| Literature DB >> 27667448 |
Zichen Wang1, Caroline D Monteiro1, Kathleen M Jagodnik1,2,3, Nicolas F Fernandez1, Gregory W Gundersen1, Andrew D Rouillard1, Sherry L Jenkins1, Axel S Feldmann1, Kevin S Hu1, Michael G McDermott1, Qiaonan Duan1, Neil R Clark1, Matthew R Jones1, Yan Kou1, Troy Goff1, Holly Woodland4, Fabio M R Amaral5, Gregory L Szeto6,7,8,9, Oliver Fuchs10, Sophia M Schüssler-Fiorenza Rose11,12, Shvetank Sharma13, Uwe Schwartz14, Xabier Bengoetxea Bausela15, Maciej Szymkiewicz16, Vasileios Maroulis, Anton Salykin17, Carolina M Barra18, Candice D Kruth, Nicholas J Bongio19, Vaibhav Mathur20, Radmila D Todoric, Udi E Rubin21, Apostolos Malatras22, Carl T Fulp, John A Galindo23, Ruta Motiejunaite24, Christoph Jüschke25, Philip C Dishuck, Katharina Lahl26, Mohieddin Jafari27,28, Sara Aibar29, Apostolos Zaravinos30,31, Linda H Steenhuizen32, Lindsey R Allison, Pablo Gamallo, Fernando de Andres Segura33, Tyler Dae Devlin, Vicente Pérez-García34, Avi Ma'ayan1.
Abstract
Gene expression data are accumulating exponentially in public repositories. Reanalysis and integration of themed collections from these studies may provide new insights, but requires further human curation. Here we report a crowdsourcing project to annotate and reanalyse a large number of gene expression profiles from Gene Expression Omnibus (GEO). Through a massive open online course on Coursera, over 70 participants from over 25 countries identify and annotate 2,460 single-gene perturbation signatures, 839 disease versus normal signatures, and 906 drug perturbation signatures. All these signatures are unique and are manually validated for quality. Global analysis of these signatures confirms known associations and identifies novel associations between genes, diseases and drugs. The manually curated signatures are used as a training set to develop classifiers for extracting similar signatures from the entire GEO repository. We develop a web portal to serve these signatures for query, download and visualization.Entities:
Year: 2016 PMID: 27667448 PMCID: PMC5052684 DOI: 10.1038/ncomms12846
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Figure 1Workflow of the crowdsourcing project.
Participants identify relevant studies from GEO and then extract gene expression signatures using GEO2Enrichr. Participants also add metadata to each signature. Submitted signatures were manually reviewed and then used to scale up the collections with machine learning methods. All signatures are served on the CRowd Extracted Expression of Differential Signatures (CREEDS) web portal.
Figure 2Batch effect correction influence on the quality of gene expression signatures.
Line plots show the probability density distribution of the scaled ranks of expected DEGs in gene expression signatures from the three collections: (a) single-gene perturbations, (b) disease signatures, and (c) single-drug perturbations. The colours indicate which algorithm was used to call the differentially expressed genes: Characteristic Direction (CD), limma, or fold change; and whether batch effect correction was applied with surrogate variable analysis (SVA).
Figure 3Benchmarking signature connections with prior knowledge.
Signed Jaccard index and absolute Jaccard index are used to measure the similarity between signatures, and plotted in dashed and solid lines, respectively. Different methods for identifying differentially expressed genes include: the Characteristic Direction (CD), limma with Benjamini–Hochberg (BH) correction, and limma with Bonferroni correction. These are plotted in blue, orange and green, respectively. ROC curves are plotted for (a) recovering the same perturbed genes; (b) recovering similar diseases; and (c) recovering drugs with similar chemical structure.
Figure 4Hierarchical clustering of the adjacency matrix of all gene expression signatures and selected clusters.
(a) The entire adjacency matrix of all signatures. (b–d) Three selected zoomed-in views of clusters from the adjacency matrix displayed in (a).
Figure 5Distributions of the ranks of matched perturbations between signatures from CREEDS and the LINCS L1000 dataset.
The highest ranks (a,c), and all ranks (b,d) of matched drugs (a,b) and matched genes (c,d) are presented. Drug perturbation signatures from CREEDS were queried against ∼30,000 significant drug perturbation signatures from the LINCS L1000 dataset; whereas gene perturbation signatures from CREEDS were queried against ∼110,000 gene perturbation signatures from the LINCS L1000 dataset.
Top hits for drug signatures extracted from GEO queried against drug perturbations from the LINCS L1000 dataset processed using the Characteristic Direction method.
| Dexamethasone | 5743 | GSE34313 | human | GPL6480 | 1 |
| Doxorubicin | 31703 | GSE58074 | human | GPL10558 | 1 |
| Azacitidine | 9444 | GSE29077 | human | GPL571 | 1 |
| Azacitidine | 9444 | GSE29077 | human | GPL571 | 1 |
| Azacitidine | 9444 | GSE29077 | human | GPL571 | 1 |
| Lapatinib | 208908 | GSE38376 | human | GPL6947 | 2 |
| Methylprednisolone | 6741 | GSE490 | rat | GPL85 | 2 |
| Lapatinib | 208908 | GSE38376 | human | GPL6947 | 2 |
| Dexamethasone | 5743 | GSE54608 | human | GPL10558 | 3 |
| Lapatinib | 208908 | GSE38376 | human | GPL6947 | 3 |
| Tretinoin | 444795 | GSE1588 | mouse | GPL81 | 3 |
| Methylprednisolone | 6741 | GSE490 | rat | GPL85 | 3 |
| Tretinoin | 444795 | GSE32161 | human | GPL570 | 3 |
| Methylprednisolone | 6741 | GSE490 | rat | GPL85 | 3 |
| Methylprednisolone | 6741 | GSE490 | rat | GPL85 | 4 |
| Trichostatin A | 444732 | GSE1437 | mouse | GPL81 | 4 |
| Dexamethasone | 5743 | GSE7683 | mouse | GPL1261 | 5 |
| Cycloheximide | 6197 | GSE8597 | human | GPL570 | 5 |
| Methylprednisolone | 6741 | GSE490 | rat | GPL85 | 6 |
| Sorafenib | 216239 | GSE39192 | human | GPL6947 | 7 |
| Vemurafenib | 42611257 | GSE37441 | human | GPL10558 | 8 |
| Methylprednisolone | 6741 | GSE490 | rat | GPL85 | 10 |
| Curcumin | 969516 | GSE10896 | human | GPL570 | 14 |
| Curcumin | 969516 | GSE10896 | human | GPL570 | 15 |
| Vemurafenib | 42611257 | GSE37441 | human | GPL10558 | 15 |
| Lapatinib | 208908 | GSE38376 | human | GPL6947 | 16 |
| Methylprednisolone | 6741 | GSE490 | rat | GPL85 | 17 |
| Tretinoin | 444795 | GSE1588 | mouse | GPL81 | 20 |
| Vemurafenib | 42611257 | GSE42872 | human | GPL6244 | 23 |
| Azacitidine | 9444 | GSE29077 | human | GPL571 | 24 |
| Troglitazone | 5591 | GSE21329 | rat | GPL341 | 31 |
| Decitabine | 451668 | GSE29077 | human | GPL571 | 36 |
| Vemurafenib | 42611257 | GSE37441 | human | GPL10558 | 36 |
| Thapsigargin | 446378 | GSE19519 | human | GPL570 | 37 |
| Methylprednisolone | 6741 | GSE490 | rat | GPL85 | 48 |