| Literature DB >> 27374120 |
Andrew D Rouillard1, Gregory W Gundersen1, Nicolas F Fernandez1, Zichen Wang1, Caroline D Monteiro1, Michael G McDermott1, Avi Ma'ayan2.
Abstract
Genomics, epigenomics, transcriptomics, proteomics and metabolomics efforts rapidly generate a plethora of data on the activity and levels of biomolecules within mammalian cells. At the same time, curation projects that organize knowledge from the biomedical literature into online databases are expanding. Hence, there is a wealth of information about genes, proteins and their associations, with an urgent need for data integration to achieve better knowledge extraction and data reuse. For this purpose, we developed the Harmonizome: a collection of processed datasets gathered to serve and mine knowledge about genes and proteins from over 70 major online resources. We extracted, abstracted and organized data into ∼72 million functional associations between genes/proteins and their attributes. Such attributes could be physical relationships with other biomolecules, expression in cell lines and tissues, genetic associations with knockout mouse or human phenotypes, or changes in expression after drug treatment. We stored these associations in a relational database along with rich metadata for the genes/proteins, their attributes and the original resources. The freely available Harmonizome web portal provides a graphical user interface, a web service and a mobile app for querying, browsing and downloading all of the collected data. To demonstrate the utility of the Harmonizome, we computed and visualized gene-gene and attribute-attribute similarity networks, and through unsupervised clustering, identified many unexpected relationships by combining pairs of datasets such as the association between kinase perturbations and disease signatures. We also applied supervised machine learning methods to predict novel substrates for kinases, endogenous ligands for G-protein coupled receptors, mouse phenotypes for knockout genes, and classified unannotated transmembrane proteins for likelihood of being ion channels. The Harmonizome is a comprehensive resource of knowledge about genes and proteins, and as such, it enables researchers to discover novel relationships between biological entities, as well as form novel data-driven hypotheses for experimental validation.Database URL: http://amp.pharm.mssm.edu/Harmonizome.Entities:
Mesh:
Year: 2016 PMID: 27374120 PMCID: PMC4930834 DOI: 10.1093/database/baw100
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.Hierarchical clustering of gene-term, term-term and gene-gene matrices. (A) Gene-phenotype associations from the MPO organized into a binary matrix and clustered using hierarchical clustering. (B) Zooming into a cluster of genes with similar associated phenotypes, filtered to show higher level phenotypes associated with at least half of the genes in the cluster but no > 10% of all genes. (C) The gene–gene and cell-line/cell-line similarity matrices are from the CCLE gene expression dataset. Along the main diagonal of both matrices, there are several distinct zones of high red intensity, indicating clusters of cell lines with similar differentially expressed genes (DEGs) and clusters of genes with similar patterns of expression across cell lines. (D) Zooming into the lung cancer cell-lines cluster.
Figure 2.Example of combining datasets: matching kinases with diseases and drugs. (A) Hierarchical clustering of kinase perturbation signatures extracted from GEO and disease signatures extracted from GEO. (B) Validation of kinase-disease associations with genomics datasets. ROC curve showing concordance of kinase-disease associations derived by comparing gene expression profiles and kinase-disease associations collected from GWAS and other genetic association datasets. Low, medium and high labels correspond to confidence levels of associations from GWAS datasets. (C) Network showing top predictions of drug-kinase-disease associations. Red edges indicate kinase-disease associations that have supporting GWAS evidence. (D) Hierarchical clustering of signatures of DEGs for kinase perturbations extracted from GEO compared with signatures for cancer cell lines from CCLE. (E) ROC curve showing concordance of kinase-cell line associations derived by comparing gene expression profiles and driver kinase mutations for cell lines from COSMIC. (F) Network showing top predictions of drug-kinase-cell line associations. Red edges indicate kinase-cell line associations supported by COSMIC as having a driver mutation in the cell line.
Figure 3.Example of supervised machine learning: classifiers to predict ion channels (IC), phenotypes of single gene knockouts in mice (MP), ligands of GPCRs (G-L), and substrates of kinases (K-S). (A) ROC curve of the classifiers. (B) MCC as a function of the fraction of correct predictions. (C) Network showing candidate ion channels, predicted at a false discovery rate (FDR) of 0.67, connected to their most similar known ion channels, and limited to no more than three edges per node. (D) Network showing candidate gene-phenotype associations, predicted at a FDR of 0.33, limited to no more than three edges per node, and trimmed to remove clusters with all edges supported by prior knowledge. Red edges indicate known associations. (E) Network showing candidate GPCR-ligand interactions; predicted at a FDR of 0.67 and limited to no more than three edges per node. Red edges indicate known interactions. (F) Network showing candidate kinase-substrate interactions predicted at a FDR of 0.67 and limited to no more than three edges per node. Red edges indicate known interactions.
Datasets. List of datasets group by attribute, with dataset citations
| Dataset | Citations |
|---|---|
| Achilles Cell Line Gene Essentiality Profiles | ( |
| BioGPS Cell Line Gene Expression Profiles | ( |
| CCLE Cell Line Gene CNV Profiles | ( |
| CCLE Cell Line Gene Expression Profiles | ( |
| CCLE Cell Line Gene Mutation Profiles | ( |
| COSMIC Cell Line Gene CNV Profiles | ( |
| COSMIC Cell Line Gene Mutation Profiles | ( |
| GDSC Cell Line Gene Expression Profiles | ( |
| Heiser et al., PNAS, 2011 Cell Line Gene Expression Profiles | ( |
| HPA Cell Line Gene Expression Profiles | ( |
| Klijn et al., Nat. Biotechnol., 2015 Cell Line Gene CNV Profiles | ( |
| Klijn et al., Nat. Biotechnol., 2015 Cell Line Gene Expression Profiles | ( |
| Klijn et al., Nat. Biotechnol., 2015 Cell Line Gene Mutation Profiles | ( |
| BioGPS Human Cell Type and Tissue Gene Expression Profiles | ( |
| BioGPS Mouse Cell Type and Tissue Gene Expression Profiles | ( |
| HPM Cell Type and Tissue Protein Expression Profiles | ( |
| ProteomicsDB Cell Type and Tissue Protein Expression Profiles | ( |
| Roadmap Epigenomics Cell and Tissue DNA Methylation Profiles | ( |
| Roadmap Epigenomics Cell and Tissue Gene Expression Profiles | ( |
| Allen Brain Atlas Developing Human Brain Tissue Gene Expression Profiles by Microarray | ( |
| Allen Brain Atlas Developing Human Brain Tissue Gene Expression Profiles by RNA-seq | ( |
| GTEx Tissue Sample Gene Expression Profiles | ( |
| HPA Tissue Sample Gene Expression Profiles | ( |
| TCGA Signatures of DEGs for Tumors | ( |
| Allen Brain Atlas Adult Human Brain Tissue Gene Expression Profiles | ( |
| Allen Brain Atlas Adult Mouse Brain Tissue Gene Expression Profiles | ( |
| Allen Brain Atlas Prenatal Human Brain Tissue Gene Expression Profiles | ( |
| GTEx Tissue Gene Expression Profiles | ( |
| HPA Tissue Gene Expression Profiles | ( |
| HPA Tissue Protein Expression Profiles | ( |
| TISSUES Curated Tissue Protein Expression Evidence Scores | ( |
| TISSUES Experimental Tissue Protein Expression Evidence Scores | ( |
| TISSUES Text-mining Tissue Protein Expression Evidence Scores | ( |
List of datasets group by attribute, with dataset citations. Datasets providing evidence for associations between genes and ‘cell lines, cell types or tissues’.
Datasets providing evidence for associations between genes and ‘chemicals’
| Dataset | Citations |
|---|---|
| CTD Gene-Chemical Interactions | ( |
| SILAC Phosphoproteomics Signatures of Differentially Phosphorylated Proteins for Drugs | |
| DrugBank Drug Targets | ( |
| Guide to Pharmacology Chemical Ligands of Receptors | ( |
| HMDB Metabolites of Enzymes | ( |
| CMAP Signatures of DEGs for Small Molecules | ( |
| GEO Signatures of DEGs for Small Molecules | ( |
| LINCS L1000 CMAP Signatures of DEGs for Small Molecules | ( |
| KinomeScan Kinase Inhibitor Targets |
Datasets providing evidence for associations between genes and ‘diseases, phenotypes or traits’
| Dataset | Citations |
|---|---|
| GEO Signatures of DEGs for Diseases | ( |
| CTD Gene-Disease Associations | ( |
| DISEASES Curated Gene-Disease Assocation Evidence Scores | ( |
| DISEASES Experimental Gene-Disease Assocation Evidence Scores | ( |
| DISEASES Text-mining Gene-Disease Assocation Evidence Scores | ( |
| GAD Gene-Disease Associations | ( |
| GAD High Level Gene-Disease Associations | ( |
| GWASdb SNP-Disease Associations | ( |
| PhosphoSitePlus Phosphosite-Disease Associations | ( |
| ClinVar SNP-Phenotype Associations | ( |
| GWAS Catalog SNP-Phenotype Associations | ( |
| GWASdb SNP-Phenotype Associations | ( |
| HPO Gene-Disease Associations | ( |
| HuGE Navigator Gene-Phenotype Associations | ( |
| MPO Gene-Phenotype Associations | ( |
| OMIM Gene-Disease Associations | ( |
| dbGAP Gene-Trait Associations | ( |
Datasets providing evidence for associations between genes and ‘functional terms, phrases or references’
| Dataset | Citations |
|---|---|
| GO Biological Process Annotations | ( |
| GeneRIF Biological Term Annotations | ( |
| Phosphosite Textmining Biological Term Annotations | |
| COMPARTMENTS Curated Protein Localization Evidence Scores | ( |
| COMPARTMENTS Experimental Protein Localization Evidence Scores | ( |
| COMPARTMENTS Text-mining Protein Localization Evidence Scores | ( |
| GO Cellular Component Annotations | ( |
| LOCATE Curated Protein Localization Annotations | ( |
| LOCATE Predicted Protein Localization Annotations | ( |
| GO Molecular Function Annotations | ( |
| Biocarta Pathways | |
| HumanCyc Pathways | ( |
| KEGG Pathways | ( |
| PANTHER Pathways | ( |
| PID Pathways | ( |
| Reactome Pathways | ( |
| Wikipathways Pathways | ( |
| CORUM Protein Complexes | ( |
| NURSA Protein Complexes | ( |
| ESCAPE Omics Signatures of Genes and Proteins for Stem Cells | ( |
| GeneSigDB Published Gene Signatures | ( |
Datasets providing evidence for associations between genes and ‘other genes, proteins or microRNAs’
| Dataset | Citations |
|---|---|
| MSigDB Cancer Gene Co-expression Modules | ( |
| GEO Signatures of DEGs for Gene Perturbations | ( |
| LINCS L1000 CMAP Signatures of DEGs for Gene Knockdowns | ( |
| MSigDB Signatures of DEGs for Cancer Gene Perturbations | ( |
| SILAC Phosphoproteomics Signatures of Differentially Phosphorylated Proteins for Gene Perturbations | |
| Hub Proteins Protein–Protein Interactions | ( |
| BIND Biomolecular Interactions | ( |
| BioGRID Protein–Protein Interactions | ( |
| DIP Protein–Protein Interactions | ( |
| HPRD Protein–Protein Interactions | ( |
| IntAct Biomolecular Interactions | ( |
| NURSA Protein–Protein Interactions | ( |
| Pathway Commons Protein–Protein Interactions | ( |
| GEO Signatures of DEGs for Kinase Perturbations | ( |
| KEA Substrates of Kinases | ( |
| PhosphoSitePlus Substrates of Kinases | ( |
| SILAC Phosphoproteomics Signatures of Differentially Phosphorylated Proteins for Protein Ligands | |
| Guide to Pharmacology Protein Ligands of Receptors | ( |
| MiRTarBase microRNA Targets | ( |
| TargetScan Predicted Conserved microRNA Targets | ( |
| TargetScan Predicted Nonconserved microRNA Targets | ( |
| DEPOD Substrates of Phosphatases | ( |
| GEO Signatures of DEGs for Transcription Factor Perturbations | ( |
| CHEA Transcription Factor Targets | ( |
| ENCODE Transcription Factor Targets | ( |
| JASPAR Predicted Transcription Factor Targets | ( |
| TRANSFAC Curated Transcription Factor Targets | ( |
| TRANSFAC Predicted Transcription Factor Targets | ( |
| Virus MINT Protein-Viral Protein Interactions | ( |
Datasets providing evidence for associations between genes and ‘molecular profiles’
| Dataset | Citations |
|---|---|
| Kinativ Kinase Inhibitor Bioactivity Profiles | |
| ENCODE Histone Modification Site Profiles | ( |
| Roadmap Epigenomics Histone Modification Site Profiles | ( |
| CHEA Transcription Factor Binding Site Profiles | ( |
| ENCODE Transcription Factor Binding Site Profiles | ( |
Datasets providing evidence for associations between genes and ‘organisms’
| Dataset | Citations |
|---|---|
| GEO Signatures of DEGs for Viral Infections | ( |
| Virus MINT Protein-Virus Interactions | ( |
Datasets providing evidence for associations between genes and ‘sequence features’
| Dataset | Citations |
|---|---|
| GTEx eQTL | ( |
Datasets providing evidence for associations between genes and ‘structural features’
| Dataset | Citations |
|---|---|
| InterPro Predicted Protein Domain Annotations | ( |