| Literature DB >> 27467773 |
Matthew J O'Meara1, Sara Ballouz2, Brian K Shoichet1, Jesse Gillis2.
Abstract
The expansion of protein-ligand annotation databases has enabled large-scale networking of proteins by ligand similarity. These ligand-based protein networks, which implicitly predict the ability of neighboring proteins to bind related ligands, may complement biologically-oriented gene networks, which are used to predict functional or disease relevance. To quantify the degree to which such ligand-based protein associations might complement functional genomic associations, including sequence similarity, physical protein-protein interactions, co-expression, and disease gene annotations, we calculated a network based on the Similarity Ensemble Approach (SEA: sea.docking.org), where protein neighbors reflect the similarity of their ligands. We also measured the similarity with functional genomic networks over a common set of 1,131 genes, and found that the networks had only small overlaps, which were significant only due to the large scale of the data. Consistent with the view that the networks contain different information, combining them substantially improved Molecular Function prediction within GO (from AUROC~0.63-0.75 for the individual data modalities to AUROC~0.8 in the aggregate). We investigated the boost in guilt-by-association gene function prediction when the networks are combined and describe underlying properties that can be further exploited.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27467773 PMCID: PMC4965129 DOI: 10.1371/journal.pone.0160098
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Discordance of bioinformatics and functional genomic similarity with chemoinformatic similarity.
Likelihood that two proteins will be related by ligand similarity (solid line: SEA E-value < 1e-5, dashed line: SEA E-value < 1e-20) given a threshold in the (A) sequence similarity network, (B) co-expression network, and (C) extended protein-protein interaction network. The Y-axis is the likelihood that pairs of targets will have a SEA E-value better than 1e-5 (and, for sequence similarity, also 1e-20) at any given threshold of similarity on the X-axis. (D-F) Truth tables showing the correspondence of the protein-protein pairs that either are or are not related by ligand similarity and by sequence similarity, co-expression, or direct protein-protein interactions. In the upper left and lower right squares, the ligand-based and genomics association agree that the targets are or are not related, while in the lower left and upper right they disagree.
Top 20 performing GO terms in the different networks.
| Co-expression | Extended PPI | SEA Network | |||
|---|---|---|---|---|---|
| GO Term | AUROC | GO Term | AUROC | GO Term | AUROC |
| nucleobase metabolic process | 0.86 | DNA-dependent transcription, initiation | 0.95 | protein kinase activity | 0.95 |
| DNA replication | 0.82 | transcription initiation from RNA polymerase II promoter | 0.94 | metallopeptidase activity | 0.95 |
| leukocyte activation | 0.80 | transcription from RNA polymerase II promoter | 0.92 | kinase activity | 0.95 |
| cell cycle phase | 0.80 | transcription regulatory region DNA binding | 0.92 | transferase activity, transferring phosphorus-containing groups | 0.95 |
| ion gated channel activity | 0.79 | transcription, DNA-dependent | 0.91 | phosphotransferase activity, alcohol group as acceptor | 0.95 |
| M phase | 0.79 | Toll signaling pathway | 0.91 | peptidase activity, acting on L-amino acid peptides | 0.94 |
| substrate-specific channel activity | 0.79 | TRIF-dependent toll-like receptor signaling pathway | 0.91 | endopeptidase activity | 0.93 |
| mitotic cell cycle phase transition | 0.79 | toll-like receptor 4 signaling pathway | 0.91 | protein serine/threonine kinase activity | 0.92 |
| regulation of purine nucleotide biosynthetic process | 0.79 | regulatory region nucleic acid binding | 0.91 | serine-type peptidase activity | 0.91 |
| synaptic transmission | 0.79 | toll-like receptor signaling pathway | 0.91 | nucleobase metabolic process | 0.91 |
| gated channel activity | 0.79 | toll-like receptor 3 signaling pathway | 0.91 | protein tyrosine kinase activity | 0.91 |
| ion channel activity | 0.79 | pattern recognition receptor signaling pathway | 0.91 | peptidase activity | 0.90 |
| positive regulation of I-kappaB kinase/NF-kappaB cascade | 0.79 | innate immune response-activating signal transduction | 0.91 | transmembrane receptor protein tyrosine kinase activity | 0.90 |
| cell chemotaxis | 0.78 | nucleic acid binding transcription factor activity | 0.91 | serine-type endopeptidase activity | 0.90 |
| chromosome organization | 0.78 | regulatory region DNA binding | 0.90 | phosphorylation | 0.90 |
| interphase | 0.78 | sequence-specific DNA binding transcription factor activity | 0.90 | serine hydrolase activity | 0.90 |
| multicellular organismal signaling | 0.78 | JNK cascade | 0.90 | protein phosphorylation | 0.90 |
| G-protein coupled receptor signaling pathway, coupled to cyclic nucleotide second messenger | 0.78 | MyD88-dependent toll-like receptor signaling pathway | 0.90 | protein autophosphorylation | 0.89 |
| neurological system process | 0.78 | toll-like receptor 2 signaling pathway | 0.89 | transmembrane receptor protein kinase activity | 0.88 |
| transmission of nerve impulse | 0.78 | toll-like receptor 1 signaling pathway | 0.89 | stress-activated MAPK cascade | 0.87 |
Biological process (green) and molecular function (pink).
Fig 2Functional genomic and chemoinformatic interactions among proteins involved in glutamate signaling.
(A) A schematic of glutamate signaling showing three of the major types of proteins involved—ionotropic glutamate receptors, metabotropic glutamate receptors, and glutamate transporters. (B) Proteins annotated in IUPHAR as involved in glutamatergic neurotransmission and linked by ligand-based SEA E-values better than 1e-25. (C) Proteins annotated in IUPHAR as involved in glutamatergic neurotransmission and linked by human physical protein-protein interactions from BioGRID (v3.2.121). (D) The ligand-similarity and PPI networks from (B) and (C) merged and extended to adjacent related proteins, using (1) ligand-based (orange edges, using a SEA E-value threshold of 1e-75), (2) protein-protein interactions restricting to those supported by at least one low-throughput observation, one physical observation, and observed in at least two different experiments (green edges), and (3) co-expression links at a 94% threshold (pink edges). Edges that overlapped between shared nodes from the independent networks shown in (B) and (C), calculated at less stringent cut-offs, are preserved here to illustrate the few cases of overlap between the networks (e.g., between GRIA1 and GRIA2).
Fig 3Ligand-based networks better recapitulate Gene Ontology than do PPI or co-expression networks.
The (A) co-expression network and the (B) extended protein-protein interaction network are compared with the (C) ligand derived network for their ability to characterize gene function (defined in the Gene Ontology, GO). We assessed performance through cross-validation (area under the ROC curve, AUROC) of a neighbor-voting algorithm. Each curve represents the distribution of AUROCs across 790 GO terms. The dark grey shows the scores in cross-validation in each network, the black curves are the AUROCs after permuting the network nodes, while the light gray curves are the scores using the node degree as a generic predictor across all functional categories. The PPI network has the highest performance (B, dark grey, AUROC = 0.68) but this reflects node degree bias (light grey line, AUROC = 0.6). Co-expression has less bias (A, light grey line, AUROC = 0.52), but performs less well (dark grey line, AUROC = 0.62). The ligand network performs almost as well (C, dark grey line, AUROC = 0.67) as the extended PPI network with little node degree bias (light grey line, AUROC = 0.52). The random permutation of each network (black), have AUROCs between 0.48 and 0.5.
Fig 4Improving performance of the network as measured through Guilt-by-Association on GO.
(A) The prediction of GO annotation terms grouped by evidence code and sub-ontology by individual and combined networks. The ChEBI subset consists of terms associated with the Chemical Entities of Biological Interest (ChEBI) ontology (S3 Table, S3 Fig). Error bars represent the standard error of the mean. Combining the networks improves performance substantially (average ~0.80). (B) The CoExp network performance gains with increasing sample size but with diminishing returns, especially when compared with the gains obtained by combining the orthogonal chemoinformatics network. Extrapolating the aggregation curve (orange line), we predict that we would need millions of more samples to achieve similar performance with CoExp alone as with the combined chemoinformatics and CoExp networks (orange arrow).
Fig 5Experimental characterization to test robustness in methods and underlying data.
(A) Performance of the networks as measured by the AUROC, average precision and average fold change precision. The black points represent the full network as assessed on a GO subset. The green points show the performance of the 1% sparsified network, while the blue the optimized version. (B) The topology of the BlastP sequence similarity network. The original, once optimized, has few connections, mostly pairs and cliques.