| Literature DB >> 18412966 |
Aaron P Gabow1, Sonia M Leach, William A Baumgartner, Lawrence E Hunter, Debra S Goldberg.
Abstract
BACKGROUND: Determining the function of uncharacterized proteins is a major challenge in the post-genomic era due to the problem's complexity and scale. Identifying a protein's function contributes to an understanding of its role in the involved pathways, its suitability as a drug target, and its potential for protein modifications. Several graph-theoretic approaches predict unidentified functions of proteins by using the functional annotations of better-characterized proteins in protein-protein interaction networks. We systematically consider the use of literature co-occurrence data, introduce a new method for quantifying the reliability of co-occurrence and test how performance differs across species. We also quantify changes in performance as the prediction algorithms annotate with increased specificity.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18412966 PMCID: PMC2375131 DOI: 10.1186/1471-2105-9-198
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Physical and Genetic Correspondence to Annotation
| Uetz 1498 | 37 | 15 | 32 | Y2H 2619 | 10 | 6 | Y2H 20045 | 10 | 14 |
| Ito 4469 | 19 | 8 | 17 | Aff Chr 26 | 19 | 0 | Immunoblotting 2 | 100 | 100 |
| Fromont 175 | 26 | 15 | 37 | Immuno Prec 5 | 60 | 100 | |||
| Gavin 3139 | 67 | 41 | 70 | Gel Retardation 2 | 100 | 100 | |||
| Ho 3464 | 38 | 15 | 36 | Experimental 1 | 100 | 100 | |||
| Biophysical 3 | 100 | 100 | |||||||
| Alanine Scanning 2 | 100 | 100 | |||||||
| Bellaoui 34 | 79 | 0 | 82 | All 20543 | 42 | 50 | All 6523 | 32 | 69 |
| Davierwala 564 | 40 | 13 | 39 | ||||||
| Huang 58 | 55 | 5 | 55 | ||||||
| Goehring 63 | 65 | 5 | 61 | ||||||
| Kozminski 30 | 70 | 7 | 53 | ||||||
| Krogan 36 | 50 | 11 | 78 | ||||||
| Parsons 86 | 55 | 2 | 43 | ||||||
| Tong 5907 | 49 | 12 | 39 | ||||||
Percentage of edges in the full graph which connect proteins sharing the same annotation according to the gold standard. These values are the rused in the calculation of edge weights by the noisy-or function. The number of edges scored is shown following the experimental group name. Listed by first author, PubMed identifiers for the groups are: Uetz PMID:10688190, Ito PMID:10655498, PMID:11283351, Fromont PMID:9207794, Gavin PMID:11805826, Ho PMID:11805837, Bellaoui PMID:12912927, Davierwala PMID:16155567, Huang PMID:12077337, Goehring PMID:12686605, Kozminski PMID:12960420, Krogan PMID:12773564, Parsons PMID:14661025, and Tong PMID:14764870. Abbreviations: GO SLIM Molecular Function (MF), GO SLIM Biological Process (BP), yeast two-hybrid (Y2H), affinity chromatography (Aff Chr), immunoprecipitation (Immuno Prec).
Co-occurrence Correspondence to Annotation
| > 0.0 | 8621 | 80 | 41 | 71 | 21847 | 76 | 57 | 17508 | 47 | 70 |
| ≥0.1 | 8615 | 80 | 41 | 71 | 21711 | 77 | 57 | 17422 | 47 | 70 |
| ≥0.2 | 8554 | 80 | 41 | 71 | 21177 | 78 | 58 | 16753 | 47 | 71 |
| ≥0.3 | 8210 | 80 | 41 | 71 | 20209 | 80 | 60 | 14494 | 49 | 72 |
| ≥0.4 | 7216 | 80 | 43 | 72 | 18811 | 83 | 63 | 10625 | 53 | 76 |
| ≥0.5 | 5592 | 82 | 46 | 73 | 17813 | 85 | 64 | 7021 | 56 | 76 |
| ≥0.6 | 3605 | 82 | 51 | 74 | 15857 | 91 | 67 | 4112 | 63 | 74 |
| ≥0.7 | 1856 | 82 | 56 | 74 | 12770 | 91 | 61 | 1965 | 59 | 68 |
| ≥0.8 | 700 | 77 | 54 | 72 | 10924 | 94 | 61 | 1002 | 56 | 63 |
| ≥0.9 | 159 | 65 | 45 | 75 | 6360 | 94 | 91 | 308 | 38 | 40 |
| >0.0 | 8621 | 80 | 43 | 73 | 21847 | 76 | 57 | 17508 | 47 | 70 |
| ≥0.1 | 8614 | 80 | 43 | 73 | 21739 | 77 | 57 | 17125 | 47 | 71 |
| ≥0.2 | 8607 | 80 | 43 | 73 | 21680 | 77 | 57 | 17044 | 47 | 71 |
| ≥0.3 | 8600 | 80 | 43 | 73 | 21671 | 77 | 57 | 16907 | 47 | 71 |
| ≥0.4 | 8591 | 80 | 43 | 73 | 21397 | 78 | 58 | 16719 | 47 | 71 |
| ≥0.5 | 8572 | 80 | 43 | 73 | 21202 | 78 | 58 | 16575 | 48 | 71 |
| ≥0.6 | 8557 | 80 | 43 | 73 | 21183 | 78 | 58 | 16360 | 48 | 71 |
| ≥0.7 | 8532 | 80 | 44 | 73 | 21159 | 78 | 58 | 16060 | 48 | 71 |
| ≥0.8 | 8466 | 80 | 44 | 73 | 20650 | 79 | 59 | 15665 | 48 | 71 |
| ≥0.9 | 8368 | 80 | 44 | 73 | 20386 | 80 | 60 | 14764 | 49 | 72 |
| >0.0 | 8621 | 80 | 41 | 71 | 21847 | 76 | 57 | 17508 | 47 | 70 |
| ≥0.1 | 6220 | 82 | 45 | 73 | 20063 | 80 | 60 | 9610 | 56 | 75 |
| ≥0.2 | 4241 | 82 | 49 | 74 | 17836 | 84 | 63 | 6786 | 58 | 76 |
| ≥0.3 | 2947 | 82 | 54 | 76 | 17353 | 86 | 63 | 5078 | 61 | 76 |
| ≥0.4 | 2283 | 82 | 56 | 76 | 17023 | 87 | 64 | 4178 | 64 | 77 |
| ≥0.5 | 1745 | 80 | 55 | 74 | 16875 | 87 | 63 | 3589 | 66 | 77 |
| ≥0.6 | 1195 | 78 | 55 | 73 | 16574 | 88 | 64 | 2922 | 68 | 76 |
| ≥0.7 | 713 | 78 | 56 | 72 | 16082 | 88 | 64 | 2494 | 70 | 76 |
| ≥0.8 | 536 | 74 | 52 | 69 | 15938 | 89 | 63 | 2277 | 71 | 77 |
| ≥0.9 | 390 | 68 | 47 | 65 | 15821 | 89 | 64 | 2031 | 72 | 75 |
Percentage of edges in the full graph which connect proteins sharing the same annotation according to the gold standard. These values are the rused in the calculation of edge weights by the noisy-or function. The number of edges scored is shown in the columns labeled by organism name. Abbreviations: GO SLIM Molecular Function (MF), GO SLIM Biological Process (BP).
Characterization of Graphs
| PPI | 12177 (4581) | 2619 (1955) | 20056 (6689) | |
| GENETIC | 4429 (1304) | 20543 (2934) | 6523 (2734) | |
| COLIT | 8621 (2605) | 21847 (1665) | 17508 (2228) | |
| PPI | 12001 (4463) | 2451 (1685) | 19992 (6573) | |
| GENETIC | 4427 (1301) | 20359 (2736) | 6418 (2551) | |
| COLIT | 8390 (2291) | 12018 (921) | 17324 (2022) | |
| PPI | 3 (288) | 1 (145) | 3 (173) | |
| GENETIC | 2 (153) | 6 (150) | 2 (191) | |
| COLIT | 4 (88) | 16 (243) | 8 (278) | |
| PPI | 0.09 | 0.9 | 0.2 | |
| COLIT | 1 | 21 | 6 | |
| GENETIC | 0.2 | 0.1 | 0.6 | |
| COLIT | 7 | 0.07 | 0.7 | |
| GENETIC | 27 | 89 | 24 | |
| COLIT | 40 | 90 | 46 |
Various measures to characterize the density and overlap of graphs. All values are given for the largest connected component of the graph except in the first panel as indicated. Abbreviations: PPI – only edges from experiments measuring protein-protein interactions; GENETIC – only edges from genetic assays; COLIT – only edges between proteins mentioned at least twice together in literature abstracts.
Characterization of Annotations
| 85 | 37 | 32 | 37 | 48 | 39 | 49 | ||
| PPI | 23 | 38 | 28 | 41 | 51 | 39 | 42 | |
| GENETIC | 14 | 32 | 16 | 24 | 26 | 53 | 41 | |
| COLIT | 2 | 15 | 7 | 17 | 24 | 9 | 7 | |
| PPI | 31 | 53 | 36 | 46 | 61 | 69 | 70 | |
| GENETIC | 14 | 40 | 24 | 53 | 50 | 32 | 17 | |
| COLIT | 4 | 34 | 18 | 53 | 59 | 33 | 28 | |
| PPI | 4 | 9 | 5 | 17 | 29 | 15 | 16 | |
| GENETIC | 0.9 | 7 | 4 | 4 | 4 | 7 | 1 | |
| COLIT | 0.08 | 2 | 1 | 4 | 2 | 1 | 1 | |
| PPI | 37 | 18 | 36 | 10 | 6 | 10 | 14 | |
| GENETIC | 48 | 12 | 40 | 42 | 50 | 32 | 70 | |
| COLIT | 80 | 40 | 71 | 59 | 53 | 47 | 70 | |
Various measures to characterize the completeness and connections among gold-standard annotations in the graphs. All values are given for all nodes in the Largest Connected Component of the graph. The number of nodes and edges from which these percentages are calculated are shown in panel 2 of Table 3. Unknown refers to proteins uncharacterized by the annotation source. Other abbreviations are as given in Table 3.
Figure 1Histogram Comparison of Co-Occurrence Measures. Histogram of the number of proteins assigned a given confidence value by the co-occurrence measures. Abbreviations: MUT – Mutual Information Measure; HYG – Hypergeometric Measure; ACF – Asymmetric Co-occurrence Fraction.
Figure 2Modified ROC curves for Functional Flow. Number of proteins predicted incorrectly (FP) versus number of proteins predicted correctly (TP). Abbreviations: GOMF – GO SLIM Molecular Function; GOBP – GO SLIM Biological Process; PPI ONLY – only edges from experiments measuring protein-protein interactions, such as yeast two-hybrid and affinity precipitation; GENETIC ONLY – only edges from genetic assays, such as synthetic lethality studies; PPI+GENETIC – edges from both PPI and from genetic assays, such as synthetic lethality studies; PPI+COLIT – edges from both PPI and edges between proteins found by literature co-occurrence, where Best and Worst correspond to the best and worst combinations of threshold setting and co-occurrence measure, respectively (c.f. Figure 5).
Figure 3Detailed Modified ROC curves for Functional Flow. Number of proteins predicted incorrectly (FP) versus number of proteins predicted correctly (TP), for FP up to 100. Abbreviations as in Figure 2.
Figure 4Varying Annotation Granularity. Performance as the level of annotation detail increases from Level 2 to Level 3 in the MIPS functional hierarchy. a) Majority, b) Functional Flow. Abbreviations: PPI ONLY – only edges from experiments measuring protein-protein interactions; PPI+COLIT – PPI edges combined with edges between proteins mentioned at least twice together in literature abstracts.
Figure 5Varying the Co-occurrence Threshold. Relative performance of Functional Flow when varying the threshold used to define the co-occurrence interaction set. Shown is the number of true positives (TP) when the scoring threshold is set to yield 100 false positives (FP) (y axis). The values of the x-axis denote instances of Functional Flow on graphs combining PPI and the interaction sets for each corresponding setting of the co-occurrence threshold (x = -1 shows PPI ONLY and x = 0–9 denote PPI plus the datasets obtained using thresholds 0.0 to 0.9). The lines are annotated to denote the MUT, HYG and ACF metrics. The best and worst performers respectively, over all co-occurrence measure and all thresholds, are shown in parentheses below the plot title. These combinations appear as Best and Worst in Figures 2 and 3.