| Literature DB >> 26310806 |
Guoxian Yu1,2, Hailong Zhu3, Carlotta Domeniconi4, Jiming Liu5.
Abstract
BACKGROUND: High-throughput bio-techniques accumulate ever-increasing amount of genomic and proteomic data. These data are far from being functionally characterized, despite the advances in gene (or gene's product proteins) functional annotations. Due to experimental techniques and to the research bias in biology, the regularly updated functional annotation databases, i.e., the Gene Ontology (GO), are far from being complete. Given the importance of protein functions for biological studies and drug design, proteins should be more comprehensively and precisely annotated.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26310806 PMCID: PMC4551531 DOI: 10.1186/s12859-015-0713-y
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1An example of a partially annotated protein. The GO terms in the white ellipses are the currently available functions of the protein, and the terms in the colored ellipses are the missing functions of the protein. In particular, the terms in the grey ellipses are missing functions of the first type: they are associated with other proteins, but are missing for the protein being considered. The terms in the blue ellipses belong to the second type: they exist in the GO hierarchy, but they are not associated with any protein of interest. We observe that any missing function of a protein should be a leaf node of the hierarchy, and this hierarchy is defined with respect to the available terms associated with the protein, rather than with the whole GO hierarchy. We can replenish a non-leaf term of a protein directly using its descendant terms, due to the true path rule of GO
Statistics of GO annotations. The data in parentheses along with Yeast (or Human) is the number of proteins in that dataset. First column: is the total number of distinct GO terms used for empirical study, and the data in parentheses is the number of GO annotations of all the proteins. [3,10) characterizes the number of terms associated with at least 3 and less than 10 proteins; [10,30) represents the number of terms associated with at least 10 and less than 30 proteins; and ≥30 includes the terms associated with at least 30 proteins, Avg ±Std is the average number of annotations of a protein and its standard deviation. The root GO term in each sub-ontology (BP, CC and MF) are not included
|
| [3,10) | [10,30) | ≥30 | Avg ± Std | ||
|---|---|---|---|---|---|---|
| Yeast(5914) | BP | 2979 (210949) | 1350 | 761 | 868 | 35.67 ± 34.62 |
| CC | 731 (79378) | 359 | 170 | 202 | 13.42 ± 12.01 | |
| MF | 978 (35033) | 546 | 236 | 196 | 5.92 ± 6.47 | |
| Human(19009) | BP | 7294 (694455) | 3237 | 1877 | 2180 | 36.53 ± 53.25 |
| CC | 978 (230826) | 414 | 224 | 340 | 12.14 ± 12.66 | |
| MF | 1772 (106410) | 943 | 420 | 409 | 5.59 ± 7.99 |
Results of predicting the missing BP functions of partially annotated Yeast proteins (N=5914, )
| Metric |
| dRW- | dRW | ITSS | PILL | Naive |
|---|---|---|---|---|---|---|
| MacroF1 | 1 | 93.14 ± 0.13 |
| 91.66 ± 0.09 | 91.52 ± 0.15 | 1.99 ± 0.00 |
| 3 | 82.72 ± 0.25 |
| 80.14 ± 0.14 | 79.77 ± 0.16 | 2.01 ± 0.00 | |
| 5 | 74.67 ± 0.22 |
| 71.16 ± 0.33 | 70.96 ± 0.22 | 2.03 ± 0.00 | |
| AvgROC | 1 | 99.88 ± 0.01 |
| 98.24 ± 0.02 | 98.77 ± 0.03 | 45.88 ± 0.00 |
| 3 |
|
| 94.44 ± 0.08 | 96.36 ± 0.15 | 45.88 ± 0.00 | |
| 5 |
| 98.89 ± 0.03 | 90.48 ± 0.17 | 93.83 ± 0.06 | 45.88 ± 0.00 | |
| 1-RankLoss | 1 | 99.96 ± 0.00 |
| 98.99 ± 0.02 | 99.81 ± 0.01 | 91.13 ± 0.00 |
| 3 |
| 99.17 ± 0.03 | 96.89 ± 0.05 | 99.23 ± 0.03 | 91.04 ± 0.00 | |
| 5 |
| 97.63 ± 0.03 | 93.99 ± 0.10 | 98.42 ± 0.05 | 90.95 ± 0.01 | |
| Fmax | 1 | 97.97 ± 0.00 |
| 97.90 ± 0.00 | 97.91 ± 0.00 | 36.96 ± 0.00 |
| 3 |
| 93.92 ± 0.01 | 93.66 ± 0.02 | 93.61 ± 0.00 | 36.86 ± 0.00 | |
| 5 |
| 89.88 ± 0.00 | 89.66 ± 0.02 | 89.41 ± 0.00 | 36.84 ± 0.03 | |
| RAccuracy | 1 | 38.75 ± 0.66 |
| 12.41 ± 0.48 | 21.65 ± 0.37 | 37.51 ± 0.94 |
| 3 |
| 36.08 ± 0.24 | 23.02 ± 0.06 | 22.27 ± 0.37 | 37.84 ± 0.75 | |
| 5 |
| 33.58 ± 0.24 | 27.39 ± 0.29 | 23.92 ± 0.08 | 37.69 ± 0.37 | |
| Coverage | 1 | 78.24 ± 0.95 |
| 405.01 ± 9.56 | 232.52 ± 4.58 | 1585.06 ± 0.99 |
| 3 |
| 234.61 ± 4.92 | 943.54 ± 10.84 | 524.50 ± 12.33 | 1605.22 ± 0.95 | |
| 5 |
| 469.13 ± 9.17 | 1412.81 ± 9.85 | 806.23 ± 18.18 | 1625.35 ± 3.17 |
The numbers in boldface denote the best (or comparable best) statistically significant performance (according to a t-test at 95 % significance level). ↓ means the lower the value, the better the performance. m is the number of missing functions for a protein, N is the total number of missing functions, and is the number of the second kind of missing functions of N proteins for a given m. m=1, , N 1=4705; m=3, , N 3=14079; m=5, , N 5=23299
Fig. 2AUC difference between dRW-kNN and ITSS. The AUC (Area Under the ROC Curve) difference between dRW-kNN and ITSS on proteins of Yeast annotated with BP terms of different sizes. [3,10) includes 1350 terms, [10,30) includes 761 terms, and ≥30 includes 868 terms
Results of dRW, dRW-Corpus, dRW-Disjoint, dRW-E in predicting the missing BP functions of Yeast proteins, with m=3
| Metric | dRW | dRW-Corpus | dRW-Disjoint | dRW-E |
|---|---|---|---|---|
| MacroF1 | 83.29 ± 0.13 | 79.77 ± 0.09 | 83.19 ± 0.09 | 83.17 ± 0.07 |
| AvgROC | 99.59 ± 0.02 | 93.61 ± 0.07 | 99.57 ± 0.01 | 99.57 ± 0.00 |
| 1-RankLoss |
| 93.87 ± 0.05 |
| 98.87 ± 0.01 |
| Fmax | 93.92 ± 0.01 | 93.67 ± 0.00 | 93.90 ± 0.01 | 93.89 ± 0.01 |
| RAccuracy |
| 15.58 ± 0.31 | 33.65 ± 0.27 | 32.67 ± 0.50 |
| Coverage |
| 1843.95 ± 18.70 | 242.53 ± 1.44 | 255.14 ± 5.49 |
Numbers of true positive predictions made by dRW, dRW-kNN, ITSS, PILL and Naive from an older GOA file (date: 2010-01-20) to a recent GOA file (date: 2014-06-09) of Yeast and Human. The data in the parentheses are the corresponding true positive rate for each of the methods. TPR means the true path rule is applied to append the ancestor functions of the positive predictions, and NoTPR means the true path rule is not applied
| dRW | dRW- | ITSS | PILL | Naive | ||
|---|---|---|---|---|---|---|
| Yeast | NoTPR | 6(6.00 %) | 17(17.00 %) | 6(6.00 %) | 0(0.00 %) | 31(31.00 %) |
| TPR | 34(6.58 %) | 17(17.00 %) | 6(6.00 %) | 11(1.83 %) | 31(31.00 %) | |
| Human | NoTPR | 10(10.00 %) | 27(27.00 %) | 20(20.00 %) | 19(19.00 %) | 48(48.00 %) |
| TPR | 120(17.36 %) | 27(27.00 %) | 20(20.00 %) | 80(21.45 %) | 48(48.00 %) |
Examples of correctly predicted missing BP functions by dRW from an older GOA file (date: 2010-01-20) to a recent GOA file (date: 2014-06-09) of Yeast and Human. hCount gives the number of proteins annotated with the term in the older GOA file. Depth represents the term’s depth in the GO hierarchy
| Yeast | Human | ||||||
|---|---|---|---|---|---|---|---|
| Protein | GO terms | hCount | Depth | Protein | GO terms | hCount | Depth |
| HIS5 | GO:0001193 | 0 | 8 | EZR | GO:0002143 | 0 | 9 |
| PET494 | GO:0019379 | 0 | 5 | TMEM200C | GO:0007094 | 8 | 8 |
| CYC1 | GO:0019430 | 0 | 4 | C9orf96 | GO:0016056 | 3 | 7 |
| FES1 | GO:0044718 | 0 | 5 | DRGX | GO:0035511 | 0 | 5 |
| MET17 | GO:0090334 | 0 | 6 | HDAC7 | GO:0035511 | 0 | 5 |
| TSTA3 | GO:2000679 | 2 | 7 | QPRT | GO:0045040 | 0 | 5 |
| CSHL1 | GO:0045040 | 0 | 5 | ||||
| RGS3 | GO:0060397 | 2 | 7 | ||||
| PRDM7_V2 | GO:0071300 | 0 | 6 | ||||
| TMEM82 | GO:0090050 | 2 | 7 | ||||