| Literature DB >> 28732468 |
Guoxian Yu1, Chang Lu2, Jun Wang2.
Abstract
BACKGROUND: Gene Ontology (GO) is a community effort to represent functional features of gene products. GO annotations (GOA) provide functional associations between GO terms and gene products. Due to resources limitation, only a small portion of annotations are manually checked by curators, and the others are electronically inferred. Although quality control techniques have been applied to ensure the quality of annotations, the community consistently report that there are still considerable noisy (or incorrect) annotations. Given the wide application of annotations, however, how to identify noisy annotations is an important but yet seldom studied open problem.Entities:
Keywords: Evidence codes; GO annotations; Gene ontology; Sparse representation
Mesh:
Year: 2017 PMID: 28732468 PMCID: PMC5521088 DOI: 10.1186/s12859-017-1764-z
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Four categories of evidence codes used in GO and their meanings
| Experimental | Computational | Author | Curatorial |
|---|---|---|---|
| EXP: inferred from experiment | ISS: inferred from sequence or structural similarity | TAS: traceable author statement | IC: inferred by curator |
| IDA: inferred from direct assay | ISO: inferred from sequence orthology | NAS: non-traceable author statement | ND: no biological data available |
| IPI: inferred from physical interaction | ISA: inferred from sequence alignment | ||
| IMP: inferred from mutant phenotype | ISM: inferred from sequence model | ||
| IGI: inferred from genetic interaction | IGC: inferred from genomic context | ||
| IEP: inferred from expression pattern | IBA: inferred from biological aspect of ancestor | ||
| IBD: inferred from biological aspect of descendant | |||
| IKR: inferred from key residues | |||
| IRD: inferred from rapid divergence | |||
| RCA: inferred from reviewed computational analysis | |||
| IEA: inferred from electronic annotation |
Statistics of GO annotations of H. sapiens, A. thaliana, S. cerevisiae, G. gallus, B. Taurus and M. musculus (archived date: May, 2015)
| Branch( | Annotations | Noisy annotations | |
|---|---|---|---|
| H. sapiens(18939) | BP (13875) | 1183415 | 23143 |
| CC (1672) | 375982 | 2770 | |
| MF (4244) | 234599 | 2322 | |
| A. thaliana(24377) | BP (5132) | 794092 | 2651 |
| CC (848) | 222465 | 498 | |
| MF (2684) | 197422 | 2301 | |
| S. cerevisiae(5887) | BP (4768) | 244374 | 898 |
| CC (931) | 104831 | 87 | |
| MF (2282) | 65745 | 338 | |
| G. gallus(12782) | BP (11783) | 572194 | 19603 |
| CC (1451) | 201471 | 3859 | |
| MF (3350) | 144112 | 2345 | |
| B. Taurus(17316) | BP (11783) | 768861 | 20788 |
| CC (1521) | 272289 | 3745 | |
| MF (3350) | 189509 | 2371 | |
| M. musculus(21188) | BP (13744) | 1036467 | 15376 |
| CC (1621) | 356694 | 1603 | |
| MF (4148) | 231078 | 2195 |
The data in the parentheses of the 1st column is the number of genes, data in the 2nd column is the number of involved GO terms (), the 3rd column is the number of annotations for a particular branch, and the last column is the number of noisy annotations, which were available in the GOA file archived in May, but absent in the GOA file archived in September of the same year
Statistics of GO annotations of H. sapiens, A. thaliana, S. cerevisiae, G. gallus, B. Taurus and M. musculus (archived date: May, 2016)
| branch( | Annotations | Noisy annotations | |
|---|---|---|---|
| H. sapiens(18932) | BP (13172) | 1141456 | 22706 |
| CC (1707) | 385525 | 3141 | |
| MF (4345) | 243928 | 4660 | |
| A. thaliana(6931) | BP (4157) | 243249 | 15918 |
| CC (750) | 97616 | 2937 | |
| MF (2271) | 81318 | 3554 | |
| S. cerevisiae(6719) | BP (4385) | 222754 | 13647 |
| CC (990) | 108186 | 2768 | |
| MF (2379) | 65032 | 4394 | |
| G. gallus(10912) | BP (10643) | 244374 | 898 |
| CC (1429) | 177491 | 4448 | |
| MF (3298) | 124997 | 2130 | |
| B. Taurus(17886) | BP (11724) | 753976 | 6541 |
| CC (1550) | 281284 | 2244 | |
| MF (3298) | 194425 | 1396 | |
| M. musculus(21279) | BP (13141) | 481417 | 18182 |
| CC (1686) | 367461 | 3917 | |
| MF (4238) | 239664 | 2705 |
The data in the parentheses of the 1st column is the number of genes, data in the 2nd column is the number of involved terms (), the 3rd column is the number of annotations for a particular branch, and the last column is the number of noisy annotations, which were available in the GOA file archived in May, but absent in the GOA file archived in September of the same year
Performance of predicting noisy annotations in GOA files of H. sapiens (archived date: May, 2016)
| Random | LF | NtN | NoisyGOA | SR | EC | NtN+EC | NoisyGOA+EC | NoGOA | ||
|---|---|---|---|---|---|---|---|---|---|---|
| BP | Precision | 23.99±0.49 | 29.50±0.57 | 23.71±0.47 | 33.98±0.67 | 35.24±0.56 | 29.43±0.56 | 26.30±0.51 | 38.55±0.72 |
|
| Recall |
| 29.58±0.57 | 55.84±0.87 | 41.08±0.76 | 35.67±1.48 | 49.04±0.86 | 52.52±0.89 | 44.82±0.81 | 41.45±0.76 | |
| F1-Score | 31.51±0.60 | 29.54±0.57 | 30.94±0.55 | 36.63±0.70 | 35.44±0.69 | 35.04±0.64 | 33.24±0.61 |
|
| |
| CC | Precision | 19.34±0.52 | 28.62±0.77 | 17.75±0.52 | 36.41±0.89 |
| 17.40±0.45 | 18.00±0.48 | 36.13±0.88 |
|
| Recall | 50.62±1.12 | 28.69±0.77 | 49.68±1.18 | 44.45±1.02 | 41.91±1.02 |
| 44.80±1.07 | 44.15±1.02 | 41.85±0.98 | |
| F1-Score | 25.98±0.65 | 28.65±0.77 | 24.22±0.65 | 38.79±0.93 |
| 25.34±0.58 | 24.34±0.61 | 38.50±0.92 |
| |
| MF | Precision | 27.74±0.39 | 23.60±0.38 | 36.43±0.45 | 38.16±0.48 | 46.18±0.54 | 41.25±0.50 | 49.90±0.55 | 52.18±0.57 |
|
| Recall | 41.94±0.50 | 23.63±0.38 | 48.83±0.57 | 46.41±0.55 | 46.57±0.54 |
| 56.80±0.60 | 58.26±0.62 | 59.47±0.60 | |
| F1-Score | 30.35±0.41 | 23.61±0.38 | 38.82±0.47 | 39.44±0.48 | 46.34±0.54 | 44.45±0.51 | 51.75±0.56 | 53.23±0.58 |
|
The numbers in boldface denote the best performance
Fig. 1Performance of NoGOA in predicting noisy annotations under different input values of α
Results of gene function prediction on H. sapiens (archived date: May, 2016)
| BP | CC | MF | ||||
|---|---|---|---|---|---|---|
| Original | NoGOA | Original | NoGOA | Original | NoGOA | |
| MicroAvgF1 |
| 92.64 | 93.72 |
|
|
|
| MacroAvgF1 | 89.04 |
| 88.06 |
| 89.55 |
|
| AvgPrec | 88.45 |
| 88.75 |
| 90.78 |
|
| AvgROC | 94.94 |
| 95.12 |
| 97.66 |
|
| Fmax |
| 93.50 | 93.85 |
| 94.62 |
|
| Smin | 8.69 |
|
|
| 2.40 |
|
The data in boldface denote the better result. ‘Original’ directly uses annotations in the historical GOA file to predict gene function; ‘NoGOA’ removes predicted noisy annotations from the historical GOA file and then predicts gene function. ↓ means the lower the value, the better the performance is
Examples of correctly (√) and wrongly(×) predicted direct noisy annotations by NoGOA in CC branch of S. cerevisiae
| Protein | GO term | Evidence codes | Details | |
|---|---|---|---|---|
| AAC1(ADP/ATP carrier) |
| GO:0005758 (mitochondrial intermembrane space) | TAS | Reactome:R-SCE-1252255 |
| GO:0005829 (cytosol) | TAS | Reactome:R-SCE-1252255 | ||
| AAP1 (Alanine/arginine aminopeptidase) |
| GO:0005886 (plasma membrane) | IBA | GO_REF:0000033 |
| GO:0005664 (nuclear origin of replication recognition complex) | IDA | PMID:9372948 | ||
| × | GO:0000276 (mitochondrial proton-transporting ATP synthase complex, coupling factor F(o)) | IDA | PMID:9224714 |