| Literature DB >> 25209026 |
Mengyuan Fan1, Hong Sang Low2, Markus R Wenk2, Limsoon Wong3.
Abstract
MOTIVATION: Although semantic similarity in Gene Ontology (GO) and other approaches may be used to find similar GO terms, there is yet a method to systematically find a class of GO terms sharing a common property with high accuracy (e.g., involving human curation).Entities:
Mesh:
Substances:
Year: 2014 PMID: 25209026 PMCID: PMC4160098 DOI: 10.1093/database/bau089
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Starting condition #1, nonrandom
| GO accession number | Class label | Term name | Number of terms in the subtree rooted at the term |
|---|---|---|---|
| GO:0006644 | BP+ | Phospholipid metabolic process | 105 |
| GO:0016020 | CC+ | Membrane | 133 |
| GO:0008289 | MF+ | Lipid binding | 70 |
| GO:0006767 | BP− | Water-soluble vitamin metabolic process | 44 (178 |
| GO:0030880 | CC− | RNA polymerase complex | 17 (31 |
| GO:0000496 | MF− | Base pairing | 85 (96 |
aNumber of terms by including their GO− ancestors by the inheritance constraint.
All six starting conditions for iterative prediction
| Starting condition | #1 | #2 | #3 | #4 | #5 | #6 |
|---|---|---|---|---|---|---|
| % GO+ in the training set | 6.5 | 2/6.7 | 5/12.6 | 20/45.3 | 50/76.1 | 80/92.3 |
| % GO− in the training set | 2.1 | 5/15 | 10/22.9 | 20/36.6 | 50/65.3 | 80/86.8 |
| Number of GO+ in the test set | 4366 | 4386 | 4061 | 2758 | 1145 | 358 |
| Number of GO+ and GO− in the test set | 18,305 | 16,529 | 15,047 | 11,814 | 6162 | 2210 |
Number of GO terms
| Class label | Gold standard with original curation rules (GO version: June 2009) | Gold standard mapped to April 2013 version of GO with original curation rules (may violate inheritance constraint) | After resolving inheritance constraints and with final set of curation rules (GO version: April 2013) | Final version (expanded with iterative prediction) with final curation rules (GO version: April 2013) |
|---|---|---|---|---|
| BP+ | 1639 | 1606 | 2225 | 2405 |
| BP− | 2309 | 2291 | 2650 | 5331 |
| BP? | 12,783 | 20,855 | 19,877 | 17,016 |
| CC+ | 924 | 912 | 712 | 748 |
| CC− | 1461 | 1429 | 1845 | 1870 |
| CC? | 0 | 817 | 601 | 540 |
| MF+ | 1736 | 1668 | 1495 | 1559 |
| MF− | 6882 | 6658 | 6956 | 7028 |
| MF? | 0 | 1206 | 1101 | 965 |
| GO+ | 4299 | 4186 | 4432 | 4712 |
| GO− | 10,652 | 10,378 | 11,451 | 14,229 |
| GO? | 12,783 | 22,898 | 21,579 | 18,521 |
| All | 27,734 | 37,462 | 37,462 | 37,462 |
KWs and lipid relatedness
| Number of terms | No KWs (87%) | Containing at least one KW (13%) |
|---|---|---|
| Non–lipid-related | 14,014 | 224 |
| Lipid-related | 2406 | 2306 |
KWs and lipid relatedness considering ancestors
| Number of terms | Terms containing no KWs, not even any of its ancestors (67%) | Term itself or one of its ancestor contain at least one KW (33%) |
|---|---|---|
| Non–lipid-related | 12,108 | 2121 |
| Lipid-related | 1204 | 3508 |
Figure 1.Curation efficiency for the six starting conditions. The black line with circle represents the iterative version, whereas the blue line with triangle represents the non-iterative one.
Figure 2.Evaluation measure for starting condition 2. Left graph corresponds to curation efficiency, right graph recall and precision.
Paired t-test on ranks of lipid-related terms undiscovered after iteration 25
| After iteration 8 | After iteration 16 | After iteration 25 | |
|---|---|---|---|
| Before iteration 1 | 0.304 | 1.25E-04 | 1.15E-08 |
| After iteration 8 | – | 4.97E-10 | 1.60E-05 |
| After iteration 16 | – | – | 1.00E+00 |