| Literature DB >> 23842461 |
Julien Gobeill1, Emilie Pasche, Dina Vishnyakova, Patrick Ruch.
Abstract
The available curated data lag behind current biological knowledge contained in the literature. Text mining can assist biologists and curators to locate and access this knowledge, for instance by characterizing the functional profile of publications. Gene Ontology (GO) category assignment in free text already supports various applications, such as powering ontology-based search engines, finding curation-relevant articles (triage) or helping the curator to identify and encode functions. Popular text mining tools for GO classification are based on so called thesaurus-based--or dictionary-based--approaches, which exploit similarities between the input text and GO terms themselves. But their effectiveness remains limited owing to the complex nature of GO terms, which rarely occur in text. In contrast, machine learning approaches exploit similarities between the input text and already curated instances contained in a knowledge base to infer a functional profile. GO Annotations (GOA) and MEDLINE make possible to exploit a growing amount of curated abstracts (97 000 in November 2012) for populating this knowledge base. Our study compares a state-of-the-art thesaurus-based system with a machine learning system (based on a k-Nearest Neighbours algorithm) for the task of proposing a functional profile for unseen MEDLINE abstracts, and shows how resources and performances have evolved. Systems are evaluated on their ability to propose for a given abstract the GO terms (2.8 on average) used for curation in GOA. We show that since 2006, although a massive effort was put into adding synonyms in GO (+300%), our thesaurus-based system effectiveness is rather constant, reaching from 0.28 to 0.31 for Recall at 20 (R20). In contrast, thanks to its knowledge base growth, our machine learning system has steadily improved, reaching from 0.38 in 2006 to 0.56 for R20 in 2012. Integrated in semi-automatic workflows or in fully automatic pipelines, such systems are more and more efficient to provide assistance to biologists. DATABASE URL: http://eagl.unige.ch/GOCat/Entities:
Mesh:
Year: 2013 PMID: 23842461 PMCID: PMC3706742 DOI: 10.1093/database/bat041
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.Yearly distribution of annotations linked to a PMID in GOA database for the top five most contributing source providers (UniProtKB, MGI, FlyBase, Reactome, TAIR).
Example of a gene ontology descriptor
| [Term] | |
| GO_id: GO:2000032 | |
| name: regulation of secondary shoot formation | |
| namespace: biological_process | |
| def: ‘Any process that modulates the frequency, rate or extent of secondary shoot formation.’ | |
| synonym: ‘regulation of auxiliary shoot formation’ [EXACT] | |
| synonym: ‘regulation of auxillary shoot formation’ [EXACT] | |
| is_a: GO:0022603! regulation of anatomical structure morphogenesis | |
| is_a: GO:0048831! regulation of shoot development |
Evolution of the gene ontology since 2006
| Year | GO terms | Exact synonyms | All synonyms |
|---|---|---|---|
| 2006 | 19 356 | 14 156 | 17 585 |
| 2007 | 21 917 | 15 846 | 19 727 |
| 2008 | 24 634 | 43 859 | 55 691 |
| 2009 | 26 505 | 45 353 | 57 013 |
| 2010 | 29 290 | 46 702 | 59 592 |
| 2011 | 31 794 | 48 939 | 63 866 |
| 2012 | 34 113 | 52 354 | 68 896 |
| 2013 | 37 070 | 63 215 | 83 920 |
Example of a GOA database entry
| Database: UniProtKB | |
| Gene id: A0AQW4 | |
| Gene Name: TCP12 | |
| GO id: GO:2000032 regulation of secondary shoot formation | |
| Evidence Code: Inferred from Mutant Phenotype (IMP) | |
| PMID:17307924 | |
| Date: 2010/08/23 |
Evolution of the knowledge base since 2007, i.e. number of GO assignments linked to a PMID in GOA. Values are for January 1st
| Year | Instances |
|---|---|
| 2007 | 104 743 |
| 2008 | 127 037 |
| 2009 | 152 651 |
| 2010 | 179 713 |
| 2011 | 209 419 |
| 2012 | 244 632 |
| 2013 | 287 354 |
Figure 2.Performances evolution of both classifiers since 2006 for the task of assigning GO terms to a just published abstract. The graph (a) presents Recall at 20, the graph (b) presents Mean Reciprocal Rank.
Current performances of both classifiers for the three GO axis on 2012 published abstracts, along with number of concepts per axis in the ontology
| Axis | ML classifier | TB classifier | Number of GO concepts | ||
|---|---|---|---|---|---|
| MRR | R20 | MRR | R20 | ||
| Biological processes | 0.27 | 0.47 | 0.21 | 0.34 | 24 414 |
| Molecular functions | 0.32 | 0.60 | 0.08 | 0.18 | 9529 |
| Cellular components | 0.42 | 0.71 | 0.35 | 0.39 | 3127 |
| All terms | 0.45 | 0.56 | 0.24 | 0.32 | 37 070 |
Performances of both approaches on the BioCreative I test set
| Articles section | ML approach | TB approach | ||
|---|---|---|---|---|
| MRR | R20 | MRR | R20 | |
| Abstracts | 0.49 | 0.65 | 0.23 | 0.26 |
| Full texts | 0.46 | 0.61 | 0.13 | 0.15 |
| Introduction | 0.45 | 0.64 | 0.16 | 0.18 |
| Methods | 0.42 | 0.55 | 0.10 | 0.12 |
| Results & discussion | 0.45 | 0.60 | 0.14 | 0.17 |