| Literature DB >> 25190367 |
Julien Gobeill1, Emilie Pasche2, Dina Vishnyakova2, Patrick Ruch2.
Abstract
Gene function curation of the literature with Gene Ontology (GO) concepts is one particularly time-consuming task in genomics, and the help from bioinformatics is highly requested to keep up with the flow of publications. In 2004, the first BioCreative challenge already designed a task of automatic GO concepts assignment from a full text. At this time, results were judged far from reaching the performances required by real curation workflows. In particular, supervised approaches produced the most disappointing results because of lack of training data. Ten years later, the available curation data have massively grown. In 2013, the BioCreative IV GO task revisited the automatic GO assignment task. For this issue, we investigated the power of our supervised classifier, GOCat. GOCat computes similarities between an input text and already curated instances contained in a knowledge base to infer GO concepts. The subtask A consisted in selecting GO evidence sentences for a relevant gene in a full text. For this, we designed a state-of-the-art supervised statistical approach, using a naïve Bayes classifier and the official training set, and obtained fair results. The subtask B consisted in predicting GO concepts from the previous output. For this, we applied GOCat and reached leading results, up to 65% for hierarchical recall in the top 20 outputted concepts. Contrary to previous competitions, machine learning has this time outperformed standard dictionary-based approaches. Thanks to BioCreative IV, we were able to design a complete workflow for curation: given a gene name and a full text, this system is able to select evidence sentences for curation and to deliver highly relevant GO concepts. Contrary to previous competitions, machine learning this time outperformed dictionary-based systems. Observed performances are sufficient for being used in a real semiautomatic curation workflow. GOCat is available at http://eagl.unige.ch/GOCat/. DATABASE URL: http://eagl.unige.ch/GOCat4FT/.Entities:
Mesh:
Substances:
Year: 2014 PMID: 25190367 PMCID: PMC4154439 DOI: 10.1093/database/bau088
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.Overall workflow of the BiTeM/SIBtex system for BioCreative IV GO task. First (subtask A), given a full text and a protein name, the system extracts relevant sentences for GO curation. Then (subtask B), given these relevant sentences, the system predicts relevant GO concepts for curation. For both subtasks, the system uses machine learning, thanks to KB designed from the BioCreative training data and GOA.
Official results of BiTeM SIBtex for BioCreative IV subtask A with partial match metrics
| Precision | Recall | F1 | Training set for Naive Bayes |
|---|---|---|---|
| 0.344 | 0.213 | 0.263 | Official training set |
| Official training and development set | |||
| 0.204 | 0.127 | 0.156 | GeneRIFs training set |
Given a paper and a gene name, the systems had to propose sentences that were meant to be relevant for GO curation. Precision is the portion of proposed sentences that were correct, recall is the portion of expected sentences that were proposed and F1 is the harmonic mean. The first two runs were obtained with the official training data, and the third was obtained with the GeneRIFs training set designed by our group for this task. Best results for each metric are in bold.
Figure 2.Official results of all competing systems for BioCreative IV subtask A, with partial match metrics. BiTeM/SIBtex results are in orange.
Results for the subtask B, computed with the official evaluation script, with standard or hierarchical metrics (14)
| Metrics | Precision | Recall | F1 | Number of GO concepts returned |
|---|---|---|---|---|
| Standard | 0.117 | 0.157 | 0.134 | 5 |
| Hierarchical | 0.356 | |||
| Standard | 0.092 | 0.245 | 0.134 | 10 |
| Hierarchical | 0.248 | 0.513 | 0.334 | |
| Standard | 0.057 | 0.306 | 0.096 | 20 |
| Hierarchical | 0.179 | 0.280 |
Figure 3.Official results of all competing results for BioCreative IV subtask B with strict metrics. BiTeM/SIBtex results are in orange.