| Literature DB >> 25070993 |
Kimberly Van Auken1, Mary L Schaeffer1, Peter McQuilton1, Stanley J F Laulederkind1, Donghui Li1, Shur-Jen Wang1, G Thomas Hayman1, Susan Tweedie1, Cecilia N Arighi1, James Done1, Hans-Michael Müller1, Paul W Sternberg2, Yuqing Mao1, Chih-Hsuan Wei1, Zhiyong Lu3.
Abstract
Gene function curation via Gene Ontology (GO) annotation is a common task among Model Organism Database groups. Owing to its manual nature, this task is considered one of the bottlenecks in literature curation. There have been many previous attempts at automatic identification of GO terms and supporting information from full text. However, few systems have delivered an accuracy that is comparable with humans. One recognized challenge in developing such systems is the lack of marked sentence-level evidence text that provides the basis for making GO annotations. We aim to create a corpus that includes the GO evidence text along with the three core elements of GO annotations: (i) a gene or gene product, (ii) a GO term and (iii) a GO evidence code. To ensure our results are consistent with real-life GO data, we recruited eight professional GO curators and asked them to follow their routine GO annotation protocols. Our annotators marked up more than 5000 text passages in 200 articles for 1356 distinct GO terms. For evidence sentence selection, the inter-annotator agreement (IAA) results are 9.3% (strict) and 42.7% (relaxed) in F1-measures. For GO term selection, the IAAs are 47% (strict) and 62.9% (hierarchical). Our corpus analysis further shows that abstracts contain ∼ 10% of relevant evidence sentences and 30% distinct GO terms, while the Results/Experiment section has nearly 60% relevant sentences and >70% GO terms. Further, of those evidence sentences found in abstracts, less than one-third contain enough experimental detail to fulfill the three core criteria of a GO annotation. This result demonstrates the need of using full-text articles for text mining GO annotations. Through its use at the BioCreative IV GO (BC4GO) task, we expect our corpus to become a valuable resource for the BioNLP research community. Database URL: http://www.biocreative.org/resources/corpora/bc-iv-go-task-corpus/. Published by Oxford University Press 2014. This work is written by US Government employees and is in the public domain in the US.Entities:
Mesh:
Year: 2014 PMID: 25070993 PMCID: PMC4112614 DOI: 10.1093/database/bau074
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 2.A sample of GO annotation in BioC format.
Figure 1.Screenshot of the GO annotation tool. When a line or more of text is highlighted, a pop-up window appears where annotation data are entered.
Number of curated articles per MOD
| Data set | FlyBase | MaizeGDB | RGD | TAIR | WormBase | Total |
|---|---|---|---|---|---|---|
| Training set | 19 | 21 | 43 | 10 | 7 | 100 |
| Development set | 8 | 5 | 25 | 4 | 8 | 50 |
| Test set | 12 | 4 | 20 | 7 | 7 | 50 |
| Subtotal per team | 39 | 30 | 88 | 21 | 22 | 200 |
Overall statistics of the annotated corpus grouped by data sets
| Data set | Articles | Genes (unique) | GO terms (unique) | Evidence text passagesw.r.t. GO|Gene|Unique |
|---|---|---|---|---|
| Training set | 100 | 316 | 611 | 2440|2478|1858 |
| Development set | 50 | 171 | 367 | 1302|1238|964 |
| Test set | 50 | 194 | 378 | 1763|1677|1253 |
| Total | 200 | 681 | 1356 | 5505|5393|4075 |
Overall statistics of the annotated corpus grouped by MODs
| MOD | Articles | Genes (unique) | GO terms (unique) | Evidence text passagesw.r.t. GO|Gene|Unique |
|---|---|---|---|---|
| FlyBase | 39 | 140 | 267 | 1106|1106|881 |
| MaizeGDB | 30 | 85 | 193 | 664|595|492 |
| RGD | 88 | 236 | 369 | 1199|1223|946 |
| TAIR | 21 | 63 | 125 | 453|544|379 |
| WormBase | 22 | 157 | 402 | 2083|1925|1377 |
Figure 3.The proportion of annotated evidence text in different parts of the article.
Figure 4.The proportion of GO terms appearing in different parts of the article.