| Literature DB >> 25157073 |
Yuqing Mao1, Kimberly Van Auken1, Donghui Li1, Cecilia N Arighi1, Peter McQuilton1, G Thomas Hayman1, Susan Tweedie1, Mary L Schaeffer1, Stanley J F Laulederkind1, Shur-Jen Wang1, Julien Gobeill1, Patrick Ruch1, Anh Tuan Luu1, Jung-Jae Kim1, Jung-Hsien Chiang1, Yu-De Chen1, Chia-Jung Yang1, Hongfang Liu1, Dongqing Zhu1, Yanpeng Li1, Hong Yu1, Ehsan Emadzadeh1, Graciela Gonzalez1, Jian-Ming Chen1, Hong-Jie Dai1, Zhiyong Lu1.
Abstract
Gene ontology (GO) annotation is a common task among model organism databases (MODs) for capturing gene function data from journal articles. It is a time-consuming and labor-intensive task, and is thus often considered as one of the bottlenecks in literature curation. There is a growing need for semiautomated or fully automated GO curation techniques that will help database curators to rapidly and accurately identify gene function information in full-length articles. Despite multiple attempts in the past, few studies have proven to be useful with regard to assisting real-world GO curation. The shortage of sentence-level training data and opportunities for interaction between text-mining developers and GO curators has limited the advances in algorithm development and corresponding use in practical circumstances. To this end, we organized a text-mining challenge task for literature-based GO annotation in BioCreative IV. More specifically, we developed two subtasks: (i) to automatically locate text passages that contain GO-relevant information (a text retrieval task) and (ii) to automatically identify relevant GO terms for the genes in a given article (a concept-recognition task). With the support from five MODs, we provided teams with >4000 unique text passages that served as the basis for each GO annotation in our task data. Such evidence text information has long been recognized as critical for text-mining algorithm development but was never made available because of the high cost of curation. In total, seven teams participated in the challenge task. From the team results, we conclude that the state of the art in automatically mining GO terms from literature has improved over the past decade while much progress is still needed for computer-assisted GO curation. Future work should focus on addressing remaining technical challenges for improved performance of automatic GO concept recognition and incorporating practical benefits of text-mining tools into real-world GO annotation. DATABASE URL: http://www.biocreative.org/tasks/biocreative-iv/track-4-GO/. Published by Oxford University Press 2014. This work is written by US Government employees and is in the public domain in the US.Entities:
Mesh:
Year: 2014 PMID: 25157073 PMCID: PMC4142793 DOI: 10.1093/database/bau086
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Overall statistics of the BC4GO corpus
| Curated data | Training set | Development set | Test set |
|---|---|---|---|
| 100 | 50 | 50 | |
| 300 | 171 | 194 | |
| Gene-associated | 2234 | 1247 | 1681 |
| Unique gene-associated | 954 | 575 | 644 |
Team results for Task A using traditional Precision (P), Recall (R) and F-measure (F1)
| Team | Run | Genes | Passages | Exact match | Overlap | ||||
|---|---|---|---|---|---|---|---|---|---|
| P | R | F1 | P | R | F1 | ||||
| 183 | 1 | 173 | 1042 | 0.206 | 0.128 | 0.158 | 0.344 | 0.213 | 0.263 |
| 183 | 2 | 173 | 1042 | 0.217 | 0.134 | 0.166 | 0.220 | 0.271 | |
| 183 | 3 | 173 | 1042 | 0.107 | 0.066 | 0.082 | 0.204 | 0.127 | 0.156 |
| 237 | 1 | 23 | 54 | 0.185 | 0.006 | 0.012 | 0.333 | 0.011 | 0.021 |
| 237 | 2 | 96 | 2755 | 0.103 | 0.171 | 0.129 | 0.214 | 0.351 | 0.266 |
| 237 | 3 | 171 | 3717 | 0.138 | 0.305 | 0.190 | 0.213 | 0.471 | 0.293 |
| 238 | 1 | 194 | 2698 | 0.219 | 0.352 | 0.313 | 0.503 | 0.386 | |
| 238 | 2 | 194 | 2362 | 0.310 | 0.257 | 0.314 | 0.442 | 0.367 | |
| 238 | 3 | 194 | 2866 | 0.214 | 0.366 | 0.307 | 0.524 | ||
| 250 | 1 | 161 | 3297 | 0.146 | 0.286 | 0.193 | 0.239 | 0.469 | 0.317 |
| 250 | 2 | 140 | 2848 | 0.153 | 0.259 | 0.193 | 0.258 | 0.437 | 0.325 |
| 250 | 3 | 161 | 3733 | 0.140 | 0.311 | 0.193 | 0.226 | 0.503 | 0.312 |
| 264 | 1 | 167 | 13 533 | 0.052 | 0.093 | 0.088 | 0.157 | ||
| 264 | 2 | 111 | 2243 | 0.037 | 0.049 | 0.042 | 0.077 | 0.103 | 0.088 |
| 264 | 3 | 111 | 2241 | 0.037 | 0.049 | 0.042 | 0.077 | 0.103 | 0.088 |
Both strict exact match and relaxed overlap measure are considered.
Team results for the Task B using traditional Precision (P), Recall (R) and F1-measure (F1) and hierarchical precision (hP), recall (hR) and F1-measure (hF1)
| Team | Run | Genes | GO terms | Exact match | Hierarchical match | ||||
|---|---|---|---|---|---|---|---|---|---|
| P | R | F1 | hP | hR | hF1 | ||||
| 183 | 1 | 172 | 860 | 0.157 | 0.322 | 0.356 | |||
| 183 | 2 | 172 | 1720 | 0.092 | 0.245 | 0.247 | 0.513 | 0.334 | |
| 183 | 3 | 172 | 3440 | 0.057 | 0.096 | 0.178 | 0.280 | ||
| 220 | 1 | 50 | 2639 | 0.018 | 0.075 | 0.029 | 0.064 | 0.190 | 0.096 |
| 220 | 2 | 46 | 1747 | 0.024 | 0.065 | 0.035 | 0.087 | 0.158 | 0.112 |
| 237 | 1 | 23 | 37 | 0.108 | 0.006 | 0.012 | 0.020 | 0.039 | |
| 237 | 2 | 96 | 2424 | 0.108 | 0.068 | 0.029 | 0.084 | 0.336 | 0.135 |
| 237 | 3 | 171 | 4631 | 0.037 | 0.264 | 0.064 | 0.150 | 0.588 | 0.240 |
| 238 | 1 | 194 | 1792 | 0.054 | 0.149 | 0.079 | 0.243 | 0.459 | 0.318 |
| 238 | 2 | 194 | 555 | 0.088 | 0.076 | 0.082 | 0.250 | 0.263 | 0.256 |
| 238 | 3 | 194 | 850 | 0.029 | 0.039 | 0.033 | 0.196 | 0.310 | 0.240 |
| 243 | 1 | 109 | 510 | 0.073 | 0.057 | 0.064 | 0.249 | 0.269 | 0.259 |
| 243 | 2 | 104 | 393 | 0.084 | 0.051 | 0.064 | 0.280 | 0.248 | 0.263 |
| 243 | 3 | 144 | 2538 | 0.030 | 0.116 | 0.047 | 0.130 | 0.477 | 0.204 |
| 250 | 1 | 171 | 1389 | 0.052 | 0.112 | 0.071 | 0.174 | 0.328 | 0.227 |
| 250 | 2 | 166 | 1893 | 0.049 | 0.143 | 0.073 | 0.128 | 0.374 | 0.191 |
| 250 | 3 | 132 | 453 | 0.095 | 0.067 | 0.078 | 0.284 | 0.161 | 0.206 |
Figure 1.The classification of the FP sentences.