| Literature DB >> 25183856 |
Dongqing Zhu1, Dingcheng Li2, Ben Carterette2, Hongfang Liu3.
Abstract
This article describes our participation of the Gene Ontology Curation task (GO task) in BioCreative IV where we participated in both subtasks: A) identification of GO evidence sentences (GOESs) for relevant genes in full-text articles and B) prediction of GO terms for relevant genes in full-text articles. For subtask A, we trained a logistic regression model to detect GOES based on annotations in the training data supplemented with more noisy negatives from an external resource. Then, a greedy approach was applied to associate genes with sentences. For subtask B, we designed two types of systems: (i) search-based systems, which predict GO terms based on existing annotations for GOESs that are of different textual granularities (i.e., full-text articles, abstracts, and sentences) using state-of-the-art information retrieval techniques (i.e., a novel application of the idea of distant supervision) and (ii) a similarity-based system, which assigns GO terms based on the distance between words in sentences and GO terms/synonyms. Our best performing system for subtask A achieves an F1 score of 0.27 based on exact match and 0.387 allowing relaxed overlap match. Our best performing system for subtask B, a search-based system, achieves an F1 score of 0.075 based on exact match and 0.301 considering hierarchical matches. Our search-based systems for subtask B significantly outperformed the similarity-based system. DATABASE URL: https://github.com/noname2020/Bioc.Entities:
Mesh:
Year: 2014 PMID: 25183856 PMCID: PMC4150992 DOI: 10.1093/database/bau087
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Statistics of the data set for BioCreative IV Track 4 GO Task
| GO task data | Training set | Development set | Test set |
|---|---|---|---|
| Number of full-text articles | 100 | 50 | 50 |
| Number of genes | 300 | 171 | 194 |
| Number of gene-associated passages | 2234 | 1247 | 1681 |
| Number of GO terms | 954 | 575 | 644 |
Figure 1.Overview of System B1.
Figure 2.Overview of System B2.
Official evaluation results for subtask A using traditional P, R and F-Measure (F1)
| System | Overlap match | Exact match | ||||
|---|---|---|---|---|---|---|
| P | R | F1 | P | R | F1 | |
| A1 | 0.313 | 0.503 | 0.386 | 0.219 | 0.352 | 0.270 |
| A2 | 0.314 | 0.442 | 0.367 | 0.220 | 0.310 | 0.257 |
| A3 | 0.307 | 0.524 | 0.387 | 0.214 | 0.366 | 0.270 |
Both strict exact match and relaxed overlap measure are considered.
Official evaluation results for subtask B using traditional (flat) precision (P), recall (R) and F1-measure (F1) and hierarchical precision (hP), recall (hR) and F1-measure (hF1)
| System | Flat | Hierarchical | ||||
|---|---|---|---|---|---|---|
| P | R | F1 | hP | hR | hF1 | |
| B1 | 0.054 | 0.149 | 0.079 | 0.243 | 0.459 | 0.318 |
| B2 | 0.088 | 0.076 | 0.082 | 0.250 | 0.263 | 0.256 |
| B3 | 0.029 | 0.039 | 0.033 | 0.196 | 0.310 | 0.240 |