| Literature DB >> 25425037 |
Yanpeng Li1, Hong Yu2.
Abstract
Gene ontology (GO) and GO annotation are important resources for biological information management and knowledge discovery, but the speed of manual annotation became a major bottleneck of database curation. BioCreative IV GO annotation task aims to evaluate the performance of system that automatically assigns GO terms to genes based on the narrative sentences in biomedical literature. This article presents our work in this task as well as the experimental results after the competition. For the evidence sentence extraction subtask, we built a binary classifier to identify evidence sentences using reference distance estimator (RDE), a recently proposed semi-supervised learning method that learns new features from around 10 million unlabeled sentences, achieving an F1 of 19.3% in exact match and 32.5% in relaxed match. In the post-submission experiment, we obtained 22.1% and 35.7% F1 performance by incorporating bigram features in RDE learning. In both development and test sets, RDE-based method achieved over 20% relative improvement on F1 and AUC performance against classical supervised learning methods, e.g. support vector machine and logistic regression. For the GO term prediction subtask, we developed an information retrieval-based method to retrieve the GO term most relevant to each evidence sentence using a ranking function that combined cosine similarity and the frequency of GO terms in documents, and a filtering method based on high-level GO classes. The best performance of our submitted runs was 7.8% F1 and 22.2% hierarchy F1. We found that the incorporation of frequency information and hierarchy filtering substantially improved the performance. In the post-submission evaluation, we obtained a 10.6% F1 using a simpler setting. Overall, the experimental analysis showed our approaches were robust in both the two tasks.Entities:
Mesh:
Year: 2014 PMID: 25425037 PMCID: PMC4243380 DOI: 10.1093/database/bau113
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.The framework of the GO annotation system.
Corpus statistics of the binary classification task
| Training data | Development data | Test | |
|---|---|---|---|
| Number of positive examples | 965 | 665 | 5494 |
| Number of negative examples | 4255 | 2400 |
The number of different types of features for the evidence sentence classification task
| Bag-of-words from sentence | Bag-of-words from sentence and paragraph | Bag-of-bigrams from sentence | Bag-of- bigrams from sentence and paragraph | |
|---|---|---|---|---|
| Original lexical features | 65 538 | 92 408 | 176 921 | 347 123 |
| Features from RDEs | 200 | 200 | 200 | 200 |
The first row is the corpus statistics from labeled data. The second row is the final feature set derived from the 200 RDEs.
Figure 2.An example of RDE-based feature transformation for GO evidence sentence classification. S1 and S2 are two sentences. The example shows the part of original Boolean features, Reference features and new features generated by RDE semi-supervised learning.
Figure 3.Distribution of GO terms appearing in biomedical literature.
Method description of submitted runs
| Subtask | Run ID | Method description |
|---|---|---|
| 1 | Run 1 | RDE, 110 reference features, Logistic Regression, classification threshold = 0.16 |
| 1 | Run 2 | RDE, 110 reference features, Logistic Regression, classification threshold = 0.18 |
| 1 | Run 3 | RDE, 110 reference features, Logistic Regression, classification threshold = 0.14 |
| 2 | Run 1 | GO Rank, Hierarchy filtering, GO terms with the count over 2000 in the GOA database for ranking. classification threshold (Subtask1) = 0, filtering threshold = 6 |
| 2 | Run 2 | GO Rank, Hierarchy filtering, GO terms with the count over 500 in the GOA database for ranking, classification threshold (Subtask1) = 0, filtering threshold = 8 |
| 2 | Run 3 | GO Rank, Hierarchy filtering, GO terms with the count over 2000 in the GOA database for ranking, classification threshold (Subtask1) = 0.16, filtering threshold = 2 |
In the table, ‘classification threshold’ is the threshold of the Logistic regression classifier with 110 RDE features. The ‘filtering threshold’ is the number of n most relevant high-level GO classes to the sentence determined by the classifiers. If the highest ranked GO term by GORank is in the n classes, it will be selected as a positive result.
Comparison of different methods on test set of Subtask 1
| ID | Method | Precision (exact) (%) | Recall (exact) (%) | F1 (exact) (%) | Precision (relaxed) (%) | Recall (relaxed) (%) | F1(relaxed) (%) |
|---|---|---|---|---|---|---|---|
| 1 | NER, no classifier (baseline) | 9 | 14.7 | 15.2 | 24.6 | ||
| 2 | SVM (words) | 11.1 | 36.3 | 17 | 18.4 | 60.3 | 28.2 |
| 3 | Logistic (words) | 11.8 | 33 | 17.4 | 19.4 | 54.3 | 28.6 |
| 4 | SuRDE (words) | 12.8 | 32.6 | 18.4 | 20.4 | 51.9 | 29.3 |
| 5 | SeRDE (Run 1) | 14.6 | 28.6 | 19.3 | 23.9 | 46.9 | 31.7 |
| 6 | SeRDE (Run 2, our best submission) | 15.3 | 25.9 | 19.3 ( + 31.3%) | 25.8 | 43.7 | 32.5 (+32.1%) |
| 7 | SeRDE (Run 3) | 14 | 31.1 | 19.3 | 22.6 | 50.3 | 31.2 |
| 8 | SeRDE (200 refs, words) | 16.7 | 24.5 | 19.9 | 27.7 | 40.6 | 32.9 |
| 9 | SeRDE (200 refs, bigrams) | 17.1 | 23.6 | 19.8 | 27.5 | 38 | 31.9 |
| 10 | 8+9 | 18.3 | 24.3 | 20.9 | 29.8 | 39.7 | 34.1 |
| 11 | 3+8+9 | 27 | 43.7 |
‘NER, no classifier’ is the method that uses all the gene sentences as evidence sentences. SuRDE and SeRDE are the supervised and semi-supervised RDEs defined in (22). All the classifiers were trained with the labeled examples in training and development sets in Table 1. Logistic regression was used to integrate RDE features from Method 5 to 8. Random forest was used in Method 9. The ensemble Method 10 (8+9) used the mean of the decision scores of the individual classifiers (Methods 8 and 9) as the combination score. Method 11 was the combination of Methods 3, 8 and 9 in the same way.
Comparison of different methods on development set of Subtask 1
| F1 (binary) (%) | AUC (binary) (%) | F1 (exact) (%) | F1 (relaxed) (%) | |
|---|---|---|---|---|
| NER, no classifier | - | - | 14.6 | 22.8 |
| SVM (baseline) | 38.4 | 62 | 14.9 | 23.4 |
| Logistic | 36 | 61 | 15.4 | 23.7 |
| SuRDE | 45.2 | 71 | 17.9 | 27.4 |
| SeRDE (200 refs, words) | 49.2 | 74.6 | 18.7 | 29.6 |
| SeRDE (200 refs, bigrams) | 48.8 | 74.2 | 18.5 | 29.7 |
| SeRDE (200 refs, words + bigrams) | 50.2 | 76.5 | 19.2 | 30.7 |
F1 (exact) and F1 (relaxed) are the official evaluation measures. The F1 (binary) and AUC (binary) are the performance on the binary sentence classification task defined in ‘Method’ and Table 1.
Comparison of different features and classifiers on test set
| Classifier for RDE features | Original features | F1 (exact) (%) | F1 (relaxed) (%) |
|---|---|---|---|
| Logistic | Sentence, words | 18.8 | 31.1 |
| Random Forest | Sentence, words | 19.3 | 32.6 |
| Logistic | Sentence, bigrams | 19.2 | 31 |
| Random Forest | Sentence, bigrams | 19.5 | 32.8 |
| Logistic | Sentence + Paragraph, words | ||
| Random Forest | Sentence + Paragraph, words | 19.4 | 32.4 |
| Logistic | Sentence + Paragraph, bigrams | 19.6 | 30.6 |
| Random Forest | Sentence + Paragraph, bigrams | 19.8 | 31.9 |
Figure 4.the relation between the number of reference features and F1 on Subtask 1.Only the unigram word features were considered in the experiment the classifier for RDE features is Logistic regression.
Figure 5.Performance varied with number of unlabeled data. The reference features are the bound-based reference features in section 2.3.2 and Figure 4. classifiers for RDE features are Logistic Regression (for unigrams) and random forest (for bigrams).
Performance of different methods on the test set of Subtask 2
| Method | Precision (exact) | Recall (exact) | F1 (exact) | Precision (hierarchy) | Recall (hierarchy) | F1 (hierarchy) |
|---|---|---|---|---|---|---|
| Indri (baseline) | 1% | 3% | 1.5% | 9.9% | 33.1% | 15.2% |
| Indri + definition | 0.8% | 3% | 1.3% | 8.5% | 34.7% | 13.7% |
| Cosine | 2.4% | 7.6% | 3.6% | 7.2% | 12.2% | |
| GORank | 5.9% | 8.4% | 13.5% | 31.8% | 19% | |
| GORank + hierarchy | 10.6% | 21.6% | 21.2% | 21.4% | ||
| Cosine + Frequency | 4.6% | 9.8% | 6.2% | 15.1% | 28.4% | 19.7% |
| GORank + frequency | 5.5% | 10.7% | 7.3% | 17.4% | 27.5% | 21.3% |
| GORank + frequency + hierarchy (Run 3) | 9.5% | 6.7% | 7.8% | 16.1% | 20.4% | |
| GORank + frequency + hierarchy (Run 1) | 5.2% | 11.2% | 7.1% | 17% | 32% | |
| GORank + frequency + hierarchy (Run 2) | 4.9% | 14.3% | 7.3 | 12.7% | 36.8% | 18.8% |
‘Indri’ is a language model-based method (23). ‘Definition’ means appending the definition of GO terms to expand the text representation. ‘Cosine’ is the similarity function in the first part of Formula (2). ‘Frequency’ is to limit GO vocabulary to the high-frequency GO terms (Table 3). ‘Hierarchy’ is the high-level GO class-based filtering.
Performance of different methods on the development set of Subtask 2
| Method | F1 (exact) | F1 (hierarchy) |
|---|---|---|
| Indri (baseline) | 1.3% | 11.8% |
| GORank | 5.9% | |
| GORank + hierarchy | 6.6% | 16% |
| GORank + frequency + hierarchy (Run 3) | 5.9% | 12.7% |
| GORank + frequency + hierarchy (Run 1) | 6.9% | 16.3% |
| GORank + frequency + Hierarchy (Run 2) | 16.4% |
Performance analysis via incorporation of gold standard in different steps
| Analysis method | F1 (exact, Task 1) (%) | F1 (relaxed, Task 1) (%) | F1 (exact, Task 2) (%) | F1 (hierarchy, Task 2) (%) | |
|---|---|---|---|---|---|
| 1 | Baseline, Result 11 in | 22.1 | 35.7 | 10.6 | 21.4 |
| 2 | Add the gold standard evidence sentences to the gene sentences to be classified | 31.8 | 46.2 | 12.5 | 24.9 |
| 3 | Based on Result 2, replace all the gene IDs by the same ID for Subtask1 | 36 | 53.4 | 12.5 | 24.9 |
| 4 | Use the gold standard of Subtask 1 as the input of Subtask 2 | 100 | 100 | 19.6 | 33.1 |
| 5 | Replace the final result by the gold standard of Subtask 2 only for the terms with the frequency over 2000 in GO annotation databases | 100 | 100 | 61.2 | 65.4 |