| Literature DB >> 23725347 |
Antonio J Jimeno-Yepes1, J Caitlin Sticco, James G Mork, Alan R Aronson.
Abstract
BACKGROUND: A Gene Reference Into Function (GeneRIF) describes novel functionality of genes. GeneRIFs are available from the National Center for Biotechnology Information (NCBI) Gene database. GeneRIF indexing is performed manually, and the intention of our work is to provide methods to support creating the GeneRIF entries. The creation of GeneRIF entries involves the identification of the genes mentioned in MEDLINE®; citations and the sentences describing a novel function.Entities:
Mesh:
Year: 2013 PMID: 23725347 PMCID: PMC3687823 DOI: 10.1186/1471-2105-14-171
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Citation filtering pipeline. The figure shows the citation filtering flow from all of MEDLINE to the training and testing sets.
GeneRIF sentence distribution
| Training | 1987 | 829 (41.72%) | 1158 (58.28%) |
| Testing | 999 | 433 (43.34%) | 566 (56.66%) |
Figure 2GeneRIF and sentence distribution. The y-axis denotes the number of GeneRIF sentences while the x-axis denotes the position of the sentence in the title and abstract of the citations. The position 1 is for the title, position 2 is the first sentence of the abstract and then increases by one. The negative numbers at the end of the x-axis denote the position of the sentence from the end of the abstract. The position -1 is for the last sentence of the abstract.
GeneRIF prediction results
| | | | | | | | |||
|---|---|---|---|---|---|---|---|---|---|
| pos | 0.6052 | 0.3256 | 0.4234 | 0.6052 | 0.3256 | 0.4234 | 0.6594 | 0.7691 | 0.7100 |
| posf | 0.6705 | 0.5358 | 0.5956 | 0.6798 | 0.5196 | 0.5890 | 0.7218 | 0.7252 | 0.7235 |
| text | 0.5941 | 0.6051 | 0.5995 | 0.6322 | 0.6351 | 0.6336 | 0.0762 | 0.1395 | |
| gene | 0.5533 | 0.6952 | 0.6162 | 0.5533 | 0.6952 | 0.6162 | 0.5533 | 0.6952 | 0.6162 |
| dis | 0.6960 | 0.8037 | 0.7460 | 0.6755 | 0.8268 | 0.7435 | 0.7284 | 0.6628 | 0.6941 |
| posf + dis | 0.6974 | 0.8568 | 0.7689 | 0.6755 | 0.8268 | 0.7435 | 0.7323 | 0.7390 | 0.7356 |
| posf + dis + gene | 0.6996 | 0.8337 | 0.7608 | 0.6976 | 0.8152 | 0.7519 | 0.7875 | 0.7275 | 0.7563 |
| posf + dis + go | 0.6972 | 0.8614 | 0.6755 | 0.8268 | 0.7435 | 0.7323 | 0.7390 | 0.7356 | |
| posf + dis + go + text | 0.6751 | 0.7968 | 0.7309 | 0.7282 | 0.6559 | 0.6902 | 0.7342 | 0.7529 | 0.7434 |
| disg | 0.6061 | 0.7440 | 0.6061 | 0.7440 | 0.6061 | 0.7440 | |||
| posf + disg | 0.6798 | 0.7552 | 0.7155 | 0.7250 | 0.7667 | 0.7452 | 0.7259 | 0.7644 | 0.7447 |
| posf + disg + gene | 0.7047 | 0.7991 | 0.7489 | 0.7886 | 0.7321 | 0.7593 | 0.7810 | 0.6836 | 0.7291 |
| posf + disg + go | 0.6708 | 0.7575 | 0.7115 | 0.7249 | 0.7667 | 0.7452 | 0.7259 | 0.7644 | 0.7447 |
| posf + disg + go + text | 0.6759 | 0.7321 | 0.7029 | 0.7802 | 0.6559 | 0.7127 | 0.7393 | 0.7206 | 0.7298 |
The results are show for each feature or their combination on the test set after training a Naïve Bayes (NB), Support Vector Machine (SVM) or AdaBoostM1 (ABM1). For each feature or feature combination, the precision (p), recall (r) and F-measure (f) are shown. The individual features are: sentence position from the beginning of the abstract (pos), sentence position from the end of the abstract (posf), text features from the sentence (text), gene mention and normalization (gene), discourse features predicted by the AdaBoostM1 classifier (dis), discourse features predicted by the CRF model (disg) and Gene Ontology score (go).
GeneRIF data set discourse prediction
| Background | 258 | 180 | 78 | 0.7692 | 0.6977 | 0.7317 |
| Conclusions | 165 | 108 | 57 | 0.6316 | 0.6545 | 0.6429 |
| Methods | 179 | 105 | 74 | 0.5412 | 0.5866 | 0.5630 |
| Purpose | 36 | 24 | 12 | 0.3750 | 0.6667 | 0.4800 |
| Results | 260 | 163 | 97 | 0.6417 | 0.6269 | 0.6342 |
Results of a classifier trained on the discourse labels annotated in our data set. For each label we show the number of instances in the data set for each label (Positives), the number of True Positives (TP), the number of False Positives (FP) and the precision, recall and F-measure values.
Structured abstracts discourse label prediction based on an AdaBoostM1 model
| Background | 18875 | 11045 | 8820 | 0.5560 | 0.5852 | 0.5702 |
| Conclusions | 53396 | 37402 | 12844 | 0.7444 | 0.7005 | 0.7218 |
| Methods | 85764 | 69003 | 21382 | 0.7634 | 0.8046 | 0.7835 |
| Objective | 26425 | 19237 | 7883 | 0.7093 | 0.7280 | 0.7185 |
| Results | 117546 | 93250 | 29424 | 0.7601 | 0.7933 | 0.7764 |
Results of an AdaBoostM1 classifier trained on structured abstracts. For each label we show the number of instances in the data set for each label (Positives), the number of True Positives (TP), the number of False Positives (FP) and the precision, recall and F-measure values.
Structured abstracts discourse prediction label based on a CRF model
| Background | 6161 | 4154 | 2259 | 0.6477 | 0.6742 | 0.6607 |
| Conclusions | 10126 | 8455 | 1683 | 0.8340 | 0.8350 | 0.8345 |
| Methods | 15617 | 13473 | 2357 | 0.8511 | 0.8627 | 0.8569 |
| Objective | 4657 | 2810 | 1634 | 0.6323 | 0.6034 | 0.6175 |
| Results | 22228 | 18724 | 3240 | 0.8525 | 0.8424 | 0.8474 |
Results of a CRF trained model on structured abstracts. For each label we show the number of instances in the data set for each label (Positives), the number of True Positives (TP), the number of False Positives (FP) and the precision, recall and F-measure values.