| Literature DB >> 23160415 |
Sun Kim1, Won Kim, Chih-Hsuan Wei, Zhiyong Lu, W John Wilbur.
Abstract
The Comparative Toxicogenomics Database (CTD) contains manually curated literature that describes chemical-gene interactions, chemical-disease relationships and gene-disease relationships. Finding articles containing this information is the first and an important step to assist manual curation efficiency. However, the complex nature of named entities and their relationships make it challenging to choose relevant articles. In this article, we introduce a machine learning framework for prioritizing CTD-relevant articles based on our prior system for the protein-protein interaction article classification task in BioCreative III. To address new challenges in the CTD task, we explore a new entity identification method for genes, chemicals and diseases. In addition, latent topics are analyzed and used as a feature type to overcome the small size of the training set. Applied to the BioCreative 2012 Triage dataset, our method achieved 0.8030 mean average precision (MAP) in the official runs, resulting in the top MAP system among participants. Integrated with PubTator, a Web interface for annotating biomedical literature, the proposed system also received a positive review from the CTD curation team.Entities:
Mesh:
Year: 2012 PMID: 23160415 PMCID: PMC3500521 DOI: 10.1093/database/bas042
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1Our article prioritization method for the BioCreative 2012 Triage task. For input articles, features are generated in three different ways: word features including multiwords, MeSH terms and substance/journal names; semantic features utilizing dependency relations and a Semantic Model; topic features are extracted by LDA topic modeling.
Dataset
| Dataset | Chemical names | Positives | Negatives | Total |
|---|---|---|---|---|
| Training | 2-Acetylaminofluorene | 81 | 97 | 178 |
| Amsacrine | 37 | 32 | 69 | |
| Aniline | 100 | 126 | 226 | |
| Aspartame | 46 | 110 | 156 | |
| Doxorubicin | 138 | 61 | 199 | |
| Indomethacin | 76 | 9 | 85 | |
| Quercetin | 392 | 150 | 542 | |
| Raloxifene | 161 | 109 | 270 | |
| Test | Cyclophosphamide | 107 | 47 | 154 |
| Phenacetin | 65 | 21 | 86 | |
| Urethane | 106 | 98 | 204 |
The training and test sets include eight and three target chemicals, respectively. Because the ratio of positive and negative examples varies with target chemicals, our system is tuned to achieve high MAP score on the training chemicals.
Semantic classes and the classification performance for the semantic model
| Class name | Gene | Chemical | Disease | Other |
|---|---|---|---|---|
| Number of strings | 70 832 | 49 800 | 7589 | 113 815 |
| Mean average precision | 0.914 | 0.868 | 0.706 | 0.912 |
The second row contains the number of unique strings in the four different classes. The last row shows the MAP scores from a 10-fold cross-validation to learn how to distinguish each class from the union of the other three.
Average precision changes with Triage (training) + Triage (testing)
| Chemical names | BASE | IXN | TOPIC | IXN + TOPIC |
|---|---|---|---|---|
| 2-Acetylaminofluorene | 0.6702 | 0.6742 | 0.6956 | |
| Amsacrine | 0.6956 | 0.6773 | 0.6848 | |
| Aniline | 0.7765 | 0.7891 | 0.7887 | |
| Aspartame | 0.4845 | 0.4687 | 0.4859 | |
| Doxorubicin | 0.8610 | 0.8627 | 0.8689 | |
| Indomethacin | 0.9758 | 0.9748 | 0.9751 | |
| Quercetin | 0.9313 | 0.9310 | 0.9313 | |
| Raloxifene | 0.8060 | 0.8107 | 0.8152 | |
| Average performance | 0.7754 | 0.7812 | 0.7777 |
The Triage dataset is used for training and testing in a leave-one (chemical)-out approach. ‘BASE’ means word features without substance/journal names. ‘IXN’ and ‘TOPIC’ mean semantic and topic features, respectively. ‘BASE’ features are used for all the experiments.
Average precision changes with CTD (training) + Triage (testing)
| Chemical names | BASE | IXN | TOPIC | IXN + TOPIC |
|---|---|---|---|---|
| 2-Acetylaminofluorene | 0.6776 | 0.6776 | 0.6814 | |
| Amsacrine | 0.7202 | 0.7308 | 0.7468 | |
| Aniline | 0.7625 | 0.7542 | 0.7477 | |
| Aspartame | 0.4902 | 0.4958 | 0.5269 | |
| Doxorubicin | 0.8767 | 0.8828 | 0.8871 | |
| Indomethacin | 0.9608 | 0.9610 | 0.9604 | |
| Quercetin | 0.9186 | 0.9162 | 0.9189 | |
| Raloxifene | 0.7803 | 0.7737 | 0.7661 | |
| Average performance | 0.7736 | 0.7752 | 0.7802 |
Again a leave-one-out train and test procedure is followed. The full dataset was downloaded from the CTD database and used to augment the training. Any duplicates appearing in both training and testing sets were removed from the training set. ‘BASE’ uses word features without substance/journal names. ‘IXN’ and ‘TOPIC’ mean semantic and topic features, respectively. ‘BASE’ features are used for all the experiments.
Overall performance (average precision) changes for different dataset, feature and classifier combinations
| Training set | Triage | CTD | ||
|---|---|---|---|---|
| Feature | Multiword features | All proposed features | ||
| Classifier | Bayes | Huber | Huber | Huber |
| 2-Acetylaminofluorene | 0.6812 | 0.7055 | 0.6932 | |
| Amsacrine | 0.5880 | 0.6676 | 0.6850 | |
| Aniline | 0.7589 | 0.7646 | 0.7708 | |
| Aspartame | 0.3755 | 0.4520 | 0.4890 | |
| Doxorubicin | 0.8434 | 0.8718 | 0.8689 | |
| Indomethacin | 0.9599 | 0.9699 | 0.9626 | |
| Quercetin | 0.9068 | 0.9176 | 0.9227 | |
| Raloxifene | 0.7913 | 0.7940 | 0.7759 | |
| Average performance | 0.7424 | 0.7648 | 0.7843 | |
‘Triage’ means the Triage training set is used for training. ‘CTD’ means the full CTD set is used to augment the positive set and negatives are from the Triage set. Again a leave-one-out train and test scenario are used. ‘Bayes’ and ‘Huber’ indicate Bayes and Huber classifiers, respectively.
Official performance on the Triage test set
| Chemical names | AP | Hit rate | ||
|---|---|---|---|---|
| Gene | Chemical | Disease | ||
| Cyclophosphamide | 0.857 | 0.339 | 0.593 | 0.646 |
| Phenacetin | 0.824 | 0.627 | 0.667 | 0.333 |
| Urethane | 0.728 | 0.311 | 0.681 | 0.389 |
| Average performance | 0.803 | 0.426 | 0.647 | 0.456 |
AP, average precision. ‘Hit Rate’ is the fraction of extracted terms that are matched with manually curated entities (precision).
Average precision comparison among top MAP scoring teams
| Chemical names | Teams | ||
|---|---|---|---|
| Our team | Team 130 | Team 133 | |
| Cyclophosphamide | 0.7740 | 0.7220 | |
| Phenacetin | 0.8240 | 0.8020 | |
| Urethane | 0.7280 | 0.6660 | |
| Mean average precision | 0.7787 | 0.7543 | |
Team 130 uses co-occurrences between entities and their network centralities for document ranking. Team 133 uses document scores obtained from entity frequencies and the number of sentences for ranking. The average performance over all participants was 0.7617, 0.8171 and 0.6649 for ‘cyclophosphamide’, ‘phenacetin’ and ‘urethane’, respectively.