| Literature DB >> 27016700 |
Jun Xu1, Yonghui Wu1, Yaoyun Zhang1, Jingqi Wang1, Hee-Jin Lee1, Hua Xu2.
Abstract
Mining chemical-induced disease relations embedded in the vast biomedical literature could facilitate a wide range of computational biomedical applications, such as pharmacovigilance. The BioCreative V organized a Chemical Disease Relation (CDR) Track regarding chemical-induced disease relation extraction from biomedical literature in 2015. We participated in all subtasks of this challenge. In this article, we present our participation system Chemical Disease Relation Extraction SysTem (CD-REST), an end-to-end system for extracting chemical-induced disease relations in biomedical literature. CD-REST consists of two main components: (1) a chemical and disease named entity recognition and normalization module, which employs the Conditional Random Fields algorithm for entity recognition and a Vector Space Model-based approach for normalization; and (2) a relation extraction module that classifies both sentence-level and document-level candidate drug-disease pairs by support vector machines. Our system achieved the best performance on the chemical-induced disease relation extraction subtask in the BioCreative V CDR Track, demonstrating the effectiveness of our proposed machine learning-based approaches for automatic extraction of chemical-induced disease relations in biomedical literature. The CD-REST system provides web services using HTTP POST request. The web services can be accessed fromhttp://clinicalnlptool.com/cdr The online CD-REST demonstration system is available athttp://clinicalnlptool.com/cdr/cdr.html. Database URL:http://clinicalnlptool.com/cdr;http://clinicalnlptool.com/cdr/cdr.html.Entities:
Mesh:
Year: 2016 PMID: 27016700 PMCID: PMC4808251 DOI: 10.1093/database/baw036
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.A sample from the CDR corpus with the annotations of mentions, corresponding normalized MeSH IDs for both chemical and disease entities and normalized chemical-induced disease relation conveyed in the abstract.
Figure 2.An overview of CD-REST.
The entity and context information features used for the sentence-level classifier CS and the document-level classifier CD
| 1 | Entity mention | Bag of words & bigrams of the entity mentions | √ | √ |
| 2 | Chemical first | Is chemical the first entity in the sentence | √ | |
| 3 | MeSH Ids | The corresponding MeSH IDs of each entity | √ | √ |
| 4 | Core chemical | Whether target chemical is a core chemical | √ | √ |
| 5 | Before | Bag of words & bigrams before the entities | √ | |
| 6 | Between | Bag of words & bigrams between the entities | √ | |
| 7 | After | Bag of words & bigrams after the entities | √ | |
| 8 | Same sentence | Whether the | √ | |
| 9 | Adjacent sentences | Whether the | √ | |
| 10 | More than two sentences | Whether the | √ | |
| 11 | Match | Whether the words between the entities contains any term in | √ | √ |
| 12 | Match | Whether the sentence contains | √ | √(if feature 8 or 9 is true) |
Features extracted by incorporating knowledge bases
| 1 | Categories of | All direct or indirect hypernyms of |
| 2 | Categories of | All direct or indirect hypernyms of |
| 3 | Has a specific disease | Whether the document has a more specific disease |
| 4 | Has a general disease | Whether the document has a more general disease |
| 5 | Relation of | |
| 6 | Relation of | |
| 7 | Relation of | |
| 8 | Relation of | |
| 9 | Whether | |
| 10 | Relation of | |
| 11 | Whether | |
These features were used for both CS and CD classifiers
Performance of the CD-REST in the CNER and DNER tasks on the test set with different approaches
| CNER | 1 | U | V | 0.8850 | 0.9115 | 0.8980 | 0.9278 | 0.8858 | 0.9063 |
| 2 | S | V | 0.8941 | 0.9112 | 0.9027 | 0.9339 | 0.8819 | ||
| 3 | S | V+IV | 0.9010 | 0.9199 | 0.9376 | 0.8698 | 0.9024 | ||
| DNER | 1 | U | V | 0.8254 | 0.8395 | 0.8324 | 0.8648 | 0.8230 | 0.8434 |
| 2* | S | V | 0.8312 | 0.8395 | 0.8689 | 0.8210 | |||
| 3 | S | V+N | 0.8158 | 0.8355 | 0.8255 | 0.8636 | 0.8232 | 0.8429 | |
U: the NER-U approach; S: the NER-S approach; V: the BioCreative V CDR Corpus; IV: the BioCreative IV CHEMDNER Corpus; N: the NCBI Disease Corpus. * was the best run the CD-REST achieved on DNER task in the CDR challenge. DNER Run #3 was not submitted to the challenge. Where applicable, the best performance in each category is highlighted in bold.
The performance of the CD-REST in the CID task using the end-to-end setting (CNER #1, DNER #1) and the gold-standard setting on the test set with different approaches. Where applicable, the best performance in each category is highlighted in bold.
| 0.4381 | 0.5209 | 0.5487 | 0.6059 | |||
| 0.6412 | 0.5047 | 0.5648 | 0.6836 | 0.6182 | 0.6493 | |
| 0.6186 | 0.6580 | |||||
Results of the CD-REST with + approach on the test set using the end-to-end setting (CNER Run #1, DNER Run #1) and the gold-standard setting, when different sets of knowledge base features were used. The best results are highlighted in bold.
| Entity + Context | 0.5160 | 0.3640 | 0.4268 | 0.5960 | 0.4400 | 0.5073 |
| Entity + Context + MeSH | 0.5155 | 0.4222 | 0.4641 | 0.5842 | 0.5140 | 0.5469 |
| Entity + Context + MeSH + MEDI | 0.5206 | 0.4278 | 0.4696 | 0.5953 | 0.5244 | 0.5576 |
| Entity + Context + MeSH + MEDI + SIDER | 0.5308 | 0.4372 | 0.4794 | 0.6086 | 0.5310 | 0.5671 |
| Entity + Context + MeSH + MEDI + SIDER + CTD | ||||||
The performance of the CD-REST with CS + CD approach on the CID task using different combinations of CNER and DNER. Where applicable, the best performance in each category is highlighted in bold.
| 1 | 1 | 1 | 0.6186 | ||
| 2 | 2 | 2 | 0.6216 | 0.5516 | 0.5845 |
| 3 | 3 | 2 | 0.5422 | 0.5809 | |
| 4 | 2 | 3 | 0.6193 | 0.5525 | 0.5840 |
| 5 | 3 | 3 | 0.6231 | 0.5413 | 0.5793 |