| Literature DB >> 27052618 |
Jinghang Gu1, Longhua Qian2, Guodong Zhou1.
Abstract
Understanding the relations between chemicals and diseases is crucial in various biomedical tasks such as new drug discoveries and new therapy developments. While manually mining these relations from the biomedical literature is costly and time-consuming, such a procedure is often difficult to keep up-to-date. To address these issues, the BioCreative-V community proposed a challenging task of automatic extraction of chemical-induced disease (CID) relations in order to benefit biocuration. This article describes our work on the CID relation extraction task on the BioCreative-V tasks. We built a machine learning based system that utilized simple yet effective linguistic features to extract relations with maximum entropy models. In addition to leveraging various features, the hypernym relations between entity concepts derived from the Medical Subject Headings (MeSH)-controlled vocabulary were also employed during both training and testing stages to obtain more accurate classification models and better extraction performance, respectively. We demoted relation extraction between entities in documents to relation extraction between entity mentions. In our system, pairs of chemical and disease mentions at both intra- and inter-sentence levels were first constructed as relation instances for training and testing, then two classification models at both levels were trained from the training examples and applied to the testing examples. Finally, we merged the classification results from mention level to document level to acquire final relations between chemicals and diseases. Our system achieved promisingF-scores of 60.4% on the development dataset and 58.3% on the test dataset using gold-standard entity annotations, respectively. Database URL:https://github.com/JHnlp/BC5CIDTask.Entities:
Mesh:
Year: 2016 PMID: 27052618 PMCID: PMC4822558 DOI: 10.1093/database/baw042
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.System workflow diagram.
Figure 2.An annotation example of the CDR corpus.
The CID relation statistics on the corpus
| Task Datasets | No. of Articles | No. of CID Relations |
|---|---|---|
| Training | 500 | 1038 |
| Development | 500 | 1012 |
| Test | 500 | 1066 |
Performance on the development dataset
| Methods | Intra-sentence level | Inter-sentence level | Final CID Relation | ||||||
|---|---|---|---|---|---|---|---|---|---|
| LEX (baseline) | 59.2 | 63.3 | 46.0 | 33.1 | 38.5 | 58.1 | 52.7 | 55.3 | |
| DEP | 67.7 | 54.2 | 60.2 | – | – | – | 58.7 | 51.2 | 54.7 |
| HF+LEX | 66.7 | 65.5 | 66.1 | 57.4 | 58.6 | 58.0 | |||
| HF+DEP | 65.5 | 61.8 | 63.6 | – | – | – | 56.6 | 57.4 | 57.0 |
| HF+LEX+DEP | 67.7 | – | – | – | 58.3 | 59.2 | |||
| Post-Processing | – | – | – | – | – | – | 59.0 | ||
Note: The best scores in each numerical column are in bold.
Performance on the test dataset
| Results | |||
|---|---|---|---|
| Intra-sentence level | 67.4 | 68.9 | 68.2 |
| Inter-sentence level | 51.4 | 29.8 | 37.7 |
| Final CID Relation | 62.0 | 55.1 | 58.3 |
Comparisons with the related works
| Methods | RT | No. of TP | No. of FP | No. of FN | |||
|---|---|---|---|---|---|---|---|
| Abstract level | – | 815 | 4145 | 251 | 16.4 | 76.5 | 27.1 |
| Sentence level | – | 570 | 1672 | 496 | 25.4 | 53.5 | 34.5 |
| Xu et al. ( | 8 | 623 | 496 | 443 | 55.7 | 58.4 | 57.0 |
| Pons et al. ( | 16 | 574 | 544 | 492 | 51.3 | 53.8 | 52.6 |
| Our online results | 5 | 358 | 346 | 708 | 50.9 | 33.6 | 40.5 |
| Approach in this article | 13 | 439 | 355 | 627 | 55.3 | 41.2 | 47.2 |
In the table, RT stands for the response time of different systems, TP stands for the true-positive relations, FP stands for the false-positive relations and FN stands for the false-negative relations.