| Literature DB >> 28316651 |
Yifan Peng1, Chih-Hsuan Wei2, Zhiyong Lu2.
Abstract
BACKGROUND: Due to the importance of identifying relations between chemicals and diseases for new drug discovery and improving chemical safety, there has been a growing interest in developing automatic relation extraction systems for capturing these relations from the rich and rapid-growing biomedical literature. In this work we aim to build on current advances in named entity recognition and a recent BioCreative effort to further improve the state of the art in biomedical relation extraction, in particular for the chemical-induced disease (CID) relations.Entities:
Keywords: BioNLP; Chemical-induced disease; Relation extraction; Text mining
Year: 2016 PMID: 28316651 PMCID: PMC5054544 DOI: 10.1186/s13321-016-0165-z
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Fig. 1The title and abstract of a sample document (PMID 2375138). Chemical and disease mentions are marked in green and yellow respectively.
Statistics of the corpora
| Corpus | Documents | CID Pairs | Unique |
|---|---|---|---|
| BC5 training | 500 | 1038 | 927 |
| BC5 development | 500 | 1012 | 887 |
| BC5 test | 500 | 1066 | 941 |
| CTD-Pfizer | 18,410 | 33,224 | 15,439 |
Fig. 2The pipeline of our CID extraction system
Fig. 3The title and abstract of a sample document (PMID 11431197). Chemical and disease mentions are marked in green and yellow respectively.
Fig. 4The Extended Dependency Graph of the sentence “A number of angiogenesis inhibitors such as sunitinib and sorafenib have been found to cause acute hemolysis” (PMID: 20698227)
Fig. 5The Extended Dependency Graph of the text “A case of tardive dyskinesia caused by metoclopramide” (PMID: 6727060)
Shortest path, v-walks, and e-walks of sample sentences in Figs. 4 and 5
| Shortest path | Chemical ← arg0 ← cause → arg1 → Disease |
|---|---|
| v-walks | cause → arg0 → Chemical |
| cause → arg1 → Disease | |
| e-walks | arg0 ← cause → arg1 |
Statistical features
| Feature | Type | |
|---|---|---|
| 1 | # of chemical mention | Numeric |
| 2 | # of disease mention | Numeric |
| 3 | Is chemical in title | Boolean |
| 4 | Is disease in title | Boolean |
| 5 | Is chemical in the 1st sentence of the abstract | Boolean |
| 6 | Is disease in the 1st sentence of the abstract | Boolean |
| 7 | Is chemical in the last sentence of the abstract | Boolean |
| 8 | Is disease in the last sentence of the abstract | Boolean |
| 9 | Are both of chemical and disease in the same sentence | Boolean |
| 10 | Is disease-chemical relation curated by CTD in the past | Boolean |
| 11 | Do both disease and chemical exist in the MeSH indexing in the past? | Boolean |
| 12 | Is any keyword around the disease, such as therapy, complicating, affect, etc. | Boolean |
| 13 | Is any keyword around the chemical, such as 3.0 mEg/L, mg, etc. | Boolean |
| 14 | Is “increase” or “decrease” around chemical | Boolean |
| 15 | Is “increase” or “decrease” around disease | Boolean |
| 16 | Is “ | Boolean |
| 17 | Is “p-value” around disease | Boolean |
| 18 | Is “men”, “women”, or “patient” around chemical | Boolean |
| 19 | Is “men”, “women”, or “patient” around disease | Boolean |
Evaluation of named entity results in normalized concept identifiers
| Named entity | Precision | Recall | F-score |
|---|---|---|---|
| Disease concepts | 78.77 | 81.14 | 79.94 |
| Chemical concepts | 88.49 | 92.57 | 90.49 |
Evaluation of CID results
| Team/training corpus | Using text-mined entity mentions | Using gold entity mentions | ||||
|---|---|---|---|---|---|---|
| Precision | Recall | F-score | Precision | Recall | F-score | |
| Co-occurrence baseline | 16.43 | 76.45 | 27.05 | |||
| Avg team results | 47.09 | 42.61 | 43.37 | – | – | – |
| Best team results | 55.67 | 58.44 | 57.03 | – | – | – |
| 1. Train | 51.55 | 59.19 | 55.11 | 62.07 | 64.17 | 63.10 |
| 2. Train + dev | 64.24 | 52.06 | 57.51 | 68.15 | 66.04 | 67.08 |
| 3. Train + dev + 1000 | 63.78 | 53.85 | 58.39 | 68.12 | 68.95 | 68.53 |
| 4. Train + dev + 5000 | 62.50 | 56.75 | 59.49 | 67.63 | 72.33 | 69.90 |
| 5. Train + dev + 10,000 | 64.49 | 56.57 | 60.27 | 69.64 | 71.86 | 70.73 |
| 6. Train + dev + 18,410 | 65.59 | 56.94 | 61.01 | 71.07 | 72.61 | 71.83 |
Contributions of different features
| Features | Precision (%) | Recall (%) | F-value (%) | F-value change (%) | |
|---|---|---|---|---|---|
| 1 | All features | 64.24 | 52.06 | 57.51 | |
| 2 | - BOW | 63.09 | 51.31 | 56.60 | −0.91 |
| 3 | - BOB | 61.24 | 52.63 | 56.61 | −0.90 |
| 4 | - Pattern | 61.83 | 51.22 | 56.03 | −1.48 |
| 5 | - Shortest path | 62.03 | 52.72 | 57.00 | −0.51 |
| 6 | - Statistical | 53.29 | 41.74 | 46.82 | −10.69 |
| 7 | - #1 ~ #8 | 62.54 | 50.75 | 56.03 | −1.48 |
| 8 | - #1 and #2 | 62.90 | 51.69 | 56.75 | −0.76 |
| 9 | - #3 and #4 | 63.31 | 51.97 | 57.08 | −0.43 |
| 10 | - #5 ~ #8 | 63.23 | 51.78 | 56.94 | −0.57 |
| 11 | - #9 ~ #11 | 54.04 | 45.12 | 49.18 | −8.33 |
| 12 | - #9 | 63.62 | 52.16 | 57.32 | −0.19 |
| 13 | - #10 | 57.09 | 45.31 | 50.52 | −6.99 |
| 14 | - #11 | 61.49 | 50.47 | 55.44 | −2.07 |
| 15 | - #12 ~ #19 | 63.79 | 52.06 | 57.33 | −0.18 |
Precision on BC5 training set
| Trigger | TP | FP | Precision (%) |
|---|---|---|---|
| Associate | 29 | 9 | 76.32 |
| Cause | 21 | 10 | 67.74 |
| Induce | 179 | 65 | 73.36 |
| Produce | 12 | 4 | 75.00 |
| Total | 242 | 89 | 73.11 |
Fig. 6The relationship between the percentage of overlapped CID relations and the method performance in F-scores with (fscore_gold_entity) and without (fscore_text-mined_entity) using gold entities
Fig. 7The performance changes of the overlapped CID relations in the test set
Fig. 8The performance changes of non-overlapped CID relations in the test set
Statistics of extraction errors by our method
| Error type | FN | FP | Total | % |
|---|---|---|---|---|
| NER or normalization errors | 254 | 58 | 312 | 39.90 |
| CID relations mentioned in single sentences | 148 | 124 | 272 | 34.78 |
| CID relations asserted across sentences | 63 | 54 | 117 | 14.96 |
| Extracted disease or chemical in CID is too general | 0 | 46 | 46 | 5.88 |
| The extracted disease/chemical pair is a treatment relation | 0 | 29 | 29 | 3.71 |
| Annotated CID relations absent in the abstract | 6 | 0 | 6 | 0.77 |
| Total | 471 | 311 | 782 |
|
| ⇒ |
|