| Literature DB >> 27630201 |
Hoang-Quynh Le1, Mai-Vu Tran1, Thanh Hai Dang2, Quang-Thuy Ha1, Nigel Collier3.
Abstract
The BioCreative V chemical-disease relation (CDR) track was proposed to accelerate the progress of text mining in facilitating integrative understanding of chemicals, diseases and their relations. In this article, we describe an extension of our system (namely UET-CAM) that participated in the BioCreative V CDR. The original UET-CAM system's performance was ranked fourth among 18 participating systems by the BioCreative CDR track committee. In the Disease Named Entity Recognition and Normalization (DNER) phase, our system employed joint inference (decoding) with a perceptron-based named entity recognizer (NER) and a back-off model with Semantic Supervised Indexing and Skip-gram for named entity normalization. In the chemical-induced disease (CID) relation extraction phase, we proposed a pipeline that includes a coreference resolution module and a Support Vector Machine relation extraction model. The former module utilized a multi-pass sieve to extend entity recall. In this article, the UET-CAM system was improved by adding a 'silver' CID corpus to train the prediction model. This silver standard corpus of more than 50 thousand sentences was automatically built based on the Comparative Toxicogenomics Database (CTD) database. We evaluated our method on the CDR test set. Results showed that our system could reach the state of the art performance with F1 of 82.44 for the DNER task and 58.90 for the CID task. Analysis demonstrated substantial benefits of both the multi-pass sieve coreference resolution method (F1 + 4.13%) and the silver CID corpus (F1 +7.3%).Database URL: SilverCID-The silver-standard corpus for CID relation extraction is freely online available at: https://zenodo.org/record/34530 (doi:10.5281/zenodo.34530).Entities:
Mesh:
Year: 2016 PMID: 27630201 PMCID: PMC4962668 DOI: 10.1093/database/baw102
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.Analysis of the direct evidence field in the CTD database.
Large-scale feature set used in the intra-sentence relation extraction module
| Feature types | Description | Features | Provided information |
|---|---|---|---|
| Token features | Token itself |
Token orthography (capitalization, first letter of sentence, number, etc.) Base form of token N-grams ( Part-of-speech tagging | Information about the current token |
| Neighboring token features | Extracts all 2-step dependency paths from the target token, which then were used to extract n-grams |
Features extracted by the token feature function for each token Token and dependency n-grams ( Token n-grams ( Dependency n-grams ( | Information about the surrounding context of the current token |
| Token n-gram features | Extract token n-grams ( |
N-grams of word | Information about phrase which contain current token |
| Pair n-gram features | Extracts word n-grams ( |
Dependency n-grams ( Token n-grams ( N-grams ( | Information about function of current token in the dependency tree |
| Shortest path features | Shortest dependency paths between two words (in which, each word belongs to a disease or chemical entity) |
Length of path Word n-grams ( Dependency n-grams ( Consecutive word n-grams ( Edge walks (word-dependency-word) and their sub-structures Vertex walks (dependency-word-dependency) and their sub-structures | Information about relation between current token and other tokens in sentence using dependency tree and function of each token in this path |
Summary of the CDR track data set
| Data set | Articles | Chemical | Disease | CID | ||
|---|---|---|---|---|---|---|
| Men | ID | Men | ID | |||
| Training | 500 | 5203 | 1467 | 4182 | 1965 | 1038 |
| Development | 500 | 5347 | 1507 | 4244 | 1865 | 1012 |
| Test | 500 | 5385 | 1435 | 4424 | 1988 | 1066 |
Men, Mention; CID, CID relations.
Figure 2.An example of constructing silverCID corpus.
Figure 3.Architecture of the proposed CDR extraction system, which includes the pipeline of processing modules and material resources; boxes with dotted lines indicate sub-modules.
Figure 4.An example of the coreference between chemical entities. Two sequential sentences are extracted from PubMed abstract PMID: 23949582.
Figure 5.Coreference resolution using nine-pass sieve. Examples is pairs were kept by sieves.
DNER results
| BioCreative benchmarks | Dictionary look-up | 42.71 | 67.46 | 52.30 |
| DNorm | 81.15 | 80.13 | 80.64 | |
| Average results | 78.99 | 74.81 | 76.03 | |
| Ranked no. 1 result | 89.63 | 83.50 | 86.46 | |
| NER-NEN pipeline | 78.26 | 83.17 | 80.64 | |
Provided by the BioCreative 2015 organizer (33).
The silverCID corpus included in training the NER module.
CID relation extraction results
| P (%) | R (%) | F (%) | ||
|---|---|---|---|---|
| BioCreative benchmarks | Co-occurrence | 16.43 | 76.45 | 27.05 |
| Average result | 47.09 | 42.61 | 43.37 | |
| Rank no. 1 result | 55.67 | 58.44 | 57.03 | |
| SVM | 44.73 | 50.56 | 47.47 | |
| SVM+ silverCID corpus | 51.42 | 52.81 | 52.11 | |
| SVM+ CR EMC | 47.64 | 50.28 | 48.93 | |
CR, coreference resolution; MPS, multi-pass sieve; EMC, EM clustering.
Results provided by the BioCreative 2015 organizer (33). SVM: SVM intra-sentence relation extraction. Bold values are performance measures of our two models on the Test set, not using cross-validation evaluation.
Analysis of the contribution of methods and resources used in our proposed system for capturing CID relationships
| CID relation (chemical-disease) | PMID | SVM | SVM+CR | SVM+SC | SVM+CR+SC | |||
|---|---|---|---|---|---|---|---|---|
| Intra- | Inter- | |||||||
| 1 | Maleate (C030272)—nephrotoxicity (D007674) | 25119790 | ✓ | ✓ | ✓ | ✓ | ✓ | |
| 2 | Quinacrine hydrochloride (D011796)—atrial thrombosis (D003328) | 6517710 | ✓ | ✓ | ✓ | |||
| 3 | Metolachlor (C051786) -liver cancer (D008113) | 26033014 | ✓ | ✓ | ✓ | |||
| 4 | Galantamine (D005702) – headaches (D006261) | 17069550 | ✓ | |||||
| 5 | Methoxamine (D008729)- headache (D006261) | 11135381 | ✓ | ✓ | ✓ | |||
| 6 | Gemfibrozil (D015248)—myositis (D009220) | 1615846 | ✓ | ✓ | ||||
| 7 | Oxidized and reduced glutathione (D019803) —reperfusion injury (D015427) | 1943082 | ✓ | ✓ | ||||
| 8 | Metolachlor (C051786)- follicular cell lymphoma (D008224) | 26033014 | ✓ | |||||
SVM, SVM intra-sentence relation extraction; CR, multi-pass sieve coreference resolution; SC, silverCID corpus; Intra-, Intra-sentence CID relation; Inter-, Inter-sentence CID relation; ✓, chemical-disease pair is classified as CID relation correctly. See supplementary 1 for the sample texts.
Sources of errors by our system on the CDR test set
| CID relation (chemical-disease) | PMID | Cause of error | |||
|---|---|---|---|---|---|
| FP | FN | ||||
| 1 | Corticosteroid (D000305)–systemic sclerosis (D012595) | 22836123 | ✓ | Complex inter-sentence structure | |
| 2 | Cyclophosphamide (D003520)–edema (D004487) | 23666265 | ✓ | Complex inter-sentence structure | |
| 3 | Chlorhexidine diphosphanilate (C048279)–pain (D010146) | 2383364 | ✓ | Noise from silverCID corpus | |
| 4 | Theophylline (D013806)–tremors (D014202) | 3074291 | ✓ | Error from NER | |
| 5 | Scopolamine (D012601)–retention deficit (D012153) | 3088653 | ✓ | Error from NER | |
| 6 | Clopidogrel (C055162)–acute hepatitis (D017114) | 23846525 | ✓ | Error from NER | |
| 7 | Isoproterenol (D007545)–heart hypertrophy (D006984) | 2974281 | ✓ | Error from NEN | |
| 8 | Nicotine (D009538)–anxiety (D001008) | 15991002 | ✓ | Noise from silverCID corpus | |
| 9 | Oxitropium bromide (C017590)–nausea (D009325) | 3074291 | ✓ | Error from SVM model | |
| 10 | Gamma-vinyl-GABA (D020888)–status epilepticus (D013226) | 3708328 | ✓ | Error from coreference resolution module | |
Intra-, Intra-sentence CID relation; Inter-, Inter-sentence CID relation; FP, false positive; FN, false negative. See supplementary 2 for the sample texts.