| Literature DB >> 27189609 |
Firoj Alam1, Anna Corazza2, Alberto Lavelli3, Roberto Zanoli4.
Abstract
The article describes a knowledge-poor approach to the task of extracting Chemical-Disease Relations from PubMed abstracts. A first version of the approach was applied during the participation in the BioCreative V track 3, both in Disease Named Entity Recognition and Normalization (DNER) and in Chemical-induced diseases (CID) relation extraction. For both tasks, we have adopted a general-purpose approach based on machine learning techniques integrated with a limited number of domain-specific knowledge resources and using freely available tools for preprocessing data. Crucially, the system only uses the data sets provided by the organizers. The aim is to design an easily portable approach with a limited need of domain-specific knowledge resources. In the participation in the BioCreative V task, we ranked 5 out of 16 in DNER, and 7 out of 18 in CID. In this article, we present our follow-up study in particular on CID by performing further experiments, extending our approach and improving the performance.Entities:
Mesh:
Year: 2016 PMID: 27189609 PMCID: PMC4869795 DOI: 10.1093/database/baw071
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Summary of the BioCreative V track 3 data set
| Data set | No. of. doc. | No. of rel. | No. of unique rel. | No. of avg. token per doc. | No. of avg. token per title. | No. of Avg. token per abs. | No. of chemical mention (ID) | No. of disease mention (ID) |
|---|---|---|---|---|---|---|---|---|
| Train | 500 | 1039 | 928 | 216.75 | 13.52 | 203.23 | 5203 (1467) | 4182 (1965) |
| Dev | 500 | 1012 | 889 | 215.33 | 13.61 | 201.72 | 5347 (1507) | 4244 (1865) |
| Test | 500 | 1066 | 941 | 226.57 | 13.42 | 212.59 | 5385 (1435) | 4424 (1988) |
Figure 1System architecture.
Trigger and endpoint pairs for barrier features
| Endpoint | Trigger |
|---|---|
| JJ | JJR |
| DT | NN, NNP |
| PRP | NNS |
| JJ | RBR |
| DT, IN | VB |
| IN | VBP |
| DT, MD, VB, VBP, VBZ, TO | VBD, VBN |
| PRP | VBZ |
Different experimental strategies of the DNER task, including with/without external resources and feature analysis
| Strategy | Description |
|---|---|
| Default configuration | Dictionary matching (CTD) + morphological regularities + context based features |
| Baseline#1_CTD | Dictionary matching (CTD) only |
| Baseline#2_w/o_res | ML system on the training set without any additional resource |
| −Dictionary matching (CTD) | w/o dictionary matching |
| −Context-based features | w/o context-based features |
| −Morphological regularities | w/o morphological regularities |
Different experimental strategies of the CID task, including feature level analysis and classifier combinations
| Strategy | Description |
|---|---|
| DLC | Entity pair in the entire abstract |
| SLC | Entity pair within a single sentence |
| Combo ( | OR of the outputs of the two classifiers DLC and SLC |
| Combo ( | The output of SLC is added as a feature for DLC |
| Combo ( | Linear combination of the output of the two classifiers with equal weights |
| Combo ( | Linear combination of the output of the two classifiers with weights computed as in combination of the basic classifiers |
| Basic feats | Features of the two entities + binary relation features |
| All-feats | Added three new features (Chemical in title; Disease in title; Core Chemical) |
| BFs | (see Features) |
| Word embeddings | (i) 1 feature (ii) 500 features (see Features) |
Results of entity normalization and mention detection (in brackets) on the development set
| P | R | F1 | |
|---|---|---|---|
| Chemical | 88.11(92.24) | 88.05(86.95) | 88.08(89.51) |
| Disease | 84.31(83.50) | 77.57(80.75) | 80.80(82.10) |
| Chemical+Disease | 86.09(88.32) | 82.26(84.20) | 84.13(86.21) |
| baseline#1_CTD | 76.03(81.07) | 64.01(69.47) | 69.51(74.82) |
| baseline#2_w/o_res | 88.14(78.40) | 64.13(64.21) | 74.24(70.60) |
Results of entity normalization and mention detection (in brackets) on the test set
| P | R | F1 | |
|---|---|---|---|
| Chemical | 88.57(93.50) | 88.57(89.71) | 88.57(91.57) |
| 86.82(84.15) | 81.84(82.21) | 84.26(83.17) | |
| Chemical+Disease | 87.58(89.24) | 84.66(86.33) | 86.09(87.76) |
In bold the system’s official results
Variation in results of entity normalization and mention detection (in brackets) when we remove one type of information at a time
| P | R | F1 | |
|---|---|---|---|
| Chemical+Disease entities | |||
| −Dictionary matching (CTD) | +1.84(−1.32) | −17.62(−5.95) | −9.62(−3.81) |
| −Context-based features | −3.46(−9.41) | −0.3(−2.26) | −1.84(−5.82) |
| −Morphological regularities | −1.63(−0.43) | +0.59(−0.23) | −0.43(−0.48) |
| Disease entities | |||
| −Dictionary matching (CTD) | +0.89(−2.66) | −13.94(−4.57) | −7.95(−3.66) |
| −Context-based features | −2.53(−10.53) | −0.75(−4.03) | −1.58(−7.30) |
| −Morphological regularities | −0.89(−0.39) | +0.70(−0.61) | −0.04(−0.50) |
Results of different configurations of the RE system
| Document-level | Sentence-level | Combo ( | |||||||
|---|---|---|---|---|---|---|---|---|---|
| P | R | P | R | P | R | ||||
| GSE | |||||||||
| Basic feats | 42.41 | 77.39 | 54.79 | 47.96 | 56.37 | 51.83 | 40.33 | 80.30 | 53.70 |
| All-feats | 44.18 | 79.08 | 56.69 | 49.47 | 57.22 | 53.06 | 43.05 | 80.01 | 55.98 |
| ARE | |||||||||
| BioCreative V | 35.39 | 56.47 | 43.51 | ||||||
| Basic feats | 37.98 | 61.06 | 46.83 | 52.01 | 19.41 | 28.27 | 37.54 | 61.81 | 46.72 |
| All-feats | 40.31 | 63.03 | 49.17 | 53.94 | 19.23 | 28.35 | 40.14 | 63.22 | 49.10 |
Basic features: the ones used for the official submission; All features: includes the three new features (Chemical in title; Disease in title; Core Chemical).
Results with word-embedding features
| Document-level | Sentence-level | Combo ( | |||||||
|---|---|---|---|---|---|---|---|---|---|
| P | R | P | R | P | R | ||||
| GSE | |||||||||
| Chemical-Disease similarity (1 feature) | 44.66 | 79.55 | 57.20 | 49.35 | 56.85 | 52.83 | 43.30 | 80.39 | 56.29 |
| Average of the FVs of words (500 features) | 44.79 | 79.36 | 57.26 | 49.35 | 57.04 | 52.92 | 43.68 | 80.39 | 56.61 |
| ARE | |||||||||
| Chemical-Disease similarity (1 feature) | 39.65 | 63.60 | 48.85 | 53.79 | 19.32 | 28.43 | 39.47 | 63.79 | 48.76 |
| Average of the FVs of words (500 features) | 39.89 | 63.13 | 48.89 | 53.75 | 19.51 | 28.63 | 39.67 | 63.23 | 48.75 |
Results of different combination strategies of the two classifiers
| P | R | ||
|---|---|---|---|
| GSE | |||
| 43.05 | 80.01 | 55.98 | |
| 43.18 | 50.84 | 46.70 | |
| 44.39 | 76.92 | 56.29 | |
| 43.05 | 80.01 | 55.98 | |
| ARE | |||
| 40.14 | 63.22 | 49.10 | |
| 51.12 | 16.97 | 25.49 | |
| 24.49 | 70.63 | 36.37 | |
| 24.39 | 71.01 | 36.31 |
Number of FNs in the results of the two classifiers
| Tot | Same sent. | Other | |
|---|---|---|---|
| GSE | |||
| DLC | 261 | 96 | 165 |
| SLC | 551 | 183 | 368 |
| ARE | |||
| DLC | 431 | 216 | 215 |
| SLC | 947 | 579 | 368 |
Number of FPs in the results of the two classifiers
| Tot | Same sent. | Other | |
|---|---|---|---|
| GSE | |||
| DLC | 1002 | 671 | 331 |
| SLC | 506 | 506 | 0 |
| ARE | |||
| DLC | 930 | 242 | 688 |
| SLC | 96 | 96 | 0 |