| Literature DB >> 27141091 |
Saber A Akhondi1, Ewoud Pons1, Zubair Afzal1, Herman van Haagen1, Benedikt F H Becker1, Kristina M Hettne2, Erik M van Mulligen1, Jan A Kors3.
Abstract
We describe the development of a chemical entity recognition system and its application in the CHEMDNER-patent track of BioCreative 2015. This community challenge includes a Chemical Entity Mention in Patents (CEMP) recognition task and a Chemical Passage Detection (CPD) classification task. We addressed both tasks by an ensemble system that combines a dictionary-based approach with a statistical one. For this purpose the performance of several lexical resources was assessed using Peregrine, our open-source indexing engine. We combined our dictionary-based results on the patent corpus with the results of tmChem, a chemical recognizer using a conditional random field classifier. To improve the performance of tmChem, we utilized three additional features, viz. part-of-speech tags, lemmas and word-vector clusters. When evaluated on the training data, our final system obtained an F-score of 85.21% for the CEMP task, and an accuracy of 91.53% for the CPD task. On the test set, the best system ranked sixth among 21 teams for CEMP with an F-score of 86.82%, and second among nine teams for CPD with an accuracy of 94.23%. The differences in performance between the best ensemble system and the statistical system separately were small.Database URL: http://biosemantics.org/chemdner-patents.Entities:
Mesh:
Year: 2016 PMID: 27141091 PMCID: PMC4852402 DOI: 10.1093/database/baw061
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Characteristics of the CHEMDNER patent corpus
| Training | Development | Test | Total | |
|---|---|---|---|---|
| Patent records | 7000 | 7000 | 7000 | 21 000 |
| Manual chemical annotations | 33 543 | 32 142 | 33 949 | 99 634 |
| Unique chemical annotations | 11 977 | 11 386 | 11 433 | 34 796 |
| Chemical-related titles and abstracts | 9152 | 8937 | 9270 | 27 359 |
Number of compounds and unique identifiers in chemical databases
| Database | No. of compounds | No. of identifiers |
|---|---|---|
| ChEBI | 23 240 | 82 612 |
| ChEMBL | 22 245 | 28 411 |
| DrugBank | 6516 | 31 948 |
| HMDB | 40 199 | 228 907 |
| NPC | 14 666 | 128 153 |
| PubChem | 4 235 189 | 19 049 175 |
| TTD | 3196 | 121 744 |
Number of unique identifiers that overlap between pairs of chemical databases
| Database | ChEBI | ChEMBL | DrugBank | HMDB | NPC | PubChem |
|---|---|---|---|---|---|---|
| ChEMBL | 1209 (4.3) | |||||
| DrugBank | 2444 (7.6) | 3931 (13.8) | ||||
| HMDB | 4885 (5.9) | 2293 (8.1) | 5946 (18.6) | |||
| NPC | 3406 (4.1) | 6508 (22.9) | 23 865 (74.7) | 7444 (5.8) | ||
| PubChem | 45 021 (54.5) | 26 251 (92.4) | 28 943 (90.6) | 52 533 (22.9) | 69 873 (54.5) | |
| TTD | 4481 (5.4) | 4507 (15.9) | 18 028 (56.4) | 6503 (5.3) | 23 901 (19.6) | 119 819 (98.4) |
The percentage coverage of the identifiers in the smallest sized database of each pair is given in parentheses.
Performance of different dictionaries and dictionary combinations with and without removal of exclusion terms
| Without exclusion | With exclusion | |||||
|---|---|---|---|---|---|---|
| Dictionary | Precision | Recall | Precision | Recall | ||
| ChEBI | 56.51 | 29.47 | 38.74 | 78.87 | 28.42 | 41.79 |
| ChEMBL | 84.53 | 20.46 | 32.94 | 85.11 | 19.87 | 32.22 |
| DrugBank | 68.20 | 17.28 | 27.58 | 85.15 | 16.89 | 28.19 |
| HMDB | 66.11 | 29.38 | 40.68 | 79.59 | 28.19 | 41.63 |
| NPC | 30.90 | 44.85 | 36.59 | 55.23 | 30.61 | 39.39 |
| TTD | 66.89 | 14.07 | 23.24 | 80.90 | 13.89 | 23.71 |
| PubChem | 34.30 | 47.11 | 39.69 | 67.03 | 45.64 | 54.30 |
| All combined | 30.85 | 50.32 | 38.25 | 53.66 | 48.59 | 51.00 |
| ChEBI–HMDB | 55.46 | 36.98 | 44.37 | 78.12 | 35.45 | 48.77 |
| ChEMBL–DrugBank | 70.51 | 23.94 | 35.74 | 83.02 | 23.16 | 36.21 |
Performance of the ensemble system trained on the training set and tested on the development set
| CEMP task | CPD task | |||||
|---|---|---|---|---|---|---|
| System | Precision | Recall | Sensitivity | Specificity | Accuracy | |
| Dictionary-based (ChEMBL-DrugBank) | 70.51 | 23.94 | 35.74 | 50.63 | 88.41 | 64.29 |
| + Exclusion list | 83.02 | 23.16 | 36.21 | 44.29 | 94.37 | 62.40 |
| + Term removal (exclusion ratio 0.3) | 88.85 | 23.09 | 36.65 | 42.14 | 97.12 | 62.02 |
| + CRF original features | 84.96 | 83.83 | 84.39 | 95.11 | 85.33 | 91.57 |
| + Post-processing (CRF) | 84.50 | 84.91 | 84.70 | 95.39 | 85.01 | 91.64 |
| + POS and lemmatization features | 84.72 | 85.09 | 84.90 | 95.40 | 85.25 | 91.73 |
| + Word-vector cluster features | 84.88 | 85.55 | 85.21 | 95.31 | 84.87 | 91.54 |
| + Missed terms (exclusion ratio 0.5) | 75.88 | 88.63 | 81.76 | 97.00 | 82.74 | 91.84 |
Performance of different systems on the test set
| CEMP task | CPD task | |||||
|---|---|---|---|---|---|---|
| System | Precision | Recall | Sensitivity | Specificity | Accuracy | |
| Statistical | 86.83 | 86.81 | 86.82 | 96.13 | 88.67 | 93.61 |
| Statistical + dictionary without missed terms | 84.92 | 88.25 | 86.55 | 97.00 | 87.91 | 93.93 |
| Statistical + dictionary with missed terms | 77.76 | 90.84 | 83.79 | 98.03 | 86.79 | 94.23 |