| Literature DB >> 25810778 |
David Campos1, Sérgio Matos2, José L Oliveira2.
Abstract
BACKGROUND: The recognition of drugs and chemical entities in text is a very important task within the field of biomedical information extraction, given the rapid growth in the amount of published texts (scientific papers, patents, patient records) and the relevance of these and other related concepts. If done effectively, this could allow exploiting such textual resources to automatically extract or infer relevant information, such as drug profiles, relations and similarities between drugs, or associations between drugs and potential drug targets. The objective of this work was to develop and validate a document processing and information extraction pipeline for the identification of chemical entity mentions in text.Entities:
Keywords: Chemicals; Conditional Random Fields; Named Entity Recognition
Year: 2015 PMID: 25810778 PMCID: PMC4331697 DOI: 10.1186/1758-2946-7-S1-S7
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Iterative feature elimination results.
| Precision (%) | Recall (%) | F-measure (%) | ||
|---|---|---|---|---|
| Windows | 78.65 | 78.75 | 78.70 | |
| Conjunctions | 81.37 | 85.83 | 83.54 | |
| Base | Token | 81.32 | 85.86 | 83.53 |
| Lemma | 78.51 | 83.80 | 81.07 | |
| Linguistic | POS | 81.75 | 73.21 | 77.25 |
| Chunk | 82.93 | 80.83 | 81.86 | |
| Dependency parsing | 85.88 | 82.78 | ||
| Capitalization | 85.97 | 83.04 | ||
| Orthographic | Counting | 85.86 | 83.09 | 84.45 |
| Symbols | 85.99 | 82.96 | 84.45 | |
| Char n-grams | 85.88 | 82.53 | 84.17 | |
| Morphological | Suffix | 85.74 | 83.02 | 84.36 |
| Prefix | 85.93 | 83.03 | 84.45 | |
| Word shape | 85.83 | 82.42 | 84.09 | |
| Lexicons | Chemicals | 85.33 | 81.48 | 83.36 |
The first line shows the results obtained with the full set of features, together with windows or conjunctions of features. The following lines show results after iterative and cumulative removal of features. Values in bold indicate improvements over the previous best result; the italic value indicates the best result, obtained by removing dependency parsing and capitalization features.
Impact of features on recall for each different class.
| Class recall (%) | ||||||||
|---|---|---|---|---|---|---|---|---|
| Feature | Multiple | Family | Abbrev | System | Formula | Identifier | Trivial | |
| Conjunctions (all features) | 42.02 | 82.22 | 81.09 | 90.21 | 79.60 | 73.08 | 91.46 | |
| Base | Token | +1.06 | -0.19 | +0.55 | -0.09 | -0.15 | 0.00 | +0.07 |
| Lemma | +5.32 | -2.25 | -1.46 | -2.95 | -1.16 | -4.85 | -1.87 | |
| Linguistic | POS | -18.62 | -12.65 | -10.93 | -13.88 | -16.65 | -24.88 | -9.61 |
| Chunk | -22.87 | -6.68 | -4.34 | -4.91 | -6.14 | -12.83 | -3.13 | |
| Dependency | 0.00 | -3.72 | -3.76 | -1.44 | -4.45 | -9.39 | -2.52 | |
| Capitalization | -0.53 | +0.43 | +0.15 | +0.26 | +0.31 | -0.94 | +0.30 | |
| Orthographic | Counting | 0.00 | +0.24 | -0.27 | +0.16 | -0.12 | -1.72 | +0.26 |
| Symbols | -1.06 | -0.12 | -0.11 | 0.00 | -0.24 | -0.16 | +0.01 | |
| Char n-grams | +1.60 | -2.51 | -0.64 | -0.43 | -1.35 | +4.07 | +0.47 | |
| Morphological | Suffix | +1.06 | -0.09 | -0.33 | +0.16 | +0.51 | +0.63 | -0.27 |
| Prefix | 0.00 | -0.64 | -0.09 | +0.10 | -0.05 | +0.31 | +0.22 | |
| Word shape | +1.60 | -0.07 | -0.97 | -0.18 | -1.47 | -2.50 | -0.57 | |
| Lexicons | Chemicals | +2.66 | -0.19 | -2.10 | -0.78 | -0.75 | -10.95 | -2.34 |
Values shown are differences in percentage points to the baseline (first line).
Final evaluation results on the CHEMDNER test set.
| Entity Mention | Document Indexing | |||||
|---|---|---|---|---|---|---|
| System | Precision | Recall | F-measure | Precision | Recall | F-measure |
| Top scoring | 89.09 | 85.75 | 87.39 | 87.02 | 89.41 | 88.20 |
| Official best run | 86.50 | 85.66 | 86.08 | 86.35 | 82.37 | 84.31 |
| Corrected run | 87.35 | 86.49 | 86.92 | 87.07 | 87.97 | 87.52 |
| +0.85 | +0.83 | +0.84 | +0.72 | +5.60 | +3.21 | |
| 1st order CRF | 88.04 | 84.89 | 86.44 | 88.00 | 86.42 | 87.20 |
| +0.69 | -1.60 | -0.48 | +0.93 | -1.55 | -0.31 | |
| 2nd order CRF | 88.35 | 83.79 | 86.01 | 88.14 | 86.65 | 87.39 |
| +1.01 | -2.71 | -0.91 | +1.08 | -1.32 | -0.13 | |
| Post-processing | 88.67 | 86.32 | 87.48 | 87.68 | 87.81 | 87.75 |
| +1.32 | -0.17 | +0.56 | +0.61 | -0.16 | +0.23 | |
The 'Corrected run' line shows results obtained using the same models as in the official run, after correcting the generation of the annotation files. Results obtained with first- and second-order CRF models alone (without model combination) and after post-processing are shown with differences compared to the corrected run. Values are shown in percentage points.
Figure 1Results obtained on the CHEMDNER test set, using a combination of a first-order and a second-order CRF model, trained using the selected feature set.
Figure 2Web interface for recognition and annotation of chemical entities in text. Available at: http://bioinformatics.ua.pt/becas-chemicals
Figure 3Overall architecture of the described solution, presenting the pipeline of required steps, tools and external resources. Boxes with dotted lines indicate optional processing modules.