| Literature DB >> 25810777 |
Riza Batista-Navarro1, Rafal Rak2, Sophia Ananiadou2.
Abstract
BACKGROUND: The development of robust methods for chemical named entity recognition, a challenging natural language processing task, was previously hindered by the lack of publicly available, large-scale, gold standard corpora. The recent public release of a large chemical entity-annotated corpus as a resource for the CHEMDNER track of the Fourth BioCreative Challenge Evaluation (BioCreative IV) workshop greatly alleviated this problem and allowed us to develop a conditional random fields-based chemical entity recogniser. In order to optimise its performance, we introduced customisations in various aspects of our solution. These include the selection of specialised pre-processing analytics, the incorporation of chemistry knowledge-rich features in the training and application of the statistical model, and the addition of post-processing rules.Entities:
Keywords: Chemical named entity recognition; Conditional random fields; Configurable workflows; Feature engineering; Sequence labelling; Text mining; Workflow optimisation
Year: 2015 PMID: 25810777 PMCID: PMC4331696 DOI: 10.1186/1758-2946-7-S1-S6
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Performance of ChER under the BioCreative IV CHEMDNER track setting.
| Custom Features | Post-processing | CEM | CDI | |||||
|---|---|---|---|---|---|---|---|---|
| P | R | F1 | P | R | F1 | |||
| ✓ | ✗ | ✗ | 92.76 | 81.02 | 86.49 | 91.39 | 85.29 | 88.23 |
| ✓ | ✓ | ✗ | 92.76 | 81.30 | 86.65 | 91.37 | 85.45 | 88.31 |
| ✓ | ✗ | ✓ | 92.14 | 81.41 | 86.44 | 90.55 | 85.72 | 88.07 |
| ✓ | ✓ | ✓ | 92.14 | 81.69 | 86.60 | 90.53 | 85.88 | 88.14 |
Key: Abbr. = Abbreviation recognition, Comp. = Chemical composition-based token relabelling
Comparative evaluation of ChER against state-of-the-art chemical name recognition methods.
| SciBorg (chemical molecules) | SCAI-100 (systematic names) | ||||||
|---|---|---|---|---|---|---|---|
| ChER | 85.96 | 74.22 | 79.66 | ChER | 86.70 | 67.50 | 75.90 |
| OSCAR | - | - | 81.20 | ChemSpot | 57.47 | 67.70 | 62.17 |
OSCAR's F1 score was taken from the paper of Corbett et al. [15].
Comparative evaluation of ChER against a state-of-the-art metabolite name recognition method.
| NaCTeM Metabolites | |||
|---|---|---|---|
| ChER | 81.42 | 79.66 | 80.53 |
| MetaboliNER | 83.02 | 74.42 | 78.49 |
Applicability of ChER with the CHEMDNER model to other chemical corpora.
| SCAI-100 (all names) | Patents | |||||
|---|---|---|---|---|---|---|
| ChER | 77.85 | 78.69 | 78.27 | 73.43 | 57.91 | 64.75 |
| ChemSpot | 76.35 | 72.55 | 74.41 | 67.79 | 41.97 | 51.84 |
| OSCAR4 | 50.88 | 81.34 | 62.60 | 49.90 | 60.73 | 54.79 |
Applicability of ChER with the CHEMDNER model to drug corpora.
| DDI test | PK | |||||
|---|---|---|---|---|---|---|
| ChER | 75.88 | 92.05 | 83.18 | 79.83 | 88.34 | 83.87 |
| ChemSpot | 73.09 | 89.49 | 80.46 | 65.29 | 86.07 | 74.25 |
| OSCAR4 | 60.20 | 85.51 | 70.66 | 42.65 | 81.71 | 56.04 |
Applicability of ChER with the CHEMDNER model to the NaCTeM Metabolites corpus.
| NaCTeM Metabolites | |||
|---|---|---|---|
| ChER | 65.08 | 83.29 | 73.07 |
| ChemSpot | 58.02 | 73.99 | 65.04 |
| OSCAR4 | 35.37 | 84.18 | 49.81 |
Figure 1The chemical entity recogniser in Argo. The proposed chemical entity recogniser is available as a processing component in the Web-based, text mining workbench Argo. The component is shown here as part of two individual workflows. The left-hand-side workflow produces an RDF file containing annotated chemicals in user-specified PubMed abstracts. The right-hand-side workflow reports effectiveness metrics for the CHEMDNER corpus.
Character and word n-gram features extracted by NERsuite by default.
| Feature | Brief description | Sample features (bigrams) |
|---|---|---|
| Character | the set of all possible combinations of a token's consecutive characters, taken n at a time (n = 2, 3, 4) | { |
| Token | unigrams and bigrams of surface forms; unigrams and bigrams of normalised surface forms where numbers numbers are replaced with '0's, the consecutive instances of which are compressed | { |
| Lemma | unigrams and bigrams of lemmatised surface forms | { |
| POS tag | unigrams and bigrams of part-of-speech (POS) tags | { |
| Lemma & POS tag | unigrams and bigrams of lemmatised forms combined with POS tags | { |
| Chunk information | chunk tag of current token; surface form of the enclosing chunk's | { |
Example of a sentence tokenised and labelled with part-of-speech and chunk tags.
| Surface form | Lemma | Part-of-speech tag | Chunk tag |
|---|---|---|---|
| It | PRP | B-NP | |
| attenuate | VBD | B-VP | |
| GSK214a | NN | B-NP | |
| -induced | JJ | I-NP | |
| gestation | NN | I-NP | |
| in | IN | B-PP | |
| rat | NN | B-NP | |
| . | . | . | O |
Orthographic features extracted by NERsuite by default.
| Feature | Example |
|---|---|
| Initial letter is in uppercase | |
| Contains only digits | |
| Contains digits | |
| Contains only alphanumeric characters | |
| Contains only uppercase letters and digits | |
| Contains only uppercase letters | |
| Does not contain any lowercase letters | |
| Contains non-initial uppercase letters | |
| Contains two consecutive uppercase letters | |
| Has a Greek letter name as a substring | |
| Contains a comma | |
| Contains a full stop | |
| Contains a hyphen | |
| Contains a forward slash | |
| Contains an opening square bracket | |
| Contains a closing square bracket | |
| Contains an opening parenthesis | |
| Contains a closing parenthesis | |
| Contains a semi-colon | |
| Contains a percentage symbol | |
| Contains an apostrophe |
Example of a token sequence tagged with matches against chemical dictionaries.
| Token | Normal form | ChEBI | DrugBank | CTD | PubChem | Jochem |
|---|---|---|---|---|---|---|
| for | O | O | O | O | O | |
| the | O | O | O | O | O | |
| preparation | O | O | O | O | O | |
| of | O | O | O | O | O | |
| hydrogel | O | O | B | O | B | |
| microsphere | O | O | O | O | O | |
| base | O | O | O | O | O | |
| on | O | O | O | O | O | |
| hydroxyethyl | O | O | B | O | B | |
| starch | B | O | I | O | I | |
| _ | B | O | O | O | O | |
| hydroxyethyl | I | O | B | O | B | |
| methacrylate | I | O | I | B | I | |
| _ | O | O | O | O | O | |
| hes_hema | O | O | O | O | O | |
| _ | O | O | O | O | O |
Example of a token sequence tagged with matches against our affix lists.
| Prefixes | Suffixes | |||||
|---|---|---|---|---|---|---|
| O | O | O | O | O | O | |
| O | O | O | O | O | O | |
| di | O | O | yl | O | O | |
| O | O | fluo | O | ate | O | |
| O | O | O | O | O | O | |
| O | O | O | O | O | O | |
| O | O | O | O | ate | O | |
Examples of chemical names with corresponding basic segments.
| Token | Basic segments | No. of basic segments |
|---|---|---|
| 10, acet, oxy, actin, idine | 5 | |
| methyl, ergo, novi, ne | 4 | |
| interleukin, 2 | 2 | |
Performance of models learned from the CHEMDNER training set when evaluated on the development set.
| Macro | Micro | |||||
|---|---|---|---|---|---|---|
| Default features | 86.66 | 79.01 | 80.89 | 88.55 | 76.82 | 82.27 |
| Enriched features | 88.26 | 81.11 | 82.86 | 89.87 | 78.99 | 84.07 |
| Margin | +1.6 | +2.1 | +1.97 | +1.32 | +2.17 | +1.8 |
Distribution (according to chemical subtype) of the instances incorrectly rejected by the model trained with enriched features.
| Subtype | Frequency | Percentage |
|---|---|---|
| Abbreviation | 1,882 | 30.32% |
| Formula | 1,291 | 20.80% |
| Family | 979 | 15.77% |
| Trivial | 926 | 14.92% |
| Systematic | 693 | 11.16% |
| Identifier | 293 | 4.72% |
| Multiple | 118 | 1.90% |
| No class | 25 | 0.40% |
Sample tokens and their chemical segment composition.
| Token initially recognised as non-chemical | Chemical basic segments | Ratio |
|---|---|---|
| poly, calcium | 1.0 | |
| meth, oxy, estra, di, ol | 0.89 | |
| toxin | 0.56 | |
Summary of ChER's performance under the CHEMDNER track setting (set 1), under similar experimental settings as state-of-the-art methods (sets 2-4), and when applied to various corpora (sets 5-9).
| Data | Pre-processing | Post-processing | Micro-averages | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Training | Test | Splitter | Tokeniser | P | R | F1 | ||||
| 1 | CHEMDNER | CHEMDNER | LingPipe | GENIA | ✗ | ✗ | ✗ | 88.87 | 70.95 | 78.91 |
| training & dev. | test | Cafetiere | OSCAR4 | ✓ | ✓ | ✓ | 92.76 | 81.30 | 86.65 | |
| 2 | SciBorg (CM):3-fold CV | LingPipe | GENIA | ✗ | ✗ | ✗ | 80.44 | 55.16 | 65.45 | |
| Cafetiere | OSCAR4 | ✓ | ✓ | ✓ | 85.96 | 74.22 | 79.66 | |||
| 3 | SCAI-IUPAC | SCAI-100 | LingPipe | GENIA | ✗ | ✗ | ✗ | 84.78 | 66.87 | 74.77 |
| training | (IUPAC) | Cafetiere | GENIA | ✓ | ✓ | ✓ | 86.70 | 67.50 | 75.90 | |
| 4 | NaCTeM Metabolites:10-fold CV | LingPipe | GENIA | ✗ | ✗ | ✗ | 81.72 | 64.49 | 72.09 | |
| Cafetiere | OSCAR4 | ✓ | ✓ | ✓ | 81.42 | 79.66 | 80.53 | |||
| 5 | CHEMDNER | SCAI-100 | LingPipe | GENIA | ✗ | ✗ | ✗ | 72.56 | 66.00 | 69.13 |
| training & dev. | (All) | Cafetiere | OSCAR4 | ✓ | ✓ | ✓ | 77.85 | 78.69 | 78.27 | |
| 6 | CHEMDNER | Patents | LingPipe | GENIA | ✗ | ✗ | ✗ | 72.66 | 52.97 | 61.27 |
| training & dev. | Cafetiere | OSCAR4 | ✓ | ✓ | ✓ | 73.43 | 57.91 | 64.75 | ||
| 7 | CHEMDNER | DDI | LingPipe | GENIA | ✗ | ✗ | ✗ | 76.52 | 75.00 | 75.75 |
| training & dev. | test | Cafetiere | OSCAR4 | ✓ | • | ✓ | 75.88 | 92.05 | 83.18 | |
| 8 | CHEMDNER | PK | LingPipe | GENIA | ✗ | ✗ | ✗ | 79.29 | 84.66 | 81.89 |
| training & dev. | Cafetiere | GENIA | ✓ | ✓ | ✓ | 79.83 | 88.34 | 83.87 | ||
| 9 | CHEMDNER | NaCTeM | LingPipe | GENIA | ✗ | ✗ | ✗ | 63.57 | 71.63 | 67.36 |
| training & dev. | Metabolites | Cafetiere | OSCAR4 | ✓ | ✓ | ✓ | 65.08 | 83.29 | 73.07 | |
The first row in each set corresponds to the baseline. Key: Cust. Feats. = Custom Features, Abbr. = Abbreviation recognition, Comp. = Chemical composition-based token relabelling; ✓ = enabled, ✗ = disabled, • = enabling or disabling makes no difference in performance.