| Literature DB >> 25810772 |
Anabel Usié1, Joaquim Cruz2, Jorge Comas2, Francesc Solsona3, Rui Alves2.
Abstract
BACKGROUND: Small chemical molecules regulate biological processes at the molecular level. Those molecules are often involved in causing or treating pathological states. Automatically identifying such molecules in biomedical text is difficult due to both, the diverse morphology of chemical names and the alternative types of nomenclature that are simultaneously used to describe them. To address these issues, the last BioCreAtIvE challenge proposed a CHEMDNER task, which is a Named Entity Recognition (NER) challenge that aims at labelling different types of chemical names in biomedical text.Entities:
Year: 2015 PMID: 25810772 PMCID: PMC4331691 DOI: 10.1186/1758-2946-7-S1-S15
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Examples of chemical entity recognition applications.
| Applications | Availability |
|---|---|
| ProMiner [ | CL |
| Whatizit [ | F |
| Chemical Reader (MDL and TEMIS) [ | CL |
| Oscar3/4 [ | F |
| K&K CRF [ | NA |
| ChemicalTagger [ | F |
| SureChem [ | CL (TVA) |
| ChemFinder (ChemBioFinder) [ | CL (TVA) |
| Chemical Name Spotter UIMA,IBM [ | CL |
| ChemSpot[ | F |
| CheNER[ | F |
CL: Commercial License, NA: Not Available, F: Free, TVA: Trial Version Available
Sets of approaches combining CRFs, dictionary matching, and regular expression matching in five different ways.
| Run | Description |
|---|---|
| 1 | Combines a CRF to identify SYSTEMATIC entities with dictionary matching to identify TRIVIAL, FAMILY, and ABBREVIATION entities, and regular expression matching to identify FORMULA and IDENTIFIER entities. |
| 2 | Combines individual CRFs to identify SYSTEMATIC and TRIVIAL entities with dictionary matching to identify FAMILY and ABBREVIATION entities, and regular expression matching to identify FORMULA and IDENTIFIER entities. |
| 3 | Uses a single CRF to identify SYSTEMATIC, TRIVIAL, FAMILY, ABBREVIATION, FORMULA and IDENTIFIER entities. |
| 4 | Combines individual CRFs to identify SYSTEMATIC, TRIVIAL, FAMILY, ABBREVIATION, and FORMULA entities with an individual regular expression matching to identify IDENTIFIER entities. |
| 5 | Uses a single CRF to identify SYSTEMATIC, TRIVIAL, FAMILY, ABBREVIATION, FORMULA and IDENTIFIER entities and specifically labels each class of entity. |
Examples of features and regular expressions used during the training of the chemical entities identification systems.
| Name of feature | Description |
|---|---|
| Length | Classifies tokens by length. If the length is less than 5, the token is Short. If length is between 5 and 15, the token is Medium, otherwise, the token is Large. |
| Word class | Automatic generation of features in terms of frequency of upper and lower case characters, digits and other types of characters. |
| Autom. Prefixes/Suffixes | Automatic generation of suffix and prefix (length 2, 3 and 4) |
| List | Automatic generation for every token that match an element within the list. We used lists of basic name segments (~3300), and stop words (~550). |
| Dictionaries | A dictionary matching for trivial, family and abbreviations names classes (~6400, ~1300 and ~1400 elements, repectively). |
| Regular expressions | Regular expressions that identify specific features, such as "contains dashes?", "is all cap?", or "contains numbers?". |
Figure 1Example of how chemical entity class names are tagged by CheNER using the IOB scheme format. Tokens that are not recognized as chemical entities are marked with O. Tokens that are recognized as the beginning of a chemical entity are marked with B. Tokens that are recognized as continuing the name of a chemical entity are marked with I. In addition, CheNER adds the class of the chemical name it tags.
Micro-average CDI subtask results.
| Run 1 | Run 2 | Run 3 | Run 4 | Run5 | |
|---|---|---|---|---|---|
| 77.37 | 80.79 | 83.01 | 83.17 | 76.79 | |
| 65.58 | 56.44 | 54.79 | 61.36 | 69.36 | |
| 70.99 | 66.45 | 66.01 | 70.62 | 72.88 | |
| 50.25 | 44.83 | 44.94 | 50.70 | 52.18 | |
| 58.85 | 53.54 | 53.48 | 59.02 | 60.82 |
P:precision; R:recall; F:F-score; AP: average precision; Fs: harmonic mean between AP and F-score.
Micro-average CEM subtask results.
| Run 1 | Run 2 | Run 3 | Run 4 | Run5 | |
|---|---|---|---|---|---|
| 77.58 | 80.49 | 85.17 | 85.15 | 81.49 | |
| 65.71 | 66.13 | 48.72 | 59.45 | 66.23 | |
| 71.15 | 72.61 | 61.98 | 70.02 | 73.07 | |
| 49.79 | 50.35 | 40.13 | 49.23 | 51.82 | |
| 58.58 | 49.47 | 48.71 | 57.85 | 60.64 |
P: precision; R: recall; F: F-score; AP: average precision; Fs: harmonic mean between AP and F-score.
Comparative micro-average performance evaluation of "out of the box" versions of ChemSpot and OSCAR.
| NO processing of results | Processing of results | |||||
|---|---|---|---|---|---|---|
| 70.05 | 59.63 | 64.43 | 71.86 | 59.81 | 65.28 | |
| 29.97 | 79.95 | 43.60 | 35.26 | 80.00 | 48.95 | |
P: precision; R: recall; F: F-score. No processing: results were not processed through the post-processing step described in methods; Processing of results: results were passed through the post-processing step described in methods.
Comparative F-Score performance combining "out of the box" versions of ChemSpot, OSCAR, and CheNER.
| Run 1 | Run 2 | Run 3 | Run 4 | Run5 | |
|---|---|---|---|---|---|
| 70.99 | 66.45 | 66.01 | 70.62 | 72.88 | |
| 73.05 | 70.03 | 73.31 | 73.83 | 73.18 | |
| 50.28 | 50.31 | 50.86 | 50.81 | 50.10 |
Comparative analysis of true and false positive tagging between the best run of CheNER and ChemSpot.
| True Positives | False Positives | Unique True Positives | Unique False Positives | |
|---|---|---|---|---|
| ChemSpot | 9626 | 3769 | 2643 | 3297 |
| CheNER | 9876 | 1999 | 2893 | 1527 |
Figure 2Example of an entity that is not consistently annotated over different abstracts. DCG-IV is correctly annotated as a chemical entity in Abstract 23164931. However, it is not annotated at all in Abstract 22445601.