| Literature DB >> 30631966 |
Wahed Hemati1, Alexander Mehler2.
Abstract
BACKGROUND: Chemical and biomedical named entity recognition (NER) is an essential preprocessing task in natural language processing. The identification and extraction of named entities from scientific articles is also attracting increasing interest in many scientific disciplines. Locating chemical named entities in the literature is an essential step in chemical text mining pipelines for identifying chemical mentions, their properties, and relations as discussed in the literature. In this work, we describe an approach to the BioCreative V.5 challenge regarding the recognition and classification of chemical named entities. For this purpose, we transform the task of NER into a sequence labeling problem. We present a series of sequence labeling systems that we used, adapted and optimized in our experiments for solving this task. To this end, we experiment with hyperparameter optimization. Finally, we present LSTMVoter, a two-stage application of recurrent neural networks that integrates the optimized sequence labelers from our study into a single ensemble classifier.Entities:
Keywords: Attention mechanism; BioCreative V.5; BioNLP; CEMP; CHEMDNER; Deep learning; LSTM; Named entity recognition
Year: 2019 PMID: 30631966 PMCID: PMC6689880 DOI: 10.1186/s13321-018-0327-2
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Number of instances for each subtype of CEMP and CHEMDNER corpus
| Annotation | CEMP | CHEMDNER |
|---|---|---|
| Abbreviation | 1373 | 9059 |
| Family | 36,238 | 8313 |
| Formula | 6818 | 8585 |
| Identifier | 278 | 1311 |
| Multiple | 418 | 390 |
| Systematic | 28,580 | 13,472 |
| Trivial | 25,927 | 17,802 |
| No class | 0 | 72 |
| Total count | 99,632 | 59,004 |
Comparison of annotators trained and tested on CEMP and CHEMDNER corpora measured by precision (P), recall (R), f1-score (F1)
| System | CEMP | CHEMDNER | ||||
|---|---|---|---|---|---|---|
| P | R | F | P | R | F | |
| Stanford NER | 0.85 | 0.80 | 0.82 | 0.82 | 0.83 | 0.82 |
| MarMoT | 0.87 | 0.86 | 0.86 | 0.85 | 0.85 | 0.85 |
| CRF++ | 0.77 | 0.73 | 0.73 | 0.74 | 0.71 | 0.73 |
| MITIE | 0.65 | 0.65 | 0.65 | 0.62 | 0.61 | 0.62 |
| Glample | 0.76 | 0.79 | 0.77 | 0.82 | 0.84 | 0.83 |
| Majority vote | 0.78 | 0.79 | 0.78 | 0.70 | 0.76 | 0.73 |
| LSTMVoter | 0.90 | 0.88 |
| 0.91 | 0.90 |
|
Bold was intended to compare LSTMVoter to the best reference tool. Bold now shows the system with the highest F-Score, which is LSTMVoter
Fig. 1Architecture of LSTMVoter
Fig. 2A long short-term memory cell
Fig. 3A bidirectional LSTM network