| Literature DB >> 31307130 |
Jordi Armengol-Estapé1, Felipe Soares2, Montserrat Marimon2, Martin Krallinger2,3.
Abstract
Automatically detecting mentions of pharmaceutical drugs and chemical substances is key for the subsequent extraction of relations of chemicals with other biomedical entities such as genes, proteins, diseases, adverse reactions or symptoms. The identification of drug mentions is also a prior step for complex event types such as drug dosage recognition, duration of medical treatments or drug repurposing. Formally, this task is known as named entity recognition (NER), meaning automatically identifying mentions of predefined entities of interest in running text. In the domain of medical texts, for chemical entity recognition (CER), techniques based on hand-crafted rules and graph-based models can provide adequate performance. In the recent years, the field of natural language processing has mainly pivoted to deep learning and state-of-the-art results for most tasks involving natural language are usually obtained with artificial neural networks. Competitive resources for drug name recognition in English medical texts are already available and heavily used, while for other languages such as Spanish these tools, although clearly needed were missing. In this work, we adapt an existing neural NER system, NeuroNER, to the particular domain of Spanish clinical case texts, and extend the neural network to be able to take into account additional features apart from the plain text. NeuroNER can be considered a competitive baseline system for Spanish drug and CER promoted by the Spanish national plan for the advancement of language technologies (Plan TL). PharmacoNER Tagger can be accessed at https://github.com/PlanTL-SANIDAD/PharmacoNER.Entities:
Keywords: machine learning; natural language processing; neural networks (computer)
Year: 2019 PMID: 31307130 PMCID: PMC6808625 DOI: 10.5808/GI.2019.17.2.e15
Source DB: PubMed Journal: Genomics Inform ISSN: 1598-866X
Fig. 1.Schematic representation of the artificial neural networks for a single token in NeuroNER: Snippet of the architecture used in NeuroNER. The type of recurrent neural network (RNN) is long short term memory (LSTM). n is the number of tokens X is the i token. V is the mapping from tokens to token embeddings. l(i) is the number of characters, and X is the j character of the i token. V is the mapping from characters to character embeddings. ei is the character-enhanced token embeddings of the i token. di is the output of the LSTM of label prediction layer, a is the probability vector over labels, and y is the predicted label of the i token. Adapted from Dernoncourt et al. J Am Med Inform Assoc 2017;24:596-606 [15].
Fig. 2.Schematic representation of the the artificial neural networks for a single token in PharmacoNER tagger: Snippet of the architecture used in PharmacoNER tagger. All notations are the same as in Fig. 1. We added the following features in the character-enhanced token embeddings: V is the mapping from the specific token to its Part-of-Speech, V is boolean mapping to identify if the given token contains any of the affixes in the database, and V is a boolean mapping to the gazetteer database. RNN, recurrent neural network.
Results for the best combination of embedding and features
| Accuracy | Precision | Recall | F1 | |||||
|---|---|---|---|---|---|---|---|---|
| Val | Test | Val | Test | Val | Test | Val | Test | |
| Overall | 99.59 | 99.66 | 92.37 | 91.35 | 89.7 | 86.89 | 91.01 | 89.06 |
| Normalizables | - | - | 94.99 | 93.97 | 88.97 | 85.54 | 91.88 | 89.56 |
| Proteins | - | - | 89.57 | 87.64 | 90.54 | 89.02 | 90.05 | 88.33 |
The features turned on were Part-of-Speech and Gazetteer.