| Literature DB >> 34920708 |
Pilar López-Úbeda1, Manuel Carlos Díaz-Galiano2, L Alfonso Ureña-López2, M Teresa Martín-Valdivia2.
Abstract
BACKGROUND: Natural language processing (NLP) and text mining technologies for the extraction and indexing of chemical and drug entities are key to improving the access and integration of information from unstructured data such as biomedical literature.Entities:
Keywords: Concept indexing; Named entity recognition; Natural language processing; Neural network; SNOMED-CT; Word embeddings
Mesh:
Substances:
Year: 2021 PMID: 34920708 PMCID: PMC8684055 DOI: 10.1186/s12859-021-04188-3
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Sample fragment from the SPACCC corpus (see English translation in “Appendix A” Figure 6)
Fig. 6English sample fragment from the SPACCC corpus
Basic analysis of SPACCC corpus documents
| Train | Dev | Test | |
|---|---|---|---|
| Number of documents | 500 | 250 | 250 |
| Avg sentences | 25.14 | 25.85 | 25.69 |
| No. tokens | 202,901 | 96,869 | 100,963 |
| No. unique tokens | 18,623 | 12,170 | 12,442 |
Distribution of labels in the SPACCC dataset
| Train | Dev | Test | |
|---|---|---|---|
| NORMALIZABLES | 2304 | 1121 | 973 |
| NO_NORMALIZABLES | 24 | 16 | 10 |
| PROTEINAS | 1405 | 745 | 859 |
| UNCLEAR | 89 | 44 | 34 |
Fig. 2Proposed BiLSTM-CRF neural network using a combination of different word embeddings as an input layer. English translation: albumin/creatinine ratio: 0.6 μg
Fig. 3Workflow for assigning a SNOMED-CT code to an entity
Micro-averaged performance for chemical and drug recognition task using BiLSTM-CRF approach
| Precision (%) | Recall (%) | F1-score (%) | |
|---|---|---|---|
| Based on BERT (Xiong et al. [ | 91.23 | 90.88 | 91.05 |
| Classic WE + Contextual WE + Medical WE | 91.41 | 90.14 | 90.77 |
| Medical WE | 87.94 | 86.24 | 87.08 |
| Contextual WE | 88.74 | 85.22 | 86.95 |
| Classic WE | 86.53 | 83.46 | 84.96 |
| CRF + features (López-Úbeda et al. [ | 88.51 | 69.81 | 78.06 |
Fig. 4Increased results by concatenating word embeddings for NER task
Micro-averaged performance for the concept indexing task
| Precision (%) | Recall (%) | F1-score (%) | |
|---|---|---|---|
| Classic WE + Contextual WE + Medical WE | 92.91 | 92.44 | 92.67 |
| Rule + Dictionary-based method (León et al. [ | 91.11 | 92.08 | 91.59 |
| Contextual WE | 91.11 | 91.93 | 91.34 |
| Medical WE | 92.16 | 90.15 | 91.17 |
| Classic WE | 92.13 | 89.34 | 90.14 |
| CRF + features [ | 82.89 | 61.84 | 70.83 |
Fine-grained evaluation considering different errors categories in the NER task
| Total | TP | FP | FN | |
|---|---|---|---|---|
| NORMALIZABLES | 973 | 893 | 58 | 80 |
| NO_NORMALIZABLES | 10 | 3 | 1 | 7 |
| PROTEINAS | 859 | 768 | 98 | 91 |
Fig. 5Example of FN in the PharmaCoNER corpus comparing the gold output and the output of our system. English translation: determination of vimentin, cytokeratin 7 and broad-spectrum cytokera
Examples of misclassified entities in the NER task
| True label | Predicted label | Entities |
|---|---|---|
| NORMALIZABLES | PROTEINAS | |
| O | ||
| NO_NORMALIZABLES | NORMALIZABLES | Ora-Sweet, harvoni, endoperox |
| O | Ora-Plus, McGhan | |
| PROTEINAS | NORMALIZABLES | |
| O | A.S.T, DHL, CLL-K | |
| O | NORMALIZABLES | |
| NO_NORMALIZABLES | Aproten | |
| PROTEINAS |
Examples of entities incorrectly indexed by the unsupervised machine learning method
| Entity | SNOMED-CT code | SNOMED-CT description |
|---|---|---|
| cd 31 | 4167003 | |
| 395835001 | ||
| 11353004 |
Examples of entities correctly indexed by the unsupervised machine learning method
| Entity | SNOMED-CT code | SNOMED-CT description |
|---|---|---|
| 372817009 | ||
| EMA | 103092003 | |
| AA | 40185008 |