| Literature DB >> 34188248 |
Eduardo P García Del Valle1, Gerardo Lagunes García2,3, Lucía Prieto Santamaría3, Massimiliano Zanin4, Ernestina Menasalvas Ruiz2,3, Alejandro Rodríguez-González2,3.
Abstract
The ever-growing availability of biomedical text sources has resulted in a boost in clinical studies based on their exploitation. Biomedical named-entity recognition (bio-NER) techniques have evolved remarkably in recent years and their application in research is increasingly successful. Still, the disparity of tools and the limited available validation resources are barriers preventing a wider diffusion, especially within clinical practice. We here propose the use of omics data and network analysis as an alternative for the assessment of bio-NER tools. Specifically, our method introduces quality criteria based on edge overlap and community detection. The application of these criteria to four bio-NER solutions yielded comparable results to strategies based on annotated corpora, without suffering from their limitations. Our approach can constitute a guide both for the selection of the best bio-NER tool given a specific task, and for the creation and validation of novel approaches.Entities:
Year: 2021 PMID: 34188248 PMCID: PMC8242017 DOI: 10.1038/s41598-021-93018-w
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Experimental Design. (a) First, data are extracted from textual and omics sources; (b) next, networks are generated from the extracted data, and their main characteristics are analysed and compared; (c) finally, network-based criteria are applied to evaluate the accuracy of the bio-NER tool, and the results are compared with existing evaluations based on annotated corpora; (d) same method is applied to DISNET’s bio-NER system; and (e) the reference set is extended with pharmacologic data.
Characteristics of the extracted networks.
| Network | Nodes | Edges | Density | Modularity | Transitivity (normalized z-score) | Assortativity |
|---|---|---|---|---|---|---|
| Genomic | 1725 | 8,208 | 0.0055 | 0.783 | 0.013 | − 0.042 |
| Proteomic | 713 | 1,169 | 0.0046 | 0.961 | 0.000 | 0.356 |
| Pharmacologic | 2832 | 21,817 | 0.0054 | 0.712 | 0.030 | 0.041 |
| MetaMap | 5903 | 411,282 | 0.0236 | 0.481 | 0.379 | 0.067 |
| MetaMap (negation) | 5900 | 386,967 | 0.0222 | 0.497 | 0.351 | 0.070 |
| MetaMap Lite | 6042 | 595,110 | 0.0326 | 0.540 | 0.745 | 0.230 |
| MetaMap Lite (negation) | 5872 | 585,465 | 0.0339 | 0.564 | 1.000 | 0.409 |
| CLAMP | 5676 | 171,382 | 0.0106 | 0.454 | 0.256 | 0.273 |
| CLAMP (negation) | 5627 | 144,936 | 0.0091 | 0.468 | 0.227 | 0.289 |
| BERN | 5683 | 124,999 | 0.0077 | 0.572 | 0.241 | 0.368 |
| DISNET | 5054 | 184,274 | 0.0144 | 0.505 | 0.416 | 0.610 |
Calculations of the transitivity, including the results of the normality tests, are available in the Supplementary Materials (see Supplementary Table S1).
Figure 2Comparison of network characteristics. (a) Location of the analysed networks in the normalized transitivity versus modularity plane. The size and the color of the bubbles represent the density and assortativity of the networks, respectively; (b) log–log plot of the degree CCDF of the networks.
Figure 3Coincidence of network communities with disease categories. The bar plots show the proportion of diseases associated with the 10 largest first-level categories in the DO (a), ICD-10-CM (b) and MeSH (c) classification systems, compared with the proportion obtained for the best performer (BERN) and worst performer (MetaMap).
Figure 4Evaluation of the bio-NER accuracy according to the proposed model. (a) Results of the network overlapping and community coincidence tests and (b) normalized average results for the two tests, compared with the normalized average F-1 score of the bio-NER tools obtained from gold-standard based evaluations.
Bio-NER tools used in the study. MetaMap, MetaMap Lite and CLAMP provide configurable assertion detection (i.e., negation), hence the two performance values in the i2b2 2010 dataset.
| Bio-NER Tool | Description | Performance (F1 Score) | ||
|---|---|---|---|---|
| i2b2 2010 | SemEval 2014 | NCBI disease | ||
| MetaMap | An open-source software program developed by the NLM for finding UMLS concepts in biomedical text using dictionary lookup | 0.37, 0.38 (negation) | 0.469 | 0.641 |
| MetaMap Lite | A lightweight implementation of MetaMap, meant for applications that emphasize processing speed and ease of use | 0.38, 0.45 (negation) | 0.645 | 0.725 |
| CLAMP | A clinical NLP toolkit that provides state-of-the-art NLP components and a user-friendly graphic user interface to build customized NLP pipelines. CLAMP uses various technologies, including machine learning-based methods and rule-based methods | 0.857, 0.9398 (negation) | 0.632 | – |
| BERN (with Bio-BERT) | A neural biomedical named entity recognition and multi-type normalization tool. BERN uses the Bio-BERT NER models to tag genes/proteins, diseases, drugs/chemicals, and species | 0.865 | 0.779 | 0.8936 |