| Literature DB >> 31138109 |
Wonjin Yoon1, Chan Ho So2, Jinhyuk Lee1, Jaewoo Kang3,4.
Abstract
BACKGROUND: Finding biomedical named entities is one of the most essential tasks in biomedical text mining. Recently, deep learning-based approaches have been applied to biomedical named entity recognition (BioNER) and showed promising results. However, as deep learning approaches need an abundant amount of training data, a lack of data can hinder performance. BioNER datasets are scarce resources and each dataset covers only a small subset of entity types. Furthermore, many bio entities are polysemous, which is one of the major obstacles in named entity recognition.Entities:
Keywords: Deep learning; NER; Named entity recognition; Text mining
Mesh:
Year: 2019 PMID: 31138109 PMCID: PMC6538547 DOI: 10.1186/s12859-019-2813-6
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Character level word embedding using CNN and an overview of Bidirectional LSTM with Conditional Random Field (BiLSTM-CRF). Single-task model structure
Fig. 2Structure of CollaboNet. Arrows show the flow of information when target model M is training. The models in CollaboNet take turns in being the target model
Descriptions of datasets
| Datasets | Entity type | # of sentence | # of annotations | Data Size |
|---|---|---|---|---|
| NCBI-Disease (Dogan et al., 2014) | Disease | 7639 | 6881 | 793 abstracts |
| JNLPBA (Kim et al., 2004) | Gene/Proteins | 22,562 | 35,336 | 2404 abstracts |
| BC5CDR (Li et al., 2016) | Chemicals | 14,228 | 15,935 | 1500 articles |
| BC5CDR (Li et al., 2016) | Diseases | 14,228 | 12,852 | 1500 articles |
| BC4CHEMD (Krallinger et al., 2015a) | Chemicals | 86,679 | 84,310 | 10,000 abstracts |
| BC2GM (Akhondi et al., 2014) | Gene/Proteins | 20,510 | 24,583 | 20,000 sentences |
Performances of single-task models
| Model | Habibi et al. (2017) STM | Wang et al. (2018) STM | Our STM | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Dataset | Precision | Recall | F1 Score | Precision | Recall | F1 Score | Precision | Recall | F1 Score |
| NCBI-disease | 85.31 | 83.58 | 84.44 | 84.95 | 82.92 | 83.92 | 83.95 | 85.45 | |
| JNLPBA | 74.83 | 79.82 | 77.25 | 69.60 | 74.95 | 72.17 | 72.51 | 82.98 | |
| BC5CDR-chem | 92.57 | 88.77 | 90.63 | *93.05 | *86.87 | *89.85 | 94.02 | 91.50 | |
| BC5CDR-disease | 84.19 | 82.79 |
| *84.09 | *81.32 | *82.68 | 82.98 | 82.25 | 82.61 (±0.25) |
| BC4CHEMD | 87.83 | 85.45 | 86.62 | 90.53 | 87.04 |
| 90.50 | 85.96 | 88.19 (±0.23) |
| BC2GM | 77.50 | 78.13 | 77.82 | 81.11 | 78.91 |
| 79.70 | 77.47 | 78.56 (±0.38) |
| Macro Average | 83.71 | 83.09 | 83.38 | 83.89 | 82.00 | 82.90 |
|
|
|
Our STM achieved the best performance on 3 datasets among 6. Scores in the asterisked (*) cells are obtained in the experiments that we conducted; these scores are not reported in the original papers. The best scores from these experiments are in bold
Performance of CollaboNet and the Multi-Task Model by Wang et al. [25]
| Model | Wang et al. (2018) MTM | CollaboNet | ||||
|---|---|---|---|---|---|---|
| Dataset | Precision | Recall | F1 Score | Precision | Recall | F1 Score |
| NCBI-disease | 85.86 | 86.42 | 86.14 | 85.48 | 87.27 | |
| JNLPBA | 70.91 | 76.34 | 73.52 | 74.43 | 83.22 |
|
| BC5CDR-chem | *93.09 | *89.56 | *91.29 | 94.26 | 92.38 |
|
| BC5CDR-disease | *83.73 | *82.93 | *83.33 | 85.61 | 82.61 |
|
| BC4CHEMD | 91.30 | 87.53 |
| 90.78 | 87.01 | 88.85 |
| BC2GM | 82.10 | 79.42 |
| 80.49 | 78.99 | 79.73 |
| Macro Average | 84.50 | 83.70 | 84.07 |
|
|
|
Scores in the asterisked (*) cells are obtained in the experiments that we conducted; these scores are not reported in the original papers. The best scores from these experiments are in bold
The number of bio-entity type errors, the total number of errors, and the ratio of bio-entity errors to the total numbers of errors for each model prediction
| Our STM | CollaboNet | ||||||
|---|---|---|---|---|---|---|---|
| Dataset | Bio Entity | Total | Ratio of Bio Entity | Bio Entity | Total | Ratio of Bio Entity | Difference |
| NCBI-disease | 54 | 167 | 32.3% | 38 | 131 | 29.0% | -3.3% |
| JNLPBA | 749 | 1520 | 49.3% | 227 | 1437 | 15.8% |
|
| BC5CDR-chem | 142 | 503 | 28.2% | 122 | 505 | 24.2% | -4.1% |
| BC5CDR-disease | 199 | 867 | 23.0% | 131 | 728 | 18.0% | -5.0% |
| BC2GM | 189 | 1277 | 14.8% | 218 | 1165 | 18.7% | 3.9% |
Negative values at the difference tab indicate that CollaboNet reduced the number of false positives, especially false biomedical entities
Case study
| Chemical dataset | ||
| Our STM | No prophylaxis with | - globulin : Protein |
| CollaboNet | No prophylaxis with antilymphocyte globulin was used | |
| Ground Truth | No prophylaxis with antilymphocyte globulin was used | |
| Our STM | elderly patients using | ACE : Gene/Protein |
| CollaboNet | elderly patients using ACE / ARB in combination with | |
| Ground Truth | elderly patients using ACE / ARB in combination with | |
| Disease Dataset | ||
| Our STM | The ATM ( | A-T, mutated : Gene |
| CollaboNet | The ATM (A-T, mutated) gene on human chromosome 11q22. | |
| Ground Truth | The ATM (A-T, mutated) gene on human chromosome 11q22. | |
| Our STM | to bind to the human | cTNT : Gene/Protein |
| CollaboNet | to bind to the human cardiac troponin T (cTNT) pre-messenger RNA | |
| Ground Truth | to bind to the human cardiac troponin T (cTNT) pre-messenger RNA | |
| Gene / Protein Dataset | ||
| Our STM | which is inhibited by the | LMB : Chemical, Drug |
| CollaboNet | which is inhibited by the cytotoxin leptomycin B (LMB), and also by its interaction | |
| Ground Truth | which is inhibited by the cytotoxin leptomycin B (LMB), and also by its interaction | |
| Our STM | Classic Hodgkin disease ( | cHD : Disease |
| CollaboNet | Classic Hodgkin disease (cHD) is derived from B cells with high loads of mutations | |
| Ground Truth | Classic Hodgkin disease (cHD) is derived from B cells with high loads of mutations |
This table contains sentences that were incorrectly predicted by of our STM but were correctly predicted by CollaboNet. The predicted labels or the ground truth labels are underlined
Case study
| Gene / Protein Dataset | |
| CollaboNet | Troglitazone, a |
| Ground Truth | Troglitazone, a PPARgamma ligand, inhibits osteopontin gene expression in THP-1 cells |
| CollaboNet | The |
| Ground Truth | The |
| Chemical Dataset | |
| CollaboNet | recently identified Delta22-isomer of |
| Ground Truth | recently identified Delta22-isomer of beta-muricholate contribute for 5.4% |
| CollaboNet | |
| Ground Truth |
This table shows the questionable answers from the ground truth datasets. Our model achieves better performance in detecting entities in these example sentences. The predicted labels or the ground truth labels are underlined