| Literature DB >> 34078270 |
Huiwei Zhou1, Zhe Liu2, Chengkun Lang2, Yibin Xu2, Yingyu Lin3, Junjie Hou4.
Abstract
BACKGROUND: Biomedical named entity recognition is one of the most essential tasks in biomedical information extraction. Previous studies suffer from inadequate annotated datasets, especially the limited knowledge contained in them.Entities:
Keywords: Biomedical named entity recognition; Knowledge distillation; Label re-correction
Mesh:
Year: 2021 PMID: 34078270 PMCID: PMC8170952 DOI: 10.1186/s12859-021-04200-w
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1The framework of our BioNER with label re-correction and knowledge distillation
Fig. 2Illustration of the dataset generation pipeline from the perspectives of coverage and accuracy. The chemical and disease mentions are highlighted in yellow and green, respectively.
Various statistics of the datasets
| Dataset | #Abstract | #Chemical | #Disease | $Chemical | $Disease | |
|---|---|---|---|---|---|---|
| Weakly labeled | CDWC | 70,026 | 706,593 | 514,964 | 34,696 | 58,985 |
| CDWA | 70,026 | 503,700 | 283,293 | 17,939 | 24,600 | |
| CDRC (BiLSTM-CRF) | 70,026 | 770,159 | 541,235 | 40,135 | 38,715 | |
| CDRA (BiLSTM-CRF) | 70,026 | 781,039 | 532,198 | 38,858 | 42,420 | |
| CDRC (BioBERT-CRF) | 70,026 | 795,096 | 557,434 | 50,018 | 52,447 | |
| CDRA (BioBERT-CRF) | 70,026 | 812,516 | 542,353 | 51,458 | 47,687 | |
| DRC (BiLSTM-CRF) | 70,026 | – | 469,849 | – | 69,567 | |
| DRA (BiLSTM-CRF) | 70,026 | – | 473,728 | – | 69,342 | |
| DRC (BioBERT-CRF) | 70,026 | – | 546,515 | – | 83,436 | |
| DRA (BioBERT-CRF) | 70,026 | – | 487,636 | – | 66,582 | |
| Human annotated | CDR training data | 500 | 5203 | 4182 | 991 | 1384 |
| CDR development data | 500 | 5347 | 4244 | 976 | 1254 | |
| CDR test data | 500 | 5385 | 4424 | 1239 | 1474 | |
| NCBI disease training data | 593 | – | 5145 | – | 1495 | |
| NCBI disease development data | 100 | – | 787 | – | 334 | |
| NCBI disease test data | 100 | – | 960 | – | 382 | |
#Abstract: the number of abstracts
#Chemical: the number of chemical mentions
#Disease: the number of disease mentions
$Chemical: the number of unique chemical mentions
$Disease: the number of unique disease mentions
Fig. 3Label similarity distribution over the large-scale dataset between the predictions of the two teacher models. Each bar represents the number of the abstracts with the probabilities of label similarity in the similarity interval.
The runtime of the experiments
| Models | Time (min) |
|---|---|
| Weakly labeled (BiLSTM-CRF) | 320 |
| Training (BiLSTM-CRF) | 2 |
| Distillation (BiLSTM-CRF) | 625 |
| Weakly labeled (BioBERT-CRF) | 334 |
| Training (BioBERT-CRF) | 4 |
| Distillation (BioBERT-CRF) | 554 |
“Time” denotes training time for one epoch. “Weakly labeled” and “Training” are the training time of the model trained on the weakly labeled dataset and CDR training dataset, respectively. “Distillation” is the training time of knowledge distillation
Comparison of BiLSTM-CRF model results trained on CDWC and CDWA with different re-correction times
| Dataset | Dataset | ||||||
|---|---|---|---|---|---|---|---|
| CDR | 91.42 | 83.59 | 87.86 | CDR | 91.42 | 83.59 | 87.86 |
| CDR + CDWC | 90.17 | 84.49 | 87.24 | CDR + CDWA | 94.02 | 71.02 | 80.92 |
| CDWC | 89.72 | 83.65 | 86.58 | CDWA | 94.75 | 67.27 | 78.68 |
| CDWC1 | 89.84 | 89.32 | 89.58 | CDWA1 | 90.16 | 88.94 | 89.55 |
| CDWC2 | 90.00 | 89.35 | 89.67 | CDWA2 (CDRA) | 91.03 | 88.31 | |
| CDWC3(CDRC) | 89.80 | 89.82 | CDWA3 | 90.28 | 89.03 | 89.65 | |
| CDWC4 | 89.90 | 89.70 | 89.80 |
The highest scores are highlighted in bold
All results are evaluated on the CDR test set. The first two lines are the baselines. For the last 5 lines, each dataset is constructed by the correction model trained with the dataset right above it. The superscript represents the re-correction times. That is, CDWC1 is the dataset constructed by the correction model trained on the CDWC. The third row datasets are the weakly labeled datasets without re-correction. What’s more, CDWC3 is CDRC, and CDWA2 is CDRA
Performance comparison of the distilled models trained with different combinations of losses
| Adv | ||||||
|---|---|---|---|---|---|---|
| ✔ | 89.99 | |||||
| ✔ | 90.13 | |||||
| ✔ | ✔ | 90.16 | ||||
| ✔ | ✔ | 90.13 | ||||
| ✔ | ✔ | |||||
| ✔ | ✔ | ✔ | 90.16 |
The highest scores are highlighted in bold
Adv: the short for adversarial learning
Ablation study results
| Model | |||
|---|---|---|---|
| Our best (BiLSTM-CRF) | 90.71 | ||
| 80.76 | 85.73** | ||
| 90.48 | 89.14 | 89.81* | |
| 90.17 | 89.55 | 89.86** |
The highest scores are highlighted in bold
w/o label re-correction: we train the teachers on the two weakly labeled datasets CDWC and CDWA rather than CDRC and CDRA
w/o CDRC: we train a single teacher without CDRC (i.e. only with CDRA)
w/o CDRA: we train a single teacher without CDRA (i.e. only with CDRC)
the marker * and ** represent P value < 0.05 and P value < 0.01, respectively, using pairwise t-test against our best (BiLSTM-CRF). Firstly, the formula of the pairwise t-test is defined as the sum of the differences of each pair divided by the square root of n times the sum of the differences squared minus the sum of the squared differences, overall n − 1. n is the number of pair. Then in this paper we use a two-tailed test in which the critical area of a distribution is two-sided and tests whether a sample is greater than or less than a certain range of values
Comparison with some state-of-the-art methods
| Methods | CDR chemical | CDR disease | CDR both | NCBI disease | |
|---|---|---|---|---|---|
| 1 | Habibi et al. [ | 91.05 | 83.49 | 87.63* | 84.44 |
| Our baseline (BiLSTM-CRF) | 91.42 | 83.59 | 87.86 | 83.96 | |
| Our baseline (BioBERT-CRF) | 93.69 | 86.19 | 90.31 | 87.47 | |
| 2 | Luo et al. [ | 92.57 | – | – | – |
| Dang et al. [ | 93.14 | 84.68 | 89.30* | 84.41 | |
| 3 | Wang et al. [ | – | – | 88.78 | 86.14 |
| Yoon et al. [ | 92.74 | 82.61 | 88.15* | 86.36 | |
| 4 | Lee et al. [ | 93.47 | 87.15 | 90.60* | 89.71 |
| Our model (BiLSTM-CRF) | 94.17 | 85.69 | 90.35 | 85.71 | |
| Our model (BioBERT-CRF) |
The highest scores are highlighted in bold
1: models with word and character features
2: models with additional domain resource features and linguistic features
3: models with multi-task learning
4: models with large-scale unlabeled datasets
*Indicates that the results are calculated by us according to their reported results in chemical and disease
Fig. 4Case study of knowledge distillation effectiveness. Yellow for chemical and green for disease
Fig. 5Label probabilities of the words “Coxon” and scoline predicted by , and our model
Fig. 6Case study of re-correction effectiveness. Yellow for chemical and green for disease