| Literature DB >> 31218364 |
John M Giorgi1,2, Gary D Bader1,2,3.
Abstract
MOTIVATION: Automatic biomedical named entity recognition (BioNER) is a key task in biomedical information extraction. For some time, state-of-the-art BioNER has been dominated by machine learning methods, particularly conditional random fields (CRFs), with a recent focus on deep learning. However, recent work has suggested that the high performance of CRFs for BioNER may not generalize to corpora other than the one it was trained on. In our analysis, we find that a popular deep learning-based approach to BioNER, known as bidirectional long short-term memory network-conditional random field (BiLSTM-CRF), is correspondingly poor at generalizing. To address this, we evaluate three modifications of BiLSTM-CRF for BioNER to improve generalization: improved regularization via variational dropout, transfer learning and multi-task learning.Entities:
Year: 2020 PMID: 31218364 PMCID: PMC6956779 DOI: 10.1093/bioinformatics/btz504
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
In-corpus (IC) performance, measured by F1-score, of the baseline (BL) BiLSTM-CRF compared to a BiLSTM-CRF with variational dropout (VD)
| BL | VD | ||||
|---|---|---|---|---|---|
| Entity | Corpus | Average | σ | Average | σ |
| Chemicals | BC4CHEMD | 88.46 | 0.61 |
| 0.76 |
| BC5CDR | 92.82 | 0.80 |
| 0.82 | |
| CRAFT | 84.98 | 1.98 |
| 1.37 | |
| Disease | BC5CDR | 84.49 | 0.33 |
| 0.56 |
| NCBI-disease | 87.01 | 1.17 |
| 1.50 | |
| Variome |
| 2.83 | 85.69 | 3.81 | |
| Species | CRAFT | 96.28 | 2.21 |
| 2.26 |
| Linnaeus | 89.44 | 3.91 |
| 7.47 | |
| S800 | 72.75 | 2.42 |
| 4.17 | |
| Genes/proteins | BC2GM | 81.48 | 0.48 |
| 0.50 |
| CRAFT | 84.46 | 6.08 |
| 5.19 | |
| JNLPBA | 80.92 | 2.50 |
| 2.62 | |
Note: In the BL model, dropout is applied only to the character-enhanced word embeddings. In the VD model, dropout is additionally applied to the input, recurrent and output connections of all LSTM layers. IC performance is derived from 5-fold cross-validation, using exact matching criteria. Statistical significance is measured through a two-tailed t-test. Bold, best scores, σ, standard deviation.
Significantly different than the BL (P ≤ 0.01).
Out-of-corpus (OOC) performance, measured by F1 score, of the baseline (BL) BiLSTM-CRF compared to a BiLSTM-CRF with variational dropout (VD)
| Entity | Train | Test | BL | VD |
|
|---|---|---|---|---|---|
| Chemicals | BC4CHEMD | BC5CDR |
| 90.61 | −0.29 |
| CRAFT | 47.44 |
| 0.23 | ||
| BC5CDR | BC4CHEMD | 71.81 |
| 0.60 | |
| CRAFT | 39.55 |
| 1.74 | ||
| CRAFT | BC4CHEMD | 40.50 |
| 2.14 | |
| BC5CDR | 41.59 |
| 15.05 | ||
| Diseases | BC5CDR | NCBI-disease | 76.67 |
| 4.19 |
| Variome | 74.03 |
| 0.81 | ||
| NCBI-disease | BC5CDR | 69.62 |
| 5.33 | |
| Variome | 74.98 |
| 0.72 | ||
| Variome | BC5CDR | 22.45 |
| 7.93 | |
| NCBI-disease | 40.17 |
| 4.99 | ||
| Species | CRAFT | Linnaeus | 45.32 |
| 7.93 |
| S800 | 36.88 |
| 9.21 | ||
| Linnaeus | CRAFT | 82.49 |
| 0.36 | |
| S800 | 62.90 |
| 4.02 | ||
| S800 | CRAFT | 57.09 |
| 19.34 | |
| Linnaeus | 61.43 |
| 5.62 | ||
| Genes/proteins | BC2GM | CRAFT | 56.04 |
| 2.12 |
| JNLPBA | 69.77 |
| 1.02 | ||
| CRAFT | BC2GM | 44.11 |
| 5.01 | |
| JNLPBA | 52.88 |
| 3.42 | ||
| JNLPBA | BC2GM | 51.03 |
| 4.57 | |
| CRAFT | 44.29 |
| 4.79 |
Note: In the BL model, dropout is applied only to the character-enhanced word embeddings. In the VD model, dropout is additionally applied to the input, recurrent and output connections of all LSTM layers. OOC performance is derived by training on one corpus (train) and testing on another annotated for the same entity type (test) using a relaxed, right-boundary matching criteria. Bold, best scores.
Significantly different than the BL (P ≤ 0.05).
Significantly different than the BL (P ≤ 0.01).
In-corpus (IC) performance, measured by F1-score, of the baseline (BL) BiLSTM-CRF compared to a BiLSTM-CRF trained with transfer learning (TL)
| BL | TL | ||||
|---|---|---|---|---|---|
| Entity | Corpus | Average | σ | Average | σ |
| Chemicals | BC4CHEMD | 88.46 | 0.61 |
| 0.63 |
| BC5CDR |
| 0.80 | 92.20 | 0.86 | |
| CRAFT | 84.98 | 1.98 |
| 1.74 | |
| Disease | BC5CDR |
| 0.33 | 84.41 | 0.24 |
| NCBI-disease | 87.01 | 1.17 |
| 0.86 | |
| Variome | 85.75 | 2.83 |
| 3.03 | |
| Species | CRAFT | 96.28 | 2.21 |
| 1.72 |
| Linnaeus | 89.44 | 3.91 |
| 4.90 | |
| S800 | 72.75 | 2.42 |
| 3.27 | |
| Genes/proteins | BC2GM |
| 0.48 | 80.65 | 0.57 |
| CRAFT | 84.46 | 6.08 |
| 4.59 | |
| JNLPBA | 80.92 | 2.50 |
| 2.73 | |
Note: The TL model was pre-trained on the CALBC-Small-III corpus. IC performance is derived from 5-fold cross-validation, using exact matching criteria. Statistical significance is measured through a two-tailed t-test. Bold, best scores, σ, standard deviation.
Significantly different than the BL (P ≤ 0.05).
Out-of-corpus (OOC) performance, measured by F1 score, of the baseline (BL) BiLSTM-CRF compared to a BiLSTM-CRF trained with transfer learning (TL)
| Entity | Train | Test | BL | TL |
|
|---|---|---|---|---|---|
| Chemicals | BC4CHEMD | BC5CDR |
| 90.73 | −0.17 |
| CRAFT |
| 47.02 | −0.42 | ||
| BC5CDR | BC4CHEMD | 71.81 |
| 2.46 | |
| CRAFT | 39.55 |
| 1.64 | ||
| CRAFT | BC4CHEMD | 40.50 |
| 5.64 | |
| BC5CDR | 41.59 |
| 16.98 | ||
| Diseases | BC5CDR | NCBI-disease | 76.67 |
| 1.83 |
| Variome | 74.03 |
| 3.16 | ||
| NCBI-disease | BC5CDR | 69.62 |
| 3.56 | |
| Variome | 74.98 |
| 1.97 | ||
| Variome | BC5CDR | 22.45 |
| 27.83 | |
| NCBI-disease | 40.17 |
| 18.47 | ||
| Species | CRAFT | Linnaeus | 45.32 |
| 8.04 |
| S800 | 36.88 |
| 9.57 | ||
| Linnaeus | CRAFT | 82.49 |
| 0.57 | |
| S800 | 62.90 |
| 4.73 | ||
| S800 | CRAFT | 57.09 |
| 12.47 | |
| Linnaeus | 61.43 |
| 5.78 | ||
| Genes/proteins | BC2GM | CRAFT | 56.04 |
| 0.79 |
| JNLPBA | 69.77 |
| 0.50 | ||
| CRAFT | BC2GM | 44.11 |
| 5.58 | |
| JNLPBA | 52.88 |
| 5.03 | ||
| JNLPBA | BC2GM | 51.03 |
| 6.78 | |
| CRAFT | 44.29 |
| 12.61 |
Note: The TL model was pre-trained on the CALBC-Small-III corpus. OOC performance is derived by training on one corpus (train) and testing on another annotated for the same entity type (test) using a relaxed, right-boundary matching criteria. Statistical significance is measured through a two-tailed t-test. Bold, best scores.
Significantly different than the BL (P ≤ 0.05).
Significantly different than the BL (P ≤ 0.01).
In-corpus (IC) performance, measured by F1-score, of the baseline (BL) BiLSTM-CRF compared to the multi-task model (MTM)
| BL | MTM | |||||
|---|---|---|---|---|---|---|
| Entity | Train | Partner | Average | σ | Average | σ |
| Chemicals | BC4CH. | BC5CDR | 88.46 | 0.61 |
| 0.60 |
| CRAFT | — | — | 88.67 | 0.50 | ||
| BC5CDR | BC4CH. | 92.82 | 0.80 |
| 0.55 | |
| CRAFT | — | — | 91.52 | 0.68 | ||
| CRAFT | BC4CH. | 84.98 | 1.98 |
| 1.49 | |
| BC5CDR | — | — | 84.74 | 1.33 | ||
| Diseases | BC5CDR | NCBI-disease |
| 0.33 | 83.85 | 0.64 |
| Variome | — | — | 83.29 | 0.80 | ||
| NCBI-disease | BC5CDR |
| 1.17 | 86.89 | 1.74 | |
| Variome | — | — | 86.27 | 1.44 | ||
| Variome | BC5CDR | 85.75 | 2.83 |
| 2.49 | |
| NCBI-disease | — | — | 85.73 | 2.46 | ||
| Species | CRAFT | Linnaeus | 96.28 | 2.21 | 96.82 | 1.51 |
| S800 | — | — |
| 1.31 | ||
| Linnaeus | CRAFT | 89.44 | 3.91 | 89.72 | 4.51 | |
| S800 | — | — |
| 3.42 | ||
| S800 | CRAFT | 72.75 | 2.42 |
| 2.98 | |
| Linnaeus | — | — | 74.43 | 1.90 | ||
| Genes/proteins | BC2GM | CRAFT |
| 0.48 | 79.41 | 0.14 |
| JNLPBA | — | — | 79.60 | 0.53 | ||
| CRAFT | BC2GM | 84.46 | 6.08 |
| 2.65 | |
| JNLPBA | — | — | 85.36 | 4.74 | ||
| JNLPBA | BC2GM | 80.92 | 2.50 |
| 2.53 | |
| CRAFT | — | — | 81.15 | 2.04 | ||
Note: The MTM is trained on pairs of corpora (train, partner), where each corpus is used during training to update the parameters of all hidden layers. IC performance is derived from 5-fold cross-validation, using exact matching criteria. Statistical significance is measured through a two-tailed t-test. Bold, best scores, σ, standard deviation.
Significantly different than the BL (P ≤ 0.05).
Significantly different than the BL (P ≤ 0.01).
Out-of-corpus (OOC) performance, measured by F1 score, of the baseline (BL) BiLSTM-CRF compared to the multi-task model (MTM)
| Entity | Train | Partner | Test | BL | MTM |
|
|---|---|---|---|---|---|---|
| Chemicals | BC4CHEMD | BC5CDR | CRAFT | 47.44 |
| 0.29 |
| CRAFT | BC5CDR | 90.90 |
| 0.07 | ||
| BC5CDR | BC4CHEMD | CRAFT | 39.55 |
| 5.24 | |
| CRAFT | BC4CHEMD | 71.81 |
| 0.73 | ||
| CRAFT | BC4CHEMD | BC5CDR | 41.59 |
| 30.09 | |
| BC5CDR | BC4CHEMD | 40.50 |
| 9.30 | ||
| Diseases | BC5CDR | NCBI-disease | Variome | 74.03 |
| 2.81 |
| Variome | NCBI-disease | 76.67 |
| 0.66 | ||
| NCBI-disease | BC5CDR | Variome | 74.98 |
| 1.34 | |
| Variome | BC5CDR | 69.62 |
| 1.10 | ||
| Variome | BC5CDR | NCBI-disease | 40.17 |
| 29.18 | |
| NCBI-disease | BC5CDR | 22.45 |
| 37.61 | ||
| Species | CRAFT | Linnaeus | S800 | 36.88 |
| 13.38 |
| S800 | Linnaeus | 45.32 |
| 12.48 | ||
| Linnaeus | CRAFT | S800 | 62.90 |
| 4.99 | |
| S800 | CRAFT | 82.49 |
| 0.20 | ||
| S800 | CRAFT | Linnaeus | 61.43 |
| 6.46 | |
| Linnaeus | CRAFT | 57.09 |
| 22.94 | ||
| Genes/proteins | BC2GM | CRAFT | JNLPBA | 69.77 |
| 0.49 |
| JNLPBA | CRAFT | 56.04 |
| 1.12 | ||
| CRAFT | BC2GM | JNLPBA | 52.88 |
| 5.89 | |
| JNLPBA | BC2GM | 44.11 |
| 1.01 | ||
| JNLPBA | BC2GM | CRAFT | 44.29 |
| 8.49 | |
| CRAFT | BC2GM | 51.03 |
| 6.32 |
Note: The MTM is trained on pairs of corpora (train, partner), where each corpus is used during training to update the parameters of all hidden layers, but only the train corpus task is used for evaluation on another corpus annotated for the same entity type (test) using a relaxed, right-boundary matching criteria. Bold, best scores.
Significantly different than the BL (P ≤ 0.05).
Significantly different than the BL (P ≤ 0.01).
Fig. 1.Violin plot of the average in-corpus (IC) and out-of-corpus (OOC) performance, measured by F1 score, of the BiLSTM-CRF model. IC performance is derived from 5-fold cross-validation, using exact matching criteria. OOC performance is derived by training on one corpus (train) and testing on another corpus annotated for the same entity type (test) using a relaxed, right-boundary matching criterion. The average performance of a model employing one of each of the proposed modifications: variational dropout (VD), transfer learning (TL) and multi-task learning (MTL) independently as well as models which employ all combinations of these methods are shown