| Literature DB >> 31419937 |
Xi Wang1, Jiagao Lyu1, Li Dong1, Ke Xu2.
Abstract
BACKGROUND: Biomedical named entity recognition (BioNER) is a fundamental and essential task for biomedical literature mining, which affects the performance of downstream tasks. Most BioNER models rely on domain-specific features or hand-crafted rules, but extracting features from massive data requires much time and human efforts. To solve this, neural network models are used to automatically learn features. Recently, multi-task learning has been applied successfully to neural network models of biomedical literature mining. For BioNER models, using multi-task learning makes use of features from multiple datasets and improves the performance of models.Entities:
Keywords: Cross-sharing structure; Multi-task learning; Named entity recognition
Mesh:
Year: 2019 PMID: 31419937 PMCID: PMC6697996 DOI: 10.1186/s12859-019-3000-5
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Bi-LSTM Structure. The figure displays a part of Bi-LSTM network. Input vectors are fed to two directions of LSTM, and the output of two directions of LSTM is concatenated as the whole output
Fig. 2Single-task Model (STM). The input is a sentence from the BioNER dataset. The dotted rectangles represent words in a sentence, and the solid rectangles represent Bi-LSTM cells. The circles represent CNN units, and the double circles represent CRF units. The tags in the double circles, e.g., “O”, “B-GENE”, are the output of the CRF layer
Fig. 3Fully-shared Multi-task Model (FS-MTM). The embedding layer and the Bi-LSTM layer are shared by two datasets, and two CRF layer are used for two datasets
Fig. 4Shared-private Multi-task Model (SP-MTM). The embedding layer and shared Bi-LSTM are shared by two datasets. Two CRF layer and two private Bi-LSTMs are used for two datasets
Fig. 5Adversarial Multi-task Model (ADV-MTM). The embedding layer and shared Bi-LSTM are shared by two datasets. Two CRF layer and two private Bi-LSTMs are used for two datasets. Three kinds of losses are marked on the figure
Fig. 6Cross-sharing Multi-task Model (CS-MTM). The embedding layer and shared Bi-LSTM are shared by two datasets. Gated interaction unit is used to adjust the output of private Bi-LSTMs. P1,P2: Output of private Bi-LSTMs. S: Output of the shared Bi-LSTM. G1,G2: Output of the gated interaction unit
Biomedical NER datasets used in the experiments
| Dataset | Size | Entity types & counts |
|---|---|---|
| BC2GM | 20,131 sentences | Gene (24,583) |
| Ex-PTM | 3,653 sentences | Protein (4,698) |
| NCBI-disease | 7,287 sentences | Disease (6,881) |
| Linnaeus | 23,155 sentences | Species (4,263) |
| JNLPBA | 24,806 sentences | Cell (12,969), Gene (10,589), Protein (35,336) |
| BC5CDR | 13,938 sentences | Chemical (15,935), Disease (12,852) |
| BioNLP09 | 11,356 sentences | Protein (14,963) |
| BioNLP11ID | 5,178 sentences | Chemical (973), Protein (6,551), Species (3,471) |
| BioNLP13PC | 5,051 sentences | Cell (1,013), Chemical (3,989), Gene (10,891) |
Model Performance Comparison
| Baseline Single-task Model (STM) | Fully-shared Multi-task Model (FS-MTM) | Shared-private Multi-task Model (SP-MTM) | Adversarial Multi-task Model (ADV-MTM) | Cross-sharing Multi-task Model (CS-MTM) | ||
|---|---|---|---|---|---|---|
| BC2GM | Precision | 84.00 | 83.34 | 84.51 | 83.66 | 83.12 |
| Recall | 83.82 | 84.75 | 84.17 | 84.05 | 85.74 | |
| F1 | 83.91 | 84.04 | 84.34 | 83.85 |
| |
| Ex-PTM | Precision | 70.83 | 72.56 | 70.45 | 76.60 | 74.73 |
| Recall | 64.12 | 70.46 | 70.03 | 67.43 | 69.56 | |
| F1 | 67.31 | 71.49 | 70.24 | 71.72 |
| |
| NCBI-disease | Precision | 88.45 | 84.39 | 87.11 | 86.02 | 86.59 |
| Recall | 83.78 | 86.61 | 85.49 | 86.86 | 86.42 | |
| F1 | 86.05 | 85.49 | 86.29 | 86.44 |
| |
| Linnaeus | Precision | 92.86 | 92.66 | 93.00 | 93.74 | 89.81 |
| Recall | 67.62 | 66.76 | 73.86 | 73.81 | 76.12 | |
| F1 | 78.25 | 77.60 | 82.33 |
| 82.40 |
Bold: the best F1 score for the dataset
Parameter numbers of all models
| Model | Number |
|---|---|
| STM | 3.68M |
| FS-MTM | 3.68M |
| SP-MTM | 5.41M |
| ADV-MTM | 5.41M |
| CS-MTM | 5.44M |
Performance with different auxiliary datasets
| JNLPBA | BC5CDR | BioNLP 09 | BioNLP 11ID | BioNLP 13PC | |
|---|---|---|---|---|---|
| BC2GM |
| 84.11 | 83.85 | 84.15 | 83.90 |
| Ex-PTM | 68.81 | 67.51 |
| 68.89 | 70.87 |
| NCBI-disease | 86.17 | 85.74 |
| 84.90 | 85.63 |
| Linnaeus | 78.07 |
| 81.93 | 78.46 | 78.37 |
Bold: the best F1 score for the dataset. ↑ / ↓: positive / negative improvement comparing to STM
Performance with different entity types in BioNLP11ID
| BioNLP11 ID | BioNLP11 ID-chem | BioNLP11 ID-ggp | BioNLP11 ID-species | |
|---|---|---|---|---|
| BC2GM | 84.15 |
| 84.01 | 83.45 |
| Ex-PTM | 68.89 | 67.51 |
| 67.58 |
| NCBI-disease | 84.90 |
| 85.26 | 85.24 |
| Linnaeus | 78.46 | 72.09 | 73.21 |
|
Bold: the best F1 score between sub-datasets. ↑/ ↓: positive / negative improvement comparing to STM
Impact of dataset size
| Full-size STM | Full-size CS-MTM | 50%-size STM | 50%-size CS-MTM | 25%-size STM | 25%-size CS-MTM | 10%-size STM | 10%-size CS-MTM | ||
|---|---|---|---|---|---|---|---|---|---|
| BC2GM | Precision | 84.00 | 83.12 | 82.37 | 79.37 | 77.82 | 79.44 | 73.19 | 72.95 |
| Recall | 83.82 | 85.74 | 80.77 | 85.05 | 79.57 | 78.98 | 73.59 | 75.39 | |
| F1 | 83.91 |
| 81.56 |
| 78.69 |
| 73.39 |
| |
| Ex-PTM | Precision | 70.83 | 74.73 | 67.74 | 68.18 | 57.46 | 54.00 | 42.47 | 50.69 |
| Recall | 64.12 | 69.56 | 58.62 | 67.48 | 53.69 | 63.97 | 50.27 | 41.68 | |
| F1 | 67.31 |
| 62.85 |
| 55.51 |
|
| 45.75 | |
| NCBI-disease | Precision | 88.45 | 86.59 | 84.03 | 84.72 | 81.52 | 81.00 | 81.02 | 79.32 |
| Recall | 83.78 | 86.42 | 84.56 | 84.76 | 76.50 | 81.00 | 68.59 | 74.40 | |
| F1 | 86.05 |
| 84.30 |
| 78.93 |
| 74.29 |
| |
| Linnaeus | Precision | 92.86 | 89.81 | 91.77 | 88.92 | 89.90 | 90.20 | 90.80 | 85.98 |
| Recall | 67.62 | 76.12 | 68.11 | 72.95 | 67.62 | 68.29 | 52.65 | 51.33 | |
| F1 | 78.25 |
| 78.19 |
| 77.18 |
|
| 64.29 |
Bold: the better F1 scores between STM and CS-MTM for each dataset size
Performance with different word embeddings
| STM | CS-MTM | |||||||
|---|---|---|---|---|---|---|---|---|
| BC2GM | Ex-PTM | NCBI-disease | Linnaeus | BC2GM | Ex-PTM | NCBI-disease | Linnaeus | |
| PMC | 84.22 | 66.09 | 85.24 | 76.87 | 85.07 | 70.61 | 84.32 | 80.00 |
| PubMed | 84.15 | 66.86 | 85.21 | 71.23 | 83.84 | 70.66 | 84.99 | 74.63 |
| PMC+PubMed | 84.35 | 66.57 | 84.39 | 75.07 |
| 72.03 | 85.34 | 76.71 |
| PMC+PubMed +Wikipedia |
| 65.71 | 84.46 | 76.87 | 84.10 | 71.79 | 85.27 | 78.99 |
| Our GloVe | 83.91 |
|
|
| 84.41 |
|
|
|
Bold: the best F1 scores for the model on each dataset
Case Study: Bold text: ground-truth entity; Underlined text: model prediction
| Main dataset: Ex-PTM Auxiliary dataset: BioNLP09 | ||
| Case 1 | STM | The myristoylation of |
| CS-MTM | The myristoylation of | |
| Auxiliary data | Human immunodeficiency virus type 1 | |
| Description | The training data of auxiliary dataset directly provides entity information about Nef protein. | |
| Main dataset: Ex-PTM Auxiliary dataset: BioNLP09 | ||
| Case 2 | STM | |
| CS-MTM | Vitamin K deficiency is a relatively common condition in neonates. | |
| Auxiliary data | Ascorbic acid (ascorbate or vitamin C) has been shown to suppress the induction of HIV in... | |
| In conclusion, we demonstrate that the vitamin E derivative TCP succinate prevents monocytic... | ||
| Description | The training data of auxiliary dataset indirectly provides information that Vitamin is not protein. | |
| Main dataset: Linnaeus Auxiliary dataset: BC5CDR | ||
| Case 3 | STM | He |
| CS-MTM | He slept well at night, ate more than his mother thought was good for him, and was able to... | |
| Auxiliary data | During the night | |
| Description | The training data of auxiliary dataset directly provides information that sleep don’t belong to species. | |