| Literature DB >> 35379166 |
Abstract
BACKGROUND: Recently, automatically extracting biomedical relations has been a significant subject in biomedical research due to the rapid growth of biomedical literature. Since the adaptation to the biomedical domain, the transformer-based BERT models have produced leading results on many biomedical natural language processing tasks. In this work, we will explore the approaches to improve the BERT model for relation extraction tasks in both the pre-training and fine-tuning stages of its applications. In the pre-training stage, we add another level of BERT adaptation on sub-domain data to bridge the gap between domain knowledge and task-specific knowledge. Also, we propose methods to incorporate the ignored knowledge in the last layer of BERT to improve its fine-tuning.Entities:
Keywords: BERT; Biomedical relation extraction; Deep learning; Text mining; Transformer
Mesh:
Year: 2022 PMID: 35379166 PMCID: PMC8978438 DOI: 10.1186/s12859-022-04642-w
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Performance (F1 score) of BERT models on the ChemProt, DDI, and PPI datasets
| Model | PPI | DDI | ChemProt |
|---|---|---|---|
| BioBERT | 81.0 | 79.0 | 75.3 |
| SciBERT | 78.8 | 78.7 | 74.4 |
| BlueBERT | 71.9 | 76.8 | 71.2 |
Datasets statistics for PPI, DDI, and ChemProt
| Dataset | Instance # | Train | Dev | Test |
|---|---|---|---|---|
| PPI(AIMed) | 5,834 | – | – | – |
| DDI | 33,508 | 22,233 | 5559 | 5716 |
| ChemProt | 45,048 | 18,035 | 11,268 | 15,745 |
For the AIMed dataset of PPI, there are only two labels: Positive and Negative. The ChemProt corpus is labeled with five positive classes (CPR:3, CPR:4, CPR:5, CPR:6, CPR:9) and the negative class. Similarly, the DDI dataset contains four positive labels (ADVICE, EFFECT, INT, MECHANISM) and one negative label
Performance of BlueBERT model on the PPI, ChemProt, and DDI tasks before and after removing MIMIC-III from the domain adaptation data
| Model | PPI | DDI | ChemProt | ||||||
|---|---|---|---|---|---|---|---|---|---|
| P | R | F | P | R | F | P | R | F | |
| BlueBERT | 69.3 | 75.0 | 71.9 | 76.2 | 77.4 | 76.8 | 70.9 | 71.5 | 71.2 |
| BlueBERT (-M) | |||||||||
Bold values indicate better results
P: Precision; R: Recall; F: F1 Score; -M: Subtract the MIMIC-III clinical notes
BERT performance after pre-training with sub-domain data
| Model | PPI | DDI | ChemProt | ||||||
|---|---|---|---|---|---|---|---|---|---|
| P | R | F | P | R | F | P | R | F | |
| BioBERT | 79.0 | 83.3 | 81.0 | 79.9 | 78.1 | 79.0 | 74.3 | 75.3 | |
| BioBERT (+P/G) | 76.1 | 77.6 | 76.9 | 76.5 | 74.2 | 75.3 | |||
| BioBERT (+D) | 81.5 | 80.9 | 81.2 | 74.4 | 75.6 | ||||
| BioBERT (+CP) | 81.3 | 83.7 | 82.4 | 78.7 | 78.8 | ||||
| PubMedBERT | 80.1 | 84.3 | 82.1 | 82.6 | 81.9 | 82.3 | 78.8 | 75.9 | 77.3 |
| PubMedBERT (+P/G) | 83.7 | 80.5 | 82.0 | 75.5 | 77.9 | ||||
| PubMedBERT (+D) | 79.1 | 85.3 | 82.0 | 80.4 | 74.6 | 77.4 | |||
| PubMedBERT (+CP) | 79.6 | 84.7 | 82.0 | 81.1 | 81.9 | ||||
Bold values indicate better results
P: Precision; R: Recall; F: F1 Score; +P/G: add Protein/Gene-related PubMed abstracts as sub-domain data; +D: add Drug-related PubMed abstracts as sub-domain data; +CP: add protein-related and chemical-related PubMed abstracts as sub-domain data
Fig. 1Learned knowledge of training data in the layers of BioBERT. L is the total layers of BERT model. “Measurement of Knowledge” () is defined in the Methods section
Performance of BERT models on PPI, DDI, and ChemProt.
| Model | PPI | DDI | ChemProt | ||||||
|---|---|---|---|---|---|---|---|---|---|
| P | R | F | P | R | F | P | R | F | |
| BioBERT | 79.0 | 83.3 | 81.0 | 79.9 | 78.1 | 79.0 | 74.3 | 75.3 | |
| BioBERT_SLL_LSTM | 80.2 | 84.0 | 82.0 | 80.5 | 78.5 | 79.5 | 77.6 | 74.4 | 76.0 |
| BioBERT_SLL_biLSTM | 80.2 | 82.7 | 81.4 | 80.8 | 78.5 | 79.6 | 73.9 | 75.9 | |
| BioBERT_SLL_Att | 77.5 | 75.1 | |||||||
| PubMedBERT | 80.1 | 84.3 | 82.1 | 82.6 | 81.9 | 82.3 | 78.8 | 75.9 | 77.3 |
| PubMedBERT_SLL_LSTM | 79.8 | 82.6 | 82.6 | 82.7 | 77.0 | ||||
| PubMedBERT_SLL_biLSTM | 80.5 | 82.6 | 81.7 | 82.6 | 81.4 | 82.0 | 78.5 | 76.5 | 77.5 |
| PubMedBERT_SLL_Att | 85.0 | 82.7 | 78.3 | ||||||
Bold values indicate better results
P: Precision; R: Recall; F: F1 Score; BioBERT/PubMedBERT_SLL_LSTM: model of summarizing the outputs of the last layer using LSTM; BioBERT/PubMedBERT_SLL_biLSTM: model of summarizing the outputs of the last layer using biLSTM; BioBERT/PubMedBERT_SLL_Att: model of summarizing the outputs of the last layer using attention mechanism
BERT performance after combining sub-domain adaptation and the refined fine-tuning mechanism
| Model | PPI | DDI | ChemProt | ||||||
|---|---|---|---|---|---|---|---|---|---|
| P | R | F | P | R | F | P | R | F | |
| BioBERT | 9.0 | 83.3 | 81.0 | 9.9 | 8.1 | 9.0 | 4.3 | 6.3 | 5.3 |
| BioBERT_SLL_Att | 80.7 | 84.4 | 82.5 | 81.3 | 80.1 | 80.7 | 76.5 | 76.8 | |
| BioBERT_SLL_Att (+P/G) | 80.4 | 79.7 | 80.0 | 78.4 | 75.1 | 76.7 | |||
| BioBERT_SLL_Att (+D) | 81.5 | 84.5 | 82.9 | 76.8 | 74.7 | 75.7 | |||
| BioBERT_SLL_Att (+CP) | 82.5 | 84.2 | 83.3 | 81.7 | 77.0 | 79.3 | |||
| PubMedBERT | 80.1 | 84.3 | 82.1 | 82.6 | 81.9 | 82.3 | 78.8 | 75.9 | 77.3 |
| PubMedBERT_SLL_Att | 81.3 | 85.0 | 83.1 | 84.3 | 82.7 | 83.5 | 78.3 | 77.6 | 77.9 |
| PubMedBERT_SLL_Att (+P/G) | 83.6 | 80.6 | 82.1 | 77.0 | 78.4 | ||||
| PubMedBERT_SLL_Att (+D) | 84.5 | 82.9 | 79.5 | 75.9 | 77.7 | ||||
| PubMedBERT_SLL_Att (+CP) | 85.7 | 83.4 | 81.4 | 83.2 | |||||
Bold values indicate better results
P: Precision; R: Recall; F: F1 Score; BioBERT/PubMedBERT: original BERT model; BioBERT/PubMedBERT_SLL_Att: model of summarizing the outputs of the last layer using attention mechanism. +P/G: add Protein/Gene-related PubMed abstracts as sub-domain data; +D: add Drug-related PubMed abstracts as sub-domain data; +CP: add protein-related and chemical-related PubMed abstracts as sub-domain data
Model performance without using the [CLS] token in the last layer
| Model | PPI | DDI | ChemProt | ||||||
|---|---|---|---|---|---|---|---|---|---|
| P | R | F | P | R | F | P | R | F | |
| BioBERT | 79.0 | 83.3 | 81.0 | 79.9 | 78.1 | 79.0 | 74.3 | 76.3 | 75.3 |
| BioBERT_SLL_Att | 80.7 | 82.5 | 80.1 | ||||||
| BioBERT_SLL_Att* | 83.5 | 79.7 | 77.6 | 78.6 | 76.4 | 74.5 | 75.4 | ||
| PubMedBERT | 80.1 | 84.3 | 82.1 | 82.6 | 81.9 | 82.3 | 75.9 | 77.3 | |
| PubMedBERT_SLL_Att | 85.0 | 78.3 | 77.6 | ||||||
| PubMedBERT_SLL_Att* | 80.0 | 82.4 | 82.5 | 80.9 | 81.7 | 75.7 | 76.7 | ||
Bold values indicate better results
P: Precision; R: Recall; F: F1 Score; BERT_SLL_Att*: models of fine-tuning with only the summarized information from attention mechanism (without [CLS] token)
Fig. 2The visualization of attention weights in the last layer. a PPI example; b DDI example; c ChemProt example
The top 10 words with large learned attention weight in PPI, DDI, and ChemProt corpora
| Task | Word stem |
|---|---|
| PPI | Activ(ate), complex, associ(ate), interact, human |
| Protein, bind, domain, specif(y), receptor | |
| DDI | Concomitantli, combin(e), concomit(ant), increas(e), use |
| Concurr(ent), decreas(e), inhibit, receiv(e), administ(er) | |
| ChemProt | Phosphoryl(ate), attenu(ate), stimul(ate), deriv(e), regul(ate) |
| Novel, metabol(ize), reduc(e), induc(e), inhibit |
For the calculation of global attention weight, we use Porter’s stemmer [32] to obtain the word stem for each word since words might in different forms in the sentence. For example, the stem of “activate” is “activ”, and the words like “activation” and “activates” share the same word stem
Fig. 3BERT model training process with sub-domain adaptation
Fig. 4Probing classifier architecture. We freeze the parameters of BERT model during the training of probing classifier. Through the learned , we can know the relevance between each layer and the task. Also, we can tell which layer learns the knowledge for a specific instance by building a series of probing classifier . For the relation extraction instance “RFX5 interacts with histone deacetylase 2”, if the probing classifier predicts the interacting relationship between proteins “RFX5” and “histone deacetylase 2” correctly using the information of the first l layers, but does not predict correctly using the information of the first (l-1) layers. We can say that the knowledge about this instance is learned in the l-th layer
Fig. 5Model architectures after incorporating all outputs from the last layer. In a we show both LSTM (only black in the RNN box) and biLSTM (both black and grey line in the RNN box). a RNN on the last layer. b Attention on the last layer
Pre-processed examples for the three tasks
| Task | Label | Sentence examples |
|---|---|---|
| PPI | Positive | Nuclear protein @PROTEIN$ is a coactivator for |
| the transcription factor @PROTEIN$. | ||
| Negative | Their order of selection was @PROTEIN$ effusion, | |
| @PROTEIN$ serum, TNFalpha-effusion, and C3 effusion. | ||
| DDI | EFFECT | @DRUG$ may increase the ototoxic potential of other drugs |
| such as aminoglycoside and some @DRUG$. | ||
| MECHANISM | Cimetidine: @DRUG$ increases @DRUG$ plasma levels. | |
| ChemProt | CPR:6 | We conclude that @CHEMICAL$ and BAAM are competitive |
| slowly reversible @PROTEIN$ antagonists on rat left atria. | ||
| CPR:9 | @PROTEIN$ plays a role in purine salvage by catalyzing the | |
| direct conversion of adenine to @CHEMICAL$. |