| Literature DB >> 34920700 |
Cong Sun1, Zhihao Yang2, Lei Wang3, Yin Zhang4, Hongfei Lin1, Jian Wang1.
Abstract
BACKGROUND: The recognition of pharmacological substances, compounds and proteins is essential for biomedical relation extraction, knowledge graph construction, drug discovery, as well as medical question answering. Although considerable efforts have been made to recognize biomedical entities in English texts, to date, only few limited attempts were made to recognize them from biomedical texts in other languages. PharmaCoNER is a named entity recognition challenge to recognize pharmacological entities from Spanish texts. Because there are currently abundant resources in the field of natural language processing, how to leverage these resources to the PharmaCoNER challenge is a meaningful study.Entities:
Keywords: BERT; Language model; NER; Named entity recognition; Text mining
Mesh:
Year: 2021 PMID: 34920700 PMCID: PMC8684061 DOI: 10.1186/s12859-021-04260-y
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
The statistical information of the PharmaCoNER corpus
| Set | Training | Development | Test | Total |
|---|---|---|---|---|
| Documents | 500 | 250 | 250 | 1000 |
| Sentences | 7003 | 3454 | 3403 | 13860 |
| NORMALIZABLES | 2304 | 1121 | 973 | 4398 |
| NO_NORMALIZABLES | 24 | 16 | 10 | 50 |
| PROTEINAS | 1405 | 745 | 859 | 3009 |
| UNCLEAR | 89 | 44 | 34 | 167 |
Fig. 1The processing flowchart of our approach
Fig. 2The architecture of the BERT model
Fig. 3Overview of the pre-training process of various BERT models. Panels adapted from Lee et al. [22]
Comparison of existing BERTs
| Model | Corpus combination | Vocabulary |
|---|---|---|
| BERT(Cased) | Wiki+Books(Original) | BERT |
| BERT(Uncased) | Wiki+Books(Original) | BERT |
| NCBI BERT(+P,Uncased) | Original+PubMed | BERT |
| NCBI BERT(+P+M,Uncased) | Original+PubMed+MIMIC-III | BERT |
| Spanish BERT(Cased) | Original+Spanish Wikipedia+OPUS | Spanish BERT |
| Spanish BERT(Uncased) | Original+Spanish Wikipedia+OPUS | Spanish BERT |
| MultiBERT(Cased) | Multilingual Wikipedia | MultiBERT |
| MultiBERT(Uncased) | Multilingual Wikipedia | MultiBERT |
| SciBERT(BertVoc,Cased) | Original+Biomedical+Scientific | BERT |
| SciBERT(BertVoc,Uncased) | Original+Biomedical+Scientific | BERT |
| SciBERT(SciVob,Cased) | Original+Biomedical+Scientific | SciBERT |
| SciBERT(SciVob,Uncased) | Original+Biomedical+Scientific | SciBERT |
| BioBERTv1.0(+P,Cased) | Original+PubMed | BERT |
| BioBERTv1.0(+PMC,Cased) | Original+PMC | BERT |
| BioBERTv1.0(+P+PMC,Cased) | Original+PubMed+PMC | BERT |
| BioBERTv1.1(+P,Cased) | Original+PubMed | BERT |
Detailed experimental settings
| Parameters | Tune range | Optimal |
|---|---|---|
| Sequence length | [128, 256, 300] | 300 |
| Train batch size | [8, 16, 32] | 16 |
| Dev batch size | 16 | 16 |
| Test batch size | 16 | 16 |
| Learning rate | [1e−05, 2e−05, 3e−05] | 2e−05 |
| Epoch number | [10, 20, 30, 50] | 20 |
| Warmup | 0.1 | 0.1 |
| Dropout | 0.1 | 0.1 |
Performance comparison on the PharmaCoNER dataset
| Method | P (%) | R (%) | F1 (%) |
|---|---|---|---|
| Baseline-Glove [ | 83.26 | 81.00 | 82.11 |
| Baseline-Med [ | 87.02 | 83.71 | 85.34 |
| Sun et al. [ | 90.46 | 88.06 | 89.24 |
| Stoeckel et al. [ | 90.79 | 90.30 | 90.52 |
| Xiong et al. [ | 91.23 | 90.88 | 91.05 |
| Our method (BioBERTv1.1(+P,Cased)) |
‘P’ denotes PubMed
The highest values are shown in bold
Performance comparison of various BERTs
| Method | Mean ± SD | Max | ||||
|---|---|---|---|---|---|---|
| P (%) | R (%) | F1 (%) | P (%) | R (%) | F1 (%) | |
| BERT(Cased) | 89.31 ± 0.26 | 88.00 ± 0.16 | 88.65 ± 0.12 | 89.51 | 88.06 | 88.78 |
| BERT(Uncased) | 89.60 ± 0.81 | 88.13 ± 0.40 | 88.86 ± 0.57 | 90.32 | 88.65 | 89.48 |
| NCBI BERT(P+M,Uncased) | 89.29 ± 0.67 | 87.11 ± 0.60 | 88.18 ± 0.35 | 89.58 | 87.30 | 88.42 |
| NCBI BERT(P,Uncased) | 90.20 ± 0.38 | 88.88 ± 0.52 | 89.53 ± 0.37 | 90.76 | 89.58 | 90.16 |
| Spanish BERT(Uncased) | 89.69 ± 0.74 | 90.56 ± 0.58 | 90.12 ± 0.37 | 90.47 | 90.72 | 90.59 |
| Spanish BERT(Cased) | 90.42 ± 0.77 | 90.51 ± 0.69 | 90.47 ± 0.69 | 91.76 | 91.31 | 91.54 |
| MultiBERT(Cased) | 89.53 ± 0.27 | 89.99 ± 0.43 | 89.76 ± 0.19 | 89.75 | 90.34 | 90.04 |
| MultiBERT(Uncased) | 90.74 ± 0.35 | 90.39 ± 0.37 | 90.56 ± 0.25 | 91.02 | 90.77 | 90.89 |
| SciBERT(Bertvoc,Cased) | 90.36 ± 0.75 | 89.55 ± 0.30 | 89.96 ± 0.40 | 91.66 | 89.52 | 90.58 |
| SciBERT(Bertvoc,Uncased) | 91.07 ± 0.71 | 89.00 ± 0.45 | 90.02 ± 0.55 | 91.85 | 89.36 | 90.59 |
| SciBERT(Scivoc,Uncased) | 90.75 ± 0.86 | 90.27 ± 0.32 | 90.51 ± 0.40 | 92.03 | 90.28 | 91.15 |
| SciBERT(Scivoc,Cased) | 91.25 ± 0.69 | 90.30 ± 0.58 | 90.77 ± 0.40 | 92.40 | 89.74 | 91.05 |
| BioBERTv1.0(+PMC,Cased) | 90.54 ± 0.71 | 89.59 ± 0.31 | 90.06 ± 0.45 | 91.09 | 89.90 | 90.49 |
| BioBERTv1.0(+P,Cased) | 90.44 ± 0.34 | 89.98 ± 0.64 | 90.21 ± 0.36 | 90.75 | 90.55 | 90.65 |
| BioBERTv1.0(+P+PMC,Cased) | 91.08 ± 0.86 | 89.76 ± 0.52 | 90.41 ± 0.42 | 91.13 | 90.34 | 90.73 |
| BioBERTv1.1(+P,Cased) | ||||||
‘P’ and ‘M’ denote PubMed and MIMIC-III, respectively. The table is sorted according to the average F1-score, and the highest values are shown in bold
*Significant difference between the means of two models according to the T-TEST statistical test. Specifically, it indicates the model has a significant difference compared with BioBERTv1.1(+P,Cased), with more than 95% confidence interval ( 0.05)
Performance of each type for PharmaCoNER
| Method | P (%) | R (%) | F1 (%) |
|---|---|---|---|
| NORMALIZABLES | 95.33 | 94.35 | 94.83 |
| NO_NORMALIZABLES | 14.29 | 20.00 | 16.67 |
| PROTEINAS | 90.45 | 89.29 | 89.87 |
| Overall | 92.44 | 91.59 | 92.01 |
Performance comparison of BERT-CRF and BERT-Softmax
| Method | Mean ± SD | Max | ||||
|---|---|---|---|---|---|---|
| P (%) | R (%) | F1 (%) | P (%) | R (%) | F1 (%) | |
| BERT-CRF | 90.42 ± 1.16 | 89.59 ± 0.36 | 90.00 ± 0.68 | 91.69 | 89.90 | 90.79 |
| BERT-Softmax | ||||||
‘BERT’ refers to BioBERTv1.1(+P,Cased)
The highest values are shown in bold
Examples of errors in recognizing biomedical entities by BioBERTv1.1(+P,Cased)
| Error examples | Number of errors in this type | |
|---|---|---|
| Gold: | Se solicita serología de | 76 |
| Pred: | Se solicita serología de | |
| Gold: | A esto se añadía alteración de | 58 |
| Pred: | A esto se añadía alteración de | |
| Gold: | ... a dosis plenas (1 mg/kg/día) y | 39 |
| Pred: | ... a dosis plenas (1 mg/kg/día) y | |
| Gold: | La ecografía mostró derrame pleural loculado, administrándose en consecuencia 200,000 UI de | 9 |
| Pred: | La ecografía mostró derrame pleural loculado, administrándose en consecuencia 200,000 UI de |
‘Gold’ denotes the gold standard, and ‘Pred’ denotes the prediction results. Bold represents the gold standard entities and bolditalic denotes the predicted entities. If not specified, it defaults to the ‘O’ type, which means it is not a chemical/protein entity