| Literature DB >> 31501885 |
Jinhyuk Lee1, Wonjin Yoon1, Sungdong Kim2, Donghyeon Kim1, Sunkyu Kim1, Chan Ho So3, Jaewoo Kang1,3.
Abstract
MOTIVATION: Biomedical text mining is becoming increasingly important as the number of biomedical documents rapidly grows. With the progress in natural language processing (NLP), extracting valuable information from biomedical literature has gained popularity among researchers, and deep learning has boosted the development of effective biomedical text mining models. However, directly applying the advancements in NLP to biomedical text mining often yields unsatisfactory results due to a word distribution shift from general domain corpora to biomedical corpora. In this article, we investigate how the recently introduced pre-trained language model BERT can be adapted for biomedical corpora.Entities:
Mesh:
Year: 2020 PMID: 31501885 PMCID: PMC7703786 DOI: 10.1093/bioinformatics/btz682
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Overview of the pre-training and fine-tuning of BioBERT
List of text corpora used for BioBERT
| Corpus | Number of words | Domain |
|---|---|---|
| English Wikipedia | 2.5B | General |
| BooksCorpus | 0.8B | General |
| PubMed Abstracts | 4.5B | Biomedical |
| PMC Full-text articles | 13.5B | Biomedical |
Pre-training BioBERT on different combinations of the following text corpora: English Wikipedia (Wiki), BooksCorpus (Books), PubMed abstracts (PubMed) and PMC full-text articles (PMC)
| Model | Corpus combination |
|---|---|
| BERT ( | Wiki + Books |
| BioBERT (+PubMed) | Wiki + Books + PubMed |
| BioBERT (+PMC) | Wiki + Books + PMC |
| BioBERT (+PubMed + PMC) | Wiki + Books + PubMed + PMC |
Statistics of the biomedical named entity recognition datasets
| Dataset | Entity type | Number of annotations |
|---|---|---|
| NCBI Disease ( | Disease | 6881 |
| 2010 i2b2/VA ( | Disease | 19 665 |
| BC5CDR ( | Disease | 12 694 |
| BC5CDR ( | Drug/Chem. | 15 411 |
| BC4CHEMD ( | Drug/Chem. | 79 842 |
| BC2GM ( | Gene/Protein | 20 703 |
| JNLPBA ( | Gene/Protein | 35 460 |
| LINNAEUS ( | Species | 4077 |
| Species-800 ( | Species | 3708 |
Note: The number of annotations from Habibi and Zhu is provided.
Statistics of the biomedical relation extraction datasets
| Dataset | Entity type | Number of relations |
|---|---|---|
| GAD ( | Gene–disease | 5330 |
| EU-ADR ( | Gene–disease | 355 |
| CHEMPROT ( | Protein–chemical | 10 031 |
Note: For the CHEMPROT dataset, the number of relations in the training, validation and test sets was summed.
Statistics of biomedical question answering datasets
| Dataset | Number of train | Number of test |
|---|---|---|
| BioASQ 4b-factoid ( | 327 | 161 |
| BioASQ 5b-factoid ( | 486 | 150 |
| BioASQ 6b-factoid ( | 618 | 161 |
Test results in biomedical named entity recognition
| BERT | BioBERT v1.0 | BioBERT v1.1 | ||||||
|---|---|---|---|---|---|---|---|---|
| Type | Datasets | Metrics | SOTA | (Wiki + Books) | (+ PubMed) | (+ PMC) | (+ PubMed + PMC) | (+ PubMed) |
| Disease | NCBI disease | P |
| 84.12 | 86.76 | 86.16 |
| 88.22 |
| R | 89.00 | 87.19 | 88.02 | 89.48 |
|
| ||
| F | 88.60 | 85.63 | 87.38 | 87.79 |
|
| ||
| 2010 i2b2/VA | P |
| 84.04 | 85.37 | 85.55 |
| 86.93 | |
| R |
| 84.08 | 85.64 | 85.72 | 85.44 |
| ||
| F |
| 84.06 | 85.51 | 85.64 | 86.46 |
| ||
| BC5CDR | P |
| 81.97 | 85.80 | 84.67 | 85.86 |
| |
| R | 83.09 | 82.48 | 86.60 | 85.87 |
|
| ||
| F |
| 82.41 | 86.20 | 85.27 | 86.56 |
| ||
| Drug/chem. | BC5CDR | P |
| 90.94 | 92.52 | 92.46 | 93.27 |
|
| R | 92.38 | 91.38 | 92.76 | 92.63 |
|
| ||
| F | 93.31 | 91.16 | 92.64 | 92.54 |
|
| ||
| BC4CHEMD | P |
| 91.19 | 91.77 | 91.65 | 92.23 |
| |
| R | 90.01 | 88.92 |
| 90.30 | 90.61 |
| ||
| F | 91.14 | 90.04 | 91.26 | 90.97 |
|
| ||
| Gene/protein | BC2GM | P | 81.81 | 81.17 | 81.72 | 82.86 |
|
|
| R | 81.57 | 82.42 | 83.38 |
| 83.65 |
| ||
| F | 81.69 | 81.79 | 82.54 | 83.53 |
|
| ||
| JNLPBA | P |
| 69.57 | 71.11 | 71.17 |
| 72.24 | |
| R |
| 81.20 | 83.11 | 82.76 | 83.21 |
| ||
| F |
| 74.94 | 76.65 | 76.53 |
| 77.49 | ||
| Species | LINNAEUS | P |
| 91.17 | 91.83 | 91.62 |
| 90.77 |
| R |
| 84.30 | 84.72 | 85.48 |
| 85.83 | ||
| F |
| 87.60 | 88.13 | 88.45 |
| 88.24 | ||
| Species-800 | P |
| 69.35 | 70.60 | 71.54 |
| 72.80 | |
| R |
| 74.05 | 75.75 | 74.71 |
| 75.36 | ||
| F |
| 71.63 | 73.08 | 73.09 |
| 74.06 | ||
Notes: Precision (P), Recall (R) and F1 (F) scores on each dataset are reported. The best scores are in bold, and the second best scores are underlined. We list the scores of the state-of-the-art (SOTA) models on different datasets as follows: scores of Xu on NCBI Disease, scores of Sachan on BC2GM, scores of Zhu (single model) on 2010 i2b2/VA, scores of Lou on BC5CDR-disease, scores of Luo on BC4CHEMD, scores of Yoon on BC5CDR-chemical and JNLPBA and scores of Giorgi and Bader (2018) on LINNAEUS and Species-800.
Biomedical relation extraction test results
| BERT | BioBERT v1.0 | BioBERT v1.1 | ||||||
|---|---|---|---|---|---|---|---|---|
| Relation | Datasets | Metrics | SOTA | (Wiki + Books) | (+ PubMed) | (+ PMC) | (+ PubMed + PMC) | (+ PubMed) |
| Gene–disease | GAD | P |
| 74.28 | 76.43 | 75.20 | 75.95 |
|
| R |
| 85.11 | 87.65 | 86.15 |
| 82.68 | ||
| F |
| 79.29 |
| 80.24 | 81.52 | 79.83 | ||
| EU-ADR | P | 76.43 | 75.45 | 78.04 |
|
| 77.86 | |
| R |
|
| 93.86 | 93.90 | 90.81 | 83.55 | ||
| F |
| 84.62 | 84.44 |
| 84.83 | 79.74 | ||
| Protein–chemical | CHEMPROT | P | 74.80 | 76.02 | 76.05 |
| 75.20 |
|
| R | 56.00 | 71.60 | 74.33 | 72.94 |
|
| ||
| F | 64.10 | 73.74 |
| 75.13 | 75.14 |
| ||
Notes: Precision (P), Recall (R) and F1 (F) scores on each dataset are reported. The best scores are in bold, and the second best scores are underlined. The scores on GAD and EU-ADR were obtained from Bhasuran and Natarajan (2018), and the scores on CHEMPROT were obtained from Lim and Kang (2018).
Biomedical question answering test results
| BERT | BioBERT v1.0 | BioBERT v1.1 | |||||
|---|---|---|---|---|---|---|---|
| Datasets | Metrics | SOTA | (Wiki + Books) | (+ PubMed) | (+ PMC) | (+ PubMed + PMC) | (+ PubMed) |
| BioASQ 4b | S | 20.01 | 27.33 | 25.47 | 26.09 |
|
|
| L | 28.81 |
|
| 42.24 |
| 44.10 | |
| M | 23.52 | 33.77 | 33.28 | 32.42 |
|
| |
| BioASQ 5b | S | 41.33 | 39.33 | 41.33 | 42.00 |
|
|
| L |
| 52.67 | 55.33 | 54.67 |
|
| |
| M | 47.24 | 44.27 | 46.73 | 46.93 |
|
| |
| BioASQ 6b | S | 24.22 | 33.54 |
| 41.61 | 40.37 |
|
| L | 37.89 | 51.55 | 55.90 | 55.28 |
|
| |
| M | 27.84 | 40.88 |
| 47.02 | 47.48 |
| |
Notes: Strict Accuracy (S), Lenient Accuracy (L) and Mean Reciprocal Rank (M) scores on each dataset are reported. The best scores are in bold, and the second best scores are underlined. The best BioASQ 4b/5b/6b scores were obtained from the BioASQ leaderboard (http://participants-area.bioasq.org).
Fig. 2.(a) Effects of varying the size of the PubMed corpus for pre-training. (b) NER performance of BioBERT at different checkpoints. (c) Performance improvement of BioBERT v1.0 (+ PubMed + PMC) over BERT
Prediction samples from BERT and BioBERT on NER and QA datasets
| Task | Dataset | Model | Sample |
|---|---|---|---|
| NER | NCBI disease | BERT | WT1 missense mutations, associated with male pseudohermaphroditism in |
| BioBERT | WT1 missense mutations, associated with | ||
| BC5CDR (Drug/Chem.) | BERT | … a case of oral | |
| BioBERT | … a case of oral | ||
| BC2GM | BERT | Like the DMA, but unlike all other mammalian class II A genes, the zebrafish gene codes for two cysteine residues … | |
| BioBERT | Like the | ||
| QA | BioASQ 6b-factoid | Q: Which type of urinary incontinence is diagnosed with the Q tip test? | |
| BERT | A total of 25 women affected by clinical | ||
| BioBERT | A total of 25 women affected by clinical | ||
| Q: Which bacteria causes erythrasma? | |||
| BERT |
| ||
| BioBERT |
|
Note: Predicted named entities for NER and predicted answers for QA are in bold.