| Literature DB >> 35382811 |
Xuedong Li1, Walter Yuan2, Dezhong Peng1, Qiaozhu Mei3, Yue Wang4.
Abstract
BACKGROUND: Natural language processing (NLP) tasks in the health domain often deal with limited amount of labeled data due to high annotation costs and naturally rare observations. To compensate for the lack of training data, health NLP researchers often have to leverage knowledge and resources external to a task at hand. Recently, pretrained large-scale language models such as the Bidirectional Encoder Representations from Transformers (BERT) have been proven to be a powerful way of learning rich linguistic knowledge from massive unlabeled text and transferring that knowledge to downstream tasks. However, previous downstream tasks often used training data at such a large scale that is unlikely to obtain in the health domain. In this work, we aim to study whether BERT can still benefit downstream tasks when training data are relatively small in the context of health NLP.Entities:
Keywords: Bidirectional encoder representations from transformers; Disease classification; Learning curve
Mesh:
Year: 2022 PMID: 35382811 PMCID: PMC8981604 DOI: 10.1186/s12911-022-01829-2
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Fig. 1Learning curves can inform NLP method selection given labeling budget. If the labeling budget is , then is preferred. If the labeling budget increases to , then is preferred
Corpora statistics
| HaoDaiFu | ChinaRe | |
|---|---|---|
| # of documents | 51,374 | 86,663 |
| # of diseases | 805 | 44 |
| # of rare diseases | 89 | 5 |
| Vocabulary size | 59,879 | 41,087 |
| Average # of words/doc | 27 | 30 |
Fig. 2Learning curves of compared algorithms averaged across all diseases in the two corpora
Area under learning curve (ALC) for different methods aggregated over all diseases
| Method | HaoDaiFu (all 805 diseases) | ChinaRe (all 44 diseases) |
|---|---|---|
| BOW | 0.4158 | 0.8534 |
| BOW_EXP | 0.4266a | 0.8934a |
| BOW_EXP_KG | 0.4254a | 0.8940a |
| CBOW | 0.2097 | 0.5817 |
| CBOW_KG | 0.2064 | 0.5714 |
| LSTM | 0.2013 | 0.6064 |
| LSTM_KG | 0.0377 | 0.6243 |
| BERT | 0.5020ab | 0.9551ab |
Figure 2 plots the learning curves
aResult significantly higher than BOW
bResult significantly higher than BOW_EXP_KG. (Fisher's randomization test, significance level )
Fig. 3Learning curves of compared algorithms averaged across very rare diseases (prevalence 0.02%) in the two corpora
Area under learning curve (ALC) for different methods aggregated over extremely rare ()
| Method | HaoDaiFu (89 rare diseases) | ChinaRe (5 rare diseases) |
|---|---|---|
| BOW | 0.3044 | 0.8454 |
| BOW_EXP | 0.3056a | 0.9058 |
| BOW_EXP_KG | 0.3115a | 0.9034 |
| CBOW | 0.1215 | 0.1945 |
| CBOW_KG | 0.1153 | 0.2136 |
| LSTM | 0 | 0 |
| LSTM_KG | 0 | 0 |
| BERT | 0. 3795ab | 0.9028 |
Figure 3 plots the learning curves
aResult significantly higher than BOW
bResult significantly higher than BOW_EXP_KG. (Fisher's randomization test, significance level )