| Literature DB >> 36011135 |
Yuting Guo1, Yao Ge1, Yuan-Chi Yang1, Mohammed Ali Al-Garadi2, Abeed Sarker1.
Abstract
Pretrained contextual language models proposed in the recent past have been reported to achieve state-of-the-art performances in many natural language processing (NLP) tasks, including those involving health-related social media data. We sought to evaluate the effectiveness of different pretrained transformer-based models for social media-based health-related text classification tasks. An additional objective was to explore and propose effective pretraining strategies to improve machine learning performance on such datasets and tasks. We benchmarked six transformer-based models that were pretrained with texts from different domains and sources-BERT, RoBERTa, BERTweet, TwitterBERT, BioClinical_BERT, and BioBERT-on 22 social media-based health-related text classification tasks. For the top-performing models, we explored the possibility of further boosting performance by comparing several pretraining strategies: domain-adaptive pretraining (DAPT), source-adaptive pretraining (SAPT), and a novel approach called topic specific pretraining (TSPT). We also attempted to interpret the impacts of distinct pretraining strategies by visualizing document-level embeddings at different stages of the training process. RoBERTa outperformed BERTweet on most tasks, and better than others. BERT, TwitterBERT, BioClinical_BERT and BioBERT consistently underperformed. For pretraining strategies, SAPT performed better or comparable to the off-the-shelf models, and significantly outperformed DAPT. SAPT + TSPT showed consistently high performance, with statistically significant improvement in three tasks. Our findings demonstrate that RoBERTa and BERTweet are excellent off-the-shelf models for health-related social media text classification, and extended pretraining using SAPT and TSPT can further improve performance.Entities:
Keywords: machine learning; social media; text classification
Year: 2022 PMID: 36011135 PMCID: PMC9408372 DOI: 10.3390/healthcare10081478
Source DB: PubMed Journal: Healthcare (Basel) ISSN: 2227-9032
Details of the classification Tasks and the data statistics. denotes the -score for the positive class, and denotes the micro-averaged -score among all the classes. * For NPMU, denotes the -score of the non-medical use class. TRN, TST, and L denote the training set size, the test set size, and the number of classes, respectively. EHF denotes the e-health forum. IAA is the inter-annotator agreement, where Task 4 used Fleiss’ K, Task 13 used Krippendorff’s alpha, Task 17–22 provided IAA but did not mention the coefficient they used, and other Tasks used Cohen’s Kappa.
| ID | Task | Source | Evaluation Metric | TRN | TST | L | IAA |
|---|---|---|---|---|---|---|---|
| 1 | ADR Detection |
| 4318 | 1152 | 2 | 0.71 | |
| 2 | Breast Cancer |
| 3513 | 1204 | 2 | 0.85 | |
| 3 | NPMU characterization | 11,829 | 3271 | 4 | 0.86 | ||
| 4 | WNUT-20-T2 (informative COVID-19 tweet detection) |
| 6238 | 1000 | 2 | 0.80 | |
| 5 | SMM4H-17-T1 (ADR detection) |
| 5340 | 6265 | 2 | 0.69 | |
| 6 | SMM4H-17-T2 (medication consumption) |
| 7291 | 5929 | 3 | 0.88 | |
| 7 | SMM4H-21-T1 (ADR detection) |
| 15,578 | 913 | 2 | - | |
| 8 | SMM4H-21-T3a (regimen change on Twitter) |
| 5295 | 1572 | 2 | - | |
| 9 | SMM4H-21-T3b (regimen change on WebMD) | WebMD |
| 9344 | 1297 | 2 | - |
| 10 | SMM4H-21-T4 (adverse pregnancy outcomes) |
| 4926 | 973 | 2 | 0.90 | |
| 11 | SMM4H-21-T5 (COVID-19 potential case) |
| 5790 | 716 | 2 | 0.77 | |
| 12 | SMM4H-21-T6 (COVID-19 symptom) |
| 8188 | 500 | 3 | - | |
| 13 | Suicidal Ideation Detection |
| 1695 | 553 | 6 | 0.88 | |
| 14 | Drug Addiction and Recovery Intervention |
| 2032 | 601 | 5 | - | |
| 15 | eRisk-21-T1 (Signs of Pathological Gambling) |
| 1511 | 481 | 2 | - | |
| 16 | eRisk-21-T2 (Signs of Self-Harm) |
| 926 | 284 | 2 | - | |
| 17 | Sentiment Analysis in EHF (Food Allergy Related) | MedHelp |
| 618 | 191 | 3 | 0.75 |
| 18 | Sentiment Analysis in EHF (Crohn’s Disease Related) | MedHelp |
| 1056 | 317 | 3 | 0.72 |
| 19 | Sentiment Analysis in EHF (Breast Cancer Related) | MedHelp |
| 551 | 161 | 3 | 0.75 |
| 20 | Factuality Classification in EHF (Food Allergy Related) | MedHelp |
| 580 | 159 | 3 | 0.73 |
| 21 | Factuality Classification in EHF (Crohn’s Disease Related) | MedHelp |
| 1018 | 323 | 3 | 0.75 |
| 22 | Factuality Classification in EHF (Breast Cancer Related) | MedHelp |
| 524 | 161 | 3 | 0.75 |
Figure 1The model architectures for MLM (a) and classification (b). [CLS] and [SEP] are two special tokens indicating the start and end of the text sequence and [msk] are masked tokens.
Comparison of six pretraining strategies on 22 text classification tasks. The metric for each task is shown along with 95% confidence intervals. The best model for each task is highlighted in boldface. Models that are statistically significantly better than all other models on the same task are underlined.
| Task | BERT | RoBERTa | BERTweet | Twitter BERT | BioClinical BERT | BioBERT |
|---|---|---|---|---|---|---|
| ADR Detection | 56.3 | 60.6 | 57.6 | 58.9 | 60.2 | |
| Breast Cancer | 84.7 | 87.4 | 86.3 | 83.0 | 83.9 | |
| NPMU | 59.5 | 61.8 | 59.5 | 56.8 | 52.7 | |
| WNUT-20-T2 (COVID-19) | 87.8 | 88.7 | 87.1 | 86.1 | 87.4 | |
| SMM4H-17-T1 (ADR detection) | 48.6 | 50.7 | 47.6 | 45.5 | 44.5 | |
| SMM4H-17-T2 (Medication consumption) | 76.8 | 79.2 | 77.6 | 74.7 | 75.2 | |
| SMM4H-21-T1 (ADR detection) | 68.3 | 66.2 | 64.9 | 64.9 | 62.7 | |
| SMM4H-21-T3a (Regimen change on Twitter) | 55.5 | 57.6 | 54.0 | 53.6 | 55.0 | |
| SMM4H-21-T3b (Regimen change on WebMD) | 87.7 | 87.6 | 87.7 | 86.7 | 87.1 | |
| SMM4H-21-T4 (Adverse pregnancy outcomes) | 86.0 | 88.8 | 88.4 | 83.4 | 83.3 | |
| SMM4H-21-T5 (COVID-19 potential case) | 69.5 | 71.0 | 70.9 | 65.0 | 66.4 | |
| SMM4H-21-T6 (COVID-19 symptom) | 98.0 | 98.2 | 97.8 | 97.8 | 98.2 | |
| Suicidal Ideation Detection | 63.9 | 63.3 | 59.8 | 61.7 | 61.7 | |
| Drug Addiction and Recovery Intervention | 71.9 | 71.9 | 69.9 | 69.7 | 69.7 | |
| eRisk-21-T1 (Signs of Pathological Gambling) | 73.9 | 67.9 | 70.2 | 68.1 | 62.7 | |
| eRisk-21-T2 (Signs of Self-Harm) | 49.1 | 48.6 | 49.2 | 40.0 | 45.2 | |
| EHF Sentiment Analysis (Food Allergy) | 74.3 | 74.3 | 71.2 | 71.7 | 74.9 | |
| EHF Sentiment Analysis (Crohn’s Disease) | 77.3 | 78.2 | 75.4 | 75.7 | 75.7 | |
| EHF Sentiment Analysis (Breast Cancer) | 73.9 | 70.8 | 72.7 | 73.9 | 70.2 | |
| EHF Factuality Classification (Food Allergy) | 76.1 | 76.1 | 76.1 | 70.4 | 76.7 | |
| EHF Factuality Classification (Crohn’s Disease) | 83.0 | 84.2 | 84.8 | 82.4 | 81.4 | |
| EHF Factuality Classification (Breast Cancer) | 75.8 | 75.2 | 74.5 | 75.8 | 72.0 |
Performance metrics obtained by models after pretraining on different data collections. The metric for breast cancer and COVID-19 is the -score of the positive class, and the metric for NPMU is the -score for the non-medical use class. RB and BT denote RoBERTa and BERTweet, respectively. Data sizes for extended pretraining are shown at the bottom. The best model for each task is shown in boldface. The models underlined are statistically significantly better than their initial models (i.e., RoBERTa and BERTweet without continual pretraining in Table 2).
| Continual Pretraining Data | Initial Model | Breast Cancer | NPMU | COVID-19 | |||
|---|---|---|---|---|---|---|---|
| OpenWebText (generic) | RB | 87.6 | 87.3 | 59.5 | 57.2 | 89.2 | 88.5 |
| BT | 86.5 | 87.1 | 61.6 | 62.1 | 88.5 | 87.9 | |
| Twitter + off-topic (SAPT) | RB | 87.5 | 86.4 | 89.2 | |||
| BT | 86.9 | 87.6 | 65.7 | 64.7 | 90.1 | ||
| Twitter + on-topic (SAPT + TSPT) | RB | 88.9 | |||||
| BT | 89.1 | 66.7 | |||||
| PubMed + off-topic (DAPT) | RB | 85.1 | - | 55.8 | - | 89.0 | - |
| BT | 85.9 | - | 58.8 | - | 88.8 | - | |
| PubMed + on-topic (DAPT + TSPT) | RB | 85.8 | - | 58.6 | - | 89.8 | - |
| BT | 86.9 | - | 60.2 | - | 89.2 | - | |
| Data size | - | 298K | 1M | 586K | 1M | 272K | 1M |
Figure 2Histograms of the distributions of cosine similarities for the models initialized from RoBERTa_Base and pretrained on 298K, 586K, and 272K samples for the breast cancer, NPMU, and COVID-19 tasks, respectively.