| Literature DB >> 35708760 |
Sedigheh Khademi Habibabadi1,2, Pari Delir Haghighi3, Jim Buttery1,4, Frada Burstein3.
Abstract
BACKGROUND: Traditional monitoring for adverse events following immunization (AEFI) relies on various established reporting systems, where there is inevitable lag between an AEFI occurring and its potential reporting and subsequent processing of reports. AEFI safety signal detection strives to detect AEFI as early as possible, ideally close to real time. Monitoring social media data holds promise as a resource for this.Entities:
Keywords: Twitter; immunization; machine learning; natural language processing; social media; vaccine adverse effects; vaccine safety; vaccines
Year: 2022 PMID: 35708760 PMCID: PMC9247809 DOI: 10.2196/34305
Source DB: PubMed Journal: JMIR Med Inform
Sample of vaccine-related tweets.
| Tweet | Type |
| “Aw wtf my poor arm is dead af from my flu shot.” | VAEMa |
| “Cannot lie on belly, baby gets squished; cannot lie on back, baby squishes; cannot lie on right side, i get heartburn; cannot lie on left side, vax arm is sore; let the third trimester moaning begin!” | VAEM |
| “2 people recently, including my 88yo father, had flu shot and really bad reaction afterwards. both said it was probably as bad as getting the flu!!! flu2018 maybe undercooked the vaccine.” | VAEM |
| “I got vaccinated as a kid. As a result, I'm now starting to gray and bald. My balding got so bad I had to shave my head. I've also gained weight. Because of vaccines I've started aging instead of dying as a baby.” | Non-VAEM |
| “Urgent vaccination plea after measles outbreak in West Yorkshire.” | Non-VAEM |
| “Researchers are developing a personalized vaccine which they hope could tackle ovarian cancer.” | Non-VAEM |
aVAEM: vaccine adverse event mention.
Data set numbers.
| Stage | Phase-One data, n (%) | Phase-Two data, n (%) | Total, n |
| Topic modeling | 328,822 (47.77) | 359,535 (52.23) | 688,357 |
| Filtering out by topic modeling | −310,021 (52.62) | −279,163 (47.38) | −589,184 |
| After topic modeling | 18,801 (18.96) | 80,372 (81.04) | 99,173 |
| Filtering out by data preparation and balancing | −14,668 (18.69) | −63,814 (81.31) | −78,482 |
| For classification training | 4133 (19.97) | 16,558 (80.03) | 20,691 |
| For training and validation | 3519 (18.28) | 15,730 (81.72) | 19,249 |
| For testing | 614 (42.58) | 828 (57.42) | 1442 |
List of classifiers.
| Models | Library or GitHub source |
| LR CVa | sklearn.linear_model [ |
| SGDb Classifier | sklearn.linear_model [ |
| Linear SVCc | sklearn.svm.SVC [ |
| RFd | sklearn.ensemble [ |
| Extra Trees | sklearn.ensemble [ |
| Multinomial NBe | sklearn.naive_bayes [ |
| NB SVMf (combined NB and Linear SVM) | GitHub Joshua-Chin/nbsvm [ |
| XGBoostg | GitHub dmlc/xgboost [ |
| Ensemble (NB SVM, LR CV, SGD, Linear SVC, and RF) | Majority voting [ |
| CNN,h LSTM,i BiLSTM,j GRU,k BiGRU,l CNN-BiLSTM, and CNN-BiGRU | Pytorch [ |
| RoBERTa,m RoBERTa Large, BERT,n XLNet,o XLNet Large, and XLMp | Pytorch; huggingface transformers [ |
aLR CV: Logistic Regression Cross Validation.
bSGD: Stochastic Gradient Descent.
cSVC: Support Vector Classification.
dRF: Random Forest.
eNB: Naïve Bayes.
fSVM: Support Vector Machine.
gXGBoost: Extreme Gradient Boosting.
hCNN: Convolutional Neural Network.
iLSTM: Long Short-Term Memory.
jBiLSTM: Bidirectional LSTM.
kGRU: Gated Recurrent Unit.
lBiGRU: Bidirectional Gated Recurrent Unit.
mRoBERTa: Robustly Optimized Bidirectional Encoder Representations Pretraining Approach.
nBERT: Bidirectional Encoder Representations.
oXLNet: Generalized Autoregressive Pretraining for Language Understanding.
pXLM: Cross-Lingual Language Model.
Figure 1The vaccine adverse event mention–mine method. CNN: Convolutional Neural Network; LSTM: Long Short-Term Memory.
Phase-One F1 scores.
| Model | Validation | Imbalanced test | Balanced test | Combined test |
| CNNa-BiGRUb | 0.842 | 0.762 | 0.846 | 0.825 |
| BERTc | N/Ad | 0.767 | 0.841 | 0.824 |
| BiGRU | 0.807 | 0.793 | 0.828 | 0.822 |
| CNN–LSTMe | 0.805 | 0.777 | 0.815 | 0.808 |
| BiLSTMf | 0.815 | 0.807 | 0.807 | 0.807 |
| GRUg | 0.820 | 0.730 | 0.822 | 0.804 |
| CNN-BiLSTM | 0.816 | 0.766 | 0.810 | 0.802 |
| CNN | 0.816 | 0.787 | 0.800 | 0.798 |
| LSTM | 0.796 | 0.767 | 0.803 | 0.796 |
| Ensemble | 0.815 | 0.726 | 0.829 | 0.810 |
| Logistic Regression CVh | 0.812 | 0.730 | 0.820 | 0.803 |
| Linear SVCi | 0.814 | 0.693 | 0.824 | 0.797 |
| SGDj | 0.805 | 0.636 | 0.825 | 0.785 |
| Naïve Bayes SVMk | 0.792 | 0.767 | 0.789 | 0.785 |
| Random Forest | 0.814 | 0.694 | 0.801 | 0.779 |
| Extra Trees | 0.833 | 0.688 | 0.801 | 0.777 |
| XGBoostl | 0.811 | 0.704 | 0.791 | 0.774 |
| Naïve Bayes | 0.798 | 0.605 | 0.799 | 0.756 |
aCNN: Convolutional Neural Network.
bBiGRU: Bidirectional Gated Recurrent Unit.
cBERT: Bidirectional Encoder Representations.
dN/A: not applicable.
eLSTM: Long Short-Term Memory.
fBiLSTM: Bidirectional Long Short-Term Memory.
gGRU: Gated Recurrent Unit.
hCV: Cross Validation.
iSVC: Support Vector Classification.
jSGD: Stochastic Gradient Descent.
kSVM: Support Vector Machine.
lXGBoost: Extreme Gradient Boosting.
Phase-Two F1 scores.
| Model | Validation | Imbalanced test | Balanced test | Combined test | Imbalanced change, % | Combined change, % |
| RoBERTaa Large | N/Ab | 0.919 | 0.908 | 0.910 | —c | — |
| RoBERTa | N/A | 0.901 | 0.905 | 0.904 | — | — |
| XLNetd Large | N/A | 0.884 | 0.906 | 0.902 | — | — |
| XLNet | N/A | 0.870 | 0.903 | 0.897 | — | — |
| XLMe | N/A | 0.910 | 0.894 | 0.897 | — | — |
| BERTf | N/A | 0.863 | 0.892 | 0.887 | 12.6 | 7.7 |
| BiGRUg | 0.877 | 0.855 | 0.896 | 0.890 | 7.9 | 8.2 |
| CNNh-BiGRU | 0.874 | 0.849 | 0.890 | 0.884 | 11.4 | 7.1 |
| LSTMi | 0.866 | 0.875 | 0.879 | 0.878 | 14.1 | 10.3 |
| CNN-LSTM | 0.866 | 0.862 | 0.876 | 0.873 | 10.9 | 8.1 |
| BiLSTMj | 0.872 | 0.847 | 0.884 | 0.878 | 5 | 8.8 |
| GRUk | 0.869 | 0.825 | 0.876 | 0.868 | 13.1 | 7.9 |
| CNN-BiLSTM | 0.872 | 0.824 | 0.879 | 0.871 | 7.6 | 8.6 |
| CNN | 0.864 | 0.805 | 0.866 | 0.856 | 2.4 | 7.2 |
| Ensemble | 0.870 | 0.818 | 0.874 | 0.865 | 12.6 | 6.8 |
| Logistic RCVl | 0.866 | 0.807 | 0.873 | 0.861 | 10.5 | 7.3 |
| SGDm | 0.865 | 0.806 | 0.873 | 0.861 | 26.7 | 9.7 |
| Linear SVCn | 0.864 | 0.802 | 0.869 | 0.857 | 15.7 | 7.5 |
| Random Forest | 0.857 | 0.796 | 0.864 | 0.853 | 14.7 | 9.5 |
| Extra Trees | 0.857 | 0.789 | 0.862 | 0.849 | 14.7 | 9.2 |
| NBo SVMp | 0.838 | 0.798 | 0.838 | 0.832 | 3.9 | 5.9 |
| XGBoostq | 0.845 | 0.714 | 0.854 | 0.831 | 1.3 | 7.4 |
| NB | 0.835 | 0.735 | 0.841 | 0.822 | 21.5 | 8.7 |
aRoBERTa: Robustly Optimized Bidirectional Encoder Representations Pretraining Approach.
bN/A: not applicable.
cChange calculation was not performed because no previous figures existed.
dXLNet: Generalized Autoregressive Pretraining for Language Understanding.
eXLM: Cross-Lingual Language Model.
fBERT: Bidirectional Encoder Representations.
gBiGRU: Bidirectional Gated Recurrent Unit.
hCNN: Convolutional Neural Network.
iLSTM: Long Short-Term Memory.
jBiLSTM: Bidirectional Long Short-Term Memory.
kGRU: Gated Recurrent Unit.
lRCV: Regression Cross Validation.
mSGD: Stochastic Gradient Descent.
nSVC: Support Vector Classification.
oNB: Naïve Bayes.
pSVM: Support Vector Machine.
qXGBoost: eXtreme Gradient Boosting.
Summary of topic modeling counts (N=811,010).
| Steps | Counts, n (% of initial data) |
| Tweets collected | 811,010 (100) |
| Cleaned | –122,653 (–15.12) |
| Tweets after cleaning | 688,357 (84.88) |
| Discarded (stage 1) | –570,383 (–70.33) |
| Tweets after stage 1 | 117,974 (14.55) |
| Discarded (stage 2) | –19,083 (–2.35) |
| Tweets after stage 2a,b | 98,891 (12.19) |
aStage 2 proportions—non–vaccine adverse event mention: 88,900 and vaccine adverse event mention: 9991 (10.10% of stage 2 data; 1.45% of tweets after cleaning; 1.23% of initial data).
bVaccine adverse event mention proportions—in other stage 2 topics: 2367 and in best stage 2 topic: 7624 (76.31% of vaccine adverse event mention).