| Literature DB >> 35919516 |
A R Sanaullah1, Anupam Das1, Anik Das2, Muhammad Ashad Kabir3, Kai Shu4.
Abstract
The inflammable growth of misinformation on social media and other platforms during pandemic situations like COVID-19 can cause significant damage to the physical and mental stability of the people. To detect such misinformation, researchers have been applying various machine learning (ML) and deep learning (DL) techniques. The objective of this study is to systematically review, assess, and synthesize state-of-the-art research articles that have used different ML and DL techniques to detect COVID-19 misinformation. A structured literature search was conducted in the relevant bibliographic databases to ensure that the survey was solely centered on reproducible and high-quality research. We reviewed 43 papers that fulfilled our inclusion criteria out of 260 articles found from our keyword search. We have surveyed a complete pipeline of COVID-19 misinformation detection. In particular, we have identified various COVID-19 misinformation datasets and reviewed different data processing, feature extraction, and classification techniques to detect COVID-19 misinformation. In the end, the challenges and limitations in detecting COVID-19 misinformation using ML techniques and the future research directions are discussed.Entities:
Keywords: COVID-19; Classification; Deep learning; Machine learning; Misinformation
Year: 2022 PMID: 35919516 PMCID: PMC9336132 DOI: 10.1007/s13278-022-00921-9
Source DB: PubMed Journal: Soc Netw Anal Min
Fig. 1Types of COVID-19 misinformation
Database search string
| Database name | Query string/Keywords |
|---|---|
| Scopus | TITLE-ABS-KEY ((COVID-19 OR coronavirus) AND (“fake news” OR misinformation OR rumors OR misleading) AND (detection OR classification OR clustering)) |
| Web of Science | TS=(( COVID-19 OR coronavirus) AND (fake news OR misinformation OR misleading OR rumors) AND (detection OR classification OR clustering)) |
| Google Scholar | COVID-19 fake news detection, COVID-19 fake news classification, COVID-19 misinformation detection, COVID-19 misleading news detection, COVID-19 rumor detection |
Fig. 2Prisma flow diagram for the systematic selection and evaluation of the articles
Summary of the datasets
| Type | Paper | Dataset name | Data source | Dataset link | Language | Instances | Labels | |||
|---|---|---|---|---|---|---|---|---|---|---|
| SP | FCW | NW | O | |||||||
| Misleading |
Elhadad et al. ( | COVID-19-FAKES | – | i | En,Ar | 7486 | 2 | |||
| Fake News | Kar et al. ( | Indic-covidemic tweet dataset | – | – | – | ii | En,B,H | 1438 | 2 | |
| Shahi and Nandini ( | FakeCovid | – | – | – | iii | En, H, G and Other 37 | 5182 | 2 | ||
| Koirala ( | Abhishek Koirala | – | – | – | – | En | 3119 | 2 | ||
| Paka et al. ( | COVID-19 Twitter Fake News (CTF) | – | xxxi | En | 45,261 | 2 | ||||
| Madani et al. ( | Madani et al. | – | – | – | – | En | 2000 | 2 | ||
| Hossain et al. ( | COVID19-Lies | – | – | – | vi | En | 6761 | 3 | ||
| Boukouvalas et al. ( | COVID-19 Twitter Data | – | – | – | vii | En | 560 | 2 | ||
| Li et al. ( | MM-COVID | xxvi | En, S, P, H, It, F | 11,173 | 2 | |||||
| Al-Rakhami and Al-Amri ( | Al-Rakhami et al. | – | – | – | – | En | 409,484 | 2 | ||
| Bandyopadhyay and Dutta ( | Fake news dataset | – | – | – | En | 19,873 | 2 | |||
| Kumar et al. ( | Kumar et al. | – | – | – | – | En | 1970 | 4 | ||
| Patwa et al. ( | COVID-19 Fake News Dataset | – | – | viii | En | 10,700 | 2 | |||
| Micallef et al. ( | Counter-covid19-misinformation | – | – | – | ix | En | 155,468 | 3 | ||
| Zhou et al. ( | ReCOVery | – | – | x | En | News 2029 tweets 140,820 | 2 | |||
| Dharawat et al. ( | Covid-HeRA | – | – | – | xi | En | 61,286 | 5 | ||
| Shahi et al. ( | Misinformation COVID-19 | – | – | xii | En | 1500 | 2 | |||
| Cui and Lee ( | CoAID | – | xiv | En | 298,778 | 2 | ||||
| Haouari et al. ( | ArCOV-19 | – | – | xvi | Ar | 9414 | 3 | |||
| Kaliyar et al. ( | FN-COV | – | – | – | – | En | 69,976 | 2 | ||
| Ayoub et al. ( | Ayoub et al. | – | – | En | 984 | 2 | ||||
| Ng and Carley ( | Ng et al. | – | – | – | – | En | 6731 | 5 | ||
| Mahlous and Al-Laith ( | Arabic Fake News corpora | – | – | – | xxv | Ar | 36,066 | 2 | ||
| Yang et al. ( | CHECKED | – | – | – | xix | C | 2104 | 2 | ||
| Rumor |
Chen ( | Shuaipu Chen | – | – | – | – | En | 3737 | 3 | |
| WANG et al. ( | Wang et al. | – | – | – | xxvii | En | 7179 | 3 | ||
| Shi et al. ( | Shi et al. | – | – | – | – | En | 1537 | 2 | ||
| Alkhalifa et al. ( | CLEF dataset | – | – | – | iv | En | 962 | 2 | ||
| Alsudias and Rayson ( | COVID-19 Arabic tweets | – | – | – | xvii | Ar | 2000 | 3 | ||
| Cheng et al. ( | COVID-19-rumor-dataset | – | xxviii | En | 6834 | 3 | ||||
| Conspiracy |
Medina Serrano et al. ( | – | – | – | v | En | Videos 180 comments 151,567 | 2 | ||
| Disinformation |
Alam et al. ( | COVID-19 Infodemic Twitter Dataset | – | – | – | xviii | En, Ar | 722 | 2 | |
| Song et al. ( | COVID-19 Disinformation corpus | – | – | – | – | En | 1293 | 10 | ||
| Unlabeled |
Dimitrov et al. ( | TweetsCOV19 | – | – | – | xiii | En | 8,151,524 | – | |
| Lamsal ( | COV19Tweets Dataset | – | – | – | xx | En | Over 310 million | – | ||
|
Paka et al. ( | CTF | – | xxix | En | 21.85 million | – | ||||
| Qazi et al. ( | GeoCoV19 | – | – | – | xxi | En, S and other 60 | 524,353,432 | – | ||
| Banda et al. ( | COVID-19 Twitter Chatter Dataset | – | – | – | xxii | En, F, S, G and others | Over 1.12 billion | – | ||
| Alqurashi et al. ( | COVID-19-Arabic-Tweets-Dataset | – | – | – | xxiii | Ar | 3,934,610 | – | ||
| Chen et al. ( | COVID-19 Twitter dataset | – | – | – | xxiv | En, S, I, F, P and other 62 | 123,113,914 | – | ||
| Lopez and Gallemore ( | – | – | – | xv | En, S, P and other 63 | 785,118,723 | – | |||
| Preda ( | COVID19 Tweets | – | – | – | xxx | En | 179,108 | – | ||
| Gao et al. ( | NAIST COVID | – | – | – | xxxi | En, Ja, C | 25,925,773 | – | ||
SP= social platform; FCW = fact checking website; NW= news website; O= others; En= English; B = Bengali; C = Chinese; Ja= Japanese; H = Hindi; Ar = Arabic; G = German; S = Spanish; P = Portuguese; I = Indonesian; F = French; It = Italian
Dataset links are provided in the appendix section
Fig. 3Datasets from various social media platforms
Fig. 4Count of different dataset types
Datasets and their class labels (continued)
| Dataset name | Class labels | Used in |
|---|---|---|
| COVID-19-FAKES | Real, Misleading |
Elhadad et al. ( |
| Indic-covidemic tweet dataset | Fake, Non-Fake (Real) | Kar et al. ( |
| FakeCovid | False (Fake), Others |
Shahi and Nandini ( |
| Abhishek Koirala | Fake, True (Real) |
Koirala ( |
| Madani et al. | Fake, Real |
Madani et al. ( |
| CTF | Fake, Genuine (Real) |
Paka et al. ( |
| COVID-19 Twitter Data | Reliable, Unreliable | Boukouvalas et al. ( |
| MM-COVID | Fake, Real |
Li et al. ( |
| Al-Rakhami et al. | Credible, Non-credible |
Al-Rakhami and Al-Amri ( |
| Fake news dataset | Fake, Real |
Bandyopadhyay and Dutta ( |
| Kumar et al. | Irrelevant, Conspiracy, True (Real), False (Fake) |
Kumar et al. ( |
| COVID-19 Fake News Dataset | Fake, Real |
Patwa et al. ( |
| Counter-covid19-misinformation | Misinformation, Counter-misinformation, Irrelevant | Micallef et al. ( |
| ReCOVery | Reliable, Unreliable |
Zhou et al. ( |
| Covid-HeRA | Real, Refutes/Rebuts, Highly severe, Possibly severe, Not severe | Dharawat et al. ( |
| COVID19-Lies | Agree, Disagree, No Stance |
Hossain et al. ( |
| Misinformation COVID-19 | False (Fake) and Partially False (Partially fake) |
Shahi et al. ( |
| CoAID | Fake, True (Real) | Cui and Lee ( |
| ArCOV-19 | False (Fake), True (Real), Other | Haouari et al. ( |
| FN-COV | Fake, Real |
Kaliyar et al. ( |
| Ayoub et al. | Fake, True (Real) |
Ayoub et al. ( |
| Ng et al. | True (Real), Partially true (Partially real), Partially false (Partially fake), False (Fake), Unknown |
Ng and Carley ( |
| Arabic Fake News corpora | Fake, Not Fake (Real) |
Mahlous and Al-Laith ( |
| CHECKED | Fake, Real |
Yang et al. ( |
| Shuaipu Chen | Health rumor, Science rumor, Society rumor |
Chen ( |
| Wang et al. | Fake (Rumor), Real, Unverified |
WANG et al. ( |
| Shi et al. | Rumor, Real |
Shi et al. ( |
| CLEF dataset | Rumor, Non-rumor |
Alkhalifa et al. ( |
| COVID-19 Arabic tweets | True (Real), False (Rumor), Unrelated |
Alsudias and Rayson ( |
| COVID-19-rumor-dataset | True (Real), False (Rumor), Unverified |
Cheng et al. ( |
| YouTube_misinfo | Conspiracy, Agreement |
Medina Serrano et al. ( |
| COVID-19 Infodemic Twitter Dataset | Yes (Not trustworthy), No (Trustworthy) |
Alam et al. ( |
| COVID-19 Disinformation corpus | PubAuthAction, CommSpread, GenMedAdv, PromActs, Consp, VirTrans, VirOrgn, PubRec, Vacc, None | Song et al. ( |
Data pre-processing techniques used in existing research
| Techniques | Explanation | Papers |
|---|---|---|
| Tokenization | Splitting the text into smaller units, known as ‘Token’ |
Boukouvalas et al. ( |
| Stop-words Removal | Removing the words which do not provide much context and hold less useful information |
Boukouvalas et al. ( |
| Case-folding | Converting the characters of a sentence into lower case |
Kaliyar et al. ( |
| Stemming | Converting a word to its grammatical roots so that they can be presented in one term only |
Elhadad et al. ( |
| Lemmatization | Transforming a word to its root which is also known as ‘lemma’ depending on the context |
Kumar et al. ( |
| POS tagging | Assigning one of the part-of-speech to a given word |
Elhadad et al. ( |
| Data Augmentation | Increasing the data by modifying existing data |
Kar et al. ( |
| Others | Removing HTML tags, URLs and other special characters from texts |
Kaliyar et al. ( |
Feature extraction methods used in the literature
| Methods | Papers |
|---|---|
| Pre-trained BERT | Kar et al. ( |
| mBERT | Kar et al. ( |
| COVID-Twitter-BERT | Alkhalifa et al. ( |
| RoBERTa | Chen et al. ( |
| GloVe | Cui and Lee ( |
| ELMo | Alkhalifa et al. ( |
| Word2Vec | Alsudias and Rayson ( |
| FastText | Alsudias and Rayson ( |
| BoW | Cui and Lee ( |
| Count Vector | Alsudias and Rayson ( |
| TF | Elhadad et al. ( |
| TF-IDF | Elhadad et al. ( |
| PCA | Boukouvalas et al. ( |
| ICA | Boukouvalas et al.( |
| LIWC | Medina Serrano et al. ( |
| RST | Zhou et al. ( |
| VAE | Cheng et al. ( |
Classification strategies used in the literature
| Strategy | Papers |
|---|---|
| Binary class |
Cui and Lee ( |
| Multi-class |
Chen ( |
Traditional ML methods used in the literature
| Methods | Papers |
|---|---|
| SVM |
Cui and Lee ( |
| LR |
Cui and Lee ( |
| RF |
Cui and Lee ( |
| DT |
Elhadad et al. ( |
| NB |
Al-Rakhami and Al-Amri ( |
| MNB |
Medina Serrano et al. ( |
| BNB |
Elhadad et al. ( |
| kNN |
Elhadad et al. ( |
| XGBoost |
Elhadad et al. ( |
| GDBT |
Patwa et al. ( |
| C4.5 |
Al-Rakhami and Al-Amri ( |
| Perceptron |
Elhadad et al. ( |
| BN |
Al-Rakhami and Al-Amri ( |
| Linear Classifier |
Hossain et al. ( |
Fig. 5Relationships between feature extraction and traditional ML techniques
DL methods used in the literature
| Methods | Papers |
|---|---|
| NN |
Elhadad et al. ( |
| DNN |
Cheng et al. ( |
| MLP |
Kar et al. ( |
| Transformer |
Yang et al. ( |
| BERT-base |
Chen ( |
| BERT-large |
Kumar et al. ( |
| Distil-BERT |
Kumar et al. ( |
| mBERT |
Alam et al. ( |
| AraBERT |
Alam et al. ( |
| RoBERTa-base |
Medina Serrano et al. ( |
| RoBERTa-large |
Kumar et al. ( |
| Distil-RoBERTa |
Kumar et al. ( |
| ALBERT-base |
Alam et al. ( |
| ALBERT-large |
Kumar et al. ( |
| ALBERT-xlarge |
Chen et al. ( |
| CT-BERT |
Chen et al. ( |
| Covid-bert-base |
Wani et al. ( |
| Ro-CT-BERT |
Chen et al. ( |
| XLNet |
Medina Serrano et al. ( |
| CNN |
Cui and Lee ( |
| RCNN |
Elhadad et al. ( |
| MCNNet |
Kaliyar et al. ( |
| TextCNN |
Chen ( |
| TextRNN |
Chen ( |
| Att-TextRNN |
Yang et al. ( |
| LSTM |
Koirala ( |
| BiLSTM |
Hossain et al. ( |
| BiLSTM-Attention |
Wani et al. ( |
| BiGRU |
Cui and Lee ( |
| Sequential Model |
Elhadad et al. ( |
| SBERT |
Hossain et al. ( |
| SBERT (DA) |
Hossain et al. ( |
| XLM-r |
Alam et al. ( |
| FastText |
Alam et al. ( |
| SCHOLAR |
Song et al. ( |
| SAFE |
Zhou et al. ( |
| SAME |
Cui and Lee ( |
| HAN |
Cui and Lee ( |
| Cross-SEAN |
Paka et al. ( |
| dEFEND |
Cui and Lee ( |
| CSI |
Cui and Lee ( |
| CANTM |
Song et al. ( |
Fig. 6Relationships between feature extraction and DL techniques
Combined Models Used in Literature
| Methods | Papers |
|---|---|
| C4.5 + RF |
Al-Rakhami and Al-Amri ( |
| C4.5 + kNN |
Al-Rakhami and Al-Amri ( |
| SVM + RF |
Al-Rakhami and Al-Amri ( |
| SVM + kNN |
Al-Rakhami and Al-Amri ( |
| SVM + Bayes Net + kNN |
Al-Rakhami and Al-Amri ( |
| C4.5 + Bayes Net + kNN |
Al-Rakhami and Al-Amri ( |
| C4.5 + SVM + RF+ BayesNet + kNN + Naive Bayes |
Al-Rakhami and Al-Amri ( |
| RNN+LSTM |
Elhadad et al. ( |
| RNN+GRU |
Elhadad et al. ( |
| BiRNN+GRU |
Elhadad et al. ( |
| BERTSCORE (DA)+BiLSTM |
Hossain et al. ( |
| BERTSCORE (DA)+SBERT (DA) |
Hossain et al. ( |
| BERT+BiLSTM |
WANG et al. ( |
| CNN+RNN |
Kumar et al. ( |
| RNN+CNN |
Kumar et al. ( |
| C-LSTM |
Kaliyar et al. ( |
Best performing models based on accuracy and F1-score
| Problem tackled | Reference | Best model | Split ratio (%) Train, Validation, Test | Split count | Performance Metrics | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| Train | Validation | Test | A (%) | P (%) | R (%) | F1 (%) | ||||
| Misleading |
Elhadad et al. ( | CNN | 80,20,– | 5989 | 1497 | – | 99.99 | 99.93 | 100 | 99.97 |
|
Elhadad et al. ( | NN (TF) | Fivefold C-V | – | – | – | 99.68 | 99.87 | 99.73 | 99.80 | |
| Fake news | Kar et al. ( | 80 , – , 20 | 1150 | – | 288 | – | 87.17 | 91.89 | 89.47 | |
| – | 76.47 | 86.66 | 81.25 | |||||||
| Koirala ( | CNN | 54,26 ,20 | 1672 | 823 | 624 | 80 | 73 | 70 | 72 | |
| BERT | 80 | – | – | – | ||||||
| Bang et al. ( | RoBERTa-large (Fakenews-19) | – , – , – | 6420 | 2140 | 2140 | 98.13 | – | – | 98.13 | |
| BERT-large (Tweets-19) | 60 | 200 | 61.10 | – | – | 54.33 | ||||
| Hossain et al. ( | BERTSCORE (DA) + BiLSTM (SNLI) | –, –, – | – | – | – | – | 44.20 | 45.30 | 43.10 | |
| BERTSCORE (DA) + SBERT (DA) (MultiNLI) | – | 55.90 | 50.90 | 50.20 | ||||||
| BERTSCORE (DA) + SBERT (DA) (MedNLI) | – | 47.80 | 49.20 | 48.40 | ||||||
| Boukouvalas et al. ( | 70, 30, – | 392 | 168 | – | 87.50 | 84.70 | 90 | 87.30 | ||
| SVM/Gaussian (ICA) | 81.20 | 85.90 | 76.30 | 80.30 | ||||||
| SparseICA-EBM ( | 69.10 | 74.40 | 67.90 | 64.40 | ||||||
|
Al-Rakhami and Al-Amri ( | C4.5 (Meta-model) | Tenfold C-V | – | – | – | 95.11 | 95.3 | 95.10 | 95.10 | |
| SVM+RF (Ensemble-model) | 97.80 | – | – | – | ||||||
| Dharawat et al. ( | LR(Multiclass) | 80, –, 20 | 49,029 | – | 12,257 | 96.30 | 31.30 | 23.30 | 25 | |
| dEFEND w.news (Binary) | 98 | 92 | 68 | 75 | ||||||
|
Chen et al. ( | Ro-CT-BERT | 60 ,20 ,20 | 6420 | 2140 | 2140 | 99.02 | 99.02 | 99.02 | 99.02 | |
|
Kaliyar et al. ( | C-LSTM | –, –, – | – | – | – | 98.62 | 99.20 | 98.9 | 99.40 | |
|
Wani et al. ( | BERT | 60, 20, 20 | 6420 | 2140 | 2140 | 98.41 | – | – | – | |
|
Ayoub et al. ( | BERT | Tenfold C-V | – | – | – | 99.40 | ||||
|
Ng and Carley ( | BoW + LR (Story Validity Classification) | 80, –, 20 | 5385 | – | 1346 | – | – | – | 89 | |
|
Mahlous and Al-Laith ( | LR (Count Vector) | Fivefold C-V | – | – | – | – | 93.40 | 93.30 | 93.30 | |
|
Kaliyar et al. ( | MCNNet | –, –, – | – | – | – | 98.20 | 97.50 | 98.70 | 98.10 | |
|
Yang et al. ( | TextCNN | 70, 10, 20 | 1473 | 210 | 421 | – | – | – | 93.80 | |
|
Shahi and Nandini ( | BERT | – ,– ,– | – | – | – | – | 78 | 75 | 76 | |
|
Bandyopadhyay and Dutta ( | kNN (k-5) | – , –, – | – | – | – | 89 | – | – | 91 | |
|
Paka et al. ( | Cross-SEAN | 80, –,20 | – | – | – | 95.40 | 94.60 | 96.10 | 95.30 | |
|
Patwa et al. ( | SVM | 60 ,20 ,20 | 6420 | 2140 | 2140 | 93.46 | 93.48 | 93.46 | 93.46 | |
|
Zhou et al. ( | SAFE | 80 ,– ,20 | – | – | – | – | ||||
|
Cui and Lee ( | dEFEND | 75 , – , 25 | 2152 | – | 717 | – | 89.65 | 48.47 | 58.14 | |
|
Kumar et al. ( | RoBERTa-large | – , – , – | – | – | – | – | 73.75 | 73.50 | 76 | |
| Rumor |
Chen ( | BERT | Tenfold C-V | – | – | – | 99.20 | 99.17 | 98.13 | 98.34 |
|
Alkhalifa et al. ( | CNN with CT-BERT | 97, 2, 1 | 9206 | 150 | 140 | – | 78 | – | – | |
|
Shi et al. ( | XGBoost | –, –, – | – | – | – | 91 | 94 | 85 | 89 | |
|
WANG et al. ( | BERT+BiLSTM | 80,–,20 | 6102 | – | 1077 | – | 73.19 | 73.27 | 72.95 | |
|
Alsudias and Rayson ( | LR (COUNT VECTOR) | Tenfold C-V | – | – | – | 84.03 | 81.04 | 80.03 | 80.50 | |
|
Cheng et al. ( | DNN | Fivefold C-V | – | – | – | – | – | – | ||
| Conspiracy |
Medina Serrano et al. ( | RoBERTa (Classification of Users Comments) | 80, –, 20 | 2582 | – | 645 | – | – | – | |
| SVM (Classification of YouTube Videos) | Tenfold C–V | – | – | – | 89.40 | – | – | – | ||
| Disinformation |
Song et al. ( | CANTM | Fivefold C-V | – | – | – | 63.34 | – | – | 55.48 |
| Alam et al. ( | BERT (En) (Binary) | Tenfold C-V | – | – | – | – | – | – | ||
| mBERT (En) (Multiclass) | – | – | – | |||||||
| mBERT (Ar) (Binary) | – | – | – | |||||||
| FastText (Ar) (Multiclass) | – | – | – | |||||||
En = English; Ar = Arabic; A = Accuracy; P = Precision; R = Recall; F1 = F1-Score Macro Average (Calculated)