| Literature DB >> 35309597 |
Taiwo Olaleye1, Adebayo Abayomi-Alli2, Kayode Adesemowo3, Oluwasefunmi Tale Arogundade2, Sanjay Misra4, Utku Kose5.
Abstract
Fake COVID-19 tweets are dangerous since they are misinformative, completely inaccurate, as threatening the efforts for flattening the pandemic curve. Thus, aside the COVID-19 pandemic, dealing with fake news and myths about the virus constitute an infodemic issue, which must be tackled by ensuring only valid information. In this context, this study proposed the Synthetic Minority Over-Sampling Technique (SMOTE) and the classifier vote ensemble (SCLAVOEM) method as a fake news classifier and a hyper parameter optimization approach for predictive modelling of COVID-19 infodemic tweets. Hyper parameter optimization variables were deployed across specific points of the proposed model and a minority oversampling of training sets was applied within imbalanced class representations. Experimental applications by the SCLAVOEM for COVID-19 infodemic prediction returned 0.999 and 1.000 weighted averages for F-measure and area under curve (AUC), respectively. Thanks to the SMOTE, the performance increases of 3.74 and 1.11%; 5.05 and 0.29%; 4.59 and 8.05% was seen in three different data sets. Eventually, the SCLAVOEM provided a framework for predictive detecting 'fake tweets' and three classifiers: 'positive', 'negative' and 'click-trap' (piège à clics). It is thought that the model will automatically flag fake information on Twitter, hence protecting the public from inaccurate and information overload.Entities:
Keywords: Bag-of-words; COVID-19; Ensemble machine learning; Fake news; Infodemic; Parameter optimization; Tweet; Twitter
Year: 2022 PMID: 35309597 PMCID: PMC8922071 DOI: 10.1007/s00500-022-06940-0
Source DB: PubMed Journal: Soft comput ISSN: 1432-7643 Impact factor: 3.732
Fig. 1Cross section of sampled valid and infodemic COVID-19 tweets
Text classification and fake news detection performance enhancement indices across the related studies and models
| Study | Feature set | Model | Dataset nature | Performance improvement |
|---|---|---|---|---|
| Jang et al. ( | Evolution pattern | Evolution tree | Private/balanced | No tuning |
| Wang ( | Nominal features | C-NN | Public/balanced | No tuning |
| Rodríguez-Ruiza et al. ( | Nominal features | Ensemble, Miner, R-Miner, OCKRA | Public/1-output class | Feature selection |
| Kumar et al. ( | Nominal Twitter metrics | Network analysis | Private/imbalanced | No tuning |
| Agarwala et al. ( | Bag-of-Words | SVM | Public/imbalanced | No tuning |
| Price et al. ( | Title/content features | kNN | Public/imbalanced | Feature selection |
| Rong ( | Nominal emotional features | R-F/ensemble | Public/imbalanced | Feature selection |
| Fletcher et al. ( | Nominal Twitter metrics | CSE | Public/imbalanced | No tuning |
| Alkhodaira et al. ( | Bag-of-Words | R-NN | Public/imbalanced | N-Gram tuning |
| Zhang et al. ( | Bag-of-Words | K-M Clustering | Private/imbalanced | No tuning |
| Monther and Alwahedi ( | Nominal Twitter metrics | L-CR | Private/imbalanced | Feature selection |
| Jwa et al. ( | Nominal metrics | Weighted Entropy | Public/imbalanced | No tuning |
| Yang et al. ( | Out-of-VOC tokens | Ensemble | Imbalanced | Tuning |
| Faustini and Covões ( | Bag-of-Words | RF, SVM, FS | Publicc/imbalanced | Parameter tuning |
| Bahad et al. ( | Vector representations | R-NN | Public/imbalanced | Parameter tuning |
| Ozbay and Alatas ( | Bag-of-Words | ML | Public/Imbalanced | Feature selection |
| Thota et al. ( | Bag-of-Words | NN | Public/imbalanced | Parameter tuning |
| Zhang et al. ( | Nodes/normal links | Network model | Public/imbalanced | Tuning |
| Kaliyar et al. ( | Vector representations | D-CNN | Public/imbalanced | Parameter tuning |
| da Silva et al. ( | Bag-of-Words | Ensemble | Public/imbalanced | No tuning |
| Atodiresei et al. ( | Tweet title/content | NER | Private/imbalance | No tuning |
| Rasool et al. ( | Bag-of-Words | Decision tree | Public/imbalance | Feature selection |
| Ruz et al. ( | Bag-of-Words | Bayesian Network | Public/imbalance | No tuning |
| Abd-Elaziz et al. ( | Bag-of-Words | CDA, EAGLE, SVM | Private/imbalance | No tuning |
| Baarah, et al. ( | Bag-of-Words | LMT | Public/imbalance | No tuning |
| Maktabar et al. ( | Bag-of-Words | Ensemble | Private/imbalance | No tuning |
| Olaleye et al. ( | Bag-of-Words | Ensemble | Private/balanced | No tuning |
Fig. 2Distribution of research variables observed across the related studies and models distribution
Characteristic nature of the tweets across four datasets
| Data_ID/timeline | Tweet corpus | Class: size distribution | Class status | Tweet source | Tweet handles |
|---|---|---|---|---|---|
| BOW-1 (pre-lockdown) | 73,011 | Valid: 43.81% Invalid: 56.12% | Imbalance class | Twitter handles of Nigeria Centre for Disease Control, Ministry of Health, World Health Organization, threads, hashtags | @ncdcgov, @FMH, @WHO_Nigeria, @WHO_Africa, @WHO, #TakeResponsibility, #Covid-19, #Covid-19Nigeria, #Coronavirus, threads etc |
| BOW-2 (during-lockdown) | 103,866 | Valid: 34.87% Invalid: 65.13% | Imbalance class | Above + #lockdown, #flattenthecurve, #endcovidscamnow, #magadascarcare, #chinesedoctors etc | |
| BOW-3 (BOW-1 + BOW-2) | 176,877 | Valid: 38.76% Invalid: 61.23% | Imbalance class | ||
| BOW-4 (BOW-3 subset) | 111,122 | Negative: 75.15% | Imbalanced class | Replies, discussion threads and hashtags | #lockdown, #flattenthecurve, #endcovidscamnow, #magadascarcare, #chinesedoctors #Covid-19, #Covid-19Nigeria, #Coronavirus, threads etc |
| Positive: 21.80% | |||||
| Click-trap: 3.05% |
Preprocessing techniques and parameter optimization application points
| Preprocessing technique | Applied option | Hyperparameter optimization point | Result |
|---|---|---|---|
| Stop-word handler | Rainbow list | Tweet-text without conjunctions, articles prepositions, etc | |
| Stemmer | Snowball | Replacement of terms with common stems | |
| Tokenizer | N-Gram | Unigram, bigram, 1–3 g | Contiguous words or phrase in tweet-text |
| Filter | StringToWordVector | To keep: 1000 | Numeric Bag-of-Words containing dictionary of known words |
| Weighting scheme | TF-IDF | ||
| Minority oversampling | SMOTE | Random seed: 1.2 Percentage (%): 30,50,100 Nearest neighbour: 5.7 | Bag-of-Words with balanced class representation |
| Feature engineering | Information gain evaluator Evaluator: ranker | Entropy > 0.1 | Most significant differentiating attributes |
Fig. 3Interaction overview diagram of the proposed infodemic predictive model
Classification metrics of infodemic tweet
| Accuracy | F-measure | AUC | Accuracy | F-measure | AUC | Accuracy | F-measure | AUC | |
|---|---|---|---|---|---|---|---|---|---|
| BOW-1 BOW-2 BOW-3 | |||||||||
| | 0.881 | 88.31 | 0.884 | 0.966 | 83.80 | 0.840 | 0.920 | ||
| | 79.46 | 0.792 | 0.783 | 0.913 | 0.860 | ||||
| VP | 79.70 | 0.797 | 0.810 | 89.46 | 0.894 | 0.898 | 85.63 | 0.857 | 0.877 |
| kNN | 66.58 | 0.614 | 0.576 | 78.93 | 0.764 | 0.777 | 72.78 | 0.701 | 0.673 |
| | 79.46 | 0.788 | 88.31 | 0.879 | 87.26 | 0.870 | |||
| Hyperparameter optimization on minority oversampling | |||||||||
| NB | 71.77 | 0.706 | 0.803 | 90.19 | 0.902 | 0.928 | 80.99 | 0.809 | 0.863 |
| | 0.847 | 0.981 | 90.85 | 0.909 | 0.909 | ||||
| VP | 80.53 | 0.805 | 0.824 | 94.46 | 0.945 | 0.951 | 87.42 | 0.874 | 0.893 |
| kNN | 78.77 | 0.784 | 0.770 | 95.17 | 0.952 | 0.956 | 85.61 | 0.856 | 0.887 |
| | 84.35 | 0.844 | 97.01 | 0.970 | |||||
| Hyperparameter optimization on feature selection | |||||||||
| NB | 82.06 | 0.819 | 0.877 | 87.64 | 0.876 | 0.922 | 84.25 | 0.843 | 0.889 |
| SMO | 82.71 | 0.826 | 0.828 | 91.47 | 0.915 | 0.914 | 88.41 | 0.884 | 0.884 |
| | 84.25 | 0.857 | 89.91 | 0.899 | 0.928 | 0.904 | |||
| | 0.806 | 0.843 | 92.04 | 0.920 | 0.945 | 85.97 | 0.859 | 0.899 | |
| | 84.02 | 0.840 | 88.15 | 0.881 | |||||
| Classifier ensemble | |||||||||
| | 0.855 | 0.938 | 0.978 | 0.993 | 0.999 | 1.000 | |||
The best values in each column is shown as underline and bold underline text
Classification metrics on bow-4 for sentiment analysis of fake tweets
| Classifier | Accuracy (%) | F-measure | AUC |
|---|---|---|---|
| NB | 75.21 | 0.752 | 0.902 |
| SMO | 99.49 | 0.995 | 0.996 |
| 0.989 | |||
| VP | |||
| VOTE | |||
The best values in each column is shown as underline and bold underline text
Fig. 4Graphical representation of the minority oversampling technique for balanced class distribution
Fig. 5Performance metrics of base classifiers across datasets with minority oversampling showing improvement of evaluation measures
Fig. 6Performance distribution of SCLAVOEM with base classifiers after minority oversampling across three datasets evaluated with F-measure and AUC, respectively
Fig. 7Entropy value distribution of ten most significant dictionaries of known words across the four datasets as ranked by information gain evaluator
Classification rule tree generated on bow-3 for COVID-19 infodemic tweet detection
Classification rule ============ #COVID19 < 0 | China < 0.5 | | Replying < 0.68 | | | @MoetiTshidi < 0.03 | | | | State < 0.16 | | | | | measures < 0.03 | | | | | | African Region < 0.16 | | | | | | | throat < 0.05 | | | | | | | | health < 0.07 | | | | | | | | | based < 0.5 | | | | | | | | | | partners < 0.15 | | | | | | | | | | | #COVID19Nigeria < 0.22 | | | | | | | | | | | | DG < 0.5 | | | | | | | | | | | | | Force < 0.5 | | | | | | | | | | | | | | making < 0.2 | | | | | | | | | | | | | | | updates on < 0.32 | | | | | | | | | | | | | | | | disease < 0.22 | | | | | | | | | | | | | | | | | work < 0.09: 0 (268/34) [143/29] | | | | | | | | | | | | | | | | | work > = 0.09: 1 (9/4) [4/1] | | | | | | | | | | | | | | | | disease > = 0.22: 1 (6/2) [2/0] | | | | | | | | | | | | | | | updates on > = 0.32: 1 (3/0) [0/0] | | | | | | | | | | | | | | making > = 0.2: 1 (3/0) [1/1] | | | | | | | | | | | | | Force > = 0.5: 1 (3/0) [1/0] | | | | | | | | | | | | DG > = 0.5: 1 (3/0) [1/0] | | | | | | | | | | | #COVID19Nigeria > = 0.22: 1 (7/1) [4/3] | | | | | | | | | | partners > = 0.15: 1 (5/0) [2/1] | | | | | | | | | based > = 0.5: 1 (5/0) [1/0] | | | | | | | | health > = 0.07: 1 (21/7) [4/1] | | | | | | | throat > = 0.05: 1 (8/0) [1/0] | | | | | | African Region > = 0.16: 1 (8/0) [1/0] | | | | | measures > = 0.03: 1 (11/0) [4/1] | | | | State > = 0.16: 1 (21/3) [18/3] | | | @MoetiTshidi > = 0.03: 1 (13/0) [4/0] | | Replying > = 0.68: 0 (45/0) [25/3] | China > = 0.5: 0 (48/0) [28/2] #COVID19 > = 0 | China < 0.5: 1 (201/30) [100/13] | China > = 0.5: 0 (4/0) [3/0] Size of the tree: 39 |
Experimental evaluation test table of classifier ensemble for infodemic tweets
| Tweet | Prediction | Reality |
|---|---|---|
| ‘who and fgn are scamming us’ | 0 | 0 |
| ‘the figures daily presented by @ncdc are all cooked up’ | 0 | 0 |
| ‘there is no single positive case of #covid-19 in nigeria i know that for a fact’ | 0 | 0 |
| ‘simply bathe with alcohol you will negative’ | 0 | 0 |
| ‘tonight 176 new cases covid19 cases reported nigeria @ncdcgov #takeresponsibiklity’ | 0 | 0 |
| ‘contrary news circulation fact finding team deployed kano state @fmohnigeria received @govumarganduje conduct appraisal situation currently looking outlines provide technical support’ | 1 | 1 |
| ‘chinese brought wrath entire world’ | 0 | 0 |
| ‘#covid19 infect ages hence need adhere strictly medical advices’ | 1 | 1 |
| ‘exposing yourself sun temperature higher than 25c degrees does prevent cure #covid19’ | 0 | 1 |
| ‘#covid19 is not transmitted through houseflies’ | 1 | 1 |
Experimental evaluation test table for invalid tweet sentiments
| Tweet | Prediction | Reality |
|---|---|---|
| ‘covid19 can cannot grow africa because high temperature’ | TN negative | Negative |
| ‘tb joshua project concerning covid19’ | FN negative | Positive |
| ‘there isno corona virus nigeria’ | TN negative | Negative |
| ‘#covid19lagos wash your hands regularly #take #responsibility’ | TP positive | Positive |
| ‘#china must pay this genocide’ | TN negative | Negative |
| ‘covid19 bed isolation centre abuja #covid19 #coronavirusupdate’ | TP positive | Positive |
| ‘@takeresposibility follow uptodate #covid19 breaking news’ | TP positive | Cick-trap |
| ‘drop your account details get alert #lockdown’ | Click-trap | Click-trap |
| ‘replying @ncdcgov read your quran bible coronavirus negative’ | TN negative | Negative |
| ‘alcohol prevents #coronavirus’ | TN negative | Negative |
Fig. 8Pie chart of sentiment class distribution for BOW-4