| Literature DB >> 36034676 |
Omar Darwish1, Yahya Tashtoush2, Amjad Bashayreh2, Alaa Alomar2, Shahed Alkhaza'leh2, Dirar Darweesh2.
Abstract
Misleading health information is a critical phenomenon in our modern life due to advance in technology. In fact, social media facilitated the dissemination of information, and as a result, misinformation spread rapidly, cheaply, and successfully. Fake health information can have a significant effect on human behavior and attitudes. This survey presents the current works developed for misleading information detection (MLID) in health fields based on machine learning and deep learning techniques and introduces a detailed discussion of the main phases of the generic adopted approach for MLID. In addition, we highlight the benchmarking datasets and the most used metrics to evaluate the performance of MLID algorithms are discussed and finally, a deep investigation of the limitations and drawbacks of the current progressing technologies in various research directions is provided to help the researchers to use the most proper methods in this emerging task of MLID.Entities:
Keywords: BERT; COVID-19; Deep learning; Disinformation; Machine learning; Misinformative; Misleading information
Year: 2022 PMID: 36034676 PMCID: PMC9396598 DOI: 10.1007/s10586-022-03706-z
Source DB: PubMed Journal: Cluster Comput ISSN: 1386-7857 Impact factor: 2.303
Types of misleading information
| Type | Description |
|---|---|
| Fabricated content | Totally fake content |
| Manipulated content | Real information or imagery is distorted. For example, a headline is created to provoke public interest, which is often promoted by ‘clickbait’ |
| Imposter content | Impersonation of authentic sources, such as using the branding of a well-known news organization |
| Misleading content | Misleading information use, such as presenting a statement as a reality |
| False context of connection | Content that is factually correct but is attended by fake contextual information, such as when an article’s headline does not accurately reflect the content, such as when an article’s headline does not accurately reflect the content |
| Satire and parody | Presenting amusing but fake information as if it were real even though not typically classified as disinformation, this may unintentionally mislead readers |
Fig. 1A representation of recent studies on misleading information topic from 2015 to 2022
A summary of MLID approaches
| References | Model | Method | Specifications | Dataset | Size (text) | Topic |
|---|---|---|---|---|---|---|
| [ | SVM, KNN, RF, NB, DistilBERT, DistilRoBERTa | Cost-weighting settings | NVIDIA Tesla V100S 32 GB GPU, 240 GB of storage, 32 GB of RAM, an Intel(R) Core(TM) i7-9750H CPU | CLEF 2018 Consumer Health Search | 5,535,120 | Public Health |
| [ | KG-Miner TransE, text-CNN, CSI, dEFEND, GUpdater, HGAT, DETERRENT | Knowledge guided graph attention network | An attention mechanism, the knowledge guided article embeddings | Diabetes Cancer | 2269 6099 | Public Health |
| [ | RF | Recursive feature selection | Linguistic features, LIWC feature | Collected dataset | 2225 | Public Health |
| [ | MLP, NN, SVM, RT, MAda and RF | The graph theory and the social influence models | Identifies two structural levels (named as user level and network level in the following) that lead to the extraction of representative feature | Collected dataset | 709 | Public Health |
| [ | GB, LR, NB, RF, Bi-LSTM, CNN | L2 regularization | Textual representation, linguistic–stylistic, linguistic emotional, linguistic–medical, propagation-network, and user-profile features | CoAID, ReCOVery, FakeHealth (release) FakeHealth (Story) | 3555 2029 606 1690 | Public Health |
| [ | DT, kNN, MNB, NN, BNB, LSVM, LR, ERF, XGBoost | Voting ensemble | Feature Engineering, TF-IDF, N-gram | Collected dataset | 7486 | COVID-19 |
| [ | SBERT, BiLSTM, SBERT (DA) | BiLSTM and SBERT are jointly trained with the linear classifier | Bi-encoder, forward/backward LSTM, concatenate flatten | COVIDLIES, SNLI, MultiNLI MedNLI | 6761 570,000 433,000 14,049 | COVID-19 |
| [ | Random Forest, DT | Two RF and combination of multiple decision trees | TF-IDF, Count vectorizer, Bag of word | COVID19FN | 2800 | COVID-19 |
| [ | DT, kNN, SVM, RF, NB, LSTM, LR, GRU | Grid search with cross validation, Keras_tuner | TF-IDF, N-gram, word embedding | CoAID PolitiFact Disasters Gossip cop | 926 1050 7613 10,650 | COVID-19 |
| [ | SVM, PAC, MLP, LSTM with FastText, CNN with FastText, LSTM + CNN, BiLSTM + Attention, Ensemble Model (BERT, ALBERT, XLNet) | Transformer model using HuggingFace library | English Glove word embeddings, PyTorch transformer library, Transformer-XL | Collected dataset | 10,700 | COVID-19 |
| [ | BERT | Cluster analysis | Special character removal, stemming and lemmatization, TF-IDF | Collected dataset | 6731 | COVID-19 |
| [ | kNN–BSSA, kNN–BGA with feature selection, kNN–BPSO, and kNN | Wrapper feature selection methods | Reduce the number of features, TF, TF-IDF, and bag-of-words | Koirala | 3002 | COVID-19 |
| [ | Hybrid model (CNN and LSTM) | Hybrid model | TF-IDF, word embedding and hyperparameter optimization | Dataset1 Dataset2 Dataset3 | 1100 10,202 3001 | COVID-19 |
| [ | DT (C4.5), RF, NB, SVM, kNN, Bayes Net + kNN | Stacking method | Data annotation and feature extraction | collected dataset | 409,484 | COVID-19 |
| [ | LSTM, BiLSTM, CNN, hybrid of LSTM-CNN | GloVe pre-trained word embedding features | Python regular expressions and NLTK | COVID-19 Fake News | 21,379 | COVID-19 |
| [ | Attention-based BiLSTM-CRF model | Conditional Random Field (CRF) | PubMed pretrained embeddings | Collected dataset | 20,137 | Cancer |
| [ | SVM | LinearSVC class based on the LibLinear library | User engagement features | Collected dataset | 250 | Cancer |
| [ | LSTM | GloVe, Word2Vec (CBOW and skip-gram), FastText (CBOW and skip-gram) | Word embeddings | Collected dataset | 140,000 | Influenza |
| [ | SVM, RF, RUSBoost, XGBoost, CNN | Sequence Alignment-free Methods | Word encoding, word embeddings | GI-SAID | 60,087 | Influenza |
| [ | DNN, SVM, J48, NB | CFS reduction method | WEKA, Sklearn library | Collected dataset | 7000 | Heart Attack |
| [ | RF, NB, J48, NN | Session-based model | User-based features, text-based features, and network-based features | Collected dataset | 1.6M | Cyberbullying (Psychological Health) |
| [ | RF, NB, J48 | SentiStrength, Indico API | Big Five, Dark Triad models, psychological features such as personalities, sentiment, and emotion | Collected dataset | 9484 | Cyberbullying (Psychological Health) |
| [ | RF + Big Five and Dark Triad Models | Ensemble technique | Big Five, Dark Triad models to determine user personality | Collected dataset | 9484 | Cyberbullying (Psychological Health) |
| [ | LR, SVM | SVM with two new hypothesis for feature extraction | N-gram, Counting, TF-IDF Score and hypotheses for feature extraction (Capturing pronouns, Skip-grams) | Collected dataset | 6547 | Cyberbullying (Psychological Health) |
| [ | kNN, SVM, NB, DT, RF | SMOTE technique | Network-Based Features, Activity Features, User Features, Content-Based Features, Personality Features, Pointwise Mutual Information-Semantic Orientation (PMI-SO) | Collected dataset | 14,495 | Cyberbullying (Psychological Health) |
| [ | NB, LibSVM, RF, kNN | SMOTE technique | Network features, Activity features, User features, Content features, Pearson correlation, chi-square test, and information gain | Collected data set | 10,007 | Cyberbullying (Psychological Health) |
| [ | CapsNet–ConvNet | Hybrid deep technique of Capsule network (CapsNet) and convolution neural network (ConvNet) | Google Lens of Google Photos App | Collected dataset | 10,000 | Cyberbullying (Psychological Health) |
| [ | NB | NB Lexicon Based features | Bag of Word features and Lexicon Based features | Collected dataset | 350 | Cyberbullying (Psychological Health) |
| [ | BoW, sBoW, LSA, LDA, EBoW | BoW and latent semantic features | BoW features, latent semantic features and bullying features | Twitter dataset | 1762 | Cyberbullying (Psychological Health) |
| [ | Bi-LSTM , CNN, Bi-LSTM with attention, CNN–LSTM combined | Automatically identifying abusive language on Arabic | For Bayesian hyperparameter optimization, the Structured Parzen Estimator (TPE) algorithm is used | Collected dataset | 15,050 | Cyberbullying (Psychological Health) |
| [ | SVM, NB | WEKA | Tweet To Senti Strength Feature Vector filter | Arabic dataset, English dataset | 35,273 91,431 | Cyberbullying (Psychological Health) |
| [ | NB, LR, SVM, XGBoost, CNN, LSTM, BLSTM and GRU | Youtube API | TF-IDF, Word Embedding, Sklearn, Tensorflow, NLTK, matplotlib | Dataset1 Dataset2 Dataset3 | 5000 7000 12,000 | Cyberbullying (Psychological Health) |
| [ | SVM, Logistic Regression, NB, Random Forest | Ensemble | Label Encoder | Dataset1, OLID, Dataset3, Dataset4 | 1990 14,100 8817 24,784 | Cyberbullying (Psychological Health) |
| [ | GBDT, Random Forest, SVM, XGB_CTD | Fuzzy C-Means (FCM) | Scikit-learn, Keras, FCM and XGBoost Libraries, Intel I7-8500H 3.60 GHz CPU and a laptop with a 12 GB RAM | Collected dataset | 542 | Cyberbullying (Psychological Health) |
| [ | FFNN | 4 Hidden layers | Hot encoding | Dataset1 Dataset2 | 4913 34,890 | Cyberbullying (Psychological Health) |
| [ | NB | Predefined list | BoW | Training Testing | 1,600,000 359 | Cyberbullying (Psychological Health) |
| [ | SVM, CNN-CB | Twitter streaming API | Spyder environment, 12 GB of RAM | Collected dataset | 39,000 | Cyberbullying (Psychological Health) |
| [ | CNN, LRCN | Skip-gram | four NVIDIA GTX 1080 servers | Collected dataset | 8815 | Cyberbullying (Psychological Health) |
| [ | SVM, DT (C4.5), NB, and kNN | Information gain and chi-square | Tokenization | Collected dataset | 900 | Cyberbullying (Psychological Health) |
| [ | CNN and PCNN | TM (threshold moving), CFA (cost function adjusting), and a hybrid solution (TM CFA) | Removing non-alphanumeric characters tokens | Dataset1 Dataset 2 | 1313 13,000 | Cyberbullying (Psychological Health) |
| [ | SVM | n-gram | Tokenization, Normalization | Collected dataset | 15,050 | Cyberbullying (Psychological Health) |
A summary of the used datasets
| Ref. | Dataset | Size (text) | No. of features |
|---|---|---|---|
| [ | CLEFeHealth | 5,535,120 | 6 |
| [ | CoAID | 3555 | 4 |
| [ | ReCOVery | 2029 | – |
| [ | FakeHealth (release) | 606 | – |
| [ | FakeHealth (Story) | 1690 | – |
| [ | SNLI | 570,000 | 13 |
| [ | MultiNLI | 433,000 | 10 |
| [ | MedNLI | 14,049 | 7 |
| [ | COVIDLIES | 6761 | 4 |
| [ | COVID19FN | 2800 | 11 |
| [ | Disasters | 7613 | 5 |
| [ | PolitiFact | 1050 | 4 |
| [ | Gossip cop | 10,650 | 5 |
| [ | Koirala | 3002 | 5 |
| [ | GI-SAID | 60,087 | – |
| [ | COVID-19 Fake News | 21,379 | – |
A comparison of the accuracies reached by MLID approaches
| References | Model | ACC (%) |
|---|---|---|
| [ | DETERRENT | 96.00 |
| [ | RF | 90.10 |
| [ | RF | 73.63 |
| [ | NN | 99.68 |
| [ | RF | 94.49 |
| [ | Modified LSTM | 98.57 |
| [ | Ensemble Model | 98.55 |
| [ | BERT | 87.00 |
| [ | kNN–BGA | 75.40 |
| [ | CNN | 94.20 |
| [ | DNN | 95.20 |
| [ | RF | 91.08 |
| [ | J48 | 91.88 |
| [ | LR | 86.00 |
| [ | RF | 93.00 |
| [ | ConvNet | 97.05 |
| [ | CNN | 87.84 |
| [ | CNN | 84.00 |
| [ | SVM, LR, Ensemble | 92.00 |
| [ | XGB-CTD | 91.75 |
| [ | FFNN | 94.56 |
| [ | NB | 67.30 |
| [ | CNN-CB | 95.00 |
| [ | DNN | 95.20 |
| [ | RF | 91.08 |
| [ | J48 | 91.88 |
| [ | LR | 86.00 |
| [ | RF | 93.00 |
| [ | ConvNet | 97.05 |
| [ | CNN | 87.84 |
| [ | CNN | 84.00 |
| [ | SVM, LR, Ensemble | 92.00 |
| [ | XGB-CTD | 91.75 |
| [ | FFNN | 94.56 |
| [ | NB | 67.30 |
| [ | CNN-CB | 95.00 |
| [ | LRCN | 87.22 |
| [ | NB | 84.00 |
| [ | PCNN | 96.00 |
| [ | SVM | 78.00 |
| [ | SVM | 90.50 |
A comparison of other performance metrics obtained using MLID approaches
| References | Model | Metric | (%) |
|---|---|---|---|
| [ | RF, SVM, DistilRoBERTa | F1-score | 93.00 |
| [ | DETERRENT | Precision | 94.00 |
| [ | DETERRENT | Recall | 91.00 |
| [ | DETERRENT | F1-score | 93.00 |
| [ | RF | Precision | 72.80% |
| [ | RF | Recall | 73.60 |
| [ | RF | AUC | 89.00 |
| [ | CNN | AUC | 97.30 |
| [ | CNN | F1-score | 95.30 |
| [ | NN | ERR | 32.00 |
| [ | NN | AUC | 99.47 |
| [ | BERTSCORE (DA) + SBERT (DA) | Precision | 63.30 |
| [ | BiLSTM | Recall | 94.20 |
| [ | BiLSTM | F1-score | 89.50 |
| [ | RF | Precision | 95.00 |
| [ | RF | Recall | 95.00 |
| [ | RF | F1-score | 95.00 |
| [ | Ensemble Model | Precision | 98.55 |
| [ | Ensemble Model | Recall | 98.55 |
| [ | Ensemble Model | F1-score | 98.55 |
| [ | Modified LSTM | Precision | 98.55 |
| [ | Modified LSTM | Recall | 98.60 |
| [ | Modified LSTM | F1-score | 98.50 |
| [ | kNN–BGA | Precision | 66.22 |
| [ | kNN | Recall | 69.57 |
| [ | kNN–BSSA | F1-score | 61.96 |
| [ | CNN | Precision | 93.60 |
| [ | CNN | Recall | 93.90 |
| [ | CNN | F1-score | 93.70 |
| [ | CNN | Specificity | 93.90 |
| [ | CNN | Error rate | 5.80 |
| [ | CNN | Miss rate | 5.50 |
| [ | CNN | FPR | 6.00 |
| [ | DNN | F1-score | 73.60 |
| [ | J48 | AUC | 97.00 |
| [ | J48 | F-score | 92.00 |
| [ | J48 | Kappa | 84.00 |
| [ | J48 | RMSE | 17.00 |
| [ | RF | Kappa | 59.00 |
| [ | RF | RMSE | 14.00 |
| [ | RF | AUC | 81.00 |
| [ | RF | Precision | 90.00 |
| [ | RF | Recall | 91.00 |
| [ | RF + Big Five and Dark Triad | Precision | 96.00 |
| [ | RF + Big Five and Dark Triad | Recall | 95.00 |
| [ | RF + Big Five and Dark Triad | F1-score | 92.00 |
| [ | LR | AUC | 86.92 |
| [ | LR | Recall | 71.00 |
| [ | LR | Precision | 76.90 |
| [ | RF | F1-score | 92.00 |
| [ | RF | Kappa | 84.00 |
| [ | RF | AUC | 94.30 |
| [ | RF | Recall | 93.00 |
| [ | RF | Precision | 94.00 |
| [ | RF | F1-score | 93.00 |
| [ | CapsNet–ConvNet | AUC | 98.00 |
| [ | ConvNe | Recall | 95.08 |
| [ | ConvNe | Precision | 98.60 |
| [ | CNN–LSTM | Recall | 83.46 |
| [ | CNN | Precision | 86.10 |
| [ | CNN | F1-score | 84.05 |
| [ | SVM | Precision | 93.40 |
| [ | SVM | Recall | 94.10 |
| [ | SVM | F1-score | 92.70 |
| [ | CNN | Precision | 84.00 |
| [ | CNN | ROC | 84.00 |
| [ | XGBoost | Recall | 91.00 |
| [ | NB | F1-score | 86.00 |
| [ | CNN-CB | Precision | 93.00 |
| [ | CNN-CB | Recall | 73.00 |
| [ | SVM | Precision | 76.80 |
| [ | SVM | Reacll | 79.40 |
| [ | SVM | F1-score | 78.00 |
| [ | PCNN | Precision | 74.00 |
| [ | PCNN | Recall | 45.30 |
| [ | PCNN | F1-score | 56.20 |
| [ | SVM | Precision | 88.00 |
| [ | SVM | Recall | 80.00 |
| [ | SVM | F1-score | 82.00 |
The complete list of abbreviations used in this survey
| Abbreviations | Definition | Abbreviations | Definition |
|---|---|---|---|
| MLID | Misleading information detection | ERF | Ensemble random forest |
| DT | Decision tree | XGBoost | Extreme gradient boosting |
| kNN | k-Nearest neighbor | ACC | Accuracy |
| ML | Machine learning | SVM | Support vector machine |
| PAC | Passive aggressive classifier | Bi-LSTM | Bidirectional long short-term memory |
| MLP | Multi-Layer perceptron | BERT | Bidirectional encoder representations from transformers |
| ALBERT | A Lite BERT | NB | Naive Bayes |
| CNN | Convolutional neural network | LSTM | Long short-term memory |
| LSVM | Linear support vector machines | ERR | Error rate |
| NLI | Natural language inference | SBERT | Sentence-BERT |
| PSO | Particle swarm optimization | GA | Genetic algorithm |
| BNB | Bernoulli Naïve Bayes | AUC | Area under The curve |
| SSA | Salp swarm algorithm | BSSA | Binary salp swarm algorithm |
| RRSE | Root relative squared error | PRE | Precision |
| EBoW | Embeddings-enhanced Bag-of-Words | REC | REC |
| BPSO | Binary particle swarm optimization | BGA | Binary-coded genetic algorithm |
| NN | Neural network | CAM | Complementary and alternative medicine |
| PMI-SO | Pointwise semantic orientation of each word and phrases | SMOTE | Synthetic Minority Over-sampling Technique |
| LDA | Latent Dirichlet allocation | LSA | Latent semantic features |
| LibSVM | Library for support vector machines | CapsNet | Capsule network deep neural network |