| Literature DB >> 35582207 |
Shubhangi Rastogi1, Divya Bansal1.
Abstract
The emergence of social media platforms has amplified the dissemination of false information in various forms. Social media gives rise to virtual societies by providing freedom of expression to users in a democracy. Due to the presence of echo chambers on social media, social science studies play a vital role in the spread of false news. To this aim, we provide a comprehensive framework that is adapted from several scholarly studies. The framework is capable of detecting information into various types, namely real, disinformation and satire based on authenticity as well as intention. The process highlights the use of interdisciplinary approaches derived from fundamental theories of social science and integrating them with modern computational tools and techniques. Few of these theories claim that malicious users suggest writing fabricated content in a different style to attract the audience. Style-based methods evaluate the intention i.e., the content is written with an intent to mislead the audience or not. However, the writing style can be deceptive. Thus, it is important to involve user-oriented social information to improve model strength. Therefore, the paper used an integrated approach by combining style based and propagation-based features with a total of thirty-one features. The extracted features are divided into ten categories: relative frequency, quantity, complexity, uncertainty, sentiment, subjectivity, diversity, informality, additional, and popularity. The features have been iteratively utilized by supervised classifiers and then selected the best-correlated ones using the ANOVA test. Our experimental results have shown that the selected features are able to distinguish real from disinformation and satirical news. It has been observed that the Ensemble machine learning model outperformed other models over the developed multi-labelled corpus.Entities:
Keywords: Covid-19; Disinformation; Ensemble; Fake; Machine learning; Neural network; Satire
Year: 2022 PMID: 35582207 PMCID: PMC9098146 DOI: 10.1007/s11042-022-13129-y
Source DB: PubMed Journal: Multimed Tools Appl ISSN: 1380-7501 Impact factor: 2.577
Fig. 1Background Flow: The figure summarises the approach followed
Extracted Features based on perspectives
| Approach | Features | Description |
|---|---|---|
| Style-based features | TF-IDF (F1) | Relative frequency of words |
| Quantity (F2) | # Characters | |
| # Words | ||
| # Noun Phrases | ||
| # Sentences | ||
| Complexity (F3) | Average # characters per word | |
| Average # words per sentence | ||
| Average # punctuations per sentence | ||
| Uncertainty (F4) | # Modal verbs | |
| # Certainty terms | ||
| # Generalizing terms | ||
| # Tentative terms | ||
| # Numbers and quantifiers | ||
| # Question marks | ||
| Sentiment (F5) | # Positive words | |
| # Negative words | ||
| # Anxiety/angry/sadness words (emotion) | ||
| # Exclamation marks | ||
| Content sentiment polarity | ||
| Subjectivity (F6) | # Subjective verbs | |
| Diversity (F7) | # Unique words | |
| # Unique nouns, verbs, adjectives, adverbs | ||
| Informality (F8) | # Typos/spellchecks | |
| # Swear words/ netspeak/assent/fillers | ||
| Additional (F9) | # Hashtags | |
| # Mentions | ||
| # Stopwords | ||
| # URL | ||
| Mean word length | ||
| User engagement features | Popularity (F10) | # Likes |
| # Retweets | ||
| # Replies |
Theories in Social Sciences helpful in deterring the spread of false information
| Theory | Description | Features | |
|---|---|---|---|
| News-related theories | Undeutsch hypothesis [ | A statement derived from real life experiences differ in content and quality from that of fabricated. | Writing style |
| Reality monitoring [ | Facts contain more detailed sensory information | More unique words | |
| Four-factor theory [ | Deception are stated differently in terms of arousal, guilt, emotion and try to appear as real. | Sentiments | |
| Information manipulation theory [ | Deception contains extensive quantity of information | Word counts | |
| Social Impact theories | Normative influence theory [ | People’s behavior gets influenced by the society to be more accepted and liked. | Likes |
| Availability cascade [ | People accept others acuities when such acuities are gaining more popularity within their social circles. | Retweets | |
| Social cognitive theory [ | People get actively molded by their surroundings. | Popularity |
Literature Review
| Approach | Ref. | Type & Dataset | Purpose | Features | Result |
|---|---|---|---|---|---|
| Style-based | [ | News articles; Crowdsourced dataset 240 fake & 240 real | Exploratory analysis on identification of linguistic differences in fake and legitimate news content | Linguistic features: ngrams, punctuation, psycholinguistic, readability, and syntax | 74% accuracy with SVM |
| [ | News articles; 1627 articles from Buzzfeed corpus | Hyperpartisan news can be differentiated well by its style of writing from mainstream | ngrams, readability scores, dictionary-based features | 90% accuracywith linear classifier | |
| [ | Crowdfunding platform | To examine the importance of content-based and linguistic cues to detect fraud | Linguistic (diversity, uncertainty, informality etc.) and content-based cues (Bag-ofwords) | 77.45% accuracy using ensemble classifier | |
| [ | Customer reviews from | Compare verbal and non-verbal features in order to detect fake online reviews | Non-verbal (membership length, tips count, photo count) and verbal features (review length, average sentence length, noun ratio) | 87% accuracy over random forest classifier | |
| [ | Facebook and news sites; Buzzfeed election data (120), real fake satire websites (224), Burfoot and Baldwin data (4233) | Fake and satire news are distinguishable using stylistic features of the title over machine learning | stylistic, complexity and psychological features | 91% accuracy with SVM | |
| [ | News articles Real from Reuters & fake from Kaggle (12600) | Fake news detection using n-grams analysis | TF-IDF with n-grams | 92% with Linear SVM | |
| [ | Social media; Twitter API | Comparison of different feature sets in order to detect sarcastic articles | Sentiment, complexity and language based features | 80% accuracy with gradient boosting | |
| Social-based | [ | Social media; 14million messages Twitter | Social bots heavily spread low credibility content by retweeting, replying those posts | Replies, retweets, mentions, sentiment etc. | 94% accuracy using binary classifier |
| [ | Social media; 343645 Twitter messages | Sentiment analysis of Twitter discussions on US presidential elections and also, to distinguish misinformation from negative information | Retweets, likes, replies, sentiments, number of followers | Observations | |
| [ | Social media; 10.8M Twitter posts and 6.2M Reddit comments | To develop neural network for classifying deceptive and trusted news sources in terms of speed and types of reactions | Reactions like answer, appreciation, elaboration, question | Observations | |
| Latent-based | [ | News articles and fact checking websites;1400 true and 2004 fake | To find out the best combination of features and word embeddings to detect deception accurately | Linguistic features | 95% accuracy with adaboost ensemble classifier |
| [ | Exisiting datasets; Amazon (21000) & Hotel datasets (800) | To improve the performance of fake customer reviews detection model | n-grams, word embeddings, lexicon-based emotion indicators | 89.56% & 82.80% with proposed DFFNN model |
Fig. 2Data collection flowchart
Annotators Guidelines
Fig. 3Generic framework for detection of fake news
Fig. 4Percentage and length distribution of data to each category
Fig. 5Data distribution boxplot of linguistic features (a) Char_count (b) Hashtag_count (c) Mean_word_len (d) Mention_count (e) Unique_word_count (f) Word_count (g) Stop_word_count (h) Punct_count (i) Url_count (j) Likes (k) Retweets (l) Replies (m) Polarity (n) Subjectivity (o) Modal_verbs
Skewed values of features
| Features | Skewed value | after outlier removal | Features | Skewed value | after outlier removal |
|---|---|---|---|---|---|
| Likes | 12.43 | -0.15 | Tentative | 2.81 | 0 |
| Retweets | 10.39 | 0.48 | Generalizing | 4.84 | 1.32 |
| Replies | 10.97 | 1.07 | Certainty | 2.98 | 0.60 |
| URL_count | -0.44 | - | Noun phrases | 2.69 | 0.81 |
| Hashtag_count | 3.76 | 1.92 | Sentences | -0.34 | - |
| Mention_count | 6.64 | 6.89 | Avg_character_count | -0.09 | - |
| Avg_Punct_count | 0.62 | - | Avg_words_count | 0.56 | - |
| Stopword_Count | 0.77 | - | Numbers+quantifiers | 3.76 | 0.24 |
| Mean_word_length | 1.60 | 0.57 | Question_mark_count | 4.89 | 1.03 |
| Word_count | 0.44 | - | Positive_count | 7.31 | 2.43 |
| Unique_word_count | 0.37 | - | Negative_count | 4.53 | 3.96 |
| Char_count | -0.06 | - | Anxiety/angry/sad_Count | 5.72 | 2.43 |
| Polarity | -0.53 | - | Exclamation_count | 6.34 | 1.04 |
| Subjectivity | 0.61 | - | Unique_nouns/verbs/adverbs | 2.54 | 1.77 |
| Modal verbs | 3.11 | 1.02 | Spellcheck_count | 5.63 | 2.48 |
| Swear_word_count | 4.54 | 3.61 |
Fig. 6Word embeddings using TSNE
Fig. 7Flow chart of the proposed generic framework
Selected values of hyperparameters majorly effecting the overall performance
| Classifier | Hyperparameters effecting the performance of classifier in Fak_cov dataset |
|---|---|
| Support Vector Machine (SVM) | Kernel= Linear & RBF; Decision function shape=ovo; penalty term C = 1; degree= 2 |
| Random Forest (RF) | Number of estimators = 3 |
| eXtreme Gradient Boosting (XGB) | Learning rate= 0.5; n = 3; depth= 2; minchild= 1; gamma= 5; objective =multi-softmax |
| Decision Tree (DT) | Depth= 2 |
| K-nearest neighbor (KNN) | Number of neighbours, k = 5 |
| Multi-layer Perceptron (MLP) | Activation function= tanh, Relu, logistic; solver=adam; hidden layer= (25,20,10); max iterations= 500 (Majorly tanh performed best) |
Fig. 8Ensemble Model Architecture
Fig. 9Generating n-grams: unigrams (top-left), bigram (bottom-left), trigrams (top-right), and quad-grams(bottom-right)
Using TF-IDF
| Classification Model | Accuracy | |||
|---|---|---|---|---|
| N = 1 Unigram | N = 2 Bigram | N = 3 Trigram | N = 4 Quadgram | |
| MNB | 94.62 | 72.73 | 63.26 | 62.5 |
| LR | 97.11 | 92.8 | 66.67 | 62.5 |
| SVM1, SVM2 | 92.05, 92.8 | 90.15, 91.29 | 62.88, 88.26 | |
| RF | 97.62 | 93.56 | 62.88 | 50.76 |
| XGB | 97.62 | 74.62 | 44.7 | 44.7 |
| DT | 75.76 | 67.05 | 44.7 | 44.7 |
| KNN | 87.88 | 84.85 | 84.47 | 25.0 |
| Ensemble | 98.04 | 96.21 | 78.79 | 61.36 |
| MLP | 98.08 | 81.82 | 62.5 | 51.52 |
Variables in ANOVA test
| Variable | Name |
|---|---|
| Dependent variables | 31 Features considering one at a time |
| Independent variable | labels = Disinformation, real and satire |
Feature Importance using ANOVA test
| Feature Set | Features | ANOVA (p-value) | |
|---|---|---|---|
| F2 | # Characters | 0.000004 | ✓ |
| # Words | 0.000001 | ✓ | |
| # Noun Phrases | 0.34 | ||
| # Sentences | 0.89 | ||
| F3 | Average # characters per word | 0.3 | |
| Average # words per sentence | 0.09 | ||
| Average # punctuations per sentence | 0.845 | ||
| F4 | # Modal verbs | 0.79 | |
| # Certainty terms | 0.58 | ||
| # Generalizing terms | 0.91 | ||
| # Tentative terms | 0.12 | ||
| # Numbers and quantifiers | 0.56 | ||
| # Question marks | 0.89 | ||
| F5 | # Positive words | 0.05 | ✓ |
| # Negative words | 0.04 | ✓ | |
| # Anxiety/angry/sadness words (emotion) | 0.05 | ✓ | |
| # Exclamation marks | 0.9 | ||
| Content sentiment polarity | 0.2 | ||
| F6 | # Subjective verbs | 0.03 | ✓ |
| F7 | # Unique words | 2.45 * e-07 | ✓ |
| # Unique nouns, verbs, adjectives, adverbs | 0.8 | ||
| F8 | # Typos/spellchecks | 0.73 | |
| # Swear words/netspeak/assent/fillers | 0.68 | ||
| F9 | # Hashtags | 0.221 | |
| # Mentions | 0.662 | ||
| # Stopwords | 0.51 | ||
| # URL | 0.32 | ||
| Mean word length | 0.000052 | ✓ | |
| F10 | # Likes | 0.11 | |
| # Retweets | 0.054 | ✓ | |
| # Replies | 0.000019 | ✓ |
Using iterative feature engineering
| F2 | F2-3 | F2-4 | F2-5 | F2-6 | F2-7 | F2-8 | F2-9 | F2-10 | Selected | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Acc. | F1 | Acc. | F1 | Acc. | F1 | Acc. | F1 | Acc. | F1 | Acc. | F1 | Acc. | F1 | Acc. | F1 | Acc. | F1 | Acc. | F1 | |
| MNB | 0.51 | 0.50 | 0.61 | 0.60 | 0.50 | 0.46 | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| LR | 0.64 | 0.62 | 0.49 | 0.45 | 0.60 | 0.59 | 0.58 | 0.57 | 0.57 | 0.56 | 0.57 | 0.56 | 0.58 | 0.57 | 0.73 | 0.73 | 0.76 | 0.75 | 0.82 | 0.81 |
| SVM1 | 0.61 | 0.54 | 0.68 | 0.68 | 0.67 | 0.67 | 0.66 | 0.67 | 0.66 | 0.66 | 0.66 | 0.66 | 0.63 | 0.63 | 0.62 | 0.55 | 0.71 | 0.65 | 0.72 | 0.65 |
| SVM2 | 0.70 | 0.65 | 0.64 | 0.62 | 0.62 | 0.60 | 0.58 | 0.57 | 0.58 | 0.57 | 0.59 | 0.52 | 0.58 | 0.59 | 0.77 | 0.77 | 0.80 | 0.79 | 0.84 | 0.85 |
| RF | 0.63 | 0.63 | 0.69 | 0.69 | 0.69 | 0.69 | 0.69 | 0.69 | 0.84 | 0.84 | 0.86 | 0.87 | ||||||||
| XGB | 0.60 | 0.58 | 0.72 | 0.72 | 0.71 | 0.71 | 0.67 | 0.65 | 0.63 | 0.60 | 0.52 | 0.43 | 0.53 | 0.53 | 0.72 | 0.72 | 0.77 | 0.73 | 0.80 | 0.79 |
| DT | 0.59 | 0.52 | 0.59 | 0.52 | 0.59 | 0.52 | 0.59 | 0.30 | 0.59 | 0.52 | 0.59 | 0.52 | 0.61 | 0.60 | 0.75 | 0.74 | 0.71 | 0.72 | 0.70 | 0.71 |
| KNN | 0.72 | 0.71 | 0.67 | 0.62 | 0.67 | 0.67 | 0.68 | 0.68 | 0.67 | 0.67 | 0.71 | 0.71 | 0.73 | 0.72 | 0.72 | 0.71 | 0.78 | 0.79 | 0.81 | 0.81 |
| Ensemble | 0.73 | 0.73 | 0.66 | 0.66 | 0.68 | 0.68 | 0.73 | 0.73 | ||||||||||||
| MLP | 0.67 | 0.64 | 0.70 | 0.70 | 0.71 | 0.71 | 0.69 | 0.69 | 0.70 | 0.70 | 0.72 | 0.71 | 0.81 | 0.81 | 0.88 | 0.88 | 0.92 | 0.92 | ||
Using TF-IDF with Selected features (Data Mapper)
| Classification Model | Accuracy | F-1 measure | Kappa |
|---|---|---|---|
| MNB | - | - | - |
| LR | 0.81 | 0.81 | 0.70 |
| SVM1, SVM2 | 0.72, 0.85 | 0.65, 0.86 | 0.54, 0.81 |
| RF | 0.96 | 0.96 | 0.94 |
| XGB | 0.91 | 0.91 | 0.86 |
| DT | 0.76 | 0.68 | 0.60 |
| KNN | 0.81 | 0.81 | 0.70 |
| Ensemble | |||
| MLP | 0.98 | 0.98 | 0.97 |
Using word embeddings
| Classification Model | Accuracy | F-1 measure | Kappa |
|---|---|---|---|
| MNB | 52.94 | 0.52 | 0.30 |
| LR | 55.04 | 0.54 | 0.32 |
| SVM1, SVM2 | 68.49 | 0.68 | 0.49 |
| RF | 82.56 | 0.82 | 0.72 |
| XGB | 82.77 | 0.82 | 0.72 |
| DT | 77.73 | 0.77 | 0.64 |
| KNN | 61.34 | 0.59 | 0.36 |
| Ensemble, Ensemble1 | |||
| MLP (tanh), Relu, logistic | 67.86, 68.07, 69.54 | 0.67, 0.67, 0.69 | 0.78, 0.78,0.78 |
Fig. 10Visualization of tweet articles from FaK_CoV Corpus (a) using PCA and (b) using t-SNE
Confusion Matrix
| Pred/True | Real | Disinfo | Satire | Overall | % Accuracy |
|---|---|---|---|---|---|
| Real | 119 | 1 | 0 | 120 | 99.16 |
| Disinfo | 0 | 85 | 0 | 85 | 100 |
| Satire | 1 | 8 | 50 | 59 | 84.75 |
| Overall | 120 | 94 | 50 | 264 | 94.62 |
Fig. 11Ten misclassified articles by Ensemble classifier
Fig. 12Misclassified Satirical News extracted manually from Twitter
Random tweets tested on Ensemble using selected features
| Text | Actual category | Predicted category | Conditional Probability | Score | |
|---|---|---|---|---|---|
| Related articles | - No my friend, muslims are doing something called coronajihad, they are spitting on fruit and vegetables. | Disinfo | Disinfo | 99.70 | 5 |
| - Shameful: How Tablighi Jamaat workers manhandled a lady health worker in Delhi LNJP hospital. | Real | Real | 97.85 | 1 | |
| - Vijay Mallya Plans to fill all his aircrafts with crude oil, will earn enough money to clear all his loans | Satire | Disinfo | 52.79 | 3 | |
| Unrelated article | - I have instructed the United states Navy to shoot down and destroy any and all Iranian gunboats if they harass our ships at sea. | Real | Real | 58.74 | 3 |
Popular hashtags inciting hate on Twitter
| #Islamophobia | 825.5K | 777.7K | 2.0B |
|---|---|---|---|
| #CoronaJihad | 538.0K | 504.1K | 914.4M |
| #NizamuddinIdiots | 156.7K | 151.2K | 276.7M |
| #TablighiVirus | 27.4K | 26.4K | 56.3M |
Fig. 13Popular hashtags and interest over time
Fig. 14Screenshots of Tweets spreading fake information to shift to anti-Muslim narrative