| Literature DB >> 35059379 |
Suleman Khan1, Saqib Hakak2, N Deepa3, B Prabadevi3, Kapal Dev4, Silvia Trelova5.
Abstract
Since its emergence in December 2019, there have been numerous posts and news regarding the COVID-19 pandemic in social media, traditional print, and electronic media. These sources have information from both trusted and non-trusted medical sources. Furthermore, the news from these media are spread rapidly. Spreading a piece of deceptive information may lead to anxiety, unwanted exposure to medical remedies, tricks for digital marketing, and may lead to deadly factors. Therefore, a model for detecting fake news from the news pool is essential. In this work, the dataset which is a fusion of news related to COVID-19 that has been sourced from data from several social media and news sources is used for classification. In the first step, preprocessing is performed on the dataset to remove unwanted text, then tokenization is carried out to extract the tokens from the raw text data collected from various sources. Later, feature selection is performed to avoid the computational overhead incurred in processing all the features in the dataset. The linguistic and sentiment features are extracted for further processing. Finally, several state-of-the-art machine learning algorithms are trained to classify the COVID-19-related dataset. These algorithms are then evaluated using various metrics. The results show that the random forest classifier outperforms the other classifiers with an accuracy of 88.50%.Entities:
Keywords: COVID-19; fake news; feature extraction; machine learning; social media
Mesh:
Year: 2022 PMID: 35059379 PMCID: PMC8764372 DOI: 10.3389/fpubh.2021.788074
Source DB: PubMed Journal: Front Public Health ISSN: 2296-2565
Figure 1Application of machine learning algorithms for COVID fake news detection.
Figure 2Proposed architecture.
Features extracted from text.
|
|
|
|
|
|---|---|---|---|
| News source | Non-Numeric | Num of? | Numeric |
| Num of Stopwords | Numeric | Num of / | Numeric |
| Num of @ | Numeric | Num of # | Numeric |
| Num of numeric values | Numeric | Num of uppercase characters | Numeric |
| Num of lowercase characters | Numeric | Num of all uppercase characters | Numeric |
| Text language | Numeric | Word count | Numeric |
| Character count | Numeric | Sentence count | Numeric |
| Average word length | Numeric | Average sentence length | Numeric |
| Positive Sentiment Score | Numeric | Negative Sentiment Score | Numeric |
| Neutral Sentiment Score | Numeric | Compound Sentiment Score | Numeric |
| Person | Numeric | NORP | Numeric |
| FAC | Numeric | Organization | Numeric |
| GPE | Numeric | Location | Numeric |
| Product | Numeric | Event | Numeric |
| Work of Art | Numeric | Law | Numeric |
| Language | Numeric | Date | Numeric |
| Time | Numeric | Percent | Numeric |
| Money | Numeric | Quantity | Numeric |
| Cardinal | Numeric | Ordinal | Numeric |
| Text Polarity | Numeric |
Figure 3Classification report before feature extraction.
Performance of the ML algorithms before feature extraction.
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|
| Random forest classifier | 83.33 | 83.14 | 84.90 | 83.61 | 0.065862 | 0.383069 |
| AdaBoost classifier | 79.88 | 76.76 | 86.36 | 81.82 | 0.056415 | 0.347593 |
| Decision tree classifier | 67.81 | 70.51 | 62.50 | 66.26 | 0.003247 | 0.091995 |
| K-nearest neighbor classifier | 62.06 | 72.91 | 39.77 | 51.47 | 0.084975 | 0.001555 |
Figure 4Training and testing time rate before feature extraction.
Figure 5Classification report after feature extraction.
Performance of the ML algorithms after feature extraction.
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|
| Random forest classifier | 88.50 | 87.77 | 89.77 | 88.76 | 0.004685 | 0.011387 |
| AdaBoost classifier | 82.75 | 79.00 | 89.77 | 84.04 | 0.004750 | 0.053603 |
| Decision tree classifier | 77.58 | 72.47 | 89.77 | 80.20 | 0.002851 | 0.081209 |
| K-nearest neighbor classifier | 69.54 | 62.41 | 100 | 76.85 | 0.003293 | 0.008997 |
Figure 6Training and Testing time rate after feature extraction.