| Literature DB >> 34458858 |
Abstract
Fake news is a big problem in every society. Fake news must be detected and its sharing should be stopped before it causes further damage to the country. Spotting fake news is challenging because of its dynamics. In this research, we propose a framework for robust Thai fake news detection. The framework comprises three main modules, including information retrieval, natural language processing, and machine learning. This research has two phases: the data collection phase and the machine learning model building phase. In the data collection phase, we obtained data from Thai online news websites using web-crawler information retrieval, and we analyzed the data using natural language processing techniques to extract good features from web data. For comparison, we selected some well-known classification Machine Learning models, including Naïve Bayesian, Logistic Regression, K-Nearest Neighbor, Multilayer Perceptron, Support Vector Machine, Decision Tree, Random Forest, Rule-Based Classifier, and Long Short-Term Memory. The comparison study on the test set showed that Long Short-Term Memory was the best model, and we deployed an automatic online fake news detection web application.Entities:
Keywords: Fake news detection; Information retrieval; Machine learning; Natural language processing
Year: 2021 PMID: 34458858 PMCID: PMC8382114 DOI: 10.1007/s42979-021-00775-6
Source DB: PubMed Journal: SN Comput Sci ISSN: 2661-8907
Fig. 1Fake news detection framework
Fig. 2Web crawler-based Information Retrieval
Fig. 3Natural language processing framework
Fig. 4Fake news detection with machine learning framework
Confusion matrix
| Predicted classes | |||||
|---|---|---|---|---|---|
| Actual classes | … | ||||
| … | |||||
| … | |||||
| … | |||||
| … | … | … | … | … | . |
| … | |||||
Sample data collected
| Labels | No. samples |
|---|---|
| Fake | 13,816 |
| Real | 13,816 |
| Suspicious | 13,816 |
| Total | 41,448 |
Training, validation, and test sets
| Data sets | No. samples | Ratio |
|---|---|---|
| Training | 20,723 | 0.50 |
| Validation | 8290 | 0.20 |
| Test | 12,435 | 0.30 |
| Total | 41,448 | 1.00 |
Feature data correlation
| Score fake | Score real | Sim matched | Domain fake | Domain real | Fake | Real | Suspicious | |
|---|---|---|---|---|---|---|---|---|
| Score fake | 1.00 | 0.07 | 0.35 | 0.91 | 0.15 | 0.70 | − 0.43 | − 0.35 |
| Score real | 0.07 | 1.00 | 0.16 | 0.10 | 0.86 | 0.04 | 0.16 | − 0.22 |
| Sim matched | 0.35 | 0.16 | 1.00 | 0.37 | 0.22 | 0.27 | 0.30 | − 0.65 |
| Domain fake | 0.91 | 0.10 | 0.37 | 1.00 | 0.20 | 0.76 | − 0.47 | − 0.38 |
| Domain real | 0.15 | 0.86 | 0.22 | 0.20 | 1.00 | 0.11 | 0.11 | − 0.25 |
| Fake | 0.70 | 0.04 | 0.27 | 0.76 | 0.11 | 1.00 | − 0.62 | − 0.49 |
| Real | − 0.43 | 0.16 | 0.30 | − 0.47 | 0.11 | − 0.62 | 1.00 | − 0.38 |
| Suspicious | − 0.35 | − 0.22 | − 0.65 | − 0.38 | − 0.25 | − 0.49 | − 0.38 | 1.00 |
Fig. 5The scatter joint plot of clustered feature data
The details of LSTM model settings
| Layer type | Activation | Output shape | No. Params |
|---|---|---|---|
| Text vectorization | (None, None) | 0 | |
| Embedding | (None, None, 128) | 589,568 | |
| Bidirectional LSTM | (None, 256) | 263,168 | |
| Dense | Relu | (None, 64) | 16,448 |
| Dense | Relu | (None, 256) | 16,640 |
| Dropout | (None, 256) | 0 | |
| Dense | Relu | (None, 256) | 65,792 |
| Dropout | (None, 256) | 0 | |
| Dense | Softmax | (None, 3) | 771 |
Test performance of RBC
| Precision | Recall | ||
|---|---|---|---|
| Fake | 0.98 | 0.95 | 0.96 |
| Real | 0.77 | 0.97 | 0.86 |
| Suspicious | 0.97 | 0.82 | 0.89 |
| Weighted avg | 0.92 | 0.90 | 0.91 |
| Accuracy | 0.90 |
Test performance of SVM
| Precision | Recall | ||
|---|---|---|---|
| Fake | 0.99 | 0.99 | 0.99 |
| Real | 0.89 | 0.99 | 0.94 |
| Suspicious | 0.99 | 0.90 | 0.94 |
| Weighted avg | 0.96 | 0.96 | 0.96 |
| Accuracy | 0.96 |
Test performance of RF
| Precision | Recall | ||
|---|---|---|---|
| Fake | 0.99 | 1.00 | 1.00 |
| Real | 0.90 | 0.99 | 0.94 |
| Suspicious | 1.00 | 0.91 | 0.95 |
| Weighted avg | 0.97 | 0.96 | 0.96 |
| Accuracy | 0.96 |
Test performance of LSTM
| Precision | Recall | ||
|---|---|---|---|
| Fake | 1.00 | 1.00 | 1.00 |
| Real | 1.00 | 1.00 | 1.00 |
| Suspicious | 1.00 | 1.00 | 1.00 |
| Weighted avg | 1.00 | 1.00 | 1.00 |
| Accuracy | 1.00 |
Machine learning model comparisons
| Metrics|Models | NB | LR | MLP | SVM | DT | RF | KNN | RBC | LSTM |
|---|---|---|---|---|---|---|---|---|---|
| Accuracy | 0.78 | 0.92 | 0.96 | 0.96 | 0.92 | 0.96 | 0.93 | 0.90 | 1.00 |
| Precision | 0.85 | 0.92 | 0.96 | 0.96 | 0.92 | 0.97 | 0.93 | 0.92 | 1.00 |
| Recall | 0.78 | 0.92 | 0.96 | 0.96 | 0.92 | 0.96 | 0.93 | 0.90 | 1.00 |
| 0.79 | 0.92 | 0.96 | 0.96 | 0.92 | 0.96 | 0.93 | 0.91 | 1.00 |
Fig. 6Test accuracy box plot of machine learning
Fig. 7The automatic online Thai fake news detection
Fig. 8The automatic online Thai fake news detection