| Literature DB >> 35599852 |
Masood Ghayoomi1, Maryam Mousavian2.
Abstract
The spread of fake news on social media has increased dramatically in recent years. Hence, fake news detection systems have received researchers' attention globally. During the COVID-19 outbreak in 2019 and the worldwide epidemic, the importance of this issue becomes more apparent. Due to the importance of the issue, a large number of researchers have begun to collect English datasets and to study COVID-19 fake news detection. However, there are a large number of low-resource languages, including Persian, that cannot develop accurate tools for automatic COVID-19 fake news detection due to the lack of annotated data for the task. In this article, we aim to develop a corpus for Persian in the domain of COVID-19 where the fake news is annotated and to provide a model for detecting Persian COVID-19 fake news. With the impressive advancement of multilingual pre-trained language models, the idea of cross-lingual transfer learning can be proposed to improve the generalization of models trained with low-resource language datasets. Accordingly, we use the state-of-the-art deep cross-lingual contextualized language model, XLM-RoBERTa, and the parallel convolutional neural networks to detect Persian COVID-19 fake news. Moreover, we use the idea of knowledge transferring across-domains to improve the results by using both the English COVID-19 dataset and the general domain Persian fake news dataset. The combination of both cross-lingual and cross-domain transfer learning has outperformed the models and it has beaten the baseline by 2.39% significantly.Entities:
Keywords: COVID‐19; contextualized text representation; deep neural network; fake news detection; transfer learning
Year: 2022 PMID: 35599852 PMCID: PMC9111484 DOI: 10.1111/exsy.13008
Source DB: PubMed Journal: Expert Syst ISSN: 0266-4720 Impact factor: 2.812
FIGURE 1The overall architecture of the proposed framework
FIGURE 2BERT pre‐training and fine‐tuning steps (Devlin et al., 2019)
FIGURE 3The architecture of the deep neural model
Statistics of the English COVID‐19 fake news dataset (Patwa et al., 2020)
| Real | Fake | Total | ||||
|---|---|---|---|---|---|---|
| Documents | Words | Documents | Words | Documents | Words | |
| All data | 5600 | 34,583 | 5100 | 31,082 | 10,700 | 65,665 |
| Training | 3360 | 17,502 | 3060 | 16,396 | 6420 | 33,898 |
| Validation | 1120 | 8356 | 1020 | 7285 | 2140 | 15,641 |
| Test | 1120 | 8725 | 1020 | 7401 | 2140 | 16,126 |
Statistics of the general domain Persian fake news dataset (Samadi et al., 2021)
| Real | Fake | Total | ||||
|---|---|---|---|---|---|---|
| Documents | Words | Documents | Words | Documents | Words | |
| All data | 1861 | 59,364 | 1860 | 37,921 | 3721 | 97,285 |
| Training | 1294 | 33,161 | 1311 | 20,174 | 2605 | 53,335 |
| Validation | 367 | 16,091 | 378 | 10,643 | 745 | 26,734 |
| Test | 200 | 10,112 | 171 | 7104 | 371 | 17,216 |
Statistics of the Persian COVID‐19 fake news dataset
| Real | Fake | Total | ||||
|---|---|---|---|---|---|---|
| Documents | Words | Documents | Words | Documents | Words | |
| All data | 265 | 11,197 | 265 | 7599 | 530 | 18,796 |
Performance of the proposed model using various training and test sets
| Language | Domain | Results (%) | |||||
|---|---|---|---|---|---|---|---|
| Mode | Train | Test | Train | Test | Precision | Recall |
|
| Baseline | Persian | Persian | COVID‐19 | COVID‐19 | 75.98 | 64.97 | 69.74 |
| CLL | English | Persian | COVID‐19 | COVID‐19 | 53.76 | 89.06 | 67.05 |
| CDL | Persian | Persian | General | COVID‐19 | 56.62 | 93.58 | 70.55 |
| CLCDL | English + Persian | Persian | COVID‐19 + General | COVID‐19 | 62.96 | 85.27 | 72.13 |
Abbreviations: CDL, Cross‐Domain Learning; CLCDL, Cross‐Lingual and Cross‐Domain Learning; CLL, Cross‐Lingual Learning.
The parameters values for the proposed model training
| Parameter | Value |
|---|---|
| Maximum sequence length | 64 |
| Learning rate | 5e−5 |
| Epoch | 5 |
| Batch size | 64 |
| Optimizer | Adam |
| Loss | Binary cross entropy |
Performance of the proposed model on English COVID data and Persian general domain data
| Results (%) | |||
|---|---|---|---|
| Dataset | Precision | Recall |
|
| English COVID‐19 | 94.22 | 99.12 | 96.61 |
| Persian general domain | 94.46 | 80.78 | 87.09 |
FIGURE 4First example of Persian real news detected as fake
FIGURE 5Second example of Persian real news detected as fake
FIGURE 6First example of Persian fake news detected as real
FIGURE 7Second example of Persian fake news detected as real