| Literature DB >> 35035595 |
Andrea Stevens Karnyoto1, Chengjie Sun1, Bingquan Liu1, Xiaolong Wang1.
Abstract
Misinformation has become a frightening specter of society, especially fake news that concerning Covid-19. It massively spreads on the Internet, and then induces misunderstandings of information to the national and global communities during the pandemic. Detecting massive misinformation on the Internet is crucial and challenging because humans have struggled against this phenomenon for a long time. Our research concerns detecting fake news related to covid-19 using augmentation [random deletion (RD), random insertion (RI), random swap (RS), synonym replacement (SR)] and several graph neural network [graph convolutional network (GCN), graph attention network (GAT), and GraphSAGE (SAmple and aggreGatE)] model. We constructed nodes and edges in the graph, word-word node, and word-document node to graph neural network. Then, we tested those models in different amounts of sample training data to obtain accuracy for each model and compared them. For our fake news detection task, we found training accuracy steadily increasing for GCN, GAT, and SAGE models from the beginning to the end of the epochs. This result proved that the performance of GNN, whether GCN, GAT, or SAGE gained an entirely insignificant difference precision result.Entities:
Keywords: COVID-19 fake news detection; Graph neural network; Text classification; Text augmentations
Year: 2022 PMID: 35035595 PMCID: PMC8742573 DOI: 10.1007/s13042-021-01503-5
Source DB: PubMed Journal: Int J Mach Learn Cybern ISSN: 1868-8071 Impact factor: 4.377
Example sentence are resulted by easy data augmentation
| Operation | Sentence |
|---|---|
| None | Graph classifications have been solved a significant task and achieved excellent performance |
| Synonym replacement (SR) | Graph classifications have been solved a significant job and achieved magnificent performance |
| Random insertion (RI) | Graph classifications have been solved a significant task and achieved excellent performance sophisticated |
| Random swap (RS) | Graph performance have been solved a significant task and achieved excellent classifications |
| Random deletion (RD) | Graph classifications have been solved a task and achieved excellent performance |
Data distribution for constraint @ AAAI2021—COVID19 fake news detection
| Data | Real | Fake | Total | Unique Word |
|---|---|---|---|---|
| Train | 3360 | 3060 | 6420 | 30,046 |
| Validation | 1120 | 1120 | 2140 | 13,697 |
| Testing | 1120 | 1120 | 2140 | 14,121 |
Some post fake and real
| Label | Post |
|---|---|
| Real | This #FourthOfJuly weekend if you choose to spend time outdoors at an event or gathering stay 6 ft apart & wear a cloth face cover to slow the spread of #COVID19. Learn more at |
| Real | We launched the #COVID19 Solidarity Response Fund which has so far mobilized $225 + M from more than 563,000 individuals companies & philanthropies. In addition we mobilized $1 + billion from Member States & other generous to support countries-@DrTedros |
| Fake | @realDonaldTrump has shifted his focus at different moments in the #CoronavirusOutbreak. We updated our running timeline of his response to the virus. |
| Fake | RT @EllenCutch: Coronavirus misinformation is moving offline. A reddit user posted this flyer to the site and told us it had been delive… |
Recommended usage parameters
| α | naug | |
|---|---|---|
| 500 | 0.05 | 16 |
| 2000 | 0.05 | 8 |
| 5000 | 0.1 | 4 |
| More | 0.1 | 4 |
Fig. 1Diagram of text GCN
Dataset statistic after preprocessing and augmentation
| Total train rows and validation | Total tokens | Sentence statistics | |||
|---|---|---|---|---|---|
| Min number of token | Max number of token | Average number of token | |||
| Without augmentation | |||||
| Train-30 | 2569 | 4709 | 2 | 58 | 12.55 |
| Train-50 | 4281 | 6421 | 2 | 248 | 12.94 |
| Train-80 | 6848 | 8988 | 2 | 259 | 13.34 |
| Train-100 | 8558 | 10,689 | 2 | 263 | 13.48 |
| With augmentation | |||||
| Train-30 + Aug | 12,845 | 14,985 | 2 | 63 | 14.16 |
| Train-50 + Aug | 21,405 | 23,545 | 2 | 317 | 14.28 |
| Train-80 + Aug | 34,240 | 36,380 | 2 | 314 | 14.38 |
| Train-100 + Aug | 42,790 | 44,930 | 2 | 317 | 14.39 |
Precision and F1-score of testing results
| Non-graph network | Graph neural network | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| LSTM | CNN | GCN | GAT | GraphSAGE | ||||||
| Prec | F1 | Prec | F1 | Prec | F1 | Prec | F1 | Prec | F1 | |
| Train-30 | 0.8401 | 0.8672 | 0.8017 | 0.7997 | 0.8947 | 0.8946 | 0.8925 | 0.8922 | 0.9015 | 0.9017 |
| Train-50 | 0.893 | 0.8764 | 0.8397 | 0.8372 | 0.9012 | 0.9006 | 0.9012 | 0.9012 | 0.9096 | 0.9096 |
| Train-80 | 0.8875 | 0.8916 | 0.9019 | 0.9619 | 0.9136 | 0.9133 | 0.9117 | 0.9114 | 0.8944 | 0.8919 |
| Train-100 | 0.8957 | 0.8893 | 0.8909 | 0.8906 | 0.9137 | 0.9133 | 0.9058 | 0.9059 | 0.9204 | 0.9204 |
| Train-30 + Aug | 0.8681 | 0.8679 | 0.8793 | 0.8789 | 0.8975 | 0.8963 | 0.8969 | 0.8964 | 0.9057 | 0.9053 |
| Train-50 + Aug | 0.8868 | 0.8867 | 0.9068 | 0.9068 | 0.9053 | 0.9043 | 0.9025 | 0.9015 | 0.9129 | 0.9123 |
| Train-80 + Aug | 0.9160 | 0.9151 | 0.9152 | 0.9142 | 0.9090 | 0.9065 | ||||
| Train-100 + Aug | 0.9141 | 0.9141 | 0.9073 | 0.9073 | ||||||
Bold values indicate the highest performance for each model
Comparison between operation
| Operation | Precision | ||||
|---|---|---|---|---|---|
| LSTM | CNN | GCN | GAT | SAGE | |
| Random deletion (RD) | 0.9118 | 0.9091 | 0.9142 | 0.9148 | 0.9249 |
| Random insertion (RI) | 0.9143 | 0.9121 | 0.9171 | 0.9176 | 0.9277 |
| Random swap (RS) | 0.9128 | 0.9108 | 0.9160 | 0.9162 | 0.9257 |
| Synonym replacement (SR) | 0.9098 | 0.9075 | 0.9125 | 0.9129 | 0.9224 |
| RD + RI + RS + SR | 0.9150 | 0.9130 | 0.9183 | 0.9183 | 0.9279 |
Fig. 2GCN, GAT, and GraphSAGE training accuracies
Fig. 3Comparison of training accuracy for GCN, GAT, and SAGE toward Train-100 + Aug dataset
Effect of sentences length toward classification accuracies
| Sentence length (words) | ||||
|---|---|---|---|---|
| ≤ 15 | 16–30 | 31–45 | ≥ 46 | |
| GCN | 0.9182 | 0.8723 | 0.6000 | |
| GAT | 0.9174 | 0.8723 | 0.6000 | |
| SAGE | 0.9265 | 0.8936 | 0.6000 | |
Bold values indicate the highest accuracy for each sentence length group
Fig. 4The most common term was detected by our models in The Constraint @ AAAI2021—COVID19 fake news detection dataset