| Literature DB >> 35730040 |
Minyu Wan1, Qi Su2, Rong Xiang3, Chu-Ren Huang1.
Abstract
The rampant of COVID-19 infodemic has almost been simultaneous with the outbreak of the pandemic. Many concerted efforts are made to mitigate its negative effect to information credibility and data legitimacy. Existing work mainly focuses on fact-checking algorithms or multi-class labeling models that are less aware of the intrinsic characteristics of the language. Nor is it discussed how such representations can account for the common psycho-socio-behavior of the information consumers. This work takes a data-driven analytical approach to (1) describe the prominent lexical and grammatical features of COVID-19 misinformation; (2) interpret the underlying (psycho-)linguistic triggers in terms of sentiment, power and activity based on the affective control theory; (3) study the feature indexing for anti-infodemic modeling. The results show distinct language generalization patterns of misinformation of favoring evaluative terms and multimedia devices in delivering a negative sentiment. Such appeals are effective to arouse people's sympathy toward the vulnerable community and foment their spreading behavior.Entities:
Keywords: COVID-19 Infodemic; Evaluation–potency–activity; Information credibility; Linguistic features; Misinformation
Year: 2022 PMID: 35730040 PMCID: PMC9194350 DOI: 10.1007/s41060-022-00339-8
Source DB: PubMed Journal: Int J Data Sci Anal
Fig. 1The triple-dimension paradigm of understanding and combating infodemic
Fig. 2An example of COVID-19 myth (with the debunked fact) from MythBusters WHO
Summary on truth-labeled datasets related to COVID-19
| Dataset | Size | Language | Modality | Class |
|---|---|---|---|---|
| COVID-19 FND | 10969 fake, 4298 true headlines | English | News headlines, content | 2 |
| CoAID | 4251 articles, 296,000 comments, 926 posts | English | News title, content, metadata | 2 |
| ReCOVery | 2029 articles, 140,820 tweets, 93,761 comments | English | Multi-modal information | 2 |
| Covid19-misinfo-data | 142 science claims, 340 politifact claims | English | Claims | 2 |
| CMU-MisCOV19 | 4573 tweets, 3629 users information | English | Tweets, users interactions | 17 |
| Covid-HeRA | 61,286 tweets, 84,545 tokens | English | Tweets | 5 |
| LitCovid | 181,848 biomedical articles | English | Title, abstract, keywords, metadata | 7 |
| COVID-19 rumor dataset | 4129 news rumors, 2705 tweets | English | Rumors, tweets | 12 |
| COVIDLIES | 6591 tweets, 62 claims | English | Wikipedia, tweets | 3 |
| FibVID | 1353 claims, 221,253 tweets | English | Claim, propagation, user information | 4 |
| ArCOV19-Rumors | 138 claims, 9.4K tweets | Arabic | Tweets, propagation networks | 3 |
| CHECKED | 344 fake, 1776 true microblogs | Chinese | Text, user engagement, metadata | 2 |
| FakeCovid | 5182 news articles | Multilingual | Article, source, title, date, metadata | 2 |
Basic statistics of the CovMythFact dataset
| Token | Sentence | Type | Lemma | TTR (%) | |
|---|---|---|---|---|---|
| Myths | 79,919 | 5000 | 10,252 | 7364 | 14.25 |
| Facts | 52,325 | 5000 | 9786 | 6835 | 20.46 |
| Total | 132,244 | 10,000 | 20,038 | 14,199 | 18.22 |
Fig. 3Sentence length distribution
Fig. 4Word average length distribution
Distinct lemmas for myths
| Rank | Lemma | DP1 | Rank | Lemma | DP1 | Rank | Lemma | DP1 | Rank | Lemma | DP1 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | COVID-19 | 0.92 | 21 | Image | 0.36 | 41 | Spread | 0.27 | 61 | victim | 0.23 |
| 2 | Video | 0.89 | 22 | Time | 0.36 | 42 | Cause | 0.27 | 62 | Brazilian | 0.23 |
| 3 | Show | 0.70 | 23 | Wuhan | 0.35 | 43 | Woman | 0.27 | 63 | social | 0.23 |
| 4 | People | 0.66 | 24 | Police | 0.35 | 44 | Drink (v.) | 0.27 | 64 | Viral | 0.23 |
| 5 | Lockdown | 0.54 | 25 | Prevent | 0.35 | 45 | Create | 0.26 | 65 | Brazil | 0.23 |
| 6 | Government | 0.52 | 26 | Post | 0.35 | 46 | Bill | 0.26 | 66 | Street | 0.23 |
| 7 | China | 0.50 | 27 | Indian | 0.34 | 47 | Mask | 0.26 | 67 | Muslim | 0.23 |
| 8 | Cure (v.) | 0.47 | 28 | Say | 0.34 | 48 | 0.26 | 68 | Picture | 0.23 | |
| 9 | Novel | 0.46 | 29 | Hospital | 0.33 | 49 | State | 0.26 | 69 | Outbreak | 0.23 |
| 10 | Claim (n.) | 0.45 | 30 | Minister | 0.33 | 50 | Multiple | 0.25 | 70 | Hot | 0.22 |
| 11 | Infect | 0.45 | 31 | Italy | 0.32 | 51 | 0.25 | 71 | Body | 0.22 | |
| 12 | Share | 0.44 | 32 | Thousand | 0.32 | 52 | Message | 0.25 | 72 | Prime | 0.22 |
| 13 | 0.44 | 33 | Man | 0.31 | 53 | Patient | 0.25 | 73 | French | 0.22 | |
| 14 | India | 0.44 | 34 | Doctor | 0.30 | 54 | Use | 0.25 | 74 | Announce | 0.22 |
| 15 | Chinese | 0.44 | 35 | Virus | 0.30 | 55 | Country | 0.25 | 75 | Lemon | 0.22 |
| 16 | Photo | 0.42 | 36 | 5G | 0.29 | 56 | Dead | 0.24 | 76 | Ministry | 0.22 |
| 17 | President | 0.41 | 37 | Gate | 0.28 | 57 | Media | 0.24 | 77 | View (n.) | 0.21 |
| 18 | Kill | 0.41 | 38 | Quarantine | 0.28 | 58 | Food | 0.24 | 78 | Spanish | 0.21 |
| 19 | Water | 0.39 | 39 | Italian | 0.28 | 59 | Spain | 0.24 | 79 | Citizen | 0.21 |
| 20 | Die | 0.38 | 40 | Cure (n.) | 0.27 | 60 | Kid | 0.23 | 80 | Tea | 0.20 |
Distinct lemmas for facts
| Rank | Lemma | DP2 | Rank | Lemma | DP2 | Rank | Lemma | DP2 |
|---|---|---|---|---|---|---|---|---|
| 1 | Coronavirus | 0.91 | 21 | FDA | 0.26 | 41 | Trial | 0.22 |
| 2 | May | 0.55 | 22 | Consideration | 0.26 | 42 | contact (n.) | 0.22 |
| 3 | Guidance | 0.34 | 23 | Clinical | 0.24 | 43 | Symptom | 0.22 |
| 4 | Test (v.) | 0.32 | 24 | Heart | 0.24 | 44 | Nursing | 0.22 |
| 5 | 0.32 | 25 | Commentary | 0.24 | 45 | American | 0.22 | |
| 6 | Study | 0.32 | 26 | Help | 0.24 | 46 | Toolkit | 0.22 |
| 7 | Case | 0.32 | 27 | Interim | 0.24 | 47 | Community | 0.21 |
| 8 | Know | 0.31 | 28 | NIH | 0.24 | 48 | Chicago | 0.21 |
| 9 | Could | 0.30 | 29 | Antibody | 0.24 | 49 | Plan | 0.21 |
| 10 | Pandemic | 0.30 | 30 | Adult | 0.23 | 50 | Tip | 0.21 |
| 11 | SARS-CoV-2 | 0.29 | 31 | Daily | 0.23 | 51 | Resource | 0.21 |
| 12 | Response | 0.29 | 32 | Update | 0.23 | 52 | Researcher | 0.20 |
| 13 | Healthcare | 0.28 | 33 | Clinic | 0.23 | 53 | Remdesivir | 0.20 |
| 14 | CDC | 0.28 | 34 | Facility | 0.23 | 54 | Cancer | 0.20 |
| 15 | Test (noun) | 0.28 | 35 | Need | 0.23 | 55 | State | 0.20 |
| 16 | Risk | 0.27 | 36 | Reopen | 0.22 | 56 | Question | 0.20 |
| 17 | U.S. | 0.27 | 37 | Trace | 0.22 | 57 | Strategy | 0.20 |
| 18 | Worker | 0.26 | 38 | Higher | 0.22 | 58 | Support | 0.20 |
| 19 | Severe | 0.26 | 39 | Data | 0.22 | 59 | Expert | 0.20 |
| 20 | Care | 0.26 | 40 | Setting | 0.22 | 60 | Information | 0.20 |
CDC is the acronym of “Centers for Disease Control and Prevention” as exemplified in
“The CDC now forecasts 100,000 US coronavirus deaths by June 1”
FDA is the acronym of “U.S. Food and Drug Administration” as exemplified in
“Any Potential COVID-19 Vaccine Will Have to Pass These FDA Requirements”
NIH is the acronym of “National Institutes Health” as exemplified in
“NIH scientists discover key pathway in lysosomes that coronaviruses use to exit cells”
Fig. 5EPA indexes of distinct words in the two groups. Words are displayed in descending order of the major dimension. We highlight words with negative scores in bold. Words that are dominant in the fact group are in the italic form. The gradient colors refer to different degrees of weights in each dimension of the words
Fig. 6Word sketch difference analysis of ‘COVID-19’ versus ‘SARS-CoV-2’
Fig. 7Distributions of parts of speech between myths and facts
Fig. 8Top concept pairs in myths
Binominal logistic regression results for predicting information credibility
| Variable | Coefficient | S.E. | Sig. | |
|---|---|---|---|---|
| (Intercept) | 0.148425 | 1.782931 | 0.083 | 0.934 |
| Null | 0.021500 | 0.003140 | 0.684 | |
| Noun | 0.042980 | 0.003021 | 14.228 | 0.012* |
| Verb | 0.004076 | 0.013* | ||
| Adj | 0.021647 | 0.004294 | 5.041 | 0.0463 |
| Adv | 0.087086 | 0.007063 | 12.330 | 0.262 |
| Content | 0.011914 | 0.427 | ||
| Function | 0.018794 | 0.012 | ||
| s_len | 0.107455 | 0.020* | ||
| w_len | 0.135527 | 0.034277 | 3.954 | 0.007* |
| TTR | 0.018334 | 0.583 | ||
| E_avg | 0.429951 | 0.017336 | 24.801 | < 2e−16*** |
| P_avg | 0.690477 | 0.027839 | 24.803 | 0.002** |
| A_avg | 0.022488 | 0.021* |
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘’ 0.1 ‘ ’ 1
Evaluation results on machine learning classifiers
| Features | LR | SVM | RFC |
|---|---|---|---|
| BOW | 0.76 | 0.77 | 0.75 |
| POS | 0.79 | 0.80 | 0.80 |
| W2V | 0.91 | 0.92 | 0.88 |
| EPA | 0.85 | 0.86 | 0.84 |
| W2V+E | 0.94 | 0.95 | 0.91 |
| W2V+P | 0.92 | 0.93 | 0.90 |
| W2V+A | 0.91 | 0.91 | 0.89 |
| W2V+EPA | 0.95 | 0.95 | 0.92 |