| Literature DB >> 35512274 |
Romy Sauvayre1,2, Jessica Vernier1, Cédric Chauvière2,3.
Abstract
BACKGROUND: As the COVID-19 pandemic progressed, disinformation, fake news, and conspiracy theories spread through many parts of society. However, the disinformation spreading through social media is, according to the literature, one of the causes of increased COVID-19 vaccine hesitancy. In this context, the analysis of social media posts is particularly important, but the large amount of data exchanged on social media platforms requires specific methods. This is why machine learning and natural language processing models are increasingly applied to social media data.Entities:
Keywords: COVID-19; CamemBERT language model; disinformation; epistemology; language model; machine learning; method; natural language processing; public health; social media; vaccine
Year: 2022 PMID: 35512274 PMCID: PMC9116457 DOI: 10.2196/37831
Source DB: PubMed Journal: JMIR Med Inform
Figure 1Flow chart of methodology steps. API v2: application programming interface version 2.
Classification criteria for tweets and definitions.
| Type of tweet | Definition | Translated examples (French to English) | |
|
| |||
|
| Unclassifiable | Unclassifiable or irrelevant to the topics of vaccination or health measures | The Emmanuel Macron effect |
|
| Noncommittal | Neutral or without explicit opinion on vaccination and/or the health pass | I have to ask my doctor for the vaccine |
|
| Pros | Arguments in favor of the health pass | Personally, I am vaccinated so nothing to fear, on the other hand, good luck to all the anti-vaccine, you will not have the choice now?? |
|
| Cons | Arguments against vaccination or doubts about the effectiveness of COVID-19 vaccines, fear of side effects, and refusal to obtain the health pass | I am against the vaccine I am not afraid of the virus but I am afraid of the vaccine |
|
| |||
|
| Unclassifiable | Irrelevant to the topic or unclassifiable | A vaccine |
|
| Scientific | Scientific or pseudoscientific content that uses true beliefs or false information | The vaccine is 95% efficient, a little less in fragile people. The risk is not zero, but a vaccinated person has much less chance of transmitting the virus. |
|
| Political | Comments on legal or political decisions about vaccination or health measures | Basically the vaccine is mandatory, shameful LMAO |
|
| Social | Comments, debates, or opinions on the report to other members of society | “Pro vaccine” you have to also understand that there are people who do not want to be vaccinated. |
|
| Vaccination status | Explicit tweet about the vaccination status of the tweet author | Example 1: I am very glad to have already done my 2 doses of the vaccine, fudge |
The proportion of tweets assigned to each label in the data set for classification problems 1 and 2 (n=1451).
| Classification problem | Tweets | |
|
| ||
|
| Unclassifiable | 189 (13) |
|
| Neutral | 354 (24.4) |
|
| Positive | 392 (27) |
|
| Negative | 516 (35.6) |
|
| ||
|
| Unclassifiable | 226 (15.6) |
|
| Scientific | 441 (30.4) |
|
| Political | 316 (21.8) |
|
| Social | 353 (24.3) |
|
| Vaccination status | 115 (7.9) |
Figure 2Training loss (a) and validation accuracy (b) of the model over 20 epochs for classification problem 1.
Classification performance of the model for classification problem 1.
| Epochs, n | Precisiona | Recalla | F1-scorea |
| 7 | 59 | 55.3 | 55.3 |
| 15 | 56.6 | 53 | 53.2 |
| 20 | 56.9 | 54.5 | 55.2 |
aThese data are provided as percentages.
Classification performance of the model for classification problem 2.
| Epochs, n | Precisiona | Recalla | F1-scorea |
| 6 | 67.6 | 64.5 | 62.9 |
| 15 | 62.7 | 62.8 | 61.3 |
| 20 | 60.6 | 59.5 | 56.5 |
aThese data are provided as percentages.
Figure 3Confusion matrix for classification problems 1 and 2 (n=400).
The number of tweets correctly classified for each label in classification problems 1 and 2 (n=400).
| Classification problem | Tweets | |
|
| ||
|
| Unclassifiable | 10 (22.2) |
|
| Noncommittal | 62 (43.7) |
|
| Pros | 36 (67.9) |
|
| Cons | 113 (70.6) |
|
| ||
|
| Unclassifiable | 27 (40.3) |
|
| Scientific | 67 (79.8) |
|
| Political | 93 (82.3) |
|
| Social | 58 (66.7) |
|
| Vaccination status | 13 (26.5) |
Figure 4Tweet text length as a function of the accuracy of the fine-tuned CamemBERT model conducted on classification problems 1 and 2 (Mann-Whitney U test).
Classification performance of the model for classification problem 2, limited to long tweets (170 or more characters).
| Classification problem | Precisiona | Recalla | F1-scorea |
| 2 | 72.6 | 73.2 | 72.4 |
aThese data are provided as percentages.
Figure 5Confusion matrix for classification problem 2 limited to long tweets (n=168).
The proportion of correct classifications for each label in classification problem 2, limited to long tweets (170 or more characters; n=168).
| Type of problem | Number of tweets | |
|
| ||
|
| Unclassifiable | 6 (46.3) |
|
| Scientific | 42 (79.2) |
|
| Political | 45 (90) |
|
| Social | 25 (65.8) |
|
| Vaccination status | 5 (35.7) |