| Literature DB >> 35668822 |
Siddhi Mishra1, Abhigya Verma1, Kavita Meena1, Rishabh Kaushal1.
Abstract
Social media have a significant impact on opinion building in public. Vaccination in India started in January 2021. We have seen many opinions towards vaccination of the people, as vaccination is one of the most crucial steps toward the fight against COVID-19. In this paper, we have compared the public's sentiments towards COVID vaccination in India before the second wave and after the second wave. We worked by extracting tweets regarding vaccination in India, building our datasets. We extracted 5977 tweets before the second wave and 42,936 tweets after the second wave. We annotated the collected tweets into four categories, namely Provaccine, Antivaccine, Hesitant and Cognizant. We built a baseline model for sentiment analysis and have used multiple classification techniques among which Random Forest using the TF-IDF vectorization technique gave the best accuracy of 69% using max-features and n-estimators as parameters.Entities:
Keywords: Covid-19 vaccincation; Machine learning; Sentiment classification
Year: 2022 PMID: 35668822 PMCID: PMC9151355 DOI: 10.1007/s13278-022-00885-w
Source DB: PubMed Journal: Soc Netw Anal Min
Fig. 1Our workflow diagram involving data collection, annotation, visualization, text analysis, vectorization, data augmentation (back translation), and machine learning
Comparative analysis between prior research works
| Paper | Description | Our remarks |
|---|---|---|
|
Toll and Li ( | The aim was to inspect the reasons for attitude towards Measles, Mumps, and Rubella (MMR) vaccination. The data used for the analysis were from The Longitudinal Study of Australian Children (LSAC). The impact of different MMR vaccines and attitudes towards vaccination were modelled using multi-nomial logistic regression. This longitudinal study began with a nationally representative sample of over 10,000 children and their families in 2004 | Although this research paper is not covering our use case, we are taking it as a reference for analysing the methodology to extract the attitude towards the vaccination |
|
Piedrahita-Valdés et al. ( | They studied hesitancy among users towards vaccines in general during the period 2011 to 2019 through the lens of social media. They collected 1,499,227 tweets regarding vaccines and performed sentiment analysis. Polarity analysis was performed using an association model based on a combination of lexical-based approaches and supervised machine learning methods. The results showed that 69.36% of the classified tweets are neutral, 21.78% were positive, and 8.86% were negative | We use this research paper to find the data preparation and extraction methodology |
|
Dubey ( | They investigated the public’s sentiment towards the vaccination drive of Covid-19 in India with respect to two vaccines, namely, Oxford-AstraZeneca’s Covishield and Bharat Biotech’s Covaxin. Tweets were classified into a sentiment class by applying the pre-trained NRC Emotion lexicon. They analysed tweets and performed sentiment classification. For Covaxin, 69% were tweeting with positive sentiments while 31% had negative sentiments. For Covishield, 71% tweets had positive sentiments while 29% tweets had negative sentiments associated with it | We also perform sentiment of users towards Indian vaccination drives. However, we build our novel dataset categorized into four classes: Provaccine, Antivaccine, Hesitant and Cognizant, which would be more useful for policy decisions |
|
Radzikowski et al. ( | They studied the narratives related to the Measles vaccination drive on Twitter. Those tweets were analysed to identify key terms, connections among such terms, retweet patterns, the structure of the narrative, and connections to the geographical space. They found that the tweets made by news agencies had more effect on the opinion than the tweets by health organization accounts | Our work is also similar in terms of understanding people’s opinions on social media. However, we go beyond the conventional sentiment classes and curate an annotated dataset of user attitudes towards Covid-19 vaccines in India |
|
Samuel et al. ( | They inspected the general outlook of the public towards the pandemic using tweets related to Covid-19 by using a pre-trained sentiment analysis model. RNN is used along with NLP processing and Sentiment analysis for classic as well as Deep Learning styles. They covered four critical issues: (1)opinions related to the outbreak of Covid-19, (2) the use of tweets for sentiment analysis, (3) they observed that naive bayes are preferable for predicting sentiments with a precision of 91% and an accuracy of 57% | In contrast, we do not use a pre-trained model. Instead, we annotate and build a trained model from scratch using our annotated dataset |
|
Shaban-Nejad et al. ( | They proposed a method for layout and build-out of an integrated semantic platform to grasp the understanding of vaccine sentiments, reliance, and attitude using ontologies | In our work, we approach the problem as the design of a machine learning model to automatically flag the user’s attitude based on the tweet |
|
Nemes and Kiss ( | They designed an emotion prediction model by co-relating words and further labelling the words to entries instead of the usual negative and positive classification. The deep learning model is used for sentiment analysis | We work on classifying a tweet into four categories, namely Provaccine, Antivaccine, Hesitant and Cognizant, moving beyond the typical sentiment classification |
Existing datasets related to vaccine
| Dataset | Authors | Description |
|---|---|---|
| Barrier childhood immunization |
Pearce et al. ( | The Longitudinal study of Australian children data was collected that contain a rich set of children and family information that are instrumental in controlling for the decision around vaccination |
| Twitter-measles |
Radzikowski et al. ( | The GeoSocial Gauge system prototype was used to collect data from Twitter using a user-specified set of parameters such as keywords, locations, and time to understand users opinions on measles vaccines |
| Fear sentiment tweets |
Samuel et al. ( | Over nine hundred thousand tweets were downloaded using a Twitter API by applying the keyword ‘Corona’, and the goal was to analyse fear sentiment related to Covid-19 |
Fig. 2Data Extraction
Keywords used to extract data
| Keyword | Reasons/explanation |
|---|---|
| Covaxine | Indian Vaccine Name |
| Covidshield | Indian Vaccine Name |
| Bharat Biotech | Covaxine manufacturer |
| Serum Institute of India | Covidshield manufacturer |
| Largest vaccine Drive | Most popular tweet |
| India vaccine | Getting tweet related to Indian Vaccine |
| Government India vaccine | Getting tweet related to Indian Vaccine |
| Modi/BJPVaccine | Getting tweets / trolls related to politics in vaccine |
Labels and definitions: The labels were inspired from the study of public opinions on Covid-19 vaccines by Lyu et al. 2021
| Category | Description |
|---|---|
| Pro-vaccine | 1. Claiming that they would take the vaccine one it is available 2. Advocating and supporting vaccine/vaccine-associated entities like vaccine trials 3. Believing that vaccine will stop the pandemic 4. Encouraging other people to take vaccine |
| Anti-vaccine | 1. Promoting information about vaccination which are not in support of it 2. Arguing with the facts which are in support of vaccination 3. Believing that an effective vaccine would not be invented soon 4. Believing that vaccine is dangerous |
| Vaccine-hesitant | 1. Claiming that they would like to take the vaccine given that the vaccine is proven safe/effective 2. Asking queries related to COVID-19 vaccine 3. Showing worries about the vaccine’s effectiveness |
| Vaccine-cognizant | 1. Vaccine-related news and facts 2. Including vaccine and the commenters’ opinions, but the focus is something else |
Data collection results
| Dataset | Total tweets | Annotated tweets | Duration |
|---|---|---|---|
| Before Second Wave | 5977 | 4094 | March 2021 |
| After Second Wave | 42,936 | 5000 | June 2021 |
| Combined | 48,913 | 9094 | N/A |
Distribution of tweets before and after second wave of Covid-19 in India among four categories
| Classes | Before | After |
|---|---|---|
| Vaccine-hesitant | 545 | 849 |
| Anti-vaccine | 413 | 274 |
| Pro-vaccine | 1340 | 947 |
| Vaccine-cognizant | 1796 | 2930 |
| Total | 4094 | 5000 |
Fig. 3Pie chart and bar chart of tweets before and after second Covid-19 wave
Fig. 4Most used hashtags circulated (i) before second Covid-19 wave and (ii) after second Covid-19 wave
Fig. 5Word cloud and frequency of mentions before and after second Covid-19 wave
Fig. 6Effect of backtranslation
Fig. 7Illustration of BoW vectorisation technique
Fig. 8TF-IDF vectorisation technique
Fig. 9Comparision of vectorization techniques on all four datasets
Fig. 10Results of Hyperparameter tuning in Logistic Regression
Fig. 11Results of Hyperparameter tuning in Random Forest
Fig. 12Results of Hyperparameter tuning in Decision Tree Classifier
Fig. 13Results of Hyperparameter tuning in KNN
Fig. 14Result of Hyperparameter tuning in Gradient Boosting
Accuracy of machine learning algorithms using vectorisation techniques with different datasets (D1: Before Second Wave, D2: After Second Wave, D3: Merged, and D4: Merged Back Translation
| Models used | D1 | D2 | D3 | D4 | Avg |
|---|---|---|---|---|---|
| BOW + LR | 0.54 | 0.66 | 0.62 | 0.61 | 0.61 |
| BOW + DT | 0.48 | 0.59 | 0.53 | 0.52 | 0.53 |
| BOW + KNN | 0.44 | 0.57 | 0.50 | 0.49 | 0.50 |
| BOW + RF | 0.57 | 0.66 | 0.61 | 0.62 | 0.61 |
| BOW + GB | 0.57 | 0.65 | 0.60 | 0.60 | 0.61 |
| TF-IDF + LR | 0.67 | 0.63 | 0.64 | ||
| TF-IDF + DT | 0.45 | 0.60 | 0.54 | 0.54 | 0.53 |
| TF-IDF + KNN | 0.58 | 0.62 | 0.561 | 0.55 | 0.58 |
| TF-IDF + RF | 0.59 | 0.64 | 0.64 | ||
| TF-IDF + GB | 0.54 | 0.67 | 0.64 | 0.62 |
Bold values are the best performing results
Results: accuracy of Best Models with different datasets using TF-IDF vectorization technique
| Datasets | Best model | Accuracy |
|---|---|---|
| Before second wave | Logistic regression | 61% |
| After second wave | Random forest | 69% |
| Merged | Logistic regression | 65% |
| Merged Backtranslation | Gradient boosting | 65% |