Literature DB >> 35528967

Analyzing the vaccination debate in social media data Pre- and Post-COVID-19 pandemic.

Abstract

The COVID-19 virus has caused and continues to cause unprecedented impacts on the life trajectories of millions of people globally. Recently, to combat the transmission of the virus, vaccination campaigns around the world have become prevalent. However, while many see such campaigns as positive (e.g., protecting lives), others see them as negative (e.g., the side effects that are not fully understood scientifically), resulting in diverse sentiments towards vaccination campaigns. In addition, the diverse sentiments have seldom been systematically quantified let alone their dynamic changes over space and time. To shed light on this issue, we propose an approach to analyze vaccine sentiments in space and time by using supervised machine learning combined with word embedding techniques. Taking the United States as a test case, we utilize a Twitter dataset (approximately 11.7 million tweets) from January 2015 to July 2021 and measure and map vaccine sentiments (Pro-vaccine, Anti-vaccine, and Neutral) across the nation. In doing so, we can capture the heterogeneous public opinions within social media discussions regarding vaccination among states. Results show how positive sentiment in social media has a strong correlation with the actual vaccinated population. Furthermore, we introduce a simple ratio between Anti and Pro-vaccine as a proxy to quantify vaccine hesitancy and show how our results align with other traditional survey approaches. The proposed approach illustrates the potential to monitor the dynamics of vaccine opinion distribution online, which we hope, can be helpful to explain vaccination rates for the ongoing COVID-19 pandemic.

Entities: Chemical

Keywords: COVID-19 Pandemic; Sentiment Analysis; Social Media; Time & Space; United States; Vaccination

Year: 2022 PMID： 35528967 PMCID： PMC9069236 DOI： 10.1016/j.jag.2022.102783

Source DB: PubMed Journal: Int J Appl Earth Obs Geoinf ISSN： 1569-8432

Introduction

Over the last two decades there have been several diseases outbreaks, such as HIN1 influenza, the Ebola and Zika viruses and the current COVID-19 outbreak. Of these diseases have been more localised than others (e.g., Ebola) but without exception, they have brought tremendous economic losses and deaths. Take the ongoing COVID-19 pandemic as an example, which was declared an international public health emergency by the World Health Organization (WHO) in early 2020, it has now spread all around the world. The pandemic has affected hundreds of millions of people’s lives in many aspects including environmental, psychological, social, and economic (Saadat et al., 2020, Saladino et al., 2020, Sharifi and Khavarian-Garmsir, 2020). In order to prevent the spread of the COVID-19 in society, governments in different countries have put in place various prevention and control measures, such as social distancing, stay-at-home constraints, closure of educational institutions, workplaces along with restricting movement either internally or externally (GÜNER et al., 2020). Although it has been seen that such counter measures for infection control can slow the transmission of the disease (Ge et al., 2022, Lai et al., 2020, Ruktanonchai et al., 2020), thus far they cannot prevent the disease from spreading totally. One possible solution however to stop the spread is that of vaccination, which can also help reduce mortality and economic losses and aid the world gradually to return to some sort of normalcy. However, notwithstanding the continuous development and maturation of vaccine research since the 18th century (Plotkin, 2014), the implementation of vaccination campaigns is still challenging. This is mainly due to the potential impacts of vaccines on human longevity and health, such as side effects that are not fully understood scientifically, and the concerns related to religion and philosophical beliefs (Calandrillo, 2004, Phipps, 2020). These impacts and concerns have slowly taken root in people’s consciousness, resulting in diverse public sentiments (e.g., positive, negative, or neutral) towards vaccination. With the proliferation of social media platforms, such as Facebook, Twitter, Weibo, and so on, people now have more flexibility than ever to share their attitudes regarding vaccination (Liu, 2012). However, the convenience of the digital era has also put people in an environment flooded with diverse information. The information, especially those from Anti-vaccine groups who disseminate negative views about vaccines, can alter the public’s perceptions regarding vaccination. As such, vaccine hesitancy has been identified as one of the ten main threats to global health in 2019 by the WHO (2019). Negative or hesitant vaccine sentiment raises risks of vaccine-preventable diseases (Dubé et al., 2013, Puri et al., 2020), contributing to a suboptimal uptake of vaccination. From this point of view, it is crucial to understand the heterogeneous public opinions regarding vaccination so as to provide insightful informs for enhancing vaccine coverage. One way to explore this is via sentiment analysis, which is one of the most active research fields in Natural Language Processing (NLP) and has been widely used to detect and classify sentiments from text data. Although social media platforms have the potential to amplify the negative voice of vaccination as information carriers, they also serve as digital Petri dishes, opening up a host of new possibilities for sentiment analysis by making it more affordable and convenient to collect large-scale data. Recently, researchers have used different techniques, such as lexicon-based or learning-based sentiment analysis, to assess public attitudes towards vaccination in online social media (Hu et al., 2021, Villavicencio et al., 2021, Yousefinaghani et al., 2021, Yuan et al., 2019), but most of them solely focused on a short study period, and no one to our knowledge has compared the dynamic changes of vaccine sentiments before and after a disease outbreak over a prolonged period of time. We take the promising intersection of sentiment analysis and social media data as a starting point to further unpack the potential dynamic changes of the public’s vaccine attitudes over a long-term period. We argue that sentiment analysis lens on large-scale social media data can complement understanding public opinions regarding vaccination based on survey research with views from both spatial and temporal perspectives. To illustrate this potential, we took the United States as a test case and utilized a Twitter dataset from January 2015 to July 2021. We started by developing a classifier to detect three vaccine sentiments (Pro-vaccine, Anti-vaccine, and Neutral) based on an approach that combines machine learning and word embedding techniques. Subsequently, taking the time of the COVID-19 outbreak as the demarcation point, we divided the study period into two phases - before and after the COVID-19 outbreak. In doing so, we can compare the three vaccine sentiments before and after the outbreak and as such reveal specific changes or trends in public attitudes towards vaccination in the United States. Moreover, we proposed a metric (A2P Ratio) derived from the identified Pro- and Anti-vaccine sentiments to evaluate the vaccine hesitancy and validated it with the estimated vaccine hesitancy from Centers for Disease Control and Prevention (CDC). Doing this allows us to address questions such as of what is the dominant vaccine sentiment before and after the outbreak? Did vaccine sentiment change over time and where did such changes take place? What are the relationships between different vaccine sentiments and the actual vaccination rates? These questions - and others alike - have direct policy implications. Such as discussions surrounding the topics of the effectiveness of interventions/strategies for enhancing vaccine uptake and immunization coverage, and the psychological, social, and political factors that sustain public trust in vaccines (Larson et al., 2011, Odone et al., 2015). We argue that tracking the dynamics of vaccine sentiments over space and time can generate informed insights for these questions. This is especially the case when combined with the analytical latitude offered by social media data that give more possibilities for assessing vaccine sentiments at larger scales in terms of space and time. In the remainder of the paper, we will provide an overview of the vaccination background and the commonly used techniques for sentiment analysis (Section 2). After that, we will describe the details of the data sets and the methodology utilized to identify public attitudes towards vaccination in the online discussion (Section 3). We will discuss the results through a series of visualization and specific state examples in Section 4 and lastly, outline the core findings and potential future works in Section 5.

Background

Modern day vaccinations can be traced back to the late 18th and early 19th Century, when Edward Jenner created the world’s first ever vaccine to contain smallpox (Battley, 1982, Edward, 1802). However, the term vaccine at that time was only used to refer to the smallpox vaccine. It was not until Louis Pasteur and his cure for rabies in the late 19th Century that the term vaccine became more broadly used (Stern and Markel, 2005). This medical cure to rabies, changed the American public’s perceptions and expectations of science (Hansen, 1998) and leaded to the widespread adoption of other vaccines, such as measles, rubella, tetanus, and diphtheria. As such vaccination campaigns were designated as one of the top ten public health achievements of the United States in the 20th century (CDC, 1999). Although vaccination has been a crucial part of preventative health care that has saved millions of lives, some have negative opinions with respect to them. One typical example is the autism-vaccine controversy where Wakefield et al. (1998) linked the rise in the ratio of children with autism to measles, mumps, and rubella (MMR) vaccination. Although the claim was eventually proved false (Wakefield et al., 1998), people became conscious about vaccine safety not only in the United States but also more globally (Motta and Stecula, 2021). Over time, public debate regarding vaccination has gradually become polarized. People who support vaccination claim that it is part of a citizen’s responsibility to be vaccinated as it helps mitigate risks of diseases transmission in society (Blume, 2006). While for anti-vaccinists, they appeal more to human rights and liberties fighting against the mandatory nature of vaccination based on concerns related to vaccine safety, religious and philosophical beliefs (Calandrillo, 2004, Phipps, 2020). The anti-vaccine movement may put vaccine practices at risk of being rejected, and their reach can be significantly amplified in the age of information because anti-vaccinists can disseminate the negatives about vaccination through social media (Evrony and Caplan, 2017, Kata, 2012, Smith and Graham, 2019). As such, the anti-vaccine movement can alter public perceptions about vaccination, and the resulting vaccine hesitancy is one of such risks. Vaccine hesitancy can be colloquially referred to as a kind of behavioral phenomenon in which people are not confident in or dissatisfied with vaccination and may potentially delay or even refuse vaccination (Finney Rutten et al., 2021). Research has provided evidence that vaccine hesitancy is one of the factors leading to the decline of vaccine coverage, which raises disease transmission risks that, in practice, can be mitigated by vaccination (Dubé et al., 2013, Puri et al., 2020). This has increased difficulties in the uptake of vaccines, particularly in the context of the ongoing COVID-19 pandemic, which has caused and continues to cause unprecedented impacts on life trajectories of millions of people globally (Palacios Cruz et al., 2020, Velavan and Meyer, 2020, Xiong et al., 2020). As such, understanding public attitude towards vaccination responding to the COVID-19 pandemic is of utmost importance to the current global public health emergency. In this context, many studies have used different sentiment analysis techniques to unfold the public sentiments or their emotions regarding vaccination, especially in the era of smart technologies, where crowd sourced data (e.g., social media data, customer reviews) have become more and more available (Yang et al., 2017). These data sources provide researchers more opportunities and flexibility to perform sentiment analysis on a large scale. Generally, sentiment analysis techniques can be classified into two major categories, namely lexical rule-based approaches and learning-based methods. More precisely, the lexical rule-based approaches utilize the manually created and validated lexicon with pre-assigned valence scores to analyze sentiment polarities. One typical example is the Valence Aware Dictionary and sEntiment Reasoner (VADER) model developed by Hutto and Gilbert (2014). VADER has the ability to capture both the polarity (positive or negative) and intensity (how positive or negative) of sentiments. For example, Hu et al. (2021) utilized VADER combined with topic modeling to measure public perceptions on COVID-19 vaccines by different topics extracted from tweets in the United States and compared them to spatio-temporal shifts of the identified sentiments from early 2020 to early 2021. Similarly, Yousefinaghani et al. (2021) used VADER to measure public opinions and emotions on vaccinations with respect to emotional expression based on Twitter posts, but in multiple countries. However, as noted earlier, the lexical rule-based approaches require manually generating and verifying predefined dictionaries, which can be a time-intensive process and the lexicon may need to be revised after a certain time. In addition, they only work on individual words without evaluating the context of words, which may result in errors, especially for sarcastic or ironic texts. Therefore, other studies have turned their attention to more automated ways, and the learning-based sentiment analysis methods are what they looked for, such as machine learning technology. The machine learning algorithms generally require transforming textual data into numeric representations before they can process them. To this end, different methods have been used, for example, the Term Frequency-Inverse Document Frequency (TF-IDF) algorithm. TF-IDF is based on the Bag-of-Words (BoW) model, which measures the mathematical importance of words in documents from a probabilistic point of view (Aizawa, 2003). For example, Yuan et al. (2019) investigated the polarized online MMR vaccine debate in Twitter after the California Disneyland measles outbreak in 2015. They used TF-IDF to extract features of the input dataset before classifying sentiments with different machine learning techniques. While in another example, Villavicencio et al. (2021) studied the stance of Filipinos regarding COVID-19 vaccines. They also utilized TF-IDF to generate word vectors from Twitter posts and performed sentiment analysis using the Naïve Bayes classification algorithm. However, the traditional TF-IDF approach may be slow for enormous vocabularies because it works directly in the word-count space. Moreover, it cannot catch the position in a text and the semantics between words due to the designed principle. In order to overcome these limitations, word embedding techniques, such as Word2Vec (Mikolov et al., 2013), have been proposed which can not only efficiently compute vector representations of words for large datasets, but can also capture the syntactic and semantic relations among words (Rezaeinia et al., 2019). Although Word2Vec has been used in sentiment analysis related to COVID-19 pandemic (e.g., Mostafa, 2021, Sciandra, 2020, Yu et al., 2020), to the authors knowledge, no one has yet used it to analyze public attitudes towards vaccination in the online discussion, let alone the dynamic changes of vaccine sentiments over space and time. Against this backdrop, we proposed to analyze public attitudes towards vaccination in online debate from both spatial and temporal perspectives through the combination of Word2Vec and supervised machine learning techniques.

Methodology

Sentiment analysis has been widely used in different fields to study public perceptions of health-related issues, especially in the context of infodemiology related to the COVID-19 pandemic. However, as far as we know, no one has compared public attitudes towards vaccination before and after a major disease outbreak. In this paper, we take the United States as a case study and utilize a large-scale Twitter dataset to understand the dynamic public attitudes on social media towards vaccination from both spatial and temporal perspectives. More precisely, we start by introducing the Twitter dataset used in this study (Section 3.1), followed by hand-labeled tweets collection through the designed dynamic online labeling questionnaire (Section 3.2). Then we demonstrate an ensemble approach combining Word2Vec and machine learning techniques to develop a classifier for sentiment analysis (Section 3.3). The classifier was subsequently applied to identify the sentiment of unlabeled tweets. Lastly, in Section 3.4, we introduce a simple ratio between Anti and Pro-vaccine as a proxy to quantify vaccine hesitancy. Fig. 1 illustrates an overview of the research outline.

Fig. 1

An overview of research outline.

Twitter data collection

To illustrate the aforementioned approach, we collected tweets sent in the United States between Jan 2015 and July 2021 utilizing the geosocial system Croitoru et al. (2013) using the keyword “vaccination” or its derivatives that are often encountered in social media (i.e., “vaccine”, “vaccines”, “vax”, “vaxine”, and “vaxx”) which have been shown to be representative of the vaccine debate in social media (e.g., Radzikowski et al., 2016). However, due to the data collection infrastructure of the geosocial system, and the Twitter API more generally, non-related tweets can show up in the dataset. For example, if someone wrote “Who should get the new vaccine?” and another person quoted this tweet saying “Apparently me sometime before today”, both tweets would be assigned the vaccination label even if the later one does not directly mention vaccines. To obtain a clearer dataset, we removed these non-related tweets, yielding a total of 11,676,612 tweets sent by 2,591,363 users. Among them, 0.21% (24,791) have precise GPS locational information, where the coordinates were directly collected when a user shares their location at the time of tweet with a GPS enabled device. 3.07% (358,443) of them were geolocated by Twitter using its API. The remainder of the tweets (96.7%) had moderate locational information as they were geolocated based on the place name given in users' public profile description. Generally, the place name can infer a user's location at city or state level using a gazetteer lookup, such as Geonames (Schulz et al., 2013) and Yahoo! Placemaker (Kinsella et al., 2011) and has been shown to be an effective way to geolocate tweets (e.g., Cheng et al., 2010, Croitoru et al., 2013, Hecht et al., 2011). Fig. 2 displays the spatial distribution of tweets and Twitter users at the state level, as well as the word cloud of high-frequency words appeared in the Twitter dataset used.

Fig. 2

The spatial distributions of tweets and Twitter users (a) the spatial distribution of Tweets; (b) the spatial distribution of Twitter users; (c) Word Cloud of high frequency words in the Twitter data set. In order to check the Twitter data is representative of the vaccination narrative at large, we compared the percentage of keywords that appeared in tweets every month with the proportion of keywords queries analyzed in Google Trends. Although certain months may either have more or fewer tweets with the same keywords as with Google Trends, we can observe a strong positive correlation (R = 0.89, p < 0.001) between the two as shown in Fig. 3 . In addition, we can also see how Twitter has the ability to capture important news or announcements (see Fig. 3(c)). Both analysis indicate that the Twitter data is representative of the vaccination debate and can used for subsequent vaccine sentiment analysis.

Fig. 3

Comparison between Twitter data and Google Trends. (a) Distribution of keywords search and tweets over time; (b) Correlation between Google Trends and Twitter data of keywords search; (c) Important news or announcements catched on Twitter activity (Note: period is the shaded area in (a)).

Hand-labeled tweets collection

In order to carry out supervised learning, a certain number of the tweets needs to be classified as pro, anti or neutral with respect to their vaccine stance. This lays the foundation for developing the sentiment classifier during the model training process (Section 3.3). The labeling of tweets was done by creating a dynamic online labeling questionnaire. Each questionnaire contained 100 random tweets, and each tweet needed to be assigned a label from the three sentiment options: “Pro-vaccine”, “Neutral”, and “Anti-vaccine”. Table 1 lists examples of manually labeled tweets of each type of sentiment.

Table 1

Examples of hand-labeled tweets of each type of sentiment.

Sentiment Labels	Examples
Pro-vaccine	Measles can be serious for young children. Make sure your child is up to date on MMR vaccine.
Neutral	State records show Blair County kindergarten measles vaccination rate is among lowest in the state.
Anti-vaccine	The CDC states the 0 have died in the past 10 years of #measles while the vaccine has killed 108 according to VAERS.

Examples of hand-labeled tweets of each type of sentiment. To quantify the reliability of the hand-labeled tweets, we made some adjustments to the designed questionnaire. Specifically, we first created a sample pool that contained 15,000 random tweets. Each tweet was assigned a random value from Uniform distribution and was ranked based on the assigned value. Similarly, we generated 100 random values from the Uniform distribution for each questionnaire and ranked them from 1 to 100. Subsequently, we matched the ranks in the sample pool and extracted the matching tweets. It is important to point out that the ranks in both sample pool and questionnaire vary every time conducting the random selection. As such, we can assure there are some overlapping tweets between questionnaires to evaluate how consistent labels are across different participants. Moreover, we also made sure there were 5% repeated tweets in each questionnaire. Thus this ensured that a participant was presented with 5 of the same tweets twice, which allowed us to examine participants' self-consistency. The dynamic questionnaires were sent to 150 participants randomly. In the end, we received 96 responses, resulting in a total of 5,969 unique hand-labeled tweets. Simultaneously, we gathered all duplicated tweets from questionnaires and asked two researchers with domain knowledge to mark them manually, and used these labels as references in the reliability measurement. Reliability measurement aims to identify participants who are not only self-consistent (i.e., internal consistency, which refers to the consistency of people’s responses to tweets multiple times) but also have a certain degree of agreement among others in labeling behavior (i.e., inter-rater reliability, which refers to the consistency between different participants in their judgments of tweet sentiment). As the quality of the hand-labeled tweets has profound implications for the model development, we only utilized the hand-labeled tweets from participants who passed the reliability test. To be precise, We first measured the percentage of agreement of the duplicated tweets for each participant and only retained participants with higher than 50% agreement of themselves. Afterward, we used Cohen’s kappa statistic (McHugh, 2012) to examine the inter-rate reliability between participants and the researchers and excluded participants with scores below 0.4 (i.e., moderate agreement). In doing so, we identified 39 reliable participants, with a total of 2,032 unique hand-labeled tweets. In addition to the data quality, we expanded the training data while still preserving the quality by combining it with other tweets from the same data corpus which were labeled by Yuan et al. (2019). In the end, we obtained a total of 7,086 hand-labeled tweets for subsequent sentiment analysis. For interested readers, we provide source code and labeled tweets at: https://figshare.com/s/7b6a91c6949d63280e99.

Sentiment classification

Sentiment analysis is a common technique used to identify and categorize public attitudes and emotions expressed on social media platforms in response to different events - in our case, the vaccination debate. To tackle the sentiment analysis, we began with a minimal text preprocessing (Section 3.3.1), followed by machine labeling (Section 3.3.2).

Preprocessing

Text preprocessing generally relates to removing noise in the contents that do not contribute to the classification process. In this paper, we cleaned out contents by pruning multiple consecutive same characters, removing Unicode characters, hashtags, URLs, as well as replacing emoji with the respective emotion which is common practice (e.g., Solomon, 2021, Surikov and Egorova, 2020). In addition, as tweets are limited in their length, and people may use the same language to support opposing viewpoints, therefore, tackling sentiment classification on tweets is sensitive to some preprocessing steps, such as stop words and punctuation removals (Saif et al., 2014, Symeonidis et al., 2018). For example, one person may post “It is insane to get vaccinated” while another may say “It is insane to not get vaccinated”, and if we remove the stop-word “not”, a classifier may not distinguish the difference between the two. Therefore, we decided to keep both stop words and punctuation to help maintain the context of each tweet.

Machine labeling

After preprocessing the tweets, we applied Word2Vec to convert words mathematically into a vector representation. Compared to other methods like the BoW approach used within TF-IDF, the advantage of Word2Vec is that it represents a word in a relatively low dimensional vector space and can capture non-trivial (grammatical) associations between words while preserving their contexts. With this being said, words with semantic relations (i.e., similar meanings) are more likely to appear close to each other in the vector space, and vice versa. In this study, we used a pre-trained word2vec model released by Google (Rezaeinia et al., 2019) to compute the numeric representation of the input words for each tweet. The embeddings were then used as the input in a set of machine learning algorithms for classification, including Naive Bayes, Support Vector Machine (SVM), Logistic Regression, and Extreme Gradient Boosting (XGBoost) in order to see which one provides the best performance. More specifically, we used stratified sampling to split the hand-labeled tweets into a training (80%) and test (20%) set. The training set was utilized for model training and the test set was used to evaluate model performance. In addition, we used 5-fold Cross-Validation (CV) and hyperparameter tuning to optimize the model performance. By comparing the performance on the test set, we found XGBoost stands out from other algorithms with an accuracy of 74% (see Table 3). The hyperparameter settings of XGBoost are shown in Table 2 . Then, the XGBoost classifier was applied to detect the sentiments (i.e., Pro-vaccine, Anti-vaccine, Neutral) of each tweet in the rest data corpus.

Table 3

Performance metrics of the XGBoost classifier.

Performance metrics
	Precision	Recall	F1 Score
Anti-vaccine	0.78	0.61	0.69
Neutral	0.71	0.59	0.65
Pro-vaccine	0.74	0.87	0.80
Macro average	0.75	0.69	0.71
Weighted average	0.74	0.74	0.74
Accuracy	0.74

Table 2

Parameter settings that generates the best performance used in the XGBoost classifier.

Hyperparameter settings
Parameters	Values
Number of trees	500
Maximum depth of a tree	8
Learning rate	0.2
γ	0.0

Parameter settings that generates the best performance used in the XGBoost classifier. Performance metrics of the XGBoost classifier.

Vaccine hesitancy estimation

In addition to detecting the three sentiments (Pro-vaccine, Anti-vaccine, Neutral), we took one step further to analyze vaccine hesitancy. As discussed earlier, vaccine hesitancy is one of the main challenges for improving vaccine coverage. However, precise quantification of vaccine hesitancy can be challenging as vaccine hesitancy itself can be affected by various factors (e.g., confidence, complacency, convenience, demographic, cultural, social, and political factors), and can vary across time, place, and vaccines (Dubé et al., 2013, MacDonald, 2015). Traditional methods for estimating vaccine hesitancy are generally done by survey or focus group discussions (e.g., Detoc et al., 2020, Krishnamoorthy et al., 2019, Larson et al., 2016). These methods are naturally time-consuming, labor-intensive, and difficult to scale up. Moreover, survey samples may also have biases in the geographical distribution, for example, individuals who live in poorly accessible areas could be underrepresented. To circumvent the limitations, we proposed a metric called “A2P Ratio” (i.e., the ratio of Anti-vaccine to Pro-vaccine) as a simple proxy to quantify vaccine hesitancy due to its nuanced nature. The larger the ratio, the higher the vaccine hesitancy. To validate the proposed metric, we further compared it with the estimated vaccine hesitancy from CDC at the state level. By doing so, we are able to identify the potential correlation between the two, so as to use the “A2P Ratio” as a simple proxy to quantify vaccine hesitancy in a more efficient way. However, it is important to stress that the “A2P Ratio” introduced here only focuses on sentiment derived from the online vaccination discussions, and as such works as a simple alternative way to provide informed insights into vaccine hesitancy. Yet, to develop a more comprehensive understanding of vaccine hesitancy will require incorporating many other factors aforementioned, such as demographic, political, cultural factors which is beyond the scope of the current paper. We will come back to this issue in Section 5.

Results

Building upon our methodology, we now turn to the results of our study. We will first discuss the results of the classified vaccine sentiments at both the country and state levels, this will be followed by exploring the potential correlation between the “Pro-vaccine” sentiment and the actual vaccination records at the state level. Finally, we will compare the “A2P Ratio” with estimated vaccine hesitancy from CDC to evaluate the rationality of the proposed metric for estimating vaccine hesitancy. Table 4 displays the results that compare the public’s attitudes towards vaccination from Twitter before and after the COVID-19 outbreak. We used “1”, which is the date of symptom onset of the 1st patient in Wuhan, China (Huang et al., 2020), as the demarcation point and split the entire study period into two phases. In other words, any dates before the demarcation point were called “before COVID-19”, otherwise they were referred to as “after COVID-19”. Among the Twitter users in the whole dataset, “Pro-vaccine” users accounted for the greatest proportion (56.97%), followed by the “Anti-vaccine” users, which made up 22.80% of total users, and the proportion of “Neutral” users was the smallest (20.22%). This result indicated that the positive vaccine sentiment was the dominant opinion. Nonetheless, the rate of “Pro-vaccine” users decreased after the outbreak, dropping from 61.56% to 56.20%. Concurrently, the percentage of “Anti-vaccine” users revealed a modest increment after the outbreak, which grew from 20.17% to 23.28%, implying that the outbreak indeed moderately shifted public attitudes towards vaccination.

Table 4

Number of Tweets and Twitter Users by each sentiment category.

Sentiments	Number of Users	Number of Tweets
Entire period
Pro-vaccine	2,055,959 (56.97%)	7,190,846 (61.58%)
Neutral	729,868 (20.22%)	2,158,271 (18.48%)
Anti-vaccine	822,926 (22.8%)	2,327,495 (19.93%)
Before COVID-19
Pro-vaccine	544,365 (61.56%)	1,655,642 (60.56%)
Neutral	161,609 (18.28%)	457,925 (16.75%)
Anti-vaccine	178,339 (20.17%)	620,103 (22.68%)
After COVID-19
Pro-vaccine	1,631,444 (56.2%)	5,535,204 (61.89%)
Neutral	595,655 (20.52%)	1,700,346 (19.01%)
Anti-vaccine	675,706 (23.28%)	1,707,392 (19.09%)
Average number of tweets per user: 5 tweets

Number of Tweets and Twitter Users by each sentiment category. To further investigate the dynamics of vaccine sentiment over time, we split the proportion of the three vaccine sentiments into years. Fig. 4 demonstrates the percentages of users (orange) and tweets (green) by the three sentiments in different years. The results show the rate of “Anti-vaccine” users approached the highest point in 2020, then slightly shrank in 2021, although the overall “Anti-vaccine” sentiment was higher compared to that of before the outbreak. One possible reason for this could be the uptake of the coronavirus vaccine(s) in some cases is accompanied by various side effects that have not been comprehensively explained scientifically (Ledford, 2020). Moreover, the severity of the side effects (e.g., minor to moderate adverse effects like fever, headache, tiredness, muscle pain, or serious safety problems such as anaphylaxis and thrombosis with thrombocytopenia syndrome (CDC, 2020)) vary from different brands of coronavirus vaccines for diverse physiques (Menni et al., 2021), resulting in negative vaccine safety concerns. This finding was coincided with Yousefinaghani et al.’s (2021) results, where the authors discovered the negative vaccine sentiment significantly increased in late 2020 when different brands of coronavirus vaccines came out. A similar trend was observed when disaggregating the sentiments at the state level as shown in Fig. 5 . For most states, the rate of “Anti-vaccine” users increased in 2020 compared to 2019 and showed a minimal drop in 2021, while the changes in the rate of “Pro-vaccine” users over time are the opposite.

Fig. 4

Distribution of tweets and Twitter users by different sentiment before and after the COVID-19 outbreak.

Fig. 5

Distribution of state-level vaccination sentiment from 2015 to 2021, (a) Percentage of Pro-vaccine users; (b) Percentage of Anti-vaccine users.

Distribution of tweets and Twitter users by different sentiment before and after the COVID-19 outbreak. Distribution of state-level vaccination sentiment from 2015 to 2021, (a) Percentage of Pro-vaccine users; (b) Percentage of Anti-vaccine users. Moreover, in order to understand the potential correlation between the positive vaccine attitude online and the actual vaccination rate offline, especially after the COVID-19 outbreak, we compared the odds ratio of the Pro-vaccine users after the outbreak to that of the coronavirus vaccinations in each state. The coronavirus vaccinations data comes from “Our World in Data” (Ritchie et al., 2020) which records the total number of people in each state who have been fully vaccinated and the number is updated daily according to governmental reports. It is necessary to note that to be in line with the study period, we chose the number of vaccinations reported on 6th July 2021 as the reference for the comparison. The odds ratio is a statistic generally utilized in medical studies (Bland and Altman, 2000), however, it is increasingly being used in a geographic context, especially in social media analysis, to alleviate size-related issues, such as uneven spatial distribution of tweets (e.g., Shelton et al., 2015, Zook et al., 2017). In this paper, we calculated the odds ratio of the “Pro-vaccine” users and vaccination records as follows: where refers to the total numbers. Fig. 6 (a) & (b) display the spatial distribution of the odds ratios of the “Pro-vaccine” users and vaccination records, separately. The results reveal there was geographic difference in Pro-vaccine sentiment on Twitter. More specifically, states, such as Massachusetts (MA), Connecticut (CT), Vermont (VT), Colorado (CO), Washington (WA), New York (NY), had relatively higher Pro-vaccine odds than other states (see Fig. 6 (a)). Part of the reason for this could be attributed to the relatively complete health system in these states as it has been shown that a well-functioning health system is crucial for improving vaccine coverage (Lahariya, 2015). According to the Commonwealth Fund’s (2020) scorecard on state health system performance, where the scorecard was assessed based on a series of indicators, including access to health care, quality of care, service use and costs of care, health outcomes, as well as income-based health care disparities, these states were ranked in the top 10. Our finding follows a similar trend to that of the actual vaccination rate (see Fig. 6 (b)), which implies the positive attitude regarding vaccines identified from the social media data can, to some extent, reflect the actual vaccination rate offline. This was further validated by measuring the correlation coefficient between the two. Fig. 6 (c) presents the correlation between the odds ratio of actual vaccination records and the odds ratio of Pro-vaccine users, where a positive correlation () between the two was observed. We argue that the proposed approach for identifying positive vaccine sentiments online can be used as an indicator for evaluating offline vaccination rates.

Fig. 6

Correlation between Pro-vaccine users and actual vaccination records (a) Spatial distribution of odds ratio of Pro-vaccine users; (b) Spatial distribution of odds ratio of actual vaccination records; (c) Correlation between the Pro-vaccine users and the actual vaccination records. In addition, to demonstrate public’s attitudes identified from online vaccine discussion can be useful for predicting the tendency of vaccines hesitancy offline, we used the A2P Ratio (see Equation (4.3)) as a proxy of vaccine hesitancy prediction and compared it with the estimated vaccine hesitancy rate from CDC1 , which is measured based on the U.S. Census Bureau's Household Pulse Survey (HPS)2 . In doing so, we were able to detect the correlation between the A2P ratio and the estimated vaccine hesitancy. By mapping the spatial distribution of both normalized A2P ratio and estimated vaccine hesitancy, we contextualized the findings in Fig. 7 . We observed that relatively higher A2P ratios and estimated vaccine hesitancy are mostly entrenched in states in the West and South, such as Wyoming (WY), Arkansas(AR), Florida (FL), Louisiana (LA), Nevada (NV), and so on. Besides this, we also observed that WY stood out from the other states in both maps, appearing as the most vaccine-hesitant state in the country, which was consistent with the result in Douthit et al.’s (2015) study. An important reason may be the inequality in resources allocation and distribution. More specifically, these states have a high proportion of people living in more rural areas who have greater difficulties in seeking medical care due to poor quality of medical services, scarcity of medical staff, lack of public transport, and more (Douthit et al., 2015). Other possible reasons could be, for example, cultural conservatism, safety concerns, distrust of government, low health literacy, and so on (Tram et al., 2021). The results indicate the proposed A2P ratio has the ability to capture a comparable pattern as the estimated vaccine hesitancy obtained from a survey research. To quantified the relationship between the two, we conducted a correlation analysis and found they indeed had a positive correlation () as shown in Fig. 7. This finding implies the proposed A2P ratio can effectively estimate vaccine hesitancy, complementing the limitations of vaccine hesitancy based on survey research, such as difficulties in scaling up, time-consuming, and labor-intensive.

Fig. 7

Vaccine hesitancy (a) Spatial distribution of Anti-vaccine to Pro-vaccine ratio; (b) Spatial distribution of estimated vaccine hesitancy from CDC; (c) Correlation between Anti-vaccine to Pro-vaccine ratio and the estimated vac- cine hesitancy from CDC.

Discussion and conclusion

In this study, we proposed a method to identify the dynamic changes in the public's sentiments towards vaccination based on social media data with respect to spatial and temporal aspects. We do so by using sentiment analysis techniques to categorize vaccine sentiment into three classes, namely “Pro-vaccine”, “Anti-vaccine” and “Neutral”. The sentiment analysis was operationalized by combining a machine learning algorithm with a pre-trained Word2Vec model leveraging a large-scale Twitter dataset. The analytical latitude offered by social media data give more possibilities for comparing changes in vaccine sentiments before and after a disease outbreak. As such, we argue the methodology presented here offers a scan of the changing pulse of the public’s perceptions regarding vaccination in light of vaccination campaigns and policy decision-making facing the current COVID-19 and future public health challenges. In addition, we introduced a metric named “A2P ratio”, which is derived from the identified positive and negative vaccine sentiments, to estimate vaccine hesitancy. The proposed A2P ratio provides a way to effectively monitor vaccine hesitancy in near real-time, which could complement the limitations of vaccine hesitancy based on survey research, such as difficulties with respect to scaling up, time-consuming and labor-intensive nature of interviewing people, etc. By using the United States as a test case, we revealed how sentiment analysis techniques combined with social media data can be utilized to capture a comprehensive picture of vaccine sentiment in online discussions from 2015 to 2021. Furthermore, we showed how the odds ratio is a helpful method for alleviating size-related issues (Shelton et al., 2015, Zook et al., 2017) when it comes to vaccine sentiment. The analysis illustrates its capacity to identify differences in geographic and temporal of Twitter users. Specifically, we discovered how positive vaccine sentiment was dominant throughout the years, although there was a slight decline in 2020 when different brands of coronavirus vaccines came out. Possible reasons for this could be due to concerns of side effects that have not been fully understood scientifically. A similar pattern was observed when disaggregating vaccine sentiments at the state level. Moreover, we found the relatively higher Pro-vaccine odds generally appeared in states that have relatively complete health systems. This is reasonable as a well-functioning health system plays an influential role in vaccine campaign (Lahariya, 2015). In addition, the Pro-vaccine odds displayed a positive correlation with the actual vaccination odds. Although the correlation was moderate, it still demonstrates the capability to be used as a preliminary reflection of the vaccination rate. However, the most intriguing insight is the proposed A2P ratio for evaluating vaccine hesitancy. We found states with relatively high vaccine hesitancy were generally concentrated in the Western and Southern states, such as Wyoming and Arkansas. Wyoming, in particular stood out from the other states appearing as the most vaccine-hesitant state in the country. As noted above, partial reasons for this could be attributed to the unequal distribution of medical resources. Moreover, the result showed there was a positive relationship between the A2P ratio and the estimated vaccine hesitancy based on survey research. As discussed earlier, vaccine hesitancy has been one of the key challenges in improving vaccination campaigns. As such, effectively evaluating vaccine hesitancy is crucial for enhancing vaccine uptake. However, deriving vaccine hesitancy directly from tweets as opposed to pro and anti-vaccine sentiment is a major challenge and we would argue that the proposed A2P ratio could provide informed insights into vaccine hesitancy in near real-time, which could helpful for health professionals and policymakers to propose more precise and promotion plans and effective strategies to improve vaccine coverage so as to control the transmission of viruses in the society. The study presented here can certainly be expanded. First, we solely tested the methodology on the Twitter dataset, while other social media datasets, such as Facebook and Weibo, should theoretically be compatible with it. Second, the social media data have certain limitations, such as sampling bias and lack of demographic information (Yuan et al., 2020). As such, future studies can make more efforts to generalize these sorts of data, such as calibrating or evaluating bias (Jiang et al., 2019, Longley et al., 2015), before utilizing them in public health policy. Second, it is crucial to mention that the “A2P Ratio” metric, introduced here is not intended to precisely predict vaccine hesitancy based only on “Anti-vaccine” and “Pro-vaccine” sentiment derived from the online vaccination discussion, but rather to provide an initial quantitative informed insights of the potential trend of vaccine hesitancy. Thus, integrating this metric with other data sources, such as census data, could enhance our understanding of vaccine hesitancy. Moreover, as different public concerns, views, and responses may result in negative sentiments or vaccine hesitancy. Therefore, it is of interest to explore key topics or themes discussed online about vaccination in the future. One such approach for this could be semantic network analysis, which deduces themes from discussions by analyzing word co-occurrence in the texts (e.g., Featherstone et al., 2020, Ruiz et al., 2021). This could be helpful for identifying more specific factors that drive different public attitudes regarding vaccination. Nonetheless, the study demonstrates the effectiveness and practicality of integrating machine learning with pre-trained Word2Vec model, allowing quantitative research on social media data to uncover insights of public perceptions towards vaccination. The methodology proposed here takes into account the complex and changing sentiments of the public on vaccination and can be expanded to assess other general public opinions related to health issues.

CRediT authorship contribution statement

Qingqing Chen: Methodology, Formal analysis, Visualization, Writing – original draft, Writing – review & editing. Andrew Crooks: Supervision, Conceptualization, Methodology, Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

40 in total

1. The history of vaccines and immunization: familiar patterns, new challenges.

Authors: Alexandra Minna Stern; Howard Markel
Journal: Health Aff (Millwood) Date: 2005 May-Jun Impact factor: 6.301

2. The overlooked dangers of anti-vaccination groups' social media presence.

Authors: Ayelet Evrony; Arthur Caplan
Journal: Hum Vaccin Immunother Date: 2017-04-13 Impact factor: 3.452

3. History of vaccination.

Authors: Stanley Plotkin
Journal: Proc Natl Acad Sci U S A Date: 2014-08-18 Impact factor: 11.205

4. Anti-vaccine activists, Web 2.0, and the postmodern paradigm--an overview of tactics and tropes used online by the anti-vaccination movement.

Authors: Anna Kata
Journal: Vaccine Date: 2011-12-13 Impact factor: 3.641

Review 5. The COVID-19 pandemic: Impacts on cities and major lessons for urban planning, design, and management.

Authors: Ayyoob Sharifi; Amir Reza Khavarian-Garmsir
Journal: Sci Total Environ Date: 2020-09-18 Impact factor: 7.963

6. The COVID-19 epidemic.

Authors: Thirumalaisamy P Velavan; Christian G Meyer
Journal: Trop Med Int Health Date: 2020-02-16 Impact factor: 2.622

7. Factors related to vaccine hesitancy during the implementation of Measles-Rubella campaign 2017 in rural Puducherry-A mixed-method study.

Authors: Yuvaraj Krishnamoorthy; Sivaranjini Kannusamy; Gokul Sarveswaran; Marie Gilbert Majella; Sonali Sarkar; Vishwanath Narayanan
Journal: J Family Med Prim Care Date: 2019-12-10

1. Information propagation on cyber, relational and physical spaces about covid-19 vaccine: Using social media and splatial framework.

Authors: Fuzhen Yin; Andrew Crooks; Li Yin
Journal: Comput Environ Urban Syst Date: 2022-09-14

1 in total