Yihua Su1, Aarthi Venkat2, Yadush Yadav1, Lisa B Puglisi3, Samah J Fodeh4. 1. Health Informatics Program, Yale School of Public Health, 60 College St, New Haven, CT, 06510, USA. 2. Computational Biology and Bioinformatics Program, Yale University, 300 George Street, Suite 501, New Haven, CT, 06511, USA. 3. SEICHE Center for Health and Justice, Yale School of Medicine, 333 Cedar St, New Haven, CT, 06510, USA; Pain Research, Informatics, Multimorbidities and Education Center, VA Connecticut Healthcare System, 950 Campbell Avenue, West Haven, CT, 06516, USA. 4. Health Informatics Program, Yale School of Public Health, 60 College St, New Haven, CT, 06510, USA; Computational Biology and Bioinformatics Program, Yale University, 300 George Street, Suite 501, New Haven, CT, 06511, USA; Department of Emergency Medicine, Yale School of Medicine, 333 Cedar St, New Haven, CT, 06510, USA. Electronic address: samah.fodeh@yale.edu.
Abstract
OBJECTIVE: We sought to understand spatial-temporal factors and socioeconomic disparities that shaped U.S. residents' response to COVID-19 as it emerged. METHODS: We mined coronavirus-related tweets from January 23rd to March 25th, 2020. We classified tweets by the socioeconomic status of the county from which they originated with the Area Deprivation Index (ADI). We applied topic modeling to identify and monitor topics of concern over time. We investigated how topics varied by ADI and between hotspots and non-hotspots. RESULTS: We identified 45 topics in 269,556 unique tweets. Topics shifted from early-outbreak-related content in January, to the presidential election and governmental response in February, to lifestyle impacts in March. High-resourced areas (low ADI) were concerned with stocks and social distancing, while under-resourced areas shared negative expression and discussion of the CARES Act relief package. These differences were consistent within hotspots, with increased discussion regarding employment in high ADI hotspots. DISCUSSION: Topic modeling captures major concerns on Twitter in the early months of COVID-19. Our study extends previous Twitter-based research as it assesses how topics differ based on a marker of socioeconomic status. Comparisons between low and high-resourced areas indicate more focus on personal economic hardship in less-resourced communities and less focus on general public health messaging. CONCLUSION: Real-time social media analysis of community-based pandemic responses can uncover differential conversations correlating to local impact and income, education, and housing disparities. In future public health crises, such insights can inform messaging campaigns, which should partly focus on the interests of those most disproportionately impacted.
OBJECTIVE: We sought to understand spatial-temporal factors and socioeconomic disparities that shaped U.S. residents' response to COVID-19 as it emerged. METHODS: We mined coronavirus-related tweets from January 23rd to March 25th, 2020. We classified tweets by the socioeconomic status of the county from which they originated with the Area Deprivation Index (ADI). We applied topic modeling to identify and monitor topics of concern over time. We investigated how topics varied by ADI and between hotspots and non-hotspots. RESULTS: We identified 45 topics in 269,556 unique tweets. Topics shifted from early-outbreak-related content in January, to the presidential election and governmental response in February, to lifestyle impacts in March. High-resourced areas (low ADI) were concerned with stocks and social distancing, while under-resourced areas shared negative expression and discussion of the CARES Act relief package. These differences were consistent within hotspots, with increased discussion regarding employment in high ADI hotspots. DISCUSSION: Topic modeling captures major concerns on Twitter in the early months of COVID-19. Our study extends previous Twitter-based research as it assesses how topics differ based on a marker of socioeconomic status. Comparisons between low and high-resourced areas indicate more focus on personal economic hardship in less-resourced communities and less focus on general public health messaging. CONCLUSION: Real-time social media analysis of community-based pandemic responses can uncover differential conversations correlating to local impact and income, education, and housing disparities. In future public health crises, such insights can inform messaging campaigns, which should partly focus on the interests of those most disproportionately impacted.
Early in the course of the COVID-19 pandemic, with no specific treatment for the disease available and fears of the burden of illness overwhelming health systems, the primary public health focus was on disease mitigation strategies [[1], [2], [3]] – and it still is almost a year later. New concepts were introduced to the general public, such as social distancing and recommendations for routine masking. These mitigation efforts along with others, including travel bans, shelter-in-place orders, and school closures, were anticipated to negatively affect many sectors of the United States (U.S.) economy, and they have drastically changed the quotidian lives of most Americans. Given marked community-level socioeconomic disparities and segregation in the U.S. that predated COVID-19, these measures were likely to have disparate uptake by and impact on Americans depending on where they live [4].With the expansive geography of the U.S. and modern-day travel patterns, the disease initially was largely localized in a few cities, and these so-called “hotspots” were a primary focus of much of the initial media coverage [5]. However, as expected, other COVID-19 hotspots with large marginalized populations later emerged [6,7]. This brought to the forefront the need to understand differential reactions to the crisis as a tool for shaping public health communication and allocation of health resources.Social media has been a prominent venue for personal and public health communication, both in previous public health crises and now with COVID-19. Twitter, in particular, has the advantage over some other social media platforms of providing brief, real-time content availability with access to networks of similar discussions through hashtags. Twitter has been used to assess mitigation strategies such as social distancing [8,9] and estimate mobility dynamics within and across states [10]. However, Twitter has not, to our knowledge, been used as a tool to identify trends in public responses to a health crisis at the local level, while factoring in socioeconomic status. Understanding the public responses and reactions at the initial stage of the pandemic across areas with socioeconomic disparities better inform future public health guidelines and communication under similar circumstances.In this study, we sought to leverage a novel approach that utilizes Twitter to understand how social media analysis can provide insight on local level concerns that can guide future public health pandemic messaging. Specifically, we investigated two hypotheses that: 1. there are differential concerns across less-resourced areas (low ADI) and high-resourced areas (high ADI) and, 2. there exist differential concerns across hotspots and non-hotspots.In the following section, we provide a brief review of the related literature, and in Material and Methods, we describe the Twitter data and implemented methods used for analysis. In the Results section, we present our findings and discuss and comment on them in the Discussion section. We also report the limitations of the study in the Limitation subsection and finally conclude the paper in the Conclusion section.
Related work
To provide greater context for understanding our use of Twitter in this study, we first provide a general and brief review of how natural language processing and analytics of Twitter data have been used as research and public health tools to characterize, contextualize and monitor health conditions. Pre-COVID-19, social media research in the context of health was primarily focused on examining the patient experience [[13], [14], [15], [16], [17]]. Comments and reviews on Twitter were used to measure healthcare quality [15] and monitor patient health status along with sentiment level [17]. It has also been useful in understanding social networks, public health messaging, and forecasting spread [[19], [20], [21], [22]]. Twitter played an important role in Ebola outbreak surveillance by contributing to disease surveillance efforts – detecting an epidemic nearly a week before its first case [20]. Influenza infection rate [21] and Zika Virus case number [22] predictions, learned from the tweet count pattern of disease-related tweets, have also proven successful.During COVID-19, Twitter has been used to capture self-reported symptoms of COVID-19 [23] and explore fake news and rumors related to the pandemic [24]. Many studies have explored the utility of using advanced data analytics such as neural networks to study the spread and impact of COVID-19 [25]. Different types of data were utilized in these studies, including; medical image data harnessed for early detection of COVID-19 [26], mortality and recovery rates leveraged to measure the security levels of the pandemic [27], and mobility data of cellphone users for monitoring impacts on the spread of COVID-19 [28,29]. Twitter data has also been used to learn more about COVID-19 spread and impact. It has been used to assess mitigation strategies such as social distancing [18,19], capture self-reported symptoms of COVID-19 [20], and identify differential psychological impacts of lockdowns using hashtags [30].
Materials and methods
Twitter dataset
The dataset we used for this analysis is composed of Twitter entries (tweets) in English posted by users in the United States from January 23rd to March 25th, 2020. We mined the tweets with Twitter's standard search API, which returns a sampling of relevant tweets matching a specific query [31]. This search service is not meant to be an exhaustive source of tweets, and is instead optimized for relevance to the query. We queried for keywords ‘coronavirus’, ‘corona virus’, ‘corona’, ‘covid’, ‘covid-19’, ‘covid 19’, and ‘covid19’. For each tweet, the Twitter standard search API provides detailed tweet attributes, including unique de-identified user ID, time and text of the tweet, and four geographic coordinates (latitude and longitude) delineating the bounding box [32] from which the tweet was posted. For privacy reasons, Twitter does not provide the exact location from which tweets were posted. Fig. 1
demonstrates the overall workflow of the analysis given these tweet attributes, which will be further detailed in the following sections.
Fig. 1
Data integration and analysis workflow.
Data integration and analysis workflow.
Preprocessing of tweets
We pre-processed the tweets following standard data cleaning practices [33] through the removal of punctuation marks, numbers, emojis, URLs, stop words, and end of line characters. We then shortened the remaining words to the root using the stemmer package provided by the NLTK toolkit [34]. We removed tweets that were with missing or invalid data such as those without a month or date of entry, valid user ID entry, or valid stemmed tweet text. Finally, we filtered out tweets containing only words that occurred in less than 20 documents or more than 50% of all documents (of which only “coronavirus” was excluded) in order to achieve better topic models. This is a common approach [35,36], used to avoid spurious associations by excluding words based on their frequency distribution.
Reverse geocodes of tweets
We employed GeoPy [37] to reverse geocode the coordinates and output the county and state names of each tweet. As the bounding box provides enough information to confidently geotag the tweet at the county resolution, we used the midpoint of the rectangle of latitude and longitude coordinates of each tweet as the effective location. This location was then linked to a five-digit FIPS code, a code designed to uniquely identify counties and states in the U.S., to determine the location of tweets at the county level. We followed a similar approach in our previous work [9] to map tweets to the county level.
Area Deprivation Index (ADI) designation
We leveraged ADI from The Neighborhood Atlas [38], a location-based socioeconomic index at the census-block-group level, which incorporates income, education, employment, and housing data and has been used to inform health delivery and policy. ADI scores range from 0 to 100, where 0 corresponds to low deprivation and 100 corresponds to high deprivation. We mapped the location of each tweet, derived from the reverse geocoding tweets process, to the median ADI score of all the census block groups within the county using its FIPS code. Counties were considered “low”, “mid”, or “high” ADI based on the ADI distribution of the unique counties represented in the dataset. Low ADI designation was assigned to counties from the lowest quintile of the ADI distribution of represented counties, and high ADI designation was assigned to counties from the highest quintile of the distribution as has been done with other studies using ADI [39,40].
Hotspot identification
We defined hotspots in January and February as areas with any cases of COVID-19 because there were few U.S. cases in these months and they were concentrated (as published by the New York Times [41]). For analyzing hotspots in March, we leveraged the curated resource The U.S. COVID-19 Atlas [42], defining a tweet as from a hotspot if the county was listed among the published population-adjusted hotspots.
Topic modeling
We used the Latent Dirichlet Allocation (LDA) approach [43] for topic modeling. LDA is an unsupervised approach and has shown to be successful at modeling topics in tweets [44]. We leveraged LDA from the MALLET package [45] and “gensim” package in Python to detect topics from COVID-related tweets. To determine the optimal number of topics, we compared topics by their coherence scores, which act as a proxy for interpretability by measuring the degree of semantic similarity between top words in the topic [46]. We used the topic-word distribution to annotate topics. We first ranked words of a topic and then assigned the underlying theme.
Spatiotemporal analysis
We leveraged the document-topic probability distribution for this analysis. We compared topic prevalence over time, across low and high ADI areas, between hotspots and non-hotspots areas, and within hotspots between low ADI and high ADI areas.
Temporal analysis of topic prevalence
To understand how the public reactions to COVID-19 varied temporally, we averaged the topic distributions of all tweets for each month. We then compared the average scores of all topics over time. For selected topics, we plotted out the daily topic dynamic to demonstrate how the topic distribution changed.
Spatial analysis of topic prevalence
We anticipated that the topic differences across areas with differing ADIs would be skewed, thus we used the log of odds ratio (log odds ratio), a common approach to transform skewed data to a normal distribution [47], to compare the topic differences across area groups. To compare the dominant topics in counties of low versus high ADI designation, we computed the log odds ratios of dominant topics in both groups. We first identified the dominant topics – the topics with the highest probability – for all tweets, then we calculated the log odds ratio of dominant topics among both groups to achieve a fair comparison. The log odds ratio of a topic can be interpreted as the probability of dominance of that topic in one group over another.The odds that any topic T dominates in a group G are calculated as:The log odds ratio of any topic T between two groups G0, G1 is calculated as:We used the same approach to compare topic prevalence between hotspots and non-hotspots. All of the calculations above were done in Python, using the packages “NumPy” and “math”.
Statistical validation
We implemented the chi-squared test and independent t-test to assess the differences in discussed topics across geographically grouped tweets. More specifically, the chi-squared test was used to validate the hypotheses stated in the Introduction Section that 1. there are differential concerns across less-resourced areas (low ADI) and high-resourced areas (high ADI) and, 2. there exist differential concerns across hotspots and non-hotspots.The chi-squared test determines whether there were statistically significant differences between the expected dominant topic frequencies and observed dominant topic frequencies across the ADI groups and hotspot groups. And according to related researches, we acknowledged the nature of Twitter data might be imbalanced [48,49] and further leveraged Welch's unequal variances t-test, which is more robust than Student's t-test for skewed distributions and unequal sample sizes [50], to identify the topics that have significant differences between the groups. Formally, the t-test determines whether there was a difference between the means of the dominant topic probabilities in the low and high ADI groups. All of the statistical validations above were conducted through SPSS.
Results
Preprocessing and integration of tweets
Pre-processing resulted in 269,556 tweets from 119,611 Twitter users (out of which only 63 users had more than 100 tweets). This dataset represents 1331 counties from all 50 states, the District of Columbia, and Puerto Rico. The range of the ADI is from 3 to 98. Fig. 2
diagrams the pre-processing workflow. Table 1
summarizes the characteristics of the final dataset.
Fig. 2
Object process diagram of tweet pre-processing.
Table 1
Characteristics of Dataset. Summary statistics of Twitter dataset in terms of user, geographic, and socioeconomic distribution.
Tweet Characteristics (n = 269,556)
Southern States
36.65% (n = 98,792)
Western States
28.10% (n = 75,745)
Northeastern States and DC
20.42% (n = 55,043)
Midwestern States
14.64% (n = 39,462)
Puerto Rico
0.19% (n = 514)
Low ADI (3-43)
50.07% (n = 134,967)
Mid ADI (43.5–77)
45.65% (n = 123,052)
High ADI (77.5–98)
4.29% (n = 11,537)
Mean Tweet Count per User
2.25 tweets
Median Tweet Count per User
1 tweet
Max Tweet Count per User
456 tweets
Object process diagram of tweet pre-processing.Characteristics of Dataset. Summary statistics of Twitter dataset in terms of user, geographic, and socioeconomic distribution.We evaluated models ranging from 10 to 50 topics and selected the model with the highest coherence score, (coherence score 0.571) and 45 topics. Coherence scores for 10 to 50 topics are plotted in Supplementary Fig. 1. We named topics based on the common theme of the top words. For example, we defined topic 1 as “Shopping” due to top words “toilet”, “paper”, “store”, “shop”, “walmart”, and “groceri” (stemmed version of groceries). The top 10 words in each topic are shown using word clouds in Fig. 3
, wherein the font size in each plot reflects the importance of a word in a specific topic. Representative tweets (tweets with the highest probability of belonging to the given topic) for all topics are available in Supplementary Table 1.
Fig. 3
Visualization of the top 10 words in all topics.
Visualization of the top 10 words in all topics.
Comparing topic prevalence over time
We present the topic-dynamics from January to March including the average distribution of topics that peaked by month. For each month, topic prevalence compared to both of the other months had a significance of p < .0001 unless indicated otherwise.In January (Fig. 4
), there were significant peaks in topics such as intense expression, negative expression, and personal expression (vs. Mar, p < .001). These topics are associated with profanity, anxiety, and emotions. There was also a peak in discussion regarding an early understanding of the novel disease, namely symptoms, flu deaths, and preventative measures (vs. Feb, p < .01; vs. Mar, p < .05). Further, there was significant discussion regarding China, international outbreak events (vs. Feb, p < .01), and ethnicity, as well as tweets concerning case counts (vs. Feb, p < .05), hotspots (vs. Feb, ns), and confirmed cases.
Fig. 4
Distribution of topics with higher proportions in tweets posted in January. Topics that had the same proportions for all months not shown. Significance testing results from two-sided Welch's t-test with Bonferroni correction. Significance legend: ns: 5.00e-02 < p ≤ 1.00e+00. *: 1.00e-02 < p ≤ 5.00e-02. **: 1.00e-03 < p ≤ 1.00e-02. ***: 1.00e-04 < p ≤ 1.00e-03. ****: p ≤ 1.00e-04.
Distribution of topics with higher proportions in tweets posted in January. Topics that had the same proportions for all months not shown. Significance testing results from two-sided Welch's t-test with Bonferroni correction. Significance legend: ns: 5.00e-02 < p ≤ 1.00e+00. *: 1.00e-02 < p ≤ 5.00e-02. **: 1.00e-03 < p ≤ 1.00e-02. ***: 1.00e-04 < p ≤ 1.00e-03. ****: p ≤ 1.00e-04.In February (Fig. 5
), there was a significant rise in the discussion surrounding the election, President Trump, news articles, stocks, the task force conference, and the CDC (Centers for Disease Control and Prevention). February also saw a significant discussion surrounding vaccines and travel (vs. Jan, p < .05).
Fig. 5
Distribution of topics with higher proportions in tweets posted in February. Topics that had the same proportions for all months not shown. Significance testing results from two-sided Welch's t-test with Bonferroni correction. Significance legend: ns: 5.00e-02 < p ≤ 1.00e+00. *: 1.00e-02 < p ≤ 5.00e-02. **: 1.00e-03 < p ≤ 1.00e-02. ***: 1.00e-04 < p ≤ 1.00e-03. ****: p ≤ 1.00e-04.
Distribution of topics with higher proportions in tweets posted in February. Topics that had the same proportions for all months not shown. Significance testing results from two-sided Welch's t-test with Bonferroni correction. Significance legend: ns: 5.00e-02 < p ≤ 1.00e+00. *: 1.00e-02 < p ≤ 5.00e-02. **: 1.00e-03 < p ≤ 1.00e-02. ***: 1.00e-04 < p ≤ 1.00e-03. ****: p ≤ 1.00e-04.In March (Fig. 6
), there was a rise in discussions related to social distancing and disease mitigation strategies, namely closures, cancellations (vs. Feb, p < .001), social distancing, staying home, online media (vs. Jan, p < .05), and education. In general, there were higher topic proportions of activities related to quarantine, in particular exercising, sport, shopping, prayers, words related to time, and adaptation. March also resulted in more dissemination of information, discussion regarding the CARES Act, discussion of cases in Florida and New York, and tweets related to employment and local business support. Finally, in March there was a significantly higher proportion of tweets related to the pandemic (vs. Feb, p < .001), public health measures, tests and test results, and also a higher prevalence of COVID-related hashtags.
Fig. 6
Distribution of topics with higher proportions in March. Topics with same proportions for all months not shown. Significance testing results from two-sided Welch's t-test with Bonferroni correction. Significance legend: ns: 5.00e-02 < p ≤ 1.00e+00. *: 1.00e-02 < p ≤ 5.00e-02. **: 1.00e-03 < p ≤ 1.00e-02. ***: 1.00e-04 < p ≤ 1.00e-03. ****: p ≤ 1.00e-04.
Distribution of topics with higher proportions in March. Topics with same proportions for all months not shown. Significance testing results from two-sided Welch's t-test with Bonferroni correction. Significance legend: ns: 5.00e-02 < p ≤ 1.00e+00. *: 1.00e-02 < p ≤ 5.00e-02. **: 1.00e-03 < p ≤ 1.00e-02. ***: 1.00e-04 < p ≤ 1.00e-03. ****: p ≤ 1.00e-04.
Comparing topic prevalence between low and high ADI areas
The ADI-specific analysis revealed significant differences in topic prevalence between low and high ADI areas. Comparing areas at the highest and lowest quintiles of ADI designation demonstrated differential effects (p < .001) in tweets by county-level socioeconomic resourcing. Topics that are more likely to dominate in high ADI (lower resourced) counties and low ADI (higher resourced) counties are shown in Fig. 7
A. Topic prevalence comparisons between low and high ADI designated tweets had a significance of p < .0001 unless indicated otherwise. Tweets from high ADI areas are more likely to share emotional content with intense, negative (p < .01), personal expression (p < .01) or prayers (p < .05), as well as news regarding confirmed cases, the outbreak in China, flu deaths, and the CARES Act. On the other hand, tweets from low ADI areas were more likely to discuss the impact of COVID-19 on hotspots, local businesses, and New York status. Topics related to the larger public health crisis (p < .001) and pandemic (p = .001), as well as dissemination of information, stocks (p < .01), and the task force conference (p = .01), were also significantly more prevalent in tweets from lower ADI areas. These areas were also more concerned about the progress of potential treatments like vaccines (p < .001). While tweets with political topics about elections (p = .937) and President Trump (p = .605) were more likely to come from low ADI areas, the differences were not statistically significant.
Fig. 7
Topic prevalence comparisons between High and Low ADI based on Log odds ratio. A. Topics with significant difference between both groups (p < .05) B. Topic dynamics for example topics.
Topic prevalence comparisons between High and Low ADI based on Log odds ratio. A. Topics with significant difference between both groups (p < .05) B. Topic dynamics for example topics.Observing the topic proportion progress from January through March (Fig. 7B), we noticed that “Intense Expression” and “CARES Act” topics had consistent trends at both high and low ADI areas, with the high ADI areas having an overall higher daily average topic probability. Furthermore, topics associated with public health policies and disease mitigation strategies in March such as “Social Distancing” and “Local Business Support” arose in tweets from low ADI areas at a higher prevalence than tweets from high ADI areas.
Comparing topic prevalence between hotspots and non-hotspots
There were significant differences in the dominant topics between hotspots and non-hotspot areas. Tweets from hotspots were more likely to include topics relating to New York, social distancing, public health and pandemic, information dissemination, exercise/sport, education, time, closures, and employment (Fig. 8
). Tweets that were not posted from hotspots were more likely to include topics pertaining to negative or intense emotion, concern regarding the CDC guidelines and task force conference, international events and flu deaths, as well as stocks and shopping.
Fig. 8
Topic prevalence between hotpots vs non-hotspots based on log odds ratio.
Topic prevalence between hotpots vs non-hotspots based on log odds ratio.
Comparing topic prevalence within hotspots between low and high ADI areas
Comparing the topic prevalence of the within-hotspots-tweets between areas of high ADI and low ADI demonstrated that topics including confirmed cases, closures, intense expression, and hashtags were more prevalent from high ADI hotspots (Fig. 9
A). Notably, tweets regarding employment concerns were also more likely to come from high ADI hotspots (p < .001), which wasn't significant in the previous analysis comparing ADI and hotspots separately. Tweets from low ADI hotspots were significantly more concerned with exercise, stocks, information dissemination, vaccine treatment, and cases in New York. We next observed the topic dynamics for selected topics from tweets collected in March (note that no high ADI areas were hotspots in January and February) (Fig. 9B). There were notable spikes in employment concerns and intense expression from high ADI hotspots, whereas these topics remain consistent throughout the month for tweets from low ADI hotspots. Tweets about New York and social distancing remained consistently high in low ADI tweets throughout March.
Fig. 9
Topic prevalence comparisons within Hotspots between low and high ADI areas. A. Topics with significant difference between the two groups (p < .05). B. Topic dynamics for example topics.
Topic prevalence comparisons within Hotspots between low and high ADI areas. A. Topics with significant difference between the two groups (p < .05). B. Topic dynamics for example topics.
Chi-squared findings
Table 2 shows the Chi-squared testing results for our hypotheses. Testing for the differential concerns across less-resourced areas (low ADI) and high-resourced areas (high ADI), we found that the dominant topics differ significantly across areas with different socioeconomics levels (p < .01). Similarly, testing for differential concerns across hotspots and non-hotspots, we found that the dominant topics differ significantly relative to the pandemic severity (p < .01).
Table 2
Test results for (1) Dominant topics and ADI levels of each tweets, (2) Dominant topics and IsHotspots, and (3) ADI levels and IsHotspots.
Test Pairs
Chi-square Value
df
P-Value
Dominant Topics * ADI Levels
1660.841
88
<.001
Dominant Topics * IsHotspots
1399.751
44
<.001
ADI Levels * IsHotspots
18338.770
2
<.001
Test results for (1) Dominant topics and ADI levels of each tweets, (2) Dominant topics and IsHotspots, and (3) ADI levels and IsHotspots.
Discussion
Our analysis of COVID-19-related social media content demonstrates that Twitter can be used effectively to identify individual-level responses to infectious disease outbreaks in such a way that considers the impact of local-level socioeconomic resources and disease incidence. It shows too that socioeconomic disparity is associated with differential responses to the current COVID-19 pandemic, even among areas which are most severely impacted by disease cases. To our knowledge, this is the first study to link geocoded tweets to the ADI in order to explore the impact of geographic area-based socioeconomic status on tweet content.This analysis follows the early pandemic timeline and establishes that topic modeling performs well in identifying major subjects of discussion on Twitter and successfully capturing the nuances of their variability. Though topic modeling has been applied to COVID-19-related tweets in an overlapping window of time (January 23 to March 7, 2020) [51], limited topics were identified and no analysis was reported about the emergence of new topics during that period. As the first cases of COVID-19 broke news in January, we found the fear sentiment in tweets as people were broadly focused on disseminating as much information as possible and similar conclusions were reported by Xue et al. [51]. As time progressed, there was increasing focus on local cases and events, public health information dissemination and testing, and quarantine activities.Ordun et al. [52] explored topic prevalence over time in COVID-19 related tweets, however, the analysis was limited to reporting trends and lacked extended investigations of linking the trending topics to other health or social factors. In our study, by linking topic prevalence to socioeconomic status, we found that tweets from high ADI areas were more likely to share content regarding personal experiences, which ranged from positive affirmations of hope and prayers to negative or intense expressions of anxiety or frustration. This was not surprising given that the disparate impact of the pandemic and the associated economic fallout have, as predicted, disproportionately impacted poorer communities [53]. Furthermore, centuries of structural racism in the United States have led to lower resourcing in these areas and higher rates of medical co-morbidities that have been shown to increase COVID-19 risk [53] – all potentially contributing factors to an increase in intense, negative, and personal discussion in these areas pertaining to the public health and economic crisis.Tweets from low ADI areas in March showed more discussion of social distancing and local business support, as quarantine policies hurt local businesses and resulted in discussions about bill relief to support these businesses. This result is consistent with the quicker response to stay at home orders from low ADI areas and is in line with recent reports of movement dynamic differences between low-income and high-income areas [54]. The higher prevalence of discussion surrounding stocks that was noted in low ADI areas was consistent with a greater stock market wealth residing amongst the wealthiest US households [55].In the comparison between low and high ADI area hotspots, we identified that tweets with intense expression and those about employment insecurity were significantly more likely to come from high ADI hotspots. This reinforces the notion that, even after restricting to areas with high case counts, income and resource disparity result in disproportionate effects due to closures and job loss [56]. Furthermore, low ADI counties were significantly more concerned with information dissemination, cases in New York (on average a large low ADI hotspot), stocks, and vaccine treatment showing increased focus on social and institutional reactions to the crisis.Our approach of integrating a location-based socioeconomic index with Twitter topics offered increased insight into the topics inferred from the text, allowing a novel framework for assessing differential topics of conversation as they correlate to income, education, and housing disparities. Our integration of published COVID-19 hotspots further enables time-specific information of disease spread and how this corresponds to topics discussed on Twitter. These nuances are valuable for recognizing how public health communication, resource allocation policy, and information dissemination can respond to the needs of different communities, especially those with the lowest health resourcing, in future waves of the pandemic and emerging infectious disease outbreaks. Future public health efforts may use Twitter topic modeling to target messaging to the unique concerns of local communities and study the impact of health resource utilization. Our findings emphasize the importance of social media as a platform for public health communication as it is freely available to communities with different levels of socioeconomic resources. In fact, using public health communication to mitigate health disparities is not a novel concept [11], and is in line with future directions laid out in the National Institute on Minority Health and Health Disparities 2019 research framework [12]. However, the implementation of these methods should see expansion to other national institutions and organizations, such as the Office of Disease Prevention and Health Promotion and the Centers for Disease Control and Prevention. Furthermore, such initiatives need to be enhanced with more targeted messages, announcements and policies addressing the community level social and behavioral differences.
Limitations
Though our study successfully explored pandemic-related topics of conversation across tweets, there were a number of limitations, some of which have also reported in other studies [57]. One limitation is related to missing data. Due to data privacy, although Twitter data is publicly available, some tweets were posted from private accounts and thus could not be retrieved from the Twitter API. Another limitation that reduced the dataset sample size was that the Twitter Search API, which we used in this study, retrieves tweets from a reduced sample of all historic tweets posted about COVID-19. This sample is reduced further by focusing on English, US-specific, and geocoded tweets. Furthermore, due to restrictions with Twitter geocoding, we accepted some degree of positional inaccuracy in our study design, in that we were only able to collect geographic coordinates to the resolution of a county, and therefore characterized each tweet by the county rather than the census tract or block group. Given the inherent geographic masking techniques used by Twitter to promote confidentiality, and our study design which involved cross-area estimation and simple geographic centroid assessment [9], we acknowledge aggregation bias as a study limitation. However, previous work assessing the quality of deprivation indices shows that aggregated ADI is able to outcompete other metrics in capturing county and tract level information [58]. Furthermore, aggregated ADI has previously been used in other work to compare county-level socioeconomic status [59]. For our dataset, on average, the county ADI was distributed such that the median ADI was a reasonable approximation for the county. Finally, for technical reasons on our server, fewer tweets were scraped on some dates. However, we were still able to glean valuable conclusions from our data that represent the early pandemic progression.
Conclusion
Twitter analysis linking geocoded tweets to markers of geographic socioeconomic resourcing demonstrates that the COVID-19 pandemic has differentially impacted areas of the United States that are already institutionally underserved, even among areas most severely impacted. Highly-resourced areas were concerned with stocks, social distancing, and national-level policies, while low-resourced areas shared content with negative expression, prayers, and discussion of the CARES Act economic relief package. Within hotspots, increased discussion regarding employment in low versus high resourced areas was observed. This finding highlights the need to address the specific fears and concerns of these communities through personalized public health messaging at the local level. Our work indicates the emerging utility for linking natural language processing techniques to real-time social media data and measures of social determinants of health. In future work, we plan to further analyze the sentiment of U.S. residents towards COVID-19 vaccination in areas with socioeconomic disparities. The speed at which vaccine-related misinformation is being propagated is alarming and has negative ramifications on global population health. We plan to investigate whether the volume and speed of misinformation differ relative to socioeconomic status and, specifically, if residents in less-resourced areas are disproportionately impacted by misinformation.
Funding
This research was supported in part by the (to A.V.).
Authors: Abeed Sarker; Sahithi Lakamana; Whitney Hogg-Bremer; Angel Xie; Mohammed Ali Al-Garadi; Yuan-Chi Yang Journal: J Am Med Inform Assoc Date: 2020-08-01 Impact factor: 4.497
Authors: Felix Greaves; Daniel Ramirez-Cano; Christopher Millett; Ara Darzi; Liam Donaldson Journal: J Med Internet Res Date: 2013-11-01 Impact factor: 5.428
Authors: Joseph Younis; Harvy Freitag; Jeremy S Ruthberg; Jonathan P Romanes; Craig Nielsen; Neil Mehta Journal: JMIR Public Health Surveill Date: 2020-10-20
Authors: Blessing Ogbuokiri; Ali Ahmadi; Nicola Luigi Bragazzi; Zahra Movahedi Nia; Bruce Mellado; Jianhong Wu; James Orbinski; Ali Asgary; Jude Kong Journal: Front Public Health Date: 2022-08-12