| Literature DB >> 33031405 |
Sarah A Nowak1,2, Christine Chen3, Andrew M Parker4, Courtney A Gidengil2,5, Luke J Matthews2.
Abstract
Over the past decade, the percentage of adults in the United States who use some form of social media has roughly doubled, increasing from 36 percent in early 2009 to 72 percent in 2019. There has been a corresponding increase in research aimed at understanding opinions and beliefs that are expressed online. However, the generalizability of findings from social media research is a subject of ongoing debate. Social media platforms are conduits of both information and misinformation about vaccines and vaccine hesitancy. Our research objective was to examine whether we can draw similar conclusions from Twitter and national survey data about the relationship between vaccine hesitancy and a broader set of beliefs. In 2018 we conducted a nationally representative survey of parents in the United States informed by a literature review to ask their views on a range of topics, including vaccine side effects, conspiracy theories, and understanding of science. We developed a set of keyword-based queries corresponding to each of the belief items from the survey and pulled matching tweets from 2017. We performed the data pull of the most recent full year of data in 2018. Our primary measures of belief covariation were the loadings and scores of the first principal components obtained using principal component analysis (PCA) from the two sources. We found that, after using manually coded weblinks in tweets to infer stance, there was good qualitative agreement between the first principal component loadings and scores using survey and Twitter data. This held true after we took the additional processing step of resampling the Twitter data based on the number of topics that an individual tweeted about, as a means of correcting for differential representation for elicited (survey) vs. volunteered (Twitter) beliefs. Overall, the results show that analyses using Twitter data may be generalizable in certain contexts, such as assessing belief covariation.Entities:
Mesh:
Year: 2020 PMID: 33031405 PMCID: PMC7544030 DOI: 10.1371/journal.pone.0239826
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Description of data file Pairs.
| Data file pair | Purpose of data file pair | Values Data Could Take | Number of belief tags | Number of Rows (Individuals) in Survey File | Number of Rows (Accounts) in Twitter File |
|---|---|---|---|---|---|
| Topic | Examine agreement with minimal processing of Twitter data | 0, +1 | 29 | 565 | 551,738 |
| Stance | Examine the impact of inferring stance with Twitter data | -1, 0, +1 | 10 | 550 | 183,169 |
| Limited Topic | Create a file without stance that is otherwise identical to the Stance data files to isolate the effect of inferring stance | 0, +1 | 10 | 550 | 183,169 |
| Resampled Stance | Examine whether the number of belief items tweeted about could be used to re-sample Twitter data to make the data more representative of the population | -1, 0, +1 | 10 | 550 | 550 |
Distribution of non-zero belief tags in Twitter and survey stance files (before resampling).
| Number of topics | Twitter stance file, count | Survey stance file, count | Twitter stance file, percent | Survey stance file, percent |
|---|---|---|---|---|
| 1 | 166216 | 68 | 90.7% | 12.4% |
| 2 | 13432 | 67 | 7.3% | 12.2% |
| 3 | 2303 | 73 | 1.3% | 13.3% |
| 4 | 717 | 76 | 0.4% | 13.8% |
| 5+ | 501 | 266 | 0.3% | 48.4% |
Fig 1Proportion of individuals or accounts with different belief tag values in the survey and Twitter data files for each of the four data file pairs.
In panels a, b, e, and f, black indicates presence (1) and yellow absence (0) of a belief. Within panels c, d, g, and h, blue indicates agreement (+1), yellow absence (0), and orange disagreement (-1) with a belief.
Fig 2Loadings on PC1, the first principal component from the PCA analysis for each data file.
Values of are the Pearson’s correlation coefficient for each pair of PC1 loadings.
Fig 3Scatter plot and linear fit of predicted PC1 score and actual PC1 score.
The “actual PC1 score” was calculated using the loadings from the data set for which we were estimating the scores. I.e., actual scores for Twitter data were calculated using Twitter PC1 loadings and actual scores for survey data were calculated using survey PC1 loadings. The “predicted PC1 score” for Twitter data sets was calculated using survey PC1 loadings and the “predicted PC1 score” for survey data sets was calculated using Twitter PC1 loadings. For each plot, the predicted and actual PC1 scores were re-scaled to lie between -1 and +1, which did not alter the correlation.