| Literature DB >> 35223365 |
Trent D Buskirk1, Brian P Blakely1, Adam Eck1, Richard McGrath1, Ravinder Singh1, Youzhi Yu1.
Abstract
As survey costs continue to rise and response rates decline, researchers are seeking more cost-effective ways to collect, analyze and process social and public opinion data. These issues have created an opportunity and interest in expanding the fit-for-purpose paradigm to include alternate sources such as passively collected sensor data and social media data. However, methods for accessing, sourcing and sampling social media data are just now being developed. In fact, there has been a small but growing body of literature focusing on comparing different Twitter data access methods through either the elaborate firehose or the free Twitter search or streaming APIs. Missing from the literature is a good understanding of how to randomly sample Tweets to produce datasets that are representative of the daily discourse, especially within geographical regions of interest, without requiring a census of all Tweets. This understanding is necessary for producing quality estimates of public opinion from social media sources such as Twitter. To address this gap, we propose and test the Velocity-Based Estimation for Sampling Tweets (VBEST) algorithm for selecting a probability based sample of tweets. We compare the performance of VBEST sample estimates to other methods of accessing Twitter through the Search API on the distribution of total Tweets as well as COVID-19 keyword incidence and frequency and find that the VBEST samples produce consistent and relatively low levels of overall bias compared to common methods of access through the Search API across many experimental conditions.Entities:
Keywords: Big data; COVID-19; Probability sampling; Social media; Survey research; Tweets; Twitter
Year: 2022 PMID: 35223365 PMCID: PMC8857877 DOI: 10.1140/epjds/s13688-022-00321-1
Source DB: PubMed Journal: EPJ Data Sci ISSN: 2193-1127 Impact factor: 3.184
Figure 1Visual depiction of the first several steps of the VBEST algorithm.: Illustration of the initial time point selection (Step 1a) as well as querying Twitter at selected time points (Step 1b) and computing Tweet Velocities (Step 2a)
Figure 2Plot of Tweet velocities and estimates the Twitter Velocity Curve (Step 2b), as well as creation of Tweet PSUs from estimated Twitter Velocity Curve (Step 3)
Twitter Search API parameters and values used in sampling and gathering Tweets for our experiment
| Twitter API parameter | Values used in our experiment | Notes |
|---|---|---|
| keyword | ‘-filter:reTweets’ | We exclude reTweets. This parameter setting allows us to gather all non-reTweet Tweets regardless of content. |
| result_type | ‘recent’ | Options include ‘recent’, ‘mixed’ or ‘popular’. For the popular and mixed sampling methods we selected ‘popular’ and ‘mixed’, respectively but for all other methods we selected ‘recent’. |
| count | 100 | The count values range from 10 to 100 for the TSAPI version we used in our experiment. |
| lang | ‘en’ | We gathered English language Tweets within the MSAs. |
| Tweet_mode | ‘extended’ | This setting allows the user to receive the full text of the Tweet rather than the default option which truncates the body of the Tweet beyond a character limit set by the API. |
| geocode | Latitude, Longitude and Radius in Miles | These parameters provide the geographic center of a circle of a given radius for the TSAPI to use as the location filter of the Tweets. This step is not the geo-filtering we discuss in Sect. |
| max_id | various | Time of desired query was converted to a Tweet ID as described and this parameter was used with only the Uniform, SRS and VBEST methods. |
| until | 2020-11-26; 2020-11-27 and so on…to 2021-01-01 | This parameter value was set to the date of the day following the field period date for which data collection is desired. This parameter provides a non-inclusive upper bound for the date of the Tweets for the TSAPI. |
| since_id | various | This parameter represents a time stamp corresponding to midnight of the day for which Tweet samples are desired to ensure that our daily samples do not go beyond the given day. We converted a midnight time stamp into a synthetic Tweet id and used these values for this parameter. |
Listing of the 5 MSAs and Principal Cities, geo-coordinates and search radius specified for each of the Twitter queries for our experiment
| Metropolitan statistical area (MSA) | Center: Lat, Long | Radius | Principal cities |
|---|---|---|---|
| Chicago-Naperville-Elgin, Illinois–Indiana–Wisconsin | 41.905170, −87.624664 | 50 miles | Bolingbrook, IL; Chicago, IL; Des Plaines, IL; Elgin, IL; Evanston, IL; Hoffman Estates, IL; Naperville, IL; Schaumburg, IL; Skokie, IL; Gary, IN; Kenosha, WI |
| Atlanta–Alpharetta–Sandy Springs, Georgia | 33.6937280, −84.3999113 | 40 miles | Alpharetta, Atlanta, Marietta, Sandy Springs, GA |
| Phoenix–Mesa–Chandler, Arizona | 33.448400, −112.074000 | 53 miles | Casa Grande, Chandler, Mesa, Phoenix, Scottsdale, Tempe, AZ |
| Baltimore–Columbia–Towson, Maryland | 39.297002, −76.676317 | 16 miles | Baltimore, Columbia, Towson, MD |
| Pittsburgh, Pennsylvania | 40.437202, −79.982197 | 28 miles | Pittsburgh, PA |
Figure A1.1Example specification of the Atlanta–Alpharetta–Sandy Springs, Georgia MSA using the center and radius specified in Table 2. Twitter queries based on geographies in the public TSAPI require that a latitude, longitude and radius be specified rather than a place keyword or location name. Underlying Map accessed from Google Maps (https://www.google.com/maps)
Description of the 6 different methods we used in our experiment. The first three methods are possible settings for the TSAPI and the last three are new variants we are introducing and comparing in our experiment
| Tweet access method | Description |
|---|---|
| 1. Popular | One of three methods available for the result_type parameter in the TSAPI that returns the most popular results, as determined by Twitter, in the query. |
| 2. Mixed | The current default method for the result_type parameter of the TSAPI: returns both “popular” and “recent” Tweets as part of the query. Popular Tweets are determined by Twitter. |
| 3. Recent | Another option for the result_type parameter of the TSAPI that can be selected by the user in which the most recent Tweets are returned. If there are more than 100 Tweets that occurred most recently then additional queries can be submitted in sequence to obtain collections of Tweets that follow chronologically from 11:59:59:999 pm of a given day back to midnight at the beginning of that day as described in |
| 4. Uniform | A series of evenly spaced time points from a given day are determined a single query is submitted for each of the selected time points using the TSAPI with result_type parameter set to “recent”. For this method we randomly select a starting time point within a sampling interval determined by the number of queries desired and then determine subsequent, evenly spaced points. The identified time points are then converted to Tweet IDs and used as the max_id parameters in the TSAPI. |
| 5. VBEST-SYS | A systematic random sample (without replacement, circular) is taken of a desired size from the universe of Tweet PSUs identified from the VBEST algorithm. The right-most endpoint of each of the Tweet PSU intervals is then used in a TSAPI query with result_type set to “recent”. One query is submitted per selected Tweet PSU. |
| 6. VBEST-SRS | A simple random sample (without replacement) is taken of a desired size from the sampling frame of Tweet PSUs constructed from the VBEST algorithm. The right-most endpoint of each of the Tweet PSU intervals is then used in a TSAPI query with result_type set to “recent”. One query is submitted per selected Tweet PSU. |
Note: For more information on Twitter Search API search options please refer to Twitter documentation available at: https://developer.twitter.com/en/docs/twitter-api/v1/Tweets/search/api-reference/get-search-Tweets.
Topics used in our evaluation and a complete listing of keywords that defined each topic. The specific keywords were identified in early spring of 2020 just as more information about tracking of COVID-19 became available in the U.S.
| Topics and keywords | |||||||
|---|---|---|---|---|---|---|---|
| Covid | Social distancing | Working | Masks | Sanitizing | General virus | Symptoms | Treatment |
| covid | social distance | wfh | face mask (s) | hand sanitizer | virus | can’t smell | ventilator |
| covid-19 | social distancing | working from home | mask (s, ed) | disinfect | flu | no Smell | remdesivir |
| covid19 | six feet apart | work from home | PPE | disinfectant | pandemic | can’t taste | vaccine |
| covid test (ing) | 6 ft apart | not working now | N95 | lysol | sars | no taste | contact tracing |
| covid cases | 6 feet apart | furlough | face cover (ing) | sanitize | pneumonia | cough | |
| coronavirus | hunker down | reopen | face shield | sanitizing | Fauci | fever | |
| rona | lockdown | reopening | sanitizer | chills | |||
| cv19 | quarantine | stimulus checks | hand wash | sore throat | |||
| quarantining | remote work | hand washing | asymptomatic | ||||
| working remotely | bleach | ||||||
| unemployed | washing hands | ||||||
Calculation of the metrics we will use to evaluate the methods in our experiment. Bold-faced metrics represent the key outcomes of interest
| Statistic/metric# | Description and calculation |
|---|---|
| R, D, M, S, T | R refers to one of the five MSA regions; D refers to one of the 38 days in our Field Period; M refers to one of the 6 sampling methods S refers to one of the four sample size settings (e.g. number of queries). T refers to one of the 8 topics including: COVID, Social Distancing, Working, Masks, Sanitizing, General Virus, Symptoms or Treatment. |
|
| Total number of geo-filtered Tweets in the qth query of a sample of size S taken from Region R on Day D using Method S. Note q = 1,2,3,…,S. |
|
| Total number of geo-filtered Tweets containing any of the keywords for topic T in the qth query of a sample of size S taken from Region R on Day D using Method S. Note q = 1,2,3,…,S. |
|
| Estimated Incidence rate of Topic T among all geo-filtered Tweets within the sample of size S taken from region R on day D using method M and is computed as: |
| Incidence rate and Frequency of topic T, respectfully, among all Tweets in region R on day D based on full twitter corpus data accessed through a TFV. | |
| Percent Relative Absolute Bias for the Incidence of topic T based on geo-filtered Tweets from a sample of size S taken from region R on day D using method M and is computed as: | |
|
| Estimated Frequency of the number of Tweets from topic T among all geo-filtered Tweets within the sample of size S taken from region R on day D using method M and is computed for SRS and VBEST methods as: |
| Percent Relative Absolute Bias for Frequency of Topic T and Mean Percent Relative Absolute Bias for Topic Frequencies are derived in the same manner as for the incidence-based metrics described above except using | |
| Estimated (and actual) total number of geo-filtered Tweets in region R for day D based on a sample of size S taken using method M. The actual value is based on TFV data supplied from our vendor. The estimate totals are computed for SRS and VBEST as: | |
|
| Percent Relative Absolute Bias for Total Tweets within region R on day D based on geo-filtered Tweets from a sample of size S taken using method M. This statistic is computed as: |
#Note: when we aggregate the MPRAB(I), MPRAB(F) and PRAB(N) metrics over all days of the experiment for a given combination of region, method and size we will refer to the measure as the overall average metric value.
Distribution of sampled Tweets, geo-filtered Tweets and Tweets filtered into one our Topics by MSA and overall for our experiment. These counts represent samples gathered from all methods and sizes included in our experiment over the 38 days in our field period
| Metropolitan statistical area (MSA) | Total tweets sampled | Total tweets geo-filtered into principal cities | Total tweets filtered in to one of our 8 topics | Tweets from sample missing user geography metadata |
|---|---|---|---|---|
| Chicago–Naperville–Elgin, Illinois–Indiana–Wisconsin | 22,442,771 | 18,525,841 | 531,402 | 178 |
| Atlanta–Alpharetta–Sandy Springs, Georgia | 22,413,310 | 17,624,013 | 329,958 | 90 |
| Phoenix–Mesa–Chandler, Arizona | 22,607,714 | 17,191,865 | 640,358 | 60 |
| Baltimore–Columbia–Towson, Maryland | 22,435,821 | 16,087,382 | 446,228 | 29 |
| Pittsburgh, Pennsylvania | 22,424,971 | 11,099,119 | 393,554 | 309 |
| Grand Total from All MSA Regions in the Experiment | 112,324,587 | 80,528,220 | 2,341,500 | 666 |
Overall averages (and standard deviations) for the three primary statistical outcomes by MSA region, size and method. The overall averages represent the arithmetic mean for the metric within a given combination of MSA Region, Size and Method across the 38-day field period for our experiment. Standard deviations are also given in parentheses
| Outcome | MSA | Size and method | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 264/360 | 444/540 | 624/720 | Other | ||||||||||||
| Recent | Uniform | VBEST-SRS | VBEST-SYS | Recent | Uniform | VBEST-SRS | VBEST-SYS | Recent | Uniform | VBEST-SRS | VBEST-SYS | Mixed | Popular | ||
| Atlanta | 19.43 (6.69) | 17.67 (5.66) | 17.92 (6.47) | 15.83 (4.92) | 16.36 (5.49) | 13.81 (4.63) | 13.41 (3.12) | 14.64 (4.65) | 13.84 (4.17) | 12.6 (3.54) | 11.94 (3.71) | 10.45 (3.11) | 138.53 (88.98) | 100 (0) | |
| Baltimore | 17.24 (7.67) | 14.43 (3.08) | 16.01 (4.74) | 13.62 (4.14) | 14.11 (5.92) | 12.49 (3.06) | 11.28 (3.38) | 10.5 (2.9) | 7.99 (4.18) | 10.84 (3.25) | 8.89 (2.71) | 8.8 (3.03) | 132.02 (44.73) | 100 (0) | |
| Chicago | 14.93 (4.45) | 13.59 (4.82) | 15.05 (5.28) | 14.15 (4.04) | 12.49 (4.1) | 10.47 (2.94) | 11.61 (3.47) | 11.36 (3.25) | 10.93 (3.83) | 10.3 (2.78) | 9.39 (3) | 9.46 (2.64) | 129.94 (74.2) | 100 (0) | |
| Phoenix | 14.09 (4.59) | 11.98 (3.52) | 13.96 (4.1) | 13.66 (3.51) | 12.57 (4.02) | 9.86 (2.76) | 8.44 (2.53) | 9.73 (3.33) | 7.49 (1.85) | 9.37 (3.9) | 7.86 (2.08) | 6.88 (2.07) | 41.96 (17.09) | 100 (0) | |
| Pittsburgh | 17.18 (7.05) | 15.22 (5.52) | 17.18 (5.08) | 15.17 (6.19) | 15.12 (6.64) | 12.33 (4.36) | 12.07 (4.31) | 11.45 (3.07) | 12.8 (6.88) | 10.37 (3.36) | 8.96 (2.44) | 9.62 (3.23) | 308.86 (727.05) | 100 (0) | |
| Atlanta | 26.95 (12.25) | 38.38 (4.86) | 18.44 (5.58) | 17.28 (4.4) | 26.64 (11.3) | 39.13 (5.56) | 16.13 (4.15) | 16.56 (5.27) | 26.66 (9.55) | 37.99 (4.75) | 14.71 (3.73) | 13.35 (3.91) | 98.26 (5.1) | 100 (0) | |
| Baltimore | 33.98 (10.57) | 35.2 (4.66) | 17.81 (4.89) | 15.48 (4.22) | 26.43 (12.26) | 34.07 (4.59) | 14.16 (3.73) | 13.86 (2.9) | 10.96 (4.21) | 34.62 (4.1) | 12.67 (3.49) | 12.46 (3.67) | 94.5 (4.49) | 100 (0) | |
| Chicago | 22.3 (8.48) | 41.39 (4.02) | 17.97 (5.37) | 17.93 (4.65) | 22.4 (8.52) | 42.06 (3.03) | 15.93 (3.89) | 16.09 (4.4) | 23.3 (8.72) | 42.95 (3.81) | 15.03 (3.68) | 14.88 (3.04) | 94.85 (4.96) | 100 (0) | |
| Phoenix | 32.2 (11.03) | 42.64 (3.55) | 16.62 (3.94) | 16.91 (5.02) | 19.06 (9.21) | 43.64 (3.77) | 13.44 (3.1) | 13.99 (3.6) | 13.45 (3.41) | 43.86 (3.41) | 13.02 (3.34) | 12.34 (3.38) | 93.15 (2.05) | 100 (0) | |
| Pittsburgh | 36.73 (15.52) | 30.86 (4.64) | 18.33 (5.52) | 16.98 (5.68) | 33.51 (12.49) | 29.54 (4.98) | 14.55 (4.12) | 13.78 (3.61) | 12.44 (4.59) | 30.2 (3.8) | 11.94 (3.45) | 12.49 (3.87) | 116.39 (82.84) | 100 (0) | |
| Atlanta | 19.15 (6.78) | 35.65 (2.51) | 7.24 (3.88) | 7.12 (3.8) | 19.83 (7.25) | 35.6 (2.5) | 7.19 (3.79) | 7.19 (3.83) | 21.25 (7) | 35.54 (2.63) | 7.23 (3.85) | 7.24 (3.84) | 92.07 (5.1) | 100 (0) | |
| Baltimore | 27.28 (11.22) | 30.34 (3.26) | 7.05 (2.73) | 7.01 (2.66) | 18.52 (11.24) | 30.32 (3.22) | 7.13 (2.66) | 7.04 (2.65) | 9.73 (4.63) | 30.34 (3.3) | 7 (2.71) | 7.03 (2.65) | 93.36 (1.14) | 100 (0) | |
| Chicago | 17.29 (6) | 41.03 (2.24) | 10.81 (2.3) | 10.84 (2.29) | 19.42 (6.48) | 40.94 (2.22) | 10.85 (2.35) | 10.86 (2.32) | 21.25 (6.91) | 41.05 (2.23) | 10.85 (2.34) | 10.84 (2.3) | 92.46 (1.33) | 100 (0) | |
| Phoenix | 26.02 (7.55) | 43.29 (1.96) | 11.34 (2.7) | 11.44 (2.68) | 16.16 (7.03) | 43.43 (2) | 11.38 (2.68) | 11.34 (2.68) | 15.35 (1.75) | 43.35 (2.02) | 11.34 (2.65) | 11.35 (2.67) | 92.6 (1.09) | 100 (0) | |
| Pittsburgh | 30.21 (16.12) | 29.2 (3.72) | 8.76 (2.72) | 8.66 (2.52) | 23.23 (13.45) | 29.32 (3.7) | 8.92 (2.48) | 8.89 (2.43) | 13.4 (7.19) | 29.38 (3.72) | 8.78 (2.55) | 8.82 (2.41) | 89.37 (11) | 100 (0) | |
∗Note that the “Mixed” and “Popular” columns are included for reference only and were not included in any of our analyses or pairwise comparisons as described in the text.
Figure 3Boxplots depicting the mean percent relative absolute bias measures for topic incidence (MPRAB(I)) for each of the methods and sample sizes within each of the five MSA regions. Here we note that lower values are better
The difference between two overall average MPRAB(I) values within a row (Region) in Table 4 is declared significant (based on ) if it exceeds 2.58 in absolute value. The primary results of the post-hoc analysis for differences in overall average MPRAB(I) values are listed within this table
| 1) | In Chicago, we cannot distinguish the performance between the Methods for any Size. |
| 2) | When the Size = 360, we cannot distinguish between the Methods in Phoenix and Pittsburgh. In Atlanta, VBEST-SYS is significantly better than Recent while in Baltimore, VBEST-SYS and Uniform are better than Recent. |
| 3) | At Size = 540, Recent has the worst performance in Phoenix and Pittsburgh. Recent is worse than VBEST-SRS in Atlanta and VBEST-SRS and VBEST-SYS in Baltimore. |
| 4) | With the Size, 720, Recent has significantly worse performance in Pittsburgh than VBEST-SRS and VBEST-SYS and it is worse than VBEST-SYS in Atlanta. Baltimore shows an anomaly with Recent having the lowest mean MPRAB(I), although not significantly better than VBEST-SRS and VBEST-SYS with Uniform being significantly worse than Recent. |
| 5) | At Size = 540, the mean MPRAB(I) for Recent is not significantly better than those from any other Method using Size = 360. Similarly, Recent with Size=720 is no better than the others with Size = 540 (with the exception of the Baltimore Region). |
| 6) | The MPRAB(I) means for a given Method with Size = 540 fall between those values of the Method using Sizes of 360 and 720 respectively. This result generally holds for each of the Methods across the Regions. |
Figure 4Boxplot depicting the distribution of daily MPRAB(F) metric values for topic frequencies for each Method and Size within Region. Here we note that lower values indicate smaller biases
The difference between two overall average MPRAB(F) values within a row (Region) in Table 4 is declared significant (based on ) if it exceeds 3.70 in absolute value. The primary results of the post-hoc analysis for differences in overall average MPRAB(F) values are listed in this table
| 1. | For the Recent Method, the average MPAB(F) values are very consistent across Size within Atlanta and Chicago. In Phoenix, Baltimore, and Pittsburgh, the overall average MPRAB(F) values decrease with Size. Notably, the former represent the most populous MSAs in our experiment, whereas the latter represent the least populous MSAs. Thus, the methods consistency was inversely proportional to both MSA population and correspondingly, Twitter volume. |
| 2. | Across all Regions, the average MPRAB(F) values for the Uniform Method were not distinguishable across the different Sizes. The average MPRAB(F) values from Uniform samples were consistent and significantly the highest across Regions and Size with values between 30% and 40%. One exception is that of Size = 360 in Baltimore where Recent and Uniform are indistinguishable. Another exception is in Pittsburgh where for Sizes 360 and 540, Recent performs worse than Uniform but Recent is much better than Uniform for Size = 720. |
| 3. | For Chicago, Atlanta and Phoenix, MPRAB(F) values for Uniform samples, regardless of Size, were significantly larger, on average, than for samples of any Size gathered from the Recent method. Uniform samples of any size in Baltimore produced larger MPRAB(F) values, on average, than Sizes of 540 and 720 from the Recent Method and in Pittsburgh Uniform samples had smaller mean MPRAB(F) values compared to those from Recent samples of Size 360 and 540 but larger than Recent for Size = 720. |
| 4. | VBEST-SRS and VBEST-SYS average MPRAB(F) values were indistinguishable from each other in all Regions for all Sizes and were better than Recent and Uniform in all cases with three exceptions. They were not significantly better than Recent in Baltimore, Phoenix or Pittsburgh for Size = 720. |
Figure 5Estimated regression slopes from predicting from by region, method and size. The regression models each use the frequency estimates from the 8 topics of interest across the 38 days of the experiment and are fitted separately for each region, method and size combination and fix the intercept at 0
Figure 6Boxplots depicting the percent relative absolute bias (PRAB(N)) values for estimating the total number of Tweets by Method and Size for each of the 5 MSA regions
The difference between two average PRAB(N) values within a row (Region) in Table 4 is declared significant (based on ) if it exceeds 2.65 in absolute value. Key findings from the post hoc analysis of differences between average PRAB(N) values are included in this table
| 1. | Average PRAB(N) values from Uniform samples ranged from about 30% to about 45% across the regions and were very consistent across Size within Region. They were significantly larger than those from any other method with one exception—for query size 360, average PRAB(N) values from Uniform samples were not significantly different from that of Recent in Pittsburgh. |
| 2. | Recent samples from Atlanta had PRAB(N) values that were consistent across all levels of Size. In Baltimore and Pittsburgh, the Recent PRAB(N) averages improved significantly with larger sample sizes. In Phoenix, PRAB(N) averages for 540 and 720 queries were indistinguishable but significantly better than with 360 queries. For Chicago, we saw the opposite behavior—the performance degraded for larger sample sizes with 720 queries performing significantly worse than 360. |
| 3. | Across Regions and query sizes, Recent samples produced significantly higher PRAB(N) values, on average, than VBEST-SRS and VBEST-SYS which were indistinguishable. |
| 4. | VBEST-SYS and VBEST-SRS clearly had the best performance among the four Methods averaging between about 7% and 11% depending on Region. |
Figure 7Estimated regression slopes from predicting from by Region, Method and Size. The regression models each use the total Tweet Population estimates across the 38 days of the experiment and are fitted separately for each Region, Method and Size combination and fix the intercept at 0
Figure A3.1Distribution of a full corpus of Tweets for the first day of our experiment (11/24/2020) along with the Geo-Filtered Tweet Distribution and the Proportion of Total Tweets represented by 330, 540 and 720 queries (vertical lines) used in conjunction with the Recent method for Baltimore (A) and Atlanta (B). We also plot the distribution of Tweets associated with the COVID Topic (full Tweets and geo-filtered Tweets) for illustration at the topic-level of analysis