| Literature DB >> 34357539 |
Jon Agley1, Yunyu Xiao2,3, Rachael Nolan4, Lilian Golzarri-Arroyo5.
Abstract
Crowdsourced psychological and other biobehavioral research using platforms like Amazon's Mechanical Turk (MTurk) is increasingly common - but has proliferated more rapidly than studies to establish data quality best practices. Thus, this study investigated whether outcome scores for three common screening tools would be significantly different among MTurk workers who were subject to different sets of quality control checks. We conducted a single-stage, randomized controlled trial with equal allocation to each of four study arms: Arm 1 (Control Arm), Arm 2 (Bot/VPN Check), Arm 3 (Truthfulness/Attention Check), and Arm 4 (Stringent Arm - All Checks). Data collection was completed in Qualtrics, to which participants were referred from MTurk. Subjects (n = 1100) were recruited on November 20-21, 2020. Eligible workers were required to claim U.S. residency, have a successful task completion rate > 95%, have completed a minimum of 100 tasks, and have completed a maximum of 10,000 tasks. Participants completed the US-Alcohol Use Disorders Identification Test (USAUDIT), the Patient Health Questionnaire (PHQ-9), and a screener for Generalized Anxiety Disorder (GAD-7). We found that differing quality control approaches significantly, meaningfully, and directionally affected outcome scores on each of the screening tools. Most notably, workers in Arm 1 (Control) reported higher scores than those in Arms 3 and 4 for all tools, and a higher score than workers in Arm 2 for the PHQ-9. These data suggest that the use, or lack thereof, of quality control questions in crowdsourced research may substantively affect findings, as might the types of quality control items.Entities:
Keywords: MTurk; crowdsourced sampling; data quality; reproducibility
Mesh:
Year: 2021 PMID: 34357539 PMCID: PMC8344397 DOI: 10.3758/s13428-021-01665-8
Source DB: PubMed Journal: Behav Res Methods ISSN: 1554-351X
Quality control (intervention) information
| Arm name | Quality control questions | Rationale |
|---|---|---|
| Arm 1: Control/No Treatment | No additional exclusion criteria were appended to the basic eligibility requirements. | Control Arm. |
| Arm 2: Bot/VPN Check | (a) “If you had an emergency, what telephone number would you dial?” with the response options [112, 911, 000, and 119], each of which is a real emergency number in a different area of the world. (b) Participants were shown an image of an eggplant and asked, “What is the name of this vegetable?” with the response options [guinea squash, brinjal, aubergine, and eggplant], which are the four most common names of the vegetable. | (a) Since this was a U.S.-based sample, and respondents were at least age 18, it was expected that true U.S.-based participants would select 911. However, workers using a VPN to mimic a US-based IP address were hypothesized to select their own regional numbers, if present. Our experience in prior studies indicated that a meaningful number of supposedly U.S.-based workers would fail to select 911 (Agley & Xiao, (b) It was suspected that all but highly sophisticated bots would fail to directly identify an eggplant by name given only an image. Further, this functioned as a secondary VPN-check because the four names provided as response options are regional, with eggplant being standard terminology in the U.S. |
| Arm 3: Truthfulness/Attention Check | (a) “In the past 2 years, have you ever traveled to, or done any business with entities in, Latveria?” with response options [no, never; yes, but not within the past 2 years; yes, I have done so within the past 2 years]. (b) “Research has suggested that a person’s favorite color can tell us a lot about the way that they think about other people. In this case, however, we would like you to ignore this question entirely. Instead, please choose all of the response options provided. In other words, regardless of your actual favorite color, click all of the answers.” Respondents were provided with responses [red, blue, yellow, green, purple] but needed to select all five to demonstrate careful reading of the prompts. (c) “When you were in school, how hard did you work on your studies? In answering this question, please ignore everything else and select the final option indicating that you don’t really remember.” Responses were [I worked incredibly hard in school, I worked moderately hard in school, I didn’t work very hard in school, and I don’t recall how hard I worked]. Selecting anything but the last option indicated inattention. | (a) Latveria is a fictional nation ruled by Doctor Doom in the Marvel Comic Universe. This was an assessment of truthful response, with particular emphasis on the increased risk for “rare” datapoints (MacInnis et al., (b and c) In addition to the literature cited within the manuscript, our own experience also suggested that a meaningful segment of workers would be inattentive (Agley & Xiao, |
| Arm 4: Stringent Check | All questions from Arm 2 and Arm 3 were included in this arm. | This arm assessed whether there was a differential outcome when the approaches from Arm 2 and Arm 3 were combined. |
Fig. 1Conceptual CONSORT diagram
Fig. 2Interim data management
Fig. 3Actual CONSORT diagram
Sociodemographic characteristics by study arm
| Arm 1 | Arm 2 | Arm 3 | Arm 4 | ||||||
|---|---|---|---|---|---|---|---|---|---|
| % | % | % | % | ||||||
| .0251 | |||||||||
| Male | 154 | 54.4 | 167 | 57.8 | 130 | 48.1 | 157 | 56.7 | |
| Female | 129 | 45.6 | 119 | 41.2 | 138 | 51.1 | 118 | 42.6 | |
| Transgender | 0 | 0.0 | 0 | 0.0 | 2 | 0.7 | 2 | 0.7 | |
| Other | 0 | 0.0 | 3 | 1.0 | 0 | 0.0 | 0 | 0.0 | |
| .0081 | |||||||||
| Hispanic/Latino | 55 | 19.4 | 58 | 20.1 | 30 | 11.1 | 39 | 14.1 | |
| Non-Hispanic/Latino | 228 | 80.6 | 231 | 79.9 | 240 | 88.9 | 238 | 85.9 | |
| .5701 | |||||||||
| White | 215 | 76.0 | 229 | 79.2 | 212 | 78.5 | 197 | 71.1 | |
| Black/African American | 38 | 13.4 | 35 | 12.1 | 28 | 10.4 | 44 | 15.9 | |
| American Indian or Alaska Native | 3 | 1.1 | 3 | 1.0 | 1 | 0.4 | 2 | 0.7 | |
| Asian | 22 | 7.8 | 12 | 4.2 | 20 | 7.4 | 20 | 7.2 | |
| Native Hawaiian or Pacific Islander | 0 | 0.0 | 1 | 0.3 | 1 | 0.4 | 1 | 0.4 | |
| Other | 2 | 0.7 | 3 | 1.0 | 4 | 1.5 | 5 | 1.8 | |
| More than One Race | 3 | 1.1 | 6 | 2.1 | 4 | 1.5 | 8 | 2.9 | |
| .7471 | |||||||||
| Less than High School | 2 | 0.7 | 1 | 0.3 | 1 | 0.4 | 1 | 0.4 | |
| High School Graduate / GED | 51 | 18.0 | 56 | 19.4 | 48 | 17.8 | 63 | 22.7 | |
| Associate's Degree | 27 | 9.5 | 38 | 13.1 | 25 | 9.3 | 34 | 12.3 | |
| Bachelor's Degree | 146 | 51.6 | 136 | 47.1 | 127 | 47.0 | 120 | 43.3 | |
| Master's Degree | 51 | 18.0 | 51 | 17.6 | 62 | 23.0 | 51 | 18.4 | |
| Doctoral or Professional Degree | 6 | 2.1 | 7 | 2.4 | 7 | 2.6 | 8 | 2.9 | |
| 38.4 | 39.3 | 39.9 | 38.9 | .5162 | |||||
1. Fisher’s exact test
2. ANOVA
Screening scores by study arm
| Arm 1 | Arm 2 | Arm 3 | Arm 4 | |||||
|---|---|---|---|---|---|---|---|---|
| USAUDIT (0-46) | 13.61 | 12.11 | 9.11 | 9.31 | ||||
| PHQ-9 (0-27) | 10.22 | 8.53 | 7.23 | 7.13 | ||||
| GAD-7 (0-21) | 8.64 | 7.34 | 6.74 | 6.24 | ||||
1. USAUDIT score indicates Zone 2
2. PHQ-9 score indicates Moderate Depression
3. PHQ-9 score indicates Mild Depression
4. GAD-7 score indicates Mild Anxiety
ANOVA and Tukey HSD post hoc test scores
| Between | 4011.45 | 3 | 1337.15 | 15.78 | < .001 |
| Within | 94480.65 | 1115 | 84.74 | - | - |
| Arm 1 vs. 2 | 1.57 | 0.77 | – 0.41 | 3.55 | .176 |
| Arm 1 vs. 3 | 4.51 | 0.78 | 2.49 | 6.52 | <.001 |
| Arm 1 vs. 4 | 4.31 | 0.78 | 2.31 | 6.31 | <.001 |
| Arm 2 vs. 3 | 2.94 | 0.78 | 0.93 | 4.94 | .001 |
| Arm 2 vs. 4 | 2.74 | 0.77 | 0.75 | 4.74 | .002 |
| Arm 3 vs. 4 | – 0.20 | 0.79 | – 2.22 | 1.83 | .995 |
| Between | 1807.76 | 3 | 602.59 | 11.94 | <.001 |
| Within | 56290.00 | 1115 | 50.48 | - | - |
| Arm 1 vs. 2 | 1.69 | 0.59 | 0.16 | 3.22 | .023 |
| Arm 1 vs. 3 | 3.04 | 0.60 | 1.49 | 4.60 | <.001 |
| Arm 1 vs. 4 | 3.14 | 0.60 | 1.59 | 4.68 | <.001 |
| Arm 2 vs. 3 | 1.35 | 0.60 | – 0.20 | 2.90 | .112 |
| Arm 2 vs. 4 | 1.44 | 0.60 | – 0.10 | 2.98 | .075 |
| Arm 3 vs. 4 | 0.09 | 0.61 | – 1.47 | 1.65 | .999 |
| Between | 859.48 | 3 | 286.49 | 8.35 | <.001 |
| Within | 38267.90 | 1115 | 34.32 | - | - |
| Arm 1 vs. 2 | 1.23 | 0.49 | – 0.03 | 2.49 | .058 |
| Arm 1 vs. 3 | 1.89 | 0.50 | 0.61 | 3.17 | .001 |
| Arm 1 vs. 4 | 2.32 | 0.50 | 1.05 | 3.59 | <.001 |
| Arm 2 vs. 3 | 0.66 | 0.50 | – 0.62 | 1.93 | .545 |
| Arm 2 vs. 4 | 1.09 | 0.49 | – 0.18 | 2.35 | .122 |
| Arm 3 vs. 4 | 0.43 | 0.50 | – 0.86 | 1.72 | .828 |
*SS = Sum of squares; Df = degrees of freedom; MS = mean squares; 95%LL/UL = 95% confidence interval of the mean difference, lower and upper levels; SE = standard error
Fig. 4Skewness by Study Arm by Scale
Fig. 5Kurtosis by study arm by scale
Correlations between screening scores by study arm
| USAUDIT | 0.562 | 0.524 |
| PHQ-9 | 1 | 0.918 |
| USAUDIT | 0.580 | 0.472 |
| PHQ-9 | 1 | 0.884 |
| USAUDIT | 0.536 | 0.403 |
| PHQ-9 | 1 | 0.855 |
| USAUDIT | 0.543 | 0.477 |
| PHQ-9 | 1 | 0.882 |
Note: All correlations significant at p < .001.