| Literature DB >> 31866890 |
Abstract
Participants from public participant panels, such as Amazon Mechanical Turk, are shared across many labs and participate in many studies during their panel tenure. Here, I demonstrate direct and indirect downstream consequences of frequent exposure in three studies (N 1-3 = 3, 660), focusing on the cognitive reflection test (CRT), one of the most frequently used cognitive measures in online research. Study 1 explored several variants of the signature bat-and-ball item in samples recruited from Mechanical Turk. Panel tenure was shown to impact responses to both the original and merely similar items. Solution rates were not found to be higher than in a commercial online panel with less exposure to the CRT (Qualtrics panels, n = 1, 238). In Study 2, an alternative test with transformed numeric values showed higher correlations with validation measures than the original test. Finally, Study 3 investigated sources of item familiarity and measured performance on novel lure items.Entities:
Keywords: Mechanical Turk (MTurk); cognitive reflection test (CRT); online research; practice effects; professional participants
Year: 2019 PMID: 31866890 PMCID: PMC6909056 DOI: 10.3389/fpsyg.2019.02646
Source DB: PubMed Journal: Front Psychol ISSN: 1664-1078
Original CRT items and items presented in Studies 1–3: study, variant name, item text, correct solution, and intuitive solution.
| CRT | I1 | A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball | $0.05 | $0.10 |
| How much does the ball cost? | ||||
| I2 | If it takes 5 machines 5 min to make 5 widgets, how long would it take 100 machines | 5 m | 100 m | |
| to make 100 widgets? | ||||
| I3 | In a lake, there is a patch of lily pads. Every day, the patch doubles in size | 47d | 24d | |
| If it takes 48 days for the patch to cover the entire lake, how long would it take for the | ||||
| patch to cover half of the lake? | ||||
| Study 1 | Original | A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball | 5 | 10 |
| How much does the ball cost? [in cents] | ||||
| Complementary | A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball | 105 | 100 | |
| How much does the bat cost? [in cents] | ||||
| Trivial | A bat and a ball cost $1.10 in total. The bat costs more than the ball. It costs $1.00 | 10 | 10 | |
| How much does the ball cost? [in cents] | ||||
| Transformed | A golden bat and a golden ball cost $5,000 in total. The golden bat costs $4,000 more | 500 | 1,000 | |
| than the golden ball. How much does the golden ball cost? [in $] | ||||
| Study 2 (CRTt) | T1 | A golden bat and a golden ball cost $5,000 in total. The golden bat costs $4,000 more | 500 | 1,000 |
| than the golden ball. How much does the golden ball cost? [in $] | ||||
| T2 | If it takes 10 machines 10 min to make 10 widgets, how long would it take | 10 | 1,000 | |
| 1,000 machines to make 1,000 widgets [in minutes]? | ||||
| T3 | In a lake, there is a patch of lily pads. Every day, the patch doubles in size | 38 | 10 | |
| If it takes 40 days for the patch to cover the entire lake, how long would | ||||
| it take for the patch to cover a quarter of the lake [in days]? | ||||
| Study 3 | I1 | A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball | 5 | 10 |
| How much does the ball cost? [in cents] | ||||
| I2 | If it takes 5 machines 5 min to make 5 widgets, how long would it take | 5 | 100 | |
| 100 machines to make 100 widgets? [in minutes] | ||||
| N1 | Peter has four friends. Together they are able to carry 40 boxes. | 160/168 | 200 | |
| If Peter had 20 friends instead, how many boxes would they be able to carry? | ||||
| N2 | If you divided a long baguette by four cuts into even pieces, each piece would | 10 | 9 | |
| be 18 cm long. How long would a piece be if you did it with eight cuts? [in cm] |
Figure 1Relative frequencies of response categories for the four question variants in Study 1 [(A) original, (B) complementary, (C) trivial, and (D) transformed variant] in percent: All non-listed responses are categorized as Other, error bars mark the 95% CI of the proportion.
Figure 2Relationship between response categories and response time/number of HITs in Study 1: Plots show average values of the log-transformed response time (left column) and the log-transformed number of previous HITs (right column). Each row contains the plots for one of the four task variants (from top to bottom: original, complementary, trivial, transformed); whiskers correspond to the 95% CI of the mean.
Figure 3(A) Average proportion of correct answers to the original variant of the problem for participants that indicated experience (black dots) or no experience (with dots) with the bat-and-ball problem, separated by the category of previous HITs on MTurk. Dot areas correspond to the proportions of participants with and without experience for a given interval of HITs. Whiskers indicate 95% CIs for the proportions. (B) Percentage of correct responses to the original variant (circles) and the trivial variant (triangles) for participant groups whose stated number of previous HITs falls into different categories. Whiskers indicate 95% CIs for the proportions.
Figure 4(A) Relative frequencies of response categories for the standard variant (Qualtrics data, n = 1, 238): All non-listed responses are categorized as Other, error bars mark the 95% CI of the proportion. (B) Average log-transformed response time for the standard variant (Qualtrics data); whiskers correspond to the 95% CI of the mean.
Figure 5Absolute frequencies of CRT scores and deviations for CRTt scores split by categories of self-reported number of completed HITs in Study 2: Each row reports on the group of participants whose number of reported HITs falls into the specified interval. The left mosaic plot shows absolute numbers of the four possible scores, the numbers on the right side show differences for the CRTt frequencies, with positive numbers indicating a larger frequency for the CRTt. Each rectangle is proportional in size to the observed frequency of the combination of score and participant group. Relative frequencies of the five categories are reported in the middle column.
Figure 6Proportion of participants giving correct answers to the three items in Study 2 split by panel tenure: Markers represent proportions of correct answers split by self-reported number of HITs, vertical lines represent 95% CIs for the proportions.
Figure 7Relative frequencies of response categories for the three items (rows) in the original version (CRT, left columns) and transformed variant (CRTt, right column) in Study 2: All non-listed responses are categorized as Other, error bars mark the 95% CI of the proportion. Bars correspond to the results in the MTurk sample, for the CRT items, results for the Qualtrics sample are marked by the letter Q with CI bars.
Mean scale scores for the CRT (Qualtrics and MTurk) and the CRTt (MTurk) split by gender and test for differences (two-sided independent samples t-test; Mturk: n = 344, n = 354, Qualtrics: n = 692, n = 546), 95% CI for the difference in group means and Cohen's d.
| CRT | MTurk | 1.78 | 2.07 | 0.29 | 3.26 | 0.001 | [0.12, 0.47] | −0.25 |
| (1.21) | (1.15) | |||||||
| CRTt | MTurk | 1.40 | 1.77 | 0.37 | 4.34 | <0.001 | [0.20, 0.53] | −0.33 |
| (1.11) | (1.12) | |||||||
| CRT | Qualtrics | 0.28 | 0.66 | 0.38 | 8.23 | <0.001 | [0.29, 0.47] | −0.46 |
| (0.69) | (0.93) |
Standard deviations are presented in brackets below the means.
Pearson correlations and Steiger's Z for original CRT (o) and new CRTt (t) with subjective numeracy (Fagerlin et al., 2007) and financial literacy (Hastings et al., 2013) in Study 2.
| Subjective numeracy | 0.24 | 0.30 | −2.68 |
| <0.001 | <0.001 | 0.007 | |
| Financial literacy | 0.34 | 0.38 | −1.95 |
| <0.001 | <0.001 | 0.051 |
The p-value for Z.
Figure 8Average proportion of correct answers to item 1 (A) and item 2 (B) for participants in Study 3 that indicated exposure (black dots) or no exposure (white dots) to the bat-and-ball problem, separated by the category of number of previous HITs on MTurk. Dot sizes correspond to the proportions of participants with and without experience for a given category of HITs. Whiskers indicate 95% CI for the proportions.
Figure 9Results for item 1 (I1), item 2 (I2) and the two novel items (N1 and N2) in Study 3. Subfigures show the proportion of intuitive and correct responses (top row), average logarithmized response times for response types (second row), and average logarithmized number of HITs (bottom row). Bars represent 95%-CIs for proportions and means, respectively.
Figure 10Cross-tabulation of correct and false responses for original and novel items (showing rounded percentages) in Study 3. Improvements from row item to column item are captured in the lower left corner (in blue), worse performance in the upper right corner (in red).
Relative number of male participants and attention check errors: Proportions, differences in proportion and CIs for differences in proportions split by correct and false solutions for the items in Study 3.
| I1 | 47.1% | 60.5% | 13.4% | [7.2%, 19.4%] | 15.6% | 8.4% | −7.3% | [−11.2%, −3.2%] |
| I2 | 41.3% | 60.9% | 19.6% | [12.5%, 26.5%] | 14.0% | 9.7% | −4.4% | [−9.1%, 0.2%] |
| N1 | 48.8% | 56.0% | 7.3% | [−4.7%, 18.8%] | 11.0% | 8.8% | −2.2% | [−8.4%, 6.1%] |
| N2 | 52.0% | 78.6% | 26.6% | [−5.8%, 44.4%] | 14.4% | 14.3% | −0.1% | [−9.7%, 17.5%] |
EV-scale scores and performance on original items: Means, mean differences and CIs for the mean difference split by correct and false solutions for the items in Study 3.
| I1 | 1.49 | 1.97 | 0.48 | [0.36, 0.59] | ||||
| I2 | 1.51 | 1.88 | 0.37 | [0.23, 0.50] | ||||
| N1 | 1.62 | 1.79 | 0.17 | [−0.06, 0.41] | 0.88 | 1.51 | 0.62 | [0.40, 0.84] |
| N2 | 1.71 | 2.00 | 0.29 | [−0.06, 0.64] | 0.92 | 1.86 | 0.95 | [0.72, 1.18] |
Note that correct or incorrect answers to I1 and I2 limit the possible range of values for CRT (1+2), affected cells show values in italics.