| Literature DB >> 28677322 |
Michael L Mortensen1, Gaelen P Adam2, Thomas A Trikalinos2, Tim Kraska3, Byron C Wallace4.
Abstract
Systematic reviews are increasingly used to inform health care decisions, but are expensive to produce. We explore the use of crowdsourcing (distributing tasks to untrained workers via the web) to reduce the cost of screening citations. We used Amazon Mechanical Turk as our platform and 4 previously conducted systematic reviews as examples. For each citation, workers answered 4 or 5 questions that were equivalent to the eligibility criteria. We aggregated responses from multiple workers into an overall decision to include or exclude the citation using 1 of 9 algorithms and compared the performance of these algorithms to the corresponding decisions of trained experts. The most inclusive algorithm (designating a citation as relevant if any worker did) identified 95% to 99% of the citations that were ultimately included in the reviews while excluding 68% to 82% of irrelevant citations. Other algorithms increased the fraction of irrelevant articles excluded at some cost to the inclusion of relevant studies. Crowdworkers completed screening in 4 to 17 days, costing $460 to $2220, a cost reduction of up to 88% compared to trained experts. Crowdsourcing may represent a useful approach to reducing the cost of identifying literature for systematic reviews.Entities:
Keywords: crowdsourcing (MeSH); evidence-based medicine (MeSH); review literature as topic (MeSH); study selection; systematic review methods
Mesh:
Year: 2017 PMID: 28677322 PMCID: PMC5589498 DOI: 10.1002/jrsm.1252
Source DB: PubMed Journal: Res Synth Methods ISSN: 1759-2879 Impact factor: 5.273
Figure 1A schematic of the crowdsourcing process used for this work. [Colour figure can be viewed at wileyonlinelibrary.com]
Description of systematic review datasets
| Systematic Review Dataset (Reference) | Number of Citations Screened in Full Review, N |
Citations Selected in the Experiment | Honeypots, Number | Screened in at Title/Abstract/Keyword Level (% of n) | Screened in on the basis of Full Text (% of n) |
|---|---|---|---|---|---|
|
| 5208 |
a. With PubMed records | 0 | 243 (5.1) | 22 (0.5) |
|
| 21 650 |
a. With PubMed records, identified in updating and published in 2013 | 10 | 242 (14.5) | 61 (3.7) |
|
| 15 515 |
a. With PubMed records, identified in updating and published between 2012 and 2014 | 10 | 183 (2.3) | 46 (0.5) |
|
| 9676 |
a. With PubMed records, pertaining to the updated outcomes from a previous report and published between 2002 and 2015 | 10 | 310 (5.3) | 144 (2.5) |
No honeypots were used for quality control in this first review (see text).
Lessons learned from each experimental iteration of citation screening
| Experiment | Lessons Learned |
|---|---|
|
| Quality controls are needed to avoid spamming (ie, low quality and “bare minimum” responses issued to receive payment). |
| Answering 7 questions took a lot of time even when the answer to one of the first questions was “No,” which immediately precluded the citation from inclusion, anyway. | |
| When asking for numerical facts, we can more easily detect errors (and thus spam) by asking for the number rather than a Yes/No answer regarding the number. | |
| Workers did not understand the point of the NA answer. | |
| Workers had to scroll down to read definitions often, hurting efficiency and result quality. | |
| The payment was unnecessarily high. | |
| Workers lacked a means of providing feedback. | |
|
| Explaining the reasoning behind reducing payment would have been beneficial. |
|
| Reducing the payment from $0.20 to $0.10 increased the response time, but did not reduce quality. |
| While question 2 (Q2), “How many humans were involved in the study?” was easy to answer, very few citations were excluded with this question. Most citations were removed with Q1 followed by Q3, Q4, and finally Q2. | |
| The interface was hard for workers to read, because it lacked structure. | |
| While the qualification test reduced the amount of spam, some workers passed the test but still later provided poor‐quality responses. | |
|
| Response was slow, likely partly because of qualification requirements and payment. |
| Honeypots drastically reduced spam incidents. | |
| Workers still reported lack of structure in the interface | |
|
| The payment of $0.03 was too low, resulting in significant worker backlash both directly and on the Mechanical Turk review site Turkopticon ( |
| Removing the qualification test did not have a significant impact on response time from quality workers, but did induce significant cost for 3 honeypots per spammer, which was needed to determine if a worker should be blocked. | |
| For | Despite the optimizations improved worker income per hour, the payment was still too low. As a result, similar worker backlash occurred and response time was poor. |
| Some workers were reading the citations in full before giving responses, thus heavily impacting response time. Some even tried to understand each medical concept before answering, to avoid making mistakes. | |
| Instructions for how to complete the citation screening were not clear enough and left too many details up to the worker (eg, how to quickly determine a firm | |
| Instructions gave no insight into the mechanics of citation screening, thus making workers fear many citations were actually relevant, despite the worker thinking the answer to a question was | |
|
| The price point of $0.15 improved response time again; however, some workers were still not performing efficiently enough, resulting in low hourly pays for those workers. |
| Some workers still did not understand what we were studying and questioned the purpose of the work. | |
|
| Removing the honeypots was a mistake. Knowing no hidden tests were present, a few workers began providing erroneous responses and began answering |
|
| The explanation in the beginning of each HIT is useful when the worker is not aware of the purpose. After having read it a few times however, it just clutters and creates the need to scroll for each hit. Removing it will improve response time further. |
| Some workers have expressed a wish to be retrained when sufficient time has passed between experiments. One possible option here would be to introduce a nonpaid training step in the start of each experiment cycle. This would give experienced workers the opportunity to have their skills refreshed before working on actual citations. | |
| Some workers have misunderstood our auto‐approval of their HITs as a seal of approval of the correctness of their answers. A better description of the purpose of honeypots and the approval process could possibly solve this issue. | |
| Some workers have expressed disapproval with the conditional questions in our DST experiment. Specifically, they found the bundling of several questions and conditionals into one question confusing, eg, “IF this study is about patients, is it a randomized controlled trial (RCT) with at least 10 participants in each group, OR, IF the study is about providers, is it a study with some form of a comparison aspect (eg, RCT, but also nonrandomized groups, before/after comparisons, etc)?” | |
| A conditional interface, dynamically showing the relevant questions depending on worker answers, may be a solution. | |
| To further avoid low‐quality workers, one could automatically flag a worker as questionable if his/her completion time per HIT is unrealistically low. Such a flagging could be used to temporarily block the worker until answers have been evaluated manually. |
The following subsections describe lessons learned from each experimental iteration of citation screening the Proton beam dataset. The final interface and processing and quality controls were developed over several months during the summer of 2014. We note that this preliminary work was necessary because no prior work on crowdsourcing citation screening existed. Once we settled on our setup and interface, comparatively little effort was needed to begin acquiring crowd labels for citations from new datasets.
Figure A1Appendicitis review, citation screening HIT interface. [Colour figure can be viewed at wileyonlinelibrary.com]
Figure A2Appendicitis review, honeypot failure with feedback for workers. [Colour figure can be viewed at wileyonlinelibrary.com]
Example clarifying the 9 the aggregation strategies
| Crowdworkers | Aggregation Strategy | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| W1 | W2 | W3 | W4 | W5 | Majority | p1 | p2 | p3 | p4 | p5 | Champion | Champion (DR) | Majority Question | |
| Q1 | Yes | Yes | Yes | Cannot tell | No | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes | 12 Yes or Cannot Tell/20 maximum answers |
| Q2 | Yes | Yes | Yes | No | — | Yes | Yes | Yes | Yes | No | No | Yes | Yes | |
| Q3 | Yes | Yes | Yes | — | — | Yes | Yes | Yes | Yes | No | No | Yes | Yes | |
| Q4 | Yes | Yes | No | — | — | No | Yes | Yes | No | No | No | Yes | Yes | |
| Citation screened in? | Yes | Yes | No | No | No | No | Yes | Yes | No | No | No | Yes | Yes | Yes |
Q1‐Q4 means question1 through 4. W1‐W5, means crowdworkers 1 through 5. In this example, using the p1, p2, Champion, Champion (DR), or Majority Question aggregation algorithms would have resulted in the citation being screened in. Using p3, p4, or p5 would have led to exclusion.
Question not posed because a previous answer was No.
Imputing No for the questions that have not been posed because of a previous No answer.
Figure 2Results on each dataset using the 9 aggregation strategies
Experimental results for the 9 aggregation strategies across the 4 datasets
| Dataset | Performance of Aggregation Strategies for Crowdworker Answers (Yield; Gain) | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Majority | 1p | 2p | 3p | 4p | 5p | Champion | Champion (DR) | Majority Question | |
| Proton beam | 0.95; 0.86 | 0.95; 0.82 | 0.86; 0.92 | 0.86; 0.93 | 0.86; 0.93 | 0.86; 0.93 | 0.95; 0.85 | 0.95; 0.85 | 0.95; 0.78 |
| Appendicitis | 0.92; 0.97 | 0.97; 0.87 | 0.93; 0.95 | 0.89; 0.98 | 0.80; 0.99 | 0.64; 1.00 | 0.93; 0.96 | 0.97; 0.92 | 1.00; 0.64 |
| DST | 0.80; 0.99 | 0.98; 0.78 | 0.91; 0.96 | 0.76; 0.99 | 0.57; 1.00 | 0.22; 1.00 | 0.93; 0.91 | 0.93; 0.91 | 0.93; 0.91 |
| Omega3 | 0.74; 0.93 | 0.99; 0.68 | 0.93; 0.85 | 0.71; 0.94 | 0.38; 0.98 | 0.13; 0.99 | 0.93; 0.84 | 0.94; 0.82 | 0.92; 0.80 |
Lack of improvement after 3p due to a small number of unconscientious workers in the pool. In the proton beam dataset we did not use honeypots as a quality control mechanism (see text).
Figure A3Time elapsed (hours) vs number of crowd screening decisions received. [Colour figure can be viewed at wileyonlinelibrary.com]
Fleiss kappa (a measure of agreement) calculated for each question
| Dataset | Q1 Kappa | Q2 Kappa | Q3 Kappa | Q4 Kappa | Average Kappa |
|---|---|---|---|---|---|
| Appendicitis | 0.252 | 0.500 | 0.387 | 0.196 | 0.333 |
| DST | 0.056 | 0.057 | 0.018 | −0.030 | 0.026 |
| Omega3 | 0.245 | 0.203 | 0.116 | NA | 0.188 |
| ProtonBeam | 0.175 | 0.128 | 0.063 | 0.071 | 0.109 |
Average agreement ranges (across reviews) from slight to fair, motivating the use of aggregation strategies.
Costs and duration of each crowdsourcing experiment
| Dataset | Worker Salary (with Amazon fee | Approximate Cost of Experts' Screening (with Fringe | Experiment Running Time (after Task Setup) |
|---|---|---|---|
| Proton beam | $1187.25 ($1305.98) | $6859.67 ($8917.57) | 4 d, 21 h, and 36 min |
| Appendicitis | $416.00 ($457.60) | $3034.23 ($3944.50) | 5 d, 10 h, and 58 min |
| DST | $2017.75 ($2219.53) | $6173.75 ($8025.88) | 16 d, 20 h, and 11 min |
| Omega3 | $2020.90 ($2222.99) | $8776.79 ($11 409.83) | 6 d, 16 h, and 17 min |
At the time we ran our experiments, Amazon Mechanical Turk charged a 10% commission fee on each HIT, with a minimum payment of $0.005 per HIT; this has since been increased to 20% (https://requestersandbox.mturk.com/pricing).
Fringe benefit costs are estimated here to be 30% of salary, reflecting (roughly) the true costs at the institutes at which this work was performed (Tufts and Brown).
Because of the higher complexity of questions for this review, worker compensation was increased from $0.15 to $0.21 per HIT
Estimated hourly pay rates for workers, using different thresholds to infer when workers were not actively working. See text for discussion
| Dataset | 30m | 15m | 10m | 5m |
|---|---|---|---|---|
| Appendicitis | $3.73 | $3.94 | $4.15 | $4.41 |
| DST | $3.60 | $4.06 | $4.31 | $4.97 |
| Omega3 | $6.25 | $6.45 | $6.75 | $7.44 |
| ProtonBeam | $5.89 | $6.29 | $6.40 | $7.08 |
| Average overall | $4.87 | $5.18 | $5.40 | $5.97 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
We are interested in the effects of omega‐3 fatty acids. Synonyms of “omega‐3 fatty acids” include “n‐3 fatty acids”, “(long chain) PUFA”, “(long chain) polyunsuturated fatty acids”. |
| Accepted outcome types | Minimum follow‐up duration | Minimum number of participants | Accepted study types |
|---|---|---|---|
|
Cardiovascular disease outcomes |
1 year |
none |
We are interested in studies that: |