| Literature DB >> 25356929 |
Christopher J Brady1, Andrea C Villanti, Jennifer L Pearson, Thomas R Kirchner, Omesh P Gupta, Chirag P Shah.
Abstract
BACKGROUND: Screening for diabetic retinopathy is both effective and cost-effective, but rates of screening compliance remain suboptimal. As screening improves, new methods to deal with screening data may help reduce the human resource needs. Crowdsourcing has been used in many contexts to harness distributed human intelligence for the completion of small tasks including image categorization.Entities:
Keywords: Amazon Mechanical Turk; crowdsourcing; diabetic retinopathy; fundus photography; telemedicine
Mesh:
Year: 2014 PMID: 25356929 PMCID: PMC4259907 DOI: 10.2196/jmir.3807
Source DB: PubMed Journal: J Med Internet Res ISSN: 1438-8871 Impact factor: 5.428
Figure 1Screenshot of the Amazon Mechanical Turk Web interface for fundus photo grading.
Turker grading of individual imagesa.
| Image # | Two-category rating | Three-category rating | Four-category rating | ||||||
|
| Expert | Correct | Turker | Expert | Correct | Turker | Expert | Correct | Turker |
| 1 | Nor | 65 | — | Nor | 90 | — | Nor | 55 | — |
| 2 | Ab | 85 | — | M/M | 50 | Sev | Mild | 0 | Sev |
| 3 | Nor | 70 | — | Nor | 70 | — | Nor | 70 | — |
| 4 | Nor | 50 | Ab | Nor | 40 | M/M | Nor | 60 | — |
| 5 | Nor | 80 | — | Nor | 70 | — | Nor | 50 | — |
| 6 | Ab | 100 | — | M/M | 90 | — | Mild | 20 | Mod |
| 7 | Ab | 90 | — | Severe | 60 | — | Sev | 10 | Mod |
| 8 | Nor | 50 | Ab | Sev | 40 | M/M | Nor | 65 | — |
| 9 | Ab | 100 | — | Sev | 95 | — | Sev | 100 | — |
| 10 | Ab | 100 | — | Sev | 40 | M/M | Sev | 70 | — |
| 11 | Ab | 90 | — | Sev | 0 | M/M | Sev | 20 | Mild |
| 12 | Nor | 90 | — | Nor | 80 | — | Nor | 90 | — |
| 13 | Ab | 100 | — | M/M | 30 | Sev | Mod | 20 | Sev |
| 14 | Ab | 80 | — | Sev | 40 | M/M | Sev | 10 | Mod |
| 15 | Nor | 90 | — | Nor | 100 | — | Nor | 90 | — |
| 16 | Ab | 90 | — | Sev | 70 | — | Sev | 50 | — |
| 17 | Ab | 100 | — | M/M | 60 | — | Mild | 10 | Mod |
| 18 | Ab | 100 | — | M/M | 100 | — | Mod | 95 | — |
| 19 | Ab | 90 | — | M/M | 80 | — | Mild | 20 | Mod |
| Correct, % | 81.3 | 89.5 |
| 64.4 | 63.2 | 50.9 | 57.9 | ||
| Sensitivityd, % | 93.6 | 100.0 |
| 96.3 | 100.0 | 96.3 | 100.0 | ||
| Specificityd, % | 67.8 | 71.4 | 66.7 | 71.4 | 66.7 | 100.0 | |||
aNor=Normal; Ab=Abnormal; M/M=Mild or Moderate; Sev=Severe; Mod=Moderate.
bAt the level of the individual graders.
cConsensus rating presented only if it differed from the expert rating.
dCalculated for normal versus any disease level.
Time to complete ratings (in seconds).
| Two-category rating | Three-category rating | Four-category rating | Four-category rating (improved training) | Four-category rating (increased approval) | Four-category rating (Master Graders)a | |
| Mean time per HITs | 25.16 | 50.87 | 54.52 | 50.98 | 38.79 | 44.14 |
| 95% CI | 21.93-28.38 | 43.18-58.55 | 46.15-62.88 | 39.66-62.30 | 31.65-45.93 | 36.00-52.27 |
| Hourly wage, $ | 14.31 | 7.08 | 6.60 | 7.06 | 9.28 | 12.23 |
| Cost per image, $ | 1.10 | 1.10 | 1.10 | 1.10 | 1.10 | 1.95 |
aMaster graders received US $0.15 per image, plus a 30% Amazon commission for a total cost of US $0.195/image.
Figure 2Area under the curve (AUC) of the receiver-operator characteristic (ROC) curve for increasing numbers of Turker interpretations of a prototypical image from each severity level. Turkers had low accuracy for the Mild (Panel A) and Severe image (Panel C), but acceptable accuracy for the Moderate image (Panel B). When all four images were analyzed for absence or presence of disease only, Turkers performed well (Panel D) with a highly significant AUC.
Turker consensus in Phase III.
| Number correct (mean)a | % correct (mean) | Number correct (mode)a | % correct (mode) | Sensitivityb | Specificityb | |
| Phase I: Four-category rating | 5 | 26.3 | 11 | 57.9 | 100.0 | 100.0 |
| Phase 3: Trial 1 (improved training) | 4 | 21.1c | 8d | 42.1 | 100.0 | 57.1 |
| Phase 3: Trial 2 (raised approval rating) | 10 | 52.6 | 11e | 57.9 | 100.0 | 100.0 |
| Phase 3: Trial 3 (Master Graders) | 7 | 36.8 | 11 | 57.9 | 100.0 | 100.0 |
aCalculated by level (eg, Turker consensus matches expert designation as normal, mild, moderate, and severe).
bCalculated for normal versus any disease level using the mode consensus score.
cAfter excluding a single Turker with systematically higher scores, 42.1% correct.
dThree images had no mode and were considered incorrect for “Number Correct” and “% correct” but recoded as abnormal for sensitivity and specificity.
eOne image had no mode and was considered incorrect for “Number Correct” and “% correct” but recoded as abnormal for sensitivity and specificity.