| Literature DB >> 30059508 |
Naihui Zhou1,2, Zachary D Siegel3, Scott Zarecor4, Nigel Lee5, Darwin A Campbell4, Carson M Andorf6,7, Dan Nettleton1,8, Carolyn J Lawrence-Dill1,4,9, Baskar Ganapathysubramanian5, Jonathan W Kelly3, Iddo Friedberg1,2.
Abstract
The accuracy of machine learning tasks critically depends on high quality ground truth data. Therefore, in many cases, producing good ground truth data typically involves trained professionals; however, this can be costly in time, effort, and money. Here we explore the use of crowdsourcing to generate a large number of training data of good quality. We explore an image analysis task involving the segmentation of corn tassels from images taken in a field setting. We investigate the accuracy, speed and other quality metrics when this task is performed by students for academic credit, Amazon MTurk workers, and Master Amazon MTurk workers. We conclude that the Amazon MTurk and Master Mturk workers perform significantly better than the for-credit students, but with no significant difference between the two MTurk worker types. Furthermore, the quality of the segmentation produced by Amazon MTurk workers rivals that of an expert worker. We provide best practices to assess the quality of ground truth data, and to compare data quality produced by different sources. We conclude that properly managed crowdsourcing can be used to establish large volumes of viable ground truth data at a low cost and high quality, especially in the context of high throughput plant phenotyping. We also provide several metrics for assessing the quality of the generated datasets.Entities:
Mesh:
Year: 2018 PMID: 30059508 PMCID: PMC6085066 DOI: 10.1371/journal.pcbi.1006337
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Fig 1Overall schema of datasets (boxes) and processes (arrows) that led to the analyses (red).
Top row: The Expert Labeled dataset was used a gold standard to analyze how well the different experimental groups (blue boxes) performed. Bottom row: the labeling from each experimental group was used to train an ML classifier. Each ML classifier was then tested against an expert-labeled test set.
Fig 2Example image used during training to demonstrate correct placement of bounding boxes around tassels.
Fig 3Drawing boxes around tassels.
Left: Sample participant-drawn boxes. Right: The Red box is the gold standard box and black is a participant-drawn box.
Fig 4Density of precision recall pairs by group.
Density based on a total of 61,888 participant-drawn boxes. A: Master MTurkers. B: MTurkers. C: Course Credit participants. D: Violin plots showing the distribution of F-measure per image per user, where white circles: distribution median; black bars: second and third quartiles; black lines 95% confidence intervals.
Parameter estimates from the ANOVA with master MTurk group as baseline.
| Estimate | Standard Error | p-value | |
|---|---|---|---|
| non-master MTurk vs. Master MTurker | 0.01125 | 0.02078 | 0.5893 |
| Course Credit vs. Master MTurker | -0.1005 | 0.02521 | 0.0001 |
| Course Credit vs. non-master MTurk | -0.1117 | 0.01517 | <0.0001 |
Fig 5Both accuracy and time per question change as participants progress through the task.
A: Time spent in log scale as a function of image order. B: Mean F value decreases very slightly over the survey process.
Parameter estimates in linear mixed effects regression of time spent each image.
| Estimate ( | Exponential of Estimate (exp( | p-value | |
|---|---|---|---|
| Master MTurk | -0.01043 | 0.9896 | < 0.0001 |
| non-Master MTurk | -0.01073 | 0.9893 | < 0.0001 |
| Course Credit | -0.01181 | 0.9883 | < 0.0001 |
Fig 6Best Linear Unbiased Predictors for images.
BLUPs are calculated in both analyses for F and time in log scale. Color represents image difficulty determined by expert.