| Literature DB >> 27959886 |
Jonathan Lawson1, Rupesh J Robinson-Vyas1, Janette P McQuillan1, Andy Paterson1, Sarah Christie1, Matthew Kidza-Griffiths1, Leigh-Anne McDuffus2, Karwan A Moutasim3, Emily C Shaw1,3, Anne E Kiltie4, William J Howat2, Andrew M Hanby5, Gareth J Thomas3, Peter Smittenaar1.
Abstract
BACKGROUND: Academic pathology suffers from an acute and growing lack of workforce resource. This especially impacts on translational elements of clinical trials, which can require detailed analysis of thousands of tissue samples. We tested whether crowdsourcing - enlisting help from the public - is a sufficiently accurate method to score such samples.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27959886 PMCID: PMC5243992 DOI: 10.1038/bjc.2016.404
Source DB: PubMed Journal: Br J Cancer ISSN: 0007-0920 Impact factor: 7.640
Figure 1The ‘Trailblazer' interface for viewing, annotating and scoring tissue microarray (TMA) cores. (A) Participants evaluated squares on a 6x6 grid overlaid on a TMA for the presence of cancer cells. (B) They were asked to mark squares with cancer as red, cells without cancer as green and completely empty squares as blank. (C) To aid in cancer detection and IHC scoring, the participant could move their cursor over the core to reveal a high magnification view of the area under the cursor. Furthermore, a scrollable gallery of high magnification example images of cancer tissue and healthy tissue was available at the bottom of the screen. (D) Prior to starting the task each participant completed a ∼10-minute tutorial explaining the type of sample and how to distinguish cancer cells from non-cancer cells, of which a screenshot is shown here. In the first experiment we tested the effect of feedback-based training and/or annotated images provided in addition to this baseline tutorial.
Figure 2Full factorial design to test the effect of annotated images and feedback-based training on cancer detection performance of individual participants. (A) Experimental design and number of participants in each cell. (B) Box-plot graph showing performance in cancer detection across individuals in each of the four groups, expressed as F1-score, specificity and sensitivity. Statistics for main effects and interactions are shown in Table 1.
Main effects of annotated images and feedback-based training and their interaction
| Annotated images | β=1.18 (−3.11, 5.48)
| β=2.69 (−0.64, 6.02)
| |
| Feedback-based training | |||
| Interaction | β=−1.81 (−3.89, 0.27)
| β=−0.71 (−5.00, 3.58)
| β=−1.95 (−5.28, 1.38)
|
All regression coefficients represent estimated change in performance when adding the factor, multiplied by 100. For example, adding annotated images is estimated to improve the F1-score by 0.0211. Values in brackets represent 95% confidence interval of the coefficient. Cells in bold are significant at P<0.05 uncorrected for multiple comparisons.
Figure 3Accuracy of aggregated responses across four sample types. (A) We used Cohen's kappa to calculate correspondence between raters. The histogram indicates the distribution of kappas of each individual participant with the expert consensus. The solid blue line indicates the agreement between the majority consensus of all participants compared with the expert consensus, showing the majority outperforms the average individual. The pairwise kappas between experts are indicated as small black lines underneath the histogram; the average of the pairwise kappas is indicated in the dashed red line. (B) A second method to compare the participant consensus with expert consensus is the area under the receiver operating characteristic curve (AUC). Here we examined how the AUC changed as we varied the number of participants included in the consensus between 3 and 40. The red dotted line indicates an AUC of 0.90. Shaded areas indicate the bootstrapped 95th percentile CI.
Figure 4Comparison of expert and aggregated participant H-scores for each image. (A) Lung/EGFR sample. Grey dots indicate the three individual expert scores per sample, black dots indicate median H-score based on all participants who evaluated the image, error bars indicate the bootstrapped 95th percentile confidence interval of the median. The images have been sorted along the x axis by median expert score. (B) Bladder/p53 sample. For details described under (A).
Figure 5(A) In lung/EGFR we observed that the Spearman correlation between participants and experts strongly increased as we included more participants in the aggregate score. The black line represents the median of the bootstrapped samples, and the shaded area represents the bootstrapped 95th percentile confidence interval of the median. (B) Bladder/p53, legend as in subplot a.