| Literature DB >> 28653011 |
Ljiljana Platiša1, Leen Van Brantegem2, Asli Kumcu1, Richard Ducatelle2, Wilfried Philips1.
Abstract
Despite the current rapid advance in technologies for whole slide imaging, there is still no scientific consensus on the recommended methodology for image quality assessment of digital pathology slides. For medical images in general, it has been recommended to assess image quality in terms of doctors' success rates in performing a specific clinical task while using the images (clinical image quality, cIQ). However, digital pathology is a new modality, and already identifying the appropriate task is difficult. In an alternative common approach, humans are asked to do a simpler task such as rating overall image quality (perceived image quality, pIQ), but that involves the risk of nonclinically relevant findings due to an unknown relationship between the pIQ and cIQ. In this study, we explored three different experimental protocols: (1) conducting a clinical task (detecting inclusion bodies), (2) rating image similarity and preference, and (3) rating the overall image quality. Additionally, within protocol 1, overall quality ratings were also collected (task-aware pIQ). The experiments were done by diagnostic veterinary pathologists in the context of evaluating the quality of hematoxylin and eosin-stained digital pathology slides of animal tissue samples under several common image alterations: additive noise, blurring, change in gamma, change in color saturation, and JPG compression. While the size of our experiments was small and prevents drawing strong conclusions, the results suggest the need to define a clinical task. Importantly, the pIQ data collected under protocols 2 and 3 did not always rank the image alterations the same as their cIQ from protocol 1, warning against using conventional pIQ to predict cIQ. At the same time, there was a correlation between the cIQ and task-aware pIQ ratings from protocol 1, suggesting that the clinical experiment context (set by specifying the clinical task) may affect human visual attention and bring focus to their criteria of image quality. Further research is needed to assess whether and for which purposes (e.g., preclinical testing) task-aware pIQ ratings could substitute cIQ for a given clinical task.Entities:
Keywords: digital pathology; human observer; image compression; image quality; signal detection
Year: 2017 PMID: 28653011 PMCID: PMC5478946 DOI: 10.1117/1.JMI.4.2.021108
Source DB: PubMed Journal: J Med Imaging (Bellingham) ISSN: 2329-4302
Fig. 1Example reference (M-NONE) images of the three considered tissue samples: (a) gastric fundic glands of a dog, (b) liver of a foal, and (c) gastric fundic glands of a dog. All M-NONE images and their corresponding artificially altered variants (M-Blur, M-Gamma, M-ColSat, M-Noise, and M-JPG) were in size.
Fig. 2Location-level mark classification. A mark is accepted as “TP” if it belongs to the acceptance region around the actual lesion; otherwise, it is classified as “FP.” The acceptance region is a manually delineated rectangular area determined by the largest width () and height () of the actual lesion in the reference image. The actual lesions are the lesions marked by the senior expert with confidence rating above 60%.
Six categories of image alterations considered in the main study.
| Image category | Image alteration | Parameter value |
|---|---|---|
| M-NONE | None | — |
| M-blur | Added Gaussian blur | |
| M-gamma | Decreased gamma | Approx. |
| M-ColSat | Decreased color saturation | Approx. |
| M-noise | Added white Gaussian noise | |
| M-JPG | JPG compression | libjpeg quality 50 |
Distribution of the observers according to gender, age, and experience in diagnostic pathology.
| Parameter | Value |
|---|---|
| Total observers | 6 |
| Male observers | 1 |
| Female observers | 5 |
| Minimum age | 25 |
| Maximum age | 40 |
| Median age | 29.5 |
| Mean years of experience | 6.2 |
Three different protocols used in the human observer experiments. Within each protocol, an observer answers the experimental question using its corresponding reporting scale .
| Protocol name | FROC |
| Stimuli | Reference image and all corresponding altered images (one image per trail) |
| Visualization | One image viewed individually |
| Questions | |
| Reporting scale | |
| Data analysis | |
| Protocol name | DS |
| Stimuli | All pairwise combinations (including self-pairs) of a given reference image and its corresponding altered images (one pair per trail) |
| Visualization | Two images viewed simultaneously |
| Questions | |
| Reporting scale | |
| Data analysis | Median and IQR (per question) |
| Protocol name | SS |
| Stimuli | Reference image and all corresponding altered images (one image per trail) |
| Visualization | One image viewed individually |
| Questions | |
| Reporting scale | |
| Data analysis | MdnOS and Kruskal–Wallis analysis (per question) |
Overview of the experiments performed in the prestudy (pilot) and in the main study (exp1, exp2, and exp3).
| Name | Protocol | Context | No. observers and observer experience | Image data | Trials per observer |
|---|---|---|---|---|---|
| Pilot | SS | Technical | One diagn. pathologist, one veterinary student, and two imaging experts | 5 reference images and their | 50 |
| Exp1 | FROC | Clinical | Six diagn. pathologists | 12 reference images and their | 72 |
| Exp2 | DS | Technical | Six diagn. pathologists | 21 pairwise combinations with repetition within three reference images | 63 |
| Exp3 | SS | Technical | Six diagn. pathologists | 12 reference images and their | 72 |
Parameters of the JAFROC experiment.
| Name | Value |
|---|---|
| No. readers | 6 |
| No. treatments | 6 |
| No. normal cases | 5 |
| No. abnormal cases | 7 |
| Fraction normal cases | 0.417 |
| Min lesions per image | 1 |
| Max lesions per image | 2 |
| Mean lesions per image | 1.286 |
| Total lesions | 9 |
| Mean nonlesion localization marks per reader on normal images | 2.667 |
| Mean nonlesion localization marks per reader on abnormal images | 1.091 |
| Mean lesion localization marks per reader | 0.722 |
Difference in FOM between all pairings of image alterations (including M-NONE) and the corresponding 95% confidence intervals (CIs). The asterisk symbols indicate statistically significant differences in FOMs.
| Compared image alterations | Difference in FOM | 95% CI |
|---|---|---|
| M-Blur versus M-Gamma | 0.05000 | |
| M-Blur versus M-ColSat | ||
| M-Blur versus M-Noise | 0.00556 | |
| M-Blur versus M-JPG | 0.03333 | |
| M-Blur versus M-NONE | ||
| M-Gamma versus M-ColSat | ||
| M-Gamma versus M-Noise | ||
| M-Gamma versus M-JPG | ||
| M-Gamma versus M-NONE | ||
| M-ColSat versus M-Noise | 0.11481 | |
| M-ColSat versus M-JPG | 0.14259 | |
| M-ColSat versus M-NONE | 0.03889 | |
| M-Noise versus M-JPG | 0.02778 | |
| M-Noise versus M-NONE | ||
Fig. 3JAFROC FOM for all considered types of image alteration (including M-NONE). The FOM is averaged over observers and the error bars correspond to 95% CI.
Summary statistics of image similarity ratings collected in exp2 under question “How similar are the images?” while using a six-point Likert-type scale from “not similar at all” (0) to “the same” (5). For each image-pairing, the median and the IQR of the similarity ratings are shown.
| M-NONE | M-Blur | M-Gamma | M-ColSat | M-Noise | M-JPG | |
|---|---|---|---|---|---|---|
| M-NONE | 5.00 [4.00,5.00] | |||||
| M-Blur | 3.00 [1.25,3.75] | 4.00 [4.00,5.00] | ||||
| M-Gamma | 4.00 [4.00,4.75] | 2.00 [1.00,3.75] | 5.00 [4.00,5.00] | |||
| M-ColSat | 1.00 [0.25,2.00] | 1.00 [0,2.50] | 1.00 [0,1.75] | 5.00 [4.25,5.00] | ||
| M-Noise | 5.00 [4.00,5.00] | 4.00 [2.00,4.00] | 4.00 [3.00,5.00] | 1.00 [0.25,1.00] | 4.50 [4.00,5.00] | |
| M-JPG | 4.50 [4.00,5.00] | 3.50 [2.00,4.00] | 4.00 [4.00,4.75] | 1.00 [0.25,1.75] | 4.00 [4.00,5.00] | 5.00 [4.00,5.00] |
Summary statistics of image preference ratings collected in exp2 under question “Which image do you prefer for overall quality?” while using a seven-point Likert-type scale from “left image” () to “right image” (). In the table, the left image side corresponds to the columns and the right image side is represented in the rows. For each image-pairing, the median and the IQR of the preference ratings are shown.
| M-NONE | M-Blur | M-Gamma | M-ColSat | M-Noise | M-JPG | |
|---|---|---|---|---|---|---|
| M-NONE | 0 [0,0] | |||||
| M-Blur | 2.50 [2.00,3.00] | 0 | ||||
| M-Gamma | 0 | 0 [0,0] | ||||
| M-ColSat | 2.00 | 2.00 | 0 [0,0] | |||
| M-Noise | 0 [0,0.75] | 0 [0,1.00] | 0 | 0 [0,0.75] | ||
| M-JPG | 0 [0,0.75] | 0 | 0 | 0 [0,0] |
Fig. 4Overall IQ ratings by diagnostic pathologists: (a) results from exp1 (task-aware pIQ), continuous rating scale from 0% to 100% and (b) results from exp3 (conventional pIQ), discrete six-point rating scale from 0 to 5. For both experiments, a higher rating score corresponds to higher pIQ. In both plots, the -axis represents the type of image alteration (M-NONE, M-Blur, M-Gamma, M-ColSat, M-Noise, and M-JPG). Each box in the plot indicates the median, the IQR, the 1.5 IQR interval (whiskers); no “outliers” (measured points outside of the whisker range) have been identified.