| Literature DB >> 35035351 |
Florian Kofler1,2,3, Ivan Ezhov1,3, Lucas Fidon4, Carolin M Pirkl1, Johannes C Paetzold1,3, Egon Burian2, Sarthak Pati1,5,6,7, Malek El Husseini1,2, Fernando Navarro1,3,8, Suprosanna Shit1,3, Jan Kirschke2, Spyridon Bakas5,6,7, Claus Zimmer2, Benedikt Wiestler2, Bjoern H Menze1,9.
Abstract
A multitude of image-based machine learning segmentation and classification algorithms has recently been proposed, offering diagnostic decision support for the identification and characterization of glioma, Covid-19 and many other diseases. Even though these algorithms often outperform human experts in segmentation tasks, their limited reliability, and in particular the inability to detect failure cases, has hindered translation into clinical practice. To address this major shortcoming, we propose an unsupervised quality estimation method for segmentation ensembles. Our primitive solution examines discord in binary segmentation maps to automatically flag segmentation results that are particularly error-prone and therefore require special assessment by human readers. We validate our method both on segmentation of brain glioma in multi-modal magnetic resonance - and of lung lesions in computer tomography images. Additionally, our method provides an adaptive prioritization mechanism to maximize efficacy in use of human expert time by enabling radiologists to focus on the most difficult, yet important cases while maintaining full diagnostic autonomy. Our method offers an intuitive and reliable uncertainty estimation from segmentation ensembles and thereby closes an important gap toward successful translation of automatic segmentation into clinical routine.Entities:
Keywords: CT; MR; OOD; anomaly detection; ensembling; failure prediction; fusion; quality estimation
Year: 2021 PMID: 35035351 PMCID: PMC8757043 DOI: 10.3389/fnins.2021.752780
Source DB: PubMed Journal: Front Neurosci ISSN: 1662-453X Impact factor: 5.152
Figure 1Quality estimation procedure. After computing fusion from the candidate segmentations, similarity metrics between the fused and the candidate segmentations are evaluated. Using this information, we obtain threshold values by subtracting the median absolute deviation (mad) of similarity metrics times the tunable parameter α from their median value. We set an alarm flag if the individual similarity metric is below the computed threshold, for example: median(Dice)−mad(Dice)*α.
Figure 2Exemplary glioma segmentation exam with multi-modal MR. Segmentations are overlayed on T1, T1c, T2, FLAIR images for the tumor's center of mass, defined by the tumor core (necrosis and enhancing tumor) of the ground truth label. The segmentation outlines represent the tumor core labels, meaning the sum of enhancing tumor and necrosis labels. Top: the four input images without segmentation overlay; Middle: ground truth segmentation (GT) in reddish purple vs. majority voting fusion (mav) in bluish green; Bottom: mav fusion in bluish green vs. individual segmentation algorithms in various colors. Notice the small outliers encircled in pink on the frontal lobe which probably contribute to the raise of 3 Dice - and 4 Hausdorff distance based alarms for this particular exam with a mediocre volumetric Dice similarity coefficient with the ground truth data of 0.66.
Figure 3Example Covid-19 lung lesion segmentation exams with CT images. Segmentations are overlayed for the lesions' center of mass, defined by the slice with most lesion voxels: Left: the empty input images; Middle: SIMPLE segmentation fusion (simple) in bluish green; Right: SIMPLE fusion in bluish green vs. individual segmentation algorithms in various colors. The volumetric Dice similarity coefficients with the ground truth and respective alarm counts are as following: Top row: 0.81, 0; Middle row: 0.58, 2; Last row: 0.14, 3.
Distribution of alarm counts depending on α for the MR experiment: The table illustrates the number of images classified in the individual alarm count categories (a) from 0 to 10; for different values of α.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| −3.00 | −0.00 | NA | NA | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 68 |
| −2.00 | 0.22 | NA | 0.04 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 64 |
| −1.00 | 1.28 | −0.27 | −0.2 | 0 | 0 | 0 | 1 | 0 | 2 | 2 | 5 | 5 | 13 | 40 |
| −0.75 | 1.80 | −0.55 | −0.27 | 0 | 0 | 1 | 3 | 4 | 2 | 10 | 5 | 4 | 14 | 25 |
| −0.50 | 2.02 | −0.63 | −0.3 | 0 | 6 | 1 | 3 | 5 | 4 | 11 | 1 | 10 | 8 | 19 |
| −0.25 | 2.33 | −0.7 | −0.38 | 3 | 5 | 4 | 5 | 7 | 4 | 6 | 7 | 8 | 7 | 12 |
| −0.10 | 2.37 | −0.73 | −0.41 | 7 | 4 | 4 | 6 | 4 | 7 | 7 | 8 | 7 | 6 | 8 |
| 0.00 | 2.35 | −0.76 | −0.45 | 9 | 5 | 7 | 4 | 4 | 6 | 6 | 8 | 8 | 3 | 8 |
| 0.10 | 2.30 | −0.77 | −0.46 | 9 | 6 | 10 | 3 | 6 | 7 | 2 | 9 | 5 | 3 | 8 |
| 0.25 | 2.28 | −0.77 | −0.51 | 11 | 7 | 12 | 3 | 2 | 7 | 3 | 8 | 5 | 5 | 5 |
| 0.50 | 2.23 | −0.78 | −0.59 | 15 | 11 | 8 | 3 | 2 | 4 | 5 | 8 | 4 | 4 | 4 |
| 0.75 | 2.06 | −0.73 | −0.59 | 18 | 13 | 7 | 3 | 1 | 5 | 6 | 7 | 2 | 6 | 0 |
| 1.00 | 1.97 | −0.72 | −0.58 | 23 | 12 | 3 | 3 | 2 | 6 | 8 | 6 | 3 | 2 | 0 |
| 2.00 | 1.71 | −0.66 | −0.55 | 30 | 10 | 6 | 4 | 3 | 8 | 2 | 5 | 0 | 0 | 0 |
| 3.00 | 1.40 | −0.65 | −0.52 | 37 | 11 | 4 | 1 | 3 | 10 | 1 | 1 | 0 | 0 | 0 |
Additionally, we depict the Pearson correlation coefficients for the Dice (r:dice) - and Hausdorff distance (r:hd) based alarm counts with volumetric Dice segmentation performance, as well as the respective alarm count distribution's entropy. The selected value for α of 0.1 is highlighted in pink The resulting computed thresholds are depicted in .
Distribution of alarm counts depending on α for the CT experiment: The table illustrates the number of images classified in the individual alarm count categories (a) from 0 to 3; for different values of α.
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|
| −3.00 | −0.00 | NA | 0 | 0 | 0 | 46 |
| −2.00 | −0.00 | NA | 0 | 0 | 0 | 46 |
| −1.00 | 0.58 | −0.45 | 0 | 3 | 5 | 38 |
| −0.75 | 0.88 | −0.56 | 5 | 2 | 6 | 33 |
| −0.50 | 1.19 | −0.67 | 6 | 7 | 8 | 25 |
| −0.25 | 1.32 | −0.64 | 10 | 7 | 10 | 19 |
| −0.10 | 1.36 | −0.73 | 12 | 8 | 11 | 15 |
| 0.00 | 1.37 | −0.7 | 13 | 8 | 14 | 11 |
| 0.10 | 1.37 | −0.7 | 15 | 10 | 11 | 10 |
| 0.25 | 1.33 | −0.62 | 18 | 9 | 11 | 8 |
| 0.50 | 1.20 | −0.61 | 23 | 6 | 12 | 5 |
| 0.75 | 1.17 | −0.69 | 25 | 9 | 8 | 4 |
| 1.00 | 1.13 | −0.71 | 26 | 10 | 6 | 4 |
| 2.00 | 0.86 | −0.67 | 33 | 8 | 2 | 3 |
| 3.00 | 0.66 | −0.62 | 37 | 6 | 1 | 2 |
Additionally, we depict the Pearson correlation coefficients for the Dice (r:dice) based alarm counts with volumetric Dice segmentation performance, as well as the respective alarm count distribution's entropy. The selected value for α of 0.1 is highlighted in pink. The resulting computed Dice similarity thresholds are as following: ADA: 0.9489; RAN: 0.9446; AUG: 0.9024.
Figure 4Segmentation performances vs. alarm counts. The group means are illustrated with horizontal black lines. For display purposes only the 0–95 percent quantile is displayed for Hausdorff distances on the y-axis. In line with the performance of the volumetric Dice coefficient, Hausdorff distances increase with increasing alarm count. Infinite values for Hausdorff distances, which can happen when ground truth or prediction are empty, are excluded from the plot. Subplots (A) + (B) illustrate findings for the MR experiment, while subplots (C) + (D) depict results for the CT experiment.
Thresholds computed with α = 0.1 for the MR experiment per algorithm: The columns Dice and Hausdorff depict, the respective volumetric Dice and Hausdorff distance based thresholds for the alarm computation for each of the segmentation algorithms.
|
|
|
|
|
|---|---|---|---|
| micdkfz | Isensee et al., | 0.9055 | 10.2277 |
| xfeng | Feng et al., | 0.9092 | 8.9835 |
| scan2019 | McKinley et al., | 0.9147 | 8.8292 |
| scan | McKinley et al., | 0.9084 | 10.4850 |
| zyx | Zhao et al., | 0.9293 | 8.4451 |