| Literature DB >> 32322186 |
Alain Jungo1,2, Fabian Balsiger1,2, Mauricio Reyes1,2.
Abstract
Automatic segmentation of brain tumors has the potential to enable volumetric measures and high-throughput analysis in the clinical setting. Reaching this potential seems almost achieved, considering the steady increase in segmentation accuracy. However, despite segmentation accuracy, the current methods still do not meet the robustness levels required for patient-centered clinical use. In this regard, uncertainty estimates are a promising direction to improve the robustness of automated segmentation systems. Different uncertainty estimation methods have been proposed, but little is known about their usefulness and limitations for brain tumor segmentation. In this study, we present an analysis of the most commonly used uncertainty estimation methods in regards to benefits and challenges for brain tumor segmentation. We evaluated their quality in terms of calibration, segmentation error localization, and segmentation failure detection. Our results show that the uncertainty methods are typically well-calibrated when evaluated at the dataset level. Evaluated at the subject level, we found notable miscalibrations and limited segmentation error localization (e.g., for correcting segmentations), which hinder the direct use of the voxel-wise uncertainties. Nevertheless, voxel-wise uncertainty showed value to detect failed segmentations when uncertainty estimates are aggregated at the subject level. Therefore, we suggest a careful usage of voxel-wise uncertainty measures and highlight the importance of developing solutions that address the subject-level requirements on calibration and segmentation error localization.Entities:
Keywords: brain tumor; deep learning; quality; segmentation; uncertainty estimation
Year: 2020 PMID: 32322186 PMCID: PMC7156850 DOI: 10.3389/fnins.2020.00282
Source DB: PubMed Journal: Front Neurosci ISSN: 1662-453X Impact factor: 4.677
Figure 1Overview of the analysis performed for the uncertainties produced by different uncertainty estimation methods. The red color indicates additions introduced by these methods with respect to the baseline.
Figure 2Comparison between dataset-level and subject-level calibration (shown in reliability diagrams) for the selected uncertainty estimation methods. The first column shows the dataset-level calibration, which considers all voxels in the dataset. The second to fourth columns show subject-level calibrations, which consider voxels of a single subject. The exemplary subjects indicate underconfident, overconfident, and well-calibrated methods. The rows indicate the three tumor regions.
Performances of the different uncertainty estimation methods in terms of expected calibration error (ECE), uncertainty-error overlap (U-E), and Dice coefficient.
| Baseline | 1.059 | 0.427 | 0.869 | 0.853 | 0.41 | 0.767 | 0.309 | 0.401 | 0.692 |
| Concrete | 0.984 | 0.429 | 0.875 | 0.802 | 0.419 | 0.775 | 0.278 | 0.407 | 0.686 |
| Center low | 0.942 | 0.434 | 0.88 | 0.83 | 0.409 | 0.775 | 0.28 | 0.403 | 0.686 |
| Center | 1.606 | 0.425 | 0.817 | 1.086 | 0.41 | 0.695 | 0.381 | 0.395 | 0.642 |
| Baseline + MC | 1.016 | 0.433 | 0.869 | 0.805 | 0.41 | 0.765 | 0.284 | 0.403 | 0.693 |
| Concrete + MC | 0.952 | 0.431 | 0.877 | 0.785 | 0.409 | 0.689 | |||
| Center low + MC | 0.922 | 0.435 | 0.83 | 0.41 | 0.769 | 0.275 | 0.409 | 0.69 | |
| Center + MC | 1.014 | 0.432 | 0.874 | 1.06 | 0.409 | 0.716 | 0.462 | 0.4 | 0.651 |
| Ensemble | 0.88 | 0.402 | 0.275 | 0.411 | |||||
| Aleatoric | 12.187 | 0.001 | 0.874 | 2.407 | 0 | 0.757 | 1.284 | 0.007 | 0.673 |
| Auxiliary segm. | 1.058 | 0.428 | 0.869 | 0.887 | 0.397 | 0.767 | 0.323 | 0.39 | 0.692 |
| Auxiliary feat. | 1.057 | 0.433 | 0.869 | 0.852 | 0.403 | 0.767 | 0.318 | 0.692 | |
All metrics range from 0 to 1, but the ECE is reported in % for better comparisons. Lower ECEs are better as well as higher U-Es and Dice coefficients.We note that the Dice coefficient is not a measure for analyzing the quality of the uncertainty and is reported to monitor the segmentation performance of the different methods. Mean values are presented, and standard deviations are omitted due to marginal differences. Bold values indicate best performances. Horizontal separations group types of uncertainty methods and WT, TC, and ET indicate the tumor regions whole tumor, tumor core, and enhancing tumor.
Figure 3Visual examples of the whole tumor uncertainty produced by the different uncertainty estimation methods. The columns correspond to underconfident, overconfident, and well-calibrated subjects (same as in Figure 2).
Figure 4Effect of the training dataset size on the expected calibration error (ECE).
Figure 5Differences among the three aggregation methods for each uncertainty estimation approach in terms of area under the curve of the receiver operating characteristic (AUC-ROC) for segmentation failure detection.
Figure 6Segmentation failure detection performance of the aggregation uncertainty by automatically extracted features in terms area under the curve of the receiver operating characteristic (AUC-ROC) and Youden's accuracy as well as correlation with the segmentation performance in terms of Spearman's rank correlation (ρ). Each point per color represents an uncertainty estimation method. The negative outliers in each metric and for each tumor region correspond to the aleatoric method.