| Literature DB >> 35976491 |
Lars Johannes Isaksson1, Paul Summers2, Abhir Bhalerao3, Sara Gandini4, Sara Raimondi4, Matteo Pepa5, Mattia Zaffaroni5, Giulia Corrao5,6, Giovanni Carlo Mazzola5,6, Marco Rotondi5,6, Giuliana Lo Presti4, Zaharudin Haron7, Sara Alessi2, Paola Pricolo2, Francesco Alessandro Mistretta8, Stefano Luzzago8, Federica Cattani9, Gennaro Musi6,8, Ottavio De Cobelli6,8, Marta Cremonesi10, Roberto Orecchia11, Giulia Marvaso5,6, Giuseppe Petralia6,12, Barbara Alicja Jereczek-Fossa5,6.
Abstract
OBJECTIVE: Deploying an automatic segmentation model in practice should require rigorous quality assurance (QA) and continuous monitoring of the model's use and performance, particularly in high-stakes scenarios such as healthcare. Currently, however, tools to assist with QA for such models are not available to AI researchers. In this work, we build a deep learning model that estimates the quality of automatically generated contours.Entities:
Keywords: Confidence calibration; Diagnostic imaging; Magnetic resonance imaging; Prostate; Quality assurance (Health care)
Year: 2022 PMID: 35976491 PMCID: PMC9385913 DOI: 10.1186/s13244-022-01276-7
Source DB: PubMed Journal: Insights Imaging ISSN: 1869-4101
Fig. 1Overview of the problem of predicting the quality of organ segmentations. First, a segmentation model (not covered in this article) takes images as input and produces segmentation maps. Then, our quality prediction model takes both the images and the segmentation maps as input and produces an estimate of the quality of the segmentation—in this case the Dice similarity coefficient. Good contours have a high Dice value and poor contours have a low Dice value. Note that we need ground truth segmentations in order to calculate the true Dice value and train the quality prediction model with supervised learning. In this study, we used 60 ground truth segmentations and 80 automatically generated masks along with heavy data augmentation to train the quality prediction model
Parameter space searched by the Optuna parameter search for the baseline CatBoost model
| Parameter | Values |
|---|---|
| n estimators | {1, 256} |
| max depth | {1, 6} |
| l2 leaf reg | [10−3 |
| random strength | [0.1, 3] |
*log-uniform prior
[]: continuous interval
{}: integer interval
Fig. 2Network architecture. An EfficientNet B0 backbone is connected to three repeated bidirectional feature pyramid blocks (BiFPNs). The regression head consists of serially connected fast normalized fusion nodes and finally BatchNorm (BN), PReLU, a single-channel convolution, and a sigmoid activation. Numbers indicate the resolution at each level relative to the input. Image adapted from [35]
Fig. 3Predicted vs. target Dice values of the baseline CatBoost model, which only uses clinical variables to predict segmentation quality. The dotted line indicates perfect x = y predictions. The predictions of this model tend to only vary minimally (very close to the naïve model), suggesting that the clinical variables are not indicative of segmentation performance
Fig. 7Performance of the quality prediction model (MAE and rank correlation of predicted Dice scores) on different cases of failed segmentations: completely empty contours, pure noise, matrices full of ones, and shifted ground truths (GTs). The performance on real GT segmentation maps is also shown. All results are aggregated over 5 different validation splits for a total of 80 samples each
Average mean absolute error (MAE) and rank correlation of the baseline CatBoost and naïve models, which only utilize clinical variables to predict segmentation quality. The naïve method predicts the mean target value for all samples. Parentheses indicate standard deviation. The values are aggregated from 64 repeated fivefold cross-validations
| MAE | Corr | |
|---|---|---|
| CatBoost | 0.016 (± 3·10−4) | − 0.16 (± 0 |
| Naïve | 0.016 (± 0) | n/a |
Fig. 4Predicted vs. target Dice values of the deep network shown in Fig. 2. The dotted line indicates perfect x = y predictions
Fig. 5Predicted Dice values, targets, and the respective absolute error of the quality prediction deep learning network. The mean absolute error is 0.02 and the correlation between the predicted and target values are 0.42
Fig. 6Characteristic training curves of the deep quality prediction network. The plot is an aggregate of all the different validation folds. The validation loss often spikes in early training, which then disappear safter the learning rate reduction at 120 epochs