| Literature DB >> 35083620 |
L B van den Oever1, W A van Veldhuizen2, L J Cornelissen1, D S Spoor1, T P Willems3, G Kramer4, T Stigter4, M Rook4, A P G Crijns1, M Oudkerk5, R N J Veldhuis6, G H de Bock7, P M A van Ooijen8.
Abstract
Organs-at-risk contouring is time consuming and labour intensive. Automation by deep learning algorithms would decrease the workload of radiotherapists and technicians considerably. However, the variety of metrics used for the evaluation of deep learning algorithms make the results of many papers difficult to interpret and compare. In this paper, a qualitative evaluation is done on five established metrics to assess whether their values correlate with clinical usability. A total of 377 CT volumes with heart delineations were randomly selected for training and evaluation. A deep learning algorithm was used to predict the contours of the heart. A total of 101 CT slices from the validation set with the predicted contours were shown to three experienced radiologists. They examined each slice independently whether they would accept or adjust the prediction and if there were (small) mistakes. For each slice, the scores of this qualitative evaluation were then compared with the Sørensen-Dice coefficient (DC), the Hausdorff distance (HD), pixel-wise accuracy, sensitivity and precision. The statistical analysis of the qualitative evaluation and metrics showed a significant correlation. Of the slices with a DC over 0.96 (N = 20) or a 95% HD under 5 voxels (N = 25), no slices were rejected by the readers. Contours with lower DC or higher HD were seen in both rejected and accepted contours. Qualitative evaluation shows that it is difficult to use common quantification metrics as indicator for use in clinic. We might need to change the reporting of quantitative metrics to better reflect clinical acceptance.Entities:
Keywords: Automatic contouring; CT; Deep learning; Qualitative assessment; Turing test
Mesh:
Year: 2022 PMID: 35083620 PMCID: PMC8921356 DOI: 10.1007/s10278-021-00573-9
Source DB: PubMed Journal: J Digit Imaging ISSN: 0897-1889 Impact factor: 4.056
Fig. 14 CT slices with predicted contouring of the heart as shown to the radiologists for the qualitative evaluation. The letters correspond with the consensus answers for this particular slide. A was rejected with clear mistakes, B was rejected with minor but clinically relevant mistakes, C was accepted with small but irrelevant mistakes and D was accepted with no mistakes
Results of the evaluation as done by the radiologists. The answers are combined by majority vote and grouped by rejected (A or B answers) or accepted (C or D answers) slices. The medians and 25th and 75th quartiles of the DC, precision, sensitivity, 95% HD and pixel-wise accuracy are given for the accepted and rejected slices
| DC | 95% HD | Precision | Sensitivity | Pixel-wise accuracy | |
|---|---|---|---|---|---|
| 0.83 (0.76–0.87) | 18.1 (11.7–27.0) | 0.80 (0.72–0.85) | 0.87 (0.80–0.97) | 0.93 (0.84–0.99) | |
| 0.93 (0.87–0.97) | 8.08 (3.27–13.2) | 0.90 (0.82–0.98) | 0.98 (0.95–0.99) | 0.98 (0.95–0.99) |
Results of the evaluation by radiologists per answer. The medians and 25th and 75th quartiles of the metrics are given per answer
| DC | 95% HD (in voxels) | Precision | Sensitivity | Pixel-wise accuracy | |
|---|---|---|---|---|---|
| 0.81 (0.76–0.88) | 21.1 (12.7–32.0) | 0.72 (0.71–0.91) | 0.89 (0.84–0.98) | 0.91 (0.85–1.00) | |
| 0.84 (0.76–0.87) | 16.9 (10.0–23.8) | 0.81 (0.73–0.85) | 0.86 (0.78–0.96) | 0.94 (0.82–0.99) | |
| 0.90 (0.86–0.93) | 10.3 (5.4–18.5) | 0.87 (0.77–0.94) | 0.98 (0.77–0.94) | 0.98 (0.90–0.99) | |
| 0.95 (0.89–0.99) | 5.90 (2.00–10.2) | 0.94 (0.86–0.99) | 0.99 (0.97–0.99) | 0.98 (0.96–0.99) |
Fig. 2Overview of the consensus answers and the corresponding 95% HD values of the slices. The grey bar indicates the median value with the 25th and 75th quartiles. The coloured dots are the qualitative evaluation answers
Fig. 3Overview of the consensus answers and the corresponding DC values of the slices. The grey bar indicates median value with the 25th and 75th quartiles. The coloured dots are the qualitative evaluation answers
Fig. 4Overview of the consensus answers and the corresponding pixel-wise accuracy values of the slices. The grey bar indicates median value with the 25th and 75th quartiles. The coloured dots are the qualitative evaluation answers
Fig. 5This graph shows the agreement between metrics and readers when shifting the threshold for the 95% HD for acceptance or rejection of contours
Fig. 6This graph shows the agreement between metrics and readers when shifting the threshold for the DC for acceptance or rejection of contours
Fig. 7This graph shows the agreement between metrics and readers when shifting the threshold for the pixel-wise accuracy for acceptance or rejection of contours
Fig. 8An example of mismatch between readers and the ground truth. On the left, a ground truth segmentation can be seen with excluded aorta. On the right, the predicted segmentation can be seen