| Literature DB >> 35954314 |
Diana Veiga-Canuto1,2, Leonor Cerdà-Alberich1, Cinta Sangüesa Nebot2, Blanca Martínez de Las Heras3, Ulrike Pötschger4, Michela Gabelloni5, José Miguel Carot Sierra6, Sabine Taschner-Mandl4, Vanessa Düster4, Adela Cañete3, Ruth Ladenstein4, Emanuele Neri5, Luis Martí-Bonmatí1,2.
Abstract
Tumor segmentation is one of the key steps in imaging processing. The goals of this study were to assess the inter-observer variability in manual segmentation of neuroblastic tumors and to analyze whether the state-of-the-art deep learning architecture nnU-Net can provide a robust solution to detect and segment tumors on MR images. A retrospective multicenter study of 132 patients with neuroblastic tumors was performed. Dice Similarity Coefficient (DSC) and Area Under the Receiver Operating Characteristic Curve (AUC ROC) were used to compare segmentation sets. Two more metrics were elaborated to understand the direction of the errors: the modified version of False Positive (FPRm) and False Negative (FNR) rates. Two radiologists manually segmented 46 tumors and a comparative study was performed. nnU-Net was trained-tuned with 106 cases divided into five balanced folds to perform cross-validation. The five resulting models were used as an ensemble solution to measure training (n = 106) and validation (n = 26) performance, independently. The time needed by the model to automatically segment 20 cases was compared to the time required for manual segmentation. The median DSC for manual segmentation sets was 0.969 (±0.032 IQR). The median DSC for the automatic tool was 0.965 (±0.018 IQR). The automatic segmentation model achieved a better performance regarding the FPRm. MR images segmentation variability is similar between radiologists and nnU-Net. Time leverage when using the automatic model with posterior visual validation and manual adjustment corresponds to 92.8%.Entities:
Keywords: automatic segmentation; deep learning; inter-observer variability; manual segmentation; neuroblastic tumors; tumor segmentation
Year: 2022 PMID: 35954314 PMCID: PMC9367307 DOI: 10.3390/cancers14153648
Source DB: PubMed Journal: Cancers (Basel) ISSN: 2072-6694 Impact factor: 6.575
Figure 1Study design. The first part consisted of manual segmentation variability, comparing the performance of two radiologists (n = 46). The second part included the training and validation of the nnU-Net using 132 cases manually segmented by Radiologist 2. Training-tuning with cross-validation was performed. The 5 resulting segmentation models obtained with the cross-validation method were used as an ensemble solution to test all the cases of the training-tuning (n = 106) and the validation (n = 26) data sets in order to measure training and validation performance independently. The previous split of the cases into balanced groups considering vendor, magnetic field strength, location and segmented sequence was performed.
Composition of validation dataset (20% of cases, n = 26), considering four variables for a balanced split: vendor, magnetic field strength, location and segmented sequence.
| Validation Set (n = 26) | |||
|---|---|---|---|
| Sequence | Equipment | Field Strength | Location |
| T2 | Philips | 1.5 | Abdominopelvic |
| T2 | Siemens | 1.5 | Abdominopelvic |
| T2 | Philips | 1.5 | Abdominopelvic |
| T2 | GE | 1.5 | Abdominopelvic |
| T2 | GE | 1.5 | Cervicothoracic |
| T2 | Philips | 1.5 | Abdominopelvic |
| T2 | Siemens | 1.5 | Abdominopelvic |
| T2 | Philips | 1.5 | Cervicothoracic |
| T2 | Siemens | 1.5 | Abdominopelvic |
| T2 fat sat | GE | 1.5 | Abdominopelvic |
| T2 | Siemens | 3 | Abdominopelvic |
| T2 | GE | 1.5 | Abdominopelvic |
| T2 | GE | 1.5 | Cervicothoracic |
| T2 | Philips | 1.5 | Abdominopelvic |
| T2 fat sat | Philips | 1.5 | Abdominopelvic |
| T2 | Siemens | 3 | Abdominopelvic |
| T2 | Siemens | 1.5 | Abdominopelvic |
| T2 fat sat | Siemens | 1.5 | Abdominopelvic |
| T2 fat sat | GE | 1.5 | Cervicothoracic |
| T2 fat sat | GE | 1.5 | Abdominopelvic |
| T2 fat sat | GE | 1.5 | Abdominopelvic |
| T2 fat sat | Siemens | 1.5 | Abdominopelvic |
| T2 | GE | 1.5 | Cervicothoracic |
| T2 fat sat | Siemens | 1.5 | Abdominopelvic |
| T2 fat sat | Siemens | 3 | Abdominopelvic |
| T2 fat sat | GE | 1.5 | Cervicothoracic |
Figure 2The ground truth (true positive and false negative voxels) corresponds to the manual segmentation performed by Radiologist 2, which was compared firstly to the manual segmentation performed by Radiologist 1 and then to the automatic segmentation obtained by the automatic segmentation model (non-ground truth mask, true positive and false positive voxels). The FPRm considered those voxels that were identified by the model as tumor but corresponded to other structures, divided by the voxels that actually corresponded to the ground truth mask. The FNR measured those voxels belonging to the tumor that the model did not include as such, divided by the ground truth voxels.
Inter-observer variability. Performance metrics for inter-observer comparison for manual segmentation, considering DSC, AUC ROC, 1-FPRm and 1-FNR.
| DSC | AUC ROC | 1-FPRm | 1-FNR | |
|---|---|---|---|---|
| Median | 0.969 | 0.998 | 0.939 | 0.998 |
| IQR | 0.032 | 0.004 | 0.063 | 0.008 |
| CI | 0.042 | 0.021 | 0.044 | 0.042 |
Figure 3Comparison of two cases segmented by Radiologist 1 (red label) and Radiologist 2 (pink label) and mask superposition and comparison. Case 1 was segmented in T2w while case 2 was segmented in T2w fat-sat. In both cases, DSC was 0.957.
Performance metrics for comparison between nnU-Net and Radiologist 2. Cases were divided into 5 folds to perform cross-validation. DSC, AUC ROC, 1-FPRm and 1-FNR for each fold are described.
| Fold | Metric | DSC | AUC ROC | 1-FPRm | 1-FNR |
|---|---|---|---|---|---|
| Fold 0 | Median | 0.895 | 0.940 | 0.922 | 0.882 |
| IQR | 0.121 | 0.116 | 0.082 | 0.233 | |
| CI | 0.146 | 0.117 | 0.074 | 0.148 | |
| Fold 1 | Median | 0.873 | 0.926 | 0.944 | 0.856 |
| IQR | 0.110 | 0.100 | 0.100 | 0.100 | |
| CI | 0.127 | 0.066 | 0.088 | 0.132 | |
| Fold 2 | Median | 0.899 | 0.936 | 0.935 | 0.875 |
| IQR | 0.131 | 0.064 | 0.133 | 0.133 | |
| CI | 0.123 | 0.062 | 0.125 | 0.124 | |
| Fold 3 | Median | 0.901 | 0.948 | 0.949 | 0.897 |
| IQR | 0.122 | 0.062 | 0.088 | 0.124 | |
| CI | 0.046 | 0.030 | 0.090 | 0.061 | |
| Fold 4 | Median | 0.874 | 0.927 | 0.958 | 0.856 |
| IQR | 0.134 | 0.110 | 0.033 | 0.221 | |
| CI | 0.141 | 0.071 | 0.032 | 0.142 |
Figure 4Original transversal and coronal MR images and examples of three cases automatically segmented by nnU-Net (blue labeled) and Radiologist 2 (pink labeled), with mask superposition for comparison. Case 1 was segmented in T2w fat-sat with a DSC of 0.869. Case 2 was segmented on T2w and the DSC obtained was 0.954. Case 3 was segmented with a DSC of 0.617.
Figure 5Box plots depicting the whole set of DSC for each fold of the training group and validation set.
The 5 resulting segmentation models obtained using the cross-validation method were used as an ensemble solution to test all the cases of the training-tuning (n = 106). Performance metrics for the final results are described. Results are detailed according to location (abdominopelvic or cervicothoracic) and magnetic field strength (1.5 T or 3 T).
| DSC | AUC ROC | 1-FPRm | 1-FNR | |
|---|---|---|---|---|
| Median | 0.965 | 0.981 | 0.968 | 0.963 |
| IQR | 0.018 | 0.010 | 0.015 | 0.021 |
| CI | 0.031 | 0.015 | 0.025 | 0.031 |
| Cervicothoracic (n = 21) | ||||
| Median | 0.956 | 0.975 | 0.962 | 0.950 |
| IQR | 0.024 | 0.012 | 0.015 | 0.024 |
| CI | 0.036 | 0.018 | 0.037 | 0.036 |
| Abdominopelvic (n = 85) | ||||
| Median | 0.966 | 0.982 | 0.969 | 0.645 |
| IQR | 0.015 | 0.009 | 0.014 | 0.019 |
| CI | 0.037 | 0.018 | 0.030 | 0.038 |
| 1.5 T (n = 93) | ||||
| Median | 0.965 | 0.981 | 0.969 | 0.963 |
| IQR | 0.018 | 0.011 | 0.016 | 0.021 |
| CI | 0.029 | 0.014 | 0.021 | 0.029 |
| 3 T (n = 13) | ||||
| Median | 0.964 | 0.982 | 0.967 | 0.964 |
| IQR | 0.013 | 0.005 | 0.007 | 0.010 |
| CI | 0.145 | 0.073 | 0.138 | 0.145 |
Performance metrics for the validation cohort results (n = 26) considering DSC, AUC ROC, 1-FPRm and 1-FNR. Results for Radiologist 2 vs. automatic model are shown. To compare these results to inter-radiologist agreement, Radiologist 1 segmented the 26 cases from the validation dataset and comparisons with Radiologist 2 and to the automatic model were made.
| DSC | AUC ROC | 1-FPRm | 1-FNR | |
|---|---|---|---|---|
| Radiologist 2 vs. automatic model | ||||
| Median | 0.918 | 0.968 | 0.943 | 0.938 |
| IQR | 0.080 | 0.051 | 0.088 | 0.104 |
| CI | 0.059 | 0.473 | 0.134 | 0.063 |
| Radiologist 1 vs. Radiologist 2 | ||||
| Median | 0.920 | 0.950 | 0.929 | 0.930 |
| IQR | 0.090 | 0.192 | 0.015 | 0.024 |
| CI | 0.038 | 0.053 | 0.166 | 0.058 |
| Radiologist 1 vs. automatic model | ||||
| Median | 0.915 | 0.950 | 0.915 | 0.912 |
| IQR | 0.443 | 0.122 | 0.436 | 0.189 |
| CI | 0.114 | 0.054 | 0.161 | 0.104 |