BACKGROUND/ PURPOSE: Oropharyngeal cancer (OPC) primary gross tumor volume (GTVp) segmentation is crucial for radiotherapy. Multiparametric MRI (mpMRI) is increasingly used for OPC adaptive radiotherapy but relies on manual segmentation. Therefore, we constructed mpMRI deep learning (DL) OPC GTVp auto-segmentation models and determined the impact of input channels on segmentation performance. MATERIALS/ METHODS: GTVp ground truth segmentations were manually generated for 30 OPC patients from a clinical trial. We evaluated five mpMRI input channels (T2, T1, ADC, Ktrans, Ve). 3D Residual U-net models were developed and assessed using leave-one-out cross-validation. A baseline T2 model was compared to mpMRI models (T2 + T1, T2 + ADC, T2 + Ktrans, T2 + Ve, all five channels [ALL]) primarily using the Dice similarity coefficient (DSC). False-negative DSC (FND), false-positive DSC, sensitivity, positive predictive value, surface DSC, Hausdorff distance (HD), 95% HD, and mean surface distance were also assessed. For the best model, ground truth and DL-generated segmentations were compared through a blinded Turing test using three physician observers. RESULTS: Models yielded mean DSCs from 0.71 ± 0.12 (ALL) to 0.73 ± 0.12 (T2 + T1). Compared to the T2 model, performance was significantly improved for FND, sensitivity, surface DSC, HD, and 95% HD for the T2 + T1 model (p < 0.05) and for FND for the T2 + Ve and ALL models (p < 0.05). No model demonstrated significant correlations between tumor size and DSC (p > 0.05). Most models demonstrated significant correlations between tumor size and HD or Surface DSC (p < 0.05), except those that included ADC or Ve as input channels (p > 0.05). On average, there were no significant differences between ground truth and DL-generated segmentations for all observers (p > 0.05). CONCLUSION: DL using mpMRI provides reasonably accurate segmentations of OPC GTVp that may be comparable to ground truth segmentations generated by clinical experts. Incorporating additional mpMRI channels may increase the performance of FND, sensitivity, surface DSC, HD, and 95% HD, and improve model robustness to tumor size.
BACKGROUND/ PURPOSE: Oropharyngeal cancer (OPC) primary gross tumor volume (GTVp) segmentation is crucial for radiotherapy. Multiparametric MRI (mpMRI) is increasingly used for OPC adaptive radiotherapy but relies on manual segmentation. Therefore, we constructed mpMRI deep learning (DL) OPC GTVp auto-segmentation models and determined the impact of input channels on segmentation performance. MATERIALS/ METHODS: GTVp ground truth segmentations were manually generated for 30 OPC patients from a clinical trial. We evaluated five mpMRI input channels (T2, T1, ADC, Ktrans, Ve). 3D Residual U-net models were developed and assessed using leave-one-out cross-validation. A baseline T2 model was compared to mpMRI models (T2 + T1, T2 + ADC, T2 + Ktrans, T2 + Ve, all five channels [ALL]) primarily using the Dice similarity coefficient (DSC). False-negative DSC (FND), false-positive DSC, sensitivity, positive predictive value, surface DSC, Hausdorff distance (HD), 95% HD, and mean surface distance were also assessed. For the best model, ground truth and DL-generated segmentations were compared through a blinded Turing test using three physician observers. RESULTS: Models yielded mean DSCs from 0.71 ± 0.12 (ALL) to 0.73 ± 0.12 (T2 + T1). Compared to the T2 model, performance was significantly improved for FND, sensitivity, surface DSC, HD, and 95% HD for the T2 + T1 model (p < 0.05) and for FND for the T2 + Ve and ALL models (p < 0.05). No model demonstrated significant correlations between tumor size and DSC (p > 0.05). Most models demonstrated significant correlations between tumor size and HD or Surface DSC (p < 0.05), except those that included ADC or Ve as input channels (p > 0.05). On average, there were no significant differences between ground truth and DL-generated segmentations for all observers (p > 0.05). CONCLUSION: DL using mpMRI provides reasonably accurate segmentations of OPC GTVp that may be comparable to ground truth segmentations generated by clinical experts. Incorporating additional mpMRI channels may increase the performance of FND, sensitivity, surface DSC, HD, and 95% HD, and improve model robustness to tumor size.
Oropharyngeal cancer (OPC), a type of head and neck squamous cell carcinoma (HNSCC), is among the most common malignancies globally [1]. Treatment for OPC often includes radiotherapy because of its high cure rate [2]. Segmentation (also termed contouring) of the primary gross tumor volume (GTVp) on radiologic imaging is necessary for the OPC radiotherapy workflow. The GTVp, with a clinical and planning safety margin, acts as a target volume to deliver the radiotherapy dose. Consequently, inadequate GTVp definition may cause under-dosage of the tumor or over-dosage of surrounding normal tissues [[3], [4]]. However, the current clinical standard is manual segmentation by physician experts, which is labor-intensive and subject to high inter-observer variation [[5], [6], [7]]. Therefore, an auto-segmentation tool would be a promising alternative to the current manual standard in OPC radiotherapy workflows.Deep learning (DL) has found wide success in auto-segmentation [[8], [9]], with many HNSCC auto-segmentation studies applying DL to CT imaging [[10], [11], [12]]. Although CT is the most commonly used imaging modality in OPC radiotherapy planning, MRI has been increasingly recognized as essential for tumor segmentation because of its exceptional soft-tissue contrast [[13], [14]]. Additionally, the emergence of MR-Linac technology, an image-guided adaptive radiotherapy approach [15], has further incentivized the incorporation of MRI in OPC radiotherapy planning. Importantly, we recently demonstrated the utility of DL for HNSCC organ-at-risk auto-segmentation using MRI, with improvements in performance, execution time, and dosimetric differences compared to other auto-segmentation methods [16]. While several DL tumor auto-segmentation studies for nasopharyngeal cancer using MRI have been published [[17], [18], [19], [20], [21], [22], [23], [24], [25], [26]], to our knowledge, only one study has been published for OPC [27]. Since HNSCC tumors at different anatomical sites have distinct anatomic boundaries and characteristics [[28], [29]], it is crucial that tumor segmentation models are developed for each site accordingly. Thus, there exists an unmet need for OPC DL tumor segmentation tools using MRI.Multiparametric MRI (mpMRI) incorporates multiple sequence acquisitions that highlight anatomical and functional information in tumors. For example, dynamic contrast-enhanced (DCE) MRI and diffusion-weighted imaging (DWI) can quantify tumor perfusion and diffusion patterns, respectively, and may affect OPC treatment guidance [[30], [31]]. Recent studies of PET/CT HNSCC DL auto-segmentation [[11], [21], [32], [33], [34], [35], [36], [37]] have demonstrated increased segmentation performance when combining functional and anatomical modalities. However, investigations that combine anatomical with functional MRI in HNSCC to achieve acceptable DL auto-segmentation performance are lacking [[38], [39]].In this pilot study, we evaluated the effects of anatomical and functional mpMRI inputs on OPC GTVp segmentation performance. Using open-source DL frameworks with standardized clinical trial data, we trained and evaluated DL models based on variable mpMRI input channels. We then compared the models qualitatively and quantitatively to determine which channel combinations led to the best segmentation results. Finally, we characterized the clinical acceptability of the best-performing model using physician expert observers.
Methods
Imaging data
We acquired pre-radiotherapy T2-weighted (T2), contrast-enhanced T1-weighted Dixon fat-suppressed (T1), DCE, and DWI MRI sequences in Digital Imaging and Communications in Medicine (DICOM) format for 124 HNSCC patients from a prospective clinical trial investigating longitudinal mpMRI (NCT03145077). Images were collected from August 2018-August 2019 under a HIPAA-compliant protocol (PA16-0302) approved by The University of Texas MD Anderson Cancer Center’s institutional review board. All patients provided study-specific informed consent. We curated 30 OPC patient data sets with a visible GTVp based on the complete availability of T2, T1, DCE, and DWI image sets (Appendix C, Fig. C1). Demographic characteristics of the patients are shown in Appendix C Table C1. Imaging was performed on a Siemens Aera scanner with a magnetic field strength of 1.5 T and standardized acquisition parameters (Appendix C Table C2). All patients were immobilized with a thermoplastic mask. Apparent diffusion coefficient (ADC) parametric maps were derived from DWI sequences through a proprietary Siemens algorithm (Munich, Germany) using a monoexponential model. The Tofts model was used to generate parametric maps from DCE sequences for the volume transfer constant (Ktrans) and the extravascular extracellular volume fraction (Ve) [40] (additional details can be found in Appendix A). GTVp ground truth structures were manually segmented in the DICOM-RT Structure format by a physician observer (radiologist with > 5 years of expertise in HNSCC) in Velocity AI v.3.0.1 (Atlanta, GA, USA). GTVp ground truth structures were segmented on the T2 MRI with all other co-registered images made available to the observer during the segmentation process. An example of the mpMRI images used in this study and overlying GTVp segmentation for one patient is shown in Fig. 1A.
Fig. 1
Annotation, processing, and analysis of data used in this study. (A) Multiparametric MRI input channels for oropharyngeal tumor segmentation. The white dotted line depicts the primary gross tumor volume segmentation. Anatomical sequence images are outlined in grey boxes, while functional sequence parametric map images are outlined in red boxes. (B) Image processing steps which included image cropping, resampling, and rescaling. (C) An illustration of the 3D Residual U-net model architecture. For illustrative purposes, only one input channel (T2-weighted image) is shown, but multiple input channel combinations were used throughout the analysis as separate models. (D) Overall study design which incorporated multi-channel input combinations coupled to a leave-one-out cross-validation (LOOCV) evaluation approach. T2 = T2-weighted MRI, T1 = T1-weighted MRI, ADC = apparent diffusion coefficient, Ktrans = volume transfer constant, Ve = extravascular extracellular volume fraction, ALL = all five input channels. BN = Batch normalization, PReLU = parametric rectified linear unit activation function. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Annotation, processing, and analysis of data used in this study. (A) Multiparametric MRI input channels for oropharyngeal tumor segmentation. The white dotted line depicts the primary gross tumor volume segmentation. Anatomical sequence images are outlined in grey boxes, while functional sequence parametric map images are outlined in red boxes. (B) Image processing steps which included image cropping, resampling, and rescaling. (C) An illustration of the 3D Residual U-net model architecture. For illustrative purposes, only one input channel (T2-weighted image) is shown, but multiple input channel combinations were used throughout the analysis as separate models. (D) Overall study design which incorporated multi-channel input combinations coupled to a leave-one-out cross-validation (LOOCV) evaluation approach. T2 = T2-weighted MRI, T1 = T1-weighted MRI, ADC = apparent diffusion coefficient, Ktrans = volume transfer constant, Ve = extravascular extracellular volume fraction, ALL = all five input channels. BN = Batch normalization, PReLU = parametric rectified linear unit activation function. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Image processing
To ensure adequate MRI comparability between patients [41], we performed intensity standardization for all images. Anatomical sequences (T2, T1) were standardized using a Z-score (mean = 0, standard deviation = 1), while functional parametric maps (ADC, Ktrans, Ve) were truncated to the 10th and 90th percentile for all patients to remove potential outliers and rescaled to [-1, 1] as per a previous study [38]. All images were cropped to the smallest field of view (Ktrans, Ve) and resampled to the T2 resolution. An example of the image processing workflow is shown in Fig. 1B.
Segmentation model architecture and implementation
A DL convolutional neural network based on the 3D Residual U-net architecture [[42], [43]] was implemented in the Medical Open Network for Artificial Intelligence (MONAI) software package [44] (Fig. 1C). The GTVp mask was used as the ground truth target to train the segmentation model. The MRI images acted as variable-channel inputs to the models. We investigated the following channel combinations as separate models: T2, T2 + T1, T2 + ADC, T2 + Ktrans, T2 + Ve, and all five input channels (ALL). The T2 model acted as a baseline of comparison for all other models. We implemented an Adam optimizer with a Sørensen-Dice similarity coefficient (DSC) loss function. The models were trained for 700 iterations with a learning rate of 2 × 10-4 for the first 550 iterations and 1 × 10-4 for the remaining 150 iterations based on empirical observations in previous studies [35]. We implemented data augmentation to mitigate overfitting, which included random horizontal flips of 50 % and random affine transformations with an axial rotation range of 12 degrees and a scale range of 10 %. Additional details on the DL architecture and implementation are found in Appendix A.
Model evaluation
Model performance was primarily assessed using DSC. We also implemented additional spatial similarity metrics, including Hausdorff distance (HD), false-negative DSC (FND), false-positive DSC (FPD), sensitivity, positive predictive value (PPV), surface DSC, 95% HD, and mean surface distance (MSD). For surface DSC, a tolerance of 3.0 mm was selected as suitable from previous inter-observer variability studies on T2 MRI of OPC GTVp [45]. Surface distance metrics were calculated using the surface-distance Python package [46], while all other metrics were calculated in Elekta ADMIRE v.2.9 (Stockholm, Sweden). Each model was trained and evaluated using leave-one-out cross-validation (LOOCV) (Fig. 1D).
Clinical evaluation
For our best-performing model, we assigned three physician expert observers (radiologist from 2.1 > 1-year post-segmenting, two radiation oncologists) to evaluate the ground truth and corresponding DL-generated segmentations using subjective scoring criteria based on a 4-point Likert scale. The score categories were: 1 = requires corrections, large errors; 2 = requires corrections, minor errors; 3 = clinically acceptable, errors not clinically significant; 4 = clinically acceptable, highly accurate. Additionally, we asked observers to predict the source of the segmentations as either human (ground truth) or DL-generated through a modified Turing test [47]. Ground truth and DL-generated segmentations for all 30 patients were anonymized and randomly presented to experts for clinical evaluation. Experts were blinded to the segmentation source.
Statistical analysis
After performing a Shapiro-Wilk test, we found that our data were not normally distributed (p < 0.05); therefore, we utilized nonparametric statistical tests. We used one-sided Wilcoxon signed-rank tests (alternative hypothesis of greater than for DSC, sensitivity, surface DSC, and PPV; alternative hypothesis of less than for HD, FND, FPD, 95% HD, and MSD) to evaluate differences between our baseline T2 model and models with additional channels. Given the utilization of a small number of a priori planned comparisons and exploratory nature of this work, corrections for multiple hypotheses were not considered. We used Mann-Whitney U tests to detect differences in model performance based on tumor subsite (base of tongue vs tonsil). Additionally, to assess correlations of tumor size with model performance, we calculated Pearson correlation coefficients with corresponding p-values of ground truth volume against DSC, HD, and surface DSC for every model. Finally, to assess the clinical evaluation of ground truth against DL-generated segmentations, for each observer we implemented a two-sided Wilcoxon signed-rank test for scores and a McNemar test for source predictions. For all statistical analyses, p-values<0.05 were considered significant. Analyses were performed in Python v.3.7.9. Code notebooks can be found at GitHub (https://github.com/kwahid/mpMRI_OPC_GTVp_segmentation).
Results
Model performance
T2 + T1 was the best performing model overall with the best mean scores in DSC, HD, sensitivity, surface DSC, and 95% HD, while ALL was the worst performing model overall with the worst mean scores in in DSC, FPD, and PPV (Fig. 2). To further quantify the impact of specific channel combinations in comparison to the baseline T2 model, significance testing was performed between different channel combinations and the baseline model. Of the significant relationships, T2 + T1 had better performance (p < 0.05) than the baseline T2 model for HD, FND, sensitivity, surface DSC, and 95% HD while T2 + Ve and ALL had better performance (p < 0.05) than the baseline T2 model for FND. These combined results suggest the T2 + T1 model to be the optimal combination for this segmentation task and was further analyzed in the Clinical Evaluation section. A complete heatmap of p-values comparing channel combinations to the baseline T2 model can be found in Appendix C Figure C2. Subsequent subgroup analysis revealed no significant differences in model performance for any combination of models and metrics based on OPC subsite (base of tongue vs tonsil), as all p-values were > 0.05 (Appendix C, Table C3). We further inspected predicted segmentation results of the various models for a few select cases based on DSC scores across all models (Fig. 3). For the high-performance case, the T2 model demonstrates a DSC of 0.88, with the incorporation of additional channels leading to DSC scores of 0.87–0.90. For the medium-performance case, the T2 model demonstrates a DSC of 0.71, with the incorporation of additional channels leading to DSC scores of 0.72–0.78. For the low-performance case, the T2 model demonstrated a DSC of 0.37 and many spuriously predicted voxels in the posterior region of the head, with the incorporation of additional channels reducing the number of spurious voxels and leading to DSC scores of 0.52–0.61. These results indicate improved performance in select cases when incorporating additional input channels. Additional in-depth analysis of this low performance case, a second low performance case, and an HPV-negative tumor case are explored in Appendix B.
Fig. 2
Boxplots comparing evaluation metrics of models built with different input channels. Evaluation metrics correspond to Dice similarity coefficient (DSC) (A), Hausdorff distance (HD) (B), false-negative DSC (FND) (C), false-positive DSC (FPD) (D), sensitivity (E), positive predictive value (PPV) (F), surface DSC (G), 95% HD (H), and mean surface distance (MSD) (I). Boxes show quartiles and median lines, while whiskers extend to the remaining distribution. Mean ± standard deviation is shown inside or adjacent to the corresponding box. The single and double stars above the boxplots correspond to significantly lower or higher values, respectively, compared to the baseline model for that metric. T2 = T2-weighted MRI, T1 = T1-weighted MRI, ADC = apparent diffusion coefficient, Ktrans = volume transfer constant, Ve = extravascular extracellular volume fraction, ALL = all five input channels.
Fig. 3
2D axial slice representations of ground truth segmentations (red dotted outline) and predicted segmentations (yellow dotted outline) for high- (green), medium- (blue), and low- (orange) performance cases. Slices for each case are shown in rows superiorly to inferiorly (top, middle, and bottom). Models are shown in columns. The DSC scores for corresponding models are shown in the top left corners. The high-performance case corresponds to a left tonsillar T4 tumor. The medium-performance case corresponds to a left base of tongue T4 tumor. The low-performance case corresponds to a right base of tongue T4 tumor. T2 = T2-weighted MRI, T1 = T1-weighted MRI, ADC = apparent diffusion coefficient, Ktrans = volume transfer constant, Ve = extravascular extracellular volume fraction, ALL = all five input channels. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Boxplots comparing evaluation metrics of models built with different input channels. Evaluation metrics correspond to Dice similarity coefficient (DSC) (A), Hausdorff distance (HD) (B), false-negative DSC (FND) (C), false-positive DSC (FPD) (D), sensitivity (E), positive predictive value (PPV) (F), surface DSC (G), 95% HD (H), and mean surface distance (MSD) (I). Boxes show quartiles and median lines, while whiskers extend to the remaining distribution. Mean ± standard deviation is shown inside or adjacent to the corresponding box. The single and double stars above the boxplots correspond to significantly lower or higher values, respectively, compared to the baseline model for that metric. T2 = T2-weighted MRI, T1 = T1-weighted MRI, ADC = apparent diffusion coefficient, Ktrans = volume transfer constant, Ve = extravascular extracellular volume fraction, ALL = all five input channels.2D axial slice representations of ground truth segmentations (red dotted outline) and predicted segmentations (yellow dotted outline) for high- (green), medium- (blue), and low- (orange) performance cases. Slices for each case are shown in rows superiorly to inferiorly (top, middle, and bottom). Models are shown in columns. The DSC scores for corresponding models are shown in the top left corners. The high-performance case corresponds to a left tonsillar T4 tumor. The medium-performance case corresponds to a left base of tongue T4 tumor. The low-performance case corresponds to a right base of tongue T4 tumor. T2 = T2-weighted MRI, T1 = T1-weighted MRI, ADC = apparent diffusion coefficient, Ktrans = volume transfer constant, Ve = extravascular extracellular volume fraction, ALL = all five input channels. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Size dependence of models
To determine the impact of tumor size on model performance, we investigated correlations comparing tumor size to representative metrics of volumetric and surface overlap, i.e., DSC, HD, and surface DSC (Fig. 4). The range of values for tumor size were 1.74–45.19 cc. Every model showed non-significant positive correlations for DSC (p > 0.05) and significant positive correlations for HD (p < 0.005), except for T2 + Ve (r = 0.33, p = 0.079) and ALL (r = 0.06, p = 0.76). Every model also showed significant negative correlations for surface DSC (p < 0.05), except for T2 + ADC (r = -0.34, p = 0.07), T2 + Ve (r = -0.30, p = 0.11), and ALL (r = -0.30, p = 0.1).
Fig. 4
Dependence of tumor size on the Dice Similarity Coefficient (DSC) (A), Hausdorff Distance (HD) (B), and surface DSC (C), for various input channel models. T2 = T2-weighted MRI, T1 = T1-weighted MRI, ADC = apparent diffusion coefficient, Ktrans = volume transfer constant, Ve = extravascular extracellular volume fraction, ALL = all five input channels.
Dependence of tumor size on the Dice Similarity Coefficient (DSC) (A), Hausdorff Distance (HD) (B), and surface DSC (C), for various input channel models. T2 = T2-weighted MRI, T1 = T1-weighted MRI, ADC = apparent diffusion coefficient, Ktrans = volume transfer constant, Ve = extravascular extracellular volume fraction, ALL = all five input channels.3.3. Clinical Evaluation: The mean segmentation quality evaluation scores for ground truth vs DL-generated segmentations for the selected model (T2 + T1) were 3.0 vs 2.5, 2.5 vs 2.7, and 3.0 vs 3.0 for observers 1, 2, and 3, respectively. Significance testing revealed no observer could differentiate between the scores (p > 0.05) or source (p > 0.05) of the ground truth segmentations compared to the DL-generated segmentations (Table 1). Sub-analysis of clinical evaluation results for select cases can be found in Appendix B.
Table 1
Clinical evaluation and Turing test results for three physician expert observers. Each observer was asked to score blinded ground truth (GT) or deep learning (DL)-generated segmentations on a 4-point Likert scale (1 = requires corrections, large errors; 2 = requires corrections, minor errors; 3 = clinically acceptable, errors not clinically significant; 4 = clinically acceptable, highly accurate) and asked to identify the source of the segmentation (GT or DL). DL-generated segmentations corresponded to the best DL model tested (T2-weighted + T1-weighted).
Observer
Score
GT (#)
DL (#)
p-value1
Source
GT (#)
DL (#)
p-value2
Observer 1 (Radiologist)
1
3
6
0.13
GT
16
10
0.18
2
7
11
DL
14
20
3
7
4
4
13
9
Observer 2 (Radiation Oncologist)
1
3
6
0.44
GT
14
14
1.00
2
10
4
DL
16
16
3
16
13
4
1
7
Observer 3 (Radiation Oncologist)
1
1
3
0.98
GT
9
12
0.61
2
8
6
DL
21
18
3
11
10
4
10
11
Two-sided Wilcoxon signed rank tests were used for score comparisons. 2 McNemar tests were used for source prediction comparisons.
Clinical evaluation and Turing test results for three physician expert observers. Each observer was asked to score blinded ground truth (GT) or deep learning (DL)-generated segmentations on a 4-point Likert scale (1 = requires corrections, large errors; 2 = requires corrections, minor errors; 3 = clinically acceptable, errors not clinically significant; 4 = clinically acceptable, highly accurate) and asked to identify the source of the segmentation (GT or DL). DL-generated segmentations corresponded to the best DL model tested (T2-weighted + T1-weighted).Two-sided Wilcoxon signed rank tests were used for score comparisons. 2 McNemar tests were used for source prediction comparisons.
Discussion
In this pilot study, we determined the impact of mpMRI input channel combinations (T2, T2 + T1, T2 + ADC, T2 + Ktrans, T2 + Ve, ALL) on DL model segmentation performance. Recent work has suggested that the average agreement between physicians measured in DSC for OPC tumor segmentation is exceptionally low [45]. Notably, compared to previous fully-automated primary tumor segmentation studies of HNSCC patients, we achieved promising average DSC performance (Table 2). While it is difficult to directly compare DSCs between studies due to different datasets and model training, our models seemingly improve upon the only other fully-automated OPC tumor segmentation study to our knowledge, which exclusively investigated anatomical MRI [27].
Table 2
Survey of relevant DL auto-segmentation literature for comparison with our study. Only studies ≤ 3 years old and with sample sizes ≥ 30 were selected for comparison. DL = deep learning, N = number of images sets used in study, GTVp = primary gross tumor volume, DSC = Dice similarity coefficient, OPC = oropharyngeal cancer, NPC = nasopharyngeal cancer, HNSCC = head and neck squamous cell carcinoma, T2 = T2-weighted MRI, T1 = T1-weighted MRI, DCE = dynamic contrast enhanced MRI, DWI = diffusion weighted imaging MRI, LOOCV = leave-one-out cross-validation, CV = cross-validation, CNN = convolutional neural network.
Author, Year
Site
Modality
DL Architecture
N (Train, Test)
GTVp DSC (average)
This study, 2021
OPC
MRI (T1, T2, DCE, DWI)
3D Residual Unet
30 (LOOCV)
0.73 (best model, T2 + T1)
Outeiral et al., 2021 [27]
OPC
MRI (T1, T2)
3D Unet
171 (151, 20)
0.55
Andrearczyk et al., 2020 [32]
OPC
CT, PET
2D Unet
202 (LOOCV)
0.48 (CT), 0.58 (PET), 0.6 (CT + PET)
Moe et al., 2019 [33]
OPC
CT, PET
2D Unet
197 (157, 40)
0.65 (CT), 0.71 (PET), 0.75 (CT + PET)
Naser et al., 2020 [35]
OPC
CT, PET
3D Unet
201 (5-fold CV)
0.69
Iantsen et al., 2020 [36]
OPC
CT, PET
3D Unet
201 (4-fold CV)
0.75
Ma et al., 2018 [19]
NPC
MRI (T1)
3D CNN + graph-cut
30 (LOOCV)
0.85
Ye et al., 2020 [18]
NPC
MRI (T1, T2)
3D Unet
44 (10-fold CV)
0.62 (T1), 0.64 (T2), 0.72 (T1 + T2)
Chen et al., 2020 [21]
NPC
MRI (T1, T2)
3D Encoder-decoder network
149 (5-fold CV)
0.72
Huang et al., 2019 [26]
NPC
MRI (T1, T2)
2D CNN + recurrent attention
596 (430, 166)
0.78
Lin et al., 2019 [17]
NPC
MRI (T1, T2)
3D CNN
1021 (818, 203)
0.79
Ke et al., 2020 [23]
NPC
MRI (T1, T2)
3D DenseNet + multi-task learning
3142 (2792, 350)
0.77
Li et al., 2018 [20]
NPC
MRI (T1)
2D CNN
87 (LOOCV)
0.89
Ma et al., 2019 [25]
NPC
CT, MRI (T1)
2D CNN
90 (5-fold CV)
0.75
Bielak et al., 2020 [38]
HNSCC
MRI (T1, T2, DCE, DWI)
3D CNN (DeepMedic)
36 (LOOCV)
∼0.30*
Ren et al., 2021 [37]
HNSCC
CT, PET, MRI (T1, T2)
3D Residual Unet
153 (123, 30)
0.72 (PET + MRI), 0.74 (CT + PET + MRI)
Average DSC interpreted from manuscript figure.
Survey of relevant DL auto-segmentation literature for comparison with our study. Only studies ≤ 3 years old and with sample sizes ≥ 30 were selected for comparison. DL = deep learning, N = number of images sets used in study, GTVp = primary gross tumor volume, DSC = Dice similarity coefficient, OPC = oropharyngeal cancer, NPC = nasopharyngeal cancer, HNSCC = head and neck squamous cell carcinoma, T2 = T2-weighted MRI, T1 = T1-weighted MRI, DCE = dynamic contrast enhanced MRI, DWI = diffusion weighted imaging MRI, LOOCV = leave-one-out cross-validation, CV = cross-validation, CNN = convolutional neural network.Average DSC interpreted from manuscript figure.The best average DSC performance was achieved by the T2 + T1 model (DSC = 0.73), which was higher than the baseline T2 model (DSC = 0.72) but not statistically significant. Moreover, average DSC decreased when combining all input channels (DSC = 0.71), though non-significantly. However, a previous similar study by Bielak et al. investigating HNSCC tumors with segmentations derived from T2 MRI demonstrated an increased DSC after the inclusion of all available mpMRI channels [38], which is in direct opposition to our results. Importantly, the authors used a smaller number of patients (n = 18) than our study and implemented repeat imaging at different time-points, which could confound their results. Additionally, their results may be more relevant for a specific HNSCC tumor site, but no analysis was performed to verify this. Notably, auto-segmentation studies in prostate cancer have also reported conflicting results on the additive effects of additional mpMRI input channels for DSC when using ground truth annotations derived from T2 MRI [[48], [49], [50]]. Therefore, further investigations are likely needed to verify if a significant positive DSC effect exists for mpMRI input channel combinations in OPC tumor auto-segmentation.While most auto-segmentation studies have focused on DSC as an evaluation metric, it has been argued that other metrics should also be taken into consideration, depending on the use-case of the auto-segmentation tool [[51], [52]]. Therefore, to increase the robustness of our analysis, we have included complimentary metrics (HD, FND, FPD, sensitivity, PPV, surface DSC, 95% HD, and MSD) to evaluate our models. Like DSC, most metrics show high performance across various models, with some models demonstrating significantly better values than the baseline T2 model. Interestingly, we demonstrated that in certain edge cases (low-performance example), the inclusion of additional channels could circumvent spurious voxel predictions derived from the baseline T2 model (a possible byproduct of model overfitting), which may increase model robustness (see further discussion in Appendix B, Case 2). These results indicate that the additional channels may contain underlying additive information to improve performance for aspects other than traditional DSC-based evaluation. Notably, the specific anatomic subsite of the tumor (base of tongue or tonsil) had no significant effect on performance for any models for any evaluation metric, indicating that the models were robust to the spatial location of the OPC. Moreover, while the majority of the cases in our study cohort were HPV-positive, our best model (T2 + T1) was minimally affected in the evaluation of an HPV-negative case, though other models were negatively affected (Appendix B, Case 1). Generally, cases that were not well-represented in model training, i.e., irregular tumor presentations, led to suboptimal auto-segmentation performance (Appendix B, Cases 2 and 3).Previous studies [[17], [38]] have suggested small tumors may be more difficult for DL models to segment, which would hinder the incorporation of models into radiotherapy workflows. Importantly, there were no significant correlations between tumor size and DSC for any of our models. However, it should be noted that surface distance metrics, such as the HD and surface DSC, demonstrate some size dependence, with larger and smaller tumors being easier for our models to segment, respectively. Interestingly, the surface distance metrics do not demonstrate a significant size dependence for some models that utilize additional channels, particularly those that correspond to functional parametric maps. Therefore, the inclusion of additional channels may strengthen the robustness of models to tumor size for surface distance metric performance, but further confirmatory work is needed.The acceptability of segmentations used in a radiotherapy workflow is ultimately determined by physician judgment, with physician rating scales considered the gold standard for clinically relevant segmentation quality [52]. While subjective evaluation through rating scales is common in auto-segmentation studies, the established variability of OPC tumor segmentation between observers [45] highlights the difficulty in the interpretation of multi-observer segmentation quality analysis. Therefore, we implemented a comparative approach for each observer to determine if significant clinical differences were present between the ground truth segmentations and the corresponding segmentations of the best DL model (T2 + T1). We demonstrated that experts were on average unable to determine differences between the ground truth and the DL-generated segmentations or identify the source of the segmentations. Of note, the radiologist who provided the original ground truth segmentations was the closest among the observers to correctly discriminating the segmentation sources but was still unable to achieve statistical significance. Additionally, for the radiation oncologist observers the mean clinical acceptability score of the DL-generated segmentations was equal to or higher than the ground truth segmentations, which may indicate a slight preference towards DL-generated OPC tumor segmentations for radiotherapy end users. Importantly, while these results are encouraging, they should not be conflated with a necessarily positive indication for the desired endpoint of radiation therapy planning, which would require further dosimetric studies [52]. Moreover, we demonstrate several DL segmentation failure cases confirmed by clinical evaluation (Appendix B, Cases 2 and 3), so further work is needed to ensure the generation clinically useful segmentations using these approaches.One limitation of our study is the use of a small cohort of predominately HPV-positive tumors with standardized acquisition parameters, thus the generalizability of the tested models as well as their robustness to different patient groups and acquisition settings may be restricted. However, we have taken steps to optimally utilize our data by implementing a LOOCV approach and investigating various evaluation metrics. Moreover, we plan to include additional prospectively acquired data for model training and use external heterogenous validation sets in future studies to increase model generalizability. Another limitation of our study is that we have constrained our analysis of input image channels based on those that were investigated in previous literature [38]. However, mpMRI input channels can be further investigated through additional quantitative parametric maps (e.g., extended Tofts model [53], advanced DWI fitting models [54], etc.). Moreover, additional data sources such as CT and PET could also be integrated into existing models to determine their impact on auto-segmentation performance. Additionally, we have limited our analysis to primary tumor volumes, so the utility of mpMRI DL for metastatic cervical lymph node tumor volume segmentation in OPC remains unknown. We plan to include additional input channels and incorporate lymph node segmentation in future analyses. A final limitation of our study is the lack of overt image registration. Our images were acquired from a standardized clinical trial with patient immobilization; therefore, implicit co-registration was deemed adequate for tumor overlap. However, small amounts of motion artifacts may cause the segmentation mask to overlap improperly on mpMRI image channels, impacting auto-segmentation quality. Furthermore, though no geometric distortion was observed on any parametric maps, distortions were not explicitly quantified. Future studies should investigate the role of additional OPC-specific registration algorithms and geometric distortion correction in combination with mpMRI DL auto-segmentation algorithms.
Conclusions
In summary, using mpMRI inputs, we built OPC primary tumor DL auto-segmentation models that demonstrated reasonable performance across multiple evaluation metrics. While most channels did not significantly impact model performance as long as the T2 MRI was included, we find that adding T1 MRI significantly improved HD, FND, sensitivity, surface DSC, and 95% HD and adding Ve or using all input channels simultaneously significantly improved FND. Additionally, certain favorable aspects of model construction, including decreased spurious voxel predictions and robustness to tumor size when considering surface distance metric performance, are apparent for models that leverage additional input channels. Finally, blinded physician experts could not differentiate ground truth from DL-generated segmentations (on average), demonstrating our model shows promise in a Turing test scenario, but should be further investigated in dosimetric impact studies. Our results should be verified in large independent external datasets. Moreover, future studies should investigate if the incorporation of additional images germane to the radiation oncology workflow, such as CT and PET, can further improve model performance. Overall, our pilot study is an important step towards fully automated MR-guided OPC radiotherapy workflows.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Authors: Mark J Gooding; Annamarie J Smith; Maira Tariq; Paul Aljabar; Devis Peressutti; Judith van der Stoep; Bart Reymen; Daisy Emans; Djoya Hattu; Judith van Loon; Maud de Rooy; Rinus Wanders; Stephanie Peeters; Tim Lustberg; Johan van Soest; Andre Dekker; Wouter van Elmpt Journal: Med Phys Date: 2018-10-12 Impact factor: 4.071
Authors: Brigid A McDonald; Sastry Vedam; Jinzhong Yang; Jihong Wang; Pamela Castillo; Belinda Lee; Angela Sobremonte; Sara Ahmed; Yao Ding; Abdallah S R Mohamed; Peter Balter; Neil Hughes; Daniela Thorwarth; Marcel Nachbar; Marielle E P Philippens; Chris H J Terhaard; Daniel Zips; Simon Böke; Musaddiq J Awan; John Christodouleas; Clifton D Fuller Journal: Int J Radiat Oncol Biol Phys Date: 2020-12-16 Impact factor: 7.038
Authors: Travis C Salzillo; Nicolette Taku; Kareem A Wahid; Brigid A McDonald; Jarey Wang; Lisanne V van Dijk; Jillian M Rigert; Abdallah S R Mohamed; Jihong Wang; Stephen Y Lai; Clifton D Fuller Journal: Semin Radiat Oncol Date: 2021-10 Impact factor: 5.421
Authors: F De Felice; V Tombolini; V Valentini; M de Vincentiis; S Mezi; O Brugnoletti; A Polimeni Journal: J Oncol Date: 2019-04-14 Impact factor: 4.375
Authors: Stanislav Nikolov; Sam Blackwell; Alexei Zverovitch; Cían Owen Hughes; Joseph R Ledsam; Olaf Ronneberger; Ruheena Mendes; Michelle Livne; Jeffrey De Fauw; Yojan Patel; Clemens Meyer; Harry Askham; Bernadino Romera-Paredes; Christopher Kelly; Alan Karthikesalingam; Carlton Chu; Dawn Carnell; Cheng Boon; Derek D'Souza; Syed Ali Moinuddin; Bethany Garie; Yasmin McQuinlan; Sarah Ireland; Kiarna Hampton; Krystle Fuller; Hugh Montgomery; Geraint Rees; Mustafa Suleyman; Trevor Back Journal: J Med Internet Res Date: 2021-07-12 Impact factor: 5.428
Authors: Nicolette Taku; Kareem A Wahid; Lisanne V van Dijk; Jaakko Sahlsten; Joel Jaskari; Kimmo Kaski; Clifton D Fuller; Mohamed A Naser Journal: Clin Transl Radiat Oncol Date: 2022-06-18
Authors: Mohamed A Naser; Kareem A Wahid; Lisanne V van Dijk; Renjie He; Moamen Abobakr Abdelaal; Cem Dede; Abdallah S R Mohamed; Clifton D Fuller Journal: Head Neck Tumor Segm Chall (2021) Date: 2022-03-13
Authors: Gerard M Walls; Valentina Giacometti; Aditya Apte; Maria Thor; Conor McCann; Gerard G Hanna; John O'Connor; Joseph O Deasy; Alan R Hounsell; Karl T Butterworth; Aidan J Cole; Suneil Jain; Conor K McGarry Journal: Phys Imaging Radiat Oncol Date: 2022-07-26
Authors: Mohamed A Naser; Kareem A Wahid; Aaron J Grossberg; Brennan Olson; Rishab Jain; Dina El-Habashy; Cem Dede; Vivian Salama; Moamen Abobakr; Abdallah S R Mohamed; Renjie He; Joel Jaskari; Jaakko Sahlsten; Kimmo Kaski; Clifton D Fuller Journal: Front Oncol Date: 2022-07-28 Impact factor: 5.738
Authors: Roque Rodríguez Outeiral; Paula Bos; Hedda J van der Hulst; Abrahim Al-Mamgani; Bas Jasperse; Rita Simões; Uulke A van der Heide Journal: Phys Imaging Radiat Oncol Date: 2022-08-13