| Literature DB >> 35885506 |
Barbara D Wichtmann1, Steffen Albert2, Wenzhao Zhao3, Angelika Maurer4, Claus Rödel5, Ralf-Dieter Hofheinz6, Jürgen Hesser3, Frank G Zöllner2, Ulrike I Attenberger1.
Abstract
This retrospective study aims to evaluate the generalizability of a promising state-of-the-art multitask deep learning (DL) model for predicting the response of locally advanced rectal cancer (LARC) to neoadjuvant chemoradiotherapy (nCRT) using a multicenter dataset. To this end, we retrained and validated a Siamese network with two U-Nets joined at multiple layers using pre- and post-therapeutic T2-weighted (T2w), diffusion-weighted (DW) images and apparent diffusion coefficient (ADC) maps of 83 LARC patients acquired under study conditions at four different medical centers. To assess the predictive performance of the model, the trained network was then applied to an external clinical routine dataset of 46 LARC patients imaged without study conditions. The training and test datasets differed significantly in terms of their composition, e.g., T-/N-staging, the time interval between initial staging/nCRT/re-staging and surgery, as well as with respect to acquisition parameters, such as resolution, echo/repetition time, flip angle and field strength. We found that even after dedicated data pre-processing, the predictive performance dropped significantly in this multicenter setting compared to a previously published single- or two-center setting. Testing the network on the external clinical routine dataset yielded an area under the receiver operating characteristic curve of 0.54 (95% confidence interval [CI]: 0.41, 0.65), when using only pre- and post-therapeutic T2w images as input, and 0.60 (95% CI: 0.48, 0.71), when using the combination of pre- and post-therapeutic T2w, DW images, and ADC maps as input. Our study highlights the importance of data quality and harmonization in clinical trials using machine learning. Only in a joint, cross-center effort, involving a multidisciplinary team can we generate large enough curated and annotated datasets and develop the necessary pre-processing pipelines for data harmonization to successfully apply DL models clinically.Entities:
Keywords: deep learning; locally advanced rectal cancer; machine learning; multicenter; response prediction to nCRT
Year: 2022 PMID: 35885506 PMCID: PMC9317842 DOI: 10.3390/diagnostics12071601
Source DB: PubMed Journal: Diagnostics (Basel) ISSN: 2075-4418
Figure 1Flowchart of the inclusions and exclusions of LARC patients of the CAO-ARO-AIO-12 study.
Figure 2Distribution of selected imaging parameters per center for the T2w images.
Characteristics of patients in the training, validation and test cohort disaggregated by centers.
| Training Cohort | Validation Cohort | Test | Significant Differences | |||
|---|---|---|---|---|---|---|
| Characteristic | Center 1 | Center 2 | Center 3 | Center 4a | Center 4b | |
| Acquisition (years) | 2015–2018 | 2016–2017 | 2015–2017 | 2015–2017 | 2009–2013 |
|
| Age (mean ± std) | 60 ± 10 | 61 ± 11 | 60 ± 6 | 66 ± 6 | 64 ± 11 | 0.07 |
|
| 0.15 | |||||
| Male | 26 | 14 | 8 | 9 | 37 | |
| Female | 11 | 6 | 7 | 2 | 9 | |
|
|
| |||||
| T0 | 0 | 0 | 0 | 0 | 0 | |
| T1 | 0 | 0 | 0 | 0 | 0 | |
| T2 | 3 | 0 | 2 | 0 | 8 | |
| T3 | 28 | 1 | 11 | 5 | 38 | |
| T4 | 4 | 0 | 1 | 0 | 0 | |
| Not specified | 2 | 19 | 1 | 6 | 0 | |
|
|
| |||||
| N- | 2 | 2 | 0 | 1 | 25 | |
| N+ | 35 | 18 | 15 | 10 | 21 | |
|
| 0.06 | |||||
| Minimal distance to mesorectal fascia (MRF) in mm | 12 ± 9 | 2 ± 5 | 3 ± 3 | |||
| MRF involvement | 2 | 0 | 0 | 2 | 7 | |
| Not specified | 13 | 20 | 0 | 9 | 0 | |
|
|
| |||||
| lower third | 18 | 5 | 5 | 4 | 10 | |
| middle third | 13 | 9 | 10 | 5 | 25 | |
| upper third | 0 | 0 | 0 | 0 | 11 | |
| location not specified | 6 | 6 | 0 | 2 | 0 | |
|
|
| |||||
| T0 | 1 | 0 | 3 | 0 | 0 | |
| T1 | 2 | 0 | 1 | 0 | 2 | |
| T2 | 13 | 0 | 1 | 0 | 27 | |
| T3 | 17 | 0 | 10 | 0 | 17 | |
| T4 | 4 | 0 | 0 | 1 | 0 | |
| Not specified | 0 | 20 | 0 | 10 | 0 | |
|
| 0.68 | |||||
| N− | 17 | 0 | 2 | 0 | 39 | |
| N+ | 20 | 0 | 13 | 2 | 5 | |
| Not specified | 0 | 20 | 0 | 9 | 2 | |
|
| 0.15 | |||||
| pCR | 7 | 9 | 1 | 2 | 5 | |
| Non-pCR | 30 | 11 | 14 | 9 | 41 | |
|
| ||||||
| Initial Staging to OP | 146 ± 12 | 146 ± 11 | 142 ± 8 | 177 ± 35 | 123 ± 20 |
|
| Post-nCRT MRI to OP | 13 ± 10 | 7 ± 3 | 8 ± 3 | 32 ± 15 | 29 ± 14 |
|
Unless otherwise indicated, data are number of patients. Significance of differences between the training/validation and test cohort were calculated using the Mann–Whitney U-test; a p-value below 0.05 was considered significant.
Overview of the different scanners used for imaging the patients at each center.
| Vendor | Model Name | Tesla | Number of Patients before nCRT | Number of Patients after nCRT | |
|---|---|---|---|---|---|
|
| Siemens | Prisma_fit | 3.0 | 6 | 18 |
| Siemens | Avanto | 1.5 | 14 | 0 | |
| Siemens | Avanto_fit | 1.5 | 2 | 14 | |
| Siemens | SymphonyTim | 1.5 | 12 | 1 | |
| Siemens | Aera | 1.5 | 1 | 3 | |
| Siemens | Espree | 1.5 | 0 | 1 | |
| Philips | Ingenia | 1.5 | 1 | 0 | |
| Siemens | Spectra | 3.0 | 1 | 0 | |
|
| Siemens | Skyra | 3.0 | 17 | 10 |
| Siemens | Avanto | 1.5 | 3 | 10 | |
|
| Siemens | Prisma_fit | 3.0 | 11 | 9 |
| Siemens | Skyra | 3.0 | 4 | 6 | |
|
| Siemens | Skyra | 3.0 | 4 | 8 |
| Siemens | TrioTim | 3.0 | 6 | 3 | |
| Siemens | Avanto | 1.5 | 1 | 0 | |
|
| Siemens | TrioTim | 1.5 | 46 | 46 |
Differences between the training and test dataset.
| Training T2w | Test T2w | Training DWI | Test DWI | |
|---|---|---|---|---|
|
| 3.3 ± 0.5 | 3.1 ± 0.2 | 5.1 ± 1.1 | 5.0 ± 0.0 |
|
| 4994.9 ± 1913.5 | 3971.5 ± 708.1 | 6463.1 ± 1539.1 | 4121.9 ± 562.7 |
|
| 215.9 ± 48.9 | 201.6 ± 8.8 | 1792.5 ± 351.6 | 1735.0 ± 8.3 |
|
| 137.0 ± 16.5 | 148.7 ± 5.6 | 90.0 ± 0.0 | 90.0 ± 0.0 |
|
| 97.8 ± 14.5 | 101.9 ± 4.7 | 67.2 ± 13.4 | 73.2 ± 2.2 |
|
| 2.5 ± 0.7 | 3.0 ± 0.0 | 2.5 ± 0.7 | 3.0 ± 0.0 |
|
| 0.6 ± 0.2 | 0.6 ± 0.0 | 1.6 ± 0.4 | 2.0 ± 0.1 |
Each column shows the mean and standard deviation of the different acquisition parameters of the T2w and DW sequences for the training and test cohort, respectively. Only the flip angle (90°) and pixel bandwidth of DWI were not significantly different between cohorts.
Figure 3Receiver operator characteristic (ROC) of all obtained models (i.e., folds) and combination of input data for the training (a,b) and external validation dataset (c,d). (a,b) depict ROC curves for the training cohort: (a) using only T2w images as input vs. (b) using T2w and DW images as input. (c,d) depict the corresponding results for the external validation dataset: (c) using only T2w images as input vs. (d) using T2w and DW images as input. In each panel, the bold brown-colored curve depicts the average over the five folds.
Figure 4The area under the receiver operating characteristic curve (AUC) was calculated for the different receiver operating characteristic curves depicted in Figure 3. The AUCs are shown for the training (colored red) and test cohorts (colored blue). For the training cohort, classification with only T2w images as inputs performed significantly better than with T2w and DW images.
Figure 5Examples of correctly and incorrectly classified patients. Each row shows input images of one patient before and after nCRT overlayed with the segmentation mask. Note the slight misalignment of the segmentation mask with the tumor region on the DW images, which might contribute to misclassification.