| Literature DB >> 29763967 |
Timo M Deist1,2, Frank J W M Dankers2,3, Gilmer Valdes4, Robin Wijsman3, I-Chow Hsu4, Cary Oberije2, Tim Lustberg5, Johan van Soest5, Frank Hoebers5, Arthur Jochems1,2, Issam El Naqa6, Leonard Wee5, Olivier Morin4, David R Raleigh4, Wouter Bots3,7, Johannes H Kaanders3, José Belderbos8, Margriet Kwint8, Timothy Solberg4, René Monshouwer3, Johan Bussink3, Andre Dekker5, Philippe Lambin1.
Abstract
PURPOSE: Machine learning classification algorithms (classifiers) for prediction of treatment response are becoming more popular in radiotherapy literature. General Machine learning literature provides evidence in favor of some classifier families (random forest, support vector machine, gradient boosting) in terms of classification performance. The purpose of this study is to compare such classifiers specifically for (chemo)radiotherapy datasets and to estimate their average discriminative performance for radiation treatment outcome prediction.Entities:
Keywords: classification; machine learning; outcome prediction; predictive modeling; radiotherapy
Mesh:
Year: 2018 PMID: 29763967 PMCID: PMC6095141 DOI: 10.1002/mp.12967
Source DB: PubMed Journal: Med Phys ISSN: 0094-2405 Impact factor: 4.071
Dataset characteristics. The number of features is determined before preprocessing.
| Dataset | Disease | Outcome | Prevalence (in%) | Patients | Features | Feature types | Source |
|---|---|---|---|---|---|---|---|
| Belderbos et al. (2005)[ | Non-small-cell lung cancer | Grade ≥ 2 acute esophagitis | 27 | 156 | 22 | Clinical, dosimetric, blood | Private |
| Bots et al. (2017)[ | Head and neck cancer | 2-yr overall survival | 42 | 137 | 10 | Clinical, dosimetric | Private |
| Carvalho et al. (2O16)[ | Non-small-cell lung cancer | 2-yr overall survival | 40 | 363 | 18 | Clinical, dosimetric, blood | Public[ |
| Janssens et al. (2012)[ | Laryngeal cancer | 5-yr regional control | 89 | 179 | 48 | Clinical, dosimetric, blood | Private |
| Jochems et al. (2016)[ | Non-small-cell lung cancer | 2-yr overall survival | 36 | 327 | 9 | Clinical, dosimetric | Private |
| Kwint et al. (2012)[ | Non-small-cell lung cancer | Grade ≥ 2 acute esophagitis | 61 | 139 | 83 | Clinical, dosimetric, blood | Private |
| Lustberg et al. (2016)[ | Laryngeal cancer | 2-yr overall survival | 83 | 922 | 7 | Clinical, dosimetric, blood | Private |
| Morin et al. (forthcoming) | Meningioma | Local failure | 36 | 257 | 18 | Clinical | Private |
| Oberijeet al. (2015)[ | Non-small-cell lung cancer | 2-yr overall survival | 17 | 548 | 20 | Clinical, dosimetric | Public[ |
| Oiling et al. (2017)[ | Small and non-small-cell lung cancer | Odynophagia prescription medication | 67 | 131 | 47 | Clinical, dosimetric | Private |
| Wijsman et al. (2015)[ | Non-small-cell lung cancer | Grade ≥ 2 acute esophagitis | 36 | 149 | 11 | Clinical, dosimetric, blood | Private |
| Wijsman et al. (2017)[ | Non-small-cell lung cancer | Grade ≥ 3 radiation pneumonin’s | 14 | 188 | 18 | Clinical, dosimetric, blood | Private |
Classifier characteristics.
| Classifier | Requires dummy coding | Tuned hyperparameters | ||
|---|---|---|---|---|
| Elastic net logistic regression | Yes | |||
| Random forest | No | |||
| Single-hidden-layer neural network | No | |||
| Support vector machine with radial basis function (RBF) kernel | Yes | |||
| LogitBoost | Yes | |||
| Decision tree | No |
FIG. 1.Experimental design: each dataset is split into five stratified outer folds (step 1). For each of the folds, the data are preprocessed (imputation, dummy coding, deleting zero variance features, rescaling) (step 2). The hyperparameters are tuned in the training set via a fivefold inner CV (steps 3–5). Based on the selected hyperparameters, a model is learned on the training set (step 6) and applied on the test set (step 7). Performance metrics are calculated on the test set (step 8) and stored for all outer folds. This process is repeated 100 times for each classifier. Randomization seeds are stable across classifiers within a repetition to allow pairwise comparison. [Color figure can be viewed at wileyonlinelibrary.com]
FIG. 2.Box and scatterplot of the AUC rank (lower being better) per outer fivefold CV aggregated over all datasets and repetitions (12 datasets 9 100 repetitions = 1200 data points per classifier). [Color figure can be viewed at wileyonlinelibrary.com]
FIG. 3.Pairwise comparisons of each classifier pair (12 datasets 9 100 repetitions = 1200 comparisons per pair). The numbers in the plot indicate how often classifier A (y-axis) achieved an AUC greater than classifier B (x-axis). The color indicates whether the increased AUCs by classifier A are statistically significant (violet), insignificant (light violet), or have not been tested (gray). The significance cutoff was set to the 0.05-level (one-sided Wilcoxon signed-rank test, Holm–Bonferroni correction for 15 tests). [Color figure can be viewed at wileyonlinelibrary.com]
FIG. 4.The mean AUC for each pair of classifier and dataset (100 repetitions = 100 data points per pair). [Color figure can be viewed at wileyonlinelibrary.com]
FIG. 5.The mean rank derived from the AUC (100 repetitions = 100 data points per pair). [Color figure can be viewed at wileyonlinelibrary.com]
For each dataset, the AUC rank averaged over all repetitions when (a) randomly selecting a classifier (Random classifier), (b) preselecting the classifier with the average best AUC rank in all other datasets, that is, without any information about the current dataset (Preselected classifier), (c) selecting the classifier that yielded the highest AUC in the inner CV (Set-specific classifier). Improvements in average AUC and average AUC rank compared to (a) are reported. The average AUC improvements by preselection and set-specific selection were tested for statistical significance (P < 0.05, one-sided Wilcoxon signed-rank test) and found to be statistically significant (*). No other statistical tests besides the two aforementioned tests were conducted.
| Random classifier | Preselected classifier | Set-specific classifier | ||||||
|---|---|---|---|---|---|---|---|---|
| Dataset | Rank | Name | Rank | AUC | Rank | AUC | ||
| Mean | Mean | Increase | Increase | Mean | Increase | Increase | ||
| Set A | 3.59 | 3.64 | −0.05 | 0.00 | 3.10 | 0.49 | 0.02 | |
| Set B | 3.48 | 2.92 | 0.56 | 0.02 | 3.31 | 0.17 | 0.01 | |
| Set C | 3.50 | 3.12 | 0.37 | 0.03 | 2.78 | 0.72 | 0.03 | |
| Set D | 3.57 | 2.60 | 0.97 | 0.04 | 3.31 | 0.26 | 0.02 | |
| Set E | 3.53 | 3.35 | 0.18 | 0.01 | 1.75 | 1.78 | 0.05 | |
| Set F | 3.39 | 1.89 | 1.50 | 0.04 | 2.58 | 0.81 | 0.03 | |
| Set G | 3.47 | 2.99 | 0.47 | 0.04 | 3.52 | −0.06 | 0.01 | |
| Set H | 3.44 | 3.81 | −0.37 | 0.00 | 1.70 | 1.74 | 0.05 | |
| Set I | 3.45 | 1.59 | 1.86 | 0.06 | 1.72 | 1.73 | 0.05 | |
| Set J | 3.52 | 4.18 | −0.66 | −0.02 | 3.41 | 0.11 | 0.00 | |
| Set K | 3.50 | 3.33 | 0.16 | 0.01 | 3.20 | 0.30 | 0.01 | |
| Set L | 3.58 | 3.50 | 0.08 | 0.01 | 3.66 | −0.08 | 0.00 | |
Median performance metrics per classifier aggregated over repetitions and datasets (1200 data points each). Undefined (NaN) values are excluded when calculating the median.
| Classifier | AUC | Brier score | Accuracy | Cohen’s kappa | Calibration intercept error | Calibration slope error |
|---|---|---|---|---|---|---|
| 0.72 | 0.17 | 0.72 | 0.10 | 0.12 | 0.37 | |
| 0.72 | 0.18 | 0.72 | 0.14 | 0.26 | 0.68 | |
| 0.71 | 0.21 | 0.69 | 0.11 | 0.36 | 0.96 | |
| 0.69 | 0.18 | 0.72 | 0.06 | 0.26 | 0.86 | |
| 0.66 | 0.23 | 0.68 | 0.18 | 0.22 | 0.60 | |
| 0.63 | 0.20 | 0.71 | 0.16 | 0.21 | 0.56 |