Literature DB >> 29556480

Design and Selection of Machine Learning Methods Using Radiomics and Dosiomics for Normal Tissue Complication Probability Modeling of Xerostomia.

Hubert S Gabryś^1,2,3, Florian Buettner⁴, Florian Sterzing^3,5,6, Henrik Hauswald^3,5,6, Mark Bangert^1,3.

Abstract

PURPOSE: The purpose of this study is to investigate whether machine learning with dosiomic, radiomic, and demographic features allows for xerostomia risk assessment more precise than normal tissue complication probability (NTCP) models based on the mean radiation dose to parotid glands.
MATERIAL AND METHODS: A cohort of 153 head-and-neck cancer patients was used to model xerostomia at 0-6 months (early), 6-15 months (late), 15-24 months (long-term), and at any time (a longitudinal model) after radiotherapy. Predictive power of the features was evaluated by the area under the receiver operating characteristic curve (AUC) of univariate logistic regression models. The multivariate NTCP models were tuned and tested with single and nested cross-validation, respectively. We compared predictive performance of seven classification algorithms, six feature selection methods, and ten data cleaning/class balancing techniques using the Friedman test and the Nemenyi post hoc analysis.
RESULTS: NTCP models based on the parotid mean dose failed to predict xerostomia (AUCs < 0.60). The most informative predictors were found for late and long-term xerostomia. Late xerostomia correlated with the contralateral dose gradient in the anterior-posterior (AUC = 0.72) and the right-left (AUC = 0.68) direction, whereas long-term xerostomia was associated with parotid volumes (AUCs > 0.85), dose gradients in the right-left (AUCs > 0.78), and the anterior-posterior (AUCs > 0.72) direction. Multivariate models of long-term xerostomia were typically based on the parotid volume, the parotid eccentricity, and the dose-volume histogram (DVH) spread with the generalization AUCs ranging from 0.74 to 0.88. On average, support vector machines and extra-trees were the top performing classifiers, whereas the algorithms based on logistic regression were the best choice for feature selection. We found no advantage in using data cleaning or class balancing methods.
CONCLUSION: We demonstrated that incorporation of organ- and dose-shape descriptors is beneficial for xerostomia prediction in highly conformal radiotherapy treatments. Due to strong reliance on patient-specific, dose-independent factors, our results underscore the need for development of personalized data-driven risk profiles for NTCP models of xerostomia. The facilitated machine learning pipeline is described in detail and can serve as a valuable reference for future work in radiomic and dosiomic NTCP modeling.

Entities: Chemical

Keywords: IMRT; NTCP; dosiomics; head and neck; machine learning; radiomics; radiotherapy; xerostomia

Year: 2018 PMID： 29556480 PMCID： PMC5844945 DOI： 10.3389/fonc.2018.00035

Source DB: PubMed Journal: Front Oncol ISSN： 2234-943X Impact factor: 6.244

Introduction

Radiotherapy is the main treatment for head-and-neck tumors. Incidental irradiation of salivary glands often impairs their function, causing dryness in the mouth (xerostomia). Xerostomia significantly reduces patients’ quality of life, leading to dental health deterioration, oral infections, and difficulties in speaking, chewing, and swallowing. The Quantitative Analyses of Normal Tissue Effects in the Clinic (QUANTEC) group recommended sparing at least one parotid gland to a mean dose <20 Gy or both parotid glands to a mean dose <25 Gy (1). Large-cohort studies confirmed that the mean dose is a good predictor of xerostomia (2, 3). However, it has also been observed that the mean dose failed to recognize patients at risk in cohorts where the majority of patients had met the QUANTEC guidelines, although the prevalence of xerostomia was reduced (4–6). In recent years, a number of studies have investigated various patient- and therapy-related factors in hope of more precise xerostomia predictions. These included the mean dose to submandibular glands and the oral cavity (5, 7–9), sparing of the parotid stem cells region (10), three-dimensional dose moments (4), CT image features (11, 12), patients’ T stage, age, financial status, education, smoking, etc. (4, 5, 8). Moreover, there has been growing interest in the adoption of machine learning classifiers in NTCP modeling (13–15). Buettner et al. used Bayesian logistic regression together with dose-shape features to predict xerostomia in head-and-neck cancer patients (4). Support vector machines were employed to model radiation-induced pneumonitis (16). Ospina et al. predicted rectal toxicity following prostate cancer radiotherapy using random forests (17). Nevertheless, despite the growing interest in data-driven methods, there have been no published studies so far systematically evaluating how different machine learning techniques can be used to address the challenges specific to NTCP modeling. These include class imbalance due to low prevalence rates, heterogeneous and noisy data, large feature spaces, irregular follow-up times, etc. A comparable work has already been presented in the fields of bioinformatics (18, 19) and radiomics (20). Such analysis is missing for NTCP modeling, although it seems especially relevant. In this context, we examined associations between xerostomia and various features describing parotid shape (radiomics), dose shape (dosiomics), and demographic characteristics. Besides investigating the individual predictive power of the features, we comprehensively evaluated the suitability of seven machine learning classifiers, six feature selection methods, and ten data cleaning/class balancing algorithms for multivariate NTCP modeling. The obtained results were compared to mean-dose models and the morphological model proposed by Buettner et al. (4). Furthermore, we proposed a longitudinal approach for NTCP modeling that includes the time after treatment as a model covariate. Doing so, rather than binning the data around a certain time point, better reflects the underlying data due to often irregular follow-up times.

Materials and Methods

Patients

The retrospective patient cohort collected for this study comprised head-and-neck cancer patients treated with radiotherapy at Heidelberg University Hospital in years 2010–2015. After excluding patients with nonzero baseline xerostomia, replanning during the treatment, tumor in the parotid gland, second irradiation, second chemotherapy, or ion beam boost, the cohort consisted of 153 patients. Patient and tumor characteristics are listed in Table 1. The study was approved by the Ethics Committee of Heidelberg University.

Table 1

Patients and tumor characteristics.

	All	0–6 months			6–15 months			15–24 months
		Grade 0	Grade 1	Grade 2	Grade 0	Grade 1	Grade 2	Grade 0	Grade 1	Grade 2
Total patients	153	17	87	30	19	99	13	15	53	9
Age
Median	61	60	60	62	60	61	61	61	61	61
Q1–Q3	55–66	54–66	54–64	53–69	57–63	53–66	54–68	55–68	52–66	54–68
Range	29–82	44–78	29–82	43–80	49–75	29–82	43–74	47–80	39–78	41–80
Sex
Female	37	5	19	7	6	24	2	2	9	4
Male	116	12	68	23	13	75	11	13	44	5
Tumor site
Hypopharynx/larynx	37	7	20	7	7	20	2	3	15	0
Nasopharynx	12	0	8	2	2	8	1	0	5	0
Oropharynx	99	9	57	20	10	69	9	11	32	9
Other	5	1	2	1	0	2	1	1	1	0
Radiation modality
IMRT	37	2	25	5	1	29	2	2	18	1
Tomotherapy	116	15	62	25	18	70	11	13	35	8
Ipsi parotid dose (Gy)
Median	24.3	22.9	25.0	23.0	19.5	24.8	25.9	22.9	23.8	24.5
Q1–Q3	20.6–27.6	18.5–24.6	21.4–29.0	21.4–25.4	16.8–24.3	21.8–28.7	21.8–27.2	18.5–31.5	20.8–26.4	21.6–26.2
Range	0.4–63.4	0.4–36.0	7.4–61.4	4.6–59.0	0.4–32.9	4.6–61.4	17.3–63.4	0.4–51.4	4.6–46.0	17.3–63.4
Contra parotid dose (Gy)
Median	19.9	19.4	20.3	19.6	15.6	20.5	20.4	12.7	19.7	20.1
Q1–Q3	15.4–23.1	13.1–21.8	15.2–23.8	16.5–22.0	10.3–20.7	16.3–23.8	19.8–23.1	5.2–17.9	16.3–23.7	16.4–22.3
Range	0.3–30.9	0.3–24.9	4.1–28.6	4.2–26.2	0.3–27.9	4.1–30.9	15.1–26.2	0.3–27.9	4.1–27.2	15.1–26.0

The total number of patients differs among the groups due to the follow-up availability.

Patients and tumor characteristics. The total number of patients differs among the groups due to the follow-up availability.

End Points

For this study, we analyzed 693 xerostomia toxicity follow-up reports. We aimed to model moderate-to-severe xerostomia defined as grade 2 or higher according to Common Terminology Criteria for Adverse Effects (CTCAE) v4.03 (21). In 74% of cases, either CTCAE v3.0 or v4.03 grading scale was used. Dry mouth (xerostomia) definitions were the same in both versions so no inconsistency in grading was introduced. In case no score was provided but descriptive toxicity information was available, appropriate scores were assigned together with Heidelberg University Hospital clinicians. To minimize intra- and interobserver variability in this process, a set of rules in the form of a dictionary was introduced. The follow-up reports were collected, on average, at 3-month intervals (Figure 1). The number of toxicity evaluations and the length of the follow-up varied from patient to patient. Due to the time-characteristic and the irregularity of the follow-up, two approaches were taken to model xerostomia: a time-specific approach and a longitudinal approach. In the time-specific approach, three time intervals were defined: 0–6, 6–15, and 15–24 months, to investigate early, late, and long-term xerostomia, respectively. In case there were multiple follow-up reports available for individual patients, the final toxicity score was calculated as the arithmetic mean rounded to the nearest integer number with x.5 being rounded up. In the longitudinal approach, no time-intervals were defined and no toxicity grades were averaged. Instead, each patient evaluation served as a separate observation and the time after treatment was included as a covariate in the model.

Figure 1

Frequency of the follow-up reports collection.

Features

The candidate xerostomia predictors comprised demographic, radiomic, and dosiomic features (Table 2). The radiomic and the dosiomic features were extracted from the CT- and the dose-cubes read from treatment planning DICOM files. In a preprocessing step, all the cubes were linearly interpolated to an isotropic 1 mm resolution. Moreover, we wanted to analyze the features in terms of ipsi- and contralateral rather than left and right parotid glands. This would, however, mean that certain spatial features would have either positive or negative value, depending on the tumor location (left or right). In order to solve that issue, the cubes were flipped through the sagittal plane for cases with the mean dose to the right parotid gland higher than the mean dose to the left parotid gland. All feature definitions were based on the LPS coordinate system, that is (right to left, anterior to posterior, inferior to superior). The detailed definitions of the features are provided in Appendix A.

Table 2

Feature sets before and after the removal of highly correlated pairs (Kendall’s |τ| > 0.5).

Feature group	Initial feature set	Final feature set
Demographics	Age, sex	Age, sex
Parotid shape	Volume, area, sphericity, eccentricity, compactness, λ₁, λ₂, λ₃	Volume, sphericity, eccentricity
Dose–volume histogram	Mean, spread, skewness, D2, D98, D10, D20, D30, D40, D50, D60, D70, D80, D90, V10, V15, V20, V25, V30, V35, V40, V45, entropy, uniformity	Mean, spread, skewness
Subvolume mean dose	sx1, sx2, sx3, sy1, sy2, sy3, sz1, sz2, sz3
Spatial dose gradient	Gradient_x, gradient_y, gradient_z	Gradient_x, gradient_y, gradient_z
Spatial dose spread	η₂₀₀, η₀₂₀, η₀₀₂	η₂₀₀, η₀₂₀, η₀₀₂
Spatial dose correlation	η₁₁₀, η₁₀₁, η₀₁₁	η₁₁₀, η₁₀₁, η₀₁₁
Spatial dose skewness	η₃₀₀, η₀₃₀, η₀₀₃	η₃₀₀, η₀₃₀, η₀₀₃
Spatial dose coskewness	η₀₁₂, η₀₂₁, η₁₂₀, η₁₀₂, η₂₁₀, η₂₀₁	η₀₁₂, η₀₂₁, η₁₂₀, η₁₀₂, η₂₁₀, η₂₀₁

Feature definitions are provided in Appendix .

Feature sets before and after the removal of highly correlated pairs (Kendall’s |τ| > 0.5). Feature definitions are provided in Appendix . To reduce feature redundancy, the Kendall rank correlation coefficient was calculated for all feature pairs. Kendall’s τ allows to measure ordinal association between two features, that is agreement in ranks assigned to the observations. It can be interpreted as a difference between the probability that both features rank a random pair of observations in the same way and the probability that they rank these observations in a different way (22). We considered feature pairs with |τ| > 0.5 in both glands as highly correlated and suitable for rejection from the feature set. This arbitrarily chosen threshold corresponds to a 75% probability that the two features rank a random pair of observations in the same way. Whenever a pair of features was found highly correlated, we decided to keep the feature that was conceptually and computationally simpler, e.g., mean dose over Dx, parotid volume over parotid compactness, etc.

Previously Proposed NTCP Models

Logit and probit NTCP models based on the mean dose to parotid glands have been extensively used in modeling xerostomia (2, 3, 23, 24). We have tested four different mean-dose models to evaluate predictive power of the mean dose in our cohort: three univariate logistic regression models based on the ipsilateral mean dose, the contralateral mean dose, and the mean dose to both parotid glands, as well as one bivariate logistic regression model based on the mean dose to contralateral and to ipsilateral parotid glands. As an alternative to the mean-dose models, Buettner et al. (4) proposed a multivariate logistic regression model based on three-dimensional dose moments to predict xerostomia. The model was retrained and tested on our data set.

Univariate Analysis

The univariate analysis was performed to investigate associations of single features with the outcome at different time intervals. First, all features were normalized via Z-score normalization to zero mean and unit variance. Next, for each feature, the Mann–Whitney U statistic was calculated. The area under the receiver operating characteristic curve (AUC) is directly related to the U statistic and follows from the formula , where nand n are the size of the negative and the size of the positive class, respectively (25). For all AUCs, 95% confidence intervals were estimated by bias-corrected and accelerated (BCa) bootstrap (26). The number of type I errors, that is falsely rejected null hypotheses, was controlled with the false discovery rate (FDR). The FDR is defined as the expected proportion of true null hypotheses in the set of all the rejected hypotheses (27). We applied the Gavrilov-Benjamini-Sarkar procedure to bound the FDR ≤ 0.05 (28). Additionally, for each feature, univariate logistic regression models were fitted and tolerance values corresponding to 20% (TV20), 10% (TV10), and 5% (TV5) complication probability were calculated.

Multivariate Analysis

The multivariate analysis allowed to examine interactions between the features and their relative relevancy and redundancy. It was a multi-step process comprising feature-group selection, feature scaling, sampling (data cleaning and/or class balancing), feature selection, and classification. The workflow is presented in Figure 2.

Figure 2

The workflow of a multivariate five-step model building comprising, in this order, feature-group selection, feature scaling, sampling, feature selection, and classification.

Workflow

The first step of the workflow was a random selection of the feature-groups (Table 2) used for model training. It allowed for an initial, unsupervised dimensionality reduction of the feature space, which typically translates into an improved predictive performance and a more straightforward interpretation of the models. The selection was realized by performing a Bernoulli trial for every feature group with a 50% chance of success. If a given group was selected, all features belonging to this group were accepted for further analysis. If no group was selected after performing all Bernoulli trials, the procedure was repeated for all feature groups. In the second step, all features were scaled via Z-score normalization. Normalization of the features often improves stability and speed of optimization algorithms. The third step served the purpose of class balancing and data cleaning. A class imbalance, noise, and a small size of the minority class can negatively affect the performance of a predictive model (29, 30). We investigated whether sampling methods designed to reduce noise and improve definitions of class clusters could enhance model performance. Ten algorithms were examined: random oversampling (ROS), synthetic minority oversampling (SMOTE), adaptive synthetic sampling (ADASYN), one-sided selection (OSS), Tomek links (TL), the Wilson’s edited nearest neighbor rule (ENN), the neighborhood cleaning rule (NCL), synthetic minority oversampling followed by the Wilson’s edited nearest neighbor rule (SMOTE + ENN), and synthetic minority oversampling followed by Tomek links (SMOTE + TL). The detailed description of the sampling algorithms is given in Appendix B. The fourth step of the analysis was feature selection. The rationale for feature selection is a reduction of model complexity, which facilitates understanding of the relations between the predictors and the modeled outcome (here: xerostomia) (31). In this study, we tested six feature selection algorithms: univariate feature selection by F-score (UFS-F), univariate feature selection by mutual information (UFS-MI), recursive feature elimination by logistic regression (RFE-LR), recursive feature elimination by extra-trees (RFE-ET), model-based feature selection by logistic regression (MB-LR), and model-based feature selection by extra-trees (MB-ET). The details on the feature selection algorithms are provided in Appendix C. The last step of the workflow was classification. We compared seven classification algorithms: logistic regression with L1 penalty (LR-L1), logistic regression with L2 penalty (LR-L2), logistic regression with elastic net penalty (LR-EN), k-nearest neighbors (kNN), support vector machines (SVM), extra-trees (ET), and gradient tree boosting (GTB). A more detailed description of the classification algorithms is given in Appendix D. The models were build for every combination of the classification, feature selection, and sampling algorithms. This resulted in 490 models per end point or 1,960 models in total. A given classifier or a feature selection algorithm was involved in 210 time-specific and 70 longitudinal models. Every sampling method was part of 147 time-specific and 49 longitudinal models.

Model Tuning

In the process of model building every model was tuned, that is its hyperparameters were optimized to maximize the prediction performance. The type and the range of the hyperparameters were based on previously reported values that worked well in various machine learning tasks (Appendices B, C, and D). For each model, the hyperparameter optimization was realized by a random search (32). First, 300 random samples were selected from the hyperparameter space. Secondly, for each hyperparameter sample, the model performance was evaluated using cross-validation. Lastly, the model was retrained using all data with the hyperparameter configuration that maximized the cross-validated AUC. In the time-specific models, the cross-validation was done by the stratified Monte Carlo cross-validation (MCCV) (33) with 300 splits and 10% of observations held out for testing at each split. For the longitudinal models, we used modified leave-pair-out cross-validation (LPOCV) (34, 35). In our LPOCV implementation, all the training observations sharing patient ID with the test fold observations were removed at each split. This decision was motivated by the fact that the observations sharing patient ID differ only in the time of the follow-up evaluation; not removing them from the training fold would lead to overoptimistic performance scores. Additionally, instead of all possible positive–negative pairs, as in typical LPOCV, only a random subset of 300 positive–negative pairs was used. This allowed for a reduction of the computation time. Confidence intervals for the model tuning AUC estimates were calculated with BCa bootstrap.

Comparison of Machine Learning Algorithms

In order to compare the algorithms in terms of their influence on the average predictive performance of the model, we looked at the classifiers, the feature selection algorithms, and the sampling methods separately. Additionally, the analysis was performed independently for the time-specific and the longitudinal models. The statistical significance of the differences between the algorithms was evaluated by the Friedman test followed by the Nemenyi post hoc analysis. The Friedman test computes average performance ranks of the algorithms and tests whether they have the same influence on the AUC score of the model. If the null hypothesis was rejected, we proceeded with the post hoc analysis. With the Nemenyi post hoc test, we calculated the critical difference at a significance level of 0.05. When the average performance ranks of two algorithms differed by at least the critical difference, they were significantly different. As mentioned before, this analysis was repeated six times to test the classifiers, the feature selection algorithms, and the sampling methods separately in the time-specific and the longitudinal models. Therefore, the Holm–Bonferroni method was used to control the family-wise error rate (FWER) of the Friedman tests, that is the probability of making at least one incorrect rejection of a true null hypothesis in any of the comparisons (36). The significance level for the FWER was set to 0.05.

Generalization Performance

Hyperparameter optimization comes at a cost. On the one hand, it allows to tune the model so it fits well the underlying data. On the other hand, the performance of the tuned model may be overoptimistic due to a favorable selection of hyperparameters. In order to estimate the generalization performance of a model, that is its performance on new, unseen data, the data used for model tuning must be separate from the data used for model testing. Due to the modest size of our data set, instead of dividing the data to training, validation, and test folds, we decided to test the models using nested-cross validation (37). Nested cross-validation is essentially cross validation within cross validation. Part of the data is set aside for testing and the rest is used for model tuning (as described in the previous section). Next, the tuned model is tested on the part of data previously set aside for testing. Then, the procedure is repeated, that is another randomly selected part of the data is set aside for testing and the rest is used for model tuning. This is repeated until the desired number of iterations is achieved. Unfortunately, due to high computation cost, it was not feasible to calculate the expected generalization performance of all 1,960 models. Therefore, the models were first stratified by end point and classifier, and then nested cross-validation was conducted for the best performing models. The inner loops of the nested cross-validation, which were responsible for model tuning, were the same as described in Section 2.6.2. The outer loops were realized by the MCCV with 100 splits and a 10% test fold (time-specific models) or the modified LPOCV (longitudinal models). Confidence intervals for the generalization AUCs were calculated with BCa bootstrap.

Software

The MATLAB code used for DICOM import, processing, and feature extraction was made publicly available on GitHub (https://github.com/hubertgabrys/DicomToolboxMatlab). For visualization, statistical analysis, model building, and model testing, the following open-source Python packages were used: imbalanced-learn (38), Matplotlib (39), NumPy & SciPy (40), Orange (41), Pandas (42), scikit-learn (43), scikits-bootstrap, and XGBoost (44).

Results

Feature Correlations

After removing the features correlated with the mean dose, the skewness of the dose–volume histogram, and the parotid volume, there were no highly correlated feature pairs left. The remaining features are listed in Table 2.

Mean-Dose and Morphological Models

The predictive performance scores of the mean-dose models and the morphological model are presented in Table 3. The mean-dose models failed to predict xerostomia (AUC < 0.60) at all time-intervals as well as in the longitudinal approach. The morphological model achieved fair performance (AUC = 0.64) only in predicting long-term xerostomia.

Table 3

Predictive performance of the mean-dose models and the morphological model proposed by Buettner et al. (4), that is logistic regression with , , , and .

End point	Model	AUC
Early	Meanⁱ	0.58 (0.56–0.60)
	Mean^c	0.42 (0.41–0.44)
	Mean^b	0.50 (0.48–0.53)
	Meanⁱ, mean^c	0.49 (0.48–0.51)
	Morphological	0.42 (0.40–0.44)

Late	Meanⁱ	0.48 (0.44–0.51)
	Mean^c	0.58 (0.55–0.61)
	Mean^b	0.55 (0.52–0.58)
	Meanⁱ, mean^c	0.54 (0.51–0.57)
	Morphological	0.59 (0.56–0.62)

Long-term	Meanⁱ	0.40 (0.37–0.44)
	Mean^c	0.58 (0.55–0.61)
	Mean^b	0.56 (0.52–0.60)
	Meanⁱ, mean^c	0.47 (0.44–0.50)
	Morphological	0.64 (0.60–0.67)

Longitudinal	Meanⁱ	0.51 (0.45–0.56)
	Mean^c	0.57 (0.51–0.62)
	Mean^b	0.50 (0.44–0.55)
	Meanⁱ, mean^c	0.52 (0.46–0.58)
	Morphological	0.55 (0.49–0.60)

i, ipsilateral gland; c, contralateral gland; b, both glands.

Predictive performance of the mean-dose models and the morphological model proposed by Buettner et al. (4), that is logistic regression with , , , and . i, ipsilateral gland; c, contralateral gland; b, both glands. The results of the univariate analysis are presented in Figure 3. There was little association between single predictors and xerostomia within the first six months after treatment. Late xerostomia correlated with individual features slightly better. The most informative were contralateral dose gradients in the right–left direction (AUC = 0.68 (0.53–0.82)) and the anterior–posterior direction (AUC = 0.72 (0.58–0.84)). Nevertheless, the AUCs were too low to be statistically significant at the FDR ≤ 0.05. Long-term xerostomia was predicted well by parotid volumes, right–left dose gradients, and anterior–posterior dose gradients. Three models were statistically significant at the FDR ≤ 0.05: the ipsilateral parotid volume (AUC = 0.87 (0.75–0.95), TV20 = 9,894 mm3, TV10 = 15,681 mm3, TV5 = 21,014 mm3), the contralateral parotid volume (AUC = 0.85 (0.66–0.98), TV20 = 9,169 mm3, TV10 = 14,533 mm3, TV5 = 19,475 mm3), and the contralateral gradient in the right–left direction (AUC = 0.84 (0.71–0.93), TV20 = 1.49 Gy/mm, TV10 = 1.29 Gy/mm, TV5 = 1.10 Gy/mm). Statistical significance of three tests at the FDR ≤ 0.05 translates into a 85.7% and a 99.3% lower bound on the probability that all three tests are truly positive or that at most one test is falsely positive, respectively.

Figure 3

Predictive power of individual features in the time-specific models measured with the area under the receiver operating characteristic curve (AUC). The left-hand side vertical axis lists the features, the right-hand side vertical axis lists the feature groups. The AUCs were calculated from the corresponding Mann–Whitney U statistic. Bars marked with * are significant at the false discovery rate (FDR) ≤ 0.05. Neither the mean dose to the contralateral nor the mean dose to the ipsilateral parotid gland discriminated well between patients with and without xerostomia in the time-specific and the longitudinal approach. Figure 4 shows the comparison between the mean dose and the absolute right–left dose gradient values for the patients with long-term xerostomia.

Figure 4

The mean dose and the absolute right–left dose gradient distribution in our patient cohort.

Comparison of Classification, Feature Selection, and Sampling Algorithms

There was a clear difference in the average performance between early (AUC ≈ 0.60), late (AUC ≈ 0.70), and long-term (AUC ≈ 0.90) xerostomia models (Figure 5). After applying the Holm-Bonferroni correction, all the Friedman tests were significant at the FWER ≤ 0.05. Therefore, classification, feature selection, and sampling algorithms were compared for both the time-specific and the longitudinal models.

Figure 5

A comparison of classification, feature selection, and sampling algorithms in terms of their predictive performance in model tuning. All heat maps in a given column belong to a single end point, whereas all heat maps in a given row correspond to a single classifier. In each heat map, rows represent feature selection algorithms and columns correspond to sampling methods. The color maps are normalized per end point. The color bar ticks correspond to the worst, average, and the best model performance. In the time-specific models, the support vector machine was by far the best scoring classifier, outperforming the other classifiers in over 70% of cases (Figure 6), whereas gradient tree boosting was on average the worst performing classifier (Figure 7). Conversely, gradient tree boosting together with support vector machines and extra-trees predicted xerostomia significantly better than all the other classifiers in the longitudinal approach.

Figure 6

Figure 7

A comparison of classification, feature selection, and sampling methods against one another with the Nemenyi test. Lower ranks correspond to better performance of the algorithm, that is rank 1 is the best. Algorithms which ranks differ by less than the critical difference (CD) are not significantly different at 0.05 significance level and are connected by the black bars.

Heat maps showing a proportion of times a given algorithm on the vertical axis outperformed another algorithm on the horizontal axis in terms of the best AUC in model tuning. For example, support vector machines (SVM) performed better than extra-trees (ET) in 73% of the time-specific models. A comparison of classification, feature selection, and sampling methods against one another with the Nemenyi test. Lower ranks correspond to better performance of the algorithm, that is rank 1 is the best. Algorithms which ranks differ by less than the critical difference (CD) are not significantly different at 0.05 significance level and are connected by the black bars. The logistic regression-based algorithms performed significantly better than the feature selection methods based on extra-trees, in both the time-specific and the longitudinal models. Interestingly, while univariate feature selection by mutual information was the worst performing feature selection method in the time-specific models, it was one of the best in the longitudinal approach. Not performing feature selection was not disadvantageous in terms of predictive performance. In both the time-specific and the longitudinal approach, no sampling algorithm gave a significant advantage over no sampling at all. In the time-specific models, Tomek links and the neighborhood cleaning rule performed significantly better than any oversampling algorithm. In the longitudinal models, Tomek links performed significantly better than random oversampling or ADASYN.

Generalization Performance

The best performing models stratified by end point and classifier are listed in Table 4. These models were retested by nested cross-validation to estimate their generalization performance. Early xerostomia (0–6 months after treatment) was predicted fairly well only by the k-nearest neighbors classifier (AUC = 0.65). The models of late xerostomia (6–15 months after treatment) generalized slightly better with logistic regression, k-nearest neighbors, and gradient tree boosting scoring AUC > 0.60. For long-term xerostomia (15–24 months after treatment), the models generalized best with the AUC ranging from 0.74 (k-nearest neighbors) to 0.88 (extra-trees). The longitudinal models failed to generalize except the gradient tree boosting classifier, which achieved AUC = 0.63. Generalization AUCs were on average 0.10 lower than tuning AUCs for all the analyzed end points.

Table 4

Expected generalization performance of selected models evaluated by nested cross-validation.

End point	Classifier	Feature selection	Sampling	AUC tuning	AUC testing
Early	LR-L1	RFE-ET	NCL	0.62 (0.60–0.64)	0.56 (0.53–0.60)
	LR-L2	RFE-LR	NCL	0.62 (0.60–0.64)	0.46 (0.42–0.49)
	LR-EN	MB-ET	NCL	0.62 (0.60–0.64)	0.54 (0.50–0.57)
	kNN	UFS-F	SMOTE + ENN	0.68 (0.66–0.70)	0.65 (0.62–0.68)^a
	SVM	UFS-F	None	0.70 (0.68–0.72)	0.57 (0.53–0.61)
	ET	MB-LR	NCL	0.63 (0.61–0.65)	0.44 (0.41–0.47)
	GTB	UFS-F	None	0.66 (0.64–0.68)	0.55 (0.51–0.59)

Late	LR-L1	RFE-LR	NCL	0.78 (0.75–0.80)	0.63 (0.56–0.69)
	LR-L2	RFE-LR	NCL	0.76 (0.73–0.78)	0.60 (0.53–0.66)
	LR-EN	MB-LR	SMOTE + TL	0.73 (0.70–0.76)	0.56 (0.51–0.62)
	kNN	MB-LR	NCL	0.78 (0.76–0.80)	0.62 (0.57–0.67)
	SVM	UFS-F	TL	0.80 (0.77–0.82)	0.52 (0.46–0.58)
	ET	RFE-ET	NCL	0.78 (0.75–0.80)	0.55 (0.50–0.61)
	GTB	MB-LR	OSS	0.77 (0.75–0.79)	0.65 (0.59–0.70)^a

Long-term	LR-L1	MB-LR	ROS	0.95 (0.94–0.96)	0.86 (0.80–0.90)
	LR-L2	MB-LR	None	0.96 (0.95–0.97)	0.86 (0.81–0.90)
	LR-EN	MB-LR	SMOTE + ENN	0.92 (0.90–0.93)	0.83 (0.76–0.88)
	kNN	UFS-MI	TL	0.88 (0.86–0.90)	0.74 (0.68–0.80)
	SVM	RFE-LR	ENN	0.94 (0.92–0.96)	0.79 (0.73–0.85)
	ET	MB-LR	ENN	0.93 (0.92–0.94)	0.88 (0.84–0.91)^a
	GTB	UFS-F	ROS	0.89 (0.86–0.91)	0.77 (0.71–0.83)

Longitudinal	LR-L1	UFS-MI	None	0.63 (0.57–0.68)	0.52 (0.41–0.61)
	LR-L2	RFE-LR	NCL	0.60 (0.55–0.66)	0.39 (0.29–0.48)
	LR-EN	UFS-MI	TL	0.62 (0.57–0.68)	0.52 (0.42–0.60)
	kNN	UFS-MI	NCL	0.65 (0.61–0.69)	0.58 (0.49–0.66)
	SVM	UFS-MI	OSS	0.66 (0.60–0.71)	0.57 (0.46–0.66)
	ET	UFS-MI	TL	0.66 (0.61–0.71)	0.51 (0.40–0.60)
	GTB	RFE-LR	ROS	0.68 (0.62–0.72)	0.63 (0.52–0.71)^a

Expected generalization performance of selected models evaluated by nested cross-validation. .

Model Interpretation

Only the models predicting long-term xerostomia achieved high generalization scores, that is AUC > 0.70. For that reason, model interpretation was performed only for this end point. The multivariate models of long-term xerostomia relied mostly on the parotid gland volume, the spread of the contralateral dose–volume histogram, and the parotid gland eccentricity (Figure 8). The contralateral dose gradient in the right–left direction, despite good univariate predictive power, was included in only one model.

Figure 8

Features underlying the multivariate models of long-term xerostomia. i, ipsilateral gland; c, contralateral gland.

Discussion

The univariate analysis showed that parotid- and dose-shape features can be highly predictive of xerostomia. Patients with small parotid glands (median parotid volume in the positive group 9,557 vs. 14,374 mm3 in the negative group) and steep dose gradients in the patient’s right–left direction (median gradient in the positive group 1.7 vs. 1.2 Gy/mm in the negative group) were significantly more likely to develop long-term xerostomia. A possible explanation of this finding could be the fact that parotid glands typically shrink and move toward the medial direction during the course of radiotherapy. As a result, for patients with small parotid glands, the gradient is a proxy for the change of any dose-related metric subject to motion. As such, this might be an indicator of neglected motion and deformation effects during the modeling process. Nevertheless, good discriminative power of the dose gradients and poor performance of the mean dose should be put into perspective of the previous studies validating mean-dose models. In cohorts where patients received a high radiation dose to parotid glands, the mean dose allowed achieving AUC above 0.80 (2, 3). It seems that inclusion of patients with less conformal treatment plans and a higher dosage to parotids would result in a cluster of patients with complications in the high-dose region of Figure 4. Therefore, for relatively high doses, the mean dose alone is a good xerostomia predictor irrespective of the dose gradient, whereas in the low-dose regime of modern radiotherapy treatments dose gradients are more informative and the mean dose is less predictive. In the multivariate analysis, we did not find a model that would achieve generalization AUC above 0.65 for early or late-effects, even though a few univariate models of late xerostomia exceeded that value. Similarly, the multivariate models of long-term xerostomia, despite their good generalization scores (AUCmax = 0.88), performed on a par with the univariate models based on the parotid volume or the contralateral dose gradient in the patient’s right–left direction. Comparable performance of the univariate and the multivariate models could be caused by the small sample size, especially the small minority class. In such setting, the distribution of model covariates can nonnegligibly differ between training and testing folds, hindering model training and reducing performance of the model. The analysis of the multivariate models highlighted the importance of personalized treatment planning in radiotherapy. The models were strongly based on patient-specific and dose-independent features, such as parotid volume, parotid eccentricity, and the patient’s sex. Females with small, elongated parotid glands were at higher risk of long-term xerostomia than males with large and rather round parotids. Interestingly, the dose gradient, despite relatively high predictive power, was included in only one model. Instead, the most common dosiomic feature was the spread of the contralateral dose–volume histogram quantifying the SD of the dose within a parotid gland. Nevertheless, due to the geometry of the problem, the DVH spread and spatial dose gradients measured a similar characteristic of the dose distribution. That is, a large spread of the DVH was present when part of the parotid gland received high dose, whereas another part was spared. In the time-specific models, the support vector machine was most commonly the best classifier. The other classifiers performed similarly to one another. The unexceptional performance of the ensemble methods (extra-trees and gradient tree boosting) could stem from the fact that complex models need more training samples to correctly learn the decision boundary. Among the longitudinal models, we saw a more commonly observed classifier “ranking,” that is GTB > ET > SVM > LR > kNN (19). Feature selection did not give a clear advantage over no feature selection in terms of the predictive performance. Nonetheless, feature selection allowed for a reduction of model complexity and made model interpretation easier. The best results were achieved with the logistic regression-based algorithms and feature selection by mutual information (only in the longitudinal models). We have not found evidence that sampling methods improve accuracy of predictions. Moreover, we observed that certain kinds of sampling, especially random oversampling, can significantly decrease predictive performance of the models. Nested cross-validation proved to be an important step in the analysis. On average, the generalization AUCs were significantly lower than the AUCs achieved in model tuning. Our findings confirm the notion that single cross-validation can lead to overoptimistic performance estimates when hyperparameter tuning is involved in model building.

Conclusion

We demonstrated that in a highly conformal regime of modern radiotherapy, use of organ- and dose-shape features can be advantageous for modeling of treatment outcomes. Moreover, due to strong dependence on patient-specific factors, such as the parotid shape or the patient’s sex, our results highlight the need for development of personalized data-driven risk profiles in future NTCP models of xerostomia. Our results show that the choice of a classifier and a feature selection algorithm can significantly influence predictive performance of the NTCP model. Moreover, in relatively small clinical data sets, simple logistic regression can perform as well as top-ranking machine learning algorithms, such as extra-trees or support vector machines. We saw no significant advantage in using data cleaning or reducing the class imbalance. Our study confirms the need for significantly larger patient cohorts to benefit from advanced classification methods, such as gradient tree boosting. We showed that single cross-validation can lead to overoptimistic performance estimates when hyperparameter optimization is involved; either nested cross-validation or an independent test set should be used to estimate the generalization performance of a model.

Ethics Statement

The study was conducted in accordance with the Declaration of Helsinki and was approved by the Ethics Committee of Heidelberg University. Nr. S-392/2016 “Validation and development of probabilistic prediction models for radiation-induced xerostomia.”

Author Contributions

HG, FS, HH, and MB contributed to the acquisition of the clinical data. HG, FS, and MB contributed to the analysis of the follow-up data. HG, FB, and MB contributed to the methodology. HG performed feature extraction, data visualization, statistical analysis, and drafted the manuscript. MB was the senior author supervising the project.

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Classification
LR-L1	Logistic regression with L1 penalty
LR-L2	Logistic regression with L2 penalty
LR-EN	Logistic regression with elastic net penalty
kNN	k-Nearest neighbors
SVM	Support vector machine
ET	Extra-trees
GTB	Gradient tree boosting
Feature selection
UFS-F	Univariate feature selection by F-score
UFS-MI	Univariate feature selection by mutual information
RFE-LR	Recursive feature elimination by logistic regression
RFE-ET	Recursive feature elimination by extra-trees
MB-LR	Model-based feature selection by logistic regression
MB-ET	Model-based feature selection by extra-trees
Sampling
ROS	Random oversampling
SMOTE	Synthetic minority
	oversampling
ADASYN	Adaptive synthetic sampling
OSS	One-sided selection
TL	Tomek links
ENN	Wilson’s edited nearest
	neighbor rule
NCL	Neighborhood cleaning rule
SMOTE + ENN	SMOTE followed by the ENN
SMOTE + TL	SMOTE followed by TL

Table A1

Hyperparameters used to tune the sampling algorithms.

Algorithm	Hyperparameters	Values
ROS	–	–
SMOTE	k_neighbors: Number of nearest neighbors used to construct synthetic samples.	{3,4,5}
	m_neighbors: Number of nearest neighbors used to determine if a minority sample is in danger.	{7,8,9}
	kind: Type of SMOTE algorithm.	{“regular,” “borderline1,” “borderline2”}
ADASYN	n_neighbors: Number of nearest neighbors to use to construct synthetic samples.	{3,5,8}
OSS	–	–
TL	–	–
ENN	n_neighbors: Number of nearest neighbors.	{2,3,5}
	kind_sel: Type of ENN algorithm.	{“all,” “mode”}
NCL	n_neighbors: Number of nearest neighbors.	{2,3,5}
SMOTE + TL	–	–
SMOTE + ENN	–	–

Hyperparameters not listed in this table assumed the default values of imbalanced-learn package (.

Table A2

Hyperparameters used to tune the feature selection algorithms.

Algorithm	Hyperparameters	Values
UFS-F	k: Number of features to select.	{2,3,4,5,6}
UFS-MI	k: Number of features to select.	{2,3,4,5,6}
RFE-LR	k: Number of features to select.	{2,3,4,5,6}
	step: Number of features to	1
	remove at each iteration.
	class_weight: Whether class weights	{None, “balanced”}
	are equal or inversely proportional
	to class frequencies.
	C: Inverse of regularization strength.	{2⁻⁵, 2^−4.985,
		2^−4.97, …, 2¹⁰}
	penalty: Type of regularization.	“l2”
RFE-ET	k: Number of features to select.	{2,3,4,5,6}
	step: Fraction of features to remove	0.5
	at each iteration.
	class_weight: Whether class weights	{None, “balanced,”
	are equal or inversely proportional to	“balanced_subsample”}
	class frequencies.
	n_estimators: Number of	[90,140]
	decision trees.
MB-LR	k: Number of features to select.	{2,3,4,5,6}
	class_weight: Whether class weights	{None, “balanced”}
	are equal or inversely proportional to
	class frequencies.
	C: Inverse of regularization strength.	{2⁻⁵, 2^−4.985,
		2^−4.97, …, 2¹⁰}
	penalty: Type of regularization.	{“l1,” “l2”}
MB-ET	k: Number of features to select.	{2,3,4,5,6}
	class_weight: Whether class weights	{None, “balanced,”
	are equal or inversely proportional	“balanced_subsample”}
	to class frequencies.
	n_estimators: Number of	[90,140]
	decision trees.

Hyperparameters not listed in this table assumed the default values of scikit-learn package (.

Table A3

Hyperparameters used to tune the classification algorithms.

Algorithm	Hyperparameters	Values
LR-L1	class_weight: Whether class weights are equal or inversely proportional to class frequencies.	{None, “balanced”}
	C: Inverse of regularization strength.	{2⁻⁵, 2^−4.985, 2^−4.97, …, 2¹⁰}
LR-L2	class_weight: Whether class weights are equal or inversely proportional to class frequencies.	{None, “balanced”}
	C: Inverse of regularization strength.	{2⁻⁵, 2^−4.985, 2^−4.97, …, 2¹⁰}
LR-EN	class_weight: Whether class weights are equal or inversely proportional to class frequencies.	{None, “balanced”}
	alpha: Regularization strength.	{2⁻¹⁰, 2^−9.985, 2^−9.97, …, 2⁵}
	l1_ratio: Ratio between L1 and L2 penalty.	[0,1]
kNN	n_neighbors: Number of nearest neighbors.	{1,2,3,…,9}
	p: Power parameter of the Minkowski distance.	{1,2,∞}
SVM	class_weight: Whether class weights are equal or inversely proportional to class frequencies.	{None, “balanced”}
	C: Inverse of regularization strength.	{2⁻⁵, 2^−4.985, 2^−4.97, …, 2¹⁰}
	gamma: Parameter of the RBF kernel.	{2⁻¹⁵, 2^−14.982, 2^−14.964, …, 2³}
ET	n_estimators: Number of decision trees.	[90, 230]
	class_weight: Whether class weights are equal of inversely proportional to class frequencies.	{None, “balanced”}
	criterion: The function to measure the quality of a split.	{“gini,” “entropy”}
	max_features: Number of features to consider when calculating the best split.	{0.05, 0.10, 0.15,…,1}
	min_samples_split: The minimum number of samples required to split a node.	{2,3,4,…,20}
	min_samples_leaf: The minimum number of samples required to be at a leaf node.	{1,2,3,…,20}
GTB	n_estimators: Number of decision trees.	[200, 2000]
	learning_rate: Boosting learning rate.	{2⁻⁷, 2^−6.994, 2^−6.988, …, 2⁻¹}
	max_depth: Maximum tree depth.	{1,2,3,…,6}
	gamma: Minimum loss reduction required to make a further partition on a leaf node of the tree.	{0.05,0.1,0.3,0.5,0.7,0.9,1}
	min_child_weight: Minimum sum of instance weight(hessian) needed in a child.	{1,3,5,7}
	subsample: Ratio of the training samples used to grow trees.	{0.6,0.65,0.70,…,1}
	reg_lambda: L1 regularization term on weights.	[0,1]
	reg_alpha: L2 regularization term on weights.	[0,1]

Hyperparameters not listed in this table assumed the default values of scikit-learn (.

24 in total

1. Prediction error estimation: a comparison of resampling methods.

Authors: Annette M Molinaro; Richard Simon; Ruth M Pfeiffer
Journal: Bioinformatics Date: 2005-05-19 Impact factor: 6.937

2. Comparison of non-parametric confidence intervals for the area under the ROC curve of a continuous-scale diagnostic test.

Authors: Lejla Hotilovac
Journal: Stat Methods Med Res Date: 2008-04 Impact factor: 3.021

3. Xerostomia and its predictors following parotid-sparing irradiation of head-and-neck cancer.

Authors: A Eisbruch; H M Kim; J E Terrell; L H Marsh; L A Dawson; J A Ship
Journal: Int J Radiat Oncol Biol Phys Date: 2001-07-01 Impact factor: 7.038

4. Geometric Image Biomarker Changes of the Parotid Gland Are Associated With Late Xerostomia.

Authors: Lisanne V van Dijk; Charlotte L Brouwer; Hans Paul van der Laan; Johannes G M Burgerhof; Johannes A Langendijk; Roel J H M Steenbakkers; Nanna M Sijtsema
Journal: Int J Radiat Oncol Biol Phys Date: 2017-08-12 Impact factor: 7.038

5. A comparison of dose-response models for the parotid gland in a large group of head-and-neck cancer patients.

Authors: Antonetta C Houweling; Marielle E P Philippens; Tim Dijkema; Judith M Roesink; Chris H J Terhaard; Cornelis Schilstra; Randall K Ten Haken; Avraham Eisbruch; Cornelis P J Raaijmakers
Journal: Int J Radiat Oncol Biol Phys Date: 2009-12-16 Impact factor: 7.038

6. CT image biomarkers to improve patient-specific prediction of radiation-induced xerostomia and sticky saliva.

Authors: Lisanne V van Dijk; Charlotte L Brouwer; Arjen van der Schaaf; Johannes G M Burgerhof; Roelof J Beukinga; Johannes A Langendijk; Nanna M Sijtsema; Roel J H M Steenbakkers
Journal: Radiother Oncol Date: 2016-07-25 Impact factor: 6.280

7. Random forests to predict rectal toxicity following prostate cancer radiation therapy.

Authors: Juan D Ospina; Jian Zhu; Ciprian Chira; Alberto Bossi; Jean B Delobel; Véronique Beckendorf; Bernard Dubray; Jean-Léon Lagrange; Juan C Correa; Antoine Simon; Oscar Acosta; Renaud de Crevoisier
Journal: Int J Radiat Oncol Biol Phys Date: 2014-07-08 Impact factor: 7.038

8. Sparing all salivary glands with IMRT for head and neck cancer: Longitudinal study of patient-reported xerostomia and head-and-neck quality of life.

Authors: Peter G Hawkins; Jae Y Lee; Yanping Mao; Pin Li; Michael Green; Francis P Worden; Paul L Swiecicki; Michelle L Mierzwa; Matthew E Spector; Matthew J Schipper; Avraham Eisbruch
Journal: Radiother Oncol Date: 2017-08-16 Impact factor: 6.280

9. Patient- and therapy-related factors associated with the incidence of xerostomia in nasopharyngeal carcinoma patients receiving parotid-sparing helical tomotherapy.

Authors: Tsair-Fwu Lee; Ming-Hsiang Liou; Hui-Min Ting; Liyun Chang; Hsiao-Yi Lee; Stephen Wan Leung; Chih-Jen Huang; Pei-Ju Chao
Journal: Sci Rep Date: 2015-08-20 Impact factor: 4.379

10. Using multivariate regression model with least absolute shrinkage and selection operator (LASSO) to predict the incidence of Xerostomia after intensity-modulated radiotherapy for head and neck cancer.

Authors: Tsair-Fwu Lee; Pei-Ju Chao; Hui-Min Ting; Liyun Chang; Yu-Jie Huang; Jia-Ming Wu; Hung-Yu Wang; Mong-Fong Horng; Chun-Ming Chang; Jen-Hong Lan; Ya-Yu Huang; Fu-Min Fang; Stephen Wan Leung
Journal: PLoS One Date: 2014-02-28 Impact factor: 3.240

31 in total

Review 1. Artificial Intelligence in radiotherapy: state of the art and future directions.

Authors: Giulio Francolini; Isacco Desideri; Giulia Stocchi; Viola Salvestrini; Lucia Pia Ciccone; Pietro Garlatti; Mauro Loi; Lorenzo Livi
Journal: Med Oncol Date: 2020-04-22 Impact factor: 3.064

2. CT imaging markers to improve radiation toxicity prediction in prostate cancer radiotherapy by stacking regression algorithm.

Authors: Shayan Mostafaei; Hamid Abdollahi; Shiva Kazempour Dehkordi; Isaac Shiri; Abolfazl Razzaghdoust; Seyed Hamid Zoljalali Moghaddam; Afshin Saadipoor; Fereshteh Koosha; Susan Cheraghi; Seied Rabi Mahdavi
Journal: Radiol Med Date: 2019-09-24 Impact factor: 3.469

Review 3. Potentials of radiomics for cancer diagnosis and treatment in comparison with computer-aided diagnosis.

Authors: Hidetaka Arimura; Mazen Soufi; Kenta Ninomiya; Hidemi Kamezawa; Masahiro Yamada
Journal: Radiol Phys Technol Date: 2018-10-29

Review 4. Artificial intelligence in radiation oncology.

Authors: Elizabeth Huynh; Ahmed Hosny; Christian Guthier; Danielle S Bitterman; Steven F Petit; Daphne A Haas-Kogan; Benjamin Kann; Hugo J W L Aerts; Raymond H Mak
Journal: Nat Rev Clin Oncol Date: 2020-08-25 Impact factor: 66.675

5. A Deep Learning Model for Predicting Xerostomia Due to Radiation Therapy for Head and Neck Squamous Cell Carcinoma in the RTOG 0522 Clinical Trial.

Authors: Kuo Men; Huaizhi Geng; Haoyu Zhong; Yong Fan; Alexander Lin; Ying Xiao
Journal: Int J Radiat Oncol Biol Phys Date: 2019-06-13 Impact factor: 7.038

6. Transfer learning approach based on computed tomography images for predicting late xerostomia after radiotherapy in patients with oropharyngeal cancer.

Authors: Annarita Fanizzi; Giovanni Scognamillo; Alessandra Nestola; Santa Bambace; Samantha Bove; Maria Colomba Comes; Cristian Cristofaro; Vittorio Didonna; Alessia Di Rito; Angelo Errico; Loredana Palermo; Pasquale Tamborra; Michele Troiano; Salvatore Parisi; Rossella Villani; Alfredo Zito; Marco Lioce; Raffaella Massafra
Journal: Front Med (Lausanne) Date: 2022-09-23

Review 7. Big Data in Head and Neck Cancer.

Authors: Carlo Resteghini; Annalisa Trama; Elio Borgonovi; Hykel Hosni; Giovanni Corrao; Ester Orlandi; Giuseppina Calareso; Loris De Cecco; Cesare Piazza; Luca Mainardi; Lisa Licitra
Journal: Curr Treat Options Oncol Date: 2018-10-25

8. "Dry mouth" and "Flammer" syndromes-neglected risks in adolescents and new concepts by predictive, preventive and personalised approach.

Authors: Analtoly Kunin; Jiri Polivka; Natalia Moiseeva; Olga Golubnitschaja
Journal: EPMA J Date: 2018-08-02 Impact factor: 6.543

9. Intrinsic radiomic expression patterns after 20 Gy demonstrate early metabolic response of oropharyngeal cancers.

Authors: Kyle J Lafata; Yushi Chang; Chunhao Wang; Yvonne M Mowery; Irina Vergalasova; Donna Niedzwiecki; David S Yoo; Jian-Guo Liu; David M Brizel; Fang-Fang Yin
Journal: Med Phys Date: 2021-06-02 Impact factor: 4.506

Review 10. The Biological Meaning of Radiomic Features.

Authors: Michal R Tomaszewski; Robert J Gillies
Journal: Radiology Date: 2021-01-05 Impact factor: 11.105