| Literature DB >> 35647396 |
Mahmoud Aldraimli1, Sarah Osman2, Diana Grishchuck3, Samuel Ingram4, Robert Lyon5, Anil Mistry6, Jorge Oliveira7, Robert Samuel8, Leila E A Shelley9, Daniele Soria10, Miriam V Dwek11, Miguel E Aguado-Barrera12,13, David Azria14, Jenny Chang-Claude15,16, Alison Dunning17, Alexandra Giraldo18, Sheryl Green19, Sara Gutiérrez-Enríquez20, Carsten Herskind21, Hans van Hulle22, Maarten Lambrecht23, Laura Lozza24, Tiziana Rancati25, Victoria Reyes18, Barry S Rosenstein26, Dirk de Ruysscher27, Maria C de Santis24, Petra Seibold15, Elena Sperk21, R Paul Symonds28, Hilary Stobart29, Begoña Taboada-Valadares12,30, Christopher J Talbot31, Vincent J L Vakaet22, Ana Vega12,13,32, Liv Veldeman22, Marlon R Veldwijk21, Adam Webb31, Caroline Weltens23, Catharine M West33, Thierry J Chaussalet1, Tim Rattay28.
Abstract
Purpose: Some patients with breast cancer treated by surgery and radiation therapy experience clinically significant toxicity, which may adversely affect cosmesis and quality of life. There is a paucity of validated clinical prediction models for radiation toxicity. We used machine learning (ML) algorithms to develop and optimise a clinical prediction model for acute breast desquamation after whole breast external beam radiation therapy in the prospective multicenter REQUITE cohort study. Methods and Materials: Using demographic and treatment-related features (m = 122) from patients (n = 2058) at 26 centers, we trained 8 ML algorithms with 10-fold cross-validation in a 50:50 random-split data set with class stratification to predict acute breast desquamation. Based on performance in the validation data set, the logistic model tree, random forest, and naïve Bayes models were taken forward to cost-sensitive learning optimisation.Entities:
Year: 2022 PMID: 35647396 PMCID: PMC9133391 DOI: 10.1016/j.adro.2021.100890
Source DB: PubMed Journal: Adv Radiat Oncol ISSN: 2452-1094
Fig. 1Diagram depicting the overall study design showing data preprocessing, splitting, and imputation at the top, and model development and optimization at the bottom. Abbreviations: ANN = artificial neural network; BVA = boundary value analysis; C4.5 = C4.5 decision tree; CS = cost sensitive optimisation; DMI = decision-tree-based missing-value imputation; ECP = equivalence class partitioning; ITD = imbalanced training data set; KNN = K-nearest neighbor; LMT = logistic model tree; LR = logistic regression; NB = naïve Bayes; RF = random forest; ROS = random over-sampling; RUS = random under-sampling; SMOTE = synthetic minority oversampling technique; SVM = support vector mechanism; VD = validation data set.
Summary study characteristics of eligible patients from the REQUITE patient cohort
| REQUITE breast cancer cohort | |
|---|---|
| Eligible patients | 2059 |
| Location | Western Europe, United States |
| Study design | Prospective cohort |
| Recruitment year (range) | 2014-2016 |
| Treatment year (range) | 2014-2016 |
| Toxicity assessment scale | CTCAE v4.0 |
| Toxicity assessment time points | Start-of-RT |
| End-of-RT | |
| Age (median, range) | 58 (23-90) |
| Whole breast dose (Gy, median, range) | 50 (28.5-56) |
| Whole breast fractions (median, range) | 25 (5-31) |
| Hypofractionated regimen (proportion of patients) | 47.9% |
| IMRT, simple field-in-field | 39.7% |
| IMRT, complex modulated | 9.8% |
| RT to axilla | 11.9% |
| RT to supraclavicular fossa | 12.8% |
| Boost | 67.8% |
| BMI ≥25 | 54.0% |
| Smoker (current or previous) | 42.7% |
| Chemotherapy | 31.0% |
| Diabetes | 6.1% |
| Hypertension | 28.0% |
| Cardiovascular disease | 6.9% |
| Toxicity (end of treatment) | |
| Ulceration | |
| Grade 0 | 1868 (91.2%) |
| Grade ≥1 | 181 (8.8%) |
| Dermatitis | |
| Grade 0 | 257 (12.5%) |
| Grade 1 | 1288 (62.6%) |
| Grade 2 | 462 (22.4%) |
| Grade 3 | 28 (1.4%) |
| Acute desquamation | |
| Ulceration ≥G1 or dermatitis ≥G3 | 192 (9.3%) |
Abbreviations: BMI = body mass index; CTCAE = Common Terminology Criteria for Adverse Events; IMRT = intensity modulated radiation therapy; RT = radiation therapy.
Model performance with imputed imbalanced training data set DMI(ITD) and validation data set DMI(VD)
| Training in ITD (n = 1029) | Validation in VD (n = 1029) | ||||||
|---|---|---|---|---|---|---|---|
| Classifier | Specificity (TNR) | Sensitivity (TPR) | AUC | Specificity (TNR) | Sensitivity (TPR) | AUC | Rank |
| (K = 1) NN | 0.908 | 0.167 | 0.548 | 0.923 | 0.292 | 0.607 | 9 |
| (K = 3) NN | 0.975 | 0.094 | 0.601 | 0.979 | 0.125 | 0.627 | 8 |
| (K = 5) NN | 0.985 | 0.042 | 0.624 | 0.989 | 0.063 | 0.651 | 6 |
| (K = 7) NN | 0.996 | 0.031 | 0.648 | 0.998 | 0.052 | 0.644 | 7 |
| (K = 9) NN | 0.999 | 0.031 | 0.660 | 0.999 | 0.042 | 0.665 | 5 |
| ANN | 0.945 | 0.198 | 0.694 | 0.953 | 0.177 | 0.676 | 4 |
| C4.5 | 0.985 | 0.083 | 0.575 | 0.979 | 0.125 | 0.496 | 12 |
| LMT | 0.996 | 0.010 | 0.578 | 0.995 | 0.042 | 0.746 | 1 |
| LR | 0.910 | 0.188 | 0.567 | 0.959 | 0.135 | 0.596 | 10 |
| NB | 0.810 | 0.438 | 0.697 | 0.833 | 0.500 | 0.737 | 3 |
| SVM | 0.966 | 0.156 | 0.561 | 0.976 | 0.146 | 0.561 | 11 |
| RF | 0.998 | 0.021 | 0.725 | 0.999 | 0.010 | 0.742 | 2 |
Abbreviations: ANN = artificial neural network; AUC = area under the curve; C4.5 = decision tree; DMI = decision-tree based missing value imputation; ITD = imbalanced training; KNN = K-nearest neighbor; LMT = logistic model tree; LR = logistic regression; NB = naïve Bayes; RF = random forest; SVM = support vector machine; TNR = true negative rate; TPR = true positive rate; VD = validation.
Fig. 2Radar charts plotting sensitivity (TPR) and specificity (TNR) in the validation data set for all ML models developed with the RUS, ROS, and SMOTE resampled training data and after applying cost-sensitive learning to the 3 best-performing ML models (RF, NB, and LMT). Abbreviations: LMT = logistic model tree; ML = machine learning; NB = naïve Bayes; RF = random forest; ROS = random over-sampling; RUS = random under-sampling; SMOTE = synthetic minority oversampling technique; TNR = true negative rate; TPR = true positive rate.
Fig. 3Trade-off threshold lines are shown for sensitivity (TPR) and specificity (TNR) at 0.63 and 0.70, respectively. Five models cross both threshold lines and their TPR, TNR, and AUC values are shown at the bottom. Two out of 5 models have a higher TNR than TPR and 3 out of the 5 models have a higher TPR than TNR. The “hero” model (no. 1) was the cost-sensitive random forest algorithm with a penalty of 90:1. Abbreviations: AUC = area under the curve; TNR = true negative rate; TPR = true positive rate.
Features in the “hero” optimized cost-sensitive RF classifier ranked by importance
| Model's feature | MDI | Model's feature | MDI |
|---|---|---|---|
| other_lipid_lowering_drugs_duration_yrs | 0.52 | alcohol_current_consumption | 0.2 |
| surgery_type | 0.41 | smoking_time_since_quitting_yrs | 0.2 |
| radio_bolus | 0.4 | radio_imrt | 0.19 |
| chemotherapy | 0.36 | radio_photon_boostdose_Gy | 0.19 |
| boost | 0.35 | other_antihypertensive_drug | 0.19 |
| radio_photon_dose_MV | 0.34 | household_members | 0.19 |
| epirubicin_chemo_drug | 0.34 | radio_breast_fractions_dose_per_fraction_Gy | 0.19 |
| blood_pressure | 0.33 | radio_elec_boost_field_y_cm | 0.19 |
| Bra_band_size | 0.3 | radio_photon_2nd | 0.19 |
| radio_treated_breast | 0.3 | bra_cup_size | 0.19 |
| tumour_size_mm | 0.29 | radio_breast_fractions | 0.19 |
| paclitaxel_chemo_drug | 0.29 | n_stage | 0.18 |
| grade_invasive | 0.28 | hypertension_duration_yrs | 0.18 |
| breast_separation | 0.28 | radio_supraclavicular_fossa | 0.18 |
| smoking | 0.27 | education_profession | 0.18 |
| radio_elec_energy_MeV | 0.27 | radio_axillary_levels | 0.18 |
| BED_boost | 0.27 | hypertension | 0.18 |
| docetaxel_chemo_drug | 0.27 | radio_photon_boost_fractions_per_week | 0.17 |
| BED_Total | 0.27 | smoker | 0.17 |
| radio_elec_boost_dose_Gy | 0.27 | depression | 0.17 |
| On_tamoxifen | 0.26 | menopausal_status | 0.17 |
| radio_heart_mean_dose_Gy | 0.26 | radio_boost_diameter_cm | 0.16 |
| t_stage | 0.26 | 5-fluorouracil (5-FU)_chemo_drug | 0.16 |
| radio_hot_spots_107 | 0.25 | radio_photon_boost_dose_per_fraction_Gy | 0.16 |
| BED_Breast | 0.25 | antidepressant_duration_yrs | 0.16 |
| tobacco_products_per_day | 0.25 | radio_breast_fractions_per_week | 0.15 |
| age_at_radiotherapy_start_yrs | 0.25 | radio_boost_type | 0.15 |
| radio_breast_ct_volume_cm3 | 0.25 | Carboplatin_chemo_drug | 0.15 |
| hormone_replacement_therapy | 0.24 | radio_boost_sequence | 0.15 |
| radio_photon_boost_volume_cm3 | 0.24 | radio_photon_boost_fractions | 0.15 |
| antidepressant | 0.24 | household_income | 0.15 |
| height_cm | 0.24 | methotrexate_chemo_drug | 0.15 |
| radio_photon_2nd_energy_MV | 0.24 | other_lipid_lowering_drugs | 0.14 |
| radio_ipsilateral_lung_mean_Gy | 0.24 | radio_photon_energy_MV or kV | 0.14 |
| alcohol_previous_consumption | 0.24 | ace_inhibitor | 0.13 |
| radio_photon_2nd_dose_fractions_per_week | 0.23 | analgesics_duration_yrs | 0.13 |
| radio_skin_max_dose_Gy | 0.23 | radio_photon_2nd_dose_per_fraction_Gy | 0.13 |
| histology | 0.23 | antidiabetic_duration_yrs | 0.13 |
| monopause_age_yrs | 0.23 | depression_duration_yrs | 0.13 |
| other_antihypertensive_drug_duration_yrs | 0.23 | on_statin_duration_yrs | 0.12 |
| weight_at_cancer_diagnosis_kg | 0.23 | antidiabetic | 0.12 |
| tobacco_product | 0.23 | diabetes | 0.11 |
| cyclophosphamide_chemo_drug | 0.22 | ace_inhibitor_duration_yrs | 0.11 |
| combined_chemo_drugs | 0.22 | on_statin | 0.11 |
| boost_frac | 0.22 | doxorubicin_chemo_drug | 0.11 |
| analgesics | 0.22 | history_of_heart_disease | 0.09 |
| breast_cancer_family_history_1st_degree | 0.22 | radio_axillary_other | 0.09 |
| smoking_duration_yrs | 0.21 | ethnicity | 0.09 |
| radio_photon_boostdose_precise_Gy | 0.21 | radio_interrupted | 0.08 |
| radio_elec_boost_field_x_cm | 0.21 | pegfilgrastim_chemo_drug | 0.07 |
| radio_photon_2nd_fractions | 0.21 | history_of_heart_disease_duration_yrs | 0.06 |
| radio_boost_fractions | 0.21 | radiotherapy_toxicity_family_history | 0.06 |
| alcohol_intake | 0.21 | diabetes_duration_yrs | 0.05 |
| radio_type_imrt | 0.21 | radio_interrupted_days | 0.05 |
| radio_treatment_pos | 0.21 | trastuzumab_chemo_drug | 0.04 |
| radio_breast_dose_Gy | 0.2 | other_collagen_vascular_disease | 0.03 |
| rheumatoid arthritis_duration_yrs | 0.2 | rheumatoid arthritis | 0.02 |
Abbreviations: BED = biologically effective dose; IMRT = intensity modulated radiation therapy; MDI = mean decrease impurity; MeV = mega electron volt; MV = mega volt; RF = random forest.
Feature importance is calculated as the decrease in node impurity weighted by the probability of reaching that node. The node probability can be calculated by the number of samples that reach the node, divided by the total number of samples. The higher the value, the more important the feature.