| Literature DB >> 34869010 |
Stefania Volpe1,2, Matteo Pepa1, Mattia Zaffaroni1, Federica Bellerba3, Riccardo Santamaria1,2, Giulia Marvaso1,2, Lars Johannes Isaksson1, Sara Gandini3, Anna Starzyńska4, Maria Cristina Leonardi1, Roberto Orecchia5, Daniela Alterio1, Barbara Alicja Jereczek-Fossa1,2.
Abstract
BACKGROUND ANDEntities:
Keywords: artificial intelligence; head and neck cancer; machine learning; radiotherapy; systematic review
Year: 2021 PMID: 34869010 PMCID: PMC8637856 DOI: 10.3389/fonc.2021.772663
Source DB: PubMed Journal: Front Oncol ISSN: 2234-943X Impact factor: 6.244
Figure 1Machine learning workflow and current applications in Radiation Oncology.
Figure 2Study selection process per the Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) guidelines.
Figure 3Classification of the machine-learning algorithms included in the analysis. *Comprehend: ANN, CNN and FCNN. ANN, Artificial Neural Network; CNN, Convolutional Neural Network; FCNN, Fully CNN; HMM, Hidden Markov Model; k-NN, k-Nearest Neighbour; MARS, Multiadaptive Regression Splines; PCA, principal component analysis; PCR, principal component regression; SVC, support vector classifier; SVM, support vector machine.
Summary and definitions of most common machine learning (ML) models.
| ML model | Abbreviations | Application | Definition |
|---|---|---|---|
|
| ANN, NN |
| Any set of algorithms modeled on human brain neuronal connections |
|
| ASM |
| Model-based method to compare an image reference model with the image of interest |
|
| BB |
| Bayesian analog of the original bootstrap. Bootstrap samples of the data are taken, the model is fit to each sample, and the predictions are averaged over all of the fitted models to get the bagged prediction |
|
| – |
| Boosting is a generic algorithm rather than a specific model. Boosting needs a weak model (e.g., regression, shallow decision trees, etc.) as a starting point and then improves it |
|
| – |
| Meta-algorithm designed to improve the stability and accuracy of ML algorithms used in statistical classification and regression. It also reduces variance and helps to avoid overfitting. Although it is usually applied to decision tree methods, it can be used with any type of method |
|
| CART |
| Predictive model which predicts an outcome variable value based on other values. A CART output is a decision tree where each fork is a split in a predictor variable and each end node contains a prediction for the outcome variable |
|
| CNN, NN |
| Ordinary NN which implements convolution (mathematical operation on 2 functions producing a third function expressing how the shape of the first one is modified by the second one), in at least 1 of its layers. Most commonly, inputs are images |
|
| – |
| An algorithm used to generate a decision tree. The decision trees generated by C4.5 can be used for classification, and for this reason, this algorithm is often referred to as a statistical classifier |
|
| DT |
| Algorithm containing conditional control statements organized in the form of a flowchart-like structure, also called tree-like model. Paths from roots to leaves represent classification rules, while each node is a class label (decision based on the computation of the attributes) |
|
| DS |
| Model consisting of a 1-level decision tree, a tree with an internal node (root) immediately connected to the terminal nodes (its leaves). A DS makes a prediction based on the value of just a single input feature. Sometimes they are also called 1xrules |
|
| FCNN |
| A deep learning model based on traditional CNN model. A FCNN is one where all the learnable layers are convolutional, so it does not have any fully connected layer. |
|
| IAMB |
| Feature selection method |
|
| LASSO |
| A regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the statistical model |
|
| LFA |
| A method used for translating statistical information coming from labeled data into a fuzzy classification system with good confidence measure in terms of class probabilities and interpretability of the fuzzy classification model, by means of semantically interpretable fuzzy partitions and if–then rule |
|
| LDA |
| A method used to find a linear combination of features that characterizes or separates 2 or more classes of objects or events |
|
| LR |
| A statistical model that uses a logistic function to model a binary dependent variable |
|
| k-NN |
| Non-parametric algorithm that classifies data points based on their similarity (also called distance or proximity) with the objects (feature vectors) contained in the collection of known objects (vector space or feature space) |
|
| MARS |
| It is a nonparametric regression technique, extension of linear models that automatically models nonlinearities and interactions between variables |
|
| MRMR |
| Supervised feature selection algorithm which requires both the input features, and the output class labels of data. Using the input features and output class labels, MRMR attempts to find the set of features which associate best with the output class labels, while minimizing the redundancy between the selected features |
|
| NB |
| Applies Bayes’ theorem to calculate the probability of an hypothesis to be true assuming prior knowledge and a strong (therefore, naive) degree of independence between the features |
|
| PLSR and PCR |
| Both methods model a response variable when there are a large number of predictor variables, and those predictors are highly correlated. Both methods construct new predictor variables, known as components, as linear combinations of the original predictor variables. PCR creates components to explain the observed variability in the predictor variables, without considering the response variable at all. PLSR does take the response variable into account, and therefore often leads to models that are able to fit the response variable with fewer components |
|
| PCA |
| Captures the maximum variance in the data into a new coordinate system whose axes are called “principal components,” to reduce data dimensionality, favor their exploration, and reduce computational cost |
|
| PLR |
| PLR imposes a penalty to the logistic model for having too many variables. This results in shrinking the coefficients of the less contributive variables toward zero. This is also known as regularization |
|
| RF, RFC |
| Operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees |
|
| – |
| An algorithm that takes a filter-method approach to feature selection that is notably sensitive to feature interactions. Relief calculates a feature score for each feature which can then be applied to rank and select top scoring features for feature selection |
|
| RSF |
| A nonparametric method for ensemble estimation constructed by bagging of classification trees for survival data, has been proposed as an alternative method for better survival prediction and variable selection |
|
| RW |
| Rescorla Wagner model is a model of classical conditioning, in which learning is conceptualized in terms of associations between conditioned and unconditioned stimuli |
|
| – |
| A ML technique which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees |
|
| SVC |
| The objective linear SVC is to fit to the provided data and returns a “best-fit” hyperplane that divides, or categorizes them |
|
| SVM |
| The SVM is based on the idea of finding a hyperplane that best divides the support vectors into classes. The SVM algorithm achieves maximum performance in binary classification problems, even if it is used for multiclass classification problems |
|
| – |
| U-Net is a CNN that was developed for biomedical image segmentation. The main idea is to supplement a usual contracting network by successive layers, where pooling operations are replaced by up sampling operators. Hence, these layers increase the resolution of the output. A successive convolutional layer can then learn to assemble a precise output based on this information |
Reported Dice Similarity Coefficient (DSC) in literature for different organs.
| Organ | No. of studies ( | Reference papers | DSC (median, IQR range) |
|---|---|---|---|
|
| 13 | 56–67, 100 | 0.84 (0.83–0.86) |
|
| 9 | 56–61, 64, 67, 100 | 0.93 (0.90–0.94) |
|
| 8 | 56–61, 67, 100 | 0.86 (0.84–0.89) |
|
| 7 | 56, 58–61, 64, 67 | 0.69 (0.67–0.71) |
|
| 7 | 56, 58–61, 64, 67, 100 | 0.80 (0.76–0.81) |
|
| 5 | 56, 59, 61, 64, 68 | 0.532 (0.412–0.581) |
|
| 4 | 57, 58, 60, 64 | 0.88 (0.77–0.96) |
|
| 3 | 57, 58, 100 | 0.90 (0.80- 0.91) |
|
| 2 | 57, 64 | 0.91 |
|
| 2 | 57, 60 | 0.86 |
|
| 2 | 57, 64 | 0.85 |
|
| 2 | 58, 60 | 0.82 |
|
| 2 | 58 | 0.57 |
|
| 2 | 58, 100 | 0.57 |
|
| 1 | 60 | 0.99 |
|
| 1 | 60 | 0.65 |
|
| 1 | 60 | 0.93 |
|
| 1 | 60 | 0.84 |
|
| 1 | 60 | 0.98 |
|
| 1 | 58 | 0.69 |
|
| 1 | 58 | 0.77 |
|
| 1 | 57 | 0.87 |
|
| 1 | 57 | 0.82 |
|
| 1 | 64 | 0.69 |
Vandewinckele et al. (57) achieved a DSC of 0.65 with the use of CNN and Nikolov et al. (59) a DSC of 0.982 by a 3D U-Net.
The reported DSC was computed as an average of inferior, medial and superior.
The average value of two (in some cases three) models was considered.
Characteristics for machine-learning studies on autosegmentation.
| Author, year of publication | Study population | HN subsite | Imaging modality | Textural and dosimetric parameters | ROI(s) | Tested ML algorithm(s) | Statistical findings and model performance |
|---|---|---|---|---|---|---|---|
| Brunenberg et al., 2020 ( | 58 pts | Mixed | CT | – | PGs, SMGs, thyroid, buccal mucosa, extended OC, pharynx constrictors, cricopharyngeal inlet, supraglottic area, MNDB, BS | Commercially available DL model; external validation | The best performance was reached for the MNDB (DSC 0.90; HD95 3.6 mm); the agreement was moderate for the aerodigestive tract with the exception of the OC. The largest variations were in the caudal and/or caudal directions (binned measurements). |
| Ma et al., 2019 ( | 90 pts | NPC | CT and MR | – | GTVs | CNNs | Both M-CNN and C-CNN showed better performance on MR than on CT. C-CNN outperformed M-CNN in both CTs (higher mean Sn, DSC, and ASSD, comparable mean PPV) and MR applications (higher mean PPV, DSC, and ASSD, comparable mean Sn) |
| Vandewinckele et al., 2019 ( | 9 pts | Mixed | CT | – | Cochlea, BS, upper esophagus, glottis area, MNDB, OC, PGs, inferior, medial and superior PCMs, SC, SMGs, supraglottic Lar | CNN | The longitudinal CNN is able to improve the segmentation results in terms of DSC compared with the DIR for 6/13 considered OARs. The longitudinal approach outperforms the cross-sectional one in terms of both DSC and ASSD for 6 different organs (BS, upper esophagus, OC, PGs, PCM medial, and SMGs) |
| Hänsch et al., 2018 ( | 254 pts, 254 R PGs, 253 L PGs | Mixed | CT | – | Ipsi- and contralateral PGs | DL U-net | The 3 ANNs showed comparable performance for training and internal validation sets (DSC ≈0.83). The 2-D ensemble and 3-D U-net showed satisfactory performance when externally validated (AUC and DSC: 0. 865 and 0.880, respectively; 2-D U-net omitted) |
| Mocnik et al., 2018 ( | 44 pts | Not specified | CT and MR | – | PGs | CNN | The multimodal CNN (CT + MR) compared favorably with the single modality CNN (CT only) in the 80.6% of cases. Overall, DSCs value were 78.8 and 76.5, respectively. Both multi- and single-modality CNNs showed satisfactory registration performance |
| Nikolov et al., 2018 ( | 486 pts, 838 CT scans for training, test and internal validation; 46 pts and 45 CT scans for external validation | Mixed | CT | – | Brain, BS, L and R cochlea, L and R LG, L and R Lens, L and R Lung, MNDB, L and R ON, L and R Orbit, L and R PGs, SC, L and R SMG | 3D U-Net | The segmentation algorithm showed good generalizability across different datasets and has the potential of improving segmentation efficiency. For 19/21 performance metrics (surface and volumetric DSC) were comparable with experienced radiographers; less accuracy was demonstrated for brainstem and R-lens |
| Ren et al., 2018 ( | 48 pts | Not specified | CT | – | Chiasm, L and R ON | 3D-CNNs | The proposed segmentation method outperformed the one developed by the MICCAI 2015 challenge winner for all the considered ROIs (DSC chiasm: 0.58 ± 0.17 |
| Tong et al., 2018 ( | 32 pts | Not specified | CT | – | L and R PGs, BS, Chiasm, L and R ONs, MNDB, L and R SMG | FCNN with and without SRM | Accuracy and robustness of the model were improved when incorporating shapes prior to SRM use for all considered ROIs. Segmentation results were satisfactory, ranging from DSC values of 0.583 for the chiasm to 0.937 for the MNDB. Average time for segmenting the whole structure set was 9.5 s |
| Zhu et al., 2018 ( | 271 CT scans | Not specified | CT | – | BS, Chiasma, MNDB, L and R ON, L and R PG, L and R SMG | Implemented 3D U-Net (AnatomyNet) | The AnatomyNet allowed for an average improvement in segmentation performance of 3.3% (DSC) as compared with previously published data of the MICCAI 2015 challenge. Segmentation time was 0.12 s for the whole structure set. |
| Doshi et al., 2017 ( | 10 pts/102 MR slices | Mixed | MR | – | GTVs | FCLSM | PLCSF showed a good performance |
| Ibragimov et al., 2017 ( | 50 pts | Not specified | CT | – | SC, MNDB, PGs, SMGs, Lar, Phar, R and L EB, R and L ON, optic chiasm | CNN-MRF | Model performance was satisfactory for almost all considered OARs (DSC values as follows—spinal cord: 87 ± 3.2; mandible: 89.5 ± 3.6; PGs DSC: 77.3 ± 5.8; submandibular glands DSC: 71.4 ± 11.6; Lar DSC: 85.6 ± 4.2; phar DSC: 69.3 ± 6.3; eye globes DSC: 88.0 ± 3.2; optic ONs DSC: 62.2 ± 7.2; optic chiasm: 37.4 ± 13.4) |
| Liang et al., 2017 ( | 185 pts | NPC | CT | – | BS, R and L EB, R and L lens, Lar, R and L MNDB, OC, R and L MAS, SC, R and left PG, R and L T-M, R and L ON | CNNs (ODS-net) | ODS-net showed satisfactory Sn and Sp for most OARs (range: 0.997–1.000 and 0.983–0.999, respectively), with DSC >0.85 when compared with manually segmented contours. ODS-net outperformed a competing FCNN ( |
| Men et al., 2017 ( | 230 pts | NPC | CT | – | GTV-T, GTV-N, CTV | DDNN | DDNN generated accurate segmentations for GTV-T and CTV (ground truth: manual segmentation), with DSC of 0.809 and 0.826, respectively, Performance for GTV-N was less satisfactory (DSC: 0.623). DDNN outperformed a competing model (VGG-16) for all the analyzed segmentations |
| Stefano et al., 2017 ( | 4 phantom experiments+ 18 pts/40 lesions | Mixed | PET | – | GTVs | RW | Both the K-RW and the AW-RW compare favorably with previously developed methods in delineating complex-shaped lesions; accuracy on phantom studies was satisfactory |
| Wang et al., 2017 ( | 111 pts | Mixed | CT | – | Cochlea, BS, upper esophagus, glottis area, MNDB, OC, PGs, inferior, medial and superior PCMs, SC, SMGs, supraglottic Lar | 3D U-Net | The model showed satisfactory performance for most of the 9 considered ROIs; when compared with other models, it ranked first in 5/9 cases (L and R PG, L and R ON, L SMG), and second in 4/9 cases |
| Beichel et al., 2016 ( | 59 pts/230 lesions | Mixed | PET | – | GTVs | Semiautomated segmentation (LOGISMOS) | Segmentation accuracy measured by the DSC was comparable for semiautomated and manual segmentation (DSC: 0.766 and 0.764, respectively) |
| Yang et al., 2014 ( | 15 pts/30 PGs/57 MRs | Mixed | MR | – | Ipsi- and contralateral PGs | SVM | Average DSC between automated and manual contours were 91.1% ± 1.6% for the L PG and 90.5% ± 2.4% for the R PG. Performance was slightly better for the L PG, also when assessed per the averaged maximum and average surface distance |
| Cheng G et al., 2013 ( | 5 pts, 10 PGs | NPC | MR | – | Ipsi- and controlateral PGs | SVM | Mean DSC between automated and physician’s PG contours was 0.853 (range: 0.818–0.891) |
| Qazi et al., 2011 ( | 25 pts | Not specified | CT | I | MNDB, BS, L and R PG, L and R SMG, L and R node level IB, L and R node levels II–IV | Atlas based segmentation | As compared with manual delineations by an expert, the automated segmentation framework showed high accuracy with DSC of 0.93 for the MNDB, 0.83 for the PGs,.83 for SMGs and 0,.74 for nodal levels |
| Chen et al., 2010 ( | 15 pts/15 neck nodal levels | Mixed | CT | – | II, III, and IV neck nodal levels | ASM | The ASM outperformed the atlas-based method (ground truth: manually segmented contours), with higher DSC (10.7%) and lower mean and median surface errors (−13.6% and −12.0%, respectively) |
| Yu et al., 2009 ( | 10 pts/10 GTV-T and 19 GTV-N | Mixed | PET and CT | I | GTVs | KNN | The feature-based classifier showed better performance than other delineation methods (e.g. standard uptake value of 2.5, 50% maximal intensity and signal/background ratio) |
2D/3D, 2/3-dimensional; ANN, Artificial Neural Network; ASM, active shape model; ASSD, average symmetric surface distance; AW-RW, K-RW algorithm with adaptive probability threshold; BS, brainstem; CNN, convolutional neural network; C-CNN, combined CNN, CT, computed tomography; CTV, clinical target volume; D, dosimetric; DDNN, deep deconvolutional neural network; DIR, deformable image registration; DL, deep learning; DSC, Dice Similarity Coefficient; EB, eyeball; FCLSM, modified fuzzy c-means clustering integrated with the level set method; FCNN, fully convolutional neural network; GTV-N, nodal-gross tumor volume; GTV-T, tumor-gross tumor volume; HD, Hausdorff distance; I, imaging; KNN, k-nearest neighbors; K-RW, RW algorithm with K-means; L, left; Lar, larynx; LG, lacrimal gland; LOGISMOS, layered optimal graph image segmentation of multiple objects and surfaces; M-CNN, multimodality convolutional neural network; MHD, modified Hausdorff distance; MICCAI, Medical Image Computing and Computer Assisted Intervention; MNDB, mandible; MR, magnetic resonance; MRF, Markov random field; MAS, mastoid; MS, mean shift; Ncut, normalized cut; NPC, nasopharyngeal carcinoma; OAR, organ at risk; LG, lacrimal gland; OC, oral cavity; ODS-net, organs at risk detection and segmentation network; ON, optic nerve; p, p-value; PCC, Pearson correlation coefficient; PCM, pharyngeal constrictors muscles; PET, positron emission tomography; PG, parotid gland; Phar, pharynx; PLCSF, pharyngeal and laryngeal cancer segmentation framework; PPV, positive predictive value; pt, patient; R, right; RAD, relative area difference; ROI, region of interest; RW, Rescola Wagner; SC, spinal cord; s, second; SMG, submandibular gland; Sn, sensitivity; Sp, specificity; SRM, shape representation model; SVM, support vector machine; VGG-16, visual geometry group-16.
Characteristics for machine-learning studies on oncological outcome.
| Authors, publication year | Sample study population | HN subsite | Clinical endpoint | Imaging modality | Textural and dosimetric parameters | ROI(s) | Tested ML algorithm(s) | Statistical findings and model performance |
|---|---|---|---|---|---|---|---|---|
| De Felice et al, 2020 ( | 273 pts | OPC | OS prediction in OPC pts treated with IMRT | None | – | None | Decision trees | The most relevant clinical variables identified were HPV status, nodal stage and early complete response to IMRT |
| Howard et al, 2020 ( | 33,527 pts | Mixed | OS prediction in HNC pts with intermediate risk factors treated with adjuvant CHT-RT or RT; identification of which pts may benefit from CHT-RT | None | – | None | DeepSurv, RSF, N-MLTR | Indication to treatment according to model recommendations was associated with a survival benefit; the best performance was achieved by DeepSurv, with an HR of 0.79 (95% CI, 0.72–0.85; |
| Starke et al, 2020 ( | 291 pts | Mixed | LRC in locally-advanced HN SCC treated with primary CHT-RT | CT | – | GTVs | 3D- and 2D-CNNs (from scratch, transfer learning and extraction of deep autoencoder features) | The best performance was achieved by an ensemble of 3D-CNNs (C-index = 0.31 on the external validation cohort); the model yielded a satisfactory performance in discriminating high- vs. low-risk LRC ( |
| Tseng et al, 2020 ( | 334 pts | OC | Risk stratification of locally-advanced OC pts treated with surgery | None | – | None | Elastic net penalized | The incorporation of genetic information to clinicopathologic data led to better model performance for the prediction of both CSS and LRC, as compared with models using clinicopathologic variables alone (mean C index, 0.689 vs. 0.673; |
| Cox proportional hazards regression-based risk stratification model | ||||||||
| Fujima et al., 2019 ( | 36 pts | SNC | LC following superselective arterial CDDP infusion and concomitant RT | MR | I | GTVs (necrotic and cystic areas excluded) | Nonlinear SVM | Mean Sn: 1.0, Sp 0.82, PPV 0.86, NPV 1.0 (on validation data sets, 9-fold crossvalidation scheme used) |
| Tran et al., 2019 ( | 32 pts | NPC | RT response of metastatic nodes by ultrasound-derived radiomic markers | CT, MR, EUS | – | GTVs | LR, naive Bayes, and k-NN | There was a statistically significant difference in the pretreatment QUS-radiomic parameters between radiological complete responders vs. partial responders ( |
| Wu et al., 2019 ( | 140 pts | OPC | DMFS | CT | I | Baseline and mid-treatment GTV-T and GTV-N | RSF | Better performance on testing set was achieved by the model incorporating mid-treatment characteristics (C-index: 0.73, |
| Li et al., 2018 ( | 306 pts | NPC | Analyze the recurrence patterns in pts with NPC treated with IMRT | CT, MR and PET | I | GTVs | ANN, k-NN, and SVM | NPC-IFRs vs NPC-NPDs could be differentiated by 8 features (AUCs: 0.727–0.835). The classification models showed potential in prediction of NPC-IFR with higher accuracies (ANN: 0.812, KNN: 0.775, SVM: 0.732) |
| Zdilar et al., 2018 ( | 529 pts, >3,800 radiomic features | OPC | OS and RFS | CT | I | GTVs | Feature selectors: MRMR, Wilcoxon rank sum test, RF, RrliefF, RRF, IAMB, RSF, PCA | RF features selectors achieved the best performance for both OS prediction (AUC: 0.75, C-index: 0.76, calibration: 0.87) and. RFS (AUC: 0.71, C-index 0.68, calibration: 19.1). The ensemble model (clinical+ radiomic) yielded the best scores for AUC and C-index in all cases |
| Predictive models: LR, CPH, RF, RSF, logistic elastic net, ensemble models | ||||||||
| Jiang et al., 2015 ( | 347 pts | NPC | OS prediction in pts with ab initio metastatic NPC (M1a vs. M1b) | None | – | None | SVM | The SVM classifier showed good performance at internal validation (AUC: 0.761, Sn 80.7%, Sp: 71.3%), while performance was less satisfactory when externally validated (AUC: 0.633) |
| Parmar et al., 2015 ( | 136 pts | Mixed | OS | CT and PET | – | GTVs | Feature selectors: RELF, FSCR, Gini, JMI, CIFE, DISR, MIM, CMIM, ICAP, TSCR, MRMR, MIFS, Wilcoxon | The three feature selection methods minimum redundancy maximum relevance (AUC = 0.69, stability = 0.66), mutual information feature selection (AUC = 0.66, stability = 0.69) and conditional infomax feature extraction (AUC = 0.68, stability = 0.7) had high prognostic performance and stability. The highest prognostic performance was achieved by GLM (median AUC ± SD: 0.72 ± 0.08) and PLSR (median AUC ± sd: 0.73 ± 0.07), whereas BAG (AUC = 0.55 ± 0.06), DT (AUC: 0.56 ± 0.05), and BST (AUC = 0.56 ± 0.07) showed lower AUC values. RF (RSD = 7.36%) and BAG (9.27%) were more stable classification methods, whereas PLSR (RSD = 12.75%) and SVM (RSD = 12.69%) showed lower stability |
| Predictive models: NN, Decision tree, Boosting, Bayesian Bagging, RF, Multi adaptive regression splines (MARS), SVM, k-NN, GLM, partial least squares, and principal component regression | ||||||||
| Bryce et al., 1998 ( | 95 pts | Mixed | Survival prediction in pts with advanced HN SCC treated with RT ± chemotherapy | None | – | None | LR, ANN | ANNs compared favorably with LR models at survival prediction, with a AUC of 0.78 ± 0.05 for the best ANN and of 0.67 ± 0.05 for the best LR model. The best ANN outperformed the modified AJCC TNM 4th edition in survival prediction, as well. Incorporated clinical parameters for the ANN were: tumor size, tumor resectability, nodal stage, tumor stage, and baseline hemoglobin levels |
ANN, Artificial Neural Network; AUC, area under the curve; CDDP, cisplatin; CHT, chemotherapy; CIFE, conditional infomax feature extraction; CMIM, conditional mutual information maximization; CNN, convolutional neural network; CSS, cancer-specific survival; CT, computed tomography; D, dosimetric; DISR, double input symmetric relevance; DMFS, distant metastasis free survival; GTV, gross tumor volume; HN, head and neck; HR, Hazard ratio; I, imaging ICAP, interaction capping; IMRT, intensity modulated RT; JMI, joint mutual information; k-NN, k-nearest neighbor; LC, local control; LR, logistic regression; LRC, loco-regional control; MARS, multiadaptive regression splines; MIFS, mutual information feature selection; MIM, mutual information maximization; MR, magnetic resonance; MRMR, minimum redundancy maximum relevance; NN, neural network; N-MLTR, neural network multitask logistic regression; NPC, nasopharyngeal cancer; OC, oral cavity cancer; OPC, oropharyngeal cancer; OS, overall survival; PET, positron emission tomography; PLSR, partial least square regression; RF, random forest; RFS, relapse-free survival; RSD, relative standard deviation; RSF, random survival forest; RT, radiotherapy; SCC, squamous cell carcinoma; SN, sinonasal cancer; SVM, support vector machine; TSCR, t-test score.
Characteristics for machine learning studies on toxicity outcome.
| Author, year of publication | Study population | HN subsite(s) | Clinical endpoint | Imaging modality | Textural and dosimetric parameters | ROI(s) | Tested ML algorithm(s) | Statistical findings and model performance |
|---|---|---|---|---|---|---|---|---|
| Humbert-Vidan et al, 2021 ( | 96 pts (of these, 50% controls) | Mixed | Prediction of osteoradionecrosis of the mandible | CT | D | Mandible | LR, SVM, RF, AdaBoost, ANN | No statistically significant difference was found among the models in terms of either accuracy, TPR, TNR, PPV, NPV). |
| Zhang et al, 2020 ( | 242 pts | NPC | Early radiation-induced brain (temporal lobes) injury prediction | MRI | I | Temporal lobes | RF (3 models) | The incorporation of textural features yielded to better model performance; features derived from T2-w images achieved higher performance than those extracted from T1-w images. In the testing cohort, models 1, 2, and 3, yielded AUCs of 0.830 (95% CI: 0.823–0.837), 0.773 (95% CI: 0.763–0.782), and 0.716 (95% CI: 0.699–0.733), respectively. |
| Guo et al., 2019 ( | 146 pts | PGs | Correlation between voxel dose and xerostomia recovery 18 months after RT | None | D | PGs, SMGs | LR with ridge regularization | The AUC scores for the ridge logistic regression model evaluated by 10-fold crossvalidation for recovery and injury prediction were 0.68 ± 0.07 and 0.74 ± 0.03, respectively. |
| Leng et al., 2019 ( | 77 pts, 67 healthy controls | NPC | Identification of biomarkers of WM injury | MR | – | 116 brain regions (90 for the brain lobes and 26 for the cerebellum) per the AAL method | SVM | WM regions and WM connections were involved in RBI. The SVM classifier showed satisfactory performances (GR, Sn, Sp) for both FA and WM connections in discriminating patients and controls at all-time points (0–6, 6–12, >12 months) |
| Abdollahi et al., 2018 ( | 47 pts, 94 cochleas, 490 radiomic features | Mixed subsites | Sensorineural hearing loss prediction following chemoradiotherapy | CT | I, D | Cochlea | Decision stump, Hoeffding, C4.5, Bayesian network, naive, adaptive boosting, bootstrap aggregating, Classification | Predictive power was >70% for all models, with Decision stump and Hoeffding being the best-performing models. Incorporation of the gEUD improved both precision and AUC of all models, while accuracy was not affected |
| Dean et al., 2018 ( | 173 pts + 90 pts for external validation | Mixed subsites | Peak grade of acute dysphagia prediction (severe = CTCAE 3.0 grade ≥3 vs. nonsevere = CTCAE 3.0 grade <3) | None | D | Pharyngeal mucosa | PLR, SVC, RFC (each trained and validated on standard dose-volume metrics and spatial dose-metrics) | PLR was not outperformed by any of the more complex models, on both internal and external validation (AUC: 0.76 and 0.82 for the standard-dose model and AUC: 0.75 and 0.73 for the spatial model, respectively). Calibration was superior for the RFC model. Dosimetric parameters (DVH, DLH and DCH) were relevant for accurate toxicity prediction: the volume of pharyngeal mucosa receiving ≥1 Gy should be minimized |
| Gabrys et al., 2018 ( | 153 pts, 24 selected radiomic features | Mixed subsites | Evaluation of xerostomia risk prediction with integrated ML models (clinical, dosimetric, and radiomic features) vs. NTCP models based on mean RT dose to the PGs | CT | I, D | Ipsi- and contralateral PGs | LR-L1, LR-L2, LR-EN, kNN, SVM, ET, GTB | SVMs were the top performing classifiers in time-specific xerostomia prediction (early, late, long term). In the longitudinal approach, the best models were GTB, ET and SVM. LR models were the best in feature selection, although selecting features did not provide any improvement in predictive performance. The NTCP mean dose-based models failed to predict xerostomia (AUC <0.60) |
| Cheng Z et al., 2017 ( | 391 pts | Mixed subsites | Prediction of WL ≥5 kg at 3 months post-RT | None | D | Pharyngeal constrictors, cricopharyngeus, masticator, temporalis, pterygoids, oral cavity, oral mucosa, soft palate, larynx, parotid gland, submandibular glands | CART algorithms | CART model encompassing toxicity and QoL data performed better than the one including baseline characteristics and dosimetric data (AUC: 0.82 vs. 0.77, Sn: 0.98 vs. 0.77, Sp 0.59 vs. 0.67, PPV 0.46 vs. 0.43, NPV: 0.99 vs. 0.90, respectively) |
| Soares et al., 2017 ( | 138 pts | Mixed subsites | Predicting xerostomia after RT | None | D | PGs | RF, stochastic boosting, SVM, NN, model-based clustering and LR | RF yielded the best model performance (AUC: 0.73); the incorporation of clinical (gender, age, baseline xerostomia) and dosimetric parameters (PG Dmean) outperformed all other RF combinations |
| Pota et al., 2015 ( | 21 pts, 42 parotids | NPC | Parotid gland shrinkage prediction | CT | I | Ipsi- and controlateral PGs | LFA, LDA, LR, 0-R method | In some cases, with only one predictor, the LR method presents the highest accuracy but low specificity, while in other cases with only one variable the performances of LDA, LR, and LFA are comparable. If more than one variable is used, the LFA classifier is the best in almost all the cases (best accuracy and sensitivity), while specificity is comparable with that of other classifiers. Adding a variable to a model hardly worsens the performances of both LDA and LR, while LFA models tolerate the noise |
ANN, Artificial NN; AUC, area under the curve; CART, classification and regression tree; CT, computed tomography; CTCAE, common terminology criteria for adverse event; D, dosimetric; DCH, dose coverage histogram; DLH, dose lymphocyte histogram; Dmean, mean (RT) dose; DTI, diffusion tensor imaging; DVH, dose volume histogram; ET, extra-trees; gEUD, generalized equivalent uniform dose; GTB, gradient tree boosting; I, imaging; k-NN, k-nearest neighbor; LDA, linear discriminant analysis; LFA, logical framework approach; LR, logistic regression; ML, machine learning; MR, magnetic resonance; NN, neural network; NPC, nasopharyngeal cancer; NPV, negative predictive value; NTCP, normal tissue complication probability; OPC, oropharyngeal cancer; PG, parotid gland; PLR, penalized LR; PPV, positive predictive value; QoL, quality of life; RFC, random forest classification; SMG, submandibular gland; Sn, sensitivity; Sp, specificity; SVM, support vector machine; TNR, true-negative rate; TPR, true-positive rate; T1/T2-w, T1/T2-weighted; TBSS, tract-based spatial statistics; WL, weight loss; WM, white matter.
Figure 4Boxplots for global and methodological scores (modified Luo classification) for the studies included in the analysis, categorized according to the task of the proposed algorithm(s). Tr, treatment.
Figure 5Boxplots for global and methodological scores (modified Luo classification) for the studies included in the analysis, categorized according to imaging data used as input parameters (texture analysis vs. no texture analysis).
Figure 6Boxplots representing global and methodological scores (modified Luo classification) for the studies included in the analysis, categorized per the presence of texture analysis.