| Literature DB >> 35395724 |
Paula Dhiman1,2, Jie Ma3, Constanza L Andaur Navarro4,5, Benjamin Speich3,6, Garrett Bullock7, Johanna A A Damen4,5, Lotty Hooft4,5, Shona Kirtley3, Richard D Riley8, Ben Van Calster9,10,11, Karel G M Moons4,5, Gary S Collins3,12.
Abstract
BACKGROUND: Describe and evaluate the methodological conduct of prognostic prediction models developed using machine learning methods in oncology.Entities:
Keywords: Machine learning; Methodology; Prediction
Mesh:
Year: 2022 PMID: 35395724 PMCID: PMC8991704 DOI: 10.1186/s12874-022-01577-x
Source DB: PubMed Journal: BMC Med Res Methodol ISSN: 1471-2288 Impact factor: 4.615
Fig. 1PRISMA flow diagram of studies included in the systematic review
Model type of the 152 models developed in the 62 included publications
| Model characteristics | All models ( |
|---|---|
| n (%) | |
| Logistic regression | 26 |
| Cox regression | 7 |
| Linear regression | 3 |
| LASSO (Logistic regression) | 1 |
| LASSO (Cox regression) | 1 |
| LASSO (model not specified) | 3 |
| Best subset regression with leave-out cross-validation | 1 |
| Neural network (including deep learning) | 18 |
| Classification tree (e.g., CART, decision tree) | 28 |
| Support vector machine | 12 |
| Naive Bayes | 6 |
| K nearest neighbours | 3 |
| Othera | 4 |
| Random forest (including random survival forest) | 23 |
| Gradient boosting machine | 8 |
| RUSBoost - boosted random forests | 1 |
| Bagging with J48 selected by Auto-WEKA | 1 |
| CoxBoost - boosted Cox regression | 1 |
| XGBoost: exTreme Gradient Boosting | 1 |
| Gradient boosting machine and Nystroem, combined using elastic net | 1 |
| Adaboost | 1 |
| Bagging, method not specified | 1 |
| Partitioning Around Medoid algorithm and complete linkage method | 1 |
| 2 [1–4], 1–6 |
CART Classification And Regression Tree, LASSO Least Absolute Shrinkage and Selection Operator
aOther includes voted perceptron; fuzzy logic, soft set theory and soft set computing; hierarchical clustering model based on the unsupervised learning for survival data using the distance matrix of survival curves; Bayes point machine
Methods for predictor selection before and after modelling and hyperparameter tuning for 152 developed clinical prediction models, by modelling type
| All ( | Regression-based models ( | Non-regression-based models ( | Ensemble models ( | |
|---|---|---|---|---|
| n (%) | n (%) | n (%) | n (%) | |
| A-priori | 5 | 3 | 1 | 1 |
| No selection before modelling | 3 | 1 | 2 | – |
| Univariable | 24 | 12 | 8 | 4 |
| Clinically relevant and available data | 1 | – | 1 | – |
| Dropout technique at input layer | 1 | – | 1 | – |
| Random forest with RPA | 9 | 1 | 6 | 2 |
| Other modelling approacha | 9 | 3 | 4 | 2 |
| Stepwise | 6 | 4 | 2 | – |
| Forward selection | 6 | 5 | – | 1 |
| Backward elimination | 5 | 3 | 2 | – |
| Full model approach (no selection) | 11 | 4 | 5 | 2 |
| Feed forward/backpropagation | 5 | – | 5 | – |
| Recursive partitioning analysis | 7 | – | 7 | – |
| LASSO | 5 | 5 | – | – |
| Gini index (minimised) | 7 | 1 | 4 | 2 |
| Cross validation | 4 | 2 | – | 2 |
| Otherb | 7 | 1 | 2 | 4 |
| Cross validation | 19 | 4 | 7 | 8 |
| Grid search (no further details provided) | 6 | – | 4 | 2 |
| Max tree depth | 2 | – | 1 | 1 |
| Adadelta method | 2 | – | 2 | – |
| Default software values | 2 | – | 1 | 1 |
RPA Recursive partitioning analysis, LASSO Least Absolute Shrinkage and Selection Operator
aModelling approaches include support vector machine, logistic regression, Cox regression, best subset linear regression, decision tree, meta-transformer (base algorithm of extra trees)
bOther includes change in unspecified performance measure, stochastic gradient descent, function, aggregation of bootstrapped decision trees and Waikato Environment for Knowledge Analysis for development-only studies, and hyperbolic tangent function, greedy algorithm for all models and using final chosen predictors from comparator model
Sample size and number of candidate predictors informing analyses for 152 developed models, by modelling type
| Regression-based models ( | Non-regression-based models ( | Ensemble models ( | ||||
|---|---|---|---|---|---|---|
| Reported, n (%) | Median [IQR], range | Reported, n (%) | Median [IQR], range | Reported, n (%) | Median [IQR], range | |
| Model development | 42 (100) | 561 [203 to 2822], 20 to 582,398 | 70 (99) | 447 [156 to 11,901], 20 to 582,398 | 39 (100) | 768 [203 to 1599], 20 to 582,398 |
| Internal validationa | 20 (48) | 122 [82 to 228], 47 to 291,200 | 35 (49) | 145 [90 to 492], 47 to 291,200 | 24 (62) | 162 [97 to 1510], 67 to 291,200 |
| External validation | 12 (29) | 511 [67 to 2300], 11 to 836,659 | 14 (20) | 793 [59 to 1675], 11 to 836,659 | 11 (28) | 313 [229 to 836,659], 11 to 836,659 |
| Model development | 20 (48) | 236 [34 to 1326], 7 to 35,019 | 37 (52) | 62 [22 to 1075], 7 to 45,797 | 10 (26) | 37 [22 to 241], 8 to 35,019 |
| Internal validationa | 2 (5) | 41 [21 to 61], 21 to 61 | 3 (4) | 61 [21 to 62], 21 to 62 | 1 (3) | 61 |
| External validation | 8 (19) | 81 [18 to 327], 7 to 513 | 11 (15) | 19 [7 to 513], 7 to 1323 | 5 (13) | 81 [81 to 81], 7 to 513 |
| 38 (90) | 21 [15 to 34], 6 to 33,788 | 64 (90) | 16 [12 to 25], 5 to 33,788 | 36 (92) | 25 [14 to 37], 4 to 33,788 | |
| 20 (48) | 8.0 [7.1 to 23.5], 0.2 to 5836.5 | 35 (49) | 3.4 [1.1 to 19.1], 0.2 to 5836.5 | 10 (26) | 1.7 [1.1 to 6.0], 0.7 to 5836.5 | |
aCombines all internal validation methods, e.g., split sample, cross validation, bootstrapping
bEvents per predictor for model development