| Literature DB >> 35741566 |
Lorena Hafermann1, Nadja Klein2, Geraldine Rauch1, Michael Kammer3, Georg Heinze3.
Abstract
There is an increasing interest in machine learning (ML) algorithms for predicting patient outcomes, as these methods are designed to automatically discover complex data patterns. For example, the random forest (RF) algorithm is designed to identify relevant predictor variables out of a large set of candidates. In addition, researchers may also use external information for variable selection to improve model interpretability and variable selection accuracy, thereby prediction quality. However, it is unclear to which extent, if at all, RF and ML methods may benefit from external information. In this paper, we examine the usefulness of external information from prior variable selection studies that used traditional statistical modeling approaches such as the Lasso, or suboptimal methods such as univariate selection. We conducted a plasmode simulation study based on subsampling a data set from a pharmacoepidemiologic study with nearly 200,000 individuals, two binary outcomes and 1152 candidate predictor (mainly sparse binary) variables. When the scope of candidate predictors was reduced based on external knowledge RF models achieved better calibration, that is, better agreement of predictions and observed outcome rates. However, prediction quality measured by cross-entropy, AUROC or the Brier score did not improve. We recommend appraising the methodological quality of studies that serve as an external information source for future prediction model development.Entities:
Keywords: calibration; machine learning; sparsity; variable selection
Year: 2022 PMID: 35741566 PMCID: PMC9222226 DOI: 10.3390/e24060847
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.738
Figure 1Motivating study: prevalence of 1150 binary predictors.
Figure 2Motivating study: Bar chart indicating the distribution of permutation-based predictor importance values (increase in prediction error by permuting a predictor) in an RF model based on 150,000 individuals. Predictors are ordered by descending importance with more important predictors shown on the left. Permutation predictor importance values were estimated using out-of-bag data in the RF procedure. Here, 537 predictors with negative importance values are not shown. The 20 variables with the highest importance values are provided in Supplemental Table S1.
Simulation study: number of selected variables (mean [range]) in preceding studies. N, sample size. Results are based on 100 replications.
| Predictability | N | Lasso | Univariate Selection | Union of Lasso and Univariate Selection | Intersection of Lasso and Univariate Selection |
|---|---|---|---|---|---|
|
| 4000 | 117 [58, 196] | 210 [175, 228] | 277 [220, 346] | 50 [33, 69] |
| 2000 | 103 [47, 178] | 111 [105, 118] | 179 [129, 246] | 35 [22, 51] | |
| 1000 | 75 [29, 139] | 55 [51, 60] | 109 [67, 167] | 22 [15, 31] | |
| 500 | 56 [11, 112] | 27 [25, 30] | 71 [32, 130] | 12 [8, 17] | |
| 250 | 33 [9, 89] | 14 [11, 15] | 41 [17, 95] | 6 [1, 11] | |
|
| 4000 | 68 [22, 152] | 84 [60, 102] | 130 [81, 212] | 22 [13, 35] |
| 2000 | 55 [11, 151] | 52 [35, 76] | 93 [56, 174] | 15 [7, 25] | |
| 1000 | 37 [0, 131] | 30 [17, 45] | 60 [29, 144] | 8 [0, 18] | |
| 500 | 21 [0, 88] | 17 [4, 29] | 34 [4, 96] | 3 [0, 10] | |
| 250 | 14 [0, 76] | 8 [0, 15] | 20 [2, 81] | 1 [0, 7] |
Simulation study: mean cross-entropy achieved by Lasso models and RF models with different preselection of variables in preceding studies for M1: no preselection; M2: preselection based on Lasso; M3: preselection based on intersection of Lasso and univariate selection; M4: preselection based on union of Lasso and univariate selection; M5: preselection based on optimum of the Lasso and univariate model. Results are based on 1000 replications. Bold numbers indicate optimal model in a scenario.
| Predictability | Sample Size | Lasso | M1 | M2 | M3 | M4 | M5 |
|---|---|---|---|---|---|---|---|
|
| 4000 | 5081 | 5050 |
| 4990 | 5024 | 5045 |
| 2000 | 5152 | 5116 |
| 5073 | 5090 | 5122 | |
| 1000 | 5253 | 5195 |
| 5185 | 5178 | 5236 | |
| 500 | 5396 | 5296 |
| 5368 | 5298 | 5419 | |
| 250 | 5599 |
| 5455 | 5609 | 5466 | 5581 | |
|
| 4000 | 6621 |
| 6606 | 6633 | 6625 | 6644 |
| 2000 | 6660 |
| 6662 | 6688 | 6677 | 6707 | |
| 1000 | 6713 |
| 6733 | 6751 | 6749 | 6796 | |
| 500 | 6777 |
| 6816 | 6839 | 6850 | 6897 | |
| 250 | 6846 |
| 6894 | 6957 | 6959 | 6987 |
Simulation study: mean AUROC achieved by Lasso models RF models with different preselection of variables in preceding studies for M1: no preselection; M2: preselection based on Lasso; M3: preselection based on intersection of Lasso and univariate selection; M4: preselection based on union of Lasso and univariate selection; M5: preselection based on optimum of the Lasso and univariate model. Results are based on 1000 replications. Bold numbers indicate optimal model in a scenario.
| Predictability | Sample Size | Lasso | M1 | M2 | M3 | M4 | M5 |
|---|---|---|---|---|---|---|---|
|
| 4000 | 0.809 | 0.812 |
| 0.813 | 0.811 | 0.809 |
| 2000 | 0.803 | 0.807 |
| 0.805 | 0.805 | 0.801 | |
| 1000 | 0.794 | 0.802 |
| 0.794 | 0.797 | 0.790 | |
| 500 | 0.780 |
| 0.789 | 0.775 | 0.785 | 0.773 | |
| 250 | 0.760 |
| 0.767 | 0.744 | 0.769 | 0.756 | |
|
| 4000 | 0.627 |
| 0.628 | 0.623 | 0.623 | 0.620 |
| 2000 | 0.619 |
| 0.617 | 0.611 | 0.614 | 0.609 | |
| 1000 | 0.606 |
| 0.601 | 0.593 | 0.603 | 0.596 | |
| 500 | 0.587 |
| 0.579 | 0.562 | 0.586 | 0.577 | |
| 250 | 0.569 |
| 0.562 | 0.547 | 0.563 | 0.556 |
Simulation study: mean Brier score achieved by Lasso models and RF models with different preselection of variables in preceding studies for M1: no preselection; M2: preselection based on Lasso; M3: preselection based on intersection of Lasso and univariate selection; M4: preselection based on union of Lasso and univariate selection; M5: preselection based on optimum of the Lasso and univariate model. Results are based on 1000 replications. Bold numbers indicate optimal model in a scenario.
| Predictability | Sample Size | Lasso | M1 | M2 | M3 | M4 | M5 |
|---|---|---|---|---|---|---|---|
|
| 4000 | 0.167 | 0.166 |
| 0.164 | 0.165 | 0.166 |
| 2000 | 0.170 | 0.168 | 0.169 |
| 0.168 | 0.170 | |
| 1000 | 0.174 |
| 0.171 | 0.171 | 0.171 | 0.172 | |
| 500 | 0.179 |
|
| 0.178 | 0.176 | 0.179 | |
| 250 | 0.188 |
| 0.184 | 0.188 | 0.182 | 0.186 | |
|
| 4000 | 0.235 |
| 0.235 | 0.237 | 0.235 | 0.236 |
| 2000 | 0.237 |
| 0.237 | 0.238 | 0.238 | 0.239 | |
| 1000 | 0.239 |
| 0.240 | 0.240 | 0.241 | 0.244 | |
| 500 | 0.242 |
| 0.245 | 0.245 | 0.244 | 0.247 | |
| 250 | 0.245 |
| 0.247 | 0.250 | 0.250 | 0.252 |
Figure 3Simulation study, mean calibration slopes achieved by Lasso models and RF models with different preselection of variables in preceding studies for M1: no preselection; M2: preselection based on Lasso; M3: preselection based on intersection of Lasso and univariate selection; M4: preselection based on union of Lasso and univariate selection; M5: preselection based on optimum of the Lasso and univariate model. Results are based on 1000 replications.
Figure 4Calibration plots (independent validation) for models developed on two data sets of size N = 250 randomly picked from the plasmode simulation. Observed outcome rates in 10 groups defined by the deciles of the predicted probabilities are plotted against the mean predicted probabilities in these groups, and the regression line from a logistic model with the log odds of the predicted probabilities as the only predictor is overlaid. Upper panels: data set with weak predictability, lower panels: data set with strong predictability. M1 (left): RF with no preselection; M3 (right): RF with preselection based on the intersection of Lasso and univariate selection in preceding studies.