| Literature DB >> 29292533 |
Georg Heinze1, Christine Wallisch1, Daniela Dunkler1.
Abstract
Statistical models support medical research by facilitating individualized outcome prognostication conditional on independent variables or by estimating effects of risk factors adjusted for covariates. Theory of statistical models is well-established if the set of independent variables to consider is fixed and small. Hence, we can assume that effect estimates are unbiased and the usual methods for confidence interval estimation are valid. In routine work, however, it is not known a priori which covariates should be included in a model, and often we are confronted with the number of candidate variables in the range 10-30. This number is often too large to be considered in a statistical model. We provide an overview of various available variable selection methods that are based on significance or information criteria, penalized likelihood, the change-in-estimate criterion, background knowledge, or combinations thereof. These methods were usually developed in the context of a linear regression model and then transferred to more generalized linear models or models for censored survival data. Variable selection, in particular if used in explanatory modeling where effect estimates are of central interest, can compromise stability of a final model, unbiasedness of regression coefficients, and validity of p-values or confidence intervals. Therefore, we give pragmatic recommendations for the practicing statistician on application of variable selection methods in general (low-dimensional) modeling problems and on performing stability investigations and inference. We also propose some quantities based on resampling the entire variable selection process to be routinely reported by software packages offering automated variable selection algorithms.Entities:
Keywords: change-in-estimate criterion; penalized likelihood; resampling; statistical model; stepwise selection
Mesh:
Year: 2018 PMID: 29292533 PMCID: PMC5969114 DOI: 10.1002/bimj.201700067
Source DB: PubMed Journal: Biom J ISSN: 0323-3847 Impact factor: 2.207
Four potential models to estimate body fat in %
| Regression coefficients | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| Model | Intercept | Weight in kg | Height in cm | Abdomen circumference |
| ||||
| Estimate | SE | Estimate | SE | Estimate | SE | Estimate | SE | ||
| 1 | −14.892 | 2.762 | 0.420 | 0.034 | 0.381 | ||||
| 2 | 76.651 | 9.976 | 0.582 | 0.034 | −0.586 | 0.062 | 0.543 | ||
| 3 | −47.659 | 2.634 | −0.292 | 0.047 | 0.979 | 0.056 | 0.722 | ||
| 4 | −30.364 | 11.432 | −0.215 | 0.068 | −0.096 | 0.062 | 0.910 | 0.071 | 0.723 |
, adjusted R 2; SE, standard error
Figure 1Simulation study to illustrate possible differential effects of variable selection. Graphs show scatterplots of estimated regression coefficients and in 50 simulated datasets of size with two standard normal IVs with correlation . Circles and dots indicate simulated datasets where a test of the null hypothesis yields p‐values greater or lower than 0.157, respectively. The dashed lines are regression lines of β1 on β2; thus they indicate how β1 would change if β2 is set to 0
Some popular variable selection algorithms
| Algorithm | Description | Stopping rule |
|---|---|---|
| Backward elimination (BE) |
Start with the global model. Repeat: Remove the most insignificant independent variable (IV) and reestimate the model. Stop if no insignificant IV is left. | All (Wald) |
| Forward selection (FS) |
Start with the most significant univariable model. Repeat: Evaluate the added value of each IV that is currently not in the model. Include the most significant IV and reestimate the model. Stop if no significant IV is left to include. | All (score) |
| Stepwise forward |
Start with the null model. Repeat: Perform an FS step. After each inclusion of an IV, perform a BE step. In subsequent FS steps, reconsider IVs that were removed in former steps. Stop if no IV can be removed or added. | All |
| Stepwise backward | Stepwise approach (see above) starting with the global model, cycling between BE and FS steps until convergence. | All |
| Augmented backward elimination | Combines BE with a standardized change‐in‐estimate criterion. IVs are not excluded even if | No further variable to exclude by significance and change‐in‐estimate criteria |
| Best subset selection | Estimate all 2 | No subset of variables attains a better information criterion. |
| Univariable selection | Estimate all univariable models. Let the multivariable model include all IVs with | |
| LASSO | Imposes a penalty on the sum of squares or log likelihood that is equal to the absolute sum of regression coefficients. | Relative weight of penalty is optimized by cross‐validated sum of squares or deviance. |
Figure 2A schematic network of dependencies arising from variable selection. β, regression coefficient; IV, independent variable; RMSE, root mean squared error
Some recommendations on variable selection, shrinkage, and stability investigations based on events‐per‐variable ratios
| Situation | Recommendation |
|---|---|
| For some IVs it is known from previous studies that their effects are strong, for example age in cardiovascular risk studies or tumor stage at diagnosis in cancer studies. | Do not perform variable selection on IVs with known strong effects. |
|
| Variable selection (on IVs with unclear effect size) should be accompanied by stability investigation. |
|
| Variable selection on IVs with unclear effect size should be accompanied by postestimation shrinkage methods (e.g. Dunkler et al., |
|
| Variable selection not recommended. |
Implementations of variable selection methods and resampling‐based stability analysis in selected statistical softwares
|
|
| R packages and functions | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Modeling techniques | PROC GLMSELECT | PROC REG | PROC LOGISTIC PROC PHREG | %ABE macro | lm(), glm(), | step() |
|
|
|
| |
| Backward | Yes | Yes | Yes | Yes | Yes | No | Yes | No | No | No | Yes |
| Forward | Yes | Yes | Yes | No | Yes | No | Yes | No | No | No | No |
| Stepwise forward | Yes | Yes | Yes | No | Yes | No | Yes | No | No | No | No |
| Stepwise backward | No | No | No | No | No | No | No | Yes | No | No | No |
| Best subset/other | Yes | Yes | Yes | No | — | No | No | No | Yes | No | No |
| Augmented backward | No | No | No | Yes | No |
| — | — | — | — | — |
| LASSO | Yes | No | No | No | No | No | No | No | No | Yes | No |
| Multi‐model inference | (Yes) | No | No | No | No | No | No | No | Yes | No | (Yes) |
| Bootstrap stability investigation | Yes | No | No | (No) | No(!) | No | No | No | No | No | (Yes) |
| Linear | Yes | Yes | No | Yes | Yes | lm() | Yes | Yes | Yes | Yes | Yes |
| Logistic | No | No | Yes (LOGISTIC) | Yes | Yes | glm() | Yes | Yes | Yes | Yes | Yes |
| Cox | No | No | Yes (PHREG) | Yes | Yes | coxph() | Yes | Yes | ? | Yes | Yes |
Body fat study: global model, model selected by backward elimination with a significance level of 0.157 (AIC selection), and some bootstrap‐derived quantities useful for assessing model uncertainty
| Global model | Selected model | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Predictors | Estimate | Standard error | Bootstrap inclusion frequency (%) | Estimate | Standard error | RMSD ratio | Relative conditional bias (%) | Bootstrap median | Bootstrap 2.5th percentile | Bootstrap 97.5th percentile |
| (Intercept) | 4.143 | 23.266 | 100 (fixed) | 5.945 | 8.150 | 0.97 | 5.741 | –49.064 | 50.429 | |
| height | –0.108 | 0.074 | 100 (fixed) | –0.130 | 0.047 | 1.02 | +4.9 | –0.116 | –0.253 | 0.043 |
| abdomen | 0.897 | 0.091 | 100 (fixed) | 0.875 | 0.065 | 1.05 | –2.1 | 0.883 | 0.687 | 1.050 |
| wrist | –1.838 | 0.529 | 97.6 | –1.729 | 0.483 | 1.07 | –1.6 | –1.793 | –2.789 | –0.624 |
| age | 0.074 | 0.032 | 84.6 | 0.060 | 0.025 | 1.14 | +4.2 | 0.069 | 0 | 0.130 |
| neck | –0.398 | 0.234 | 62.9 | –0.330 | 0.219 | 1.24 | +30.3 | –0.387 | –0.825 | 0 |
| forearm | 0.276 | 0.206 | 54.0 | 0.365 | 0.192 | 1.14 | +46.6 | 0.264 | 0 | 0.641 |
| chest | –0.127 | 0.108 | 50.9 | –0.135 | 0.088 | 1.14 | +68.0 | –0.055 | –0.342 | 0 |
| thigh | 0.173 | 0.146 | 47.9 | 1.13 | +64.4 | 0 | 0 | 0.471 | ||
| biceps | 0.175 | 0.170 | 43.1 | 1.15 | +101.4 | 0 | 0 | 0.541 | ||
| hip | –0.149 | 0.143 | 41.4 | 1.08 | +85.3 | 0 | –0.415 | 0 | ||
| ankle | 0.190 | 0.220 | 33.5 | 1.11 | +82.2 | 0 | –0.370 | 0.605 | ||
| weight | –0.025 | 0.147 | 28.3 | 0.95 | +272.3 | 0 | –0.355 | 0.295 | ||
| knee | –0.038 | 0.244 | 17.8 | 0.78 | +113.0 | 0 | –0.505 | 0.436 | ||
RMSD, root mean squared difference, see Section 3.2(iv).
Body fat study: model selection frequencies. Selected model is model 4
| Model | Included predictors | Count | Percent | Cumulative percent |
|---|---|---|---|---|
| 1 | Height abdomen wrist age chest biceps | 32 | 3.2 | 3.2 |
| 2 | Height abdomen wrist age neck forearm thigh hip | 29 | 2.9 | 6.1 |
| 3 | Height abdomen wrist age forearm chest | 19 | 1.9 | 8.0 |
|
|
|
|
|
|
| 5 | Height abdomen wrist age neck forearm chest thigh hip | 19 | 1.9 | 11.8 |
| 6 | Height abdomen wrist age neck chest biceps | 18 | 1.8 | 13.6 |
| 7 | Height abdomen wrist age neck thigh biceps hip | 16 | 1.6 | 15.2 |
| 8 | Height abdomen wrist age neck forearm | 15 | 1.5 | 16.7 |
| 9 | Height abdomen wrist age neck biceps | 15 | 1.5 | 18.2 |
| 10 | Height abdomen wrist age neck forearm chest biceps | 14 | 1.4 | 19.6 |
| 11 | Height abdomen wrist age neck forearm chest ankle | 12 | 1.2 | 20.8 |
| 12 | Height abdomen wrist age neck forearm chest thigh hip ankle | 12 | 1.2 | 22.0 |
| 13 | Height abdomen wrist age neck forearm thigh | 10 | 1.0 | 23.0 |
| 14 | Height abdomen wrist age neck forearm biceps | 10 | 1.0 | 24.0 |
| 15 | Height abdomen wrist age forearm chest ankle | 10 | 1.0 | 25.0 |
| 16 | Height abdomen wrist age neck forearm thigh hip knee | 10 | 1.0 | 26.0 |
| 17 | Height abdomen wrist age neck thigh hip | 9 | 0.9 | 26.9 |
| 18 | Height abdomen wrist age chest biceps ankle | 9 | 0.9 | 27.8 |
| 19 | Height abdomen wrist age neck thigh hip ankle | 9 | 0.9 | 28.7 |
| 20 | Height abdomen wrist age chest | 8 | 0.8 | 29.5 |