| Literature DB >> 36191290 |
Edwin Kipruto1, Willi Sauerbrei1.
Abstract
In low-dimensional data and within the framework of a classical linear regression model, we intend to compare variable selection methods and investigate the role of shrinkage of regression estimates in a simulation study. Our primary aim is to build descriptive models that capture the data structure parsimoniously, while our secondary aim is to derive a prediction model. Simulation studies are an important tool in statistical methodology research if they are well designed, executed, and reported. However, bias in favor of an "own" preferred method is prevalent in most simulation studies in which a new method is proposed and compared with existing methods. To overcome such bias, neutral comparison studies, which disregard the superiority or inferiority of a particular method, have been proposed. In this paper, we designed a simulation study with key principles of neutral comparison studies in mind, though certain unintentional biases cannot be ruled out. To improve the design and reporting of a simulation study, we followed the recently proposed ADEMP structure, which entails defining the aims (A), data-generating mechanisms (D), estimand/target of analysis (E), methods (M), and performance measures (P). To ensure the reproducibility of results, we published the protocol before conducting the study. In addition, we presented earlier versions of the design to several experts whose feedback influenced certain aspects of the design. We will compare popular penalized regression methods (lasso, adaptive lasso, relaxed lasso, and nonnegative garrote) that combine variable selection and shrinkage with classical variable selection methods (best subset selection and backward elimination) with and without post-estimation shrinkage of parameter estimates.Entities:
Mesh:
Year: 2022 PMID: 36191290 PMCID: PMC9529280 DOI: 10.1371/journal.pone.0271240
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.752
Summary of the simulation design following the ADEMP structure.
|
To compare variable selection methods using different tuning parameters (CV, AIC and BIC) or initial estimates in terms of model selection and prediction. To assess the usefulness of post-estimation shrinkage in the prediction of classical variable selection methods and compare the results with penalized methods. To compare the amount of shrinkage of regression coefficients of penalized and post-estimation shrinkage methods. To assess the performance of different methods in the presence of relatively many noise variables, in larger sample size, in relatively high correlation and when | |||
( New simulations with the same design as training dataset ( | |||
|
Selection status of each covariate and identification of the true model Shrinkage factors for each regression estimate Model prediction errors | |||
|
| |||
| Method | Tuning parameters | Initial estimates | |
| Lasso | 10-fold CV, AIC & BIC | N/A | |
| Garrote | 10-fold CV, AIC & BIC | OLS, ridge and lasso | |
| Alasso | 10-fold CV, AIC & BIC | OLS, ridge and lasso | |
| Rlasso | 10-fold CV, AIC & BIC | N/A | |
| Best subset | 10-fold CV, AIC & BIC | N/A | |
| BE | 10-fold CV, AIC & BIC | N/A | |
|
Inclusion and exclusion of variables: FNR & FPR–subsection 2.6.1 classification of models: Probabilities–subsection 2.6.1 Prediction accuracy: Model error (ME)–subsection 2.6.2 Variability of ME within and between scenarios—section 5 in | |||
*Alasso, Rlasso and BE denote adaptive lasso, relaxed lasso and backward elimination; while FNR and FPR denote false negative rates and false positive rates, respectively.
Classification of selected models for 15 covariates (Taken from [24]).
| Category | Model Category | # of SV | # of NV |
|---|---|---|---|
|
| True | 0 | 0 |
|
| Under-selection | 1 or 2 | 0 |
|
| Over-selection | 0 | 1 or 2 |
|
| Almost-real | 1 or 2 | 1 or 2 |
|
| Wrong | Models which cannot be classified in category 1, 2, 3 or 4 | |
*SV and NV denote signal and noise variables, respectively