| Literature DB >> 35883041 |
Michael Kammer1,2, Daniela Dunkler1, Stefan Michiels3, Georg Heinze4.
Abstract
BACKGROUND: Variable selection for regression models plays a key role in the analysis of biomedical data. However, inference after selection is not covered by classical statistical frequentist theory, which assumes a fixed set of covariates in the model. This leads to over-optimistic selection and replicability issues.Entities:
Keywords: Linear model; Penalized regression; Selective inference; Simulation study; Variable selection
Mesh:
Year: 2022 PMID: 35883041 PMCID: PMC9316707 DOI: 10.1186/s12874-022-01681-y
Source DB: PubMed Journal: BMC Med Res Methodol ISSN: 1471-2288 Impact factor: 4.612
Simulation setup
| Motivation | Simplicity, insight | Realistic data | |
| Number of variables | 4 | 17 | |
| Type of variables | Continuous | Continuous, binary | |
| Distribution of variables | Gaussian | Mixed (Supplementary Table S | |
| Correlation structures | 7 blocked correlation matrices with no or strong correlation (Supplementary Material Sect. | Fixed, mimicking real study (Supplementary Material Sect. | |
| Coefficient structures | 10 (Supplementary Table S | 13 (Supplementary Table S | |
| True target R2 (noise | 0.2, 0.5, 0.8 | 0.2, 0.5, 0.8 | |
| Observations per variable | 5, 10, 50 | 5, 10, 50 | |
| Number of scenarios | 630 | 117 | |
| Iterations per scenario | 900 | 900 |
See the description of the simulation setups in the Supplementary Material for more details on the correlation and coefficient structures. Varying design parameters are the parameters that were varied in the full factorial design in both setups
Overview of methods investigated in this study
| Method | Variable selection | Tuning | Inference |
|---|---|---|---|
| Full | None | None | Wald CI |
| Oracle | None | None | Wald CI |
| Lasso-CV-Split | Lasso | tenfold CV | Split-sample |
| Lasso-CV-PoSI | Lasso | tenfold CV | Universally valid post-selection inference [ |
| Lasso-CV-SI | Lasso | tenfold CV | Exact post-selection inference [ |
| Lasso-Neg-SI | Lasso | Fixed penalization parameter [ | Exact post-selection inference [ |
| ALasso-CV-Split | Adaptive Lasso | tenfold CV | Split-sample |
| ALasso-CV-PoSI | Adaptive Lasso | tenfold CV | Universally valid post-selection inference [ |
| ALasso-CV-SI | Adaptive Lasso | tenfold CV | Exact post-selection inference [ |
| ALasso-Neg-SI | Adaptive Lasso | Fixed penalization parameter [ | Exact post-selection inference [ |
Primary performance measures investigated in this study
| Measure | Definition | Approximation by simulation |
|---|---|---|
| Coverage | ||
| Power | ||
| Type 1 error |
We denote the set of all iterations of a simulation scenario by . The full model using all predictors is written as , the selected model in a specific iteration is written as . By the use of we denote the indicator function for the event specified between square brackets. Note that for methods without variable selection, the estimands reduce to the usual definitions of frequentist properties. More details on the derivation of the approximation in the simulation are given in the Supplementary Material Sect. 3.1
Fig. 1Simulation study: selective coverage from both simulation setups of selective 90% CIs for the submodel inference target. For each scenario, the actual selective coverage rate was estimated by simulation, and over all scenarios, the values were summarised by boxplots. See Supplementary Figure S4 for stratified results. The nominal confidence level of 0.9 used in the construction of the CIs is depicted as dashed line. Colors indicate the type of variable selection. Monte Carlo error is indicated by grey areas describing binomial 95% CIs expected at the nominal confidence level with 900 iterations
Fig. 2Toy simulation study: selective power and selective type 1 error of selective 90% confidence intervals. For each scenario, power or type I error was estimated by simulation, and over all scenarios with specified target simulation R2, the values were summarised by boxplots. The target values are depicted as dashed lines (1 for power, 0.1 for type 1 error). Colors indicate the type of variable selection. Results from the realistic setup are comparable to the ones shown here (Supplementary Figure S5)
Fig. 3Toy simulation study: median and interquartile range (IQR) of the widths of selective 90% CIs. CIs were standardized. For each scenario the median and IQR of CI widths were computed, and over all variables and scenarios with specified target simulation R2, the values were summarised by boxplots. Dashed lines mark a width of zero. Colors indicate the type of variable selection. Supplementary Figure S7 shows the same results stratified by true predictors and noise variables. Results for the realistic setup are comparable (Supplementary Figure S8)
Fig. 4Toy simulation study: predictive accuracy in terms of difference of validation R2 and target simulation R2. The target simulation R2 was 0.2 in left panel, 0.8 in right panel. For each scenario, predictive accuracy was estimated by simulation, and over all scenarios, the values were summarised by boxplots. Dashed lines mark an optimal difference of zero. Colors indicate the type of variable selection. Results for the realistic setup are comparable (Supplementary Figure S12)
Overview of main results for the primary estimands of our simulation study
| Method | Stability | Coverage | Power | Type 1 error |
|---|---|---|---|---|
| Lasso_CV_Split | No concern | Acceptable | Low | Acceptable |
| Lasso_CV_PoSI | No concern | Too high | Low | Low |
| Lasso_CV_SI | Problematic | Acceptable | Acceptable | High |
| Lasso_Neg_SI | Problematic | Acceptable | Acceptable | Acceptable |
| ALasso_CV_Split | No concern | Acceptable | Acceptable | Acceptable |
| ALasso_CV_PoSI | No concern | Too high | Acceptable | Acceptable |
| ALasso_CV_SI | Problematic | Too low | High | High |
| ALasso_Neg_SI | Problematic | Acceptable | High | High |
By “Acceptable” we mean that results were mostly (i.e. in median over all scenarios) within the expected simulation variability
Fig. 5Real data example: point estimates and 90% selective CIs for regression coefficients. Results are shown at the original scales of the variables. Each method is depicted in a separate panel. The variables are ordered by increasing standardized coefficients. The individual selection frequencies estimated by 100 subsamples are given as percentages above each panel