| Literature DB >> 32409668 |
Eliana Lima1,2, Peers Davies3, Jasmeet Kaler1, Fiona Lovatt1, Martin Green4.
Abstract
Variable selection in inferential modelling is problematic when the number of variables is large relative to the number of data points, especially when multicollinearity is present. A variety of techniques have been described to identify 'important' subsets of variables from within a large parameter space but these may produce different results which creates difficulties with inference and reproducibility. Our aim was evaluate the extent to which variable selection would change depending on statistical approach and whether triangulation across methods could enhance data interpretation. A real dataset containing 408 subjects, 337 explanatory variables and a normally distributed outcome was used. We show that with model hyperparameters optimised to minimise cross validation error, ten methods of automated variable selection produced markedly different results; different variables were selected and model sparsity varied greatly. Comparison between multiple methods provided valuable additional insights. Two variables that were consistently selected and stable across all methods accounted for the majority of the explainable variability; these were the most plausible important candidate variables. Further variables of importance were identified from evaluating selection stability across all methods. In conclusion, triangulation of results across methods, including use of covariate stability, can greatly enhance data interpretation and confidence in variable selection.Entities:
Year: 2020 PMID: 32409668 PMCID: PMC7224285 DOI: 10.1038/s41598-020-64829-0
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Numbers of variables selected and model performance for ten automated methods of variable selection.
| Technique | Number of variables in final model | Approach for evaluation of model performance | MAE | R2 |
|---|---|---|---|---|
| Backward stepwise linear regression | 265 | Internal | 26.5 | 0.95 |
| Multivariate adaptive regression splines | 2 | Internal | 64.6 | 0.67 |
| Least absolute shrinkage and selection operator | 36 | Internal | 57.0 | 0.73 |
| Ridge regression | 335 | Internal | 56.0 | 0.78 |
| Elastic net | 42 | Internal | 56.5 | 0.74 |
| Adaptive elastic net | 3 | Internal | 63.6 | 0.68 |
| Smoothly clipped absolute deviation, | 19 | Internal | 61.2 | 0.70 |
| Minimax convex penalty | 3 | Internal | 65.0 | 0.67 |
| SparseStep | 3 | Internal | 63.6 | 0.68 |
| Ranking-based variable selection | 5 | Internal | 62.6 | 0.68 |
Figure 1Covariates selected (out of 337 available) in final models using eight different automated methods of variable selection. Key; Sparse step – SparseStep regression, SCAD - smoothly clipped absolute deviation, Elastic net - elastic net regression, RBVS - ranking-based variable selection, MCP - minimax convex penalty, MARS - multivariate adaptive regression splines, Lasso - least absolute shrinkage and selection operator regression, Aenet - adaptive elastic net regression.
Coefficients of variables selected in final models of eight automatic variable selection methods (blank spaces indicate the variable was not selected).
| Variable ID | MARS | Lasso | Elastic net | Aenet | SCAD | MCP | Sparse step | RBVS |
|---|---|---|---|---|---|---|---|---|
| V29 | 58.5 | 34.8 | 35.7 | 54.0 | 45.7 | 34.6 | 58.6 | 59.2 |
| V40 | *207.2 | 181.5 | 182.2 | 191.7 | 196.2 | 198.4 | 19.2 | 218.7 |
| V6 | 22.5 | 22.9 | 28.6 | 16.0 | 9.7 | 33.5 | ||
| V34 | 16.6 | 18.8 | 2.2 | 22.8 | ||||
| V2 | 5.4 | 6.2 | 0.6 | |||||
| V9 | 8.4 | 8.3 | 5.0 | |||||
| V10 | 2.7 | 4.4 | 7.0 | |||||
| V13 | 6.5 | 7.8 | 2.2 | |||||
| V15 | 12.2 | 13.2 | 3.6 | |||||
| V17 | 2.3 | 3.9 | 3.6 | |||||
| V21 | 8.3 | 10.0 | 3.1 | |||||
| V22 | 3.2 | 2.7 | 1.6 | |||||
| V25 | 13.5 | 14.0 | 5.7 | |||||
| V36 | −15.5 | −15.9 | −9.1 | |||||
| V37 | −7.9 | −9.3 | −0.7 | |||||
| V38 | 12.1 | 12.9 | 2.3 | |||||
| V43 | 21.7 | 22.9 | 11.5 | |||||
| V40^2 | 2.2 | 0.3 | −29.9 | |||||
| V4 | 2.4 | 3.0 | ||||||
| V7 | 9.3 | 9.8 | ||||||
| V8 | 2.3 | 3.0 | ||||||
| V12 | 1.0 | 2.1 | ||||||
| V16 | −2.7 | −2.6 | ||||||
| V19 | −1.1 | −2.2 | ||||||
| V24 | 4.8 | 4.2 | ||||||
| V26 | −1.5 | −2.7 | ||||||
| V27 | 0.2 | 0.7 | ||||||
| V28 | 1.2 | 2.4 | ||||||
| V30 | −12.6 | −14.5 | ||||||
| V32 | −4.5 | −6.4 | ||||||
| V33 | −0.5 | −1.4 | ||||||
| V35 | −0.6 | −1.6 | ||||||
| V39 | 3.3 | 4.6 | ||||||
| V41 | 2.1 | 2.5 | ||||||
| V42 | −7.4 | −8.0 | ||||||
| V44 | 0.8 | 2.5 | ||||||
| V1 | 0.5 | |||||||
| V3 | 39.7 | |||||||
| V5 | −0.6 | |||||||
| V11 | 1.7 | |||||||
| V14 | 4.5 | |||||||
| V18 | 0.2 | |||||||
| V20 | −22.4 | |||||||
| V23 | 0.7 | |||||||
| V31 | −0.5 |
*Represents a hinge function of variable V40.
Key; Sparse step – SparseStep regression, SCAD - smoothly clipped absolute deviation, Elastic net - elastic net regression, RBVS - ranking-based variable selection, MCP - minimax convex penalty, MARS - multivariate adaptive regression splines, Lasso - least absolute shrinkage and selection operator regression, Aenet - adaptive elastic net regression.
Figure 2Illustration of the distribution of covariate selection stability for ten methods of automated covariate selection. Selection stability was defined as the percentage of bootstrap samples (out of 500) that each covariate (n = 337) was selected by each specified method. Key; SparseStep – SparseStep regression, SCAD - smoothly clipped absolute deviation, ridge - ridge regression, RBVS - ranking-based variable selection, MCP - minimax convex penalty, MARS - multivariate adaptive regression splines, lasso - least absolute shrinkage and selection operator regression, BSLR - backward stepwise linear regression, enet - elastic net regression, Aenet - adaptive elastic net regression.
Maximum and median selection stability values (%) for covariates across eight statistical models, ranked in order of median stability. Covariates shown all had a stability >90% in at least one method (BSLM and ridge methods excluded).
| Variable ID | Maximum Stability | Median Stability | SCAD | Lasso | MARS | MCP | Aenet | Enet | RBVS | SparseStep |
|---|---|---|---|---|---|---|---|---|---|---|
| V40 | 100 | 100 | 100 | 100 | 65 | 100 | 100 | 100 | 99 | 100 |
| V29 | 100 | 97 | 97 | 100 | 54 | 91 | 100 | 100 | 60 | 97 |
| V34 | 100 | 79 | 71 | 100 | 88 | 52 | 94 | 100 | 29 | 27 |
| V6 | 92 | 73 | 83 | 90 | 39 | 64 | 89 | 92 | 20 | 61 |
| V36 | 92 | 65 | 73 | 92 | 47 | 57 | 83 | 90 | 0 | 12 |
| V42 | 95 | 51 | 42 | 91 | 60 | 30 | 83 | 95 | 0 | 7 |
| V39 | 95 | 48 | 49 | 95 | 36 | 46 | 80 | 94 | 10 | 17 |
| V21 | 93 | 46 | 57 | 90 | 20 | 34 | 83 | 93 | 2 | 16 |
| V37 | 94 | 45 | 53 | 91 | 37 | 35 | 62 | 94 | 0 | 10 |
| V10 | 94 | 44 | 52 | 92 | 20 | 37 | 82 | 94 | 2 | 7 |
| V30 | 98 | 40 | 32 | 98 | 47 | 22 | 64 | 98 | 0 | 10 |
| X1 | 92 | 38 | 23 | 88 | 65 | 12 | 53 | 92 | 1 | 1 |
| V2 | 92 | 36 | 42 | 90 | 12 | 31 | 71 | 92 | 3 | 26 |
| V4 | 90 | 31 | 36 | 88 | 7 | 21 | 61 | 90 | 1 | 26 |
| X2 | 93 | 28 | 20 | 91 | 36 | 17 | 54 | 93 | 0 | 0 |
| V19 | 90 | 27 | 32 | 88 | 4 | 21 | 65 | 90 | 0 | 0 |
| X3 | 95 | 25 | 29 | 91 | 7 | 20 | 66 | 95 | 0 | 1 |
| V8 | 91 | 25 | 26 | 84 | 4 | 17 | 57 | 91 | 1 | 23 |
| X4 | 99 | 22 | 27 | 97 | 10 | 18 | 66 | 99 | 1 | 14 |
| V41 | 95 | 20 | 15 | 92 | 25 | 6 | 74 | 95 | 13 | 5 |
| X5 | 92 | 19 | 9 | 83 | 29 | 7 | 42 | 92 | 0 | 0 |
| X6 | 93 | 18 | 11 | 90 | 25 | 9 | 61 | 93 | 0 | 0 |
| X8 | 92 | 13 | 15 | 87 | 5 | 10 | 61 | 92 | 0 | 0 |
| X9 | 92 | 7 | 10 | 87 | 2 | 5 | 47 | 92 | 0 | 2 |
Key; SparseStep – SparseStep regression, SCAD - smoothly clipped absolute deviation, Enet - elastic net regression, RBVS - ranking-based variable selection, MCP - minimax convex penalty, MARS - multivariate adaptive regression splines, Lasso - least absolute shrinkage and selection operator regression, Aenet - adaptive elastic net regression
Spearman correlations between variable selection stability by method.
| Stability | SCAD | Lasso | MARS | MCP | Ridge | Aenet | Enet | RBVS | Sparse step | BSLR |
|---|---|---|---|---|---|---|---|---|---|---|
| SCAD | 0.50 | 0.46 | 0.93 | −0.05 | 0.84 | 0.48 | 0.49 | 0.50 | 0.05 | |
| Lasso | 0.50 | 0.45 | 0.52 | 0.39 | 0.47 | 0.98 | 0.37 | 0.60 | 0.09 | |
| MARS | 0.46 | 0.45 | 0.50 | 0.15 | 0.39 | 0.44 | 0.41 | 0.38 | −0.27 | |
| MCP | 0.93 | 0.52 | 0.50 | 0.00 | 0.85 | 0.51 | 0.49 | 0.50 | 0.07 | |
| Ridge | −0.05 | 0.39 | 0.15 | 0.00 | −0.08 | 0.39 | 0.07 | 0.19 | 0.15 | |
| Aenet | 0.84 | 0.47 | 0.39 | 0.85 | −0.08 | 0.46 | 0.48 | 0.39 | 0.04 | |
| Enet | 0.48 | 0.98 | 0.44 | 0.51 | 0.39 | 0.46 | 0.38 | 0.59 | 0.08 | |
| RBVS | 0.49 | 0.37 | 0.41 | 0.49 | 0.07 | 0.48 | 0.38 | 1.00 | 0.53 | −0.04 |
| Sparse step | 0.50 | 0.60 | 0.38 | 0.50 | 0.19 | 0.39 | 0.59 | 0.53 | 0.01 | |
| LM | 0.05 | 0.09 | −0.27 | 0.07 | 0.15 | 0.04 | 0.08 | −0.04 | 0.01 |
Key; Sparsestep – SparseStep regression, SCAD - smoothly clipped absolute deviation, Ridge - ridge regression, RBVS - ranking-based variable selection, MCP - minimax convex penalty, MARS - multivariate adaptive regression splines, Lasso - least absolute shrinkage and selection operator regression, BSLR - backward stepwise linear regression, Enet - elastic net regression, Aenet - adaptive elastic net regression.