| Literature DB >> 35412893 |
Anupreet Porwal1, Adrian E Raftery1,2.
Abstract
Probability models are used for many statistical tasks, notably parameter estimation, interval estimation, inference about model parameters, point prediction, and interval prediction. Thus, choosing a statistical model and accounting for uncertainty about this choice are important parts of the scientific process. Here we focus on one such choice, that of variables to include in a linear regression model. Many methods have been proposed, including Bayesian and penalized likelihood methods, and it is unclear which one to use. We compared 21 of the most popular methods by carrying out an extensive set of simulation studies based closely on real datasets that span a range of situations encountered in practical data analysis. Three adaptive Bayesian model averaging (BMA) methods performed best across all statistical tasks. These used adaptive versions of Zellner’s g-prior for the parameters, where the prior variance parameter g is a function of sample size or is estimated from the data. We found that for BMA methods implemented with Markov chain Monte Carlo, 10,000 iterations were enough. Computationally, we found two of the three best methods (BMA with g=√n and empirical Bayes-local) to be competitive with the least absolute shrinkage and selection operator (LASSO), which is often preferred as a variable selection technique because of its computational efficiency. BMA performed better than Bayesian model selection (in which just one model is selected).Entities:
Keywords: Bayesian model averaging; LASSO; interval estimation; model selection; parameter estimation
Year: 2022 PMID: 35412893 PMCID: PMC9169744 DOI: 10.1073/pnas.2120737119
Source DB: PubMed Journal: Proc Natl Acad Sci U S A ISSN: 0027-8424 Impact factor: 12.779
Fig. 1.Sample size n versus the number of candidate variables p for the 14 datasets on which our simulation studies are based. The n = p line is shown in red.
Fig. 2.Performance of 21 methods for inference in linear regression under model uncertainty: “PointEst” is the RMSE for point estimation, “IntEst” is the MIS for interval estimation, “Inference” is 1 – the AUPRC, “Prediction” is the RMSE for point prediction, and “IntPred” is the MIS for interval prediction. “N vars” is the average number of variables used for the task. All metrics are standardized to equal 1 for the JZS method. See Results and Materials and Methods for more information about the ranking and coloring and the definitions of the methods and metrics. Note that BICREG denotes the BICREG-SIS method, in which sure independence screening is used first to reduce the number of variables to 30.
Comparison of BMA and BMS for top three methods
Variable selection methods compared in this study
| Method | Authors | Implementation (R package–version) | Function |
|---|---|---|---|
|
| Fernández et al. ( | BAS-V1.5.5 ( | bas.lm(…, prior=”g-prior”, alpha = sqrt(n)) |
| Hyper-g | Liang et al. ( | BAS-V1.5.5 ( | bas.lm(…, prior=”hyper-g”) |
| EB-local | Hansen and Yu ( | BAS-V1.5.5 ( | bas.lm(…, prior=”EB-local”) |
| JZS | Zellner and Siow ( | BAS-V1.5.5 ( | bas.lm(…, prior=”JZS”) |
| Horseshoe | Carvalho et al. ( | horseshoe-V0.2.0 ( | horseshoe() |
| UIP | Kass and Wasserman ( | BAS-V1.5.5 ( | bas.lm(…, prior=”g-prior”, alpha = n) |
| EB-global | Clyde and George ( | BAS-V1.5.5 ( | bas.lm(…, prior=”EB-global”) |
| Benchmark | Fernández et al. ( | BAS-V1.5.5 ( | bas.lm(…, prior=”g-prior”, alpha = max(n, |
| NLP | Rossell and Telesca ( | mombf-V2.2.9 ( | modelSelection() |
| LASSO | Tibshirani ( | glmnet-V3.0.2 ( | cv.glmnet() |
| SCAD | Fan and Li ( | ncvreg-V3.11.2 ( | cv.ncvreg(…, penalty=”SCAD”) |
| BIC-BAS | George and Foster ( | BAS-V1.5.5 ( | bas.lm(…, prior=”BIC”) |
| BICREG-SIS | Raftery ( | BMA-V3.18.12 ( | bicreg() |
| Spike slab | George and McCulloch ( | BoomSpikeSlab-V1.2.3 ( | lm.spike() |
| Elastic net | Zou and Hastie ( | glmnet-V3.0.2 ( | cv.glmnet(, alpha) |
| MCP | Zhang et al. ( | ncvreg-V3.11.2 ( | cv.ncvreg(…, penalty=”MCP”) |
| SS lasso | Ročková and George ( | SSLASSO-V1.2.2 ( | SSLASSO() |
| EMVS | Ročková and George ( | EMVS-V1.1 ( | EMVS() |
| AIC | George and Foster ( | BAS-V1.5.5 ( | bas.lm(…, prior=”AIC”) |
| van Zwet ( | BAS-V1.5.5 ( | bas.lm(…, prior=”g-prior”, alpha = 1) |
*LASSO-1se has the same reference as LASSO.
Datasets used in the study
| Dataset name | Sample size (N) | Covariates (p) | Source |
|---|---|---|---|
| College | 777 | 14 | ISLR ( |
| Bias Correction-Tmax | 7,590 | 21 | UCI ML repository |
| Bias Correction-Tmin | 7,590 | 21 | UCI ML repository |
| SML2010 | 1,373 | 22 | UCI ML repository |
| Bike sharing-daily | 731 | 28 | UCI ML repository |
| Bike sharing-hourly | 17,379 | 32 | UCI ML repository |
| Superconductivity | 21,263 | 81 | UCI ML repository |
| Diabetes | 442 | 64 | spikeslab ( |
| Ozone | 330 | 44 | gss ( |
| Boston housing | 506 | 103 | mlbench ( |
| NIR | 166 | 225 | chemometrics ( |
| Nutrimouse | 40 | 120 | mixOmics ( |
| Multidrug | 60 | 853 | mixOmics ( |
| Liver toxicity | 64 | 3,116 | mixOmics ( |