Literature DB >> 27003311

Categorical variables with many categories are preferentially selected in bootstrap-based model selection procedures for multivariable regression models.

Susanne Rospleszcz, Silke Janitza1, Anne-Laure Boulesteix1.   

Abstract

Automated variable selection procedures, such as backward elimination, are commonly employed to perform model selection in the context of multivariable regression. The stability of such procedures can be investigated using a bootstrap-based approach. The idea is to apply the variable selection procedure on a large number of bootstrap samples successively and to examine the obtained models, for instance, in terms of the inclusion of specific predictor variables. In this paper, we aim to investigate a particular important problem affecting this method in the case of categorical predictor variables with different numbers of categories and to give recommendations on how to avoid it. For this purpose, we systematically assess the behavior of automated variable selection based on the likelihood ratio test using either bootstrap samples drawn with replacement or subsamples drawn without replacement from the original dataset. Our study consists of extensive simulations and a real data example from the NHANES study. Our main result is that if automated variable selection is conducted on bootstrap samples, variables with more categories are substantially favored over variables with fewer categories and over metric variables even if none of them have any effect. Importantly, variables with no effect and many categories may be (wrongly) preferred to variables with an effect but few categories. We suggest the use of subsamples instead of bootstrap samples to bypass these drawbacks.
© 2016 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

Keywords:  Automated selection procedures; Bootstrap samples; Categorical variables; Likelihood ratio test; Model selection

Mesh:

Year:  2016        PMID: 27003311     DOI: 10.1002/bimj.201400185

Source DB:  PubMed          Journal:  Biom J        ISSN: 0323-3847            Impact factor:   2.207


  4 in total

1.  The Impacts of Air Pollution on Mental Health: Evidence from the Chinese University Students.

Authors:  Daqing Zu; Keyu Zhai; Yue Qiu; Pei Pei; Xiaoxian Zhu; Dongho Han
Journal:  Int J Environ Res Public Health       Date:  2020-09-16       Impact factor: 3.390

2.  State of the art in selection of variables and functional forms in multivariable analysis-outstanding issues.

Authors:  Willi Sauerbrei; Aris Perperoglou; Matthias Schmid; Michal Abrahamowicz; Heiko Becher; Harald Binder; Daniela Dunkler; Frank E Harrell; Patrick Royston; Georg Heinze
Journal:  Diagn Progn Res       Date:  2020-04-02

3.  Bootstrapping promotes the RSFC-behavior associations: An application of individual cognitive traits prediction.

Authors:  Lijiang Wei; Bin Jing; Haiyun Li
Journal:  Hum Brain Mapp       Date:  2020-03-16       Impact factor: 5.038

4.  Selection of variables for multivariable models: Opportunities and limitations in quantifying model stability by resampling.

Authors:  Christine Wallisch; Daniela Dunkler; Geraldine Rauch; Riccardo de Bin; Georg Heinze
Journal:  Stat Med       Date:  2020-10-21       Impact factor: 2.373

  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.