L Nelson Sanchez-Pinto1, Laura Ruth Venable2, John Fahrenbach3, Matthew M Churpek4. 1. Ann & Robert H. Lurie Children's Hospital of Chicago, Northwestern University Feinberg School of Medicine, Chicago, IL, USA. 2. Rollins School of Public Health, Emory University, Atlanta, GA, USA. 3. The Center for Healthcare Delivery Science and Innovation, The University of Chicago, Chicago, IL, USA. 4. Department of Medicine, The University of Chicago, Chicago, IL, USA. Electronic address: matthew.churpek@uchospitals.edu.
Abstract
OBJECTIVE: Modern machine learning-based modeling methods are increasingly applied to clinical problems. One such application is in variable selection methods for predictive modeling. However, there is limited research comparing the performance of classic and modern for variable selection in clinical datasets. MATERIALS AND METHODS: We analyzed the performance of eight different variable selection methods: four regression-based methods (stepwise backward selection using p-value and AIC, Least Absolute Shrinkage and Selection Operator, and Elastic Net) and four tree-based methods (Variable Selection Using Random Forest, Regularized Random Forests, Boruta, and Gradient Boosted Feature Selection). We used two clinical datasets of different sizes, a multicenter adult clinical deterioration cohort and a single center pediatric acute kidney injury cohort. Method evaluation included measures of parsimony, variable importance, and discrimination. RESULTS: In the large, multicenter dataset, the modern tree-based Variable Selection Using Random Forest and the Gradient Boosted Feature Selection methods achieved the best parsimony. In the smaller, single-center dataset, the classic regression-based stepwise backward selection using p-value and AIC methods achieved the best parsimony. In both datasets, variable selection tended to decrease the accuracy of the random forest models and increase the accuracy of logistic regression models. CONCLUSIONS: The performance of classic regression-based and modern tree-based variable selection methods is associated with the size of the clinical dataset used. Classic regression-based variable selection methods seem to achieve better parsimony in clinical prediction problems in smaller datasets while modern tree-based methods perform better in larger datasets.
OBJECTIVE: Modern machine learning-based modeling methods are increasingly applied to clinical problems. One such application is in variable selection methods for predictive modeling. However, there is limited research comparing the performance of classic and modern for variable selection in clinical datasets. MATERIALS AND METHODS: We analyzed the performance of eight different variable selection methods: four regression-based methods (stepwise backward selection using p-value and AIC, Least Absolute Shrinkage and Selection Operator, and Elastic Net) and four tree-based methods (Variable Selection Using Random Forest, Regularized Random Forests, Boruta, and Gradient Boosted Feature Selection). We used two clinical datasets of different sizes, a multicenter adult clinical deterioration cohort and a single center pediatric acute kidney injury cohort. Method evaluation included measures of parsimony, variable importance, and discrimination. RESULTS: In the large, multicenter dataset, the modern tree-based Variable Selection Using Random Forest and the Gradient Boosted Feature Selection methods achieved the best parsimony. In the smaller, single-center dataset, the classic regression-based stepwise backward selection using p-value and AIC methods achieved the best parsimony. In both datasets, variable selection tended to decrease the accuracy of the random forest models and increase the accuracy of logistic regression models. CONCLUSIONS: The performance of classic regression-based and modern tree-based variable selection methods is associated with the size of the clinical dataset used. Classic regression-based variable selection methods seem to achieve better parsimony in clinical prediction problems in smaller datasets while modern tree-based methods perform better in larger datasets.
Authors: Matthew M Churpek; Trevor C Yuen; Christopher Winslow; Ari A Robicsek; David O Meltzer; Robert D Gibbons; Dana P Edelson Journal: Am J Respir Crit Care Med Date: 2014-09-15 Impact factor: 21.405
Authors: Matthew M Churpek; Trevor C Yuen; Christopher Winslow; David O Meltzer; Michael W Kattan; Dana P Edelson Journal: Crit Care Med Date: 2016-02 Impact factor: 7.598
Authors: Leah L Zullig; Shelley A Jazowski; Tracy Y Wang; Anne Hellkamp; Daniel Wojdyla; Laine Thomas; Lisa Egbuonu-Davis; Anne Beal; Hayden B Bosworth Journal: Health Serv Res Date: 2019-08-20 Impact factor: 3.402
Authors: Sally L Baxter; Charles Marks; Tsung-Ting Kuo; Lucila Ohno-Machado; Robert N Weinreb Journal: Am J Ophthalmol Date: 2019-07-16 Impact factor: 5.258
Authors: Alexandra Gilbert; Ane L Appelt; Stelios Theophanous; Robert Samuel; John Lilley; Ann Henry; David Sebag-Montefiore Journal: BMC Cancer Date: 2022-06-03 Impact factor: 4.638
Authors: Ibrahim Sandokji; Yu Yamamoto; Aditya Biswas; Tanima Arora; Ugochukwu Ugwuowo; Michael Simonov; Ishan Saran; Melissa Martin; Jeffrey M Testani; Sherry Mansour; Dennis G Moledina; Jason H Greenberg; F Perry Wilson Journal: J Am Soc Nephrol Date: 2020-05-07 Impact factor: 10.121
Authors: Aleksandar Z Obradovic; Matthew C Dallos; Emmanuel S Antonarakis; Charles G Drake; Marianna L Zahurak; Alan W Partin; Edward M Schaeffer; Ashley E Ross; Mohamad E Allaf; Thomas R Nirschl; David Liu; Carolyn G Chapman; Tanya O'Neal; Haiyi Cao; Jennifer N Durham; Gunes Guner; Javier A Baena-Del Valle; Onur Ertunc; Angelo M De Marzo Journal: Clin Cancer Res Date: 2020-03-15 Impact factor: 12.531