Literature DB >> 32968335

A Comparison of Random Forest Variable Selection Methods for Classification Prediction Modeling.

Jaime Lynn Speiser1, Michael E Miller1, Janet Tooze1, Edward Ip1.   

Abstract

Random forest classification is a popular machine learning method for developing prediction models in many research settings. Often in prediction modeling, a goal is to reduce the number of variables needed to obtain a prediction in order to reduce the burden of data collection and improve efficiency. Several variable selection methods exist for the setting of random forest classification; however, there is a paucity of literature to guide users as to which method may be preferable for different types of datasets. Using 311 classification datasets freely available online, we evaluate the prediction error rates, number of variables, computation times and area under the receiver operating curve for many random forest variable selection methods. We compare random forest variable selection methods for different types of datasets (datasets with binary outcomes, datasets with many predictors, and datasets with imbalanced outcomes) and for different types of methods (standard random forest versus conditional random forest methods and test based versus performance based methods). Based on our study, the best variable selection methods for most datasets are Jiang's method and the method implemented in the VSURF R package. For datasets with many predictors, the methods implemented in the R packages varSelRF and Boruta are preferable due to computational efficiency. A significant contribution of this study is the ability to assess different variable selection techniques in the setting of random forest classification in order to identify preferable methods based on applications in expert and intelligent systems.

Entities:  

Keywords:  classification; feature reduction; random forest; variable selection

Year:  2019        PMID: 32968335      PMCID: PMC7508310          DOI: 10.1016/j.eswa.2019.05.028

Source DB:  PubMed          Journal:  Expert Syst Appl        ISSN: 0957-4174            Impact factor:   6.954


  6 in total

1.  Random forest classification of etiologies for an orphan disease.

Authors:  Jaime Lynn Speiser; Valerie L Durkalski; William M Lee
Journal:  Stat Med       Date:  2014-11-03       Impact factor: 2.373

2.  Permutation importance: a corrected feature importance measure.

Authors:  André Altmann; Laura Toloşi; Oliver Sander; Thomas Lengauer
Journal:  Bioinformatics       Date:  2010-04-12       Impact factor: 6.937

3.  Comparison of variable selection methods for clinical predictive modeling.

Authors:  L Nelson Sanchez-Pinto; Laura Ruth Venable; John Fahrenbach; Matthew M Churpek
Journal:  Int J Med Inform       Date:  2018-05-21       Impact factor: 4.046

4.  Gene selection and classification of microarray data using random forest.

Authors:  Ramón Díaz-Uriarte; Sara Alvarez de Andrés
Journal:  BMC Bioinformatics       Date:  2006-01-06       Impact factor: 3.169

5.  Evaluation of variable selection methods for random forests and omics data sets.

Authors:  Frauke Degenhardt; Stephan Seifert; Silke Szymczak
Journal:  Brief Bioinform       Date:  2019-03-22       Impact factor: 11.622

6.  Joint analysis of two microarray gene-expression data sets to select lung adenocarcinoma marker genes.

Authors:  Hongying Jiang; Youping Deng; Huann-Sheng Chen; Lin Tao; Qiuying Sha; Jun Chen; Chung-Jui Tsai; Shuanglin Zhang
Journal:  BMC Bioinformatics       Date:  2004-06-24       Impact factor: 3.169

  6 in total
  38 in total

1.  Development and Validation of an Early Scoring System for Prediction of Disease Severity in COVID-19 Using Complete Blood Count Parameters.

Authors:  Tawsifur Rahman; Amith Khandakar; Md Enamul Hoque; Nabil Ibtehaz; Saad Bin Kashem; Reehum Masud; Lutfunnahar Shampa; Mohammad Mehedi Hasan; Mohammad Tariqul Islam; Somaya Al-Maadeed; Susu M Zughaier; Saif Badran; Suhail A R Doi; Muhammad E H Chowdhury
Journal:  IEEE Access       Date:  2021-08-16       Impact factor: 3.367

2.  Regional Scale Assessment of Shallow Groundwater Vulnerability to Contamination from Unconventional Hydrocarbon Extraction.

Authors:  Mario A Soriano; Nicole C Deziel; James E Saiers
Journal:  Environ Sci Technol       Date:  2022-08-12       Impact factor: 11.357

3.  Development and Validation of a Mortality Prediction Model in Extremely Low Gestational Age Neonates.

Authors:  Alvaro Moreira; Domenico Benvenuto; Christopher Fox-Good; Yasmeen Alayli; Mary Evans; Baldvin Jonsson; Stellan Hakansson; Nathan Harper; Jennifer Kim; Mikael Norman; Matteo Bruschettini
Journal:  Neonatology       Date:  2022-05-20       Impact factor: 5.106

4.  Usefulness of Random Forest Algorithm in Predicting Severe Acute Pancreatitis.

Authors:  Wandong Hong; Yajing Lu; Xiaoying Zhou; Shengchun Jin; Jingyi Pan; Qingyi Lin; Shaopeng Yang; Zarrin Basharat; Maddalena Zippi; Hemant Goyal
Journal:  Front Cell Infect Microbiol       Date:  2022-06-10       Impact factor: 6.073

5.  Air quality prediction models based on meteorological factors and real-time data of industrial waste gas.

Authors:  Ying Liu; Peiyu Wang; Yong Li; Lixia Wen; Xiaochao Deng
Journal:  Sci Rep       Date:  2022-06-03       Impact factor: 4.996

6.  Seeing the forest for the trees: Predicting attendance in trials for co-occurring PTSD and substance use disorders with a machine learning approach.

Authors:  Teresa López-Castro; Yihong Zhao; Skye Fitzpatrick; Lesia M Ruglass; Denise A Hien
Journal:  J Consult Clin Psychol       Date:  2021-10

7.  A random forest method with feature selection for developing medical prediction models with clustered and longitudinal data.

Authors:  Jaime Lynn Speiser
Journal:  J Biomed Inform       Date:  2021-03-26       Impact factor: 6.317

8.  Estimating the Growing Stem Volume of Coniferous Plantations Based on Random Forest Using an Optimized Variable Selection Method.

Authors:  Fugen Jiang; Mykola Kutia; Arbi J Sarkissian; Hui Lin; Jiangping Long; Hua Sun; Guangxing Wang
Journal:  Sensors (Basel)       Date:  2020-12-17       Impact factor: 3.576

9.  Deploying viscosity and starch polymer properties to predict cooking and eating quality models: A novel breeding tool to predict texture.

Authors:  Reuben James Q Buenafe; Vasudev Kumanduri; Nese Sreenivasulu
Journal:  Carbohydr Polym       Date:  2021-02-15       Impact factor: 9.381

10.  Application of the random forest algorithm to Streptococcus pyogenes response regulator allele variation: from machine learning to evolutionary models.

Authors:  Sean J Buckley; Robert J Harvey; Zack Shan
Journal:  Sci Rep       Date:  2021-06-16       Impact factor: 4.379

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.