Literature DB >> 28589457

Assessing the accuracy and stability of variable selection methods for random forest modeling in ecology.

Eric W Fox1, Ryan A Hill2, Scott G Leibowitz3, Anthony R Olsen3, Darren J Thornbrugh2, Marc H Weber3.   

Abstract

Random forest (RF) modeling has emerged as an important statistical learning method in ecology due to its exceptional predictive performance. However, for large and complex ecological data sets, there is limited guidance on variable selection methods for RF modeling. Typically, either a preselected set of predictor variables are used or stepwise procedures are employed which iteratively remove variables according to their importance measures. This paper investigates the application of variable selection methods to RF models for predicting probable biological stream condition. Our motivating data set consists of the good/poor condition of n = 1365 stream survey sites from the 2008/2009 National Rivers and Stream Assessment, and a large set (p = 212) of landscape features from the StreamCat data set as potential predictors. We compare two types of RF models: a full variable set model with all 212 predictors and a reduced variable set model selected using a backward elimination approach. We assess model accuracy using RF's internal out-of-bag estimate, and a cross-validation procedure with validation folds external to the variable selection process. We also assess the stability of the spatial predictions generated by the RF models to changes in the number of predictors and argue that model selection needs to consider both accuracy and stability. The results suggest that RF modeling is robust to the inclusion of many variables of moderate to low importance. We found no substantial improvement in cross-validated accuracy as a result of variable reduction. Moreover, the backward elimination procedure tended to select too few variables and exhibited numerous issues such as upwardly biased out-of-bag accuracy estimates and instabilities in the spatial predictions. We use simulations to further support and generalize results from the analysis of real data. A main purpose of this work is to elucidate issues of model selection bias and instability to ecologists interested in using RF to develop predictive models with large environmental data sets.

Keywords:  Benthic macroinvertebrates; Model selection bias; National rivers and streams assessment; Random forest modeling; StreamCat dataset; Variable selection

Mesh:

Year:  2017        PMID: 28589457      PMCID: PMC6049094          DOI: 10.1007/s10661-017-6025-0

Source DB:  PubMed          Journal:  Environ Monit Assess        ISSN: 0167-6369            Impact factor:   2.513


  12 in total

1.  Random forest: a classification and regression tool for compound classification and QSAR modeling.

Authors:  Vladimir Svetnik; Andy Liaw; Christopher Tong; J Christopher Culberson; Robert P Sheridan; Bradley P Feuston
Journal:  J Chem Inf Comput Sci       Date:  2003 Nov-Dec

2.  What you see may not be what you get: a brief, nontechnical introduction to overfitting in regression-type models.

Authors:  Michael A Babyak
Journal:  Psychosom Med       Date:  2004 May-Jun       Impact factor: 4.312

3.  North American vegetation model for land-use planning in a changing climate: a solution to large classification problems.

Authors:  Gerald E Rehfeldt; Nicholas L Crookston; Cuauhtémoc Sáenz-Romero; Elizabeth M Campbell
Journal:  Ecol Appl       Date:  2012-01       Impact factor: 4.657

4.  Random forests for classification in ecology.

Authors:  D Richard Cutler; Thomas C Edwards; Karen H Beard; Adele Cutler; Kyle T Hess; Jacob Gibson; Joshua J Lawler
Journal:  Ecology       Date:  2007-11       Impact factor: 5.499

5.  Predicting the biological condition of streams: use of geospatial indicators of natural and anthropogenic characteristics of watersheds.

Authors:  Daren M Carlisle; James Falcone; Michael R Meador
Journal:  Environ Monit Assess       Date:  2008-05-21       Impact factor: 2.513

Review 6.  Random forests for genetic association studies.

Authors:  Benjamin A Goldstein; Eric C Polley; Farren B S Briggs
Journal:  Stat Appl Genet Mol Biol       Date:  2011-07-12

7.  Predictive mapping of the biotic condition of conterminous U.S. rivers and streams.

Authors:  Ryan A Hill; Eric W Fox; Scott G Leibowitz; Anthony R Olsen; Darren J Thornbrugh; Marc H Weber
Journal:  Ecol Appl       Date:  2017-11-03       Impact factor: 4.657

8.  An application of Random Forests to a genome-wide association dataset: methodological considerations & new findings.

Authors:  Benjamin A Goldstein; Alan E Hubbard; Adele Cutler; Lisa F Barcellos
Journal:  BMC Genet       Date:  2010-06-14       Impact factor: 2.797

9.  Predicting disease risks from highly imbalanced data using random forest.

Authors:  Mohammed Khalilia; Sounak Chakraborty; Mihail Popescu
Journal:  BMC Med Inform Decis Mak       Date:  2011-07-29       Impact factor: 2.796

10.  Gene selection and classification of microarray data using random forest.

Authors:  Ramón Díaz-Uriarte; Sara Alvarez de Andrés
Journal:  BMC Bioinformatics       Date:  2006-01-06       Impact factor: 3.169

View more
  10 in total

1.  Modeling Spatial and Temporal Variation in Natural Background Specific Conductivity.

Authors:  John R Olson; Susan M Cormier
Journal:  Environ Sci Technol       Date:  2019-04-01       Impact factor: 9.028

2.  The Lake-Catchment (LakeCat) Dataset: characterizing landscape features for lake basins within the conterminous USA.

Authors:  Ryan A Hill; Marc H Weber; Rick M Debbout; Scott G Leibowitz; Anthony R Olsen
Journal:  Freshw Sci       Date:  2018-06-01       Impact factor: 2.034

3.  Predictive mapping of the biotic condition of conterminous U.S. rivers and streams.

Authors:  Ryan A Hill; Eric W Fox; Scott G Leibowitz; Anthony R Olsen; Darren J Thornbrugh; Marc H Weber
Journal:  Ecol Appl       Date:  2017-11-03       Impact factor: 4.657

4.  Methane and Carbon Dioxide Emissions From Reservoirs: Controls and Upscaling.

Authors:  Jake J Beaulieu; Sarah Waldo; David A Balz; Will Barnett; Alexander Hall; Michelle C Platz; Karen M White
Journal:  J Geophys Res Biogeosci       Date:  2020-12-04       Impact factor: 3.822

5.  Patterns and predictions of drinking water nitrate violations across the conterminous United States.

Authors:  Michael J Pennino; Scott G Leibowitz; Jana E Compton; Ryan A Hill; Robert D Sabo
Journal:  Sci Total Environ       Date:  2020-03-05       Impact factor: 7.963

6.  Characterizing nonnative plants in wetlands across the conterminous United States.

Authors:  Teresa K Magee; Karen A Blocksom; Alan T Herlihy; Amanda M Nahlik
Journal:  Environ Monit Assess       Date:  2019-06-20       Impact factor: 2.513

7.  A risk score based on baseline risk factors for predicting mortality in COVID-19 patients.

Authors:  Ze Chen; Jing Chen; Jianghua Zhou; Fang Lei; Feng Zhou; Juan-Juan Qin; Xiao-Jing Zhang; Lihua Zhu; Ye-Mao Liu; Haitao Wang; Ming-Ming Chen; Yan-Ci Zhao; Jing Xie; Lijun Shen; Xiaohui Song; Xingyuan Zhang; Chengzhang Yang; Weifang Liu; Xiao Zhang; Deliang Guo; Youqin Yan; Mingyu Liu; Weiming Mao; Liming Liu; Ping Ye; Bing Xiao; Pengcheng Luo; Zixiong Zhang; Zhigang Lu; Junhai Wang; Haofeng Lu; Xigang Xia; Daihong Wang; Xiaofeng Liao; Gang Peng; Liang Liang; Jun Yang; Guohua Chen; Elena Azzolini; Alessio Aghemo; Michele Ciccarelli; Gianluigi Condorelli; Giulio G Stefanini; Xiang Wei; Bing-Hong Zhang; Xiaodong Huang; Jiahong Xia; Yufeng Yuan; Zhi-Gang She; Jiao Guo; Yibin Wang; Peng Zhang; Hongliang Li
Journal:  Curr Med Res Opin       Date:  2021-04-10       Impact factor: 2.580

8.  Land-use history impacts spatial patterns and composition of woody plant species across a 35-hectare temperate forest plot.

Authors:  David A Orwig; Jason A Aylward; Hannah L Buckley; Bradley S Case; Aaron M Ellison
Journal:  PeerJ       Date:  2022-01-03       Impact factor: 2.984

9.  Development of Novel Management Tools for Phortica variegata (Diptera: Drosophilidae), Vector of the Oriental Eyeworm, Thelazia callipaeda (Spirurida: Thelaziidae), in Europe.

Authors:  M A González; D Bravo-Barriga; P M Alarcón-Elbal; J M Álvarez-Calero; C Quero; M Ferraguti; S López
Journal:  J Med Entomol       Date:  2022-01-12       Impact factor: 2.278

10.  Variable selection and validation in multivariate modelling.

Authors:  Lin Shi; Johan A Westerhuis; Johan Rosén; Rikard Landberg; Carl Brunius
Journal:  Bioinformatics       Date:  2019-03-15       Impact factor: 6.937

  10 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.