Literature DB >> 21908865

Random forest Gini importance favours SNPs with large minor allele frequency: impact, sources and recommendations.

Anne-Laure Boulesteix1, Andreas Bender, Justo Lorenzo Bermejo, Carolin Strobl.   

Abstract

The use of random forests is increasingly common in genetic association studies. The variable importance measure (VIM) that is automatically calculated as a by-product of the algorithm is often used to rank polymorphisms with respect to their ability to predict the investigated phenotype. Here, we investigate a characteristic of this methodology that may be considered as an important pitfall, namely that common variants are systematically favoured by the widely used Gini VIM. As a consequence, researchers may overlook rare variants that contribute to the missing heritability. The goal of the present article is 3-fold: (i) to assess this effect quantitatively using simulation studies for different types of random forests (classical random forests and conditional inference forests, that employ unbiased variable selection criteria) as well as for different importance measures (Gini and permutation based); (ii) to explore the trees and to compare the behaviour of random forests and the standard logistic regression model in order to understand the statistical mechanisms behind the preference for common variants; and (iii) to summarize these results and previously investigated properties of random forest VIMs in the context of genetic association studies and to make practical recommendations regarding the choice of the random forest and variable importance type. All our analyses can be reproduced using R code available from the companion website: http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/020_professuren/boulesteix/ginibias/.

Mesh:

Year:  2011        PMID: 21908865     DOI: 10.1093/bib/bbr053

Source DB:  PubMed          Journal:  Brief Bioinform        ISSN: 1467-5463            Impact factor:   11.622


  19 in total

1.  iRSpot-GAEnsC: identifing recombination spots via ensemble classifier and extending the concept of Chou's PseAAC to formulate DNA samples.

Authors:  Muhammad Kabir; Maqsood Hayat
Journal:  Mol Genet Genomics       Date:  2015-08-30       Impact factor: 3.291

2.  An integrated approach to reduce the impact of minor allele frequency and linkage disequilibrium on variable importance measures for genome-wide data.

Authors:  Raymond Walters; Charles Laurin; Gitta H Lubke
Journal:  Bioinformatics       Date:  2012-07-30       Impact factor: 6.937

3.  Interpretation of machine learning predictions for patient outcomes in electronic health records.

Authors:  William La Cava; Christopher Bauer; Jason H Moore; Sarah A Pendergrass
Journal:  AMIA Annu Symp Proc       Date:  2020-03-04

4.  Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle?

Authors:  Wouter G Touw; Jumamurat R Bayjanov; Lex Overmars; Lennart Backus; Jos Boekhorst; Michiel Wels; Sacha A F T van Hijum
Journal:  Brief Bioinform       Date:  2012-07-10       Impact factor: 11.622

5.  Use of random forest to estimate population attributable fractions from a case-control study of Salmonella enterica serotype Enteritidis infections.

Authors:  W Gu; A R Vieira; R M Hoekstra; P M Griffin; D Cole
Journal:  Epidemiol Infect       Date:  2015-02-12       Impact factor: 4.434

6.  Random forests on Hadoop for genome-wide association studies of multivariate neuroimaging phenotypes.

Authors:  Yue Wang; Wilson Goh; Limsoon Wong; Giovanni Montana
Journal:  BMC Bioinformatics       Date:  2013-10-22       Impact factor: 3.169

7.  Regularized machine learning in the genetic prediction of complex traits.

Authors:  Sebastian Okser; Tapio Pahikkala; Antti Airola; Tapio Salakoski; Samuli Ripatti; Tero Aittokallio
Journal:  PLoS Genet       Date:  2014-11-13       Impact factor: 5.917

8.  Genome-scale screening of drug-target associations relevant to Ki using a chemogenomics approach.

Authors:  Dong-Sheng Cao; Yi-Zeng Liang; Zhe Deng; Qian-Nan Hu; Min He; Qing-Song Xu; Guang-Hua Zhou; Liu-Xia Zhang; Zi-xin Deng; Shao Liu
Journal:  PLoS One       Date:  2013-04-05       Impact factor: 3.240

9.  An AUC-based permutation variable importance measure for random forests.

Authors:  Silke Janitza; Carolin Strobl; Anne-Laure Boulesteix
Journal:  BMC Bioinformatics       Date:  2013-04-05       Impact factor: 3.169

10.  Sequence based prediction of DNA-binding proteins based on hybrid feature selection using random forest and Gaussian naïve Bayes.

Authors:  Wangchao Lou; Xiaoqing Wang; Fan Chen; Yixiao Chen; Bo Jiang; Hua Zhang
Journal:  PLoS One       Date:  2014-01-24       Impact factor: 3.240

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.