Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Random forest Gini importance favours SNPs with large minor allele frequency: impact, sources and recommendations.

Literature DB >> 21908865

Random forest Gini importance favours SNPs with large minor allele frequency: impact, sources and recommendations.

Anne-Laure Boulesteix¹, Andreas Bender, Justo Lorenzo Bermejo, Carolin Strobl.

Abstract

The use of random forests is increasingly common in genetic association studies. The variable importance measure (VIM) that is automatically calculated as a by-product of the algorithm is often used to rank polymorphisms with respect to their ability to predict the investigated phenotype. Here, we investigate a characteristic of this methodology that may be considered as an important pitfall, namely that common variants are systematically favoured by the widely used Gini VIM. As a consequence, researchers may overlook rare variants that contribute to the missing heritability. The goal of the present article is 3-fold: (i) to assess this effect quantitatively using simulation studies for different types of random forests (classical random forests and conditional inference forests, that employ unbiased variable selection criteria) as well as for different importance measures (Gini and permutation based); (ii) to explore the trees and to compare the behaviour of random forests and the standard logistic regression model in order to understand the statistical mechanisms behind the preference for common variants; and (iii) to summarize these results and previously investigated properties of random forest VIMs in the context of genetic association studies and to make practical recommendations regarding the choice of the random forest and variable importance type. All our analyses can be reproduced using R code available from the companion website: http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/020_professuren/boulesteix/ginibias/.

Mesh：

Year: 2011 PMID： 21908865 DOI： 10.1093/bib/bbr053

Source DB: PubMed Journal: Brief Bioinform ISSN： 1467-5463 Impact factor: 11.622

Keyword Cloud
Cited

19 in total

1. iRSpot-GAEnsC: identifing recombination spots via ensemble classifier and extending the concept of Chou's PseAAC to formulate DNA samples.

Authors: Muhammad Kabir; Maqsood Hayat
Journal: Mol Genet Genomics Date: 2015-08-30 Impact factor: 3.291

2. An integrated approach to reduce the impact of minor allele frequency and linkage disequilibrium on variable importance measures for genome-wide data.

Authors: Raymond Walters; Charles Laurin; Gitta H Lubke
Journal: Bioinformatics Date: 2012-07-30 Impact factor: 6.937

3. Interpretation of machine learning predictions for patient outcomes in electronic health records.

Authors: William La Cava; Christopher Bauer; Jason H Moore; Sarah A Pendergrass
Journal: AMIA Annu Symp Proc Date: 2020-03-04

4. Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle?

Authors: Wouter G Touw; Jumamurat R Bayjanov; Lex Overmars; Lennart Backus; Jos Boekhorst; Michiel Wels; Sacha A F T van Hijum
Journal: Brief Bioinform Date: 2012-07-10 Impact factor: 11.622

5. Use of random forest to estimate population attributable fractions from a case-control study of Salmonella enterica serotype Enteritidis infections.

Authors: W Gu; A R Vieira; R M Hoekstra; P M Griffin; D Cole
Journal: Epidemiol Infect Date: 2015-02-12 Impact factor: 4.434

10. Sequence based prediction of DNA-binding proteins based on hybrid feature selection using random forest and Gaussian naïve Bayes.

Authors: Wangchao Lou; Xiaoqing Wang; Fan Chen; Yixiao Chen; Bo Jiang; Hua Zhang
Journal: PLoS One Date: 2014-01-24 Impact factor: 3.240

Random forest Gini importance favours SNPs with large minor allele frequency: impact, sources and recommendations.

1. iRSpot-GAEnsC: identifing recombination spots via ensemble classifier and extending the concept of Chou's PseAAC to formulate DNA samples.

2. An integrated approach to reduce the impact of minor allele frequency and linkage disequilibrium on variable importance measures for genome-wide data.

3. Interpretation of machine learning predictions for patient outcomes in electronic health records.

4. Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle?

5. Use of random forest to estimate population attributable fractions from a case-control study of Salmonella enterica serotype Enteritidis infections.

6. Random forests on Hadoop for genome-wide association studies of multivariate neuroimaging phenotypes.

7. Regularized machine learning in the genetic prediction of complex traits.

8. Genome-scale screening of drug-target associations relevant to Ki using a chemogenomics approach.

9. An AUC-based permutation variable importance measure for random forests.

10. Sequence based prediction of DNA-binding proteins based on hybrid feature selection using random forest and Gaussian naïve Bayes.