Literature DB >> 23148107

Correction for population stratification in random forest analysis.

Yang Zhao1, Feng Chen, Rihong Zhai, Xihong Lin, Zhaoxi Wang, Li Su, David C Christiani.   

Abstract

BACKGROUND: Population structure (PS), including population stratification and admixture, is a significant confounder in genome-wide association studies (GWAS), as it may produce spurious associations. Random forest (RF) has been increasingly applied in GWAS data analysis because of its advantage in analysing high dimensional genetic data. RF creates importance measures for single nucleotide polymorphisms (SNPs), which are helpful for feature selections. However, if PS is not appropriately corrected, RF tends to give high importance to disease-unrelated SNPs with different frequencies of allele or genotype among subpopulations, leading to inaccurate results.
METHODS: In this study, the authors propose to correct for the confounding effect of PS by including the information of PS in RF analysis. The correction procedure starts by extracting the information of PS using EIGENSTRAT or multi-dimensional scaling clustering procedure from a large number of structure inference SNPs. Phenotype and genotypes adjusted by the information of PS are then used as the outcome and predictors in RF analysis.
RESULTS: Extensive simulations indicate that the importance measure of the causal SNP is increased following the PS correction. By analysing a real dataset, the proposed correction removes the spurious association between the lactase gene and height.
CONCLUSION: The authors propose a simple method to correct for PS in RF analysis on GWAS data. Further studies in real GWAS datasets are required to validate the robustness of the proposed approach.

Entities:  

Mesh:

Year:  2012        PMID: 23148107      PMCID: PMC3535752          DOI: 10.1093/ije/dys183

Source DB:  PubMed          Journal:  Int J Epidemiol        ISSN: 0300-5771            Impact factor:   7.196


  28 in total

Review 1.  Genome-wide association studies for common diseases and complex traits.

Authors:  Joel N Hirschhorn; Mark J Daly
Journal:  Nat Rev Genet       Date:  2005-02       Impact factor: 53.242

2.  Principal components analysis corrects for stratification in genome-wide association studies.

Authors:  Alkes L Price; Nick J Patterson; Robert M Plenge; Michael E Weinblatt; Nancy A Shadick; David Reich
Journal:  Nat Genet       Date:  2006-07-23       Impact factor: 38.330

Review 3.  Biostatistical aspects of genome-wide association studies.

Authors:  Andreas Ziegler; Inke R König; John R Thompson
Journal:  Biom J       Date:  2008-02       Impact factor: 2.207

4.  Genome-wide association studies: past, present and future.

Authors:  Mark I McCarthy; Joel N Hirschhorn
Journal:  Hum Mol Genet       Date:  2008-10-15       Impact factor: 6.150

Review 5.  Multigenic modeling of complex disease by random forests.

Authors:  Yan V Sun
Journal:  Adv Genet       Date:  2010       Impact factor: 1.944

Review 6.  Genome-wide association studies in diverse populations.

Authors:  Noah A Rosenberg; Lucy Huang; Ethan M Jewett; Zachary A Szpiech; Ivana Jankovic; Michael Boehnke
Journal:  Nat Rev Genet       Date:  2010-05       Impact factor: 53.242

7.  An application of Random Forests to a genome-wide association dataset: methodological considerations & new findings.

Authors:  Benjamin A Goldstein; Alan E Hubbard; Adele Cutler; Lisa F Barcellos
Journal:  BMC Genet       Date:  2010-06-14       Impact factor: 2.797

8.  The behaviour of random forest permutation-based variable importance measures under predictor correlation.

Authors:  Kristin K Nicodemus; James D Malley; Carolin Strobl; Andreas Ziegler
Journal:  BMC Bioinformatics       Date:  2010-02-27       Impact factor: 3.169

9.  Population structure and eigenanalysis.

Authors:  Nick Patterson; Alkes L Price; David Reich
Journal:  PLoS Genet       Date:  2006-12       Impact factor: 5.917

10.  Classification of rheumatoid arthritis status with candidate gene and genome-wide single-nucleotide polymorphisms using random forests.

Authors:  Yan V Sun; Zhaohui Cai; Kaushal Desai; Rachael Lawrance; Richard Leff; Ansar Jawaid; Sharon Lr Kardia; Huiying Yang
Journal:  BMC Proc       Date:  2007-12-18
View more
  13 in total

1.  Machine learning for genetic prediction of psychiatric disorders: a systematic review.

Authors:  Matthew Bracher-Smith; Karen Crawford; Valentina Escott-Price
Journal:  Mol Psychiatry       Date:  2020-06-26       Impact factor: 15.992

2.  Use of random forest to estimate population attributable fractions from a case-control study of Salmonella enterica serotype Enteritidis infections.

Authors:  W Gu; A R Vieira; R M Hoekstra; P M Griffin; D Cole
Journal:  Epidemiol Infect       Date:  2015-02-12       Impact factor: 4.434

3.  Regularized machine learning in the genetic prediction of complex traits.

Authors:  Sebastian Okser; Tapio Pahikkala; Antti Airola; Tapio Salakoski; Samuli Ripatti; Tero Aittokallio
Journal:  PLoS Genet       Date:  2014-11-13       Impact factor: 5.917

4.  Genomic signatures among Oncorhynchus nerka ecotypes to inform conservation and management of endangered Sockeye Salmon.

Authors:  Krista M Nichols; Christine C Kozfkay; Shawn R Narum
Journal:  Evol Appl       Date:  2016-10-21       Impact factor: 5.183

5.  Genomewide association analyses of fitness traits in captive-reared Chinook salmon: Applications in evaluating conservation strategies.

Authors:  Charles D Waters; Jeffrey J Hard; Marine S O Brieuc; David E Fast; Kenneth I Warheit; Curtis M Knudsen; William J Bosch; Kerry A Naish
Journal:  Evol Appl       Date:  2018-03-05       Impact factor: 5.183

6.  Identification of Age-Specific and Common Key Regulatory Mechanisms Governing Eggshell Strength in Chicken Using Random Forests.

Authors:  Faisal Ramzan; Selina Klees; Armin Otto Schmitt; David Cavero; Mehmet Gültas
Journal:  Genes (Basel)       Date:  2020-04-24       Impact factor: 4.096

7.  Unravelling the GSK3β-related genotypic interaction network influencing hippocampal volume in recurrent major depressive disorder.

Authors:  Becky Inkster; Andy Simmons; James H Cole; Erwin Schoof; Rune Linding; Tom Nichols; Pierandrea Muglia; Florian Holsboer; Philipp G Sämann; Peter McGuffin; Cynthia H Y Fu; Kamilla Miskowiak; Paul M Matthews; Gwyneth Zai; Kristin Nicodemus
Journal:  Psychiatr Genet       Date:  2018-10       Impact factor: 2.458

8.  Molecular reclassification of Crohn's disease: a cautionary note on population stratification.

Authors:  Bärbel Maus; Camille Jung; Jestinah M Mahachie John; Jean-Pierre Hugot; Emmanuelle Génin; Kristel Van Steen
Journal:  PLoS One       Date:  2013-10-17       Impact factor: 3.240

9.  Epigenetic modifications in KDM lysine demethylases associate with survival of early-stage NSCLC.

Authors:  Yongyue Wei; Junya Liang; Ruyang Zhang; Yichen Guo; Sipeng Shen; Li Su; Xihong Lin; Sebastian Moran; Åslaug Helland; Maria M Bjaanæs; Anna Karlsson; Maria Planck; Manel Esteller; Thomas Fleischer; Johan Staaf; Yang Zhao; Feng Chen; David C Christiani
Journal:  Clin Epigenetics       Date:  2018-04-02       Impact factor: 6.551

10.  Using tree-based methods for detection of gene-gene interactions in the presence of a polygenic signal: simulation study with application to educational attainment in the Generation Scotland Cohort Study.

Authors:  Joeri J Meijsen; Alexandros Rammos; Archie Campbell; Caroline Hayward; David J Porteous; Ian J Deary; Riccardo E Marioni; Kristin K Nicodemus
Journal:  Bioinformatics       Date:  2019-01-15       Impact factor: 6.937

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.