BACKGROUND: Population structure (PS), including population stratification and admixture, is a significant confounder in genome-wide association studies (GWAS), as it may produce spurious associations. Random forest (RF) has been increasingly applied in GWAS data analysis because of its advantage in analysing high dimensional genetic data. RF creates importance measures for single nucleotide polymorphisms (SNPs), which are helpful for feature selections. However, if PS is not appropriately corrected, RF tends to give high importance to disease-unrelated SNPs with different frequencies of allele or genotype among subpopulations, leading to inaccurate results. METHODS: In this study, the authors propose to correct for the confounding effect of PS by including the information of PS in RF analysis. The correction procedure starts by extracting the information of PS using EIGENSTRAT or multi-dimensional scaling clustering procedure from a large number of structure inference SNPs. Phenotype and genotypes adjusted by the information of PS are then used as the outcome and predictors in RF analysis. RESULTS: Extensive simulations indicate that the importance measure of the causal SNP is increased following the PS correction. By analysing a real dataset, the proposed correction removes the spurious association between the lactase gene and height. CONCLUSION: The authors propose a simple method to correct for PS in RF analysis on GWAS data. Further studies in real GWAS datasets are required to validate the robustness of the proposed approach.
BACKGROUND: Population structure (PS), including population stratification and admixture, is a significant confounder in genome-wide association studies (GWAS), as it may produce spurious associations. Random forest (RF) has been increasingly applied in GWAS data analysis because of its advantage in analysing high dimensional genetic data. RF creates importance measures for single nucleotide polymorphisms (SNPs), which are helpful for feature selections. However, if PS is not appropriately corrected, RF tends to give high importance to disease-unrelated SNPs with different frequencies of allele or genotype among subpopulations, leading to inaccurate results. METHODS: In this study, the authors propose to correct for the confounding effect of PS by including the information of PS in RF analysis. The correction procedure starts by extracting the information of PS using EIGENSTRAT or multi-dimensional scaling clustering procedure from a large number of structure inference SNPs. Phenotype and genotypes adjusted by the information of PS are then used as the outcome and predictors in RF analysis. RESULTS: Extensive simulations indicate that the importance measure of the causal SNP is increased following the PS correction. By analysing a real dataset, the proposed correction removes the spurious association between the lactase gene and height. CONCLUSION: The authors propose a simple method to correct for PS in RF analysis on GWAS data. Further studies in real GWAS datasets are required to validate the robustness of the proposed approach.
Authors: Alkes L Price; Nick J Patterson; Robert M Plenge; Michael E Weinblatt; Nancy A Shadick; David Reich Journal: Nat Genet Date: 2006-07-23 Impact factor: 38.330
Authors: Noah A Rosenberg; Lucy Huang; Ethan M Jewett; Zachary A Szpiech; Ivana Jankovic; Michael Boehnke Journal: Nat Rev Genet Date: 2010-05 Impact factor: 53.242
Authors: Yan V Sun; Zhaohui Cai; Kaushal Desai; Rachael Lawrance; Richard Leff; Ansar Jawaid; Sharon Lr Kardia; Huiying Yang Journal: BMC Proc Date: 2007-12-18
Authors: Charles D Waters; Jeffrey J Hard; Marine S O Brieuc; David E Fast; Kenneth I Warheit; Curtis M Knudsen; William J Bosch; Kerry A Naish Journal: Evol Appl Date: 2018-03-05 Impact factor: 5.183
Authors: Becky Inkster; Andy Simmons; James H Cole; Erwin Schoof; Rune Linding; Tom Nichols; Pierandrea Muglia; Florian Holsboer; Philipp G Sämann; Peter McGuffin; Cynthia H Y Fu; Kamilla Miskowiak; Paul M Matthews; Gwyneth Zai; Kristin Nicodemus Journal: Psychiatr Genet Date: 2018-10 Impact factor: 2.458
Authors: Yongyue Wei; Junya Liang; Ruyang Zhang; Yichen Guo; Sipeng Shen; Li Su; Xihong Lin; Sebastian Moran; Åslaug Helland; Maria M Bjaanæs; Anna Karlsson; Maria Planck; Manel Esteller; Thomas Fleischer; Johan Staaf; Yang Zhao; Feng Chen; David C Christiani Journal: Clin Epigenetics Date: 2018-04-02 Impact factor: 6.551