Literature DB >> 26639183

Modeling X Chromosome Data Using Random Forests: Conquering Sex Bias.

Stacey J Winham1, Gregory D Jenkins1, Joanna M Biernacka1,2.   

Abstract

Machine learning methods, including Random Forests (RF), are increasingly used for genetic data analysis. However, the standard RF algorithm does not correctly model the effects of X chromosome single nucleotide polymorphisms (SNPs), leading to biased estimates of variable importance. We propose extensions of RF to correctly model X SNPs, including a stratified approach and an approach based on the process of X chromosome inactivation. We applied the new and standard RF approaches to case-control alcohol dependence data from the Study of Addiction: Genes and Environment (SAGE), and compared the performance of the alternative approaches via a simulation study. Standard RF applied to a case-control study of alcohol dependence yielded inflated variable importance estimates for X SNPs, even when sex was included as a variable, but the results of the new RF methods were consistent with univariate regression-based approaches that correctly model X chromosome data. Simulations showed that the new RF methods eliminate the bias in standard RF variable importance for X SNPs when sex is associated with the trait, and are able to detect causal autosomal and X SNPs. Even in the absence of sex effects, the new extensions perform similarly to standard RF. Thus, we provide a powerful multimarker approach for genetic analysis that accommodates X chromosome data in an unbiased way. This method is implemented in the freely available R package "snpRF" (http://www.cran.r-project.org/web/packages/snpRF/).
© 2015 WILEY PERIODICALS, INC.

Entities:  

Keywords:  Random Forest; X chromosome; bias; sex differences; variable importance

Mesh:

Substances:

Year:  2015        PMID: 26639183      PMCID: PMC4724236          DOI: 10.1002/gepi.21946

Source DB:  PubMed          Journal:  Genet Epidemiol        ISSN: 0741-0395            Impact factor:   2.135


  42 in total

1.  The NCBI dbGaP database of genotypes and phenotypes.

Authors:  Matthew D Mailman; Michael Feolo; Yumi Jin; Masato Kimura; Kimberly Tryka; Rinat Bagoutdinov; Luning Hao; Anne Kiang; Justin Paschall; Lon Phan; Natalia Popova; Stephanie Pretel; Lora Ziyabari; Moira Lee; Yu Shao; Zhen Y Wang; Karl Sirotkin; Minghong Ward; Michael Kholodov; Kerry Zbicz; Jeffrey Beck; Michael Kimelman; Sergey Shevelev; Don Preuss; Eugene Yaschenko; Alan Graeff; James Ostell; Stephen T Sherry
Journal:  Nat Genet       Date:  2007-10       Impact factor: 38.330

2.  Machine learning in genome-wide association studies.

Authors:  Silke Szymczak; Joanna M Biernacka; Heather J Cordell; Oscar González-Recio; Inke R König; Heping Zhang; Yan V Sun
Journal:  Genet Epidemiol       Date:  2009       Impact factor: 2.135

3.  eXclusion: toward integrating the X chromosome in genome-wide association analyses.

Authors:  Anastasia L Wise; Lin Gyi; Teri A Manolio
Journal:  Am J Hum Genet       Date:  2013-05-02       Impact factor: 11.025

4.  An application of Random Forests to a genome-wide association dataset: methodological considerations & new findings.

Authors:  Benjamin A Goldstein; Alan E Hubbard; Adele Cutler; Lisa F Barcellos
Journal:  BMC Genet       Date:  2010-06-14       Impact factor: 2.797

5.  Gender differences in substance use disorders.

Authors:  K T Brady; D E Grice; L Dustan; C Randall
Journal:  Am J Psychiatry       Date:  1993-11       Impact factor: 18.112

6.  Bias in random forest variable importance measures: illustrations, sources and a solution.

Authors:  Carolin Strobl; Anne-Laure Boulesteix; Achim Zeileis; Torsten Hothorn
Journal:  BMC Bioinformatics       Date:  2007-01-25       Impact factor: 3.169

7.  Screening large-scale association study data: exploiting interactions using random forests.

Authors:  Kathryn L Lunetta; L Brooke Hayward; Jonathan Segal; Paul Van Eerdewegh
Journal:  BMC Genet       Date:  2004-12-10       Impact factor: 2.797

8.  Application of multi-SNP approaches Bayesian LASSO and AUC-RF to detect main effects of inflammatory-gene variants associated with bladder cancer risk.

Authors:  Evangelina López de Maturana; Yuanqing Ye; M Luz Calle; Nathaniel Rothman; Víctor Urrea; Manolis Kogevinas; Sandra Petrus; Stephen J Chanock; Adonina Tardón; Montserrat García-Closas; Anna González-Neira; Gemma Vellalta; Alfredo Carrato; Arcadi Navarro; Belén Lorente-Galdós; Debra T Silverman; Francisco X Real; Xifeng Wu; Núria Malats
Journal:  PLoS One       Date:  2013-12-31       Impact factor: 3.240

9.  Sex chromosomes and genetic association studies.

Authors:  David G Clayton
Journal:  Genome Med       Date:  2009-11-24       Impact factor: 11.117

10.  An AUC-based permutation variable importance measure for random forests.

Authors:  Silke Janitza; Carolin Strobl; Anne-Laure Boulesteix
Journal:  BMC Bioinformatics       Date:  2013-04-05       Impact factor: 3.169

View more
  3 in total

Review 1.  Statistical learning approaches in the genetic epidemiology of complex diseases.

Authors:  Anne-Laure Boulesteix; Marvin N Wright; Sabine Hoffmann; Inke R König
Journal:  Hum Genet       Date:  2019-05-02       Impact factor: 4.132

2.  Viewing the male-specific chromosome Y in a new light.

Authors:  Christian F Deschepper
Journal:  Eur J Hum Genet       Date:  2017-08-30       Impact factor: 4.246

3.  Intersections of machine learning and epidemiological methods for health services research.

Authors:  Sherri Rose
Journal:  Int J Epidemiol       Date:  2021-01-23       Impact factor: 7.196

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.