Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Modeling X Chromosome Data Using Random Forests: Conquering Sex Bias.

Literature DB >> 26639183

Modeling X Chromosome Data Using Random Forests: Conquering Sex Bias.

Stacey J Winham¹, Gregory D Jenkins¹, Joanna M Biernacka^1,2.

Abstract

Machine learning methods, including Random Forests (RF), are increasingly used for genetic data analysis. However, the standard RF algorithm does not correctly model the effects of X chromosome single nucleotide polymorphisms (SNPs), leading to biased estimates of variable importance. We propose extensions of RF to correctly model X SNPs, including a stratified approach and an approach based on the process of X chromosome inactivation. We applied the new and standard RF approaches to case-control alcohol dependence data from the Study of Addiction: Genes and Environment (SAGE), and compared the performance of the alternative approaches via a simulation study. Standard RF applied to a case-control study of alcohol dependence yielded inflated variable importance estimates for X SNPs, even when sex was included as a variable, but the results of the new RF methods were consistent with univariate regression-based approaches that correctly model X chromosome data. Simulations showed that the new RF methods eliminate the bias in standard RF variable importance for X SNPs when sex is associated with the trait, and are able to detect causal autosomal and X SNPs. Even in the absence of sex effects, the new extensions perform similarly to standard RF. Thus, we provide a powerful multimarker approach for genetic analysis that accommodates X chromosome data in an unbiased way. This method is implemented in the freely available R package "snpRF" (http://www.cran.r-project.org/web/packages/snpRF/).

Entities: Chemical Disease Gene Mutation Species

Keywords: Random Forest; X chromosome; bias; sex differences; variable importance

Mesh：

Substances：
Genetic Markers

Year: 2015 PMID： 26639183 PMCID： PMC4724236 DOI： 10.1002/gepi.21946

Source DB: PubMed Journal: Genet Epidemiol ISSN： 0741-0395 Impact factor: 2.135

42 in total

1. The NCBI dbGaP database of genotypes and phenotypes.

Authors: Matthew D Mailman; Michael Feolo; Yumi Jin; Masato Kimura; Kimberly Tryka; Rinat Bagoutdinov; Luning Hao; Anne Kiang; Justin Paschall; Lon Phan; Natalia Popova; Stephanie Pretel; Lora Ziyabari; Moira Lee; Yu Shao; Zhen Y Wang; Karl Sirotkin; Minghong Ward; Michael Kholodov; Kerry Zbicz; Jeffrey Beck; Michael Kimelman; Sergey Shevelev; Don Preuss; Eugene Yaschenko; Alan Graeff; James Ostell; Stephen T Sherry
Journal: Nat Genet Date: 2007-10 Impact factor: 38.330

2. Machine learning in genome-wide association studies.

Authors: Silke Szymczak; Joanna M Biernacka; Heather J Cordell; Oscar González-Recio; Inke R König; Heping Zhang; Yan V Sun
Journal: Genet Epidemiol Date: 2009 Impact factor: 2.135

3. eXclusion: toward integrating the X chromosome in genome-wide association analyses.

Authors: Anastasia L Wise; Lin Gyi; Teri A Manolio
Journal: Am J Hum Genet Date: 2013-05-02 Impact factor: 11.025

4. An application of Random Forests to a genome-wide association dataset: methodological considerations & new findings.

Authors: Benjamin A Goldstein; Alan E Hubbard; Adele Cutler; Lisa F Barcellos
Journal: BMC Genet Date: 2010-06-14 Impact factor: 2.797

5. Gender differences in substance use disorders.

Authors: K T Brady; D E Grice; L Dustan; C Randall
Journal: Am J Psychiatry Date: 1993-11 Impact factor: 18.112

6. Bias in random forest variable importance measures: illustrations, sources and a solution.

Authors: Carolin Strobl; Anne-Laure Boulesteix; Achim Zeileis; Torsten Hothorn
Journal: BMC Bioinformatics Date: 2007-01-25 Impact factor: 3.169

7. Screening large-scale association study data: exploiting interactions using random forests.

Authors: Kathryn L Lunetta; L Brooke Hayward; Jonathan Segal; Paul Van Eerdewegh
Journal: BMC Genet Date: 2004-12-10 Impact factor: 2.797

8. Application of multi-SNP approaches Bayesian LASSO and AUC-RF to detect main effects of inflammatory-gene variants associated with bladder cancer risk.

Authors: Evangelina López de Maturana; Yuanqing Ye; M Luz Calle; Nathaniel Rothman; Víctor Urrea; Manolis Kogevinas; Sandra Petrus; Stephen J Chanock; Adonina Tardón; Montserrat García-Closas; Anna González-Neira; Gemma Vellalta; Alfredo Carrato; Arcadi Navarro; Belén Lorente-Galdós; Debra T Silverman; Francisco X Real; Xifeng Wu; Núria Malats
Journal: PLoS One Date: 2013-12-31 Impact factor: 3.240

Modeling X Chromosome Data Using Random Forests: Conquering Sex Bias.

1. The NCBI dbGaP database of genotypes and phenotypes.

2. Machine learning in genome-wide association studies.

3. eXclusion: toward integrating the X chromosome in genome-wide association analyses.

4. An application of Random Forests to a genome-wide association dataset: methodological considerations & new findings.

5. Gender differences in substance use disorders.

6. Bias in random forest variable importance measures: illustrations, sources and a solution.

7. Screening large-scale association study data: exploiting interactions using random forests.

8. Application of multi-SNP approaches Bayesian LASSO and AUC-RF to detect main effects of inflammatory-gene variants associated with bladder cancer risk.

9. Sex chromosomes and genetic association studies.

10. An AUC-based permutation variable importance measure for random forests.

Review 1. Statistical learning approaches in the genetic epidemiology of complex diseases.

2. Viewing the male-specific chromosome Y in a new light.

3. Intersections of machine learning and epidemiological methods for health services research.