Literature DB >> 25708662

Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests.

Thanh-Tung Nguyen, Joshua Huang, Qingyao Wu, Thuy Nguyen, Mark Li.   

Abstract

BACKGROUND: Single-nucleotide polymorphisms (SNPs) selection and identification are the most important tasks in Genome-wide association data analysis. The problem is difficult because genome-wide association data is very high dimensional and a large portion of SNPs in the data is irrelevant to the disease. Advanced machine learning methods have been successfully used in Genome-wide association studies (GWAS) for identification of genetic variants that have relatively big effects in some common, complex diseases. Among them, the most successful one is Random Forests (RF). Despite of performing well in terms of prediction accuracy in some data sets with moderate size, RF still suffers from working in GWAS for selecting informative SNPs and building accurate prediction models. In this paper, we propose to use a new two-stage quality-based sampling method in random forests, named ts-RF, for SNP subspace selection for GWAS. The method first applies p-value assessment to find a cut-off point that separates informative and irrelevant SNPs in two groups. The informative SNPs group is further divided into two sub-groups: highly informative and weak informative SNPs. When sampling the SNP subspace for building trees for the forest, only those SNPs from the two sub-groups are taken into account. The feature subspaces always contain highly informative SNPs when used to split a node at a tree.
RESULTS: This approach enables one to generate more accurate trees with a lower prediction error, meanwhile possibly avoiding overfitting. It allows one to detect interactions of multiple SNPs with the diseases, and to reduce the dimensionality and the amount of Genome-wide association data needed for learning the RF model. Extensive experiments on two genome-wide SNP data sets (Parkinson case-control data comprised of 408,803 SNPs and Alzheimer case-control data comprised of 380,157 SNPs) and 10 gene data sets have demonstrated that the proposed model significantly reduced prediction errors and outperformed most existing the-state-of-the-art random forests. The top 25 SNPs in Parkinson data set were identified by the proposed model including four interesting genes associated with neurological disorders.
CONCLUSION: The presented approach has shown to be effective in selecting informative sub-groups of SNPs potentially associated with diseases that traditional statistical approaches might fail. The new RF works well for the data where the number of case-control objects is much smaller than the number of SNPs, which is a typical problem in gene data and GWAS. Experiment results demonstrated the effectiveness of the proposed RF model that outperformed the state-of-the-art RFs, including Breiman's RF, GRRF and wsRF methods.

Entities:  

Mesh:

Year:  2015        PMID: 25708662      PMCID: PMC4331719          DOI: 10.1186/1471-2164-16-S2-S5

Source DB:  PubMed          Journal:  BMC Genomics        ISSN: 1471-2164            Impact factor:   3.969


  20 in total

1.  Selecting SNPs in two-stage analysis of disease association data: a model-free approach.

Authors:  J Hoh; A Wille; R Zee; S Cheng; R Reynolds; K Lindpaintner; J Ott
Journal:  Ann Hum Genet       Date:  2000-09       Impact factor: 1.670

Review 2.  Epistasis: what it means, what it doesn't mean, and statistical methods to detect it in humans.

Authors:  Heather J Cordell
Journal:  Hum Mol Genet       Date:  2002-10-01       Impact factor: 6.150

3.  On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data.

Authors:  Daniel F Schwarz; Inke R König; Andreas Ziegler
Journal:  Bioinformatics       Date:  2010-05-26       Impact factor: 6.937

4.  Genome-wide strategies for detecting multiple loci that influence complex diseases.

Authors:  Jonathan Marchini; Peter Donnelly; Lon R Cardon
Journal:  Nat Genet       Date:  2005-03-27       Impact factor: 38.330

5.  Enriched random forests.

Authors:  Dhammika Amaratunga; Javier Cabrera; Yung-Seop Lee
Journal:  Bioinformatics       Date:  2008-07-22       Impact factor: 6.937

Review 6.  Detecting gene-gene interactions that underlie human diseases.

Authors:  Heather J Cordell
Journal:  Nat Rev Genet       Date:  2009-06       Impact factor: 53.242

7.  Bias in random forest variable importance measures: illustrations, sources and a solution.

Authors:  Carolin Strobl; Anne-Laure Boulesteix; Achim Zeileis; Torsten Hothorn
Journal:  BMC Bioinformatics       Date:  2007-01-25       Impact factor: 3.169

8.  Screening large-scale association study data: exploiting interactions using random forests.

Authors:  Kathryn L Lunetta; L Brooke Hayward; Jonathan Segal; Paul Van Eerdewegh
Journal:  BMC Genet       Date:  2004-12-10       Impact factor: 2.797

9.  Genome-wide association of early-onset myocardial infarction with single nucleotide polymorphisms and copy number variants.

Authors:  Sekar Kathiresan; Benjamin F Voight; Shaun Purcell; Kiran Musunuru; Diego Ardissino; Pier M Mannucci; Sonia Anand; James C Engert; Nilesh J Samani; Heribert Schunkert; Jeanette Erdmann; Muredach P Reilly; Daniel J Rader; Thomas Morgan; John A Spertus; Monika Stoll; Domenico Girelli; Pascal P McKeown; Chris C Patterson; David S Siscovick; Christopher J O'Donnell; Roberto Elosua; Leena Peltonen; Veikko Salomaa; Stephen M Schwartz; Olle Melander; David Altshuler; Diego Ardissino; Pier Angelica Merlini; Carlo Berzuini; Luisa Bernardinelli; Flora Peyvandi; Marco Tubaro; Patrizia Celli; Maurizio Ferrario; Raffaela Fetiveau; Nicola Marziliano; Giorgio Casari; Michele Galli; Flavio Ribichini; Marco Rossi; Francesco Bernardi; Pietro Zonzin; Alberto Piazza; Pier M Mannucci; Stephen M Schwartz; David S Siscovick; Jean Yee; Yechiel Friedlander; Roberto Elosua; Jaume Marrugat; Gavin Lucas; Isaac Subirana; Joan Sala; Rafael Ramos; Sekar Kathiresan; James B Meigs; Gordon Williams; David M Nathan; Calum A MacRae; Christopher J O'Donnell; Veikko Salomaa; Aki S Havulinna; Leena Peltonen; Olle Melander; Goran Berglund; Benjamin F Voight; Sekar Kathiresan; Joel N Hirschhorn; Rosanna Asselta; Stefano Duga; Marta Spreafico; Kiran Musunuru; Mark J Daly; Shaun Purcell; Benjamin F Voight; Shaun Purcell; James Nemesh; Joshua M Korn; Steven A McCarroll; Stephen M Schwartz; Jean Yee; Sekar Kathiresan; Gavin Lucas; Isaac Subirana; Roberto Elosua; Aarti Surti; Candace Guiducci; Lauren Gianniny; Daniel Mirel; Melissa Parkin; Noel Burtt; Stacey B Gabriel; Nilesh J Samani; John R Thompson; Peter S Braund; Benjamin J Wright; Anthony J Balmforth; Stephen G Ball; Alistair S Hall; Heribert Schunkert; Jeanette Erdmann; Patrick Linsel-Nitschke; Wolfgang Lieb; Andreas Ziegler; Inke König; Christian Hengstenberg; Marcus Fischer; Klaus Stark; Anika Grosshennig; Michael Preuss; H-Erich Wichmann; Stefan Schreiber; Heribert Schunkert; Nilesh J Samani; Jeanette Erdmann; Willem Ouwehand; Christian Hengstenberg; Panos Deloukas; Michael Scholz; Francois Cambien; Muredach P Reilly; Mingyao Li; Zhen Chen; Robert Wilensky; William Matthai; Atif Qasim; Hakon H Hakonarson; Joe Devaney; Mary-Susan Burnett; Augusto D Pichard; Kenneth M Kent; Lowell Satler; Joseph M Lindsay; Ron Waksman; Christopher W Knouff; Dawn M Waterworth; Max C Walker; Vincent Mooser; Stephen E Epstein; Daniel J Rader; Thomas Scheffold; Klaus Berger; Monika Stoll; Andreas Huge; Domenico Girelli; Nicola Martinelli; Oliviero Olivieri; Roberto Corrocher; Thomas Morgan; John A Spertus; Pascal McKeown; Chris C Patterson; Heribert Schunkert; Erdmann Erdmann; Patrick Linsel-Nitschke; Wolfgang Lieb; Andreas Ziegler; Inke R König; Christian Hengstenberg; Marcus Fischer; Klaus Stark; Anika Grosshennig; Michael Preuss; H-Erich Wichmann; Stefan Schreiber; Hilma Hólm; Gudmar Thorleifsson; Unnur Thorsteinsdottir; Kari Stefansson; James C Engert; Ron Do; Changchun Xie; Sonia Anand; Sekar Kathiresan; Diego Ardissino; Pier M Mannucci; David Siscovick; Christopher J O'Donnell; Nilesh J Samani; Olle Melander; Roberto Elosua; Leena Peltonen; Veikko Salomaa; Stephen M Schwartz; David Altshuler
Journal:  Nat Genet       Date:  2009-02-08       Impact factor: 38.330

10.  Classification of rheumatoid arthritis status with candidate gene and genome-wide single-nucleotide polymorphisms using random forests.

Authors:  Yan V Sun; Zhaohui Cai; Kaushal Desai; Rachael Lawrance; Richard Leff; Ansar Jawaid; Sharon Lr Kardia; Huiying Yang
Journal:  BMC Proc       Date:  2007-12-18
View more
  24 in total

1.  Identification of compound-protein interactions through the analysis of gene ontology, KEGG enrichment for proteins and molecular fragments of compounds.

Authors:  Lei Chen; Yu-Hang Zhang; Mingyue Zheng; Tao Huang; Yu-Dong Cai
Journal:  Mol Genet Genomics       Date:  2016-08-16       Impact factor: 3.291

2.  Machine Learning on a Genome-wide Association Study to Predict Late Genitourinary Toxicity After Prostate Radiation Therapy.

Authors:  Sangkyu Lee; Sarah Kerns; Harry Ostrer; Barry Rosenstein; Joseph O Deasy; Jung Hun Oh
Journal:  Int J Radiat Oncol Biol Phys       Date:  2018-01-31       Impact factor: 7.038

Review 3.  Genomics models in radiotherapy: From mechanistic to machine learning.

Authors:  John Kang; James T Coates; Robert L Strawderman; Barry S Rosenstein; Sarah L Kerns
Journal:  Med Phys       Date:  2020-06       Impact factor: 4.071

4.  Addressing Measurement Error in Random Forests Using Quantitative Bias Analysis.

Authors:  Tammy Jiang; Jaimie L Gradus; Timothy L Lash; Matthew P Fox
Journal:  Am J Epidemiol       Date:  2021-09-01       Impact factor: 5.363

5.  Combining Multiple Hypothesis Testing with Machine Learning Increases the Statistical Power of Genome-wide Association Studies.

Authors:  Bettina Mieth; Marius Kloft; Juan Antonio Rodríguez; Sören Sonnenburg; Robin Vobruba; Carlos Morcillo-Suárez; Xavier Farré; Urko M Marigorta; Ernst Fehr; Thorsten Dickhaus; Gilles Blanchard; Daniel Schunk; Arcadi Navarro; Klaus-Robert Müller
Journal:  Sci Rep       Date:  2016-11-28       Impact factor: 4.379

6.  Genetic loci associated with an earlier age at onset in multiplex schizophrenia.

Authors:  Annemarie L Woolston; Po-Chang Hsiao; Po-Hsiu Kuo; Shi-Heng Wang; Yin-Ju Lien; Chih-Min Liu; Hai-Gwo Hwu; Tzu-Pin Lu; Eric Y Chuang; Li-Ching Chang; Chien-Hsiun Chen; Jer-Yuarn Wu; Ming T Tsuang; Wei J Chen
Journal:  Sci Rep       Date:  2017-07-25       Impact factor: 4.379

7.  RAPIDSNPs: A new computational pipeline for rapidly identifying key genetic variants reveals previously unidentified SNPs that are significantly associated with individual platelet responses.

Authors:  Bajuna Rashid Salehe; Chris Ian Jones; Giuseppe Di Fatta; Liam James McGuffin
Journal:  PLoS One       Date:  2017-04-25       Impact factor: 3.240

8.  Computational methods using genome-wide association studies to predict radiotherapy complications and to identify correlative molecular processes.

Authors:  Jung Hun Oh; Sarah Kerns; Harry Ostrer; Simon N Powell; Barry Rosenstein; Joseph O Deasy
Journal:  Sci Rep       Date:  2017-02-24       Impact factor: 4.379

9.  Discovering Alzheimer Genetic Biomarkers Using Bayesian Networks.

Authors:  Fayroz F Sherif; Nourhan Zayed; Mahmoud Fakhr
Journal:  Adv Bioinformatics       Date:  2015-08-23

10.  A method combining a random forest-based technique with the modeling of linkage disequilibrium through latent variables, to run multilocus genome-wide association studies.

Authors:  Christine Sinoquet
Journal:  BMC Bioinformatics       Date:  2018-03-27       Impact factor: 3.169

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.