Literature DB >> 20047492

Use of wrapper algorithms coupled with a random forests classifier for variable selection in large-scale genomic association studies.

Andrei S Rodin1, Anatoliy Litvinenko, Kathy Klos, Alanna C Morrison, Trevor Woodage, Josef Coresh, Eric Boerwinkle.   

Abstract

Modern large-scale genetic association studies generate increasingly high-dimensional datasets. Therefore, some variable selection procedure should be performed before the application of traditional data analysis methods, for reasons of both computational efficiency and problems related to overfitting. We describe here a "wrapper" strategy (SIZEFIT) for variable selection that uses a Random Forests classifier, coupled with various local search/optimization algorithms. We apply it to a large dataset consisting of 2,425 African-American and non-Hispanic white individuals genotyped for 4,869 single-nucleotide polymorphisms (SNPs) in a coronary heart disease (CHD) case-cohort association study (Atherosclerosis Risk in Communities), using incident CHD and plasma low-density lipoprotein (LDL) cholesterol levels as the dependent variables. We show that most SNPs can be safely removed from the dataset without compromising the predictive (classification) accuracy, with only a small number of SNPs (sometimes less than 100) containing any predictive signal. A statistical (SUMSTAT) approach is also applied to the dataset for comparison purposes. We describe a novel method for refining the subset of signal-containing SNPs (FIXFIT), based on an Extremal Optimization algorithm. Finally, we compare the top SNP rankings obtained by different methods and devise practical guidelines for researchers trying to generate a compact subset of predictive SNPs from genome-wide association datasets. Interestingly, there is a significant amount of overlap between seemingly very heterogeneous rankings. We conclude by constructing compact optimal predictive SNP subsets for CHD (less than 150 SNPs) and LDL (less than 300 SNPs) phenotypes, and by comparing various rankings for two well-known positive control SNPs for LDL in the apolipoprotein E gene.

Entities:  

Mesh:

Substances:

Year:  2009        PMID: 20047492      PMCID: PMC2980837          DOI: 10.1089/cmb.2008.0037

Source DB:  PubMed          Journal:  J Comput Biol        ISSN: 1066-5277            Impact factor:   1.479


  17 in total

1.  A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation.

Authors:  M R Nelson; S L Kardia; R E Ferrell; C F Sing
Journal:  Genome Res       Date:  2001-03       Impact factor: 9.043

Review 2.  Classification methods for confronting heterogeneity.

Authors:  M A Province; W D Shannon; D C Rao
Journal:  Adv Genet       Date:  2001       Impact factor: 1.944

3.  Random forest: a classification and regression tool for compound classification and QSAR modeling.

Authors:  Vladimir Svetnik; Andy Liaw; Christopher Tong; J Christopher Culberson; Robert P Sheridan; Bradley P Feuston
Journal:  J Chem Inf Comput Sci       Date:  2003 Nov-Dec

Review 4.  Mathematical multi-locus approaches to localizing complex human trait genes.

Authors:  Josephine Hoh; Jurg Ott
Journal:  Nat Rev Genet       Date:  2003-09       Impact factor: 53.242

5.  Complement factor H polymorphism in age-related macular degeneration.

Authors:  Robert J Klein; Caroline Zeiss; Emily Y Chew; Jen-Yue Tsai; Richard S Sackler; Chad Haynes; Alice K Henning; John Paul SanGiovanni; Shrikant M Mane; Susan T Mayne; Michael B Bracken; Frederick L Ferris; Jurg Ott; Colin Barnstable; Josephine Hoh
Journal:  Science       Date:  2005-03-10       Impact factor: 47.728

6.  Identifying SNPs predictive of phenotype using random forests.

Authors:  Alexandre Bureau; Josée Dupuis; Kathleen Falls; Kathryn L Lunetta; Brooke Hayward; Tim P Keith; Paul Van Eerdewegh
Journal:  Genet Epidemiol       Date:  2005-02       Impact factor: 2.135

7.  Analysis of multiple SNPs in genetic association studies: comparison of three multi-locus methods to prioritize and select SNPs.

Authors:  A Geert Heidema; Edith J M Feskens; Pieter A F M Doevendans; Henk J T Ruven; Hans C van Houwelingen; Edwin C M Mariman; Jolanda M A Boer
Journal:  Genet Epidemiol       Date:  2007-12       Impact factor: 2.135

8.  The Atherosclerosis Risk in Communities (ARIC) Study: design and objectives. The ARIC investigators.

Authors: 
Journal:  Am J Epidemiol       Date:  1989-04       Impact factor: 4.897

9.  Gene selection and classification of microarray data using random forest.

Authors:  Ramón Díaz-Uriarte; Sara Alvarez de Andrés
Journal:  BMC Bioinformatics       Date:  2006-01-06       Impact factor: 3.169

10.  Screening large-scale association study data: exploiting interactions using random forests.

Authors:  Kathryn L Lunetta; L Brooke Hayward; Jonathan Segal; Paul Van Eerdewegh
Journal:  BMC Genet       Date:  2004-12-10       Impact factor: 2.797

View more
  10 in total

1.  A practical computerized decision support system for predicting the severity of Alzheimer's disease of an individual.

Authors:  Magda Bucholc; Xuemei Ding; Haiying Wang; David H Glass; Hui Wang; Girijesh Prasad; Liam P Maguire; Anthony J Bjourson; Paula L McClean; Stephen Todd; David P Finn; KongFatt Wong-Lin
Journal:  Expert Syst Appl       Date:  2019-04-10       Impact factor: 6.954

2.  A multilayer multimodal detection and prediction model based on explainable artificial intelligence for Alzheimer's disease.

Authors:  Shaker El-Sappagh; Jose M Alonso; S M Riazul Islam; Ahmad M Sultan; Kyung Sup Kwak
Journal:  Sci Rep       Date:  2021-01-29       Impact factor: 4.379

Review 3.  Random forests for genetic association studies.

Authors:  Benjamin A Goldstein; Eric C Polley; Farren B S Briggs
Journal:  Stat Appl Genet Mol Biol       Date:  2011-07-12

Review 4.  Systems biology data analysis methodology in pharmacogenomics.

Authors:  Andrei S Rodin; Grigoriy Gogoshin; Eric Boerwinkle
Journal:  Pharmacogenomics       Date:  2011-09       Impact factor: 2.533

Review 5.  Integrative systems biology approaches in asthma pharmacogenomics.

Authors:  Amber Dahlin; Kelan G Tantisira
Journal:  Pharmacogenomics       Date:  2012-09       Impact factor: 2.533

6.  Wrapper-based selection of genetic features in genome-wide association studies through fast matrix operations.

Authors:  Tapio Pahikkala; Sebastian Okser; Antti Airola; Tapio Salakoski; Tero Aittokallio
Journal:  Algorithms Mol Biol       Date:  2012-05-02       Impact factor: 1.405

7.  Genomic selection and association mapping in rice (Oryza sativa): effect of trait genetic architecture, training population composition, marker number and statistical model on accuracy of rice genomic selection in elite, tropical rice breeding lines.

Authors:  Jennifer Spindel; Hasina Begum; Deniz Akdemir; Parminder Virk; Bertrand Collard; Edilberto Redoña; Gary Atlin; Jean-Luc Jannink; Susan R McCouch
Journal:  PLoS Genet       Date:  2015-02-17       Impact factor: 5.917

8.  New Analysis Framework Incorporating Mixed Mutual Information and Scalable Bayesian Networks for Multimodal High Dimensional Genomic and Epigenomic Cancer Data.

Authors:  Xichun Wang; Sergio Branciamore; Grigoriy Gogoshin; Shuyu Ding; Andrei S Rodin
Journal:  Front Genet       Date:  2020-06-18       Impact factor: 4.599

9.  Impact of natural genetic variation on gene expression dynamics.

Authors:  Marit Ackermann; Weronika Sikora-Wohlfeld; Andreas Beyer
Journal:  PLoS Genet       Date:  2013-06-06       Impact factor: 5.917

10.  Intrinsic Properties of tRNA Molecules as Deciphered via Bayesian Network and Distribution Divergence Analysis.

Authors:  Sergio Branciamore; Grigoriy Gogoshin; Massimo Di Giulio; Andrei S Rodin
Journal:  Life (Basel)       Date:  2018-02-08
  10 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.