Literature DB >> 22987127

SNP selection and classification of genome-wide SNP data using stratified sampling random forests.

Qingyao Wu1, Yunming Ye, Yang Liu, Michael K Ng.   

Abstract

For high dimensional genome-wide association (GWA) case-control data of complex disease, there are usually a large portion of single-nucleotide polymorphisms (SNPs) that are irrelevant with the disease. A simple random sampling method in random forest using default mtry parameter to choose feature subspace, will select too many subspaces without informative SNPs. Exhaustive searching an optimal mtry is often required in order to include useful and relevant SNPs and get rid of vast of non-informative SNPs. However, it is too time-consuming and not favorable in GWA for high-dimensional data. The main aim of this paper is to propose a stratified sampling method for feature subspace selection to generate decision trees in a random forest for GWA high-dimensional data. Our idea is to design an equal-width discretization scheme for informativeness to divide SNPs into multiple groups. In feature subspace selection, we randomly select the same number of SNPs from each group and combine them to form a subspace to generate a decision tree. The advantage of this stratified sampling procedure can make sure each subspace contains enough useful SNPs, but can avoid a very high computational cost of exhaustive search of an optimal mtry, and maintain the randomness of a random forest. We employ two genome-wide SNP data sets (Parkinson case-control data comprised of 408 803 SNPs and Alzheimer case-control data comprised of 380 157 SNPs) to demonstrate that the proposed stratified sampling method is effective, and it can generate better random forest with higher accuracy and lower error bound than those by Breiman's random forest generation method. For Parkinson data, we also show some interesting genes identified by the method, which may be associated with neurological disorders for further biological investigations.

Entities:  

Mesh:

Year:  2012        PMID: 22987127     DOI: 10.1109/TNB.2012.2214232

Source DB:  PubMed          Journal:  IEEE Trans Nanobioscience        ISSN: 1536-1241            Impact factor:   2.935


  13 in total

1.  Mutated Pathways as a Guide to Adjuvant Therapy Treatments for Breast Cancer.

Authors:  Yang Liu; Zhenjun Hu; Charles DeLisi
Journal:  Mol Cancer Ther       Date:  2015-12-01       Impact factor: 6.261

2.  Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests.

Authors:  Thanh-Tung Nguyen; Joshua Huang; Qingyao Wu; Thuy Nguyen; Mark Li
Journal:  BMC Genomics       Date:  2015-01-21       Impact factor: 3.969

3.  Inferring population structure and relationship using minimal independent evolutionary markers in Y-chromosome: a hybrid approach of recursive feature selection for hierarchical clustering.

Authors:  Amit Kumar Srivastava; Rupali Chopra; Shafat Ali; Shweta Aggarwal; Lovekesh Vig; Rameshwar Nath Koul Bamezai
Journal:  Nucleic Acids Res       Date:  2014-07-16       Impact factor: 16.971

4.  Construction and analysis of single nucleotide polymorphism-single nucleotide polymorphism interaction networks.

Authors:  Yang Liu; Xutao Li; Zhiping Liu; Luonan Chen; Michael K Ng
Journal:  IET Syst Biol       Date:  2013-10       Impact factor: 1.615

5.  Effective Analysis of Inpatient Satisfaction: The Random Forest Algorithm.

Authors:  Chengcheng Li; Conghui Liao; Xuehui Meng; Honghua Chen; Weiling Chen; Bo Wei; Pinghua Zhu
Journal:  Patient Prefer Adherence       Date:  2021-04-07       Impact factor: 2.711

6.  Protein functional properties prediction in sparsely-label PPI networks through regularized non-negative matrix factorization.

Authors:  Qingyao Wu; Zhenyu Wang; Chunshan Li; Yunming Ye; Yueping Li; Ning Sun
Journal:  BMC Syst Biol       Date:  2015-01-21

7.  Evaluation and integration of cancer gene classifiers: identification and ranking of plausible drivers.

Authors:  Yang Liu; Feng Tian; Zhenjun Hu; Charles DeLisi
Journal:  Sci Rep       Date:  2015-05-11       Impact factor: 4.379

8.  Combining Multiple Hypothesis Testing with Machine Learning Increases the Statistical Power of Genome-wide Association Studies.

Authors:  Bettina Mieth; Marius Kloft; Juan Antonio Rodríguez; Sören Sonnenburg; Robin Vobruba; Carlos Morcillo-Suárez; Xavier Farré; Urko M Marigorta; Ernst Fehr; Thorsten Dickhaus; Gilles Blanchard; Daniel Schunk; Arcadi Navarro; Klaus-Robert Müller
Journal:  Sci Rep       Date:  2016-11-28       Impact factor: 4.379

9.  Feature selection for high-dimensional temporal data.

Authors:  Michail Tsagris; Vincenzo Lagani; Ioannis Tsamardinos
Journal:  BMC Bioinformatics       Date:  2018-01-23       Impact factor: 3.169

10.  RAPIDSNPs: A new computational pipeline for rapidly identifying key genetic variants reveals previously unidentified SNPs that are significantly associated with individual platelet responses.

Authors:  Bajuna Rashid Salehe; Chris Ian Jones; Giuseppe Di Fatta; Liam James McGuffin
Journal:  PLoS One       Date:  2017-04-25       Impact factor: 3.240

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.