Literature DB >> 25364211

Gene Selection with Sequential Classification and Regression Tree Algorithm.

Caleb D Bastian1, Grzegorz A Rempala2.   

Abstract

BACKGROUND: In the typical setting of gene-selection problems from high-dimensional data, e.g., gene expression data from microarray or next-generation sequencing-based technologies, an enormous volume of high-throughput data is generated, and there is often a need for a simple, computationally-inexpensive, non-parametric screening procedure than can quickly and accurately find a low-dimensional variable subset that preserves biological information from the original very high-dimensional data (dimension p > 40,000). This is in contrast to the very sophisticated variable selection methods that are computationally expensive, need pre-processing routines, and often require calibration of priors.
RESULTS: We present a tree-based sequential CART (S-CART) approach to variable selection in the binary classification setting and compare it against the more sophisticated procedures using simulated and real biological data. In simulated data, we analyze S-CART performance versus (i) a random forest (RF), (ii) a fully-parametric Bayesian stochastic search variable selection (SSVS), and (iii) the moderated t-test statistic from the LIMMA package in R. The simulation study is based on a hierarchical Bayesian model, where dataset dimensionality, percentage of significant variables, and substructure via dependency vary. Selection efficacy is measured through false-discovery and missed-discovery rates. In all scenarios, the S-CART method is seen to consistently outperform SSVS and RF in both speed and detection accuracy. We demonstrate the utility of the S-CART technique both on simulated data and in a control-treatment mouse study. We show that the network analysis based on the S-CART-selected gene subset in essence recapitulates the biological findings of the study using only a fraction of the original set of genes considered in the study's analysis.
CONCLUSIONS: The relatively simple-minded gene selection algorithms like S-CART may often in practical circumstances be preferred over much more sophisticated ones. The advantage of the "greedy" selection methods utilized by S-CART and the likes is that they scale well with the problem size and require virtually no tuning or training while remaining efficient in extracting the relevant information from microarray-like datasets containing large number of redundant or irrelevant variables. AVAILABILITY: The MATLAB 7.4b code for the S-CART implementation is available for download from https://neyman.mcg.edu/posts/scart.zip.

Entities:  

Year:  2011        PMID: 25364211      PMCID: PMC4214923     

Source DB:  PubMed          Journal:  Biostat Bioinforma Biomath        ISSN: 0976-1594


  8 in total

1.  Bayesian variable selection in multinomial probit models to identify molecular signatures of disease stage.

Authors:  Naijun Sha; Marina Vannucci; Mahlet G Tadesse; Philip J Brown; Ilaria Dragoni; Nick Davies; Tracy C Roberts; Andrea Contestabile; Mike Salmon; Chris Buckley; Francesco Falciani
Journal:  Biometrics       Date:  2004-09       Impact factor: 2.571

2.  Revealing strengths and weaknesses of methods for gene network inference.

Authors:  Daniel Marbach; Robert J Prill; Thomas Schaffter; Claudio Mattiussi; Dario Floreano; Gustavo Stolovitzky
Journal:  Proc Natl Acad Sci U S A       Date:  2010-03-22       Impact factor: 11.205

Review 3.  Applications of DNA microarrays in biology.

Authors:  Roland B Stoughton
Journal:  Annu Rev Biochem       Date:  2005       Impact factor: 23.643

4.  Linear models and empirical bayes methods for assessing differential expression in microarray experiments.

Authors:  Gordon K Smyth
Journal:  Stat Appl Genet Mol Biol       Date:  2004-02-12

5.  Classification of breast cancer versus normal samples from mass spectrometry profiles using linear discriminant analysis of important features selected by random forest.

Authors:  Somnath Datta
Journal:  Stat Appl Genet Mol Biol       Date:  2008-02-19

6.  Stochastic search gene suggestion: a Bayesian hierarchical model for gene mapping.

Authors:  Michael D Swartz; Marek Kimmel; Peter Mueller; Christopher I Amos
Journal:  Biometrics       Date:  2006-06       Impact factor: 2.571

7.  Creatine improves health and survival of mice.

Authors:  A Bender; J Beckers; I Schneider; S M Hölter; T Haack; T Ruthsatz; D M Vogt-Weisenhorn; L Becker; J Genius; D Rujescu; M Irmler; T Mijalski; M Mader; L Quintanilla-Martinez; H Fuchs; V Gailus-Durner; M Hrabé de Angelis; W Wurst; J Schmidt; T Klopstock
Journal:  Neurobiol Aging       Date:  2007-04-09       Impact factor: 4.673

8.  The null distribution of stochastic search gene suggestion: a Bayesian approach to gene mapping.

Authors:  Michael D Swartz; Sanjay Shete
Journal:  BMC Proc       Date:  2007-12-18
  8 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.