Caleb D Bastian1, Grzegorz A Rempala2. 1. Program in Applied and Computational Mathematics, Princeton University, Fine Hall, Washington Road, Princeton, NJ 08544. 2. Department of Biostatistics and the Cancer Center, Georgia Health Sciences University, Augusta, GA 30912.
Abstract
BACKGROUND: In the typical setting of gene-selection problems from high-dimensional data, e.g., gene expression data from microarray or next-generation sequencing-based technologies, an enormous volume of high-throughput data is generated, and there is often a need for a simple, computationally-inexpensive, non-parametric screening procedure than can quickly and accurately find a low-dimensional variable subset that preserves biological information from the original very high-dimensional data (dimension p > 40,000). This is in contrast to the very sophisticated variable selection methods that are computationally expensive, need pre-processing routines, and often require calibration of priors. RESULTS: We present a tree-based sequential CART (S-CART) approach to variable selection in the binary classification setting and compare it against the more sophisticated procedures using simulated and real biological data. In simulated data, we analyze S-CART performance versus (i) a random forest (RF), (ii) a fully-parametric Bayesian stochastic search variable selection (SSVS), and (iii) the moderated t-test statistic from the LIMMA package in R. The simulation study is based on a hierarchical Bayesian model, where dataset dimensionality, percentage of significant variables, and substructure via dependency vary. Selection efficacy is measured through false-discovery and missed-discovery rates. In all scenarios, the S-CART method is seen to consistently outperform SSVS and RF in both speed and detection accuracy. We demonstrate the utility of the S-CART technique both on simulated data and in a control-treatment mouse study. We show that the network analysis based on the S-CART-selected gene subset in essence recapitulates the biological findings of the study using only a fraction of the original set of genes considered in the study's analysis. CONCLUSIONS: The relatively simple-minded gene selection algorithms like S-CART may often in practical circumstances be preferred over much more sophisticated ones. The advantage of the "greedy" selection methods utilized by S-CART and the likes is that they scale well with the problem size and require virtually no tuning or training while remaining efficient in extracting the relevant information from microarray-like datasets containing large number of redundant or irrelevant variables. AVAILABILITY: The MATLAB 7.4b code for the S-CART implementation is available for download from https://neyman.mcg.edu/posts/scart.zip.
BACKGROUND: In the typical setting of gene-selection problems from high-dimensional data, e.g., gene expression data from microarray or next-generation sequencing-based technologies, an enormous volume of high-throughput data is generated, and there is often a need for a simple, computationally-inexpensive, non-parametric screening procedure than can quickly and accurately find a low-dimensional variable subset that preserves biological information from the original very high-dimensional data (dimension p > 40,000). This is in contrast to the very sophisticated variable selection methods that are computationally expensive, need pre-processing routines, and often require calibration of priors. RESULTS: We present a tree-based sequential CART (S-CART) approach to variable selection in the binary classification setting and compare it against the more sophisticated procedures using simulated and real biological data. In simulated data, we analyze S-CART performance versus (i) a random forest (RF), (ii) a fully-parametric Bayesian stochastic search variable selection (SSVS), and (iii) the moderated t-test statistic from the LIMMA package in R. The simulation study is based on a hierarchical Bayesian model, where dataset dimensionality, percentage of significant variables, and substructure via dependency vary. Selection efficacy is measured through false-discovery and missed-discovery rates. In all scenarios, the S-CART method is seen to consistently outperform SSVS and RF in both speed and detection accuracy. We demonstrate the utility of the S-CART technique both on simulated data and in a control-treatment mouse study. We show that the network analysis based on the S-CART-selected gene subset in essence recapitulates the biological findings of the study using only a fraction of the original set of genes considered in the study's analysis. CONCLUSIONS: The relatively simple-minded gene selection algorithms like S-CART may often in practical circumstances be preferred over much more sophisticated ones. The advantage of the "greedy" selection methods utilized by S-CART and the likes is that they scale well with the problem size and require virtually no tuning or training while remaining efficient in extracting the relevant information from microarray-like datasets containing large number of redundant or irrelevant variables. AVAILABILITY: The MATLAB 7.4b code for the S-CART implementation is available for download from https://neyman.mcg.edu/posts/scart.zip.
Authors: Naijun Sha; Marina Vannucci; Mahlet G Tadesse; Philip J Brown; Ilaria Dragoni; Nick Davies; Tracy C Roberts; Andrea Contestabile; Mike Salmon; Chris Buckley; Francesco Falciani Journal: Biometrics Date: 2004-09 Impact factor: 2.571
Authors: Daniel Marbach; Robert J Prill; Thomas Schaffter; Claudio Mattiussi; Dario Floreano; Gustavo Stolovitzky Journal: Proc Natl Acad Sci U S A Date: 2010-03-22 Impact factor: 11.205
Authors: A Bender; J Beckers; I Schneider; S M Hölter; T Haack; T Ruthsatz; D M Vogt-Weisenhorn; L Becker; J Genius; D Rujescu; M Irmler; T Mijalski; M Mader; L Quintanilla-Martinez; H Fuchs; V Gailus-Durner; M Hrabé de Angelis; W Wurst; J Schmidt; T Klopstock Journal: Neurobiol Aging Date: 2007-04-09 Impact factor: 4.673