Li Chen1, Han Liu2, Jean-Pierre A Kocher3, Hongzhe Li4, Jun Chen3. 1. Division of Biomedical Statistics and Informatics and Center for Individualized Medicine, Mayo Clinic, Rochester, MN 55905,USA, Department of Computer Science, Emory University, Atlanta, GA 30322,USA. 2. Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA and. 3. Division of Biomedical Statistics and Informatics and Center for Individualized Medicine, Mayo Clinic, Rochester, MN 55905,USA. 4. Department of Biostatistics and Epidemiology, University of Pennsylvania, Philadelphia, PA 19104, USA.
Abstract
UNLABELLED: One central theme of modern high-throughput genomic data analysis is to identify relevant genomic features as well as build up a predictive model based on selected features for various tasks such as personalized medicine. Correlating the large number of 'omics' features with a certain phenotype is particularly challenging due to small sample size (n) and high dimensionality (p). To address this small n, large p problem, various forms of sparse regression models have been proposed by exploiting the sparsity assumption. Among these, network-constrained sparse regression model is of particular interest due to its ability to utilize the prior graph/network structure in the omics data. Despite its potential usefulness for omics data analysis, no efficient R implementation is publicly available. Here we present an R software package 'glmgraph' that implements the graph-constrained regularization for both sparse linear regression and sparse logistic regression. We implement both the L1 penalty and minimax concave penalty for variable selection and Laplacian penalty for coefficient smoothing. Efficient coordinate descent algorithm is used to solve the optimization problem. We demonstrate the use of the package by applying it to a human microbiome dataset, where phylogeny structure among bacterial taxa is available. AVAILABILITY AND IMPLEMENTATION: 'glmgraph' is implemented in R and C++ Armadillo and publicly available under CRAN.
UNLABELLED: One central theme of modern high-throughput genomic data analysis is to identify relevant genomic features as well as build up a predictive model based on selected features for various tasks such as personalized medicine. Correlating the large number of 'omics' features with a certain phenotype is particularly challenging due to small sample size (n) and high dimensionality (p). To address this small n, large p problem, various forms of sparse regression models have been proposed by exploiting the sparsity assumption. Among these, network-constrained sparse regression model is of particular interest due to its ability to utilize the prior graph/network structure in the omics data. Despite its potential usefulness for omics data analysis, no efficient R implementation is publicly available. Here we present an R software package 'glmgraph' that implements the graph-constrained regularization for both sparse linear regression and sparse logistic regression. We implement both the L1 penalty and minimax concave penalty for variable selection and Laplacian penalty for coefficient smoothing. Efficient coordinate descent algorithm is used to solve the optimization problem. We demonstrate the use of the package by applying it to a human microbiome dataset, where phylogeny structure among bacterial taxa is available. AVAILABILITY AND IMPLEMENTATION: 'glmgraph' is implemented in R and C++ Armadillo and publicly available under CRAN.
Authors: Emily S Charlson; Jun Chen; Rebecca Custers-Allen; Kyle Bittinger; Hongzhe Li; Rohini Sinha; Jennifer Hwang; Frederic D Bushman; Ronald G Collman Journal: PLoS One Date: 2010-12-20 Impact factor: 3.240