Literature DB >> 12801870

Simultaneous gene clustering and subset selection for sample classification via MDL.

Rebecka Jörnsten1, Bin Yu.   

Abstract

MOTIVATION: The microarray technology allows for the simultaneous monitoring of thousands of genes for each sample. The high-dimensional gene expression data can be used to study similarities of gene expression profiles across different samples to form a gene clustering. The clusters may be indicative of genetic pathways. Parallel to gene clustering is the important application of sample classification based on all or selected gene expressions. The gene clustering and sample classification are often undertaken separately, or in a directional manner (one as an aid for the other). However, such separation of these two tasks may occlude informative structure in the data. Here we present an algorithm for the simultaneous clustering of genes and subset selection of gene clusters for sample classification. We develop a new model selection criterion based on Rissanen's MDL (minimum description length) principle. For the first time, an MDL code length is given for both explanatory variables (genes) and response variables (sample class labels). The final output of the proposed algorithm is a sparse and interpretable classification rule based on cluster centroids or the closest genes to the centroids.
RESULTS: Our algorithm for simultaneous gene clustering and subset selection for classification is applied to three publicly available data sets. For all three data sets, we obtain sparse and interpretable classification models based on centroids of clusters. At the same time, these models give competitive test error rates as the best reported methods. Compared with classification models based on single gene selections, our rules are stable in the sense that the number of clusters has a small variability and the centroids of the clusters are well correlated (or consistent) across different cross validation samples. We also discuss models where the centroids of clusters are replaced with the genes closest to the centroids. These models show comparable test error rates to models based on single gene selection, but are more sparse as well as more stable. Moreover, we comment on how the inclusion of a classification criterion affects the gene clustering, bringing out class informative structure in the data. AVAILABILITY: The methods presented in this paper have been implemented in the R language. The source code is available from the first author.

Mesh:

Year:  2003        PMID: 12801870     DOI: 10.1093/bioinformatics/btg039

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  12 in total

1.  Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR.

Authors:  Howard D Bondell; Brian J Reich
Journal:  Biometrics       Date:  2007-06-30       Impact factor: 2.571

2.  Online Decentralized Leverage Score Sampling for Streaming Multidimensional Time Series.

Authors:  Rui Xie; Zengyan Wang; Shuyang Bai; Ping Ma; Wenxuan Zhong
Journal:  Proc Mach Learn Res       Date:  2019-04

3.  Hypermethylation of genes for diagnosis and risk stratification of prostate cancer.

Authors:  Donkena Krishna Vanaja; Mathias Ehrich; Dirk Van den Boom; John C Cheville; R Jeffrey Karnes; Donald J Tindall; Charles R Cantor; Charles Y F Young
Journal:  Cancer Invest       Date:  2009-06       Impact factor: 2.176

4.  A non-parametric approach to population structure inference using multilocus genotypes.

Authors:  Nianjun Liu; Hongyu Zhao
Journal:  Hum Genomics       Date:  2006-06       Impact factor: 4.639

5.  Metabolomics-based discovery of diagnostic biomarkers for onchocerciasis.

Authors:  Judith R Denery; Ashlee A K Nunes; Mark S Hixon; Tobin J Dickerson; Kim D Janda
Journal:  PLoS Negl Trop Dis       Date:  2010-10-05

6.  Multiclass cancer classification by using fuzzy support vector machine and binary decision tree with gene selection.

Authors:  Yong Mao; Xiaobo Zhou; Daoying Pi; Youxian Sun; Stephen T C Wong
Journal:  J Biomed Biotechnol       Date:  2005-06-30

7.  Bayesian profiling of molecular signatures to predict event times.

Authors:  Dabao Zhang; Min Zhang
Journal:  Theor Biol Med Model       Date:  2007-01-19       Impact factor: 2.432

8.  Optimal reference sequence selection for genome assembly using minimum description length principle.

Authors:  Bilal Wajid; Erchin Serpedin; Mohamed Nounou; Hazem Nounou
Journal:  EURASIP J Bioinform Syst Biol       Date:  2012-11-27

9.  A stable iterative method for refining discriminative gene clusters.

Authors:  Min Xu; Mengxia Zhu; Louxin Zhang
Journal:  BMC Genomics       Date:  2008-09-16       Impact factor: 3.969

10.  Altered expression of mitochondrial and extracellular matrix genes in the heart of human fetuses with chromosome 21 trisomy.

Authors:  Anna Conti; Floriana Fabbrini; Paola D'Agostino; Rosa Negri; Dario Greco; Rita Genesio; Maria D'Armiento; Carlo Olla; Dario Paladini; Mariastella Zannini; Lucio Nitsch
Journal:  BMC Genomics       Date:  2007-08-07       Impact factor: 3.969

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.