Literature DB >> 12611800

Comparisons and validation of statistical clustering techniques for microarray gene expression data.

Susmita Datta1, Somnath Datta.   

Abstract

MOTIVATION: With the advent of microarray chip technology, large data sets are emerging containing the simultaneous expression levels of thousands of genes at various time points during a biological process. Biologists are attempting to group genes based on the temporal pattern of their expression levels. While the use of hierarchical clustering (UPGMA) with correlation 'distance' has been the most common in the microarray studies, there are many more choices of clustering algorithms in pattern recognition and statistics literature. At the moment there do not seem to be any clear-cut guidelines regarding the choice of a clustering algorithm to be used for grouping genes based on their expression profiles.
RESULTS: In this paper, we consider six clustering algorithms (of various flavors!) and evaluate their performances on a well-known publicly available microarray data set on sporulation of budding yeast and on two simulated data sets. Among other things, we formulate three reasonable validation strategies that can be used with any clustering algorithm when temporal observations or replications are present. We evaluate each of these six clustering methods with these validation measures. While the 'best' method is dependent on the exact validation strategy and the number of clusters to be used, overall Diana appears to be a solid performer. Interestingly, the performance of correlation-based hierarchical clustering and model-based clustering (another method that has been advocated by a number of researchers) appear to be on opposite extremes, depending on what validation measure one employs. Next it is shown that the group means produced by Diana are the closest and those produced by UPGMA are the farthest from a model profile based on a set of hand-picked genes. AVAILABILITY: S+ codes for the partial least squares based clustering are available from the authors upon request. All other clustering methods considered have S+ implementation in the library MASS. S+ codes for calculating the validation measures are available from the authors upon request. The sporulation data set is publicly available at http://cmgm.stanford.edu/pbrown/sporulation

Entities:  

Mesh:

Year:  2003        PMID: 12611800     DOI: 10.1093/bioinformatics/btg025

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  72 in total

Review 1.  Associating phenotypes with molecular events: recent statistical advances and challenges underpinning microarray experiments.

Authors:  Yulan Liang; Arpad Kelemen
Journal:  Funct Integr Genomics       Date:  2005-11-15       Impact factor: 3.410

2.  RNA secondary structure prediction by centroids in a Boltzmann weighted ensemble.

Authors:  Ye Ding; Chi Yu Chan; Charles E Lawrence
Journal:  RNA       Date:  2005-08       Impact factor: 4.942

Review 3.  Systems interface biology.

Authors:  Francis J Doyle; Jörg Stelling
Journal:  J R Soc Interface       Date:  2006-10-22       Impact factor: 4.118

4.  Quantifying gene network connectivity in silico: scalability and accuracy of a modular approach.

Authors:  N Yalamanchili; D E Zak; B A Ogunnaike; J S Schwaber; A Kriete; B N Kholodenko
Journal:  Syst Biol (Stevenage)       Date:  2006-07

5.  Analysis of time-series gene expression data: methods, challenges, and opportunities.

Authors:  I P Androulakis; E Yang; R R Almon
Journal:  Annu Rev Biomed Eng       Date:  2007       Impact factor: 9.590

6.  Class-specific correlations of gene expressions: identification and their effects on clustering analyses.

Authors:  Jigang Zhang; Jian Li; Hongwen Deng
Journal:  Am J Hum Genet       Date:  2008-08       Impact factor: 11.025

7.  Heritable clustering and pathway discovery in breast cancer integrating epigenetic and phenotypic data.

Authors:  Zailong Wang; Pearlly Yan; Dustin Potter; Charis Eng; Tim H-M Huang; Shili Lin
Journal:  BMC Bioinformatics       Date:  2007-02-01       Impact factor: 3.169

Review 8.  Systems analysis of high-throughput data.

Authors:  Rosemary Braun
Journal:  Adv Exp Med Biol       Date:  2014       Impact factor: 2.622

9.  In vivo endotoxin synchronizes and suppresses clock gene expression in human peripheral blood leukocytes.

Authors:  Beatrice Haimovich; Jacqueline Calvano; Adrian D Haimovich; Steve E Calvano; Susette M Coyle; Stephen F Lowry
Journal:  Crit Care Med       Date:  2010-03       Impact factor: 7.598

10.  Endovascular Biopsy: In Vivo Cerebral Aneurysm Endothelial Cell Sampling and Gene Expression Analysis.

Authors:  Daniel L Cooke; David B McCoy; Van V Halbach; Steven W Hetts; Matthew R Amans; Christopher F Dowd; Randall T Higashida; Devon Lawson; Jeffrey Nelson; Chih-Yang Wang; Helen Kim; Zena Werb; Charles McCulloch; Tomoki Hashimoto; Hua Su; Zhengda Sun
Journal:  Transl Stroke Res       Date:  2017-09-13       Impact factor: 6.829

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.