| Literature DB >> 20529873 |
Taegyun Yun1, Taeho Hwang, Kihoon Cha, Gwan-Su Yi.
Abstract
Large microarray data sets have recently become common. However, most available clustering methods do not easily handle large microarray data sets due to their very large computational complexity and memory requirements. Furthermore, typical clustering methods construct oversimplified clusters that ignore subtle but meaningful changes in the expression patterns present in large microarray data sets. It is necessary to develop an efficient clustering method that identifies both absolute expression differences and expression profile patterns in different expression levels for large microarray data sets. This study presents CLIC, which meets the requirements of clustering analysis particularly but not limited to large microarray data sets. CLIC is based on a novel concept in which genes are clustered in individual dimensions first and in which the ordinal labels of clusters in each dimension are then used for further full dimension-wide clustering. CLIC enables iterative sub-clustering into more homogeneous groups and the identification of common expression patterns among the genes separated in different groups due to the large difference in the expression levels. In addition, the computation of clustering is parallelized, the number of clusters is automatically detected, and the functional enrichment for each cluster and pattern is provided. CLIC is freely available at http://gexp2.kaist.ac.kr/clic.Entities:
Mesh:
Year: 2010 PMID: 20529873 PMCID: PMC2896182 DOI: 10.1093/nar/gkq516
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Schematic diagram of individual dimension-based clustering. Genes in individual dimensions or arrays are clustered independently with optimal number of cluster k which maximizes internal cluster validity Si. After 1D clustering, the genes (rows) are aligned with their cluster indices successively from the column with highest validity to build a combined index matrix. To identify the cluster boundaries of the combined matrix, the cluster boundary distance is measured for every adjacent gene pairs. From the site showing the largest cluster boundary distance, cluster boundaries are selected successively until the average cluster homogeneity is not increased by the boundary selection. The cluster homogeneity is measured by the average Pearson correlation coefficient of the expression values of genes in each cluster.
Adjusted rand indexes of clustering algorithms for three different data sets
| Simulated (1000, 100, | Simulated (100, 33, | Yeast galactose (205, | |
|---|---|---|---|
| CLIC | 1 ( | 1 ( | 0.97 ( |
| 0.68 ( | 0.88 ( | 0.87 ( | |
| HPCluster | 1 ( | 1 ( | 0.83 ( |
| CRC | 0.90 ( | 0.46 ( | 0.97 ( |
| MCLUST | 1 ( | 0.98 ( | 0.97 ( |
| 0.72 ( | 0.20 ( | 0.95 ( | |
| CLICK | NA (1) | 0 ( | 0.81 ( |
The details of the yeast galactose data set and simulated data sets are described in the manuscript and in the Supplementary Data. Numbers are followed by the data name (the number of genes, the number of samples and the number of true clusters).
Execution times of clustering algorithms for data sets of different sizes
| 5000 | 10 000 | 15 000 | 20 000 | 25 000 | 30 000 | |
|---|---|---|---|---|---|---|
| CLIC | 69 | 132 | 201 | 273 | 345 | 466 |
| CRC | 9709 | 28 871 | NC | NC | NC | NC |
| MCLUST | 2023 | 6533 | 12 517 | 23 771 | 32 861 | 46 972 |
| 185 | 432 | 660 | 1028 | 1183 | 1781 | |
| CLICK | 559 | 930 | 481 | 373 | 325 | 587 |
HPCluster and k-means methods listed in Table 1 are not included in this comparison because these methods require a separate procedure to determine the number of clusters. The execution time is measured in seconds. NC: clustering analysis is not completed.
Figure 2.Example study of NCI 60 data with CLIC. (a) Heatmap (left: gene expression levels, right: pattern of cluster indices), and a part of top-ranked functionally enriched terms for cluster 1 and its sub-cluster 9. Functional terms uniquely enriched by a selected cluster 1 with a given threshold level are highlighted. Uniquely identified functional terms in sub-cluster compared to those in its original cluster 1 are highlighted. The homogeneity of sub-cluster (0.987) is increased dramatically from that of its mother cluster (0.4); (b) Heatmap, and a part of top-ranked functionally enriched terms for pattern 190 (above table) and the terms for three sub-clusters that include the genes in pattern 190 (below table).