| Literature DB >> 22759431 |
Jeremy J Jay1, John D Eblen, Yun Zhang, Mikael Benson, Andy D Perkins, Arnold M Saxton, Brynn H Voy, Elissa J Chesler, Michael A Langston.
Abstract
BACKGROUND: A wealth of clustering algorithms has been applied to gene co-expression experiments. These algorithms cover a broad range of approaches, from conventional techniques such as k-means and hierarchical clustering, to graphical approaches such as k-clique communities, weighted gene co-expression networks (WGCNA) and paraclique. Comparison of these methods to evaluate their relative effectiveness provides guidance to algorithm selection, development and implementation. Most prior work on comparative clustering evaluation has focused on parametric methods. Graph theoretical methods are recent additions to the tool set for the global analysis and decomposition of microarray co-expression matrices that have not generally been included in earlier methodological comparisons. In the present study, a variety of parametric and graph theoretical clustering algorithms are compared using well-characterized transcriptomic data at a genome scale from Saccharomyces cerevisiae.Entities:
Mesh:
Year: 2012 PMID: 22759431 PMCID: PMC3382433 DOI: 10.1186/1471-2105-13-S10-S7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Overview of algorithms tested
| Allows Overlapping Clusters | ||||||
|---|---|---|---|---|---|---|
| Pre-specified Number of Clusters ( | ||||||
| Thresholded Correlations | ||||||
| Method | Type | Result Range | Parameters Tested | |||
| Ward | Hierarchical | Y | Average cluster size | |||
| Average | Hierarchical | Y | Average cluster size | |||
| McQuitty | Hierarchical | Y | Average cluster size | |||
| Complete | Hierarchical | Y | Average cluster size | |||
| k-Means | Partitioning | Y | Number of clusters | |||
| SOM | Neural network | Y | Grid size/typea | |||
| QT Clust | Partitioning | 24-385 | Maximum cluster diameters | |||
| CAST | Graph-based | 1-6162 | Y | Threshold | ||
| CLICK | Graph-based | 4-32 | Cluster homogeneity | |||
| WGCNA | Graph-based | 4-160 | Power, Module detection method | |||
| NNN | Graph-based | 23-52 | Yb | Minimum neighborhood size | ||
| k-Cliques Communities | Graph-based | 1-68 | Y | Y | Threshold, Clique size | |
| Maximal Clique | Graph-based | 1,000-64,000 | Y | Y | Threshold | |
| Paraclique | Graph-based | 8-615 | Yc | Y | Threshold, Glom factor | |
Clustering methods are listed by name, along with the type of algorithm, and a general listing of parameters tested. Number of clusters in the result, given the parameters and data set tested, is only provided here as an approximate figure. Empty results are obviously not included. aGrid type can be either rectangular or hexagonal, in an m x n layout. We tested both types, but used an m x m layout for simplicity (k = m). bRarely occurs in practice. On this data set we observed no overlap with NNN. cOptional, not used in this analysis.
Algorithms ranked by quartile comparisons
| Average Quartile | Small (3-10 genes) | Medium (11-100 genes) | Large (101-1000 genes) | ||||
|---|---|---|---|---|---|---|---|
| Clustering Method | Quartile | BAT5 Jaccard | Quartile | BAT5 Jaccard | Quartile | BAT5 Jaccard | |
| K-Clique Communities | 1.00 | 1 | 0.7531 | 1 | 0.4465 | 1 | 0.4915 |
| Maximal Clique | 1.00 | 1 | 0.8433 | 1 | 0.4081 | 0.0000 | |
| Paraclique | 1.00 | 1 | 0.7576 | 1 | 0.4285 | 1 | 0.4169 |
| Ward (H) | 1.33 | 2 | 0.5782 | 1 | 0.4011 | 1 | 0.5723 |
| CAST | 1.67 | 1 | 0.7455 | 3 | 0.3146 | 1 | 0.4994 |
| QT Clust | 2.00 | 2 | 0.5473 | 2 | 0.3670 | 2 | 0.3944 |
| Complete (H) | 2.33 | 3 | 0.3933 | 2 | 0.3677 | 2 | 0.3419 |
| NNN | 2.67 | 2 | 0.5521 | 2 | 0.3705 | 4 | 0.2406 |
| K-Means | 3.00 | 4 | 0.2573 | 3 | 0.3015 | 2 | 0.3463 |
| SOM | 3.00 | 4 | 0.3260 | 2 | 0.3286 | 3 | 0.3282 |
| WGCNA | 3.00 | 3 | 0.4391 | 3 | 0.3106 | 3 | 0.2949 |
| Average (H) | 3.33 | 3 | 0.4087 | 4 | 0.2792 | 3 | 0.3037 |
| McQuitty (H) | 3.33 | 3 | 0.4594 | 3 | 0.3065 | 4 | 0.2868 |
| CLICK | 4.00 | 4 | 0.0339 | 4 | 0.1453 | 4 | 0.2817 |
Results from Figure 1 are displayed by quartile (1 = top 25% - 4 = bottom 25%), with missing values for maximal clique discarded. (H) denotes Hierarchical Clustering agglomeration method.
Figure 1Algorithms ranked by best average top 5 clusters. BAT5 Jaccard values are shown for each clustering method and cluster size classification. (H) = Hierarchical clustering agglomeration method.
Figure 3Number of clusters produced by each method. The number of clusters produced by each method at the optimal parameter settings for each size class is displayed on a log10 scale. Note that some methods produced a single cluster for one or more of the size classes, which appears as absent on the graph. Maximal clique generated no clusters in the large size class, also showing as 0 on the graph.
Figure 4Average cluster size produced by each method. The number of clusters produced by each method at the optimal parameter settings for each size class is shown on a log10 scale.
Runtimes for each clustering method
| Small (3-10 genes) | Medium (11-100 genes) | Large (101-1000 genes) | |||||
|---|---|---|---|---|---|---|---|
| K-Clique Communities | 0.80/03 | 0.80/57 | 0.80/48 | Standalone,*** | |||
| Maximal Clique | 0.80 | 26.510 | 0.80 | 26.510 | N/A | N/A | Standalone,*** |
| Paraclique | 0.80/01 | 5.120 | 0.80/09 | 0.780 | 0.60/09 | 9.050 | Standalone,*** |
| Ward (H) | N/A | 2.863 | N/A | 2.863 | N/A | 2.863 | R 2.13.0 |
| CAST | 0.875 | 37.324 | 0.85 | 34.242 | 0.90 | 34.121 | MeV 4.5.1 |
| QT Clust | 030 | 6 904.518 | 035 | 6 759.073 | 050 | 5 559.467 | MeV 4.5.1 |
| Complete (H) | N/A | 2.721 | N/A | 2.721 | N/A | 2.721 | R 2.13.0 |
| NNN | 11 | 25.550 | 24 | 30.610 | 27 | 34.370 | Standalone,*** |
| K-Means | 617 | 6 711.143 | 308 | 4 060.351 | 21 | 1 068.069 | R 2.13.0 |
| SOM | 25/r | 6.159 | 25/h | 6.121 | 18/r | 2.956 | MeV 4.5.1 |
| WGCNA | 2/10 | 79.430 | 1/06 | 80.962 | 2/06 | 80.962 | R 2.13.0 |
| Average (H) | N/A | 2.452 | N/A | 2.452 | N/A | 2.452 | R 2.13.0 |
| McQuitty (H) | N/A | 2.445 | N/A | 2.445 | N/A | 2.445 | R 2.13.0 |
| CLICK | 015 | 38.270 | 060 | 45.310 | 065 | 52.570 | Standalone,*** |
Parameters used to produce the best Jaccard score, and the associated runtime for the given method and parameters are displayed. Specific parameter descriptions are listed in Table 1. MeV times were reported by GUI results. Hierarchical methods use the "flashClust" package for R, which is a C++ implementation of the standard "hclust" package. Hierarchical timings do not include the time for tree cutting (which is negligible). flashClust and WGCNA packages were downloaded from the CRAN repository June 22, 2011. Versions reported refer to the version used for runtime calculation; in some cases, previous versions were used to generate clusters for scoring. r = rectangular, h = hexagonal. *A GUI-based graphical tool which is no longer maintained was used to generate clusters for Jaccard scoring while the latest R implementation was using in timing, **Total elapsed time reported by the system.
Figure 2Algorithms ranked by prominent annotations. 112 annotations received a Jaccard score above 0.25. Each clustering method was ranked by the average of its highest Jaccard score for each of these annotations. (H) = Hierarchical agglomeration method.