| Literature DB >> 31874625 |
Yuping Lu1, Charles A Phillips2, Michael A Langston2.
Abstract
BACKGROUND: Cluster analysis is a core task in modern data-centric computation. Algorithmic choice is driven by factors such as data size and heterogeneity, the similarity measures employed, and the type of clusters sought. Familiarity and mere preference often play a significant role as well. Comparisons between clustering algorithms tend to focus on cluster quality. Such comparisons are complicated by the fact that algorithms often have multiple settings that can affect the clusters produced. Such a setting may represent, for example, a preset variable, a parameter of interest, or various sorts of initial assignments. A question of interest then is this: to what degree do the clusters produced vary as setting values change?Entities:
Keywords: Clustering algorithms; Paraclique; Robustness
Mesh:
Year: 2019 PMID: 31874625 PMCID: PMC6929270 DOI: 10.1186/s12859-019-3089-6
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Clusters produced by three runs of a clustering algorithm
Clustering methods tested for robustness
| Algorithm | Type | Setting | Implementation |
|---|---|---|---|
| Average | Hierarchical | Number of clusters | R 3.2.3 |
| Complete | Hierarchical | Number of clusters | R 3.2.3 |
| Mcquitty | Hierarchical | Number of clusters | R 3.2.3 |
| Ward | Hierarchical | Number of clusters | R 3.2.3 |
| CLICK | Graph-based | Cluster homogeneity | Expander4 |
| NNN | Graph-based | Min neighborhood size | Java |
| Paraclique | Graph-based | Starting clique | C++ |
| WGCNA | Graph-based | Power | R 3.2.3 |
| K-means | Partitioning | Number of clusters | R 3.2.3 |
| QT Clustering | Partitioning | Max cluster diameter | R 3.2.3 |
| SOM | Neural network | Grid type/size | R 3.2.3 |
Gene expression datasets tested in this study
| Dataset | Organism | Threshold | Edges | Vertices |
|---|---|---|---|---|
| GDS516 | Drosophila melanogaster | 0.89 | 3980 | 195322 |
| GDS2485 | Drosophila melanogaster | 0.91 | 4604 | 30412 |
| GDS2504 | Drosophila melanogaster | 0.81 | 7888 | 191715 |
| GDS2674 | Drosophila melanogaster | 0.95 | 3334 | 5820 |
| GDS1842 | Drosophila melanogaster | 0.91 | 2307 | 4589 |
| GDS653 | Drosophila melanogaster | 0.95 | 1688 | 3368 |
| GDS664 | Drosophila melanogaster | 0.8 | 14008 | 2298635 |
| GDS1399 | Escherichia coli | 0.95 | 2880 | 5614 |
| GDS5160 | Escherichia coli | 0.94 | 4826 | 74819 |
| GDS5162 | Escherichia coli | 0.95 | 5038 | 293061 |
| GDS5010 | Mus musculus | 0.9 | 10269 | 120907 |
| GDS3870 | Penicillium chrysogenum | 0.94 | 6826 | 62431 |
| GDS344 | Saccharomyces cerevisiae | 0.95 | 3071 | 6303 |
| GDS772 | Saccharomyces cerevisiae | 0.94 | 1463 | 3785 |
| GDS777 | Saccharomyces cerevisiae | 0.91 | 2244 | 11916 |
| GDS1013 | Saccharomyces cerevisiae | 0.81 | 5312 | 555852 |
| GDS1103 | Saccharomyces cerevisiae | 0.95 | 4215 | 38139 |
| GDS1534 | Saccharomyces cerevisiae | 0.8 | 9335 | 1470003 |
| GDS1674 | Saccharomyces cerevisiae | 0.93 | 3839 | 11904 |
| GDS2267 | Saccharomyces cerevisiae | 0.83 | 4676 | 302104 |
| GDS2508 | Saccharomyces cerevisiae | 0.9 | 3069 | 10485 |
| GDS2663 | Saccharomyces cerevisiae | 0.8 | 9335 | 2617139 |
| GDS3332 | Saccharomyces cerevisiae | 0.86 | 7290 | 572118 |
| GDS2969 | Saccharomyces cerevisiae | 0.95 | 1679 | 5206 |
Fig. 2Robustness of four hierarchical algorithms on 24 transcriptomic datasets
Fig. 3Robustness of all algorithms tested on 24 transcriptomic datasets
Fig. 4Average robustness of each algorithm
Fig. 5Coefficient of variation of each algorithm