| Literature DB >> 29361928 |
Chenyue W Hu1, Hanyang Li1, Amina A Qutub2.
Abstract
BACKGROUND: Many common clustering algorithms require a two-step process that limits their efficiency. The algorithms need to be performed repetitively and need to be implemented together with a model selection criterion. These two steps are needed in order to determine both the number of clusters present in the data and the corresponding cluster memberships. As biomedical datasets increase in size and prevalence, there is a growing need for new methods that are more convenient to implement and are more computationally efficient. In addition, it is often essential to obtain clusters of sufficient sample size to make the clustering result meaningful and interpretable for subsequent analysis.Entities:
Keywords: Cancer subtyping; Clustering; Gene expression; Matrix factorization
Mesh:
Year: 2018 PMID: 29361928 PMCID: PMC5782397 DOI: 10.1186/s12859-018-2022-8
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Clustering results of simulated similarity matrices with varying size constraints (ω), where C is the cluster generated by Shrinkage Clustering
| True Label | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| C1 | C2 | C3 | C4 | C5 | C1 | C2 | C3 | C4 | C1 | C2 | |
| Cluster 1 | 0 | 0 | 24 | 0 | 0 | 0 | 24 | 0 | 0 | 0 | 24 |
| Cluster 2 | 15 | 0 | 0 | 0 | 0 | 15 | 0 | 0 | 0 | 15 | 0 |
| Cluster 3 | 0 | 0 | 0 | 24 | 0 | 0 | 0 | 24 | 0 | 0 | 24 |
| Cluster 4 | 0 | 17 | 0 | 0 | 0 | 17 | 0 | 0 | 0 | 17 | 0 |
| Cluster 5 | 0 | 0 | 0 | 0 | 20 | 0 | 0 | 0 | 20 | 20 | 0 |
Fig. 1Performances of the base algorithm on simulated similarity data. Shrinkage paths plot changes in cluster numbers through the entire iteration process. a The first five shrinkage paths from the 1000 runs (with 20 initial random clusters) are illustrated. b Example shrinkage paths are shown from initiating the algorithm with 5, 10, 20, 50 and 100 random clusters
Fig. 2Performances of Shrinkage Clustering with cluster size constraints. a The average number of iterations spent is plotted with ω taking values of 1 to 5, 10, 15, 20 and 25. b Example shrinkage paths are shown for ω of 1 to 5, 10, 15, 20 and 25 (path of ω=10 is in overlap with ω=15)
Fig. 3Robustness of Shrinkage Clustering against noise. a The distribution density of S is shown with a varying degree of noise, as ε is sampled with σ from 0 to 0.5. b The probability of successfully recovering the underlying cluster structure is plotted against different noise levels. The true cluster recovery is defined as the frequency of generating the exact same cluster assignment as the true cluster assignement when clustering the data with noise generated 1000 times
Clustering results of the TCGA dataset, where the clustering assignments from Shrinkage Clustering are compared against the three known tumor types
| Tumor Type | Cluster 1 | Cluster 2 | Cluster 3 |
| BRCA | 3 | 204 | 0 |
| GBM | 0 | 0 | 67 |
| LUSC | 17 | 2 | 0 |
Performance comparison of ten algorithms on six biological data sets, i.e. TCGA, BCWD, Dyrskjot-2003, Nutt-2003-v1, Nutt-2003-v3 and AIBT
| Data | Metric | Shrinkage | Spectral | K-means | Hierarchical | PAM | DBSCAN | Affinity | AGNES | Clusterdp | SymNMF |
|---|---|---|---|---|---|---|---|---|---|---|---|
| TCGA | NMI |
| 0.77 | NA | 0.83 | 0.76 | NA | NA |
| NA |
|
| Rand |
| 0.91 | NA | 0.91 | 0.77 | NA | NA |
| NA |
| |
| F1 |
| 0.92 | NA | 0.92 | 0.80 | NA | NA |
| NA |
| |
| K (3) |
| 2 | NA | 2 | 2 | NA | NA |
| NA | 2 | |
| BCWD | NMI |
| 0.29 | 0.46 | 0.09 |
| 0.20 | 0.45 | 0.09 | 0.20 |
|
| Rand |
| 0.68 | 0.75 | 0.55 |
| 0.64 | 0.76 | 0.55 | 0.53 |
| |
| F1 |
| 0.69 | 0.79 | 0.69 |
| 0.75 | 0.79 | 0.69 | 0.59 |
| |
| K (2) |
| 2 | 2 | 2 |
| 2 | 3 | 2 | 2 |
| |
| Dyrskjot-2003 | NMI |
| 0.07 |
| 0.12 | 0.56 | 0.30 | 0.42 | 0.12 | 0.07 |
|
| Rand |
| 0.55 |
| 0.42 | 0.77 | 0.55 | 0.72 | 0.42 | 0.50 |
| |
| F1 |
| 0.36 |
| 0.54 | 0.66 | 0.60 | 0.66 | 0.54 | 0.43 |
| |
| K (3) |
| 3 |
| 3 | 3 | 3 | 3 | 3 | 2 |
| |
| Nutt-2003-v1 | NMI |
| 0.45 | 0.47 | 0.28 | 0.34 |
| 0.41 | 0.11 | 0.17 |
|
| Rand |
| 0.73 | 0.72 | 0.52 | 0.68 |
| 0.73 | 0.35 | 0.64 |
| |
| F1 |
| 0.51 | 0.51 | 0.43 | 0.41 |
| 0.44 | 0.38 | 0.34 |
| |
| K (4) |
| 4 | 4 | 4 | 4 |
| 5 | 4 | 4 |
| |
| Nutt-2003-v3 | NMI |
| 0.20 |
| 0.13 | 0.33 | 0.13 | 0.13 | 0.13 | 0.29 |
|
| Rand |
| 0.58 |
| 0.58 | 0.58 | 0.58 | 0.58 | 0.58 | 0.55 |
| |
| F1 |
| 0.59 |
| 0.71 | 0.60 | 0.71 | 0.71 | 0.71 | 0.57 |
| |
| K (2) |
| 2 |
| 2 | 2 | 2 | 3 | 2 | 2 |
| |
| AIBT | NMI |
| 0.20 |
| 0.17 | 0.54 | 0.56 | 0.53 | 0.02 | 0.55 |
|
| Rand |
| 0.68 |
| 0.37 | 0.78 | 0.65 | 0.76 | 0.26 | 0.69 |
| |
| F1 |
| 0.39 |
| 0.40 | 0.59 | 0.59 | 0.51 | 0.40 | 0.57 |
| |
| K (4) |
| 4 |
| 4 | 4 | 4 | 5 | 4 | 3 |
|
Clustering accuracy is assessed via metrics including NMI (Normalized Mutual Information), Rand Index, F1 score and K (the optimal cluster number). The top three performers in each case are highlighted in bold
Performances of Shrinkage Clustering on Simulated, Iris and Wine data, where the clustering assignments are compared against the three simulated centers, three Iris species and three wine types respectively
| Simulated | Iris | Wine | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Center | C1 | C2 | C3 | Species | C1 | C2 | Type | C1 | C2 | C3 |
| (-2,2) | 0 | 49 | 1 |
| 50 | 0 | 1 | 0 | 59 | 0 |
| (-2,-2) | 0 | 1 | 49 |
| 0 | 50 | 2 | 59 | 6 | 0 |
| (2,0) | 50 | 0 | 0 |
| 0 | 50 | 3 | 0 | 6 | 48 |
Parameter values of DBSCAN, Affinity Propagation and clusterdp
| Algorithm | DBSCAN | Affinity propagation | clusterdp | |||
|---|---|---|---|---|---|---|
| Parameter | minPts | eps | p | q | rho | delta |
| BCWD | 31 | 3000 | NA | 0 | 20 | 3000 |
| Dyrskjot-2003 | 2 | 23000 | NA | 0.07 | 3 | 20000 |
| Nutt-2003-v1 | 2 | 11000 | NA | 0.12 | 1.5 | 3000 |
| Nutt-2003-v3 | 1 | 8000 | NA | 0.1 | 1 | 7000 |
| AIBT | 5 | 400 | NA | 0 | 2.5 | 240 |
Fig. 4Speed comparison using the AIBT data. The computation time of Shrinkage Clustering is recorded and compared against other commonly used clustering algorithms