| Literature DB >> 34556686 |
Michael C Thrun1,2.
Abstract
Benchmark datasets with predefined cluster structures and high-dimensional biomedical datasets outline the challenges of cluster analysis: clustering algorithms are limited in their clustering ability in the presence of clusters defining distance-based structures resulting in a biased clustering solution. Data sets might not have cluster structures. Clustering yields arbitrary labels and often depends on the trial, leading to varying results. Moreover, recent research indicated that all partition comparison measures can yield the same results for different clustering solutions. Consequently, algorithm selection and parameter optimization by unsupervised quality measures (QM) are always biased and misleading. Only if the predefined structures happen to meet the particular clustering criterion and QM, can the clusters be recovered. Results are presented based on 41 open-source algorithms which are particularly useful in biomedical scenarios. Furthermore, comparative analysis with mirrored density plots provides a significantly more detailed benchmark than that with the typically used box plots or violin plots.Entities:
Year: 2021 PMID: 34556686 PMCID: PMC8460803 DOI: 10.1038/s41598-021-98126-1
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 2MD-plots of the micro-averaged F1 score (left) and Davies–Bouldin index (right) across 120 trials for 33 clustering algorithms calculated on the leukaemia dataset. Distance-based structures with imbalanced classes are not easy to tackle in high-dimensional data. The chance level is shown by the dotted line at 50%. The choice of an algorithm by the Davies–Bouldin index would lead to the selection of the CentroidL or for some trials VarSelLCM algorithms, whereas using the ground truth shows that AverageL, CompleteL, DBS, Diana SingleL and WPGMA are appropriate algorithms to reproduce the high-dimensional structures with low variance and bias. The results for Clustvarsel CrossEntropyC, ModelBased, mvnpEM, npEM, Orclus, RTC, and Spectrum could not be computed. Note that, Markov clustering results in only one cluster in which case the Davies-Bouldin index is not defined.
Typical distance-based clustering challenges with one example dataset each. The table summarizes the results of SI C, Supplementary Fig. 10 and SI D Supplementary Figs. 11–14. No algorithm is able to reproduce all types of problems with highly stable results. The challenge that no distance-based cluster structures exist is not included in this table because benchmarking is not possible in this case. Note that the benchmarking performed here does not allow the deduction if an algorithm fails due to the cluster structures or due to the distribution of the data.
| Distance-based cluster structures | Exemplary dataset dimensionality d range of cluster size | Stable clustering solution | Small bias with minor variance | Small bias and unstable clustering solution (multimodality) | Large bias |
|---|---|---|---|---|---|
| Non-overlapping convex hulls with varying intra-cluster distance | Hepta, D = 3 14%-15% | 24/41 | QT, SOM, | CrossEntropyC, Hartigan, HCL, HDD, LBG, mvnpEM, npEM, Orclus, SOM Sparse k-means Spectral, | Diana, ProClus, RTC, PPC |
| Overlapping convex hulls | Atom D = 3 50% | 10/41 | DBS | CrossEntropyC | 29/41 |
| Non-overlapping convex hulls with varying geometric shapes and noise | Lsun3D D = 3 24–49% (Additionally, 4 outliers as noise) | Clustvarsel, , Gini, HDBSCAN, Minimax ModelBased, mvnpEM, npEM, VarSelLCM, Ward, , , | Fanny, DBS, Orclus, CrossEntropyC, HDD | Spectral, ProClus | 25/41 |
| Linear non-separable entanglements | Chainlink D = 3 50% | DBS, Gini, HDBSCAN,mvnpEM, SingleL, Spectral, Spectrum, , | Clustvarsel, CrossEntrpoy, Modelbased, npEM, VarSelLCM | / | 29/41 |
| High dimensionality with highly imbalanced cluster sizes | Leukaemia D = 7447 Range of cluster sizes: 2.7–50% (Additionally, 1 outlier as noise) | AverageL, CompleteL Diana, SingleL, WPGMA | DBS | Clara, HCL, QT | 32/41 with Clustvarsel, CrossEntropy, ModelBased, mvnpEM, npEM, Orclus, RTC, and Spectrum not computable |
| High dimensionality with an unstable clustering solution | Cancer D = 18,167 Range of cluster sizes: 10%-17% | Gini | Ward | DBS, Hartigan, HDD, LBG, Neural Gas | 34/41 with Clustvarsel, CrossentropyC, ModelBased, mvnpEM, npEM, Orclus, RobustTrimmedC, SparseH and Spectrum not computable |
Figure 1The coloured points of the two SOM clusters of the GolfBall dataset[16]. The figure on the left shows an optimal clustering of 0.83 for the Davies–Bouldin index, and the figure on the right shows the worst case of 11.8 for the Davies–Bouldin index.