| Literature DB >> 35358292 |
Simon Crase1,2, Suresh N Thennadil1,3.
Abstract
Cluster analysis is a valuable unsupervised machine learning technique that is applied in a multitude of domains to identify similarities or clusters in unlabelled data. However, its performance is dependent of the characteristics of the data it is being applied to. There is no universally best clustering algorithm, and hence, there are numerous clustering algorithms available with different performance characteristics. This raises the problem of how to select an appropriate clustering algorithm for the given analytical purposes. We present and validate an analysis framework to address this problem. Unlike most current literature which focuses on characterizing the clustering algorithm itself, we present a wider holistic approach, with a focus on the user's needs, the data's characteristics and the characteristics of the clusters it may contain. In our analysis framework, we utilize a softer qualitative approach to identify appropriate characteristics for consideration when matching clustering algorithms to the intended application. These are used to generate a small subset of suitable clustering algorithms whose performance are then evaluated utilizing quantitative cluster validity indices. To validate our analysis framework for selecting clustering algorithms, we applied it to four different types of datasets: three datasets of homemade explosives spectroscopy, eight datasets of publicly available spectroscopy data covering food and biomedical applications, a gene expression cancer dataset, and three classic machine learning datasets. Each data type has discernible differences in the composition of the data and the context within which they are used. Our analysis framework, when applied to each of these challenges, recommended differing subsets of clustering algorithms for final quantitative performance evaluation. For each application, the recommended clustering algorithms were confirmed to contain the top performing algorithms through quantitative performance indices.Entities:
Mesh:
Year: 2022 PMID: 35358292 PMCID: PMC8970496 DOI: 10.1371/journal.pone.0266369
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Analysis framework for clustering algorithm selection.
Characteristics of the validation datasets.
| Domain | Dataset Name | Dataset Type | No. of Samples | No. of Features | No. of Classes |
|---|---|---|---|---|---|
| Explosives | Output Energetic | Fourier Transform Infrared Spectroscopy | 73 | 3350 | 5 |
| Spectroscopy | Transition Energetic | Fourier Transform Infrared Spectroscopy | 69 | 3350 | 8 |
| First Fire Energetic | Fourier Transform Infrared Spectroscopy | 53 | 3350 | 7 | |
| General (Public) | Coffee | Mid Infrared Spectroscopy | 56 | 286 | 2 |
| Spectroscopy | Fruit | Fourier Transform Infrared Spectroscopy | 983 | 234 | 2 |
| Liver | Fourier Transform Infrared Spectroscopy | 731 | 234 | 4 | |
| Mangos | Near Infrared Spectroscopy | 186 | 1157 | 4 | |
| Marzipan | Fourier Transform Infrared Spectroscopy | 32 | 1557 | 9 | |
| Meats | Fourier Transform Infrared Spectroscopy | 120 | 448 | 3 | |
| Olive Oil | Fourier Transform Infrared Spectroscopy | 120 | 570 | 4 | |
| Wine | Fourier Transform Infrared Spectroscopy | 44 | 842 | 4 | |
| Medical | Gene Expression | RNA-Sequence | 801 | 20531 | 5 |
| Classic Machine | Iris | Multivariate | 150 | 4 | 3 |
| Learning Examples | Wine | Multivariate | 178 | 13 | 3 |
| Breast Cancer | Multivariate | 569 | 30 | 2 |
Fig 2PC1 vs PC2 PCA score plots of the clusters present in the explosives spectroscopy datasets.
A: Output Energetic; B: Transition Energetic; C: First Fire Energetic. The percentage of original information contained in each principal component is shown on each axis.
Fig 5PC1 vs PC2 PCA score plots of the clusters present in the classic ML datasets.
A: Iris; B: Wine; C: Breast Cancer. The percentage of original information contained in each principal component is shown on each axis.
Fig 3PC1 vs PC2 PCA score plots of the clusters present in the public spectroscopy datasets.
A: Coffee; B: Fruit; C: Liver; D: Mangos; E: Marzipan; F: Meats; G: Olive Oil; H: Wine. The percentage of original information contained in each principal component is shown on each axis.
Fig 4PC1 vs PC2 PCA score plots of the clusters present in the gene expression dataset.
The percentage of original information contained in each principal component is shown on each axis.
Characteristics applicable to the validation datasets.
| Dataset Characteristics | Cluster Characteristics | Clustering Algorithm Characteristics | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Dataset Type | Small Datasets | High dimensions | Non-Spherical shape | Variable Cluster Density | Single Point Cluster | Uneven Cluster Size | Robust to Noise and Outliers | Multi-modal (hierarchical) | No. of parameters/ hyperparameters | Deterministic | Efficiency (O) |
| Explosives Spectroscopy Datasets | X | X | X | X | X | X | X | X | X | ||
| General Spectroscopy Datasets | X | X | X | X | X | X | |||||
| Gene Expression Dataset | X | X | |||||||||
| Classic ML Datasets | X | ||||||||||
Comparative matrix of identified characteristics vs. candidate clustering algorithms.
| Dataset Characteristics | Cluster Characteristics | Clustering Algorithm Characteristics | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Clustering Algorithm | Small Datasets | High dimensions | Non-Spherical shape | Variable Cluster Density | Single Point Cluster | Uneven Cluster Size | Robust to Noise and Outliers | Multi-modal (hierarchical) | No. of parameters/ hyperparameters | Deterministic | Efficiency (O) |
| Hierarchical (Ward’s) [ | Y | N | N | Y | Y | Y | Y | Y |
| Y | |
| Hierarchical (Single Link) [ | Y | N | Y | N | Y | Y | N | Y |
| Y | |
| BIRCH [ | Y | Y | N | Y | Y | Y | Y |
| |||
| Y | N | N | N | Y | N | N | N |
| N |
| |
| N | Y | N | N | Y | N | N | N | N |
| ||
| PAM [ | Y | N | N | N | Y | Y | Y | N |
| N |
|
| Fuzzy C-Means [ | Y | N | N | Y | N | N | N |
| N |
| |
| DBSCAN [ | Y | N | Y | N | Y | Y | Y | N | 2 | Y | |
| HDBSCAN [ | Y | N | Y | Y | Y | Y | Y | 2 | Y |
| |
| OPTICS [ | Y | N | Y | Y | N | Y | Y | Y | 2 | Y |
|
| Mean Shift [ | Y | N | Y | N | Y | Y | Y | Y | 1 |
| |
| Spectral Clustering [ | N | Y | Y | Y | N | N | N | N |
| ||
| Affinity Propagation [ | Y | Y | Y | N | Y | Y | N | Y | 2 |
| |
| Gaussian Mixture Model [ | Y | N | Y | Y | Y | Y | Y | N |
| N |
|
n = the number of objects to be clustered.
k = the number of clusters.
d = the number of dimensions.
i = the number of iterations to convergence.
*Results can change based on the order the data is provided.
+Produces an object such as a minimum spanning tree from which hierarchy can be inferred.
#Enabling single point clustering removes outlier/noise detection capabilities.
V-measure scores for all algorithms and datasets.
The highlighted algorithms are those suggested through our analysis framework. The cumulative total score for each type of dataset is shown in the bold columns.
| Explosives Spectroscopy Datasets | Public Spectroscopy Datasets | Gene Dataset | Classic ML Datasets | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Clustering Algorithm | Output Energetic | Transition Energetic | First Fire Energetic | Total Score | Coffee | Fruit | Liver | Mangos | Marzipan | Meats | Olive Oil | Wine | Total Score | Gene Expression | Wine | Iris | Breast Cancer | Total Score |
| Hierarchical (Ward’s) | 1.00 | 0.71 | 0.68 |
| 0.70 | 0.08 | 0.79 | 0.41 | 0.81 | 0.65 | 0.73 | 0.37 |
|
| 0.79 | 0.77 | 0.46 |
|
| Hierarchical (Single) | 1.00 | 0.65 | 0.68 |
| 0.03 | 0.00 | 0.01 | 0.10 | 0.72 | 0.07 | 0.22 | 0.11 |
|
| 0.03 | 0.72 | 0.01 |
|
| BIRCH | 1.00 | 0.71 | 0.68 |
| 0.70 | 0.08 | 0.57 | 0.08 | 0.81 | 0.65 | 0.73 | 0.37 |
|
| 0.79 | 0.77 | 0.46 |
|
| 1.00 | 0.71 | 0.73 |
| 0.54 | 0.15 | 0.75 | 0.40 | 0.81 | 0.81 | 0.67 | 0.33 |
|
| 0.89 | 0.76 | 0.53 |
| |
| 1.00 | 0.69 | 0.74 |
| 0.54 | 0.15 | 0.75 | 0.29 | 0.83 | 0.67 | 0.69 | 0.33 |
|
| 0.86 | 0.76 | 0.58 |
| |
| PAM | 1.00 | 0.73 | 0.73 |
| 0.59 | 0.15 | 0.73 | 0.39 | 0.81 | 0.58 | 0.67 | 0.28 |
|
| 0.78 | 0.79 | 0.49 |
|
| Fuzzy C-Means | 1.00 | 0.63 | 0.70 |
| 0.54 | 0.12 | 0.74 | 0.37 | 0.84 | 0.80 | 0.62 | 0.33 |
|
| 0.88 | 0.75 | 0.56 |
|
| DBSCAN | 1.00 | 0.65 | 0.68 |
| 082 | 0.10 | 0.46 | 0.54 | 0.72 | 0.45 | 0.56 | 0.12 |
|
| 0.52 | 0.66 | 0.04 |
|
| HDBSCAN | 0.95 | 0.74^ | 0.77 |
| 0.78 | 0.18 | 0.58 | 0.68 | 0.84^ | 0.67 | 0.49 | 0.27 |
|
| 0.50 | 0.73 | 0.21 |
|
| OPTICS | 0.36 | 0.62 | 0.44 |
| 0.69 | 0.03 | 0.59 | 0.56 | 0.80 | 0.50 | 0.53 | 0.30 |
|
| 0.55 | 0.63 | 0.20 |
|
| Mean Shift | 1.00 | 0.63 | 0.68 |
| 0.06 | 0.16 | 0.59 | 0.20 | 0.77 | 0.60 | 0.50 | 0.18 |
|
| 0.04 | 0.77 | 0.02 |
|
| Spectral Clustering | 0.06 | 0.15 | 0.13 |
| 0.09 | 0.12 | 0.79 | 0.18 | 0.86 | 0.63 | 0.73 | 0.29 |
|
| 0.66 | 0.77 | 0.01 |
|
| Affinity Propagation | 0.64 | 0.55 | 0.50 |
| 0.57 | 0.15 | 0.72 | 0.07 | 0.75 | 0.35 | 0.62 | 0.21 |
|
| 0.56 | 0.73 | 0.50 |
|
| Gaussian Mixture Model | 1.00 | 0.74 | 0.68 |
| 0.65 | 0.14 | 0.74 | 0.36 | 0.83 | 0.54 | 0.67 | 0.26 |
|
| 0.88 | 0.90 | 0.66 |
|
#large number of noise.
^could not generate specified k number of clusters.
* could not generate a complete graph from the similarity matrix, which resulted in incorrect clustering.