| Literature DB >> 35951672 |
Abiodun M Ikotun1,2, Absalom E Ezugwu1.
Abstract
Kmeans clustering algorithm is an iterative unsupervised learning algorithm that tries to partition the given dataset into k pre-defined distinct non-overlapping clusters where each data point belongs to only one group. However, its performance is affected by its sensitivity to the initial cluster centroids with the possibility of convergence into local optimum and specification of cluster number as the input parameter. Recently, the hybridization of metaheuristics algorithms with the K-Means algorithm has been explored to address these problems and effectively improve the algorithm's performance. Nonetheless, most metaheuristics algorithms require rigorous parameter tunning to achieve an optimum result. This paper proposes a hybrid clustering method that combines the well-known symbiotic organisms search algorithm with K-Means using the SOS as a global search metaheuristic for generating the optimum initial cluster centroids for the K-Means. The SOS algorithm is more of a parameter-free metaheuristic with excellent search quality that only requires initialising a single control parameter. The performance of the proposed algorithm is investigated by comparing it with the classical SOS, classical K-means and other existing hybrids clustering algorithms on eleven (11) UCI Machine Learning Repository datasets and one artificial dataset. The results from the extensive computational experimentation show improved performance of the hybrid SOSK-Means for solving automatic clustering compared to the standard K-Means, symbiotic organisms search clustering methods and other hybrid clustering approaches.Entities:
Mesh:
Year: 2022 PMID: 35951672 PMCID: PMC9371361 DOI: 10.1371/journal.pone.0272861
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.752
Summary of literature review on K-means hybridization with metaheuristic algorithms.
| S/N | Algorithm | Reference | Method | Findings | Limitations |
|---|---|---|---|---|---|
|
| GA- and DE-based Heuristics hybrid algorithm | Mustafi and Sahoo [ | Hybridised GA and DE with K-means | Improved initial seeding for the K-means algorithm with the requisite number of clusters | Proper tunning of basic GA input parameters is required with increased computational time and complexity |
|
| MapReduce-based hybrid algorithm | Sinha and Jana [ | Hybridised GA with K-means using Mahalanobis distance as fitness function and K-means ++ initialization process | MapReduce-based K-means hybridized with GA for clustering in a distributed environment | Proper tunning of basic GA input parameters are required |
|
| GENCLUST++ | Islam et al. [ | Hybridised GA with K-means | Advancement in genetic algorithm-based clustering for quality clustering solutions with O(n) complexity | Proper tunning of basic GA input parameters is required with increased computational complexity |
|
| NCLUST | Zhang and Zhou [ | Hybridised GA with K-means++ | Genetic Algorithm-based hybrid clustering with maintained population diversity | Proper tuning of basic GA input parameters and increased computational time is required for a high-quality result. |
|
| Genetic K-means | Kapil, Chawla and Ansari [ | Hybridised GA with K-means | Optimized K-means using GA for better K-means initialisation process using sample dataset as chromosomes | Proper tunning of basic GA input parameters is required with increased computational complexity |
|
| GENCLUST | Rahman and Islam [ | Hybridised GA with K-means | Genetic algorithm-based clustering with automatic generated accurate cluster numbers and high-quality cluster centres | Proper tunning of basic GA input parameters is required with increased computational complexity |
|
| KMQGA | Xiao et al. [ | Hybridised GA with K-means | Generation of an optimal number of clusters and optimal cluster centroids using Q-bits operations and representation | Proper tunning of basic GA input parameters is required, and increased computational complexity |
|
| ACDE-K-MEANS | Kuo, Suryani and Yasid [ | Hybridised Improved DE with K-means | Automatic generation of a number of clusters | Proper tunning of basic DE input parameters is required |
|
| ACDE | Silva et al. [ | Hybridised DE with K-means | Automatic generation of a number of clusters using U-control chart-based DE | Proper tunning of basic DE input parameters is required |
|
| CDE | Cai et al. [ | Use of one-step K-means with DE | Improved performance for DE-based clustering | Proper tunning of basic DE input parameters is required. Higher computational time for better quality clustering |
|
| IGBHSK | Cobos et al. [ | Hybridised Global best HS with K-meanS | Automatic clustering using BIC for determining cluster numbers for document clustering | Proper tunning of basic HS input parameters is required |
|
| iABC | Kuo and Zulvia [ | Hybridised improved ABC with K-means | Better initial cluster centroid for the K-means algorithm with better and more stable clustering result | Required parameter analysis to achieve optimal performance. The higher computational time for better quality clustering |
|
| Classical SOS | Zhou et al. [ | Using SOS Algorithm for solving clustering problem | Automatic clustering using classical SOS | Required parameter analysis to achieve optimal performance. Limited in performance as a single algorithm |
|
| CSOS | Yang & Sutrisno [ | Integrate K-means with SOS | Uses K-means for the classical SOS algorithm’s initialization improvement to improve the searching quality and searching efficiency | The focus is on improving the classical SOS algorithm |
|
| SOSFA, SOSDE, SOSPSO, SOSTLBO | Rajah and Ezugwu [ | Hybridized SOS with FA, DE, PSO and TLBO | Improving the performance of the basic SOS algorithms through hybridization | Proper tunning of basic participating metaheuristics input parameters is required |
Fig 1Proposed hybrid SOSK-means clustering algorithm.
Initial parameter setting for classical SOS, classical K-means and proposed hybrid SOSK-means.
| SOS | K-means | SOSK-means | |||
|---|---|---|---|---|---|
| Parameter | Value | Parameter | Value | Parameter | Value |
| Max-It | 200 | k | As per dataset | Max-It | 200 |
| np | 20 | cc1….cck | First k elements in dataset | np | 20 |
Initial parameter setting for the compared algorithms.
| DCPSO | GCUK | ||
|---|---|---|---|
| Algorithm’s Parameter | Assigned Value | Algorithm’s Parameter | Assigned Value |
| Popl_size | 100 | Popl_size | 50 |
| Inertial Weight | 0.7200 | Cross-over | 0.800 |
| c1, c2 | 1.4940 | Mutation probability | 0.0010 |
|
| 20 |
| 20 |
|
| 2 |
| 2 |
Dataset characteristics.
| Datasets | Dataset Features | Number of Clusters | Number of Objects | Dataset Type |
|---|---|---|---|---|
| Breast [ | 9 | 2 | 699 | UCI |
| Compound [ | 2 | 6 | 399 | Shape |
| Flame [ | 2 | 2 | 240 | Shape |
| Glass [ | 9 | 7 | 214 | UCI |
| Iris [ | 4 | 3 | 150 | UCI |
| Jain [ | 2 | 2 | 373 | Shape |
| Path-based [ | 2 | 3 | 300 | Shape |
| Spiral [ | 2 | 2 | 312 | Shape |
| Thyroid [ | 5 | 2 | 215 | UCI |
| Two-moons [ | 2 | 2 | 10,000 | - |
| Wine [ | 13 | 3 | 178 | UCI |
| Yeast [ | 8 | 10 | 1,484 | UCI |
SOSK-means results in over forty independent runs with DB and CS validity indices as the fitness function.
| DBIndex | CSIndex | |||||||
|---|---|---|---|---|---|---|---|---|
| Dataset | Best | Worst | Average | Std Dev | Best | Worst | Average | Std Dev |
|
| 0.8121 | 0.8121 | 0.8121 | 0.0000 |
| 0.9574 | 0.7606 | 0.1217 |
|
|
| 0.5158 | 0.5046 | 0.0044 | 0.5032 | 0.5918 | 0.5072 | 0.0155 |
|
| 0.7755 | 0.7787 | 0.7770 | 0.0008 |
| 0.3846 | 0.3846 | 0.0000 |
|
| 0.3633 | 0.8159 | 0.7113 | 0.1217 |
| 0.0608 | 0.0608 | 0.0000 |
|
| 0.5937 | 0.6744 | 0.6346 | 0.0188 |
| 0.6444 | 0.5743 | 0.0237 |
|
|
| 0.6535 | 0.6518 | 0.0009 | 0.6546 | 0.6546 | 0.6546 | 0.0000 |
|
| 0.6579 | 0.6740 | 0.6708 | 0.0031 |
| 0.6894 | 0.6511 | 0.0120 |
|
| 0.7350 | 0.7541 | 0.7437 | 0.0045 |
| 0.6862 | 0.6812 | 0.0115 |
|
|
| 0.6934 | 0.6321 | 0.0316 | 0.6409 | 0.6409 | 0.6409 | 0.0000 |
|
|
| 0.6048 | 0.6032 | 0.0010 | 0.7176 | 0.7664 | 0.7498 | 0.0162 |
|
| 1.0045 | 1.0896 | 1.0460 | 0.0207 |
| 0.8829 | 0.8422 | 0.0527 |
|
| 0.4460 | 1.0819 | 0.8496 | 0.1588 |
| 0.6303 | 0.5242 | 0.0437 |
|
| 0.6426 | 0.7623 | 0.7197 | 0.0305 |
| 0.6325 | 0.5860 | 0.0248 |
Fig 2The mean run time achieved by SOSK-means on DB and CS measures over forty independent runs for the twelve datasets.
SOSK-means compared with SOS and K-means for forty replications.
| Dataset | Algorithm | DBIndex | CSIndex | ||
|---|---|---|---|---|---|
| Mean Sol. | Std Dev | Mean Sol. | Std Dev | ||
| Breast | SOS | 1.3520 | 0.2858 | 0.9946 | 0.2667 |
| Kmeans | 0.8121 | 0.0000 | 1.1019 | 0.0000 | |
| SOSKmeans | 0.8121 | 0.0000 |
| 0.1217 | |
| Compound | SOS | 0.6924 | 0.1481 | 0.5670 | 0.1225 |
| Kmeans | 0.9716 | 0.0748 | 1.2887 | 0.1486 | |
| SOSKmeans |
| 0.0044 | 0.5072 | 0.0155 | |
| Flame | SOS | 0.8234 | 0.0180 | 1.2707 | 0.1006 |
| Kmeans | 1.2306 | 0.0059 | 1.5806 | 0.0263 | |
| SOSKmeans | 0.7770 | 0.0008 |
| 0.0000 | |
| Glass | SOS | 0.8164 | 0.1174 | 0.2200 | 0.2563 |
| Kmeans | 1.2208 | 0.1570 | 1.4894 | 0.1904 | |
| SOSKmeans | 0.7113 | 0.1217 |
| 0.0000 | |
| Iris | SOS | 0.8602 | 0.1809 | 0.8585 | 0.1922 |
| Kmeans | 0.9167 | 0.0033 | 1.2404 | 0.0092 | |
| SOSKmeans | 0.6346 | 0.0188 |
| 0.0237 | |
| Jain | SOS | 0.7007 | 0.0274 | 0.8196 | 0.0212 |
| Kmeans | 0.8587 | 0.0001 | 1.0668 | 0.0003 | |
| SOSKmeans |
| 0.0009 | 0.6546 | 0.0000 | |
| Pathbased | SOS | 0.7578 | 0.0686 | 1.0021 | 0.1708 |
| Kmeans | 0.7696 | 0.0066 | 0.9893 | 0.0086 | |
| SOSKmeans | 0.6708 | 0.0031 |
| 0.0120 | |
| Spiral | SOS | 0.8013 | 0.0447 | 1.0818 | 0.2107 |
| Kmeans | 0.9589 | 0.0109 | 1.1896 | 0.0053 | |
| SOSKmeans | 0.7437 | 0.0045 |
| 0.0115 | |
| Thyroid | SOS | 1.0232 | 0.1479 | 0.6446 | 0.0238 |
| Kmeans | 1.0298 | 0.2042 | 1.7863 | 0.3602 | |
| SOSKmeans |
| 0.0316 | 0.6409 | 0.0000 | |
| Twomoons | SOS | 0.6128 | 0.0179 | 0.7701 | 0.0281 |
| Kmeans | 0.7948 | 0.0000 | 0.9385 | 0.0000 | |
| SOSKmeans |
| 0.0010 | 0.7498 | 0.0162 | |
| Wine | SOS | 1.1488 | 0.1394 | 1.1938 | 0.3318 |
| Kmeans | 1.3053 | 0.0022 | 1.4425 | 0.0128 | |
| SOSKmeans | 1.0460 | 0.0207 |
| 0.0527 | |
| Yeast | SOS | 1.2144 | 0.2911 | 0.5594 | 0.2847 |
| Kmeans | 1.7176 | 0.1875 | 2.6417 | 0.5950 | |
| SOSKmeans | 0.8496 | 0.1588 |
| 0.0437 |
SOSK-means results compared with results from existing algorithms in the literature.
| Dataset | Algorithms | DBIndex | CSIndex | ||
|---|---|---|---|---|---|
| Mean | Std Dev | Mean | Std Dev | ||
| Breast | SOSKmeans | 0.8121 | 0 | 0.7606 | 0.1217 |
| SOS | 1.352 | 0.2858 | 0.9946 | 0.2667 | |
| Kmeans | 0.8121 | 0 | 1.1019 | 0 | |
| SOSTLBO | 0.8937 | 0.0384 | - | - | |
| SOSFA | 0.7644 | 0.0211 | - | - | |
| SOSPSO | 0.7128 | 0.1458 | - | - | |
| SOSDE | 1.1378 | 0.0947 | - | - | |
| DE | 0.5199 | 0.007 | 0.8984 | 0.381 | |
| DCPSO | 0.5754 | 0.073 | 0.4854 | 0.009 | |
| GCUK | 0.6328 | 0.002 | 0.6089 | 0.016 | |
| Glass | SOSKmeans | 0.7113 | 0.1217 | 0.0608 | 0 |
| SOS | 0.8164 | 0.1174 | 0.22 | 0.2563 | |
| Kmeans | 1.2208 | 0.157 | 1.4894 | 0.1904 | |
| SOSTLBO | 0.7832 | 0.0357 | - | - | |
| SOSFA | 0.6707 | 0.0459 | - | - | |
| SOSPSO | 0.6318 | 0.0418 | - | - | |
| SOSDE | 0.8444 | 0.0216 | - | - | |
| DE | 1.6673 | 0.004 | 0.7782 | 0.643 | |
| DCPSO | 1.5152 | 0.073 | 0.7361 | 0.671 | |
| GCUK | 1.8371 | 0.034 | 0.7282 | 2.003 | |
| Iris | SOSKmeans | 0.6346 | 0.0188 | 0.5743 | 0.0237 |
| SOS | 0.8602 | 0.1809 | 0.8585 | 0.1922 | |
| Kmeans | 0.9167 | 0.0033 | 1.2404 | 0.0092 | |
| SOSTLBO | 0.634 | 0.0182 | - | - | |
| SOSFA | 0.591 | 0.0075 | - | - | |
| SOSPSO | 0.5714 | 0.0038 | - | - | |
| SOSDE | 0.6916 | 0.0267 | - | - | |
| DE | 0.5822 | 0.067 | 0.7633 | 0.039 | |
| DCPSO | 0.6899 | 0.008 | 0.6899 | 0.008 | |
| GCUK | 0.7377 | 0.065 | 0.7377 | 0.65 | |
| Spiral | SOSKmeans | 0.7437 | 0.0045 | 0.6812 | 0.0115 |
| SOS | 0.8013 | 0.0447 | 1.0818 | 0.2107 | |
| Kmeans | 0.9589 | 0.0109 | 1.1896 | 0.0053 | |
| SOSTLBO | 0.7412 | 0.042 | - | - | |
| SOSFA | 0.7388 | 0.003 | - | - | |
| SOSPSO | 0.7332 | 0.0053 | - | - | |
| SOSDE | 0.7453 | 0.004 | - | - | |
| DE | - | - | - | - | |
| DCPSO | - | - | - | - | |
| GCUK | - | - | - | - | |
| Thyroid | SOSKmeans | 0.6321 | 0.0316 | 0.6409 | 0 |
| SOS | 1.0232 | 0.1479 | 0.6446 | 0.0238 | |
| Kmeans | 1.0298 | 0.2042 | 1.7863 | 0.3602 | |
| SOSTLBO | 0.6148 | 0.0234 | - | - | |
| SOSFA | 0.5313 | 0.0077 | - | - | |
| SOSPSO | 0.5021 | 0.0483 | - | - | |
| SOSDE | 0.7172 | 0.0532 | - | - | |
| DE | - | - | - | - | |
| DCPSO | - | - | - | - | |
| GCUK | - | - | - | - | |
| Wine | SOSKmeans | 1.046 | 0.0207 | 0.8422 | 0.0527 |
| SOS | 1.1488 | 0.1394 | 1.1938 | 0.3318 | |
| Kmeans | 1.3053 | 0.0022 | 1.4425 | 0.0128 | |
| SOSTLBO | 1.0413 | 0.0242 | - | - | |
| SOSFA | 0.9229 | 0.0189 | - | - | |
| SOSPSO | 0.8489 | 0.0741 | - | - | |
| SOSDE | 1.1108 | 0.0399 | - | - | |
| DE | 3.3923 | 0.092 | 1.7964 | 0.802 | |
| DCPSO | 4.3432 | 0.232 | 1.8721 | 0.232 | |
| GCUK | 5.3424 | 0.343 | 1.5842 | 0.343 | |
| Yeast | SOSKmeans | 0.8496 | 0.1588 | 0.5242 | 0.0437 |
| SOS | 1.2144 | 0.2911 | 0.5594 | 0.2847 | |
| Kmeans | 1.7176 | 0.1875 | 2.6417 | 0.595 | |
| SOSTLBO | 0.8954 | 0.0236 | - | - | |
| SOSFA | 0.7518 | 0.0346 | - | - | |
| SOSPSO | 0.7599 | 0.0666 | - | - | |
| SOSDE | 0.9869 | 0.0312 | - | - | |
| DE | - | - | - | - | |
| DCPSO | - | - | - | - | |
| GCUK | - | - | - | - | |
The Friedman means rank test results for the SOS, K-means and hybrid SOSK-means algorithms.
| Dataset | DBIndex | CSIndex | ||||
|---|---|---|---|---|---|---|
| SOS | Kmeans | SOSKmeans | SOS | Kmeans | SOSKmeans | |
| Breast | 3.00 |
|
| 2.25 | 2.45 |
|
| Compound | 1.95 | 2.95 |
| 1.65 | 3.00 |
|
| Flame | 2.00 | 3.00 |
| 2.00 | 3.00 |
|
| Glass | 1.73 | 3.00 |
| 1.84 | 3.00 |
|
| Iris | 2.35 | 2.48 |
| 1.86 | 3.00 |
|
| Jain | 2.00 | 3.00 |
| 2.00 | 3.00 |
|
| Pathbased | 2.30 | 2.45 |
| 2.66 | 2.23 |
|
| Spiral | 1.83 | 3.00 |
| 2.29 | 2.58 |
|
| Thyroid | 2.40 | 2.53 |
| 1.51 | 3.00 |
|
| Twomoons | 1.66 | 3.00 |
| 1.73 | 3.00 |
|
| Wine | 2.03 | 2.70 |
| 2.14 | 2.63 |
|
| Yeast | 2.00 | 2.88 |
| 1.23 | 3.00 | 1.78 |
Wilcoxon rank-sum test for equal medians showing corresponding p-values.
| Dataset | DBIndex | CSIndex | ||||
|---|---|---|---|---|---|---|
| SOS vs Kmeans | SOSKmeans vs SOS | SOSKmeans vs Kmeans | SOS vs Kmeans | SOSKmeans vs SOS | SOSKmeans vs Kmeans | |
| Breast | 0.000 | 0.000 | 1.000 | 0.034 | 0.000 | 0.000 |
| Compound | 0.000 | 0.000 | 0.000 | 0.000 | 0.006 | 0.000 |
| Flame | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| Glass | 0.000 | 0.002 | 0.000 | 0.000 | 0.000 | 0.000 |
| Iris | 0.147 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| Jain | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| Pathbased | 0.301 | 0.000 | 0.000 | 0.053 | 0.000 | 0.000 |
| Spiral | 0.000 | 0.000 | 0.000 | 0.044 | 0.000 | 0.000 |
| Thyroid | 0.554 | 0.000 | 0.000 | 0.000 | 0.317 | 0.000 |
| Twomoons | 0.000 | 0.009 | 0.000 | 0.000 | 0.000 | 0.000 |
| Wine | 0.000 | 0.000 | 0.000 | 0.002 | 0.000 | 0.000 |
| Yeast | 0.000 | 0.000 | 0.000 | 0.000 | 0.021 | 0.000 |
Fig 3Clustering illustration of hybrid SOSK-means for the listed datasets using DB-Index.
Fig 4Clustering illustration of hybrid SOSK-means for the listed datasets using CS-Index.