| Literature DB >> 32802153 |
Saeedeh Pourahmad1,2, Atefeh Basirat2, Amir Rahimi1,3, Marziyeh Doostfatemeh2.
Abstract
Random selection of initial centroids (centers) for clusters is a fundamental defect in K-means clustering algorithm as the algorithm's performance depends on initial centroids and may end up in local optimizations. Various hybrid methods have been introduced to resolve this defect in K-means clustering algorithm. As regards, there are no comparative studies comparing these methods in various aspects, the present paper compared three hybrid methods with K-means clustering algorithm using concepts of genetic algorithm, minimum spanning tree, and hierarchical clustering method. Although these three hybrid methods have received more attention in previous researches, fewer studies have compared their results. Hence, seven quantitative datasets with different characteristics in terms of sample size, number of features, and number of different classes are utilized in present study. Eleven indices of external and internal evaluating index were also considered for comparing the methods. Data indicated that the hybrid methods resulted in higher convergence rate in obtaining the final solution than the ordinary K-means method. Furthermore, the hybrid method with hierarchical clustering algorithm converges to the optimal solution with less iteration than the other two hybrid methods. However, hybrid methods with minimal spanning trees and genetic algorithms may not always or often be more effective than the ordinary K-means method. Therefore, despite the computational complexity, these three hybrid methods have not led to much improvement in the K-means method. However, a simulation study is required to compare the methods and complete the conclusion.Entities:
Mesh:
Year: 2020 PMID: 32802153 PMCID: PMC7416251 DOI: 10.1155/2020/7636857
Source DB: PubMed Journal: Comput Math Methods Med ISSN: 1748-670X Impact factor: 2.238
Description of seven datasets utilized for comparisons among the methods1.
| Name of dataset | Sample size (+/-) | No. of variables (features) | No. of classes (labels) | No. of optimal clusters2 |
|---|---|---|---|---|
| Leukemia | 64 (26/38) | 4 | 2 | 2 |
| Prostate | 30 (15/15) | 3 | 2 | 2 |
| Colon Cancer | 111 (56/55) | 4 | 2 | 2 |
| Haberman | 306 | 3 | 2 | 2 |
| Iris | 150 | 4 | 3 | 3 |
| Wine | 178 | 13 | 3 | 3 |
| Glass | 214 | 10 | 7 | 7 |
1Gene Expression Omnibus (https://www.ncbi.nlm.nih.gov/gds) & University of California, Irvine Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets.php). 2The number of optimal clusters based on the elbow, gap, and silhouette by applying the law of the majority.
Figure 1K-means clustering algorithm.
Figure 2MST-based K-means clustering algorithm.
Figure 3GA-based K-means clustering algoritm.
Figure 4Hierarchical-based K-means clustering algoritm.
Comparison of four different ordinary clustering methods based on the silhouette and RPT indexes.
| Index method |
| Hierarchical | DB scan | EM algorithm |
|---|---|---|---|---|
| Leukemia dataset | ||||
| Silhouette |
| 0.4663 | 0.2693 | 0.4419 |
| RPT |
| 0.8612 | 0.5087 | 0.8160 |
|
| ||||
| Prostate dataset | ||||
| Silhouette | 0.3265 |
| 0.3339 | 0.2756 |
| RPT | 0.6141 |
| 0.6319 | 0.5295 |
|
| ||||
| Colon | ||||
| Silhouette |
| 0.5189 | 0.3156 | 0.5176 |
| RPT |
| 0.9516 | 0.5747 | 0.9478 |
|
| ||||
| Haberman | ||||
| Silhouette | 0.2477 |
| 0.6266 | 0.1384 |
| RPT | 0.4787 |
| 1.15 | 0.2704 |
|
| ||||
| Iris | ||||
| Silhouette | 0.4589 | 0.4796 |
| 0.3728 |
| RPT | 0.8446 | 0.8614 |
| 0.6812 |
|
| ||||
| Wine | ||||
| Silhouette | 0.2469 | 0.1575 | 0.1911 |
|
| RPT | 0.4788 | 0.3092 | 0.3742 |
|
|
| ||||
| Glass | ||||
| Silhouette | 0.3411 | 0.4281 |
| 0.2809 |
| RPT | 0.6369 | 0.7921 |
| 0.5148 |
RPT: robustness performance trade-off.
Comparison among the hybrid and ordinary K-means clustering method based on eleven evaluation criteria.
| Indexes |
| SSE | Si | RPT | Dunn | RI | ARI | AC | F | HI | VI |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Leukemia dataset | |||||||||||
|
| 5 | 116.2 | 0.4702 | 0.880 | 0.1431 | 0.8809 | 0.7617 | 0.9375 | 0.8848 | 0.76197 | 0.6477 |
|
| 2 | 116.2 | 0.4650 | 0.880 | 0.1431 | 0.8809 | 0.7617 | 0.9375 | 0.8848 | 0.76197 | 0.6477 |
|
| 1 | 116.8 | 0.4675 | 0.8719 | 0.1679 | 0.9092 | 0.8183 | 0.9531 | 0.9115 | 0.8184 | 0.5357 |
|
| 4 | 116.2 | 0.4702 | 0.8801 | 0.1431 | 0.8809 | 0.7617 | 0.9375 | 0.8848 | 0.76197 | 0.6477 |
|
| |||||||||||
| Prostate dataset | |||||||||||
|
| 6 | 60.3 | 0.2677 | 0.5149 | 0.0969 | 0.6298 | 0.2599 | 0.7667 | 0.6247 | 0.2602 | 1.51 |
|
| 1 | 58.1 | 0.3944 | 0.6141 | 0.1549 | 0.5954 | 0.1980 | 0.7337 | 0.6364 | 0.2069 | 1.21 |
|
| 2 | 62.1 | 0.3935 | 0.7498 | 0.2239 | 0.4919 | 0.0019 | 0.5667 | 0.5915 | 0.0022 | 1.51 |
|
| 4 | 58.7 | 0.2796 | 0.5385 | 0.1498 | 0.7126 | 0.4247 | 0.8333 | 0.7031 | 0.4247 | 1.29 |
|
| |||||||||||
| Colon dataset | |||||||||||
|
| 4 | 161.47 | 0.5248 | 0.9650 | 0.1431 | 0.8650 | 0.73 | 0.9279 | 0.8638 | 0.7300 | 0.7411 |
|
| 2 | 161.47 | 0.5248 | 0.9650 | 0.1431 | 0.8650 | 0.73 | 0.9279 | 0.8638 | 0.7300 | 0.7411 |
|
| 3 | 161.47 | 0.5248 | 0.9650 | 0.1431 | 0.8650 | 0.73 | 0.9279 | 0.8638 | 0.7300 | 0.7411 |
|
| 2 | 161.47 | 0.5248 | 0.9650 | 0.1431 | 0.8650 | 0.73 | 0.9279 | 0.8638 | 0.7300 | 0.7411 |
|
| |||||||||||
| Haberman dataset | |||||||||||
|
| 6 | 698.8 | 0.2477 | 0.4787 | 0.023 | 0.4991 | -0.002 | 0.5196 | 0.5483 | -0.0015 | 1.83 |
|
| 4 | 684.4 | 0.2733 | 0.5256 | 0.035 | 0.5038 | 0.0083 | 0.5523 | 0.5523 | 0.0085 | 1.82 |
|
| 4 | 702.8 | 0.3888 | 0.7427 | 0.073 | 0.6189 | 0.1284 | 0.7451 | 0.7270 | 0.7451 | 0.1405 |
|
| 5 | 682.1 | 0.2751 | 0.5305 | 0.039 | 0.4997 | -0.001 | 0.5261 | 0.5488 | -0.003 | 1.83 |
|
| |||||||||||
| Iris dataset | |||||||||||
|
| 7 | 140 | 0.4589 | 0.8446 | 0.02637 | 0.8322 | 0.6201 | 0.8333 | 0.7452 | 0.6201 | 1.079 |
|
| 3 | 141.1 | 0.4554 | 0.8359 | 0.07756 | 0.8431 | 0.6451 | 0.8533 | 0.7622 | 0.6452 | 1.072 |
|
| 5 | 191.7 | 0.4787 | 0.8917 | 0.05309 | 0.7197 | 0.4290 | 0.5732 | 0.6505 | 0.4488 | 1.19 |
|
| 3 | 140 | 0.4589 | 0.8446 | 0.02637 | 0.8322 | 0.6201 | 0.8333 | 0.7452 | 0.6201 | 1.079 |
|
| |||||||||||
| Wine dataset | |||||||||||
|
| 8 | 1589.1 | 0.2469 | 0.4788 | 0.1357 | 0.6915 | 0.3757 | 0.6067 | 0.6237 | 0.3927 | 1.42 |
|
| 2 | 1270.2 | 0.2905 | 0.5481 | 0.2323 | 0.9543 | 0.8975 | 0.9663 | 0.9319 | 0.8976 | 0.39 |
|
| 4 | 1270.2 | 0.2849 | 0.5481 | 0.2323 | 0.9543 | 0.8975 | 0.9663 | 0.9319 | 0.8976 | 0.39 |
|
| 4 | 1270.2 | 0.2849 | 0.5481 | 0.2323 | 0.9543 | 0.8975 | 0.9663 | 0.9319 | 0.8976 | 0.39 |
|
| |||||||||||
| Glass dataset | |||||||||||
|
| 13 | 687.4 | 0.3411 | 0.6369 | 0.05804 | 0.6891 | 0.1966 | 0.4346 | 0.4073 | 0.1966 | 2.8 |
|
| 2 | 679.9 | 0.3458 | 0.6433 | 0.04906 | 0.6926 | 0.2036 | 0.4395 | 0.4116 | 0.2036 | 2.73 |
|
| 4 | 790.2 | 0.3021 | 0.5754 | 0.06644 | 0.6531 | 0.1908 | 0.3598 | 0.4327 | 0.1954 | 2.60 |
|
| 10 | 678.6 | 0.3427 | 0.6390 | 0.04502 | 0.6879 | 0.1946 | 0.4766 | 0.4062 | 0.1946 | 2.84 |
I: number of iteration: ARI: adjusted rand index, −1 < ARI < +1; RI: rand index, RI > 0); VI: variation of information, VI > 0. AC: accuracy, 0 < AC < 1; Si: silhouette, −1 < Si < +1); HI: Huber's Γ index, −1 < HI < +1; RPT: robustness-performance trade-off, RPT > 0; K+H: hierarchical K-means clustering; K+MST: minimum spanning tree K-means clustering; K+GA: genetic K-means clustering.
Figure 5Number of iterations to converge for the hybrid methods in comparison with K-means method.