Literature DB >> 32802153

Does Determination of Initial Cluster Centroids Improve the Performance of K-Means Clustering Algorithm? Comparison of Three Hybrid Methods by Genetic Algorithm, Minimum Spanning Tree, and Hierarchical Clustering in an Applied Study.

Saeedeh Pourahmad1,2, Atefeh Basirat2, Amir Rahimi1,3, Marziyeh Doostfatemeh2.   

Abstract

Random selection of initial centroids (centers) for clusters is a fundamental defect in K-means clustering algorithm as the algorithm's performance depends on initial centroids and may end up in local optimizations. Various hybrid methods have been introduced to resolve this defect in K-means clustering algorithm. As regards, there are no comparative studies comparing these methods in various aspects, the present paper compared three hybrid methods with K-means clustering algorithm using concepts of genetic algorithm, minimum spanning tree, and hierarchical clustering method. Although these three hybrid methods have received more attention in previous researches, fewer studies have compared their results. Hence, seven quantitative datasets with different characteristics in terms of sample size, number of features, and number of different classes are utilized in present study. Eleven indices of external and internal evaluating index were also considered for comparing the methods. Data indicated that the hybrid methods resulted in higher convergence rate in obtaining the final solution than the ordinary K-means method. Furthermore, the hybrid method with hierarchical clustering algorithm converges to the optimal solution with less iteration than the other two hybrid methods. However, hybrid methods with minimal spanning trees and genetic algorithms may not always or often be more effective than the ordinary K-means method. Therefore, despite the computational complexity, these three hybrid methods have not led to much improvement in the K-means method. However, a simulation study is required to compare the methods and complete the conclusion.
Copyright © 2020 Saeedeh Pourahmad et al.

Entities:  

Mesh:

Year:  2020        PMID: 32802153      PMCID: PMC7416251          DOI: 10.1155/2020/7636857

Source DB:  PubMed          Journal:  Comput Math Methods Med        ISSN: 1748-670X            Impact factor:   2.238


1. Introduction

Clustering is a branch of unsupervised learning. This method is widely used as a first step to interpret the data. In this method, samples are divided into groups whose members are similar to each other [1]. A good clustering algorithm should be efficient, reliable, and capable to determine relevant clusters [2]. From the four famous crisp clustering branches including distribution-based, density-based, connection-based, and partition-based methods, EM (expectation-maximization) algorithm; DBScan (density-based spatial clustering of applications with noise), hierarchichal, and K-means clustering methods can be pointed out, respectively [1]. Of course, there are other categories in clustering methods such as fuzzy clustering algorithms (such as fuzzy C-means) which are not in the scope of the present research. K-means clustering is an important and popular technique in data mining. This method is a partition-based clustering algorithm which works with randomly selected points as the initial centroids (centers) at first and then updates these centroids in an iteratively process until some convergence criteria were met. The simplicity of K-means clustering method makes it as a basic and popular method in different fields of research. The most important function of this method is that it works better when the clusters overlap [3]. This method also works with high-volume data. However, the more clusters there are, the more K-means may fail to find all clusters correctly [3]. In addition, the clusters created in this method are spherical and convex. Its performance also depends on the initial centroids of clusters and often ends in the local optimization [3, 4]. To solve these problems, different hybrid methods have been proposed [5-25]. Some of them try to solve K-means problems by different methods [5–10, 12–20], and some others use the simplicity of K-means method to improve the performance of other clustering methods [11, 21–24]. The present paper evaluated the results of three well-known hybrid K-means methods with minimum spanning tree (MST) [5], genetic algorithm (GA) [6], and hierarchical clustering [7] in different datasets. Genetic algorithm is a good option to solve the local optimization of K-means and will give a proper initial cluster center [25]. Clustering based on MST is known for deriving disordered boundaries and outlier detection [23]. The MST-based clustering techniques have widely been used for efficient clustering [23, 24]. The combination of partition-based and hierarchical clustering methods may also strengthen both approaches and discard their disadvantages [7]. Meanwhile, an important task in cluster analysis is evaluating the results of a clustering method or comparing it to another clustering result. Lots of different validity measures have been proposed in the literature [26]. Among these evaluating methods, we applied eleven validity indices (internal, external, and relative) to judge or compare the results of clustering methods. Therefore, the analysis was performed in two phases. In phase I, to investigate whether K-means is a proper clustering method for each dataset, EM, DBScan, hierarchical, and K-means clustering methods were applied at first. Then, three hybrid methods were tested on each data in phase II and compared with the results of phase I. Accordingly, the organization of this paper is as follows. Ordinary K-means algorithm is briefly reviewed along with three hybrid methods in Section 2. Also, seven internet datasets utilized in present study are introduced there. On Section 3, four ordinary clustering algorithms including K-means, hierarchical, DBScan, and EM algorithms accompanied with three hybrid methods including MST-based, GA-based, and hierarchical-based K-means methods are applied on each dataset, and the results of eleven different external and internal evaluation indices are reported for comparison. Section 4 contains some discussion on comparing these methods.

2. Materials and Methods

2.1. Materials

All hybrid methods introduced with K-means algorithm in the present paper with different underlying theories help improve this method by eliminating the random selection defects of initial centroids in the K-means clustering method. The nature of these hybrid methods can be influenced by various factors such as number of variables (features) in the dataset, sample size, and even number of labels (number of classes) in the data and exhibit quite different results. Since these hybrid methods in a dataset have not been compared yet, seven web datasets with different features were used in order to investigate the performance of these methods in the present paper. The data consisted of three gene expression datasets relating to leukemia, prostate, and colon cancers, and they were considered as high-dimensional data with an expression of more than 20,000 genes and were downloaded from the Gene Expression Omnibus (GEO) database (Table 1). Another four datasets were also well-known standard and appropriate Internet data for clustering methods and have been used in most applied papers to measure the performance of clustering methods (for instances [5, 6]). These data are available to all researchers for scientific research in the UCI database (University of California Irvine (UCI): Center for Machine Learning and Intelligent System) (Table 1).
Table 1

Description of seven datasets utilized for comparisons among the methods1.

Name of datasetSample size (+/-)No. of variables (features)No. of classes (labels)No. of optimal clusters2
Leukemia64 (26/38)422
Prostate30 (15/15)322
Colon Cancer111 (56/55)422
Haberman306322
Iris150433
Wine1781333
Glass2141077

1Gene Expression Omnibus (https://www.ncbi.nlm.nih.gov/gds) & University of California, Irvine Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets.php). 2The number of optimal clusters based on the elbow, gap, and silhouette by applying the law of the majority.

2.2. Methods

2.2.1. K-Means Clustering Method

The basic idea in K-means clustering includes the definition of clusters in a way that total within-cluster variation is minimized. There are many algorithms for K-means clustering method. Mac-Queen algorithm was used in the present paper [27] that defined the total within-cluster variations as the sum of squares of the Euclidean distance between objects and centroids. Let X = {x}, i = 1, 2, ⋯., n be the set of nd-dimensional observations (points) to be clustered into a set of K clusters, C = {c, k = 1, 2, ⋯, K}. K-means algorithm finds a partition such that the squared error between the center (empirical mean) of a cluster and the points in the cluster is minimized. Let μ be the mean of clusterc. The sum of squared error (SSE) between μ and the points in cluster c is defined as [28] The goal of K-means is to minimize the sum of the squared error over all K clusters, In general, the algorithmic steps of this method are summarized as follows (Figure 1):
Figure 1

K-means clustering algorithm.

Initial K cluster centroids are selected randomly from the observations Distance between each observation and clusters' centroid is calculated and the observation is assigned to a cluster with minimal distance from the centroid of that cluster Cluster centroids are updated by averaging the observations contained in each cluster Distance between each observation and new centroids of clusters is recalculated and data are placed in new clusters based on the minimum distance to the centroids Steps 3 and 4 are repeated until the centroids of clusters are not changed and the convergence occurs

2.2.2. Combination of K-Means Clustering Algorithm with Minimum Spanning Tree Method

Minimum spanning trees (MSTs) have been applied in data mining, pattern recognition, and machine learning for a long time [3]. The MST-based clustering techniques usually lead to efficient clustering [23, 24]. Indeed, these hybrid clustering methods can identify clusters of arbitrary shape by removing inconsistent edges and detect clusters of heterogeneous nature. MST-based clustering algorithm was proposed by Zahn [23]. Since then, some studies have been conducted to improve it (such as [5, 23, 24]). MST is utilized as the preanalysis method to find the initial centroids for K-means algorithm [5]. In graph theory, a dataset can be shown by a complete graph G, so that number of vertices of graph indicates number of points in a dataset. The weight between two vertices is the Euclidean distance between points based on the features (variables) vector. Tree is an undirected connected graph that does not contain any distance. The spanning tree is a subset of a complete weighted graph in a way that it has all features of a tree and also contains all vertices of a complete weighted graph. For a complete weighted graph, the minimum spanning tree has the least weight among all spanning trees of that graph. In present study, we followed the idea introduced by Yang et al. [5] and used MST for initializing the K-means clustering algorithm. Accordingly, the MST-based K-means clustering algorithm applied in present study is as follows [5]: Number of points (n observations) and number of clusters (K) are entered as the input parameters MST is generated using Prim's algorithm The set S = {s1, s2, ⋯, s}  is created containing the skeleton points from which the most edges pass (the number of edges from each point is known as degree). Member of S contains m points which admit some specific criteria (see ref. 6 for details) and are important candidates for cluster centroids at the first stage of K-means clustering Distances between any two skeleton points of S set are calculated (Equation (3)) where v and v are the degree of s and s, respectively. The skeleton point s with the highest degree is selected and entered to the set of initial centroids denoted by C = {s} The rest skeleton point r  of S satisfying Equation (4) is added to set C = {r} ∪ C Step 6 is repeated until the number of initial cluster centroids is equal to K. Figure 2 describes this process in a flowchart.
Figure 2

MST-based K-means clustering algorithm.

2.2.3. Combination of K-Means Clustering with Genetic Algorithm

The genetic algorithms (GAs) in clustering analysis are usually used to determine the number of clusters automatically and to find initial centroids for K-means clustering [16]. Indeed, genetic algorithm is a good option to solve the local minimum problem of K-means [25]. Usually, the simplicity of the K-means algorithm and the ability of the genetic algorithm are combined to provide a GA-based clustering algorithm which has even attracted the attention of researchers in health sciences ( [17-21] for instances). The genetic algorithm is inspired by the genetic science and Darwin's theory of evolution and is based on the survival of the superiors or natural selection. A common application of genetic algorithms is its use as an optimizer function. Inspired by the evolutionary process of nature, these algorithms solve problems. In other words, they create a population of beings like nature, and reach an optimal set or being by acting on this set. The hybrid method used in the present paper provided a hybrid version of the K-means algorithm with genetic algorithm that effectively solved the problem of random selection of initial centroids. Results of simulation tests confirmed this claim [11]. This algorithm preserved all important properties of the K-means method and is also stronger in data contains outlier. In general, the steps of GA-based K-means clustering algorithm are as follows (see ref. [6] for details): The input parameters are determined including M initial population size (number of chromosomes) and T number of iterations (number of generations) and K number of clusters and operator rates P, P, ⋯ M chromosomes are randomly selected to generate the initial population where each chromosome is a set of initial centroids of clusters considering the notion that centroids of each chromosome should not be similarly selected A target function is calculated for each chromosome. Based on the target function, the fitness value is calculated Crossover, selection, and mutation operators are used to generate the next generation If the number of produced generations is less than number of generations that is determined by user, it goes to stage 3 otherwise, it goes to stage 6 The amount of fit is calculated for the last generation per chromosome and compares the optimal amount of fit in this generation with the best fit obtained from previous generations and selects the largest one based on the estimator function Finally, the initial centroids obtained from the best chromosome are used according to stage 2 as the initial centroids in the K-means clustering method (Figure 3)
Figure 3

GA-based K-means clustering algoritm.

2.2.4. Combination of K-Means Clustering Algorithm with Hierarchical Clustering Method

The hierarchical method is the second most important crisp clustering method in microarray technology. In this method, clusters are formed by calculating the size of similarities or distance between each pair of elements [27]. The number of clusters is also determined by the users based on the height that the clusters merge. The weak point of hierarchical clustering is its termination, and the most important problem of K-means is its initiation [7]. Therefore, the combination of these two methods leads to a hybrid method with interesting characteristics. In present study, we apply agglomerative hierarchical clustering algorithm on a dataset at first to get initial information (initial centroids of clusters). Then K-means algorithm is applied. Steps of hierarchical-based K-means method are summarized as follows [7]: An agglomerative hierarchical clustering method is applied to data and the resulting tree is divided by K cluster Centroid of each cluster (mean clusters) is calculated and set C is created K-means algorithm is performed for the set C as the initial centroids obtained in step 2 Figure 4 summarizes this algorithm through a flowchart.
Figure 4

Hierarchical-based K-means clustering algoritm.

2.2.5. Validation of Clustering Methods

To evaluate the results of clustering algorithms, some cluster validation methods are used. These methods prevent the occurrence of random patterns in data and also allow the comparison of different clustering algorithms. A good validity measure should be invariant to the changes of sample size, cluster size, and number of clusters [26]. In general, clustering evaluation indices are classified into three categories: internal, external, and relative. Internal validity indices measure compactness, connectedness, and separation of each cluster while external validity indices measure matching the results of a clustering to the truth (if available) or another clustering method [26]. Relative validity methods in comparison are used to determine the optimal input parameter by changing the values such as the number of clusters in K-means for an instance and also comparing the clustering methods. Silhouette criterion (Si), Dunn index, and the hybrid index robustness-performance trade-off (RPT) were applied in the present study for internal evaluation. External validity methods can be categorized into pair-counting, information theoretic, and set matching measures. Pair-counting measures (such as rand index (RI) and adjusted rand index (ARI) used in our research) are based on counting the pairs of objects in a dataset on which two different partitions agree or disagree. For instance, if two objects in one cluster in the first partition place also in the same cluster in the second partition, it is considered an agreement [26]. Information theoretic indices such as mutual information (MI) measures the information that two clustering methods share and variation of information (VI) as a simple linear explanation of MI is applied in the present study. Set matching indices such as accuracy (AC), Fmeasure, and Haber's Γ index (HI) utilized here are based on pairing similar clusters in two partitions. It should be noted that the optimal number of clusters in the present paper was determined by the majority rule and using three methods including the average silhouette criteria, gap statistics, and elbow; data were standardized prior to any clustering analysis.

3. Results

To compare the performance of three hybrid methods and ordinary K-means method, seven free downloadable datasets on Internet including “leukemia cancer,” “prostate cancer,” and “colon cancer” from GEO site and “haberman,” “iris,” “wine,” and “glass” datasets from UCI: Center for Machine Learning and Intelligent System were applied. Table 1 summarizes the description of these datasets. To decrease the dimension of gene expression datasets and find the important genes (attributes), the result of the article by Ram et al. [29] was used. They selected a subset of three or four genes as the important ones by a feature selection method based on the random forest model. The clustering methods (ordinary or hybrid) were applied to the selected subsets for these four datasets. It is necessary to mention that these datasets already contain some classes (labels). Ignoring these classes, we obtain the optimal number of clusters (among 2-15 clusters) for each dataset based on the majority rule according to the mean value of silhouette, elbow criterion, and gap statistics. Then, the data analysis was organized in two phases:

3.1. Phase I

To investigate whether K-means is an appropriate clustering method for each dataset, four ordinary clustering methods including K-means, DBScan, hierarchical, and EM algorithms were applied on datasets at first. The mean value of silhouette and RPT criterion were then used to determine the best method for each data set (Table 2). The mean value of silhouette near to one and the high value of RPT reveal good clustering. Accordingly, K-means clustering method was the best method for just two out of seven data sets discussed in present study, leukemia and colon cancer datasets. Hierarchal clustering method was the best for prostate and haberman, the DBScan method was the best for iris and glass datasets and EM algorithm was the best method for the wine dataset.
Table 2

Comparison of four different ordinary clustering methods based on the silhouette and RPT indexes.

Index method K_MeansHierarchicalDB scanEM algorithm
Leukemia dataset
Silhouette 0.4702 0.46630.26930.4419
RPT 0.880 0.86120.50870.8160

Prostate dataset
Silhouette0.3265 0.4029 0.33390.2756
RPT0.6141 0.7586 0.63190.5295

Colon
Silhouette 0.5248 0.51890.31560.5176
RPT 0.9650 0.95160.57470.9478

Haberman
Silhouette0.2477 0.6632 0.62660.1384
RPT0.4787 1.21 1.150.2704

Iris
Silhouette0.45890.4796 0.5029 0.3728
RPT0.84460.8614 0.9085 0.6812

Wine
Silhouette0.24690.15750.1911 0.2728
RPT0.47880.30920.3742 0.5202

Glass
Silhouette0.34110.4281 0.6389 0.2809
RPT0.63690.7921 1.17 0.5148

RPT: robustness performance trade-off.

3.2. Phase II

The hybrid K-means methods were then applied on each dataset, and the results were summarized in Table 3. The higher the value of these evaluation criteria, the better is the clustering algorithm; except for SSE and VI indices (their fewer values are desirable). Figure 5 detects that all hybrid methods are faster in convergence than K-means method in terms of the number of iterations (the line belongs to K-means dominates the others).
Table 3

Comparison among the hybrid and ordinary K-means clustering method based on eleven evaluation criteria.

Indexes I SSESiRPTDunnRIARIACFHIVI
Leukemia dataset
K_Means5116.20.47020.8800.14310.88090.76170.93750.88480.761970.6477
K+H2116.20.46500.8800.14310.88090.76170.93750.88480.761970.6477
K+MST1116.80.46750.87190.16790.90920.81830.95310.91150.81840.5357
K+GA4116.20.47020.88010.14310.88090.76170.93750.88480.761970.6477

Prostate dataset
K_Means660.30.26770.51490.09690.62980.25990.76670.62470.26021.51
K+H158.10.39440.61410.15490.59540.19800.73370.63640.20691.21
K+MST262.10.39350.74980.22390.49190.00190.56670.59150.00221.51
K+GA458.70.27960.53850.14980.71260.42470.83330.70310.42471.29

Colon dataset
K_Means4161.470.52480.96500.14310.86500.730.92790.86380.73000.7411
K+H2161.470.52480.96500.14310.86500.730.92790.86380.73000.7411
K+MST3161.470.52480.96500.14310.86500.730.92790.86380.73000.7411
K+GA2161.470.52480.96500.14310.86500.730.92790.86380.73000.7411

Haberman dataset
K_Means6698.80.24770.47870.0230.4991-0.0020.51960.5483-0.00151.83
K+H4684.40.27330.52560.0350.50380.00830.55230.55230.00851.82
K+MST4702.80.38880.74270.0730.61890.12840.74510.72700.74510.1405
K+GA5682.10.27510.53050.0390.4997-0.0010.52610.5488-0.0031.83

Iris dataset
K_Means71400.45890.84460.026370.83220.62010.83330.74520.62011.079
K+H3141.10.45540.83590.077560.84310.64510.85330.76220.64521.072
K+MST5191.70.47870.89170.053090.71970.42900.57320.65050.44881.19
K+GA31400.45890.84460.026370.83220.62010.83330.74520.62011.079

Wine dataset
K_Means81589.10.24690.47880.13570.69150.37570.60670.62370.39271.42
K+H21270.20.29050.54810.23230.95430.89750.96630.93190.89760.39
K+MST41270.20.28490.54810.23230.95430.89750.96630.93190.89760.39
K+GA41270.20.28490.54810.23230.95430.89750.96630.93190.89760.39

Glass dataset
K_Means13687.40.34110.63690.058040.68910.19660.43460.40730.19662.8
K+H2679.90.34580.64330.049060.69260.20360.43950.41160.20362.73
K+MST4790.20.30210.57540.066440.65310.19080.35980.43270.19542.60
K+GA10678.60.34270.63900.045020.68790.19460.47660.40620.19462.84

I: number of iteration: ARI: adjusted rand index, −1 < ARI < +1; RI: rand index, RI > 0); VI: variation of information, VI > 0. AC: accuracy, 0 < AC < 1; Si: silhouette, −1 < Si < +1); HI: Huber's Γ index, −1 < HI < +1; RPT: robustness-performance trade-off, RPT > 0; K+H: hierarchical K-means clustering; K+MST: minimum spanning tree K-means clustering; K+GA: genetic K-means clustering.

Figure 5

Number of iterations to converge for the hybrid methods in comparison with K-means method.

Obviously, based on all evaluation criteria, one superior clustering method could not be achieved. But, depending on the purpose of the study, internal or external validity indices may be important. Therefore, according to internal validity indices, the MST-based clustering method was the best for all datasets except for the leukemia, wine, and glass datasets. For the former, GA-based and for the two latter, hierarchal-based methods are the best hybrid method (Table 3). However, the internal validity indices for the best hybrid method could not reach the values for the best ordinary clustering method determined in phase I (Tables 2 and 3), except for those two dat sets (leukemia and olon cancer) which K-means was the best ordinary method. According to external validity indices, MST-based clustering method for leukemia and haberman, GA-based method for prostate and hierarchal-based method for iris and glass were the best hybrid clustering methods. For colon cancer and wine datasets, all three hybrid methods have the same performance. Totally, the hybrid methods could not greatly improve the performance of K-means clustering method in the present study. Meanwhile, although the results do not reveal any regular relationship between sample sizes, number of variables, and number of classes with the best hybrid method, but it seems hierarchal-based method works better in larger sample size with more variables (in wine and glass datasets).

4. Discussion

We have conducted a comparison study on three hybrid clustering methods which try to solve the random centroids problem in K-means clustering [5-7]. Seven existing datasets on Internet were applied to compare the methods. Eleven indices from different clusters' validation methods were the criteria for comparison. The hybrid methods including MST-based [5], GA-based [6], and hierarchical-based [7] K-means clustering are three popular hybrid methods for modifying random centroids problems in K-means. However, there are other methods which try to improve the K-means performance such as principal component analysis [8], different rules for updating the new centroids [9-12] and machine learning online algorithm [13]. Meanwhile, some previous studies report the improvements occurred by K-means in other clustering methods [11, 21–24]. To the best of our knowledge, MST-, GA-, and hierarchical-based K-means methods utilized in the present study have not been compared in any simulation or experimental study before. Seven datasets used here were different in terms of sample size, numbers of variables, and natural classes. Hence, these three methods compared here from different aspects. Results of this research indicated that the hybrid methods did not necessarily improve the ordinary K-means method; and they even sometimes had poorer performance in some indices than the ordinary K-means method (Table 3). The performance of ordinary K-means method is improved in hybrid methods only in the number of iterations to reach the final solution. In this regard, hierarchical-based, MST-based, and GA-based clustering methods are in the first to third ranks of convergence rate (Figure 1). Totally, the hybrid methods could not greatly improve the performance of K-means clustering methods in internal validity indices. However, in external validity indices, these methods outperformed the K-means clustering method (Table 3). Finally, since some previous studies reported better performance for these three hybrid methods than the ordinary K-means clustering algorithm [5-7] simulation studies are recommended to compare these hybrid methods with K-means clustering in terms of initial centroids.
  2 in total

1.  Clustering high throughput biological data with B-MST, a minimum spanning tree based heuristic.

Authors:  Harun Pirim; Burak Ekşioğlu; Andy D Perkins
Journal:  Comput Biol Med       Date:  2015-04-14       Impact factor: 4.589

2.  Classification and Biomarker Genes Selection for Cancer Gene Expression Data Using Random Forest.

Authors:  Malihe Ram; Ali Najafi; Mohammad Taghi Shakeri
Journal:  Iran J Pathol       Date:  2017-10-01
  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.