Literature DB >> 35720897

VIASCKDE Index: A Novel Internal Cluster Validity Index for Arbitrary-Shaped Clusters Based on the Kernel Density Estimation.

Ali Şenol1.   

Abstract

The cluster evaluation process is of great importance in areas of machine learning and data mining. Evaluating the clustering quality of clusters shows how much any proposed approach or algorithm is competent. Nevertheless, evaluating the quality of any cluster is still an issue. Although many cluster validity indices have been proposed, there is a need for new approaches that can measure the clustering quality more accurately because most of the existing approaches measure the cluster quality correctly when the shape of the cluster is spherical. However, very few clusters in the real world are spherical. Therefore, a new Validity Index for Arbitrary-Shaped Clusters based on the kernel density estimation (the VIASCKDE Index) to overcome the mentioned issue was proposed in the study. In the VIASCKDE Index, we used separation and compactness of each data to support arbitrary-shaped clusters and utilized the kernel density estimation (KDE) to give more weight to the denser areas in the clusters to support cluster compactness. To evaluate the performance of our approach, we compared it to the state-of-the-art cluster validity indices. Experimental results have demonstrated that the VIASCKDE Index outperforms the compared indices.
Copyright © 2022 Ali Şenol.

Entities:  

Year:  2022        PMID: 35720897      PMCID: PMC9200537          DOI: 10.1155/2022/4059302

Source DB:  PubMed          Journal:  Comput Intell Neurosci


1. Introduction

Clustering approaches are unsupervised learning techniques that separate data into groups called clusters according to the similarities and dissimilarities among the data [1, 2]. The DBSCAN [3], kmeans [4], BIRCH [5], Spectral Clustering [6], Agglomerative Clustering [7], HDBSCAN [8], Affinity Propagation [9], and OPTICS [10] are some examples of them, and they are used in many fields such as pattern recognition [11-13], machine learning [14-16], data mining [17, 18], web mining [1, 19], bioinformatics [20, 21], and streaming data mining [22, 23]. On the other hand, measuring the performance of any proposed clustering approach is also an important issue because each algorithm has its special point of view, and the results of each clustering technique vary. Therefore, to overcome this problem, cluster validation analysis or cluster validation indices have emerged. These approaches are generally used for two purposes, which are measuring the performance of clustering algorithms and contributing to clustering algorithms as a guide by finding the optimum number of clusters. Cluster validation indices are divided into two main categories as internal and external indices. In external indices, true class labels are compared with the labels that are assigned by the proposed algorithm to measure the performance. Therefore, to use these indices, there is a need for true class labels. The Purity [24], Rand Index [25], Adjusted Rand Index [26], Accuracy, Precision and Recall [27], F-Measure [28], and NMI [29] can be given as examples of these types of indices. On the other hand, in the internal indices, we do not need actual class labels to measure the quality of clusters. In these indices, the evaluation of clustering performance is based on how similar the data in the same cluster are to each other, known as compactness, and how dissimilar the data in different clusters are from each other, known as separation. The Silhouette Index (SI) [30], Dunn Index [31], Davies–Bouldin (DB) [32], Calinski-Harabasz (CH) [33], Xie-Beni (XB) [34], S_Dbw [35], and RMSSTD [36] can be mentioned as primary cluster validity indices. Besides, there are many new cluster validity indices such as the CVNN [37], CVDD [38], DSI [39], SCV [40], and AWCD [41]. The main problem of the majority of state-of-the-art cluster validity indices is that they measure the cluster quality correctly when the shapes of the clusters are spherical. As an example, Silhouette Index (SI) uses the means of distances of each data in the cluster to evaluate their quality. Similarly, Davies–Bouldin (DB) uses cluster diameters and cluster centroids, and the Calinski-Harabasz (CH) uses the square of intracluster and intercluster distances. These all calculations are ideal if the shape of the cluster is spherical. However, the shapes of the minority of clusters are spherical in the real world. Additionally, if the shape is arbitrary, these indices cannot measure the cluster quality correctly because the center of gravity of any cluster is in the middle only if the shape is spherical. Similar to our approach, there is another kernel density estimation-based cluster validation index, named the M [42]. In the M, the authors used a function of estimation of the mode to assess cluster quality. This mode function allows the index to assess the cluster quality by adopting interpoint distance measures that can be defined to have a probability density function. To evaluate clustering with the number of clusters greater than 1 (K > 1), they applied the mode estimation procedure for interpoint distances that are assumed to have a probability density function between the data members. On the other hand, in this study, we proposed a novel Internal Validity Index for Arbitrary-Shaped Clusters based on the kernel density estimation (the VIASCKDE Index). We aimed to calculate the cluster quality accurately by using compactness and separation of each data to support arbitrary-shaped clusters and the kernel density estimation (KDE) to weight denser regions in the clusters to the compactness of the clusters. Therefore, the advantages of our new approach can be listed as follows: The VIASCKDE Index can evaluate arbitrary-shaped clusters correctly It weights denser regions to support the compactness of clusters It is suitable for all types of clustering techniques, especially for density-based algorithms It can be used for micro-cluster-based approaches It has greater performance when compared with state-of-the-art techniques The rest of this paper was organized as follows: in Section 2, the related studies were reviewed. In the 3rd section, the problem with existing works and the need for the proposed approach was explained. While details about the VIASCKDE Index were given in the 4th section, the comparison of experimental results with the state-of-the-art approaches on real and synthetic datasets was given in the 5th section. After that, the discussion on the results was provided in Section 6. Finally, the conclusion of the study was presented in Section 7.

2. Background and Related Works

As cluster validation techniques, in internal methods, we do not need the actual class labels. The cluster validation operation is done by calculating the similarities in the intraclusters and the differences in the interclusters produced by the model to reveal how consistent the produced clusters are [43]. As mentioned above, in the internal methods, cluster quality is evaluated in the aspects of two concepts [44]: Compactness: it states how much the data, which is in the same cluster, are close to each other. Closer data mean better clustering. Separation: it evaluates how much the clusters are far from each other. In the clustering evaluation, it is expected to be far from each other as much as possible. The illustration of these two concepts is presented in Figure 1, while the equation is demonstrated in Eq. (1). Here, α and β are the weights.
Figure 1

The example of the relationship between the compactness and separation concepts of two clusters in a two-dimensional data space.

There are many internal methods proposed in the literature. In this section, we focused on the validation indices that are relevant to our approach. To make definitions shorter and more understandable, the general definitions are as follows: Let X = {x, x,…,x} ∈ R be a dataset containing n points in a d-dimensional space, and x ∈ R. X is a set of disjoint k clusters (where C is a cluster and i = 1,2,3,…,k), and n data are in the C cluster. While the cluster center that is the gravity center of cluster C is the mean of the data that belongs to C and calculated by μ=1/n∑x, the mean of all datasets is calculated by . In the present study, the mentioned distance is the Euclidean distance; one of each x and y is data of the dataset, and the Euclidean distance between these two data is expressed as d( x , y ). In light of this information, we can briefly list the main internal cluster validity indices as follows: Silhouette Index (SI) [30]: as given in Figure 2, the compactness value of one of the data in any cluster is calculated by measuring the distance from the data to each data in the same cluster. Then, the compactness of the cluster, which is notated as a(x), is calculated by measuring the mean of compactness of all the data that the cluster has. The average of the distances from the elements of the nearest cluster, to which the mentioned data do not belong, gives the separation value of that data. After that, the separation value of the cluster is found by calculating the mean of the separation values of all the data of the cluster and it is notated as b(x). From now on, we can calculate the SI value, which is the cluster validity index of the model. The equations to calculate SI, a(x), and b(x) are given in equations (2)–(4), respectively. The SI value is [−1, +1]. While -1 means the worst clustering, +1 means the best clustering.
Figure 2

The example of Silhouette Index.

Dunn Index (DI) [31]: the DI calculates the success of the model based on compactness and the separation between the clusters. To do this, the DI value of a cluster is calculated by the distance to the closest cluster and its own diameter. Let dmin(C, C) be the closest distance between clusters C and C, and let diam(C) be the diameter of the cluster C, and the values of these two variables are calculated by d(C, C)=mind(x, y)  and diam(C)=maxd(x, y). Therefore, by knowing the value of dmin(C, C) and diam(C), the DI of the model is calculated by equation (5). The larger the result value, the more successful the clustering is. Calinski-Harabasz (CH) [33]: the CH calculates compactness and separation values via the mean of the squares of the interclass and intraclass distances. The CH index value is calculated by (6). In the CH index, the goal is to make the result as large as possible. Davies–Bouldin (DB) [32]: the compactness value is calculated over the mean of the variance of the data in each cluster. On the other hand, the separation value is calculated over the distance from the center of the cluster to the center of the closest one. Let avg(C), which is calculated by (7), be the average of the distances of each data in the cluster i to the cluster center, and the avg(C) is calculated by (8). S_Dbw Index [35]: The S_Dbw calculates the compactness value of the clusters over the standard deviations (σ) of the data that the cluster has. On the other hand, it calculates the separation value by the distance between the centers of the clusters. The S_Dbw index is a type of index that considers the density of clusters. Let den be the density of the cluster, and the S_Dbw index value is calculated with the following equations: Distance-based Separability Index (DSI) [39]: the DSI is another approach that measures the cluster quality by the means of the distances based on intercluster and intracluster. Let C and C be two clusters and have N and N data points, respectively. The intracluster distance set of cluster C will be a set as given equation (13). Moreover, the intercluster distance set is measured based on the distances of data pairs of clusters C and C. To compute the DSI, the Kolmogorov–Smirnov (KS) test was utilized. Let S be Kolmogorov–Smirnov test of cluster C, which is calculated as S=KS({d}, {d}) and S be of Cj, and the DSI of these two clusters is the result of the following equation: RMSSTD [35]: the root-mean-square standard deviation (RMSSTD) aims to calculate the clustering quality by measuring the homogeneity of clusters. It is commonly used for hierarchical clustering. Let the dataset consists of k clusters, p be the number of independent variables, be the mean of data in variable j and cluster i, and n is the number of data in variable p and cluster k. RMSSTD is measured by equation (12). The lower RMSSTD means better clustering.

3. Statement of the Problem

Although many approaches have been proposed, analysis of the cluster quality is still an issue. Because there are many clustering approaches in the literature, they differ from each other in many aspects. Therefore, no cluster validation technique can evaluate the quality of all produced clusters precisely. However, some approaches have been used in this task including the Silhouette Index, Dunn Index, Davies–Bouldin, Calinski-Harabasz, and S_Dbw. Although these indices have been used commonly, each of them has a specific problem with cluster validation as given in Table 1. For example, a significant part of the proposed cluster validity indices assumes the shapes of clusters are spherical. In fact, the minority of clusters are spherical in the real world as some examples are given in Figure 3. The SI can be given as an example of these kinds of indices. It cannot achieve a good score if the shape of the cluster is not spherical. On the other hand, the DB and the CH identify clusters that are compact and well separated. However, in the real world, very few clusters are in that shape. Similarly, despite being better than the DB and the CH in case of the clusters are not well separated, the DI encounters some issues with computational cost when the number of clusters or dimensionality is high. Besides, it is affected by the noisy data due to increasing diameter. As for the S_Dbw, although it is proposed as a density-supported validity index and gets a good score with the compact and well-separated clusters, it is affected by the distribution of the data. In addition, thanks to being a density-based clustering validity index, the DSI is good at dealing with arbitrary-shaped clusters. It can successfully evaluate any cluster quality. However, the DSI is also another cluster validity index that is affected when clusters are too close. Likewise, the RMSSTD is another validity index that encounters some problems when the clusters are close to each other. The examples of the problems on the shapes of clusters that existing indices come across can be increased.
Table 1

Comparison of clustering validity indices that were used for experimentation in the present study.

Cluster validity IndexNotationRuntime complexityOptimal valueConsidering denser region?Handling arbitrary-shaped clusters?AdvantagesDisadvantages
Silhouette Index [30]SI O(n2)Max.The score is higher when the clusters are dense and well separatedGood at handling the spherical clusters, high computational complexity
Dunn Index [31]DI O(n2)Max.Competent at cluster validity taskHigh computational cost with high-dimensional data and the number of clusters
Calinski-Harabasz Index [33]CH O(n)Max.Good at well separated and compact clusters, its computational complexity is very lowIt is not competent enough at the cluster validation task.
Davies–Bouldin Index [32]DB O(n)Min.Good at well separated and compact clusters, its computational complexity is very lowIt is not competent enough at the cluster validation task.
S_Dbw validity Index [35]S_Dbw O(n)Min.Its computational complexity is very lowAffected negatively by the distribution of data
Distance-based Separability Index [39]DSI O(n3)MinUseful to discover the shape of clustersAffected negatively when clusters are too close and its computational complexity is high
Root-mean-square std dev [35]RMSSTD O(n)Min.Good for hierarchical clusteringHas issues when the clusters are close to each other
VIASCKDE Index (proposed)VIASCKDE O(n2)Max.It can handle the arbitrary-shaped clusters, take into account the denser regions, can be used for density-based and micro-cluster-based approachesHas issues when the clusters are close to each other
Figure 3

Some examples of the arbitrary-shaped cluster.

Another problem with existing cluster validation indices is that they assume that all the data in any cluster have a homogeneous distribution. However, data inside the cluster mostly have various regions that have different densities, as seen in Figure 4 (darker areas mean denser regions). Moreover, the data in the same cluster may not have homogeneous distribution as can be seen in Figure 4(b). So, any approach that considers the density of data in the clusters is still needed to support the compactness of the cluster. Although the S_Dbw and the DSI are two examples of cluster validity indices that take into consideration the density of clusters, they do not examine the density areas inside the clusters. These kinds of indices are useful to discover the shapes of clusters. However, maybe, some regions are denser than the other regions inside the cluster, and these indices do not take into account such problems. Giving more weight to denser regions may make the approach more accurate while identifying it because of supporting compactness. In the present study, we proposed a new cluster validity index that can discover the arbitrary-shaped clusters and weight the denser regions by using the Kernel. Density estimation was explained in Section 4.2.
Figure 4

An example of various densities in clusters: example of an Aggregation dataset. (a)Density distribution of the dataset. (b) Density distribution inside a cluster.

4. Proposed Cluster Validity Index: A Novel Internal Cluster Validity Index for Arbitrary-Shaped Clusters Based on the Kernel Density Estimation (The VIASCKDE Index)

4.1. Basic Idea

In the present study, a new cluster validation index, which has been named shortly the VIASCKDE (the Validity Index for Arbitrary-Shaped Clusters based on the Kernel Density Estimation) index, was proposed. The VIASCKDE Index is a kind of index that is not affected by cluster shape, and thus, it can make a realistic evaluation of clustering performance regardless of the clusters' shape. Unlike the existing cluster validation indices, our index calculates the compactness and separation values of the cluster based on calculating the compactness and separation values for each data separately. In other words, it calculates the compactness and separation values of the cluster over the distance of data, independent of parameters such as the cluster center because, in nonspherical clusters, the distance of the data to the closest data is more important than its distance to the cluster center. As can be seen in the example given in Figure 5, the closest data in the cluster that “it belongs to” are used when calculating the compactness value for the data x. Similarly, the separation value of x is calculated by the distance to the closest data of the cluster that “it does not belong.”
Figure 5

Relationship between the compactness and separation values of any data in the VIASCKDE Index.

As mentioned before, another problem with existing cluster validity indices is to assume that the distribution of the data inside the cluster has homogeneous distribution, even if the shape of the cluster is arbitrary. Therefore, they weight each data of the cluster as the same value, whereas, as presented in Figure 4, the distribution of data that is inside the same cluster may vary. Therefore, we need a new method that considers this situation. To overcome this problem, we proposed the kernel density estimation (KDE), which is detailed in the next section based on weighting method.

4.2. Kernel Density Estimation-Based Weighting

In the literature, there are two types of distribution estimation methods that are parametric and nonparametric. In parametric methods, for example, the Gaussian distribution assumes the distribution of any dataset is gathered around the center and the majority of the data is in a circle having a radius of the standard deviation. It means that the curve has only one peak on distribution. It is important to keep in mind that the univariate normal distribution, with mean µ and variance σ, has the probability density functionwhere x is in -∞ < x < ∞ interval. On the other hand, in nonparametric distribution estimation methods, it is assumed that there may be more than one distribution peaks on the curve. Let X=[X1,…,X] be an n-dimensional vector that has a multivariate Gaussian (or normal) distribution with the n-dimensional mean vector μϵR and ∑ be the n x n covariance matrix. The multivariate Gaussian distribution is calculated as follows: The kernel density estimation (KDE) is a nonparametric density estimator that is used for density estimation. It is also a method that is used to analyze existing data to decide which incoming data is placed correctly in which place. For this ability, it is commonly used in many areas such as data analysis procedures in healthcare services, artificial intelligence applications, the stock market, and many other areas [2]. The bar graph represents the histograms, and the orange line represents the KDE, and it is calculated over the histograms as presented in Figure 6. In analyzing the data and representing its application, it figures out the distribution of data according to various methods, which are given in Figure 7. Each one has its characteristic and equation. In mathematical formulation, the KDE is a functionwhere K(.) is one of the functions, which are given in Figure 7. The most commonly used one is the Gaussian function. These functions are known as smooth functions that control the amount of smoothing where the h > 0. The KDE smooths each data; here, it is X, one after the other one until reaching the final density estimation.
Figure 6

An example of the kernel density estimation and its histogram.

Figure 7

Types of kernel density estimation curves.

In addition to estimating the density function of univariate data, as an example given in Figure 6, we can apply the KDE to multivariate datasets. In this case, we have to use a kernel function that could process a multidimensional dataset. To achieve this, the mentioned kernel function should be constructed by a product kernel or a radial basis approach. Let X=(X1, X2, X3,…, X)′ denote a sample of size n from a multivariate random variable with density f(x) defined on R, and let {x1,…, x} be an independent random sample drawn from f(x). In the following example, we only considered the two-dimensional case without the loss of generality. Thus, X,  i=1,…, n  is given by (X, X), where X and X denote the x and y coordinates, respectively. The multivariate kernel density estimator at point x is given bywhere K(.) is a multivariate kernel function and h denotes a symmetric positive definite bandwidth matrix. Although KDE is a nonparametric probability density function to solve the inhomogeneous distribution problem, we can also use it as a weighting function to support the compactness of clusters. As the KDE of any data is the summation of the data around it, it is expected the weight of any data close to the edges of data distribution would be less, while the KDE of the data in the near center would be more. Therefore, the KDE could be used as a weighting function to weight the data. In our approach, doing that will support the compactness of the cluster regardless of its shape. Namely, we used the KDE to weight each data to give more importance to the data in the denser regions. Therefore, we calculated the weight of each data that is W according to obtained KDE value. For example, let us assume we want to find W values for data x = 30 and x = 40 in the example of the dataset given in Figure 6. W for x would be 0.007, while W would be 0.05 for x, which is very high when compared to the other one. That makes our approach superior when compared with existing clustering validity indices, which ignore the distribution of data in the same cluster. In other density-based approaches, they would weight x and x as equal for this example and this would be incorrect.

4.3. Definitions and Equations

In light of these explanations, let us explain the details of the VIASCKDE Index.

Definition 1 .

(CoSeD—Compactness and Separation Value of a Data): the CoSeD can be described as the compactness and separation value of any data. To calculate this value, W value of each data, which is explained in Section 4.2, is calculated first. Let a( x ) (compactness) be the distance from x to the closest data of cluster C in which the data x also belong, and let b( x ) (separation) be the distance from x to the closest data of cluster C in which the data x do not belong to; therefore, the compactness and separation value of the data x, CoSeD( x ), are calculated by the following equation:

Definition 2 .

(CoSeC—Compactness and Separation Value of a Cluster): the CoSeC value is the average of the CoSeD values of the data owned by the cluster. The CoSeC value of the cluster C is calculated by equation (18), where C is the cluster to which the data x belong, and n is the number of the data that cluster C possesses.

Definition 3 .

(the VIASCKDE, the Value of Overall Clustering): let k be the number of clusters, let n be the number of data that cluster C possesses, and let CoSeC be the value of cluster C, which is calculated in equation (18); therefore, the VIASCKDE Index value is calculated by equation (19). The VIASCKDE value is expected to be in between [−1, +1], where +1 refers to the best possible value, and -1 refers to the worst possible value.

4.4. The Algorithm

Let Gaussian_KDE be a function that calculates the KDE and MinMaxNormalization, which is also a function that normalizes the data to the range of [0, 1]. The CoSeD and CoSeC values were explained in Section 4.3. In light of this information and the equation given in the previous section, the pseudocode of VIASCKDE Index was given in Algorithm 1.

4.5. Computational Complexity

Let k be the number of clusters in the dataset, let n be the number of data that clusters possess, and let d be the number of features each data possesses; therefore, the time complexity of the VIASCKDE Index is calculated as the O(knd), since it calculates the distance of each data to all others. This means that the complexity of the proposed approach is the O(n). This is acceptable when the index is compared with the complexity of other indices given in Table 1.

5. Experimental Study

5.1. Development Environment

To demonstrate the effectiveness of the VIASCKDE Index (https://github.com/senolali/VIASCKDE) on the experimental studies, the data were processed with using the Python language in the Anaconda Spyder environment. Various machine learning libraries of the Scikit-learn library such as the DBSCAN, Spectral Clustering, HDBSCAN, and metrics were used. The dataset was imported with the Pandas library, and mathematical operations were performed with the NumPy library. Visualization processes were also carried out with the matplotlib library. All experiments and comparison operations were performed on a computer with 16 GB RAM, Intel i7 processor, and Windows 11 operating system.

5.2. Used Datasets

To measure the performance of the proposed approach, we performed an experimental study in both synthetic and real datasets. Since the main purpose of our approach is to measure the performance of nonspherical clusters, artificial datasets containing clusters in different shapes were used. In Figure 3, some of the used datasets that contain clusters in different shapes are demonstrated. In addition to these synthetic datasets, real datasets, which are frequently used in the clustering field, were also used for testing. Details of the datasets used in the comparison process are provided in Table 2. Additionally, as given in Figure 8, some imbalanced datasets were used to analyze the performance of our cluster validation index on the imbalanced data distribution.
Table 2

Used datasets.

DatasetType# of Features# of data# of classesReference
Half-kernelSynthetic210002[45]
Two spiralsSynthetic23123[45]
OutlierSynthetic27004[45]
CornersSynthetic220004[45]
Cluster in clusterSynthetic210122[45]
Crescent full moonSynthetic210002[45]
MoonSynthetic25144[45]
FaceSynthetic23224[46]
WaveSynthetic22872[46]
AggregationSynthetic27887[47]
Zelnik1Synthetic26224[48]
Zelnik5Synthetic25124[48]
XclaraSynthetic230003[48]
BananaSynthetic248112[48]
D2c2sc13Synthetic258813[48]
2sp2globSynthetic29993[48]
Cure-t1-200nSynthetic220005[48]
ThyroidReal42152[49]
Fisher irisReal41503[49]
Breast cancerReal86992[49]
Figure 8

The distributions of some of the used datasets.

5.3. Experimental Procedure

For the experimental study, we used the procedure given below. But firstly, to ensure that each data are between the same ranges and to make it easy to determine parameters, the data were normalized using the min-max normalization that was demonstrated in (20). In addition, the ARI (Adjusted Rand Index) was used as the ground truth method to evaluate the performance of cluster validation indices by comparing the cluster labels that were produced by the clustering algorithm with the actual cluster labels. The reason we chose the ARI is that the generated cluster labels do not need to be the same as the actual cluster labels. For example, let us assume the clustering algorithm produced {1,1,1,2,2,2} cluster labels and actual labels are {2,2,2,4,4,4}. The accuracy value for this situation would be 0%, while it would be 100% with the ARI value, which should be the actual result. The procedure established in the testing process is as follows: Step #1: Select one of the algorithms (DBSCAN,HDBSCAN, and Spectral Clustering) Step #2: Test the algorithm with randomly selected parameters on one of the selected datasets. Step #3: Evaluate the cluster qualities of clusters that were produced by the selected algorithm with clustering validation indices (SI, DI, CH, DB, S_Dbw, DSI, RMSSTD, and VIASCKDE). Step #4: Calculate the VIASCKDE Index via produced clusters and evaluate it to see whether this is the best result so far. If it is, we accept this value as the best one for the VIASCKDE Index. Then, we do the same operation for the other indices. Step #5: To test each index sufficiently, go to Step #2 and repeat the cycle 100 times. If the cycle is completed go to Step #6. Step #6: Calculate the ARI value that corresponds to the most successful value obtained for each of the clustering validity indices including our proposed approach. Step #7: Compare the ARI values calculated by all cluster validity indices. Consider the one with the highest ARI value as the most competent one for this dataset. Step #8: Go to Step 2 and do the same operations for the new dataset. If all datasets are performed, go to Step 9. Step #9: If all algorithms are performed, finish the procedure; otherwise, go to Step 1.

5.4. Experimental Study

5.4.1. The Selection of Density Distribution Estimation Method

We performed some experimental studies on the datasets to decide which data distribution method should be selected, either parametric or nonparametric. For the parametric method, we selected the Gaussian method and the KDE for the nonparametric method. We carried out experiments with the procedure given in Section 5.3, by using the DBSCAN in which the parameters are randomly selected. Besides, the kernel = “Gaussian” and h = 0.05 were the parameters of KDE based on the VIASCKDE Index approach, while the Gaussian was the method of parametric VIASCKDE Index. According to obtained results, while the Gaussian-based method outperformed in 15 datasets, the KDE-based method was the best in 17 datasets, as demonstrated in Table 3. Therefore, we selected the KDE-based method as the weighting function for our approach.
Table 3

ARI results obtained with the parametric and nonparametric methods.

DatasetsAdjusted Rand Index (ARI)
Methods
Gaussian WeightKDE Weight
Half-kernel 1.0000 1.0000
Two spirals 1.0000 1.0000
Outlier 1.0000 1.0000
Corners 1.0000 1.0000
Cluster in cluster 1.0000 1.0000
Crescent full moon 1.0000 1.0000
Moon 0.7424 0.7424
Face0.9949 1.0000
Wave1.0000 1.0000
Fisher iris0.7493 0.7493
Breast cancer0.7540 0.7540
Aggregation0.7338 0.9118
Thyroid-0.0619 0.6783
Zelnik1 1.0000 0.9488
Zelnik5 1.0000 1.0000
Xclara0.00010.0001
Banana 1.0000 1.0000
Ds2c2sc130.3187 0.5904
2sp2glob 1.0000 0.9880
Cure-t1-2000n 0.8850 0.8850

5.4.2. The Kernel Selection for KDE

As mentioned in Section 4.2, there are various kernels in the literature. The Gaussian, cosine, linear, tophat, and exponential can be given as examples, and they affect the smoothness of distribution. We fulfilled the operation with the procedure provided in Section 5.3 where the parameters of DBSCAN algorithm were selected randomly. We performed the experiments by choosing each kernel in each experimental study. As it can be seen in Table 4, the Gaussian kernel was the best in all of the selected datasets, when the bandwidth was 0.05.
Table 4

Obtained results with the different kernels values.

KernelsDatasets
Obtained VIASCKDE Values with each kernelObtained ARI Values with each kernel
FaceAggregationOutliersThyroidCrescent full moonCure-t1-200nFaceAggregationOutliersThyroidCrescent full moonCure-t1-200n
Gaussian0.70630.63680.67970.49470.66230.6555 0.6085 0.8246 1.0000 0.5083 1.0000 0.8850
Cosine0.59670.65640.64990.16990.63400.6343 0.6085 0.8089 1.0000 0.5083 1.0000 0.8850
Exponential0.70050.63710.67140.55410.64260.66530.03860.8089 1.0000 0.5034 1.0000 0.8850
Linear0.57360.64270.63060.15940.61690.6371 0.6085 0.8089 1.0000 0.5083 1.0000 0.8850
Epanechnikov0.60210.65620.65810.17580.63880.6295 0.6085 0.8089 1.0000 0.5083 1.0000 0.8850
Tophat0.64570.61650.64330.23060.66640.6299 0.6085 0.0333 1.0000 0.5083 1.0000 0.8850

5.4.3. Bandwidth Selection for the KDE

One of the most important parameters of KDE is bandwidth (h). It possesses a direct effect on the results. When the h is too small, there would be many wiggly structures on the density curve. On the other hand, when the h is too large, the bumps on the curve would be smoothed out as given in Figure 9. To find which bandwidth is the best for our approach, we fulfilled some experimental studies with the procedure given in Section 5.3 by testing it with different bandwidth values on some datasets, which are provided in Table 2. The best bandwidth was found to be 0.05 as it can be seen in Table 5, when the kernel was the Gaussian.
Figure 9

Types of the kernel density estimation curves.

Table 5

Obtained results with the different bandwidth values.

BandwidthDatasets
Obtained VIASCKDE values with each bandwidthObtained ARI values with each bandwidth
FaceAggregationOutliersThyroidCrescent full moonCure-t1-200nFaceAggregationOutliersThyroidCrescent full moonCure-t1-200n
0.010.33770.34440.46500.05560.47800.5264−0.03860.8089 1.0000 0.5277 1.0000 0.8850
0.030.66270.65650.65080.34930.66080.6421 0.6085 0.8089 1.0000 0.5034 1.0000 0.8850
0.050.70630.63880.67970.49470.66230.6555 0.6085 0.9898 1.0000 0.5034 1.0000 0.8850
0.10.73650.62250.68510.63060.64860.6565−0.03860.8089 1.0000 0.5034 1.0000 0.8850
0.30.78570.59470.67730.74020.61430.6189−0.03860.7338 1.0000 0.2099 1.0000 0.8850
0.50.75860.56890.54810.75910.59450.6039−0.03860.7338 1.0000 0.2099 1.0000 0.8850
1.00.74120.56360.52570.76180.59270.6018−0.03860.7338 1.0000 0.2099 1.0000 0.8850
1.50.73620.56290.52360.76180.59230.6016−0.03860.7338 1.0000 0.2099 1.0000 0.8850
20.73390.56260.52290.76180.59210.6015−0.03860.7338 1.0000 0.2099 1.0000 0.8850
2.50.73280.56250.52260.76180.59200.6015−0.03860.7338 1.0000 0.2099 1.0000 0.8850
30.73220.56240.52250.76180.59200.6015−0.03860.7338 1.0000 0.2099 1.0000 0.8850
3.50.73170.56240.52230.76180.59190.6015−0.03860.7338 1.0000 0.2099 1.0000 0.8850
40.73140.56230.52220.76170.59190.6015−0.03860.7338 1.0000 0.2099 1.0000 0.8850
4.50.33770.34440.46500.05560.47800.5264−0.03860.8089 1.0000 0.5277 1.0000 0.8850
50.66270.65650.65080.34930.66080.6421 0.6085 0.8089 1.0000 0.5034 1.0000 0.8850

5.4.4. The Tests on Both Synthetic and Real Datasets

In this section, experimental works were executed on both synthetic and real datasets. To detect nonspherical clusters in the test process, the DBSCAN, Spectral Clustering, and HDBSCAN were used. The DBSCAN algorithm uses two parameters (MinPts: the clustering threshold value, and ε: the accessibility distance) and Spectral Clustering uses one parameter as input (n_clusters: the number of clusters) if the affinity =  “nearest_neighbors,” while the HDBSCAN Clustering uses two parameters (min_cluster_size: the number of clusters, and min_samples). To test each algorithm with different parameters, we performed the random search method on the procedure given in Section 5.3. The procedure given above with each cluster validity index was used as the leading method to reach better clustering results. As an example is given in Figure 10, each index proposed various results. It means that the cluster validation performance of each one is also different. According to obtained results, our index was the best one. The performance of each index in all datasets is presented in the following tables for each clustering algorithm (Tables 6–14).
Figure 10

The clustering results suggested by each validity index when the DBSCAN algorithm was tested in the Aggregation dataset.

Table 6

The best parameters for datasets that were detected by the cluster validity indices with the DBSCAN algorithm.

DatasetDBSCAN parametersBest parameters detected by indices for the DBSCAN algorithm
SIDIDBCHS_DbwDSIRMSSTDVIASCKDE
Half-kernel ε 0.080.080.050.080.050.050.080.08
MinPts77117151177
Two spirals ε 0.10.10.050.10.050.10.050.1
MinPts1111151115111411
Outlier ε 0.070.070.070.070.050.070.050.07
MinPts151515158151415
Corners ε 0.10.10.10.10.10.10.10.1
MinPts1515151515151515
Cluster in cluster ε 0.060.060.060.060.060.060.060.06
MinPts1212121212121412
Crescent full moon ε 0.070.070.070.070.050.060.050.07
MinPts1414141415121514
Moon ε 0.060.080.060.060.050.050.060.06
MinPts71197991515
Face ε 0.060.10.10.060.060.050.060.1
MinPts158561512118
Wave ε 0.090.090.060.090.050.060.050.06
MinPts12512129121512
Fisher iris ε 0.140.190.140.140.080.140.060.19
MinPts156151551576
Breast cancer ε 0.390.330.390.390.060.060.050.4
MinPts858855145
Aggregation ε 0.060.090.060.060.060.060.050.06
MinPts137131314121413
Thyroid ε 0.10.10.060.090.070.050.050.1
MinPts551256895
Zelnik1 ε 0.080.080.050.10.070.070.080.07
MinPts61514755155
Zelnik5 ε 0.060.10.050.10.060.050.050.1
MinPts1413121315121413
Xclara ε 0.050.080.090.050.050.050.080.05
MinPts1312151313131213
Banana ε 0.050.050.050.050.050.050.050.05
MinPts99999999
Ds2c2sc13 ε 0.090.090.060.060.050.060.090.05
MinPts101014141314108
2sp2glob ε 0.10.10.050.070.080.10.060.07
MinPts99121469514
Cure-t1-2000n ε 0.10.10.10.10.10.10.10.1
MinPts1010101010101010
Table 7

Obtained values for each index based on the parameters given in Table 6.

DatasetObtained values for the each index
SIDIDBCHS_DbwDSIRMSSTDVIASCKDE
Half-kernel0.20100.09491.8818127.89050.54190.50680.24950.7125
Two spirals0.05880.13173.3241152.94470.58480.10690.280.7903
Outlier0.56080.42910.40371075.56090.20990.96540.13020.6797
Corners0.46140.28720.74362020.10680.49760.63580.11870.6295
Cluster in cluster0.22310.2341208.84580.01690.85360.73320.22760.595
Crescent full moon0.27840.19231.1646285.14230.32550.65680.24490.6623
Moon0.23710.10520.9739244.17220.20810.87880.25250.7508
Face0.45690.22171.1099213.02460.37250.76270.24230.6631
Wave0.45250.12910.7119366.10950.23440.89350.26960.6495
Fisher iris0.56920.12220.5234223.61370.33860.82960.25270.443
Breast cancer0.56980.12280.8037900.19880.36060.96170.29930.2944
Aggregation0.47630.14320.54611156.75390.20730.94420.18780.6388
Thyroid0.4330.05982.762616.64290.53430.74860.15280.3275
Zelnik10.20450.09925.697895.1960.25230.89390.21710.6604
Zelnik50.49710.22240.8098413.88350.36510.83380.15340.7739
Xclara0.66540.06561.18636889.01540.34920.74620.2290.8101
Banana0.35890.12581.13223532.22010.76250.43340.21460.8076
Ds2c2sc130.57240.2370.58911907.23880.19210.91930.10910.605
2sp2glob0.38990.12782.7559158.51870.63740.80030.20890.8819
Cure-t1-2000n0.45140.11960.67751365.07740.30540.7870.17210.6555
Table 8

The best parameters for the datasets that were detected by the cluster validity indices with the Spectral Clustering algorithm are given in Table 7.

DatasetSpectral clustering parametersBest parameters detected by indices for the Spectral Clustering algorithm
SIDIDBCHS_DbwDSIRMSSTDVIASCKDE
Half-kerneln_clusters1421515141522
Two spiralsn_clusters1521515151522
Outliern_clusters244133424
Cornersn_clusters1241212151422
Cluster in clustern_clusters42415151522
Crescent full moonn_clusters52513151426
Moonn_clusters1521515151522
Facen_clusters1121012151322
Waven_clusters721515151522
Fisher irisn_clusters222315223
Breast cancern_clusters222211141512
Aggregationn_clusters4261421522
Thyroidn_clusters3233151523
Zelnik1n_clusters1221312151333
Zelnik5n_clusters82815151524
Xclaran_clusters323310323
Bananan_clusters92915141522
Ds2c2sc13n_clusters335821525
2sp2globn_clusters721515151527
Cure-t1-2000nn_clusters5241321223
Table 9

The best parameters for the datasets that were detected by the cluster validity indices with the HDBSCAN algorithm.

DatasetHDBSCAN ParameterBest parameters detected by the indices for the HDBSCAN algorithm
SIDIDBCHS_DbwDSIRMSSTDVIASCKDE
Half-kerneln_clusters_size242422525252424
n_samples661025252566
Two spiralsn_clusters_size32531722156
n_samples21727221912
Outliern_clusters_size1616161616161616
n_samples1212121212121212
Cornersn_clusters_size88882288
n_samples88882288
Cluster in clustern_clusters_size2020911772020
n_samples101022331010
Crescent full moonn_clusters_size2020320332020
n_samples1212212221212
Moonn_clusters_size2262222102106
n_samples34332425244
Facen_clusters_size211392199139
n_samples5198588198
Waven_clusters_size16616163462
n_samples1332313131935
Fisher irisn_clusters_size5514551895
n_samples1212161212212512
Breast cancern_clusters_size1152522225
n_samples3455355335355
Aggregationn_clusters_size1712912232122
n_samples25141614134144
Thyroidn_clusters_size32333328
n_samples272242164
Zelnik1n_clusters_size11353202143
n_samples1611251516171911
Zelnik5n_clusters_size2020202020202020
n_samples33333333
Xclaran_clusters_size92231333313
n_samples26393339
Bananan_clusters_size2121132121162121
n_samples1414161414241414
Ds2c2sc13n_clusters_size222216224222416
n_samples191920196192410
2sp2globn_clusters_size2121212121212121
n_samples2222222222222222
Cure-t1-2000nn_clusters_size444444425
n_samples66666664
Table 10

Obtained values for each index based on the parameters are given in Table 8.

DatasetObtained values for the each index
SIDIDBCHS_DbwRMSSTDDSIVIASCKDE
Half-kernel0.47480.09490.60661761.61980.22460.91630.24950.7395
Two spirals0.31750.13171.0581378.8780.28570.78290.28650.8151
Outlier0.61780.42910.40371804.4630.11760.96540.29240.6863
Corners0.56720.28720.53154102.58830.18730.94390.2070.6575
Cluster in cluster0.45470.23410.9465832.93850.27640.8570.22750.6052
Crescent full moon0.49930.19230.57922022.70220.20550.91030.24230.6689
Moon0.45430.12850.6781602.09070.21690.90980.26890.7527
Face0.49960.23610.54731055.05730.17050.92710.24810.7575
Wave0.49570.12910.631681.36810.16390.91240.25410.617
Fisher iris0.62950.35810.4877356.2890.21630.89230.14320.4539
Breast cancer0.58390.12910.7738993.01580.17960.77950.20310.4341
Aggregation0.45410.10910.5891623.96840.14340.9210.29660.6944
Thyroid0.55170.09730.85138.12910.38090.6850.13090.4832
Zelnik10.50420.09920.663194.5860.28360.86140.21710.6544
Zelnik50.59480.26510.53531832.56260.15480.94950.27630.7686
Xclara0.69460.0230.420310843.72030.27790.9460.16120.8164
Banana0.50870.12580.573414012.55970.18060.93430.21460.82
Ds2c2sc130.39390.06390.80821133.55450.14340.90640.28960.6187
2sp2glob0.61020.14560.69211548.84650.25440.86930.23960.725
Cure-t1-2000n0.49940.19210.65813615.53020.15820.90160.28170.6589
Table 11

ARI values were obtained from the parameters that are given in Table 6 and were proposed by each index.

DatasetObtained ARI values for the each index
SIDIDBCHS_DbwDSIRMSSTDVIASCKDE
Half-kernel 1.0000 1.0000 0.9940 1.0000 0.91530.9940 1.0000 1.0000
Two spirals 1.0000 1.0000 0.9804 1.0000 0.9804 1.0000 0.9990 1.0000
Outlier 1.0000 1.0000 1.0000 1.0000 0.9973 1.0000 0.8621 1.0000
Corners 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
Cluster in cluster 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.8879 1.0000
Crescent full moon 1.0000 1.0000 0.9968 1.0000 0.91050.98730.8509 1.0000
Moon 0.9379 0.63220.9256 0.9379 0.78740.78740.79490.7949
Face0.26450.9949 0.9961 0.28920.13040.12260.8521 0.9961
Wave0.3514 1.0000 0.14410.35140.19130.14410.05080.0536
Fisher iris0.4518 0.5503 0.45180.45180.23690.45180.0106 0.5503
Breast cancer0.82400.81890.82400.8240−0.0779−0.0779−0.0780 0.8283
Aggregation 0.9898 0.7338 0.9898 0.9898 0.87700.98660.6330 0.9898
Thyroid0.67150.6715−0.0664 0.7339 0.2940−0.1332−0.13960.6715
Zelnik10.7708 1.0000 0.34090.78520.77240.7724 1.0000 0.7781
Zelnik50.9214 1.0000 0.9278 1.0000 0.92160.91260.9839 1.0000
Xclara 0.9813 0.00010.0001 0.9813 0.9813 0.9813 0.0001 0.9813
Banana 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
Ds2c2sc130.31870.31870.49110.49110.53250.49110.3187 0.5904
2sp2glob1.0000 1.0000 0.98500.99400.9985 1.0000 0.99700.9940
Cure-t1-2000n 0.8850 0.8850 0.8850 0.8850 0.8850 0.8850 0.8850 0.8850
Table 12

ARI values, which were obtained from the parameters, were given in Table 8 and were proposed by each index.

DatasetObtained ARI values for the each index
SIDIDBCHS_DbwRMSSTDDSIVIASCKDE
Half-kernel0.1514 1.0000 0.14220.14210.15150.1421 1.0000 1.0000
Two spirals0.1401 1.0000 0.14350.14010.14010.14010.2047 1.0000
Outlier0.8463 1.0000 1.0000 0.22360.2322 1.0000 0.2271 1.0000
Corners0.4581 1.0000 0.45810.45810.39170.41990.33300.3330
Cluster in cluster0.6584 1.0000 0.65840.13650.13680.1365 1.0000 1.0000
Crescent full moon0.2934 1.0000 0.29340.10210.08690.0955 1.0000 0.2341
Moon0.36290.29730.36290.36290.30920.3092 0.4916 0.4916
Face0.0646 0.3662 0.07470.05800.04430.0538 0.3662 0.3662
Wave0.2970 1.0000 0.13330.13230.13230.1356 1.0000 1.0000
Fisher iris0.56810.56810.5681 0.7445 0.23950.56810.5681 0.7445
Breast cancer 0.8933 0.8933 0.8933 0.8933 0.28750.17790.06690.2534
Aggregation0.79750.0646 0.9066 0.44530.04860.41560.11490.0646
Thyroid 0.6307 0.4204 0.6307 0.6307 0.08300.08300.4204 0.6307
Zelnik10.31700.43520.30040.31700.22250.3007 1.0000 1.0000
Zelnik50.65670.30960.65670.36380.37900.36380.5003 1.0000
Xclara 0.9939 0.6270 0.9939 0.9939 0.3602 0.9939 0.6270 0.9939
Banana0.2394 1.0000 0.23940.13690.14630.13691.0000 1.0000
Ds2c2sc130.32670.32670.27660.45310.0244 0.5344 0.02440.2394
2sp2glob 0.7852 0.57090.32260.31950.31850.32260.5709 0.7852
Cure-t1-2000n0.63340.34230.78180.33030.17570.35460.1757 0.8427
Table 13

Obtained values for each index based on the parameters given in Table 9.

DatasetObtained values for the each index
SIDIDBCHS_DbwDSIRMSSTDVIASCKDE
Half-kernel0.2010.09491.8878171.89840.55890.46620.24950.7125
Two spirals0.40710.13171.1858259.03490.01360.99570.280.8151
Outlier0.56080.42910.40371075.56090.20990.96540.12350.6881
Corners0.46140.28720.74362020.10680.04370.97910.11870.6268
Cluster in cluster0.22310.23414.40832.56240.06420.9470.22750.6052
Crescent full moon0.27840.19231.0934285.14230.05270.98290.24230.6623
Moon0.23710.07941.1729244.17220.32430.70210.26280.7002
Face0.4170.22170.9539204.56650.40310.85570.23390.6654
Wave0.37460.12911.1785168.99360.31550.78620.25410.617
Fisher iris0.62950.35810.4659353.36740.44880.92960.14780.4722
Breast cancer0.43060.11251.1919493.46320.19580.95750.29830.0143
Aggregation0.49250.14320.6452778.94480.27010.84810.14970.6108
Thyroid0.43590.06831.6838.42350.65190.79130.15320.3833
Zelnik10.00080.099213.253512.74330.30220.72870.21710.541
Zelnik50.46630.22241.0459413.88350.45930.74250.14930.7739
Xclara0.67450.02951.2137008.87460.04750.99180.11140.7814
Banana0.35890.12581.02883532.22010.76250.70030.21460.82
Ds2c2sc130.57240.2370.58291785.90020.18310.89280.10930.6045
2sp2glob0.38990.12782.7973158.4080.63740.80030.20880.7146
Cure-t1-2000n0.45140.11960.67751365.07740.30540.7870.17210.655
Table 14

ARI values, which were obtained from the parameters, are given in Table 9 and were proposed by each index.

DatasetObtained ARI values for the each index
SIDIDBCHS_DbwDSIRMSSTDVIASCKDE
Half-kernel 1.0000 1.0000 0.99800.79010.79010.7901 1.0000 1.0000
Two spirals0.0079 1.0000 0.00790.75240.00760.00760.9990 1.0000
Outlier 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
Corners 1.0000 1.0000 1.0000 1.0000 0.82610.8261 1.0000 1.0000
Cluster in cluster 1.0000 1.0000 0.52850.53600.52740.5274 1.0000 1.0000
Crescent full moon 1.0000 1.0000 0.1160 1.0000 0.11600.1160 1.0000 1.0000
Moon0.9379 1.0000 0.93790.93790.29330.36970.2933 1.0000
Face0.18830.9949 1.0000 0.1883 1.0000 1.0000 0.9949 1.0000
Wave0.2609 1.0000 0.17090.26090.25280.2140 1.0000 1.0000
Fisher iris0.56810.56810.56570.56810.56810.5638 0.5482 0.5682
Breast cancer0.8349 0.8522 0.0011 0.8522 0.00110.0011-0.0707 0.8522
Aggregation0.79620.73380.73230.7338 0.8154 0.80890.73380.8089
Thyroid0.4885 0.5662 0.48850.48850.48800.4885-0.02550.4873
Zelnik10.9313 1.0000 0.32070.8880 1.0000 0.96800.9771 1.0000
Zelnik5 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
Xclara0.9861 0.9904 0.39360.98800.39360.39360.3936 0.9904
Banana 1.0000 1.0000 0.8308 1.0000 1.0000 0.8278 1.0000 1.0000
Ds2c2sc130.31870.31870.31800.3187 0.7624 0.31870.31650.4260
2sp2glob 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
Cure-t1-2000n 0.8850 0.8850 0.8850 0.8850 0.8850 0.8850 0.8850 0.8850

6. Evaluation of the Results and Discussion

In our approach, we used the compactness and separation values of each data to support the arbitrary-shaped clusters. In this case, our approach tended to divide the spherical clusters into small partitions. To cope with this issue, we used a density estimation method to support the compactness of clusters. In the literature, there are two types of density estimation methods, parametric and nonparametric methods. To decide which one is the best for our approach, we carried out some experiments on the datasets by using the DBSCAN as the clustering algorithm. According to the experimental study, the nonparametric method was better than the parametric method, and the results of it can be seen in Table 3. After deciding that the nonparametric method was the best for our approach, we selected the kernel density estimation as the nonparametric density estimation method in order to support the multivariate (Table 4). The second point worth discussing is the selection of parameters of the kernel density estimation. The kernel density estimation has two parameters: the first one is the kernel method and the second one is the bandwidth. To find the best parameters of the kernel density estimation, we conducted some experimental studies. We carried out separate experiments for each parameter by using the procedure given in Section 5.3 by using the DBSCAN with randomly selected parameters. As it can be seen in Tables 4 and 5, the Gaussian was the best kernel method and the h = 0.05 was the best bandwidth. These parameters were the parameters that were used in experimental studies, which were used to compare our approach with the other indices. One of the advantages of the proposed VIASCKDE Index is that it can realistically evaluate the clustering performance regardless of the cluster shape. To test the success of our index on different cluster types, we used the DBSCAN, Spectral Clustering, and HDBSCAN algorithms with the procedure given in Section 5.3. The highest ARI values found as the best value by each index are given in Tables 11, 12 and 14. As it can be seen in the tables, the VIASCKDE Index reaches the highest ARI values on most of the datasets. The VIASCKDE Index reaches the highest ARI values in 47 of the 60 experiments, as given in Table 15. In addition, the ARI value of our index was very high, even if it was not the index that had the highest ARI value. In addition, when our index was compared with the density-based two indices, which were the S_Dbw and DSI, better results were obtained, and they are demonstrated in Table 15.
Table 15

The number of highest ARI values that each index reached.

Index# of datasets that each index was the best on the different algorithms
DBSCANSpectral ClusteringHDBSCANTotal
SI 11 4 9 24
DI 13 10 16 39
DB 7 5 6 18
CH 13 4 8 25
S_Dbw 5 0 9 14
DSI 8 7 5 20
RMSSTD 5 3 11 19
VIASCKDE (proposed index) 15 15 17 47
The other important advantage of our approach is that it considers the density of each cluster independently. For example, the Aggregation dataset has a nonhomogeneous density as it can be seen in Figure 4, and each cluster also may have a nonhomogeneous distribution as it was given in Figure 4(b). So, our approach does not assume all data inside any cluster has homogeneous distribution and also does not weight each data equally. It gives more importance to the data in the denser regions by multiplying those data with a coefficient that is detected by the KDE. Doing that supports the compactness of clusters. In other words, this approach made our index got better results. Since the VIASCKDE Index has a density-based approach, it can also be used to evaluate the performance of the algorithms that are based on a microcluster structure, which is used by the majority of density-based clustering algorithms because such algorithms use the center of each of the microclusters as the actual data in the offline phase. Therefore, the VIASCKDE Index can also be used to evaluate the performance of micro-cluster-based clustering algorithms.

7. Conclusion and Future Works

In the present study, we proposed a cluster validation index, which is called the VIASCKDE Index to validate the clusters quality of both the spherical and nonspherical clusters. Our approach draws its strength from considering the distribution of data inside the clusters by using the KDE. Doing that supports the compactness of clusters irrespective of the cluster center, and thus, the shape of the cluster can be in the form of arbitrary cluster. Most of the cluster validity indices in the literature can only do a realistic cluster quality evaluation when the cluster shape is spherical. However, in many instances, the cluster shape is not spherical. Our proposed approach calculates the compactness and separation values only based on the data. This approach makes it possible to evaluate cluster quality irrespective of its shape. Experimental studies revealed that the VIASCKDE Index reached the highest ARI values in most of the datasets. This means that the approach we proposed is the most successful one among the others. It has been planned to carry out studies to decrease the runtime complexity of the proposed index in the future.
  6 in total

1.  Clustering by passing messages between data points.

Authors:  Brendan J Frey; Delbert Dueck
Journal:  Science       Date:  2007-01-11       Impact factor: 47.728

2.  Graph-based consensus clustering for class discovery from gene expression data.

Authors:  Zhiwen Yu; Hau-San Wong; Hongqiang Wang
Journal:  Bioinformatics       Date:  2007-09-14       Impact factor: 6.937

3.  A cluster separation measure.

Authors:  D L Davies; D W Bouldin
Journal:  IEEE Trans Pattern Anal Mach Intell       Date:  1979-02       Impact factor: 6.226

4.  Understanding and enhancement of internal clustering validation measures.

Authors:  Yanchi Liu; Zhongmou Li; Hui Xiong; Xuedong Gao; Junjie Wu; Sen Wu
Journal:  IEEE Trans Cybern       Date:  2012-10-26       Impact factor: 11.448

5.  An Online Weighted Bayesian Fuzzy Clustering Method for Large Medical Data Sets.

Authors:  Cong Zhang; Jing Xue; Xiaoqing Gu
Journal:  Comput Intell Neurosci       Date:  2022-02-21

6.  A Deep Learning and Clustering Extraction Mechanism for Recognizing the Actions of Athletes in Sports.

Authors:  Jianhua Yang
Journal:  Comput Intell Neurosci       Date:  2022-03-24
  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.