| Literature DB >> 28042291 |
Min Ren1, Peiyu Liu2, Zhihao Wang2, Jing Yi2.
Abstract
For the shortcoming of fuzzy c-means algorithm (FCM) needing to know the number of clusters in advance, this paper proposed a new self-adaptive method to determine the optimal number of clusters. Firstly, a density-based algorithm was put forward. The algorithm, according to the characteristics of the dataset, automatically determined the possible maximum number of clusters instead of using the empirical rule [Formula: see text] and obtained the optimal initial cluster centroids, improving the limitation of FCM that randomly selected cluster centroids lead the convergence result to the local minimum. Secondly, this paper, by introducing a penalty function, proposed a new fuzzy clustering validity index based on fuzzy compactness and separation, which ensured that when the number of clusters verged on that of objects in the dataset, the value of clustering validity index did not monotonically decrease and was close to zero, so that the optimal number of clusters lost robustness and decision function. Then, based on these studies, a self-adaptive FCM algorithm was put forward to estimate the optimal number of clusters by the iterative trial-and-error process. At last, experiments were done on the UCI, KDD Cup 1999, and synthetic datasets, which showed that the method not only effectively determined the optimal number of clusters, but also reduced the iteration of FCM with the stable clustering result.Entities:
Mesh:
Year: 2016 PMID: 28042291 PMCID: PMC5153549 DOI: 10.1155/2016/2647389
Source DB: PubMed Journal: Comput Intell Neurosci
Figure 1An example. Point A is of the highest local density. If A does not belong to any cluster, then A is a core point. Points B, C, D, and E are directly density-reachable to point A. Point F is density-reachable to point A. Point H is a border point.
Algorithm 1Density-based algorithm.
Figure 2Demonstration of the process of density-based algorithm. (a) is the initial data distribution of a synthetic dataset. The dataset consists of two pieces of 2-dimensional Gaussian distribution data with centroids, respectively, as (2, 3) and (7, 8). Each class has 100 samples. In (b), the blue circle represents the highest density core point as the centroid of the first cluster, and the red plus sign represents the object belonging to the first cluster. In (c), the red circle represents the core point as the centroid of the second cluster, and the blue asterisk represents the object belonging to the second cluster. In (d), the purple circle represents the core point as the centroid of the third cluster, the green times sign represents the object belonging to the third cluster, and the black dot represents the final border point which does not belong to any cluster. According to a certain cutoff distance, the maximum number of clusters is 3. If calculated in accordance with the empirical rule, the maximum number of clusters should be 14. Therefore, the algorithm can effectively reduce the iteration of FCM algorithm operation.
Algorithm 2SAFCM.
The data type and distribution of SubKDD.
| Attack behavior | Number of samples |
|---|---|
| normal | 200 |
| ipsweep | 50 |
| portsweep | 50 |
| neptune | 200 |
| smurf | 300 |
| back | 50 |
Figure 3Four synthetic datasets.
c max estimated by several methods. n is the number of objects in the dataset, c is the actual number of clusters, c ER is the number of clusters estimated by the empirical rule, that is, , c AP is the number of clusters obtained by AP algorithm, and c DBA is the number of clusters obtained by density-based algorithm.
|
| ||||||||
|---|---|---|---|---|---|---|---|---|
| Dataset |
|
|
|
|
|
|
|
|
| Iris | 150 | 3 | 12 | 9 | 20 | 14 | 9 | 6 |
| Wine | 178 | 3 | 13 | 15 | 14 | 7 | 6 | 3 |
| Seeds | 210 | 3 | 14 | 13 | 18 | 12 | 5 | 2 |
| SubKDD | 1050 | 6 | 32 | 24 | 21 | 17 | 10 | 7 |
| SD1 | 200 | 20 | 14 | 19 | 38 | 22 | 20 | — |
| SD2 | 2000 | 4 | 44 | 25 | 16 | 3 | 4 | 2 |
| SD3 | 885 | 3 | 29 | 27 | 24 | 19 | 5 | 3 |
| SD4 | 947 | 3 | 30 | 31 | 23 | 13 | 8 | 4 |
Comparison of iterations of FCM algorithm. Method 1 uses the random initial cluster centroids, and Method 2 uses the cluster centroids obtained by density-based algorithm.
| Dataset | Method 1 | Method 2 |
|---|---|---|
| Iris | 21 | 16 |
| Wine | 27 | 18 |
| Seeds | 19 | 16 |
| SubKDD | 31 | 23 |
| SD1 | 38 | 14 |
| SD2 | 18 | 12 |
| SD3 | 30 | 22 |
| SD4 | 26 | 21 |
Clustering accuracy.
| Dataset | Iris | Wine | Seeds | SubKDD |
|---|---|---|---|---|
| Clustering accuracy | 84.00% | 96.63% | 91.90% | 94.35% |
Figure 4Clustering results of two synthetic datasets.
Optimal number of clusters estimated by several clustering validity indices.
| Dataset |
|
|
|
|
|---|---|---|---|---|
| Iris | 2 | 9 | 2 | 2 |
| Wine | 3 | 6 | 3 | 3 |
| Seeds | 2 | 5 | 2 | 2 |
| SubKDD | 10 | 10 | 9 | 4 |
| SD1 | 20 | 20 | 20 | 20 |
| SD2 | 4 | 4 | 4 | 4 |
| SD3 | 5 | 3 | 5 | 4 |
| SD4 | 2 | 8 | 2 | 3 |
The value of clustering validity index on Iris.
|
|
|
|
|
|
|---|---|---|---|---|
| 2 |
| 0.223604 |
|
|
| 3 | 0.223973 | 0.124598 | 34.572146 | 0.551565 |
| 4 | 0.316742 | 0.099103 | 49.279488 | 0.615436 |
| 5 | 0.560109 | 0.089108 | 87.540968 | 0.676350 |
| 6 | 0.574475 | 0.072201 | 90.563379 | 0.691340 |
| 7 | 0.400071 | 0.067005 | 63.328679 | 0.735311 |
| 8 | 0.275682 | 0.036283 | 45.736972 | 0.614055 |
| 9 | 0.250971 |
| 42.868449 | 0.584244 |
The value of clustering validity index on Wine.
|
|
|
|
|
|
|---|---|---|---|---|
| 2 | 0.663406 | 1.328291 | 118.33902 | 1.578291 |
| 3 |
| 0.513071 |
|
|
| 4 | — | 0.473254 | — | 1.735791 |
| 5 | — | 0.373668 | — | 1.846686 |
| 6 | — |
| — | 1.683222 |
The value of clustering validity index on Seeds.
|
|
|
|
|
|
|---|---|---|---|---|
| 2 |
| 0.293609 |
|
|
| 3 | 0.212127 | 0.150899 | 45.326216 | 0.599001 |
| 4 | 0.243483 | 0.127720 | 52.215334 | 0.697943 |
| 5 | 0.348842 |
| 75.493654 | 0.701153 |
The value of clustering validity index on SubKDD.
|
|
|
|
|
|
|---|---|---|---|---|
| 2 | 0.646989 | 1.324434 | 550.166676 | 1.574431 |
| 3 | 0.260755 | 0.378775 | 222.020838 | 1.090289 |
| 4 | 0.133843 | 0.062126 | 119.544560 |
|
| 5 | 0.234402 | 0.052499 | 202.641204 | 0.537852 |
| 6 | 0.180728 | 0.054938 | 156.812271 | 0.583800 |
| 7 | 0.134636 | 0.047514 | 119.029265 | 0.619720 |
| 8 | 0.104511 | 0.032849 | 91.9852740 | 0.690873 |
| 9 | 0.129721 | 0.027639 |
| 0.562636 |
| 10 |
|
| 91.3528560 | 0.528528 |
The value of clustering validity index on SD1.
|
|
|
|
|
|
|---|---|---|---|---|
| 2 | 0.221693 | 0.443390 | 44.592968 | 0.693390 |
| 3 | 0.206035 | 0.198853 | 40.264251 | 0.726245 |
| 4 | 0.127731 | 0.093653 | 26.220200 | 0.655550 |
| 5 | 0.130781 | 0.069848 | 27.154867 | 0.651465 |
| 6 | 0.144894 | 0.050067 | 22.922121 | 0.639325 |
| 7 | 0.136562 | 0.040275 | 29.126152 | 0.636258 |
| 8 | 0.112480 | 0.032874 | 24.323625 | 0.627442 |
| 9 | 0.115090 | 0.026833 | 24.242580 | 0.624936 |
| 10 | 0.141415 | 0.022611 | 28.574579 | 0.616701 |
| 11 | 0.126680 | 0.019256 | 28.821707 | 0.611524 |
| 12 | 0.103178 | 0.016634 | 23.931865 | 0.605990 |
| 13 | 0.110355 | 0.013253 | 26.517065 | 0.588246 |
| 14 | 0.095513 | 0.011083 | 23.635022 | 0.576808 |
| 15 | 0.075928 | 0.009817 | 19.302095 | 0.562289 |
| 16 | 0.066025 | 0.008824 | 17.236138 | 0.557990 |
| 17 | 0.054314 | 0.007248 | 14.995284 | 0.544341 |
| 18 | 0.045398 | 0.006090 | 13.208810 | 0.534882 |
| 19 | 0.039492 | 0.005365 | 11.977437 | 0.527131 |
| 20 |
|
|
|
|
The value of clustering validity index on SD2.
|
|
|
|
|
|
|---|---|---|---|---|
| 2 | 0.066286 | 0.132572 | 132.81503 | 0.382572 |
| 3 | 0.068242 | 0.063751 | 137.52535 | 0.394200 |
| 4 |
|
|
|
|
The value of clustering validity index on SD3.
|
|
|
|
|
|
|---|---|---|---|---|
| 2 | 0.148379 | 0.300899 | 131.557269 | 0.570876 |
| 3 | 0.195663 |
| 173.900551 | 0.599680 |
| 4 | 0.127512 | 0.142150 | 113.748947 |
|
| 5 |
| 0.738070 |
| 0.589535 |
The value of clustering validity index on SD4.
|
|
|
|
|
|
|---|---|---|---|---|
| 2 |
| 0.208832 |
| 0.473748 |
| 3 | 0.170326 | 0.142561 | 162.044450 |
|
| 4 | 0.221884 | 0.081007 | 211.692699 | 0.583529 |
| 5 | 0.156253 | 0.053094 | 157.683921 | 0.603211 |
| 6 | 0.123191 | 0.041799 | 118.279116 | 0.575396 |
| 7 | 0.165465 | 0.032411 | 107.210082 | 0.592625 |
| 8 | 0.145164 |
| 139.310969 | 0.606049 |