| Literature DB >> 29240756 |
Aihua Zheng1, Bo Jiang1, Yan Li2, Xuehan Zhang1, Chris Ding1.
Abstract
The widely used K-means clustering is a hard clustering algorithm. Here we propose a Elastic K-means clustering model (EKM) using posterior probability with soft capability where each data point can belong to multiple clusters fractionally and show the benefit of proposed Elastic K-means. Furthermore, in many applications, besides vector attributes information, pairwise relations (graph information) are also available. Thus we integrate EKM with Normalized Cut graph clustering into a single clustering formulation. Finally, we provide several useful matrix inequalities which are useful for matrix formulations of learning models. Based on these results, we prove the correctness and the convergence of EKM algorithms. Experimental results on six benchmark datasets demonstrate the effectiveness of proposed EKM and its integrated model.Entities:
Mesh:
Year: 2017 PMID: 29240756 PMCID: PMC5730165 DOI: 10.1371/journal.pone.0188252
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1EKM clustering results on 2 dimensional 3 Gaussian clusters.
x and y axes denote the first and the second dimension respectively.
Fig 2An example of the posterior G of 3 ambiguous data points and 3 sharp data points on above EKM clustering results.
Fig 3The gap in the posterior distribution: The difference between the highest and second highest peaks.
Dataset names, number of data samples, data dimension and number of classes of the 6 benchmark datasets.
| Dataset | #Sample | #Dimension | #Class |
|---|---|---|---|
| AT&T | 400 | 1024 | 40 |
| USPS | 9298 | 256 | 10 |
| MNIST | 1000 | 784 | 10 |
| COIL20 | 1440 | 1024 | 20 |
| Isolet1 | 1560 | 617 | 26 |
| BinAlph | 1014 | 320 | 26 |
Fig 4Clustering accuracy for gEKM against α on six datasets in full dimension.
Fig 5Changes of the objective function of the EKM and gEKM in 100 iterations.
Clustering accuracy of six datasets in full dimension.
| Dataset | Kmeans | FCM | NMF | Ncut | MinMaxCut | EKM | gEKM |
|---|---|---|---|---|---|---|---|
| AT&T | 0.715 | 0.701 | 0.721 | 0.731 | 0.730 | ||
| USPS | 0.618 | 0.626 | 0.624 | 0.644 | 0.621 | ||
| MNIST | 0.494 | 0.501 | 0.518 | 0.531 | 0.532 | ||
| COIL20 | 0.625 | 0.615 | 0.631 | 0.635 | 0.634 | ||
| Isolet1 | 0.626 | 0.624 | 0.628 | 0.651 | 0.645 | ||
| BinAlph | 0.495 | 0.511 | 0.502 | 0.511 | 0.512 |
Clustering accuracy of six datasets in PCA subspace when p = 50.
| Dataset | Kmeans | FCM | NMF | Ncut | MinMaxCut | EKM | gEKM |
|---|---|---|---|---|---|---|---|
| AT&T | 0.713 | 0.713 | 0.715 | 0.731 | 0.734 | ||
| USPS | 0.604 | 0.619 | 0.621 | 0.627 | 0.632 | ||
| MNIST | 0.521 | 0.523 | 0.524 | 0.537 | 0.536 | ||
| COIL20 | 0.620 | 0.631 | 0.627 | 0.633 | 0.640 | ||
| Isolet1 | 0.626 | 0.627 | 0.626 | 0.641 | 0.635 | ||
| BinAlph | 0.498 | 0.510 | 0.518 | 0.534 | 0.533 |
Clustering accuracy of six datasets in PCA subspace when p = 200.
| Dataset | Kmeans | FCM | NMF | Ncut | MinMaxCut | EKM | gEKM |
|---|---|---|---|---|---|---|---|
| AT&T | 0.712 | 0.708 | 0.702 | 0.725 | 0.723 | ||
| USPS | 0.610 | 0.616 | 0.625 | 0.629 | 0.631 | ||
| MNIST | 0.512 | 0.503 | 0.518 | 0.525 | 0.532 | ||
| COIL20 | 0.623 | 0.611 | 0.621 | 0.626 | 0.630 | ||
| Isolet1 | 0.619 | 0.624 | 0.625 | 0.631 | 0.629 | ||
| BinAlph | 0.477 | 0.512 | 0.515 | 0.522 | 0.525 |
Fig 6Clustering accuracy of gEKM against the number of dimension of PCA subspace on six datasets.
where “full” on the x-axis denotes the full dimension of the datasets as indicated on Table 1.
Clustering accuracy of six datasets in PCA subspace when p = 100.
| Dataset | Kmeans | FCM | NMF | Ncut | MinMaxCut | EKM | gEKM |
|---|---|---|---|---|---|---|---|
| AT&T | 0.711 | 0.701 | 0.711 | 0.737 | 0.731 | ||
| USPS | 0.608 | 0.625 | 0.625 | 0.632 | 0.622 | ||
| MNIST | 0.516 | 0.521 | 0.525 | 0.534 | 0.535 | ||
| COIL20 | 0.618 | 0.625 | 0.630 | 0.633 | 0.634 | ||
| Isolet1 | 0.628 | 0.623 | 0.627 | 0.637 | 0.639 | ||
| BinAlph | 0.481 | 0.515 | 0.518 | 0.523 | 0.522 |
Clustering accuracy of six datasets in PCA subspace when p = 150.
| Dataset | Kmeans | FCM | NMF | Ncut | MinMaxCut | EKM | gEKM |
|---|---|---|---|---|---|---|---|
| AT&T | 0.717 | 0.702 | 0.702 | 0.720 | 0.722 | ||
| USPS | 0.605 | 0.612 | 0.627 | 0.631 | 0.623 | ||
| MNIST | 0.519 | 0.511 | 0.528 | 0.537 | 0.539 | ||
| COIL20 | 0.607 | 0.622 | 0.629 | 0.631 | 0.633 | ||
| Isolet1 | 0.627 | 0.622 | 0.631 | 0.642 | 0.640 | ||
| BinAlph | 0.485 | 0.512 | 0.519 | 0.526 | 0.531 |