| Literature DB >> 30022093 |
Yi Wang1,2, Yi Li1,3,2, Chunhong Qiao1,2, Xiaoyu Liu1,2, Meng Hao1,2, Yin Yao Shugart4,5,6, Momiao Xiong7, Li Jin8,9,10.
Abstract
Clustering techniques are widely used in many applications. The goal of clustering is to identify patterns or groups of similar objects within a dataset of interest. However, many cluster methods are neither robust nor sensitive to noises and outliers in real data. In this paper, we present Nuclear Norm Clustering (NNC, available at https://sourceforge.net/projects/nnc/), an algorithm that can be used in various fields as a promising alternative to the k-means clustering method. The NNC algorithm requires users to provide a data matrix M and a desired number of cluster K. We employed simulated annealing techniques to choose an optimal label vector that minimizes nuclear norm of the pooled within cluster residual matrix. To evaluate the performance of the NNC algorithm, we compared the performance of both 15 public datasets and 2 genome-wide association studies (GWAS) on psoriasis, comparing our method with other classic methods. The results indicate that NNC method has a competitive performance in terms of F-score on 15 benchmarked public datasets and 2 psoriasis GWAS datasets. So NNC is a promising alternative method for clustering tasks.Entities:
Year: 2018 PMID: 30022093 PMCID: PMC6052164 DOI: 10.1038/s41598-018-29246-4
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
The pseudocode of Nuclear Norm Clustering.
| subroutine |
|
|
Macro-averaged F-score of all methods on 15 datasets.
| Datasets | sample | feature | class | k-means | PAM | Hcluster | CLARA | AGNES | DIANA | Clusterdp | NNC |
|---|---|---|---|---|---|---|---|---|---|---|---|
|
| 4601 | 57 | 2 | 0.4756 | 0.7594 | 0.8257 | 0.3771 | 0.3779 | 0.3779 | 0.6088 |
|
|
| 579 | 10 | 2 | 0.4122 | 0.5406 |
| 0.5418 | 0.4163 | 0.4122 | 0.5981 | 0.5837 |
|
| 748 | 4 | 2 | 0.5630 | 0.5710 |
| 0.5849 | 0.4658 | 0.5547 | 0.6304 | 0.5554 |
|
| 768 | 8 | 2 | 0.5803 | 0.6202 | 0.6918 | 0.6169 | 0.4131 | 0.6385 | 0.6100 |
|
|
| 195 | 22 | 2 | 0.4682 | 0.6748 | 0.7013 | 0.6733 | 0.4231 | 0.4073 |
| 0.6376 |
|
| 1055 | 41 | 2 | 0.5025 | 0.7112 | 0.7119 | 0.6570 | 0.3982 | 0.3982 |
| 0.7057 |
|
| 351 | 33 | 2 | 0.7024 | 0.6991 |
| 0.6872 | 0.3992 | 0.5004 | 0.6904 | 0.7024 |
|
| 830 | 5 | 2 | 0.6774 |
| 0.8067 | 0.8010 | 0.5218 | 0.5374 | 0.7976 | 0.7987 |
|
| 569 | 30 | 2 | 0.8268 |
| 0.9181 | 0.9276 | 0.4007 | 0.8832 | 0.8552 | 0.9303 |
|
| 373 | 2 | 2 | 0.7660 | 0.8369 |
| 0.7974 | 0.9127 | 0.8416 | 0.9001 | 0.8636 |
|
| 240 | 2 | 2 | 0.8331 | 0.8461 | 0.8962 | 0.8620 | 0.7986 | 0.8584 |
| 0.8303 |
|
| 300 | 2 | 3 | 0.7081 | 0.7270 | 0.7586 | 0.7147 | 0.7223 |
| 0.7273 | 0.7270 |
|
| 150 | 4 | 3 | 0.8918 | 0.8593 | 0.8841 | 0.8867 | 0.8841 | 0.8512 |
| 0.8853 |
|
| 210 | 7 | 3 | 0.8954 | 0.9104 | 0.9290 | 0.9054 | 0.8795 | 0.9037 | 0.9286 |
|
|
| 178 | 13 | 3 | 0.7032 | 0.9270 | 0.9500 | 0.9425 | 0.5500 | 0.8245 | 0.7860 |
|
Bold: The bold means the first place result of all methods compared.
Mean and SD of F-score on 2 psoriasis datasets.
| methods | Psoriasis 1 | Psoriasis 2 | ||||
|---|---|---|---|---|---|---|
| Mean 1 | SD | Pvalue | Mean 2 | SD | Pvalue | |
| k-means | 0.4363 | 0.1155 |
| 0.6314 | 0.0316 |
|
| PAM | 0.4864 | 0.1221 |
| 0.6548 | 0.0328 | 6.5430E-02 |
| Hcluster |
| 0.0138 | 9.8145E-01 | 0.6590 | 0.0214 | 5.3664E-02 |
| CLARA | 0.4875 | 0.1247 | 9.6680E-02 | 0.6507 | 0.0229 |
|
| AGNES | 0.3654 | 0.0029 |
| 0.5261 | 0.0711 |
|
| DIANA | 0.4340 | 0.1127 |
| 0.6119 | 0.0401 |
|
| NNC | 0.5735 | 0.0722 | — |
| 0.0065 | — |
Bold: The bold means the first place result of all methods compared. SD: Standard Deviation.
The pvalue was calculated by Wilcoxon Rank Sum test (paired = TRUE, alternative = “greater”).
Figure 1The macro-averaged F-score of selected top 50 associated SNPs on the Psoriasis GWAS dataset of GRU group.
Figure 2The macro-averaged F-score of selected top 50 associated SNPs on the Psoriasis GWAS dataset of ADO group.
The detail running time comparison of all benchmarked methods.
| Methods | Psoriasis 1 with 1590 samples | Psoriasis 2 with 1133 samples |
|---|---|---|
| Computing Time (seconds) | Computing Time (seconds) | |
| k-means# | 0.030 | 0.025 |
| PAM# | 0.053 | 0.030 |
| Hcluster# | 0.056 | 0.025 |
| CLARA# |
|
|
| AGNES# | 0.041 | 0.035 |
| DIANA# | 0.084 | 0.036 |
| NNC(iter = 20, K = 2) | 0.016 | 0.012 |
| NNC(iter = 200, K = 2) | 0.053 | 0.024 |
| NNC(iter = 2000, K = 2) | 0.414 | 0.735 |
| NNC(iter = 20000, K = 2) | 4.754 | 7.328 |
| NNC(iter = 200000, K = 2) | 82.757 | 86.205 |
Bold: The bold means the first place running time of all methods compared.
Computing time: The time calculated on the processor.
#Sum of 10 times computing time according to the default parameters.