| Literature DB >> 29310570 |
Min Wang1,2, Zachary B Abrams1, Steven M Kornblau3, Kevin R Coombes4.
Abstract
BACKGROUND: Cluster analysis is the most common unsupervised method for finding hidden groups in data. Clustering presents two main challenges: (1) finding the optimal number of clusters, and (2) removing "outliers" among the objects being clustered. Few clustering algorithms currently deal directly with the outlier problem. Furthermore, existing methods for identifying the number of clusters still have some drawbacks. Thus, there is a need for a better algorithm to tackle both challenges.Entities:
Keywords: Clustering; Gap statistics; NbClust; Number of clusters; SCOD; Silhouette width; von Mises-Fisher mixture model
Mesh:
Year: 2018 PMID: 29310570 PMCID: PMC5759208 DOI: 10.1186/s12859-017-1998-9
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
True positive and false positive rates for Δ between 0.20 and 0.60
| Delta | False positive rate | True positive rate |
|---|---|---|
| 0.20 | 0.0300 | 0.9954664 |
| 0.25 | 0.0108 | 0.9954143 |
| 0.26 | 0.0082 | 0.9953882 |
| 0.27 | 0.0074 | 0.9953882 |
| 0.28 | 0.0068 | 0.9953882 |
| 0.29 | 0.0056 | 0.9953882 |
| 0.30 | 0.0042 | 0.9953882 |
| 0.31 | 0.0040 | 0.9953622 |
| 0.32 | 0.0038 | 0.9953622 |
| 0.33 | 0.0038 | 0.9953361 |
| 0.34 | 0.0038 | 0.9953361 |
| 0.35 | 0.0038 | 0.9952840 |
| 0.40 | 0.0036 | 0.9949453 |
| 0.45 | 0.0036 | 0.9939812 |
| 0.50 | 0.0036 | 0.9904898 |
| 0.55 | 0.0036 | 0.9819698 |
| 0.60 | 0.0036 | 0.9554455 |
Fig. 1The 16 correlation matrices considered in the simulation studies. Values of correlations are provided by the colorbar. Numbers in parentheses correspond to the known numbers of clusters
Summary statistics for detecting good and bad objects in datasets 7-10 from Thresher
| Scenarios and datasets | 96 variables, 24 objects | 24 variables, 96 objects | ||||||
|---|---|---|---|---|---|---|---|---|
| Dataset 7 | Dataset 8 | Dataset 9 | Dataset 10 | Dataset 7 | Dataset 8 | Dataset 9 | Dataset 10 | |
| Sensitivity | 0.990 | 0.985 | 0.988 | 0.958 | 0.822 | 0.816 | 0.836 | 0.809 |
| Specificity | 0.606 | 0.552 | 1 | 0.999 | 0.688 | 0.655 | 1 | 0.917 |
| FDR | 0.427 | 0.458 | 0 | 0.001 | 0.399 | 0.426 | 0 | 0.047 |
| AUC | 0.798 | 0.768 | 0.994 | 0.978 | 0.755 | 0.735 | 0.918 | 0.863 |
Summary statistics for detecting good and bad objects in datasets 7-10 from SCOD algorithm
| Scenarios and datasets | 96 variables, 24 objects | 24 variables, 96 objects | ||||||
|---|---|---|---|---|---|---|---|---|
| Dataset 7 | Dataset 8 | Dataset 9 | Dataset 10 | Dataset 7 | Dataset 8 | Dataset 9 | Dataset 10 | |
| Sensitivity | 0.337 | 0.344 | 0.327 | 0.328 | 0.225 | 0.228 | 0.223 | 0.217 |
| Specificity | 0.670 | 0.661 | 0.674 | 0.658 | 0.780 | 0.774 | 0.780 | 0.786 |
| FDR | 0.660 | 0.663 | 0.333 | 0.342 | 0.661 | 0.666 | 0.333 | 0.338 |
| AUC | 0.504 | 0.502 | 0.501 | 0.493 | 0.502 | 0.501 | 0.502 | 0.501 |
Fig. 2Values of the absolute difference between the estimated values and the known number of clusters across the correlation matrices for 96 variables and 24 objects
Fig. 3Values of the absolute difference between the estimated values and the known number of clusters across the correlation matrices for 24 variables and 96 objects
Values of the absolute difference between the estimated and the known number of clusters across the correlation matrices for 96 variables and 24 objects
| Methods | NbClust Top 10 Best Indices | Thresher | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| trcovw | tracew | ratkowsky | mcclain | ptbiserial | tau | sdindex | kl | ccc | hartigan | CPT | TwiceMean | SCOD | |
| 1 | 1.008 | 1.037 | 1.088 | 1.123 | 1.920 | 2.020 | 2.255 | 2.087 | 2.022 | 2.905 | 0.477 | 0.727 |
|
| 2 | 1.004 | 1.032 | 1.073 | 1.179 | 1.959 | 2.081 | 2.287 | 2.258 | 2.363 | 3.048 |
| 0.119 | 0.153 |
| 3 | 1.008 | 1.031 | 1.065 | 1.135 | 1.858 | 2.041 | 2.193 | 2.099 | 2.763 | 2.902 | 0.135 |
| 0.165 |
| 4 | 0.811 | 0.968 | 0.921 | 0.904 | 0.635 | 0.559 | 0.551 | 1.023 | 1.263 | 0.941 | 0.438 |
| 1.887 |
| 5 | 0.965 | 0.978 | 0.918 | 0.888 | 0.598 | 0.524 | 0.516 | 1.058 | 1.940 | 0.846 | 0.192 |
| 1.882 |
| 6 | 0.822 | 0.960 | 0.917 | 0.897 | 0.613 | 0.522 | 0.516 | 1.082 | 1.558 | 1.016 |
| 0.438 | 1.897 |
| 7 | 0.064 |
| 0.082 | 0.112 | 0.906 | 1.068 | 1.239 | 1.163 | 1.307 | 1.954 | 0.618 | 0.776 | 0.890 |
| 8 | 0.068 |
| 0.075 | 0.108 | 0.946 | 1.078 | 1.255 | 1.199 | 1.177 | 1.857 | 0.760 | 0.802 | 0.914 |
| 9 | 1.05 | 1.029 | 1.082 | 1.126 | 1.882 | 2.045 | 2.215 | 2.153 | 1.975 | 2.933 | 0.422 | 0.401 |
|
| 10 | 1.011 | 1.025 | 1.072 | 1.130 | 1.981 | 2.080 | 2.262 | 2.129 | 2.011 | 2.932 | 0.502 | 0.611 |
|
| 11 | 0.571 | 0.024 | 0.069 | 0.123 | 0.900 | 1.084 | 1.239 | 1.114 |
| 1.906 | 0.109 | 0.104 | 0.918 |
| 12 | 0.104 |
| 0.081 | 0.101 | 0.919 | 1.088 | 1.225 | 1.148 | 0.450 | 1.901 | 0.115 | 0.120 | 0.913 |
| 13 | 1.664 | 1.956 | 1.902 | 1.897 | 1.147 | 0.983 | 0.852 | 1.392 | 1.971 | 0.767 | 0.582 |
| 2.930 |
| 14 | 1.938 | 1.969 | 1.925 | 1.893 | 1.184 | 0.997 | 0.857 | 1.463 | 2.201 | 0.793 | 0.105 |
| 2.884 |
| 15 | 0.810 | 0.960 | 0.902 | 0.892 | 0.614 | 0.531 |
| 1.020 | 1.025 | 0.914 | 1.354 | 1.328 | 1.910 |
| 16 | 0.958 | 0.964 | 0.896 | 0.906 | 0.635 | 0.549 | 0.537 | 1.017 | 1.586 | 0.912 |
| 0.278 | 1.897 |
| Average | 0.866 | 0.877 | 0.879 | 0.901 | 1.169 | 1.203 | 1.282 | 1.463 | 1.602 | 1.783 |
| 0.441 | 1.233 |
Bold values indicate the best results for row settings
Values of the absolute difference between the estimated and the known number of clusters across the correlation matrices for 24 variables and 96 objects
| Methods | NbClust Top 10 Best Indices | Thresher | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| tracew | mcclain | ratkowsky | trcovw | scott | silhouette | sdindex | kl | tau | ptbiserial | CPT | TwiceMean | SCOD | |
| 1 |
| 1.011 | 1.055 | 1.078 | 1.213 | 1.528 | 2.246 | 1.850 | 2.382 | 2.320 | 1.147 | 1.153 | 1.011 |
| 2 | 1.011 | 1.024 | 1.064 | 1.076 | 1.195 | 2.665 | 2.250 | 1.885 | 2.320 | 2.348 |
|
| 1.021 |
| 3 | 1.018 | 1.013 | 1.070 | 1.088 | 1.221 | 1.674 | 2.285 | 1.829 | 2.369 | 2.344 |
|
| 0.988 |
| 4 | 0.987 | 0.988 | 0.923 | 0.849 | 0.840 | 1.024 |
| 0.871 | 0.571 | 0.711 | 2.834 | 2.999 | 1.013 |
| 5 | 0.982 | 0.994 | 0.936 | 0.908 | 0.845 | 0.962 | 0.424 | 0.947 | 0.550 | 0.695 | 0.790 |
| 1.008 |
| 6 | 0.989 | 0.979 | 0.933 | 0.820 | 0.844 | 1.083 |
| 0.901 | 0.572 | 0.703 | 2.935 | 2.971 | 1.011 |
| 7 |
| 0.012 | 0.067 | 0.137 | 0.206 | 0.614 | 1.254 | 0.838 | 1.369 | 1.360 | 0.760 | 2.634 | 0.318 |
| 8 | 0.015 |
| 0.067 | 0.138 | 0.212 | 0.585 | 1.314 | 0.882 | 1.422 | 1.366 | 0.650 | 2.689 | 0.346 |
| 9 | 1.014 | 1.016 | 1.050 | 1.133 | 1.198 | 1.562 | 2.250 | 1.902 | 2.365 | 2.320 | 1.163 | 1.163 |
|
| 10 | 1.011 | 1.015 | 1.078 | 1.083 | 1.215 | 1.547 | 2.277 | 1.893 | 2.378 | 2.330 | 1.035 | 1.037 |
|
| 11 | 0.023 |
| 0.094 | 0.741 | 0.231 | 0.571 | 1.226 | 0.981 | 1.373 | 1.308 | 0.050 | 0.049 | 0.350 |
| 12 |
| 0.017 | 0.068 | 0.232 | 0.205 | 0.503 | 1.244 | 0.816 | 1.359 | 1.285 | 0.025 | 0.025 | 0.372 |
| 13 | 1.983 | 1.985 | 1.920 | 1.597 | 1.779 | 1.634 |
| 1.497 | 0.839 | 0.951 | 0.878 | 0.881 | 2.006 |
| 14 | 1.987 | 1.988 | 1.936 | 1.834 | 1.809 | 1.663 |
| 1.515 | 0.819 | 0.945 | 1.152 | 1.043 | 2.006 |
| 15 | 0.983 | 0.988 | 0.928 | 0.805 | 0.831 | 0.942 |
| 0.861 | 0.554 | 0.699 | 1.853 | 1.858 | 1.033 |
| 16 | 0.987 | 0.987 | 0.922 | 0.878 | 0.824 | 0.892 |
| 0.910 | 0.529 | 0.693 | 1.444 | 1.772 | 1.005 |
| Average |
| 0.878 | 0.882 | 0.900 | 0.917 | 1.216 | 1.262 | 1.274 | 1.361 | 1.399 | 1.048 | 1.289 | 0.967 |
Bold values indicate the best results for row settings
Average running time of the methods (Thresher, SCOD and the indices in NbClust with top performance) across correlation matrices (unit: seconds)
| Rules | NbClust | |||||||
| trcovw | tracew | ratkowsky | mcclain | ptbiserial | tau | sdindex | kl | |
| 96 var., 24 obj. | 0.09 | 0.09 | 0.116 | 0.029 | 0.029 | 0.077 | 0.291 | 0.275 |
| 24 var., 96 obj. | 0.025 | 0.025 | 0.055 | 0.039 | 0.165 | 1.780 | 0.113 | 0.109 |
| Rules | NbClust | Thresher | SCOD | |||||
| ccc | hartigan | scott | silhouette | CPT | TwiceMean | SCOD | ||
| 96 var., 24 obj. | 0.088 | 0.169 | 0.092 | 0.027 | 0.25 | 0.271 | 0.009 | |
| 24 var., 96 obj. | 0.025 | 0.075 | 0.025 | 0.071 | 0.419 | 0.530 | 0.057 | |
Fig. 4Comparison of top NbClust indices with Thresher (TwiceMean) and SCOD on estimating the number of clusters from GEO breast cancer datasets
Summary of the data and analysis in clustering breast cancer subtypes
| Dataset | Sample # | Outlier # | Outlier percentage | Cluster # |
|---|---|---|---|---|
| GSE60785 | 55 | 10 | 18.18 | 2 |
| GSE43358 | 57 | 5 | 8.77 | 3 |
| GSE10810 | 58 | 0 | 0.00 | 6 |
| GSE29431 | 66 | 2 | 3.03 | 9 |
| GSE50939 | 71 | 1 | 1.41 | 2 |
| GSE39004 | 72 | 9 | 12.50 | 4 |
| GSE46184 | 74 | 8 | 10.81 | 4 |
| GSE19177 | 75 | 2 | 2.67 | 4 |
| GSE37145 | 76 | 0 | 0.00 | 7 |
| GSE21921 | 85 | 3 | 3.53 | 6 |
| GSE20711 | 90 | 18 | 20.00 | 7 |
| GSE40115 | 92 | 1 | 1.09 | 5 |
| GSE12622 | 103 | 0 | 0.00 | 7 |
| GSE22093 | 103 | 3 | 2.91 | 5 |
| GSE19783 | 115 | 1 | 0.87 | 8 |
| GSE56493 | 120 | 6 | 5.00 | 6 |
| GSE10885 | 125 | 4 | 3.20 | 7 |
| GSE2607 | 126 | 3 | 2.38 | 7 |
| GSE45255 | 139 | 10 | 7.19 | 8 |
| GSE45827 | 155 | 9 | 5.81 | 6 |
| GSE3143 | 158 | 6 | 3.80 | 7 |
| GSE53031 | 167 | 10 | 5.99 | 6 |
| GSE2741 | 169 | 7 | 4.14 | 6 |
| GSE1992 | 170 | 8 | 4.71 | 8 |
| GSE4611 | 218 | 23 | 10.55 | 9 |