Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 K-means clustering versus validation measures: a data-distribution perspective.

Literature DB >> 19095536

K-means clustering versus validation measures: a data-distribution perspective.

Abstract

K-means is a well-known and widely used partitional clustering method. While there are considerable research efforts to characterize the key features of the K-means clustering algorithm, further investigation is needed to understand how data distributions can have impact on the performance of K-means clustering. To that end, in this paper, we provide a formal and organized study of the effect of skewed data distributions on K-means clustering. Along this line, we first formally illustrate that K-means tends to produce clusters of relatively uniform size, even if input data have varied "true" cluster sizes. In addition, we show that some clustering validation measures, such as the entropy measure, may not capture this uniform effect and provide misleading information on the clustering performance. Viewed in this light, we provide the coefficient of variation (CV) as a necessary criterion to validate the clustering results. Our findings reveal that K-means tends to produce clusters in which the variations of cluster sizes, as measured by CV, are in a range of about 0.3-1.0. Specifically, for data sets with large variation in "true" cluster sizes (e.g., CV > 1.0), K-means reduces variation in resultant cluster sizes to less than 1.0. In contrast, for data sets with small variation in "true" cluster sizes (e.g., CV < 0.3), K-means increases variation in resultant cluster sizes to greater than 0.3. In other words, for the earlier two cases, K-means produces the clustering results which are away from the "true" cluster distributions.

Year: 2008 PMID： 19095536 DOI： 10.1109/TSMCB.2008.2004559

Source DB: PubMed Journal: IEEE Trans Syst Man Cybern B Cybern ISSN： 1083-4419

Keyword Cloud
Cited

7 in total

1. Water quality assessment with hierarchical cluster analysis based on Mahalanobis distance.

Authors: Xiangjun Du; Fengjing Shao; Shunyao Wu; Hanlin Zhang; Si Xu
Journal: Environ Monit Assess Date: 2017-06-13 Impact factor: 2.513

2. Longitudinal Clustering of High-cost Patients' Spend Trajectories:Delineating Individual Behaviors from Aggregate Trends.

Authors: Andrew M Placona; Rich King; Fengjuan Wang
Journal: AMIA Annu Symp Proc Date: 2018-12-05

3. Characterization and quantification of angiogenesis in rheumatoid arthritis in a mouse model using μCT.

Authors: Svitlana Gayetskyy; Oleg Museyko; Johannes Käßer; Andreas Hess; Georg Schett; Klaus Engelke
Journal: BMC Musculoskelet Disord Date: 2014-09-06 Impact factor: 2.362

4. Comparative RNA-Seq profiling of berry development between table grape 'Kyoho' and its early-ripening mutant 'Fengzao'.

Authors: Da-Long Guo; Fei-Fei Xi; Yi-He Yu; Xiao-Yu Zhang; Guo-Hai Zhang; Gan-Yuan Zhong
Journal: BMC Genomics Date: 2016-10-12 Impact factor: 3.969

5. Regression on imperfect class labels derived by unsupervised clustering.

Authors: Rasmus Froberg Brøndum; Thomas Yssing Michaelsen; Martin Bøgsted
Journal: Brief Bioinform Date: 2021-03-22 Impact factor: 11.622

6. Convalescing Cluster Configuration Using a Superlative Framework.

Authors: R Sabitha; S Karthik
Journal: ScientificWorldJournal Date: 2015-10-12

7. A Strong Machine Learning Classifier and Decision Stumps Based Hybrid AdaBoost Classification Algorithm for Cognitive Radios.

Authors: Siji Chen; Bin Shen; Xin Wang; Sang-Jo Yoo
Journal: Sensors (Basel) Date: 2019-11-20 Impact factor: 3.576

7 in total