| Literature DB >> 28458686 |
Hongfang Zhou1, Yihui Zhang1, Yibin Liu1.
Abstract
The k-modes clustering algorithm has been widely used to cluster categorical data. In this paper, we firstly analyzed the k-modes algorithm and its dissimilarity measure. Based on this, we then proposed a novel dissimilarity measure, which is named as GRD. GRD considers not only the relationships between the object and all cluster modes but also the differences of different attributes. Finally the experiments were made on four real data sets from UCI. And the corresponding results show that GRD achieves better performance than two existing dissimilarity measures used in k-modes and Cao's algorithms.Entities:
Mesh:
Year: 2017 PMID: 28458686 PMCID: PMC5387825 DOI: 10.1155/2017/3691316
Source DB: PubMed Journal: Comput Intell Neurosci
An artificial data set.
| Objects |
|
|
|
|
|---|---|---|---|---|
|
| A | B | A | E |
|
| A | A | B | D |
|
| C | A | A | E |
| Cluster 1 ( | A | A | A | E |
|
| A | B | B | E |
|
| B | A | C | E |
|
| A | B | C | E |
| Cluster 2 ( | A | B | C | E |
|
| A | A | A | E |
|
| D | C | B | E |
|
| C | C | A | E |
| Cluster 3 ( | A | C | A | E |
Pseudocode 1Pseudocodes of KBGRD algorithm.
Data sets.
| Data set | Attribute characteristics | #of data objects | # of attributes | # of class | Missing values |
|---|---|---|---|---|---|
| QSAR | Integer/real | 1055 | 41 | 2 | No |
| Chess | Categorical | 3196 | 36 | 2 | No |
| Mushroom | Categorical | 8142 | 22 | 2 | Yes (very few) |
| Nursery | Categorical | 12960 | 8 | 5 | No |
Average RandIndex on four data sets for three algorithms.
| QSAR | Chess | Mushroom | Nursery | |
|---|---|---|---|---|
|
| 0.513 | 0.5102 | 0.5101 | 0.6908 |
| Cao's | 0.5106 | 0.5136 | 0.5251 | 0.7895 |
| KBGRD | 0.5153 | 0.5229 | 0.5543 | 0.7933 |
Average AC on four data sets for three algorithms.
| QSAR | Chess | Mushroom | Nursery | |
|---|---|---|---|---|
|
| 0.5820 | 0.5720 | 0.5701 | 0.4786 |
| Cao's | 0.5944 | 0.5432 | 0.5895 | 0.5897 |
| KBGRD | 0.6042 | 0.6073 | 0.6634 | 0.5938 |
Average RandIndex of three algorithms on QSAR data set.
| 10 | 15 | 20 | 25 | 30 | 35 | Average | |
|---|---|---|---|---|---|---|---|
|
| 0.4613 | 0.4608 | 0.4611 | 0.4596 | 0.4603 | 0.4584 | 0.4603 |
| Cao's | 0.4650 | 0.4610 | 0.4625 | 0.4608 | 0.4611 | 0.4593 | 0.4616 |
| KBGRD | 0.4658 | 0.4634 | 0.4628 | 0.4612 | 0.4620 | 0.4605 | 0.4626 |
Average RandIndex of three algorithms on Chess data set.
| 10 | 15 | 20 | 25 | 30 | 35 | Average | |
|---|---|---|---|---|---|---|---|
|
| 0.5016 | 0.5011 | 0.5008 | 0.5024 | 0.5032 | 0.5027 | 0.5020 |
| Cao's | 0.5041 | 0.5023 | 0.5014 | 0.5064 | 0.5045 | 0.5044 | 0.5039 |
| KBGRD | 0.5060 | 0.5090 | 0.5073 | 0.5072 | 0.5070 | 0.5075 | 0.5074 |
Average RandIndex of three algorithms on Mushroom data set.
| 10 | 15 | 20 | 25 | 30 | 35 | Average | |
|---|---|---|---|---|---|---|---|
|
| 0.5771 | 0.5641 | 0.5611 | 0.5622 | 0.5443 | 0.5404 | 0.5582 |
| Cao's | 0.5925 | 0.5644 | 0.5679 | 0.5790 | 0.5558 | 0.5638 | 0.5706 |
| KBGRD | 0.5932 | 0.5648 | 0.5731 | 0.5834 | 0.5678 | 0.5730 | 0.5759 |
Average RandIndex of three algorithms on Nursery data set.
| 10 | 15 | 20 | 25 | 30 | 35 | Average | |
|---|---|---|---|---|---|---|---|
|
| 0.6839 | 0.7061 | 0.6963 | 0.6875 | 0.6815 | 0.6942 | 0.6916 |
| Cao's | 0.7188 | 0.7071 | 0.6956 | 0.6989 | 0.6847 | 0.6982 | 0.7006 |
| KBGRD | 0.7195 | 0.7073 | 0.6967 | 0.7022 | 0.6957 | 0.6988 | 0.7034 |