| Literature DB >> 33815625 |
Semeh Ben Salem1,2,3, Sami Naouali1, Zied Chtourou3.
Abstract
The categorical clustering problem has attracted much attention especially in the last decades since many real world applications produce categorical data. The k-mode algorithm, proposed since 1998, and its multiple variants were widely used in this context. However, they suffer from a great limitation related to the update of the modes in each iteration. The mode in the last step of these algorithms is randomly selected although it is possible to identify many candidate ones. In this paper, a rough density mode selection method is proposed to identify the adequate modes among a list of candidate ones in each iteration of the k-modes. The proposed method, called Density Rough k-Modes (DRk-M) was experimented using real world datasets extracted from the UCI Machine Learning Repository, the Global Terrorism Database (GTD) and a set of collected Tweets. The DRk-M was also compared to many states of the art clustering methods and has shown great efficiency.Entities:
Keywords: Categorical clustering; K-modes; Rough set theory; Uncertainty; Unsupervised learning
Year: 2021 PMID: 33815625 PMCID: PMC7998089 DOI: 10.1007/s13042-021-01293-w
Source DB: PubMed Journal: Int J Mach Learn Cybern ISSN: 1868-8071 Impact factor: 4.012
Fig. 1The k-modes algorithm and its variants compared to the DRk-M
Fig. 2W matrix with assigning observations to clusters
Assignment of observations to clusters
| obs1 | obs2 | obs3 | obs4 | obs5 | obs6 | obs7 | |cli| | ||
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 1 | 0 | 0 | 1 | 2 | ||
| 0 | 0 | 1 | 0 | 1 | 1 | 0 | 3 | ||
| 1 | 1 | 0 | 0 | 0 | 0 | 0 | 2 | ||
| 7 |
Illustrative example of the generation of the candidate modes in a categorical dataset
| obs1 | obs2 | obs3 | obs4 | obs5 | obs6 | obs7 | |
|---|---|---|---|---|---|---|---|
| a1 | a | a | c | d | a | d | d |
| a2 | e | f | e | g | f | g | h |
| a3 | l | n | k | l | l | m | l |
| a4 | x | z | y | y | z | x | y |
Classification dataset for the Covid infection
| Patient | Fever | Fatigue | Cough | Sneezing | Aches and pains | Sore throat | Headache | Covid/ Not Covid |
|---|---|---|---|---|---|---|---|---|
| Patient01 | Yes | Yes | No | Yes | Yes | Yes | Yes | Covid |
| Patient02 | Yes | Yes | No | Yes | Yes | Yes | Yes | Not Covid |
| Patient03 | No | No | Yes | No | Yes | Yes | No | Covid |
| Patient04 | Yes | Yes | Yes | Yes | Yes | No | Yes | Not Covid |
| Patient05 | No | Yes | Yes | Yes | Yes | Yes | No | Covid |
Fig. 3The DRk-M clustering process
Confusion matrix for two classes
| Predicted | |||
|---|---|---|---|
| Cancer = YES | Cancer = NO | ||
| Actual | Cancer = YES | TP = 25 | FN = 5 |
| Cancer = NO | FP = 5 | TN = 65 | |
Fig. 4Experiments for the Mushroom dataset with various dimensions (N = 8124, K = 2)
Experimental results computed for the Breast cancer dataset for K (6 → 10), N = 644, d = 4
| K | 6 | 7 | 8 | 4 | 10 | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| k- | DRk- | k- | DRk-modes | k- | DRk-modes | k- | DRk-modes | k- | DRk-modes | |
| Accuracy | 0.4716 | 0.4440 | 0.324 | 0.4512 | 0.3603 | 0.4435 | ||||
| Entropy | 3.5738 | 4.3474 | 4.3770 | 5.4385 | 6.5446 | |||||
| NMI | 0.0461 | 0.0442 | 0.0045 | 0.0107 | 0.0338 | 8 × 10–4 | 2 × 10–4 | |||
Values written in bold correspond to the metrics were the proposed algorithm performed better than state of the art methods
The accuracy (AC) and F1-score computed for 100 runs for the Mushroom dataset
| Methods | Huang’s k-modes | Improved Huang’s kmodes [ | Weighted k-modes | Improved Weighted k-modes | Ng’s k-modes [46,7] (2014) | Improved Ng’s k-modes [ | Bai’s fuzzy NFKM [ | Khan’s initialization method | Fuzzy k-modes | DRk-M |
|---|---|---|---|---|---|---|---|---|---|---|
| AC | 0.7176 | 0.8190 | 0.7106 | 0.8006 | 0.7969 | 0.8366 | 0.8298 | 0.7001 | ||
| F1-score | 0.7289 | 0.8250 | 0.7230 | 0.7827 | 0.7742 | 0.8411 | 0.8359 | 0.6787 |
Values written in bold correspond to the metrics were the proposed algorithm performed better than state of the art methods
The accuracy (AC) and F1-score computed for 100 runs for the lung cancer
| Methods | Huang’s k-modes | Improved Huang’s k-modes [ | Weighted k-modes | Improved Weighted k-modes | Ng’s k-modes [7,46] (2014) | Improved Ng’s k-modes [ | Bai’s fuzzy NFKM | Khan’s initialization method | Fuzzy k-modes | DRk-M |
|---|---|---|---|---|---|---|---|---|---|---|
| AC | 0.5322 | 0.5803 | 0.5344 | 0.5631 | 0.5516 | 0.6003 | 0.6012 | 0.5000 | 0.5306 | |
| F1-score | 0.5545 | 0.5967 | 0.5408 | 0.5557 | 0.5779 | 0.6265 | 0.6008 | 0.5735 | 0.5580 |
Values written in bold correspond to the metrics were the proposed algorithm performed better than state of the art methods
The accuracy (AC) and F1-score computed for 100 runs for the breast cancer dataset
| Methods | Huang's | Improved Huang’s | Weighted | Improved Weighted | Ng’s | Improved Ng’s | Bai’s fuzzy NFKM | Khan’s initialization method | Fuzzy k-modes | DR |
|---|---|---|---|---|---|---|---|---|---|---|
| AC | 0.8482 | 0.9270 | 0.8530 | 0.8441 | 0.8645 | 0.8770 | 0.9446 | 0.9127 | 0.8343 | |
| F1-score | 0.8263 | 0.9191 | 0.8051 | 0.8771 | 0.8535 | 0.8499 | 0.9383 | 0.9042 | 0.8111 |
Values written in bold correspond to the metrics were the proposed algorithm performed better than state of the art methods
The accuracy (AC) and F1-score computed for 100 runs for the credit approval dataset
| Methods | Huang’s k-modes | Improved Huang’s k-modes | Weighted k-modes | Improved Weighted k-modes | Ng’s k-modes [7,46] (2014) | Improved Ng’s k-modes [ | Bai’s fuzzy NFKM [ | Fuzzy k-modes | DRk-M | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AC | 0.7367 | 0.7647 | 0.7442 | 0.7578 | 0.7612 | 0.7942 | 0.7701 | 0.7441 | ||||||||||
| F1-score | 0.7453 | 0.7628 | 0.7428 | 0.7550 | 0.7590 | 0.7680 | 0.7712 | 0.7630 | ||||||||||
Values written in bold correspond to the metrics were the proposed algorithm performed better than state of the art methods
The accuracy (AC) and F1-score computed for 100 runs for the soybean dataset
| Methods | Huang’s k-modes [ | Improved Huang’s k-modes [ | Weighted k-modes [ | Improved Weighted k-modes [ | Ng’s k-modes [7,46] (2014) | Improved Ng’s k-modes [ | Bai’s fuzzy NFKM [ | Khan’s initialization method [ | Fuzzy k-modes | DRk-M |
|---|---|---|---|---|---|---|---|---|---|---|
| AC | 0.8553 | 0.9234 | 0.8613 | 0.9068 | 0.9396 | 0.9979 | 0.9264 | 0.9574 | 0.8336 | |
| F1-score | 0.8702 | 0.9288 | 0.8702 | 0.9100 | 0.9442 | 0.9978 | 0.9319 | 0.9643 | 0.8495 |
Values written in bold correspond to the metrics were the proposed algorithm performed better than state of the art methods
Clusters resulting from the segmentation process for K = 5
| Tunisia | 10 | 1 | 190 | 5 | 5 | 221 |
| Egypt | 5 | 160 | 9 | 10 | 5 | 189 |
| Algeria | 165 | 2 | 3 | 2 | 1 | 173 |
| Lybia | 2 | 4 | 2 | 48 | 3 | 59 |
| Morocco | 2 | 1 | 6 | 1 | 72 | 83 |
| max | 165 | 160 | 190 | 48 | 72 | 635/725 = 0.87 |
Average accuracy, STD and average_Silhouette computed for the DRk-M and the k-modes for the two Twitter datasets for 50 runs of the algorithms
| Dataset 1 | Dataset 2 | ||||
|---|---|---|---|---|---|
| DR | DR | ||||
| 5 | Average_accuracy | 0.6412 | 0.7232 | 0.7557 | 0.7289 |
| STD | 2.68% | 2.27% | 1.25% | 0.79% | |
| Average_Silhouette | 0.7854 | 0.7846 | 0.7432 | 0.7637 | |
| 6 | Average_accuracy | 0.6251 | 0.6927 | 0.7597 | 0.7914 |
| STD | 1.79% | 1.12% | 2.08% | 1.83% | |
| Average_Silhouette | 0.7284 | 0.7876 | 0.7013 | 0.7522 | |
| 7 | Average_accuracy | 0.7543 | 0.7643 | 0.6210 | 0.6938 |
| STD | 2.67% | 2.39% | 1.19% | 0.72% | |
| Average_Silhouette | 0.6832 | 0.6893 | 0.6144 | 0.6381 | |
| 8 | Average_accuracy | 0.6718 | 0.7351 | 0.7183 | 0.7318 |
| STD | 3.28% | 2.83% | 0.86% | 0.59% | |
| Average_Silhouette | 0.7291 | 0.7456 | 0.8267 | 0.8819 | |
| 9 | Average_accuracy | 0.6706 | 0.7091 | 0.7418 | 0.9063 |
| STD | 2.12% | 1.88% | 2.07% | 1.80% | |
| Average_Silhouette | 0.7156 | 0.7612 | 0.7155 | 0.7516 | |
| 10 | Average_accuracy | 0.7164 | 0.7763 | 0.8161 | 0.8201 |
| STD | 1.89% | 0.93% | 1.01% | 0.91% | |
| Average_Silhouette | 0.7396 | 0.8137 | 0.7236 | 0.7514 | |
Accuracy computed for the DRk-M and the k-modes for various N and K
| Number of clusters ( | 5 | 6 | 7 | 8 | 9 | 10 | |
|---|---|---|---|---|---|---|---|
Dataset 1 (N = 1803) | 0.6523 | 0.6310 | 0.7623 | 0.6845 | 0.6874 | 0.7239 | |
| DR | 0.7623 | ||||||
Dataset 2 (N = 284) | 0.7354 | 0.7325 | 0.6178 | 0.6912 | 0.7523 | 0.8234 | |
| DR | 0.7354 | 0.8234 | |||||
Values written in bold correspond to the metrics were the proposed algorithm performed better than state of the art methods
Fig. 5Entropy computed for the for the two algorithms for N = 103 and K (3 → 15)
Fig. 6NMI computed for the two algorithms for N = 15 × 103 and K (3 → 15)
Fig. 7Distribution of Silhouette scores for various clusterings according to the number of clusters for the DRk-M and k-modes (N = 103, K: 5 → 10 and 50 runs)
Fig. 8Execution time computed for N (500 → 25 × 103) and K = 8 and 10
Accuracy, STD and average_Silhouette computed for the DRk-M and the k-modes for 50 runs using the GTD dataset
| Dataset 1 | Dataset 2 | ||||
|---|---|---|---|---|---|
| DR | DR | ||||
| 3 | Average_accuracy | 0.6448 | 0.6837 | 0.6537 | 0.6967 |
| STD | 1.13% | 0.82% | 1.72% | 1.19% | |
| Average_Silhouette | 0.6581 | 0.6819 | 0.7294 | 0.7628 | |
| 4 | Average_accuracy | 0.6215 | 0.6928 | 0.6534 | 0.6976 |
| STD | 2.34% | 1.69% | 2.91% | 1.76% | |
| Average_Silhouette | 0.6213 | 0.7089 | 0.6095 | 0.6780 | |
| 5 | Average_accuracy | 0.6519 | 0.6686 | 0.6207 | 0.6911 |
| STD | 3.67% | 3.28% | 1.68% | 1.89% | |
| Average_Silhouette | 0.6391 | 0.6814 | 0.6135 | 0.6318 | |
| 6 | Average_accuracy | 0.7637 | 0.8019 | 0.6125 | 0.6308 |
| STD | 3.68% | 2.98% | 2.59% | 2.14% | |
| Average_Silhouette | 0.7284 | 0.7446 | 0.8381 | 0.8734 | |
| 7 | Average_accuracy | 0.7493 | 0.7739 | 0.6717 | 0.7193 |
| STD | 2.57% | 3.09% | 1.89% | 1.48% | |
| Average_Silhouette | 0.7293 | 0.7675 | 0.8098 | 0.8539 | |
| 8 | Average_accuracy | 0.7824 | 0.8239 | 0.8190 | 0.8382 |
| STD | 1.32% | 0.92% | 1.68% | 0.98% | |
| Average_Silhouette | 0.7287 | 0.8097 | 0.6381 | 0.6937 | |