| Literature DB >> 33286165 |
Bin Dong1, Songlei Jian1, Ke Zuo1.
Abstract
Categorical data are ubiquitous in machine learning tasks, and the representation of categorical data plays an important role in the learning performance. The heterogeneous coupling relationships between features and feature values reflect the characteristics of the real-world categorical data which need to be captured in the representations. The paper proposes an enhanced categorical data embedding method, i.e., CDE++, which captures the heterogeneous feature value coupling relationships into the representations. Based on information theory and the hierarchical couplings defined in our previous work CDE (Categorical Data Embedding by learning hierarchical value coupling), CDE++ adopts mutual information and margin entropy to capture feature couplings and designs a hybrid clustering strategy to capture multiple types of feature value clusters. Moreover, Autoencoder is used to learn non-linear couplings between features and value clusters. The categorical data embeddings generated by CDE++ are low-dimensional numerical vectors which are directly applied to clustering and classification and achieve the best performance comparing with other categorical representation learning methods. Parameter sensitivity and scalability tests are also conducted to demonstrate the superiority of CDE++.Entities:
Keywords: autoencoder; categorical data; classification; clustering; data embedding; heterogeneous couplings; hybrid clustering strategy
Year: 2020 PMID: 33286165 PMCID: PMC7516865 DOI: 10.3390/e22040391
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.524
A simple example to explain the value coupling relationships.
| Name | Gender | Major | Occupation |
|---|---|---|---|
| John | Male | Engineering | Programmer |
| Tony | Male | Science | Analyst |
| Alisa | Female | Liberal arts | Lawyer |
| Ben | Male | Engineering | Programmer |
| Abby | Female | Liberal arts | Marketing Manager |
| James | Male | Engineering | Technician |
Figure 1Overview of CDE++.
The descriptions of the notations in CDE++.
| Symbols | Description |
|---|---|
| The dataset and a specific object. | |
| The feature set in the dataset and a specific feature. | |
|
| The feature that value |
| The whole feature value set in the dataset and a specific feature value. | |
|
| The feature value set for feature |
|
| The value in feature |
|
| The number of objects in the dataset. |
|
| The number of features in the dataset. |
|
| The number of feature values in the dataset. |
|
| The number of groud-truth classes in the dataset. |
|
| The probability of |
|
| The joint probability of |
|
| The relation between two features |
|
| The relative entropy of joint distribution and marginal distribution between two features |
|
| The marginal entropy of featrure |
|
| The occurrence-based value coupling function. |
|
| The co-occurrence-based value coupling function. |
|
| The occurrence-based relationship matrix. |
|
| The co-occurrence-based relationship matrix. |
|
| The parameter of DBSCAN. |
|
| The number of clusters parameter of HC. |
|
| The cluster indicator matrix. |
|
| The dimension of cluster indicator matrix. |
|
| The factor of drop redundancy value clusters. |
|
| The hidden factor of Autoencoder. |
|
| The dimension of value after Autoencoder. |
|
| The general function to generate new objects embedding. |
The Dataset attributes and F-score results of Clustering by Inverse Document Frequency (IDF), DILCA, Categorical Data Embedding (CDE), CDE-AE, 1-HOT, 1-HOT-AE, and our method CDE++ on 15 Data Sets. The best performance for each data set is boldfaced. The Data Sets are sorted in descending order of F-score.
| Dataset_Attributes | F-Score | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Datasets |
|
|
|
| IDF | DILCA | CDE | CDE-AE | 1-HOT | 1-HOT-AE | CDE++ |
| Zoo | 101 | 17 | 43 | 7 | 0.827 | 0.746 | 0.833 | 0.79 | 0.826 | 0.871 |
|
| Iris | 150 | 4 | 123 | 3 | 0.59 | 0.632 | 0.717 | 0.667 | 0.585 | 0.467 |
|
| Hepatitis | 155 | 19 | 360 | 2 | 0.535 | 0.679 | 0.672 | 0.687 | 0.677 | 0.684 |
|
| Tic-tac-toe | 958 | 9 | 27 | 2 | 0.536 | 0.542 | 0.557 | 0.559 | 0.578 | 0.578 |
|
| Annealing | 798 | 38 | 317 | 5 | 0.512 | 0.534 | 0.577 | 0.547 | 0.528 | 0.588 |
|
| Bloger | 100 | 5 | 15 | 2 | 0.484 | 0.492 | 0.546 | 0.539 | 0.53 | 0.51 |
|
| Balance-scale | 625 | 4 | 20 | 3 | 0.463 | 0.497 | 0.514 | 0.499 | 0.462 | 0.419 |
|
| Lymphography | 148 | 18 | 59 | 4 | 0.556 | 0.513 | 0.528 | 0.489 | 0.494 | 0.493 |
|
| Hayes-roth | 132 | 4 | 15 | 3 | 0.48 | 0.478 | 0.495 | 0.484 | 0.348 | 0.341 |
|
| Teaching A.E. | 151 | 5 | 101 | 3 | 0.395 | 0.41 | 0.432 | 0.44 | 0.428 | 0.444 |
|
| Student A.P. | 131 | 21 | 75 | 3 | 0.429 | 0.423 | 0.449 | 0.433 | 0.445 | 0.466 |
|
| Lenses | 24 | 4 | 9 | 3 | 0.442 | 0.471 | 0.458 | 0.467 | 0.546 |
| 0.458 |
| Nursery | 12,960 | 8 | 27 | 5 | 0.306 | 0.294 | 0.32 | 0.356 | 0.283 |
| 0.325 |
| Primary-tumor | 339 | 17 | 37 | 21 | 0.218 | 0.224 | 0.285 | 0.29 | 0.291 | 0.289 |
|
| Chess | 28,056 | 6 | 40 | 18 | 0.154 | 0.16 | 0.165 | 0.16 | 0.157 | 0.151 |
|
|
| 0.462 | 0.473 | 0.503 | 0.494 | 0.479 | 0.484 |
| ||||
Basic Parameters of Autoencoder.
| Architecture | A-64-code dimension-64- |
| MaxEpochs | 1000 |
| LossFunction | MSE with L2 and Sparsity Regularizers |
| Training Algorithm | Scaled Conjugate Gradient Descent |
The code dimension in Architecture is determined by the hidden factor and the original data dimension.
The Accuracy results of Classifying by IDF, DILCA, CDE, CDE-AE, 1-HOT, 1-HOT-AE, and our method CDE++ on 15 Data Sets. The best performance for each data set is boldfaced. The Data Sets are sorted in descending order of Accuracy.
| Accuracy | |||||||
|---|---|---|---|---|---|---|---|
| Datasets | IDF | DILCA | CDE | CDE-AE | 1-HOT | 1-HOT-AE | CDE++ |
| Zoo | 0.937 | 0.946 | 0.97 | 0.944 | 1 | 1 | 1 |
| Lenses | 0.826 | 0.793 | 0.833 | 0.714 | 0.8 | 0.811 |
|
| Annealing | 0.973 | 0.979 | 0.985 | 0.978 |
| 0.989 | 0.988 |
| Tic-tac-toe | 0.894 | 0.872 | 0.913 | 0.735 | 0.981 | 0.98 |
|
| Balance-scale | 0.741 | 0.713 | 0.727 | 0.649 |
| 0.957 |
|
| Nursery | 0.804 | 0.729 | 0.817 | 0.549 | 0.937 | 0.939 | 0.943 |
| Iris | 0.883 | 0.897 | 0.893 | 0.887 | 0.92 |
| 0.911 |
| Bloger | 0.742 | 0.733 | 0.739 | 0.808 | 0.758 | 0.852 |
|
| Hepatitis | 0.829 | 0.817 | 0.834 | 0.811 | 0.877 | 0.804 |
|
| Hayes-roth | 0.754 | 0.771 | 0.807 | 0.763 | 0.829 | 0.834 |
|
| Lymphography | 0.792 | 0.803 | 0.819 | 0.803 | 0.826 | 0.834 |
|
| Primary-tumor | 0.538 | 0.551 | 0.577 | 0.528 | 0.595 | 0.61 |
|
| Student A.P. | 0.518 | 0.499 | 0.529 | 0.519 | 0.551 | 0.544 |
|
| Teaching A.E. | 0.547 | 0.543 | 0.576 | 0.509 | 0.561 | 0.585 |
|
| Chess | 0.184 | 0.216 | 0.242 | 0.215 | 0.27 | 0.257 |
|
|
| 0.731 | 0.724 | 0.751 | 0.694 | 0.791 | 0.795 |
|
Ablation Study Settings.
| Learning Value Clusters | Learn Value Clusters Couplings | ||
|---|---|---|---|
| DBSCAN | HC | Autoencoder | |
| i | 🗸 | × | × |
| ii | × | 🗸 | × |
| iii | 🗸 | 🗸 | × |
| iv | 🗸 | 🗸 | 🗸 |
F-score Results of Ablation Study based on Clustering task. The best performance for each data set is boldfaced.
| F-Score | ||||
|---|---|---|---|---|
| Datasets | DBSCAN | HC | DBSCAN+HC | CDE++ |
| Zoo | 0.829 | 0.85 | 0.854 |
|
| Iris | 0.453 | 0.491 | 0.646 |
|
| Hepatitis | 0.503 | 0.617 | 0.666 |
|
| Tic-tac-toe | 0.621 | 0.578 | 0.573 |
|
| Annealing | 0.606 | 0.514 | 0.544 |
|
| Bloger | 0.53 | 0.588 | 0.572 |
|
| Balance-scale | 0.464 | 0.5 | 0.549 |
|
| Lymphography | 0.434 | 0.488 | 0.484 |
|
| Hayes-roth | 0.445 | 0.445 | 0.441 |
|
| Teaching.A.E | 0.437 | 0.419 | 0.491 |
|
| Student.A.P | 0.435 | 0.425 | 0.446 |
|
| Lenses |
| 0.583 | 0.583 | 0.458 |
| Nursery |
| 0.295 | 0.273 | 0.325 |
| Primary-tumor | 0.248 | 0.291 | 0.288 |
|
| Chess | 0.163 | 0.167 | 0.151 |
|
|
| 0.475 | 0.483 | 0.504 |
|
Accuracy Results of Ablation Study based on Classifying task. The best performance for each data set is boldfaced.
| ACCURACY | ||||
|---|---|---|---|---|
| Datasets | DBSCAN | HC | DBSCAN+HC | CDE++ |
| Zoo | 0.964 | 0.97 | 0.976 |
|
| Lenses |
| 0.789 | 0.811 |
|
| Annealing | 0.975 | 0.93 | 0.971 |
|
| Tic-tac-toe | 0.695 | 0.965 | 0.977 |
|
| Balance-scale |
| 0.961 | 0.958 | 0.968 |
| Nursery |
| 0.944 | 0.943 | 0.943 |
| Iris | 0.833 | 0.738 | 0.898 |
|
| Bloger | 0.839 | 0.781 | 0.819 |
|
| Hepatitis | 0.828 | 0.838 | 0.845 |
|
| Hayes-roth | 0.812 | 0.837 | 0.834 |
|
| Lymphography | 0.828 | 0.813 | 0.843 |
|
| Primary-tumor |
| 0.579 | 0.564 | 0.623 |
| Student.A.P | 0.578 | 0.527 | 0.527 |
|
| Teaching.A.E | 0.565 | 0.557 | 0.578 |
|
| Chess |
| 0.461 | 0.236 | 0.413 |
|
| 0.812 | 0.779 | 0.785 |
|
Figure 2Sensitivity test w.r.t. parameter in term of Representation Dimension.
Figure 3Sensitivity test w.r.t. parameter in term of F-score.
Figure 4Sensitivity test w.r.t. hidden factor in term of Representation Dimension.
Figure 5Sensitivity test w.r.t. hidden factor in term of F-score.
Figure 6Scalability test w.r.t Data Size in term of Execution Time.
Figure 7Scalability test w.r.t Data Dimension in term of Execution Time.