| Literature DB >> 36061078 |
Ziqiang Lin1, Eugene Laska1,2, Carole Siegel1,2.
Abstract
The quality of a cluster analysis of unlabeled units depends on the quality of the between units dissimilarity measures. Data dependent dissimilarity is more objective than data independent geometric measures such as Euclidean distance. As suggested by Breiman, many data driven approaches are based on decision tree ensembles, such as a random forest (RF), that produce a proximity matrix that can easily be transformed into a dissimilarity matrix. A RF can be obtained using labels that distinguish units with real data from units with synthetic data. The resulting dissimilarity matrix is input to a clustering program and units are assigned labels corresponding to cluster membership. We introduce a General Iterative Cluster (GIC) algorithm that improves the proximity matrix and clusters of the base RF. The cluster labels are used to grow a new RF yielding an updated proximity matrix which is entered into the clustering program. The process is repeated until convergence. The same procedure can be used with many base procedures such as the Extremely Randomized Tree ensemble. We evaluate the performance of the GIC algorithm using benchmark and simulated data sets. The properties measured by the Silhouette Score are substantially superior to the base clustering algorithm. The GIC package has been released in R: https://cran.r-project.org/web/packages/GIC/index.html.Entities:
Keywords: Clustering; Extremely Randomized Tree; Extremely randomized tree; Proximity; Random Forest; iterative RF clustering
Year: 2022 PMID: 36061078 PMCID: PMC9438941 DOI: 10.1002/sam.11573
Source DB: PubMed Journal: Stat Anal Data Min ISSN: 1932-1864 Impact factor: 1.247
Mean pairwise proximity and change in mean pairwise proximity and standard deviation over iterations of IRFC and IERT for the heart disease data
| IRFC | IERT | |||
|---|---|---|---|---|
| Iteration | Mean pairwise distance | Mean iteration change in pairwise proximity | Mean pairwise proximity | Mean iteration change in pairwise proximity |
| 1 | 0.80 (0.22) | ‐ | 0.64(0.01) | ‐ |
| 2 | 0.87 (0.17) | 0.07(0.11) | 0.62 (0.01) | −0.02 (0.01) |
| 3 | 0.86 (0.18) | −0.01 (0.09) | 0.62 (0.01) | 0.004 (0.001) |
| 4 | 0.84 (0.20) | −0.02 (0.08) | 0.62 (0.004) | 0.003 (0.004) |
| 5 | 0.82 (0.22) | −0.02 (0.08) | 0.63 (0.004) | 0.001 (0.001) |
| 6 | 0.78 (0.24) | −0.04 (0.08) | 0.63 (0.004) | 0.001 (0.001) |
| 7 | 0.77 (0.24) | −0.01 (0.08) | 0.63 (0.004) | <0.001 (0.001) |
| 8 | 0.77 (0.24) | −0.01 (0.07) | 0.63 (0.004) | <0.001 (0.001) |
Silhouette score and Jaccard index for RFC, IRFC, ERT and IERT clustering methods for real‐world data sets
| Silhouette score | Jaccard Index | |||||||
|---|---|---|---|---|---|---|---|---|
| Dataset | RFC | IRFC | ERT | IERT | RFC | IRFC | ERT | IERT |
| Mean value and standard deviation | ||||||||
| Iris | 0.169 | 0.834 | 0.023 | 0.437 | 0.648 | 0.706 | 0.287 | 0.517 |
| (150, 4, 3) | (0.004) | (0.014) | (0.003) | (0.041) | (0.031) | (0.025) | (0.194) | (0.235) |
| Heart disease | 0.022 | 0.407 | 0.009 | 0.192 | 0.474 | 0.414 | 0.177 | 0.179 |
| (270, 13, 2) | (0.002) | (0.068) | (0.001) | (0.053) | (0.037) | (0.026) | (0.057) | (0.063) |
| Mean value | ||||||||
| Wisconsin | 0.109 | 0.761 | 0.505 | 0.604 | 0.666 | 0.796 | 0.942 | 0.894 |
| (699, 9, 2) | ||||||||
| Lung | 0.055 | 0.301 | 0.094 | 0.274 | 0.268 | 0.261 | 0.095 | 0.182 |
| (32, 56, 3) | ||||||||
| Breast tissue | 0.224 | 0.709 | 0.282 | 0.409 | 0.331 | 0.355 | 0.427 | 0.431 |
| (106, 9, 6) | ||||||||
| Isolet | −0.004 | 0.187 | 0.063 | 0.244 | 0.156 | 0.192 | 0.016 | 0.039 |
| (1559, 617, 26) | ||||||||
| Parkinson | 0.254 | 0.814 | 0.254 | 0.447 | 0.451 | 0.446 | 0.168 | 0.189 |
| (768, 8, 2) | ||||||||
| Ionosphere | 0.122 | 0.594 | 0.298 | 0.496 | 0.440 | 0.404 | 0.513 | 0.519 |
| (351, 34, 2) | ||||||||
| Segmentation | 0.245 | 0.603 | 0.295 | 0.528 | 0.405 | 0.386 | 0.179 | 0.137 |
| (2310, 19, 7) | ||||||||
Size of sample, number of features, number of labels.
FIGURE 1Scatterplots of petal length versus petal width features for the iris data for ground truth and 4 clustering methods. Ground truth clusters are setosa, versicolor and virginica shown in the upper left plot
FIGURE 2Scatterplots of maximum heart rate achieved versus exercise‐induced ST depression for the heart disease data, for ground truth and 4 clustering methods
Silhouette score and Jaccard index for RFC, IRFC, ERT and IERT clustering methods in simulated data with 9 and 49 continuous features
| Silhouette score | Jaccard index | |||||||
|---|---|---|---|---|---|---|---|---|
| Number of clusters | RFC | IRFC | ERT | IERT | RFC | IRFC | ERT | IERT |
| 9 continuous features | ||||||||
| 2 | 0.016 | 0.539 | 0.110 | 0.182 | 0.376 | 0.366 | 0.493 | 0.487 |
| 5 | 0.008 | 0.356 | 0.073 | 0.113 | 0.143 | 0.134 | 0.081 | 0.068 |
| 10 | −0.007 | 0.323 | 0.074 | 0.105 | 0.165 | 0.132 | 0.033 | 0.037 |
| 49 continuous features | ||||||||
| 2 | 0.005 | 0.337 | 0.032 | 0.107 | 0.372 | 0.374 | 0.575 | 0.688 |
| 5 | 0.003 | 0.141 | 0.011 | 0.048 | 0.196 | 0.160 | 0.111 | 0.096 |
| 10 | −0.018 | 0.105 | 0.039 | 0.097 | 0.325 | 0.211 | 0.036 | 0.028 |
Silhouette score and Jaccard index for RFC, IRFC, ERT and IERT clustering methods in simulated data with 9 and 49 continuous features and one categorical feature
| Silhouette score | Jaccard index | |||||||
|---|---|---|---|---|---|---|---|---|
| Number of clusters | RFC | IRFC | ERT | IERT | RFC | IRFC | ERT | IERT |
| 9 continuous features +1 categorical feature | ||||||||
| 2 | 0.021 | 0.500 | 0.143 | 0.339 | 0.385 | 0.360 | 0.347 | 0.370 |
| 5 | 0.006 | 0.314 | 0.109 | 0.231 | 0.135 | 0.128 | 0.063 | 0.045 |
| 10 | −0.007 | 0.201 | 0.177 | 0.224 | 0.137 | 0.103 | 0.048 | 0.025 |
| 49 continuous features +1 categorical feature | ||||||||
| 2 | 0.009 | 0.291 | 0.040 | 0.217 | 0.543 | 0.411 | 0.594 | 0.384 |
| 5 | 0.002 | 0.136 | 0.017 | 0.135 | 0.220 | 0.169 | 0.065 | 0.272 |
| 10 | −0.011 | 0.100 | 0.079 | 0.187 | 0.168 | 0.219 | 0.027 | 0.042 |
Mean pairwise proximity and change in mean pairwise proximity and standard deviation over iterations of IRFC and IERT for the iris data
| IRFC | IERT | |||
|---|---|---|---|---|
| Iteration | Mean pairwise distance | Mean iteration change in pairwise distance | Mean pairwise distance | Mean iteration change in pairwise distance |
| 1 | 0.717 (0.39) | ‐ | 0.658 (0.004) | ‐ |
| 2 | 0.702 (0.40) | −0.015 (0.08) | 0.651 (<0.001) | −0.007 (0.004) |
| 3 | 0.694 (0.41) | −0.008 (0.05) | 0.651 (<0.001) | <0.001 (<0.001) |
| 4 | 0.702 (0.70) | 0.007 (0.04) | 0.651 (<0.001) | <0.001 (<0.001) |
| 5 | 0.696 (0.41) | −0.006 (0.04) | 0.651 (<0.001) | <0.001 (<0.001) |
| 6 | 0.698 (0.41) | 0.002 (0.04) | 0.651 (<0.001) | <0.001 (<0.001) |
| 7 | 0.696 (0.41) | −0.002 (0.04) | 0.651 (<0.001) | <0.001 (<0.001) |
| 8 | 0.696 (0.41) | <0.001 (0.04) | 0.651 (<0.001) | <0.001 (<0.001) |
FIGURE 3Silhouette scores over the range possible cluster numbers for the iris data. Note that the range of the ordinates in the two plots are not the same. The maximum for RFC is 6 and for IRFC is 2
FIGURE 4Scatterplots of petal length versus petal width for the iris data for the number of clusters that maximized the Silhouette scores, 6 for RFC and 2 for IRFC. Ground truth with 3 clusters are shown in the upper left plot in Figure 1
Number of units in clusters and the Silhouette score for IRFC for different initial labels for the iris data
| Initial label method | Setosa | Versicolor | Virginica | Silhouette score |
|---|---|---|---|---|
| Ground truth | 50 | 50 | 50 | 0.759 |
| Breiman and Cutler | 50 | 66 | 34 | 0.834 |
| Purposeful clustering | 50 | 54 | 46 | 0.861 |
|
| 34 | 74 | 42 | 0.753 |
FIGURE 5Scatterplots of petal length versus petal width for the iris data for different initial labeling strategies all assuming 3 clusters. Ground truth clusters are shown in the upper left