| Literature DB >> 26658987 |
Ali Seyed Shirkhorshidi1, Saeed Aghabozorgi2, Teh Ying Wah1.
Abstract
Similarity or distance measures are core components used by distance-based clustering algorithms to cluster similar data points into the same clusters, while dissimilar or distant data points are placed into different clusters. The performance of similarity measures is mostly addressed in two or three-dimensional spaces, beyond which, to the best of our knowledge, there is no empirical study that has revealed the behavior of similarity measures when dealing with high-dimensional datasets. To fill this gap, a technical framework is proposed in this study to analyze, compare and benchmark the influence of different similarity measures on the results of distance-based clustering algorithms. For reproducibility purposes, fifteen publicly available datasets were used for this study, and consequently, future distance measures can be evaluated and compared with the results of the measures discussed in this work. These datasets were classified as low and high-dimensional categories to study the performance of each measure against each category. This research should help the research community to identify suitable distance measures for datasets and also to facilitate a comparison and evaluation of the newly proposed similarity or distance measures with traditional ones.Entities:
Mesh:
Year: 2015 PMID: 26658987 PMCID: PMC4686108 DOI: 10.1371/journal.pone.0144059
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Similarity Measures for continuous data (in time complexity, n is the number of dimensions of x and y).
| Distance Measure | Equation | Time complexity | Advantages | Disadvantages | Applications |
|---|---|---|---|---|---|
| Euclidean Distance |
| O(n) | Very common, easy to compute and works well with datasets with compact or isolated clusters [ | Sensitive to outliers [ |
|
| Average Distance |
| O(n) | Better than Euclidean distance [ | Variables contribute independently to the measure of distance. Redundant values could dominate the similarity between data points [ |
|
| Weighted Euclidean |
| O(n) | The weight matrix allows to increase the effect of more important data points than less important one [ | Same as Average Distance. | Fuzzy |
| Chord |
| O(3n) | Can work with un-normalized data [ | It is not invariant to linear transformation [ | Ecological resemblance detection [ |
| Mahalanobis |
|
| Mahalanobis is a data-driven measure that can ease the distance distortion caused by a linear combination of attributes [ | It can be expensive in terms of computation [ | Hyperellipsoidal clustering algorithm [ |
| Cosine Measure |
| O(3n) | Independent of vector length and invariant to rotation [ | It is not invariant to linear transformation [ | Mostly used in document similarity applications [ |
| Manhattan |
| O(n) | Is common and like other Minkowski-driven distances it works well with datasets with compact or isolated clusters [ | Sensitive to the outliers.[ |
|
| Mean Character Difference |
| O(n) |
|
| Partitioning and hierarchical clustering algorithms. |
| Index of Association |
| O(3n) | - |
| Partitioning and hierarchical clustering algorithms. |
| Canberra Metric |
| O(n) |
| - | Partitioning and hierarchical clustering algorithms. |
| Czekanowski Coefficient |
| O(2n) |
| - | Partitioning and hierarchical clustering algorithms. |
| Coefficient of Divergence |
| O(n) |
| - | Partitioning and hierarchical clustering algorithms. |
| Pearson coefficient |
| O(2n) |
| - | Partitioning and hierarchical clustering algorithms. |
*Points marked by asterisk are compiled based on this article’s experimental results.
Fig 1Overview of experimental study.
Fig 2Arrangement of experiments.
Rand Index values used for ANOVA test (in the table HAverage stands for Hierarchical Average algorithm and HSingle stands for Hierarchical Single link).
| Dataset | Distance/Similarity Measures | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Euclidean | Average | Cosine | Chord | Mahalanobis | Canberra | CoeffDiv | Czekan | IndOfAssoc | Manhattan | MCharDiff | Pearson | |
| k-Means | ||||||||||||
| sensor_2 | 0.722 | 0.733 | 0.659 | 0.659 | 0.725 | 0.744 | 0.741 | 0.765 | 0.662 | 0.729 | 0.729 | 0.403 |
| Aggregation | 0.929 | 0.929 | 0.798 | 0.799 | 0.927 | 0.921 | 0.904 | 0.949 | 0.799 | 0.927 | 0.927 | 0.636 |
| Compound | 0.919 | 0.914 | 0.746 | 0.746 | 0.926 | 0.890 | 0.908 | 0.886 | 0.744 | 0.906 | 0.904 | 0.497 |
| Flame | 0.756 | 0.756 | 0.569 | 0.569 | 0.750 | 0.716 | 0.498 | 0.710 | 0.557 | 0.750 | 0.750 | 0.536 |
| Pathbased | 0.750 | 0.750 | 0.639 | 0.639 | 0.758 | 0.735 | 0.733 | 0.746 | 0.637 | 0.748 | 0.748 | 0.635 |
| R15 | 0.999 | 0.999 | 0.949 | 0.948 | 0.999 | 0.999 | 0.998 | 0.998 | 0.947 | 0.998 | 0.998 | 0.552 |
| Spiral | 0.554 | 0.554 | 0.562 | 0.562 | 0.555 | 0.550 | 0.552 | 0.553 | 0.562 | 0.556 | 0.556 | 0.496 |
| D31 | 0.994 | 0.992 | 0.956 | 0.956 | 0.995 | 0.992 | 0.992 | 0.994 | 0.956 | 0.994 | 0.994 | 0.528 |
| Iris | 0.880 | 0.880 | 0.966 | 0.966 | 0.880 | 0.942 | 0.950 | 0.927 | 0.958 | 0.874 | 0.874 | 0.776 |
| sensor_4 | 0.612 | 0.624 | 0.637 | 0.637 | 0.619 | 0.745 | 0.709 | 0.737 | 0.649 | 0.726 | 0.728 | 0.670 |
| Data_User_Modeling | 0.725 | 0.725 | 0.668 | 0.668 | 0.719 | 0.711 | 0.706 | 0.713 | 0.668 | 0.712 | 0.711 | 0.657 |
| Seeds | 0.876 | 0.874 | 0.884 | 0.884 | 0.876 | 0.859 | 0.782 | 0.891 | 0.890 | 0.872 | 0.872 | 0.359 |
| Glass | 0.741 | 0.742 | 0.737 | 0.740 | 0.732 | 0.604 | 0.602 | 0.734 | 0.732 | 0.734 | 0.731 | 0.342 |
| sensor_24 | 0.610 | 0.615 | 0.614 | 0.617 | 0.596 | 0.618 | 0.621 | 0.613 | 0.610 | 0.604 | 0.611 | 0.626 |
| Libras movement | 0.914 | 0.917 | 0.913 | 0.917 | 0.915 | 0.911 | 0.914 | 0.910 | 0.913 | 0.914 | 0.912 | 0.918 |
| k-Medoids | ||||||||||||
| sensor_2 | 0.777 | 0.736 | 0.661 | 0.661 | 0.729 | 0.804 | 0.806 | 0.797 | 0.675 | 0.785 | 0.796 | 0.403 |
| Aggregation | 0.949 | 0.949 | 0.790 | 0.790 | 0.950 | 0.928 | 0.901 | 0.958 | 0.787 | 0.941 | 0.953 | 0.636 |
| Compound | 0.925 | 0.911 | 0.734 | 0.733 | 0.920 | 0.890 | 0.890 | 0.900 | 0.740 | 0.916 | 0.913 | 0.497 |
| Flame | 0.762 | 0.762 | 0.538 | 0.538 | 0.756 | 0.705 | 0.498 | 0.716 | 0.565 | 0.744 | 0.744 | 0.536 |
| Pathbased | 0.746 | 0.746 | 0.606 | 0.606 | 0.756 | 0.743 | 0.745 | 0.745 | 0.667 | 0.741 | 0.741 | 0.635 |
| R15 | 0.999 | 0.999 | 0.947 | 0.945 | 0.988 | 0.998 | 0.988 | 0.998 | 0.947 | 0.999 | 0.998 | 0.552 |
| Spiral | 0.555 | 0.554 | 0.555 | 0.555 | 0.555 | 0.571 | 0.555 | 0.557 | 0.551 | 0.556 | 0.564 | 0.496 |
| D31 | 0.994 | 0.992 | 0.956 | 0.956 | 0.992 | 0.990 | 0.988 | 0.991 | 0.956 | 0.991 | 0.994 | 0.528 |
| Iris | 0.912 | 0.912 | 0.966 | 0.966 | 0.824 | 0.927 | 0.950 | 0.906 | 0.950 | 0.880 | 0.880 | 0.776 |
| sensor_4 | 0.707 | 0.711 | 0.711 | 0.711 | 0.656 | 0.740 | 0.722 | 0.709 | 0.690 | 0.696 | 0.716 | 0.656 |
| Data_User_Modeling | 0.725 | 0.712 | 0.654 | 0.654 | 0.728 | 0.285 | 0.285 | 0.285 | 0.646 | 0.734 | 0.745 | 0.659 |
| Seeds | 0.874 | 0.874 | 0.842 | 0.842 | 0.798 | 0.872 | 0.771 | 0.876 | 0.865 | 0.867 | 0.867 | 0.359 |
| Glass | 0.735 | 0.736 | 0.738 | 0.732 | 0.711 | 0.633 | 0.582 | 0.737 | 0.735 | 0.737 | 0.739 | 0.342 |
| sensor_24 | 0.624 | 0.623 | 0.623 | 0.622 | 0.588 | 0.652 | 0.634 | 0.630 | 0.629 | 0.620 | 0.617 | 0.613 |
| Libras movement | 0.907 | 0.909 | 0.908 | 0.905 | 0.720 | 0.897 | 0.905 | 0.901 | 0.906 | 0.904 | 0.904 | 0.907 |
| HSingle | ||||||||||||
| sensor_2 | 0.432 | 0.432 | 0.355 | 0.355 | 0.432 | 0.432 | 0.432 | 0.431 | 0.365 | 0.432 | 0.432 | 0.405 |
| Aggregation | 0.926 | 0.926 | 0.574 | 0.574 | 0.926 | 0.619 | 0.927 | 0.927 | 0.550 | 0.926 | 0.926 | 0.635 |
| Compound | 0.890 | 0.890 | 0.415 | 0.415 | 0.896 | 0.895 | 0.898 | 0.891 | 0.415 | 0.712 | 0.712 | 0.497 |
| Flame | 0.541 | 0.541 | 0.522 | 0.522 | 0.541 | 0.531 | 0.531 | 0.541 | 0.522 | 0.541 | 0.541 | 0.538 |
| Pathbased | 0.338 | 0.338 | 0.362 | 0.362 | 0.340 | 0.339 | 0.338 | 0.338 | 0.362 | 0.338 | 0.338 | 0.635 |
| R15 | 0.910 | 0.910 | 0.817 | 0.817 | 0.910 | 0.856 | 0.857 | 0.856 | 0.817 | 0.911 | 0.911 | 0.574 |
| Spiral | 1.000 | 1.000 | 0.383 | 0.383 | 1.000 | 0.781 | 0.781 | 0.781 | 0.383 | 1.000 | 1.000 | 0.497 |
| D31 | 0.779 | 0.779 | 0.818 | 0.818 | 0.754 | 0.740 | 0.731 | 0.730 | 0.518 | 0.755 | 0.755 | 0.536 |
| Iris | 0.777 | 0.777 | 0.772 | 0.772 | 0.343 | 0.753 | 0.753 | 0.772 | 0.772 | 0.776 | 0.776 | 0.772 |
| sensor_4 | 0.341 | 0.341 | 0.345 | 0.345 | 0.346 | 0.451 | 0.339 | 0.333 | 0.345 | 0.338 | 0.338 | 0.651 |
| Data_User_Modeling | 0.309 | 0.309 | 0.301 | 0.301 | 0.304 | 0.302 | 0.302 | 0.305 | 0.302 | 0.299 | 0.299 | 0.311 |
| Seeds | 0.357 | 0.357 | 0.340 | 0.340 | 0.337 | 0.340 | 0.337 | 0.340 | 0.340 | 0.340 | 0.340 | 0.358 |
| Glass | 0.304 | 0.304 | 0.308 | 0.308 | 0.309 | 0.293 | 0.294 | 0.308 | 0.308 | 0.308 | 0.308 | 0.342 |
| sensor_24 | 0.347 | 0.347 | 0.346 | 0.346 | 0.353 | 0.346 | 0.347 | 0.346 | 0.346 | 0.345 | 0.345 | 0.349 |
| Libras movement | 0.187 | 0.187 | 0.202 | 0.202 | 0.131 | 0.183 | 0.183 | 0.187 | 0.192 | 0.187 | 0.187 | 0.296 |
| HAverage | ||||||||||||
| sensor_2 | 0.466 | 0.466 | 0.634 | 0.634 | 0.506 | 0.466 | 0.729 | 0.716 | 0.634 | 0.466 | 0.466 | 0.404 |
| Aggregation | 1.000 | 1.000 | 0.778 | 0.778 | 0.997 | 0.930 | 0.948 | 0.927 | 0.778 | 0.991 | 0.991 | 0.643 |
| Compound | 0.921 | 0.921 | 0.676 | 0.676 | 0.921 | 0.850 | 0.852 | 0.829 | 0.697 | 0.933 | 0.933 | 0.511 |
| Flame | 0.721 | 0.721 | 0.503 | 0.503 | 0.847 | 0.512 | 0.529 | 0.501 | 0.503 | 0.689 | 0.689 | 0.538 |
| Pathbased | 0.738 | 0.738 | 0.699 | 0.699 | 0.754 | 0.438 | 0.377 | 0.708 | 0.629 | 0.724 | 0.724 | 0.635 |
| R15 | 0.999 | 0.999 | 0.917 | 0.917 | 0.999 | 0.981 | 0.963 | 0.990 | 0.914 | 0.998 | 0.998 | 0.566 |
| Spiral | 0.537 | 0.537 | 0.528 | 0.528 | 0.557 | 0.424 | 0.499 | 0.498 | 0.428 | 0.540 | 0.540 | 0.497 |
| D31 | 0.994 | 0.994 | 0.950 | 0.950 | 0.996 | 0.977 | 0.979 | 0.986 | 0.952 | 0.996 | 0.996 | 0.537 |
| Iris | 0.892 | 0.892 | 0.772 | 0.772 | 0.343 | 0.753 | 0.753 | 0.778 | 0.772 | 0.886 | 0.886 | 0.776 |
| sensor_4 | 0.338 | 0.338 | 0.561 | 0.561 | 0.338 | 0.479 | 0.479 | 0.480 | 0.544 | 0.376 | 0.376 | 0.653 |
| Data_User_Modeling | 0.659 | 0.659 | 0.301 | 0.301 | 0.337 | 0.302 | 0.302 | 0.307 | 0.309 | 0.645 | 0.645 | 0.594 |
| Seeds | 0.887 | 0.887 | 0.691 | 0.691 | 0.337 | 0.879 | 0.581 | 0.802 | 0.688 | 0.802 | 0.802 | 0.362 |
| Glass | 0.329 | 0.329 | 0.570 | 0.570 | 0.309 | 0.328 | 0.323 | 0.415 | 0.415 | 0.415 | 0.415 | 0.369 |
| sensor_24 | 0.353 | 0.353 | 0.538 | 0.538 | 0.347 | 0.498 | 0.516 | 0.518 | 0.521 | 0.428 | 0.428 | 0.446 |
| Libras movement | 0.886 | 0.886 | 0.892 | 0.892 | 0.131 | 0.582 | 0.613 | 0.827 | 0.844 | 0.861 | 0.861 | 0.886 |
ANOVA results for k-means.
| K_means | SS | df | MS | F | Prob>F |
|---|---|---|---|---|---|
| Columns | 0.68317 | 11 | 0.06211 | 2.96 | 0.0013 |
| Error | 3.52624 | 168 | 0.02099 | ||
| Total | 4.20942 | 179 |
ANOVA results for HSingle.
| HSingle | SS | df | MS | F | Prob>F |
|---|---|---|---|---|---|
| Columns | 0.3194 | 11 | 0.02903 | 2.38 | 0.0095 |
| Error | 1.8788 | 154 | 0.0122 | ||
| Total | 10.2233 | 179 |
Dataset Details.
| Dataset Name | Dimensions | Clusters | Vectors |
|---|---|---|---|
| Aggregation | 2 | 7 | 788 |
| Compound | 2 | 6 | 399 |
| D31 | 2 | 31 | 3100 |
| Flame | 2 | 2 | 240 |
| Path based | 2 | 3 | 300 |
| R15 | 2 | 15 | 600 |
| Sensor_2 | 2 | 4 | 5456 |
| Spiral | 2 | 3 | 312 |
| Iris | 4 | 3 | 150 |
| Sensor_4 | 4 | 4 | 5456 |
| Data_User_Modeling | 5 | 4 | 258 |
| Seeds | 7 | 3 | 210 |
| Glass | 9 | 7 | 214 |
| Sensor_24 | 24 | 4 | 5456 |
| Movement Libera | 90 | 15 | 360 |
Fig 3K-means color scale table for normalized Rand index values (green represents the highest and it changes to red, which is the lowest Rand index value).
Fig 4K-medoids color scale table for normalized Rand index values (green is the highest and changes color to red, which is the lowest Rand index value).
Fig 5Sample box charts for k-means iteration counts created with a collection of normalized results after 100 times of repeating the algorithm for each similarity measure and dataset.
Fig 6Color scale table for iteration count mean and variance (green is the lowest and it changes color to red, which shows the greatest iteration count value).
Fig 7Bar chart of normalized Rand index values for selected datasets using the Single-link algorithm.
Fig 8Bar chart of normalized Rand index values for selected datasets using the Group Average algorithm.
Fig 9Color scale table of normalized Rand index values for the Single-link method (green is the highest and it changes color to red, which represents the lowest Rand index value).
Fig 10Color scale table of normalized Rand index values for Group Average (green is the highest and it changes color to red, which signifies the lowest Rand index value).
Fig 11Overall RI Average.
Fig 12Average RI for four algorithms.
ANOVA results for k-medoids.
| K_medoids | SS | df | MS | F | Prob>F |
|---|---|---|---|---|---|
| Columns | 0.69565 | 11 | 0.06324 | 2.62 | 0.0042 |
| Error | 4.05766 | 168 | 0.02415 | ||
| Total | 4.75331 | 179 |
ANOVA results for HSingle.
| HAvrage | SS | df | MS | F | Prob>F |
|---|---|---|---|---|---|
| Columns | 0.47251 | 11 | 0.04296 | 2.62 | 0.0043 |
| Error | 2.52617 | 154 | 0.0164 | ||
| Total | 8.91175 | 175 |