| Literature DB >> 26640321 |
Arthur Flexer1, Dominik Schnitzer1.
Abstract
The hubness phenomenon is a recently discovered aspect of the curse of dimensionality. Hub objects have a small distance to an exceptionally large number of data points while anti-hubs lie far from all other data points. A closely related problem is the concentration of distances in high-dimensional spaces. Previous work has already advocated the use of fractional ℓ p norms instead of the ubiquitous Euclidean norm to avoid the negative effects of distance concentration. However, which exact fractional norm to use is a largely unsolved problem. The contribution of this work is an empirical analysis of the relation of different ℓ p norms and hubness. We propose an unsupervised approach for choosing an ℓ p norm which minimizes hubs while simultaneously maximizing nearest neighbor classification. Our approach is evaluated on seven high-dimensional data sets and compared to three approaches that re-scale distances to avoid hubness.Entities:
Keywords: Concentration of distances; Fractional norms; High-dimensional data analysis; Hubness
Year: 2015 PMID: 26640321 PMCID: PMC4567076 DOI: 10.1016/j.neucom.2014.11.084
Source DB: PubMed Journal: Neurocomputing ISSN: 0925-2312 Impact factor: 5.719
Data sets, their dimensionality d and size m, number of classes c, hubness (), classification rates (C) in the original Euclidean space , actual maximum (max C) and estimated maximum ℓ based on anti-hubs Aest and hubs Hest. Better or equal C when compared to the original data are given in bold, an asterisk indicates that respective methods were able to find the actual maximum.
| max | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ℓ | ℓ | ℓ | ℓ | |||||||||
| Dexter | 20 000 | 300 | 2 | 2.9 | 2 | 64.3 | 1.75 | 2 | 2.25 | 52.0 | ||
| Gisette | 5000 | 6000 | 2 | 2.9 | 2 | 93.5 | 0.5 | 1.5 | 1.25 | |||
| Leeds Butt. | 36 000 | 832 | 10 | 3.5 | 2 | 50.4 | 1.5 | 1.25 | 1.75 | |||
| 17 Flowers | 36 000 | 1360 | 17 | 3.9 | 2 | 42.3 | 1 | 1 | ⁎ | 1 | ⁎ | |
| Splice | 60 | 1000 | 2 | 5.6 | 2 | 69.4 | 0.5 | 0.25 | 0.25 | |||
| 49 820 | 969 | 17 | 14.6 | 2 | 10.3 | 4 | 4 | ⁎ | 4 | ⁎ | ||
| Protein | 357 | 6621 | 3 | 43.1 | 2 | 52.1 | 1 | 1 | ⁎ | 1 | ⁎ | |
Fig. 1The minimum in anti-hub (A) and hub (H) occurrence while changing the ℓ norm is closely related to the maximum kNN classification rate (C). See Section 5.1.
Data sets, classification rates () in percent in the original Euclidean space (orig), actual maximum (max C), estimated maximum based on anti-hubs Aest and hubs Hest, classification rates based on secondary measures computed with MP, LS and SNN. Best classification results per data set printed in bold. Classification results which are significantly better than the ones achieved in the original Euclidean space are marked with an asterisk (McNemar test, 5% significance level, degrees of freedom =1). The last line gives the average gain in absolute percentage points relative to using the original ℓ2 norm.
| Data set | Orig | max | MP | LS | SNN | ||
|---|---|---|---|---|---|---|---|
| Dexter | 64.3 | 64.3 | 52.0 | 68.0 | 70.3⁎ | 66.0 | |
| Gisette | 93.5 | 93.8⁎ | 93.7 | 93.1 | 93.0 | 90.3 | |
| Leeds Butt. | 50.4 | 51.7 | 51.0 | 51.0 | 58.8⁎ | 42.2 | |
| 17 Flowers | 42.3 | 43.1 | 43.1 | 43.1 | 50.6⁎ | 36.5 | |
| Splice | 69.4 | 77.7⁎ | 77.5⁎ | 77.5⁎ | 77.2⁎ | 69.3 | |
| 10.3 | 19.6⁎ | 19.6⁎ | 19.6⁎ | 45.6⁎ | 17.3⁎ | ||
| Protein | 52.1 | 49.1 | 50.1 | 43.9 | |||
| Ave. gain | – | 5.37 | 3.37 | 1.6 | 8.69 | 9.29 | −2.40 |
Fig. 2Comparison of classification accuracy results on the y-axis for all seven data sets (x-axis) and all six methods depicted as six bars per data set. Shown are differences in absolute percentage points relative to using the original ℓ2 norm.