Literature DB >> 32647403

A comprehensive empirical comparison of hubness reduction in high-dimensional spaces.

Roman Feldbauer1, Arthur Flexer1.   

Abstract

Hubness is an aspect of the curse of dimensionality related to the distance concentration effect. Hubs occur in high-dimensional data spaces as objects that are particularly often among the nearest neighbors of other objects. Conversely, other data objects become antihubs, which are rarely or never nearest neighbors to other objects. Many machine learning algorithms rely on nearest neighbor search and some form of measuring distances, which are both impaired by high hubness. Degraded performance due to hubness has been reported for various tasks such as classification, clustering, regression, visualization, recommendation, retrieval and outlier detection. Several hubness reduction methods based on different paradigms have previously been developed. Local and global scaling as well as shared neighbors approaches aim at repairing asymmetric neighborhood relations. Global and localized centering try to eliminate spatial centrality, while the related global and local dissimilarity measures are based on density gradient flattening. Additional methods and alternative dissimilarity measures that were argued to mitigate detrimental effects of distance concentration also influence the related hubness phenomenon. In this paper, we present a large-scale empirical evaluation of all available unsupervised hubness reduction methods and dissimilarity measures. We investigate several aspects of hubness reduction as well as its influence on data semantics which we measure via nearest neighbor classification. Scaling and density gradient flattening methods improve evaluation measures such as hubness and classification accuracy consistently for data sets from a wide range of domains, while centering approaches achieve the same only under specific settings.
© The Author(s) 2018.

Entities:  

Keywords:  Classification; Curse of dimensionality; Hubness; Nearest neighbors; Secondary distances

Year:  2018        PMID: 32647403      PMCID: PMC7327987          DOI: 10.1007/s10115-018-1205-y

Source DB:  PubMed          Journal:  Knowl Inf Syst        ISSN: 0219-3116            Impact factor:   2.822


  7 in total

1.  A comparison of methods for multiclass support vector machines.

Authors:  Chih-Wei Hsu; Chih-Jen Lin
Journal:  IEEE Trans Neural Netw       Date:  2002

2.  Stability of ranked gene lists in large microarray analysis studies.

Authors:  Gregor Stiglic; Peter Kokol
Journal:  J Biomed Biotechnol       Date:  2010-06-27

3.  Collection and analysis of a Parkinson speech dataset with multiple types of sound recordings.

Authors:  Betul Erdogdu Sakar; M Erdem Isenkul; C Okan Sakar; Ahmet Sertbas; Fikret Gurgen; Sakir Delil; Hulya Apaydin; Olcay Kursun
Journal:  IEEE J Biomed Health Inform       Date:  2013-07       Impact factor: 5.772

4.  Self-Organizing Feature Maps Identify Proteins Critical to Learning in a Mouse Model of Down Syndrome.

Authors:  Clara Higuera; Katheleen J Gardiner; Krzysztof J Cios
Journal:  PLoS One       Date:  2015-06-25       Impact factor: 3.240

5.  Mutual proximity graphs for improved reachability in music recommendation.

Authors:  Arthur Flexer; Jeff Stevens
Journal:  J New Music Res       Date:  2017-08-03       Impact factor: 1.143

6.  Predicting positive p53 cancer rescue regions using Most Informative Positive (MIP) active learning.

Authors:  Samuel A Danziger; Roberta Baronio; Lydia Ho; Linda Hall; Kirsty Salmon; G Wesley Hatfield; Peter Kaiser; Richard H Lathrop
Journal:  PLoS Comput Biol       Date:  2008-09-04       Impact factor: 4.475

7.  Choosing ℓ p norms in high-dimensional spaces based on hub analysis.

Authors:  Arthur Flexer; Dominik Schnitzer
Journal:  Neurocomputing       Date:  2015-12-02       Impact factor: 5.719

  7 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.