| Literature DB >> 27081307 |
Abstract
High-dimensional data generally refer to data in which the number of variables is larger than the sample size. Analyzing such datasets poses great challenges for classical statistical learning because the finite-sample performance of methods developed within classical statistical learning does not live up to classical asymptotic premises in which the sample size unboundedly grows for a fixed dimensionality of observations. Much work has been done in developing mathematical-statistical techniques for analyzing high-dimensional data. Despite remarkable progress in this field, many practitioners still utilize classical methods for analyzing such datasets. This state of affairs can be attributed, in part, to a lack of knowledge and, in part, to the ready-to-use computational and statistical software packages that are well developed for classical techniques. Moreover, many scientists working in a specific field of high-dimensional statistical learning are either not aware of other existing machineries in the field or are not willing to try them out. The primary goal in this work is to bring together various machineries of high-dimensional analysis, give an overview of the important results, and present the operating conditions upon which they are grounded. When appropriate, readers are referred to relevant review articles for more information on a specific subject.Entities:
Keywords: G-analysis; Kolmogorov asymptotics; curse of dimensionality; double asymptotics; high-dimensional analysis; random matrix theory; ridge estimation; shrinkage; sparsity
Year: 2016 PMID: 27081307 PMCID: PMC4830639 DOI: 10.4137/CIN.S30804
Source DB: PubMed Journal: Cancer Inform ISSN: 1176-9351
Figure 1Expected error of Euclidean-distance classifier versus dimension for n0 = n1 = 100. The solid curve is obtained from theoretical results. The small circles are the result of simulation experiments for p = 10, 65, 200, 535, and 1,400.
Figure 2(A)–(C) Comparing the empirical spectral distribution of one realization of the covariance matrix with the limiting spectral distribution: (A) p = 1,000, n = 2,000; (b) p = 100, n = 200; (C) p = 20, n = 40. (D), (E) Comparing the average empirical spectral distribution of N realizations of the covariance matrix with the limiting spectral distribution: (D) p = 1,000, n = 2,000, and N = 10; (E) p = 100, n = 200, and N = 100; (f) p = 20, n = 40, and N = 10,000. Marčenko and Pastur71 obtained the closed-form solution of the limiting spectral distribution by using the double-asymptotic framework.