Literature DB >> 25419469

Challenges of Big Data Analysis.

Jianqing Fan1, Fang Han2, Han Liu3.   

Abstract

Big Data bring new opportunities to modern society and challenges to data scientists. On one hand, Big Data hold great promises for discovering subtle population patterns and heterogeneities that are not possible with small-scale data. On the other hand, the massive sample size and high dimensionality of Big Data introduce unique computational and statistical challenges, including scalability and storage bottleneck, noise accumulation, spurious correlation, incidental endogeneity, and measurement errors. These challenges are distinguished and require new computational and statistical paradigm. This article gives overviews on the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures. We also provide various new perspectives on the Big Data analysis and computation. In particular, we emphasize on the viability of the sparsest solution in high-confidence set and point out that exogeneous assumptions in most statistical methods for Big Data can not be validated due to incidental endogeneity. They can lead to wrong statistical inferences and consequently wrong scientific conclusions.

Entities:  

Keywords:  Big Data; data storage; high dimensional data; incidental endogeneity; large-scale optimization; massive data; massively parallel data processing; noise accumulation; random projection; scalability; spurious correlation

Year:  2014        PMID: 25419469      PMCID: PMC4236847          DOI: 10.1093/nsr/nwt032

Source DB:  PubMed          Journal:  Natl Sci Rev        ISSN: 2053-714X            Impact factor:   17.275


  29 in total

1.  Optimally sparse representation in general (nonorthogonal) dictionaries via l minimization.

Authors:  David L Donoho; Michael Elad
Journal:  Proc Natl Acad Sci U S A       Date:  2003-02-21       Impact factor: 11.205

2.  Variable Selection using MM Algorithms.

Authors:  David R Hunter; Runze Li
Journal:  Ann Stat       Date:  2005       Impact factor: 4.028

3.  CUR matrix decompositions for improved data analysis.

Authors:  Michael W Mahoney; Petros Drineas
Journal:  Proc Natl Acad Sci U S A       Date:  2009-01-12       Impact factor: 11.205

4.  Empirical null and false discovery rate analysis in neuroimaging.

Authors:  Armin Schwartzman; Robert F Dougherty; Jongho Lee; Dara Ghahremani; Jonathan E Taylor
Journal:  Neuroimage       Date:  2008-04-24       Impact factor: 6.556

Review 5.  An overview of recent developments in genomics and associated statistical methods.

Authors:  Peter J Bickel; James B Brown; Haiyan Huang; Qunhua Li
Journal:  Philos Trans A Math Phys Eng Sci       Date:  2009-11-13       Impact factor: 4.226

6.  The case for cloud computing in genome informatics.

Authors:  Lincoln D Stein
Journal:  Genome Biol       Date:  2010-05-05       Impact factor: 13.583

7.  Ultrahigh dimensional feature selection: beyond the linear model.

Authors:  Jianqing Fan; Richard Samworth; Yichao Wu
Journal:  J Mach Learn Res       Date:  2009       Impact factor: 3.654

8.  Scale-Invariant Sparse PCA on High Dimensional Meta-elliptical Data.

Authors:  Fang Han; Han Liu
Journal:  J Am Stat Assoc       Date:  2014-01-01       Impact factor: 5.033

9.  Feature Screening via Distance Correlation Learning.

Authors:  Runze Li; Wei Zhong; Liping Zhu
Journal:  J Am Stat Assoc       Date:  2012-07-01       Impact factor: 5.033

10.  ORACLE INEQUALITIES FOR THE LASSO IN THE COX MODEL.

Authors:  Jian Huang; Tingni Sun; Zhiliang Ying; Yi Yu; Cun-Hui Zhang
Journal:  Ann Stat       Date:  2013-06-01       Impact factor: 4.028

View more
  99 in total

Review 1.  Annual Research Review: Discovery science strategies in studies of the pathophysiology of child and adolescent psychiatric disorders--promises and limitations.

Authors:  Yihong Zhao; F Xavier Castellanos
Journal:  J Child Psychol Psychiatry       Date:  2016-01-06       Impact factor: 8.982

2.  Exploiting Linkage Disequilibrium for Ultrahigh-Dimensional Genome-Wide Data with an Integrated Statistical Approach.

Authors:  Michelle Carlsen; Guifang Fu; Shaun Bushman; Christopher Corcoran
Journal:  Genetics       Date:  2015-12-12       Impact factor: 4.562

3.  Transferring and generalizing deep-learning-based neural encoding models across subjects.

Authors:  Haiguang Wen; Junxing Shi; Wei Chen; Zhongming Liu
Journal:  Neuroimage       Date:  2018-04-27       Impact factor: 6.556

4.  Nuclear safety in the unexpected second nuclear era.

Authors:  Yican Wu; Zhibin Chen; Zhen Wang; Shanqi Chen; Daochuan Ge; Chao Chen; Jiangtao Jia; Yazhou Li; Ming Jin; Tao Zhou; Fang Wang; Liqin Hu
Journal:  Proc Natl Acad Sci U S A       Date:  2019-08-19       Impact factor: 11.205

5.  Generalized meta-analysis for multiple regression models across studies with disparate covariate information.

Authors:  Prosenjit Kundu; Runlong Tang; Nilanjan Chatterjee
Journal:  Biometrika       Date:  2019-07-13       Impact factor: 2.445

6.  ARE DISCOVERIES SPURIOUS? DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS AND THEIR APPLICATIONS.

Authors:  Jianqing Fan; Qi-Man Shao; Wen-Xin Zhou
Journal:  Ann Stat       Date:  2018-05-03       Impact factor: 4.028

Review 7.  Scaling Up Scientific Discovery in Sleep Medicine: The National Sleep Research Resource.

Authors:  Dennis A Dean; Ary L Goldberger; Remo Mueller; Matthew Kim; Michael Rueschman; Daniel Mobley; Satya S Sahoo; Catherine P Jayapandian; Licong Cui; Michael G Morrical; Susan Surovec; Guo-Qiang Zhang; Susan Redline
Journal:  Sleep       Date:  2016-05-01       Impact factor: 5.849

8.  DISTRIBUTED TESTING AND ESTIMATION UNDER SPARSE HIGH DIMENSIONAL MODELS.

Authors:  Heather Battey; Jianqing Fan; Han Liu; Junwei Lu; Ziwei Zhu
Journal:  Ann Stat       Date:  2018-05-03       Impact factor: 4.028

9.  Statistical methods and computing for big data.

Authors:  Chun Wang; Ming-Hui Chen; Elizabeth Schifano; Jing Wu; Jun Yan
Journal:  Stat Interface       Date:  2016       Impact factor: 0.582

10.  Distributed Simultaneous Inference in Generalized Linear Models via Confidence Distribution.

Authors:  Lu Tang; Ling Zhou; Peter X-K Song
Journal:  J Multivar Anal       Date:  2019-11-28       Impact factor: 1.473

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.