Literature DB >> 27870109

Analyzing large datasets with bootstrap penalization.

Kuangnan Fang1, Shuangge Ma1,2.   

Abstract

Data with a large p (number of covariates) and/or a large n (sample size) are now commonly encountered. For many problems, regularization especially penalization is adopted for estimation and variable selection. The straightforward application of penalization to large datasets demands a "big computer" with high computational power. To improve computational feasibility, we develop bootstrap penalization, which dissects a big penalized estimation into a set of small ones, which can be executed in a highly parallel manner and each only demands a "small computer". The proposed approach takes different strategies for data with different characteristics. For data with a large p but a small to moderate n, covariates are first clustered into relatively homogeneous blocks. The proposed approach consists of two sequential steps. In each step and for each bootstrap sample, we select blocks of covariates and run penalization. The results from multiple bootstrap samples are pooled to generate the final estimate. For data with a large n but a small to moderate p, we bootstrap a small number of subjects, apply penalized estimation, and then conduct a weighted average over multiple bootstrap samples. For data with a large p and a large n, the natural marriage of the previous two methods is applied. Numerical studies, including simulations and data analysis, show that the proposed approach has computational and numerical advantages over the straightforward application of penalization. An R package has been developed to implement the proposed methods.
© 2016 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

Entities:  

Keywords:  Bootstrap; Computational feasibility; Large datasets; Penalization

Mesh:

Year:  2016        PMID: 27870109      PMCID: PMC5577005          DOI: 10.1002/bimj.201600052

Source DB:  PubMed          Journal:  Biom J        ISSN: 0323-3847            Impact factor:   1.715


  7 in total

1.  Pitfalls of hypothesis tests and model selection on bootstrap samples: Causes and consequences in biometrical applications.

Authors:  Silke Janitza; Harald Binder; Anne-Laure Boulesteix
Journal:  Biom J       Date:  2015-09-15       Impact factor: 2.207

2.  Discussion of "Sure Independence Screening for Ultra-High Dimensional Feature Space.

Authors:  Hao Helen Zhang
Journal:  J R Stat Soc Series B Stat Methodol       Date:  2008-11       Impact factor: 4.488

3.  RANDOM LASSO.

Authors:  Sijian Wang; Bin Nan; Saharon Rosset; Ji Zhu
Journal:  Ann Appl Stat       Date:  2011-03-01       Impact factor: 2.083

4.  Regulation of gene expression in the mammalian eye and its relevance to eye disease.

Authors:  Todd E Scheetz; Kwang-Youn A Kim; Ruth E Swiderski; Alisdair R Philp; Terry A Braun; Kevin L Knudtson; Anne M Dorrance; Gerald F DiBona; Jian Huang; Thomas L Casavant; Val C Sheffield; Edwin M Stone
Journal:  Proc Natl Acad Sci U S A       Date:  2006-09-18       Impact factor: 11.205

5.  Challenges of Big Data Analysis.

Authors:  Jianqing Fan; Fang Han; Han Liu
Journal:  Natl Sci Rev       Date:  2014-06       Impact factor: 17.275

Review 6.  Computational solutions to large-scale data management and analysis.

Authors:  Eric E Schadt; Michael D Linderman; Jon Sorenson; Lawrence Lee; Garry P Nolan
Journal:  Nat Rev Genet       Date:  2010-09       Impact factor: 53.242

Review 7.  Computational cluster validation in post-genomic data analysis.

Authors:  Julia Handl; Joshua Knowles; Douglas B Kell
Journal:  Bioinformatics       Date:  2005-05-24       Impact factor: 6.937

  7 in total
  1 in total

Review 1.  A generic Transcriptomics Reporting Framework (TRF) for 'omics data processing and analysis.

Authors:  Timothy W Gant; Ursula G Sauer; Shu-Dong Zhang; Brian N Chorley; Jörg Hackermüller; Stefania Perdichizzi; Knut E Tollefsen; Ben van Ravenzwaay; Carole Yauk; Weida Tong; Alan Poole
Journal:  Regul Toxicol Pharmacol       Date:  2017-11-04       Impact factor: 3.271

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.