Literature DB >> 30078922

Optimal Subsampling for Large Sample Logistic Regression.

HaiYing Wang1, Rong Zhu2, Ping Ma3.   

Abstract

For massive data, the family of subsampling algorithms is popular to downsize the data volume and reduce computational burden. Existing studies focus on approximating the ordinary least squares estimate in linear regression, where statistical leverage scores are often used to define subsampling probabilities. In this paper, we propose fast subsampling algorithms to efficiently approximate the maximum likelihood estimate in logistic regression. We first establish consistency and asymptotic normality of the estimator from a general subsampling algorithm, and then derive optimal subsampling probabilities that minimize the asymptotic mean squared error of the resultant estimator. An alternative minimization criterion is also proposed to further reduce the computational cost. The optimal subsampling probabilities depend on the full data estimate, so we develop a two-step algorithm to approximate the optimal subsampling procedure. This algorithm is computationally efficient and has a significant reduction in computing time compared to the full data approach. Consistency and asymptotic normality of the estimator from a two-step algorithm are also established. Synthetic and real data sets are used to evaluate the practical performance of the proposed method.

Entities:  

Keywords:  A-optimality; Logistic Regression; Massive Data; Optimal Subsampling; Rare Event

Year:  2018        PMID: 30078922      PMCID: PMC6075720          DOI: 10.1080/01621459.2017.1292914

Source DB:  PubMed          Journal:  J Am Stat Assoc        ISSN: 0162-1459            Impact factor:   5.033


  4 in total

1.  CUR matrix decompositions for improved data analysis.

Authors:  Michael W Mahoney; Petros Drineas
Journal:  Proc Natl Acad Sci U S A       Date:  2009-01-12       Impact factor: 11.205

2.  A fast randomized algorithm for overdetermined linear least-squares regression.

Authors:  Vladimir Rokhlin; Mark Tygert
Journal:  Proc Natl Acad Sci U S A       Date:  2008-09-08       Impact factor: 11.205

3.  Searching for exotic particles in high-energy physics with deep learning.

Authors:  P Baldi; P Sadowski; D Whiteson
Journal:  Nat Commun       Date:  2014-07-02       Impact factor: 14.919

4.  LOCAL CASE-CONTROL SAMPLING: EFFICIENT SUBSAMPLING IN IMBALANCED DATA SETS.

Authors:  William Fithian; Trevor Hastie
Journal:  Ann Stat       Date:  2014-10-01       Impact factor: 4.028

  4 in total
  2 in total

1.  Online Decentralized Leverage Score Sampling for Streaming Multidimensional Time Series.

Authors:  Rui Xie; Zengyan Wang; Shuyang Bai; Ping Ma; Wenxuan Zhong
Journal:  Proc Mach Learn Res       Date:  2019-04

2.  Sampling-based estimation for massive survival data with additive hazards model.

Authors:  Lulu Zuo; Haixiang Zhang; HaiYing Wang; Lei Liu
Journal:  Stat Med       Date:  2020-11-03       Impact factor: 2.373

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.