Literature DB >> 25492979

LOCAL CASE-CONTROL SAMPLING: EFFICIENT SUBSAMPLING IN IMBALANCED DATA SETS.

William Fithian1, Trevor Hastie1.   

Abstract

For classification problems with significant class imbalance, subsampling can reduce computational costs at the price of inflated variance in estimating model parameters. We propose a method for subsampling efficiently for logistic regression by adjusting the class balance locally in feature space via an accept-reject scheme. Our method generalizes standard case-control sampling, using a pilot estimate to preferentially select examples whose responses are conditionally rare given their features. The biased subsampling is corrected by a post-hoc analytic adjustment to the parameters. The method is simple and requires one parallelizable scan over the full data set. Standard case-control sampling is inconsistent under model misspecification for the population risk-minimizing coefficients θ*. By contrast, our estimator is consistent for θ* provided that the pilot estimate is. Moreover, under correct specification and with a consistent, independent pilot estimate, our estimator has exactly twice the asymptotic variance of the full-sample MLE-even if the selected subsample comprises a miniscule fraction of the full data set, as happens when the original data are severely imbalanced. The factor of two improves to [Formula: see text] if we multiply the baseline acceptance probabilities by c > 1 (and weight points with acceptance probability greater than 1), taking roughly [Formula: see text] times as many data points into the subsample. Experiments on simulated and real data show that our method can substantially outperform standard case-control subsampling.

Entities:  

Keywords:  Logistic regression; case-control sampling; subsampling

Year:  2014        PMID: 25492979      PMCID: PMC4258397          DOI: 10.1214/14-AOS1220

Source DB:  PubMed          Journal:  Ann Stat        ISSN: 0090-5364            Impact factor:   4.028


  4 in total

1.  Statistical aspects of the analysis of data from retrospective studies of disease.

Authors:  N MANTEL; W HAENSZEL
Journal:  J Natl Cancer Inst       Date:  1959-04       Impact factor: 13.506

2.  The design and analysis of case-control studies with biased sampling.

Authors:  C R Weinberg; S Wacholder
Journal:  Biometrics       Date:  1990-12       Impact factor: 2.571

3.  Connections between survey calibration estimators and semiparametric models for incomplete data.

Authors:  Thomas Lumley; Pamela A Shaw; James Y Dai
Journal:  Int Stat Rev       Date:  2011-08       Impact factor: 2.217

4.  Logistic regression methods for retrospective case-control studies using complex sampling procedures.

Authors:  T R Fears; C C Brown
Journal:  Biometrics       Date:  1986-12       Impact factor: 2.571

  4 in total
  10 in total

1.  A Random Forests Quantile Classifier for Class Imbalanced Data.

Authors:  Robert O'Brien; Hemant Ishwaran
Journal:  Pattern Recognit       Date:  2019-01-29       Impact factor: 7.740

2.  Novel two-phase sampling designs for studying binary outcomes.

Authors:  Le Wang; Matthew L Williams; Yong Chen; Jinbo Chen
Journal:  Biometrics       Date:  2019-11-14       Impact factor: 2.571

3.  Native American Ancestry and Air Pollution Interact to Impact Bronchodilator Response in Puerto Rican Children with Asthma.

Authors:  María G Contreras; Kevin Keys; Joaquin Magaña; Pagé C Goddard; Oona Risse-Adams; Andrew M Zeiger; Angel C Y Mak; Lesly-Anne Samedy-Bates; Andreas M Neophytou; Eunice Lee; Neeta Thakur; Jennifer R Elhawary; Donglei Hu; Scott Huntsman; Celeste Eng; Ting Hu; Esteban G Burchard; Marquitta J White
Journal:  Ethn Dis       Date:  2021-01-21       Impact factor: 1.847

4.  Likelihood Inference for Large Scale Stochastic Blockmodels with Covariates based on a Divide-and-Conquer Parallelizable Algorithm with Communication.

Authors:  Sandipan Roy; Yves Atchadé; George Michailidis
Journal:  J Comput Graph Stat       Date:  2019-02-27       Impact factor: 2.302

5.  Variational Disentanglement for Rare Event Modeling.

Authors:  Zidi Xiu; Chenyang Tao; Michael Gao; Connor Davis; Benjamin A Goldstein; Ricardo Henao
Journal:  Proc Conf AAAI Artif Intell       Date:  2021-05-18

6.  Optimal Subsampling for Large Sample Logistic Regression.

Authors:  HaiYing Wang; Rong Zhu; Ping Ma
Journal:  J Am Stat Assoc       Date:  2018-06-06       Impact factor: 5.033

7.  A semi-supervised model to predict regulatory effects of genetic variants at single nucleotide resolution using massively parallel reporter assays.

Authors:  Zikun Yang; Chen Wang; Stephanie Erjavec; Lynn Petukhova; Angela Christiano; Iuliana Ionita-Laza
Journal:  Bioinformatics       Date:  2021-01-30       Impact factor: 6.937

8.  An epistatic interaction between pre-natal smoke exposure and socioeconomic status has a significant impact on bronchodilator drug response in African American youth with asthma.

Authors:  J Magaña; M G Contreras; K L Keys; O Risse-Adams; P C Goddard; A M Zeiger; A C Y Mak; J R Elhawary; L A Samedy-Bates; E Lee; N Thakur; D Hu; C Eng; S Salazar; S Huntsman; T Hu; E G Burchard; M J White
Journal:  BioData Min       Date:  2020-07-03       Impact factor: 2.522

9.  Normal liver enzymes are correlated with severity of metabolic syndrome in a large population based cohort.

Authors:  Julia Kälsch; Lars P Bechmann; Dominik Heider; Jan Best; Paul Manka; Hagen Kälsch; Jan-Peter Sowa; Susanne Moebus; Uta Slomiany; Karl-Heinz Jöckel; Raimund Erbel; Guido Gerken; Ali Canbay
Journal:  Sci Rep       Date:  2015-08-13       Impact factor: 4.379

10.  Go big or … don't? A field-based diet evaluation of freshwater piscivore and prey fish size relationships.

Authors:  Jereme W Gaeta; Tyler D Ahrenstorff; James S Diana; William W Fetzer; Thomas S Jones; Zach J Lawson; Michael C McInerny; Victor J Santucci; M Jake Vander Zanden
Journal:  PLoS One       Date:  2018-03-15       Impact factor: 3.240

  10 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.