Literature DB >> 26353332

Why Does Rebalancing Class-Unbalanced Data Improve AUC for Linear Discriminant Analysis?

Jing-Hao Xue, Peter Hall.   

Abstract

Many established classifiers fail to identify the minority class when it is much smaller than the majority class. To tackle this problem, researchers often first rebalance the class sizes in the training dataset, through oversampling the minority class or undersampling the majority class, and then use the rebalanced data to train the classifiers. This leads to interesting empirical patterns. In particular, using the rebalanced training data can often improve the area under the receiver operating characteristic curve (AUC) for the original, unbalanced test data. The AUC is a widely-used quantitative measure of classification performance, but the property that it increases with rebalancing has, as yet, no theoretical explanation. In this note, using Gaussian-based linear discriminant analysis (LDA) as the classifier, we demonstrate that, at least for LDA, there is an intrinsic, positive relationship between the rebalancing of class sizes and the improvement of AUC. We show that the largest improvement of AUC is achieved, asymptotically, when the two classes are fully rebalanced to be of equal sizes.

Year:  2015        PMID: 26353332     DOI: 10.1109/TPAMI.2014.2359660

Source DB:  PubMed          Journal:  IEEE Trans Pattern Anal Mach Intell        ISSN: 0098-5589            Impact factor:   6.226


  9 in total

1.  Early Detection of Human Epileptic Seizures Based on Intracortical Microelectrode Array Signals.

Authors:  Yun S Park; G Rees Cosgrove; Joseph R Madsen; Emad N Eskandar; Leigh R Hochberg; Sydney S Cash; Wilson Truccolo
Journal:  IEEE Trans Biomed Eng       Date:  2019-06-06       Impact factor: 4.538

2.  Comparison of logistic regression, support vector machines, and deep learning classifiers for predicting memory encoding success using human intracranial EEG recordings.

Authors:  Akshay Arora; Jui-Jui Lin; Alec Gasperian; Joseph Maldjian; Joel Stein; Michael Kahana; Bradley Lega
Journal:  J Neural Eng       Date:  2018-09-13       Impact factor: 5.043

3.  A Novel Ensemble Method for Imbalanced Data Learning: Bagging of Extrapolation-SMOTE SVM.

Authors:  Qi Wang; ZhiHao Luo; JinCai Huang; YangHe Feng; Zhong Liu
Journal:  Comput Intell Neurosci       Date:  2017-01-30

Review 4.  Integrated Chemometrics and Statistics to Drive Successful Proteomics Biomarker Discovery.

Authors:  Anouk Suppers; Alain J van Gool; Hans J C T Wessels
Journal:  Proteomes       Date:  2018-04-26

5.  XGBoost-based and tumor-immune characterized gene signature for the prediction of metastatic status in breast cancer.

Authors:  Qingqing Li; Hui Yang; Peipei Wang; Xiaocen Liu; Kun Lv; Mingquan Ye
Journal:  J Transl Med       Date:  2022-04-18       Impact factor: 8.440

6.  An empirical evaluation of sampling methods for the classification of imbalanced data.

Authors:  Misuk Kim; Kyu-Baek Hwang
Journal:  PLoS One       Date:  2022-07-28       Impact factor: 3.752

7.  Decoding declarative memory process for predicting memory retrieval based on source localization.

Authors:  Jenifer Kalafatovich; Minji Lee; Seong-Whan Lee
Journal:  PLoS One       Date:  2022-09-08       Impact factor: 3.752

8.  Prediction of Drug-Induced Long QT Syndrome Using Machine Learning Applied to Harmonized Electronic Health Record Data.

Authors:  Steven T Simon; Divneet Mandair; Premanand Tiwari; Michael A Rosenberg
Journal:  J Cardiovasc Pharmacol Ther       Date:  2021-03-08       Impact factor: 2.457

9.  From ERPs to MVPA Using the Amsterdam Decoding and Modeling Toolbox (ADAM).

Authors:  Johannes J Fahrenfort; Joram van Driel; Simon van Gaal; Christian N L Olivers
Journal:  Front Neurosci       Date:  2018-07-03       Impact factor: 4.677

  9 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.