Literature DB >> 25568835

Classifying Imbalanced Data Streams via Dynamic Feature Group Weighting with Importance Sampling.

Ke Wu1, Andrea Edwards1, Wei Fan2, Jing Gao3, Kun Zhang1.   

Abstract

Data stream classification and imbalanced data learning are two important areas of data mining research. Each has been well studied to date with many interesting algorithms developed. However, only a few approaches reported in literature address the intersection of these two fields due to their complex interplay. In this work, we proposed an importance sampling driven, dynamic feature group weighting framework (DFGW-IS) for classifying data streams of imbalanced distribution. Two components are tightly incorporated into the proposed approach to address the intrinsic characteristics of concept-drifting, imbalanced streaming data. Specifically, the ever-evolving concepts are tackled by a weighted ensemble trained on a set of feature groups with each sub-classifier (i.e. a single classifier or an ensemble) weighed by its discriminative power and stable level. The un-even class distribution, on the other hand, is typically battled by the sub-classifier built in a specific feature group with the underlying distribution rebalanced by the importance sampling technique. We derived the theoretical upper bound for the generalization error of the proposed algorithm. We also studied the empirical performance of our method on a set of benchmark synthetic and real world data, and significant improvement has been achieved over the competing algorithms in terms of standard evaluation metrics and parallel running time. Algorithm implementations and datasets are available upon request.

Entities:  

Keywords:  Class imbalance; Data stream classification; Ensemble weighting; Feature group ensemble; Importance sampling

Year:  2014        PMID: 25568835      PMCID: PMC4283472          DOI: 10.1137/1.9781611973440.83

Source DB:  PubMed          Journal:  Proc SIAM Int Conf Data Min


  3 in total

Review 1.  Critical evaluation of in silico methods for prediction of coiled-coil domains in proteins.

Authors:  Chen Li; Catherine Ching Han Chang; Jeremy Nagel; Benjamin T Porebski; Morihiro Hayashida; Tatsuya Akutsu; Jiangning Song; Ashley M Buckle
Journal:  Brief Bioinform       Date:  2015-07-15       Impact factor: 11.622

2.  RS-Forest: A Rapid Density Estimator for Streaming Anomaly Detection.

Authors:  Ke Wu; Kun Zhang; Wei Fan; Andrea Edwards; Philip S Yu
Journal:  Proc IEEE Int Conf Data Min       Date:  2014

3.  CSTG: An Effective Framework for Cost-sensitive Sparse Online Learning.

Authors:  Zhong Chen; Zhide Fang; Wei Fan; Andrea Edwards; Kun Zhang
Journal:  SIAM Rev Soc Ind Appl Math       Date:  2017-04       Impact factor: 10.780

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.