Literature DB >> 27185255

Distance Metric Based Oversampling Method for Bioinformatics and Performance Evaluation.

Meng-Fong Tsai1, Shyr-Shen Yu2.   

Abstract

An imbalanced classification means that a dataset has an unequal class distribution among its population. For any given dataset, regardless of any balancing issue, the predictions made by most classification methods are highly accurate for the majority class but significantly less accurate for the minority class. To overcome this problem, this study took several imbalanced datasets from the famed UCI datasets and designed and implemented an efficient algorithm which couples Top-N Reverse k-Nearest Neighbor (TRkNN) with the Synthetic Minority Oversampling TEchnique (SMOTE). The proposed algorithm was investigated by applying it to classification methods such as logistic regression (LR), C4.5, Support Vector Machine (SVM), and Back Propagation Neural Network (BPNN). This research also adopted different distance metrics to classify the same UCI datasets. The empirical results illustrate that the Euclidean and Manhattan distances are not only more accurate, but also show greater computational efficiency when compared to the Chebyshev and Cosine distances. Therefore, the proposed algorithm based on TRkNN and SMOTE can be widely used to handle imbalanced datasets. Our recommendations on choosing suitable distance metrics can also serve as a reference for future studies.

Keywords:  Distance Metric; Imbalanced classification; Synthetic minority oversampling technique; UCI Dataset

Mesh:

Year:  2016        PMID: 27185255     DOI: 10.1007/s10916-016-0516-3

Source DB:  PubMed          Journal:  J Med Syst        ISSN: 0148-5598            Impact factor:   4.460


  4 in total

1.  An approach for classification of highly imbalanced data using weighting and undersampling.

Authors:  Ashish Anand; Ganesan Pugalenthi; Gary B Fogel; P N Suganthan
Journal:  Amino Acids       Date:  2010-04-22       Impact factor: 3.520

2.  Learning from imbalanced data in surveillance of nosocomial infection.

Authors:  Gilles Cohen; Mélanie Hilario; Hugo Sax; Stéphane Hugonnet; Antoine Geissbuhler
Journal:  Artif Intell Med       Date:  2005-10-17       Impact factor: 5.326

3.  Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance.

Authors:  Maciej A Mazurowski; Piotr A Habas; Jacek M Zurada; Joseph Y Lo; Jay A Baker; Georgia D Tourassi
Journal:  Neural Netw       Date:  2007-12-27

4.  An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data.

Authors:  Ming Hao; Yanli Wang; Stephen H Bryant
Journal:  Anal Chim Acta       Date:  2013-11-06       Impact factor: 6.558

  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.