| Literature DB >> 34767567 |
Mingyang Deng1,2, Yingshi Guo1, Chang Wang1, Fuwei Wu1.
Abstract
To solve the oversampling problem of multi-class small samples and to improve their classification accuracy, we develop an oversampling method based on classification ranking and weight setting. The designed oversampling algorithm sorts the data within each class of dataset according to the distance from original data to the hyperplane. Furthermore, iterative sampling is performed within the class and inter-class sampling is adopted at the boundaries of adjacent classes according to the sampling weight composed of data density and data sorting. Finally, information assignment is performed on all newly generated sampling data. The training and testing experiments of the algorithm are conducted by using the UCI imbalanced datasets, and the established composite metrics are used to evaluate the performance of the proposed algorithm and other algorithms in comprehensive evaluation method. The results show that the proposed algorithm makes the multi-class imbalanced data balanced in terms of quantity, and the newly generated data maintain the distribution characteristics and information properties of the original samples. Moreover, compared with other algorithms such as SMOTE and SVMOM, the proposed algorithm has reached a higher classification accuracy of about 90%. It is concluded that this algorithm has high practicability and general characteristics for imbalanced multi-class samples.Entities:
Mesh:
Year: 2021 PMID: 34767567 PMCID: PMC8589211 DOI: 10.1371/journal.pone.0259227
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Classification of typical algorithms for imbalanced sampling and representative literature.
| Classification | Strategy | Typical literature |
|---|---|---|
| Oversampling | K-order approach | Chawla et al. (2002), Han et al. (2005) [ |
| Clustering | Sanchez et al. (2013), Nekooeimehr et al. (2016) [ | |
| Neural networks | Konno et al. (2019) [ | |
| Undersampling | Clustering | Yen et al. (2009), Tsai et al. (2018) [ |
| Integration | Liu et al. (2009), Tahir et al. (2012) [ | |
| Hybrid sampling | Random sampling | Seiffert et al. (2009) [ |
| SMOTE+ENN/TOMEK | Batista et al. (2004) [ |
Fig 1Multi-dimensional classification data.
Fig 2Two-dimensional spatial classification data.
Fig 3The classification oversampling algorithm flow chart.
The Datasets for training and testing algorithms.
| Datasets | Imbalanced Degree | Minority class | Majority class | Total samples | Usage |
|---|---|---|---|---|---|
| Weather data | 0.12 | 163 | 1358 | 1521 | training |
| Clinical cases | 0.28 | 6636 | 23700 | 30336 | training |
| Financial data | 0.31 | 178 | 574 | 752 | training |
| Product sampling | 0.56 | 126 | 225 | 351 | training |
| Market Research | 0.43 | 431 | 10048 | 10479 | testing |
Fig 4Data sorting of oversampling for imbalanced data.
Fig 7Inter-class sampling of oversampling for imbalanced data.
Confusion matrix.
| Category | Predicted positive category | Predicted negative category | True quantity |
|---|---|---|---|
| Actual positive category | True Positive (TP) | False Negative (FN) | TP+FN |
| Actual negative category | False Positive (FP) | True Negative (TN) | FP+TN |
| Forecasted total | TP+FP | FN+TN |
Single index evaluation of different algorithms.
| Testing dataset | Algorithm | Recall | Precision | Specificity |
|---|---|---|---|---|
|
| SMOTE | 0.8934 | 0.8972 | 0.8986 |
| SVMOM | 0.8731 | 0.9098 | 0.8893 | |
| SMO+TLK | 0.8857 | 0.9028 | 0.8922 | |
| SVM+ENN | 0.8943 | 0.8962 | 0.8998 | |
| STCPS | 0.9013 | 0.8937 | 0.8909 |
Comparison of composite indicators of different algorithms.
| Testing dataset | Algorithm | G-mean | F-value | AUC | C I |
|---|---|---|---|---|---|
|
| SMOTE | 0.8960 | 0.8953 | 0.8028 | 0.8647 |
| SVMOM | 0.8812 | 0.8911 | 0.7764 | 0.8496 | |
| SMO+TLK | 0.8889 | 0.8942 | 0.7902 | 0.8578 | |
| SVM+ENN | 0.8968 | 0.8952 | 0.8047 | 0.8657 | |
| STCPS | 0.8970 | 0.8975 | 0.8030 | 0.8655 |