| Literature DB >> 34239981 |
Peter Gnip1, Liberios Vokorokos1, Peter Drotár1.
Abstract
Challenges posed by imbalanced data are encountered in many real-world applications. One of the possible approaches to improve the classifier performance on imbalanced data is oversampling. In this paper, we propose the new selective oversampling approach (SOA) that first isolates the most representative samples from minority classes by using an outlier detection technique and then utilizes these samples for synthetic oversampling. We show that the proposed approach improves the performance of two state-of-the-art oversampling methods, namely, the synthetic minority oversampling technique and adaptive synthetic sampling. The prediction performance is evaluated on four synthetic datasets and four real-world datasets, and the proposed SOA methods always achieved the same or better performance than other considered existing oversampling methods. ©2021 Gnip et al.Entities:
Keywords: ADASYN; Bankruptcy prediction; Imbalanced data; Outlier detection; Oversampling; SMOTE
Year: 2021 PMID: 34239981 PMCID: PMC8237317 DOI: 10.7717/peerj-cs.604
Source DB: PubMed Journal: PeerJ Comput Sci ISSN: 2376-5992
Figure 1Principle of the proposed selective oversampling approach.
Detailed characteristic of utilized datasets.
| Dataset | Samples | Attributes | Imbalance ratio | Reference |
|---|---|---|---|---|
| Synt. dataset-1 | 2200 | 40 | 20:1 | |
| Synt. dataset-2 | 1500 | 20 | 39:1 | |
| Synt. dataset-3 | 1500 | 20 | 50:1 | |
| Synt. dataset-4 | 2414 | 40 | 70:1 | |
| Bankruptcy - manufacture | 5854 | 20 | 417:1 | |
| Bankruptcy - construction | 3128 | 20 | 222:1 | |
| Wine | 4898 | 11 | 26:1 | |
| Bank marketing | 4119 | 20 | 8:1 |
Figure 2Visualization of data samples after applying SMOTE and SOA-S method in 2-dimensional space.
The best GM scores (%) achieved on the synthetic datasets (± for standard deviation).
| Dataset | Sampling | SVC | AB | KNN | RF |
|---|---|---|---|---|---|
| Synt. dataset-1 | none | 66.73 ± 13.5 | 75.41 ± 10.1 | 33.14 ± 15.3 | 70.42 ± 16.6 |
| SMOTE | 84.72 ± 5.62 | 78.61 ± 5.67 | 78.13 ± 11.3 | ||
| ADASYN | 86.82 ± 4.13 | 84.93 ± 5.48 | 77.59 ± 5.53 | 78.04 ± 11.4 | |
| SOA-S | 86.94 ± 3.39 | ||||
| SOA-A | 86.77 ± 3.61 | 86.77 ± 4.61 | 82.74 ± 4.49 | 79.79 ± 9.75 | |
| Synt. dataset-2 | none | 63.40 ± 17.5 | 69.81 ± 13.7 | 37.80 ± 22.1 | 66.06 ± 18.3 |
| SMOTE | 85.73 ± 4.61 | 80.50 ± 6.83 | 79.69 ± 7.09 | 72.53 ± 12.9 | |
| ADASYN | 85.65 ± 4.82 | 80.57 ± 6.65 | 79.48 ± 6.86 | 70.89 ± 14.1 | |
| SOA-S | 84.03 ± 5.22 | 81.99 ± 5.42 | |||
| SOA-A | 86.46 ± 3.99 | 73.17 ± 11.7 | |||
| Synt. dataset-3 | none | 54.63 ± 20.5 | 61.75 ± 17.9 | 27.67 ± 22.6 | 54.07 ± 21.6 |
| SMOTE | 82.57 ± 5.81 | 75.36 ± 9.85 | 76.13 ± 7.94 | 58.77 ± 18.4 | |
| ADASYN | 82.55 ± 6.15 | 75.34 ± 9.22 | 76.09 ± 8.49 | 57.56 ± 18.4 | |
| SOA-S | 79.23 ± 6.19 | ||||
| SOA-A | 84.10 ± 4.65 | 79.39 ± 6.27 | 60.28 ± 15.9 | ||
| Synt. dataset-4 | none | 21.24 ± 24.6 | 36.32 ± 21.6 | 5.31 ± 14.9 | 20.88 ± 24.9 |
| SMOTE | 72.94 ± 7.52 | 60.80 ± 9.85 | 67.72 ± 6.65 | 19.59 ± 20.9 | |
| ADASYN | 72.14 ± 7.78 | 61.32 ± 10.7 | 67.69 ± 6.29 | 17.95 ± 19.8 | |
| SOA-S | 68.86 ± 8.09 | 74.84 ± 6.13 | |||
| SOA-A | 75.22 ± 7.78 | 28.09 ± 17.7 |
Notes.
Highest results are in bold.
The best GM scores (%) achieved on the real-world datasets ( ± for standard deviation).
| Dataset | Sampling | SVC | AB | KNN | RF |
|---|---|---|---|---|---|
| Bankruptcy - manufacture | none | 7.44 ± 11.3 | 22.37 ± 12.7 | 5.65 ± 9.84 | 5.60 ± 9.93 |
| SMOTE | 90.61 ± 2.36 | 52.70 ± 12.4 | 39.70 ± 11.3 | 12.33 ± 7.95 | |
| ADASYN | 90.65 ± 2.38 | 53.48 ± 12.3 | 62.79 ± 10.9 | 12.24 ± 7.92 | |
| SOA+S | 95.13 ± 1.09 | 61.41 ± 10.7 | 55.37 ± 14.3 | ||
| SOA+A | 71.09 ± 12.1 | ||||
| Bankruptcy - construction | none | 33.45 ± 13.2 | 44.46 ± 13.1 | 5.20 ± 9.74 | 14.54 ± 10.4 |
| SMOTE | 94.54 ± 1.42 | 57.99 ± 14.8 | 62.19 ± 13.8 | 27.35 ± 12.9 | |
| ADASYN | 94.52 ± 1.33 | 61.35 ± 13.7 | 61.86 ± 13.3 | 28.91 ± 13.4 | |
| SOA+S | 78.43 ± 10.7 | ||||
| SOA+A | 95.65 ± 0.99 | 78.52 ± 9.93 | 55.56 ± 13.4 | ||
| Wine | none | 64.06 ± 2.45 | 48.41 ± 1.89 | 63.21 ± 2.37 | 60.82 ± 2.32 |
| SMOTE | 77.49 ± 0.91 | 73.71 ± 1.21 | 79.28 ± 1.35 | 70.90 ± 2.33 | |
| ADASYN | 77.48 ± 1.01 | 73.86 ± 1.38 | 79.46 ± 1.28 | 71.27 ± 2.31 | |
| SOA+S | 81.21 ± 1.07 | 79.81 ± 0.87 | |||
| SOA+A | 78.79 ± 0.69 | 77.76 ± 0.78 | |||
| Bank marketing | none | 64.69 ± 0.78 | 67.71 ± 0.82 | 54.06 ± 0.71 | 65.47 ± 1.04 |
| SMOTE | 86.88 ± 0.28 | 81.99 ± 1.21 | 79.13 ± 0.67 | 73.84 ± 0.81 | |
| ADASYN | 87.02 ± 0.29 | 81.36 ± 0.93 | 79.23 ± 0.79 | 73.55 ± 0.83 | |
| SOA+S | 83.38 ± 0.51 | ||||
| SOA+A | 87.02 ± 0.35 | 83.02 ± 0.48 | 79.42 ± 0.99 |
Notes.
Highest results are in bold.