| Literature DB >> 35327833 |
Der-Chiang Li1, Qi-Shi Shi1, Yao-San Lin2, Liang-Sian Lin3.
Abstract
Oversampling is the most popular data preprocessing technique. It makes traditional classifiers available for learning from imbalanced data. Through an overall review of oversampling techniques (oversamplers), we find that some of them can be regarded as danger-information-based oversamplers (DIBOs) that create samples near danger areas to make it possible for these positive examples to be correctly classified, and others are safe-information-based oversamplers (SIBOs) that create samples near safe areas to increase the correct rate of predicted positive values. However, DIBOs cause misclassification of too many negative examples in the overlapped areas, and SIBOs cause incorrect classification of too many borderline positive examples. Based on their advantages and disadvantages, a boundary-information-based oversampler (BIBO) is proposed. First, a concept of boundary information that considers safe information and dangerous information at the same time is proposed that makes created samples near decision boundaries. The experimental results show that DIBOs and BIBO perform better than SIBOs on the basic metrics of recall and negative class precision; SIBOs and BIBO perform better than DIBOs on the basic metrics for specificity and positive class precision, and BIBO is better than both of DIBOs and SIBOs in terms of integrated metrics.Entities:
Keywords: boundary information; imbalanced datasets; synthetic sample generation
Year: 2022 PMID: 35327833 PMCID: PMC8947752 DOI: 10.3390/e24030322
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.524
Figure 1Two types of majority class examples.
Two classes of oversamplers note apparent row misalignment.
| Oversamplers | Authors [Reference] | Methods |
|---|---|---|
| Danger-information-based | Han et al. [ | B1_SMOTE |
| B2_SMOTE | ||
| He et al. [ | ADASYN | |
| Nguyen et al. [ | BOS | |
| Barua et al. [ | MWMOTE | |
| Piri et al. [ | SIMO | |
| Fahrudin et al. [ | AWH_SMOTE | |
| Safe-information-based | Cieslak et al. [ | C_SMOTE |
| Bunkhumpornpat et al. [ | SL_SMOTE | |
| Maciejewski and Stefanowski [ | LN_SMOTE | |
| Sanchez et al. [ | SOI_CJ | |
| Douzas et al. [ | km_SMOTE |
Figure 2Classification results using different resamplers. (a) the raw imblanced dataset; (b) using CNN; (c) using ROS; (d) uisng SMOTE; (e–h) using danger-information-based oversamplers; (i–l) using safe-information-based oversamplers.
Figure 3A demonstration of created samples that are biased towards the larger BIW.
The algorithm of the proposed boundary-information-based oversampler.
|
| |
| imbData: | An imbalanced dataset. |
| | The number of kNPNs for oversampling. |
| | The number of kNNs for computing |
|
| |
| resData: | The imbData that had been resampled by this procedure |
|
| |
| 1. | |
| 2. | |
| 3. resData = imbData | |
| 4. while the length of resData < twice the length of | |
| 5. for | |
| 6. | |
| 7. | |
| 8. | |
| 9. if | |
| 10. | |
| 11. | |
| 12. | |
| 13. | |
| 14. | |
| 15. | |
| 16. else: | |
| 17. | |
| 18.
| |
| 19. resData = resData + s: | |
| 20. if the length of resData >= twice the length of | |
| 22. return the resData | |
|
| |
Figure 4The classification results using BIBOs with different values of K and k.
Confusion matrix.
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Positive class | True positive (TP) | False Negative (FN) |
| Negative class | False positive (FP) | True Negative (TN) |
The simulated datasets.
| No. | Name | Ex. | IR | DR (%) | No. | Name | Ex. | IR | DR (%) |
|---|---|---|---|---|---|---|---|---|---|
| 1 | paw1 | 600 | 5 | 0 | 10 | clover4 | 800 | 7 | 0 |
| 2 | paw2 | 600 | 5 | 30 | 11 | clover5 | 800 | 7 | 30 |
| 3 | paw3 | 600 | 5 | 60 | 12 | clover6 | 800 | 7 | 60 |
| 4 | paw4 | 800 | 7 | 0 | 13 | subcl1 | 600 | 5 | 0 |
| 5 | paw5 | 800 | 7 | 30 | 14 | subcl2 | 600 | 5 | 30 |
| 6 | paw6 | 800 | 7 | 60 | 15 | subcl3 | 600 | 5 | 60 |
| 7 | clover1 | 600 | 5 | 0 | 16 | subcl4 | 800 | 7 | 0 |
| 8 | clover2 | 600 | 5 | 30 | 17 | subcl5 | 800 | 7 | 30 |
| 9 | clover3 | 600 | 5 | 60 | 18 | subcl6 | 800 | 7 | 60 |
The real-world datasets.
| No. | Name | Att. | Ex. | IR | No. | Name | Att. | Ex. | IR |
|---|---|---|---|---|---|---|---|---|---|
| 1 | ecoli-0_vs_1 | 7 | 220 | 1.86 | 12 | page-blocks0 | 10 | 5472 | 8.79 |
| 2 | ecoli1 | 7 | 336 | 3.36 | 13 | pima | 8 | 768 | 1.87 |
| 3 | ecoli2 | 7 | 336 | 5.46 | 14 | segment0 | 19 | 2308 | 6.02 |
| 4 | ecoli3 | 7 | 336 | 8.6 | 15 | vehicle0 | 18 | 846 | 3.25 |
| 5 | glass-0-1-2-3 | 9 | 214 | 3.2 | 16 | vehicle1 | 18 | 846 | 2.9 |
| 17 | vehicle2 | 18 | 846 | 2.88 | |||||
| 6 | glass0 | 9 | 214 | 2.06 | 18 | vehicle3 | 18 | 846 | 2.99 |
| 7 | glass1 | 9 | 214 | 1.82 | 19 | wisconsin | 9 | 683 | 1.86 |
| 8 | glass6 | 9 | 214 | 6.38 | 20 | yeast1 | 8 | 1484 | 2.46 |
| 9 | haberman | 3 | 306 | 2.78 | 21 | yeast3 | 8 | 1484 | 8.1 |
| 10 | new-thyroid1 | 5 | 215 | 5.14 | 22 | ionosphere | 34 | 351 | 1.79 |
| 11 | new-thyroid2 | 5 | 215 | 5.14 | 23 | Swarm | 2400 | 24017 | 2.20 |
The rankings of the oversamplers using different metrics.
| Oversamplers | Methods |
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|---|---|
| RAW | - |
| 11.00 |
|
|
| 10.43 | 7.65 | 11.00 | 8.00 |
| SMOTE | - |
|
|
|
|
|
|
|
|
|
| DIBOs | B1_SMOTE | 7.30 | 5.30 | 7.87 | 7.26 |
| 4.00 | 6.26 | 5.30 | 6.80 |
| B2_SMOTE | 10.00 | 7.78 | 9.96 | 10.22 | 9.65 | 6.35 | 9.91 | 7.78 | 9.91 | |
| ADASYN | 10.43 | 5.43 | 10.26 | 9.39 | 9.09 | 5.22 | 9.78 | 5.43 | 9.65 | |
| MWMOTE |
|
| 8.00 |
| 7.70 |
| 5.57 |
|
| |
| SIBOs | SL_SMOTE | 9.17 | 8.96 | 7.22 | 9.74 | 9.04 | 8.22 | 9.87 | 8.96 | 9.87 |
| LN_SMOTE |
|
|
|
|
|
|
|
|
| |
| SOI_CJ |
| 4.61 |
|
|
| 6.09 |
| 4.61 |
| |
| km_SMOTE |
| 4.91 |
|
|
| 8.09 |
| 4.91 |
| |
| BIBO | - |
| 6.57 |
|
|
| 9.83 |
| 6.57 |
|
Note: The values in bold are the oversamplers that outperform the SMOTE (underlined ranking).
The performance results of the oversamplers using different classifiers.
|
|
|
| |||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
| RAW | - | 4.65 | 6.52 | 8.35 | 8.00 | 8.96 | 7.61 | 8.09 | 8.57 | 9.52 | 8.26 |
| SMOTE | - | 7.59 | 8.04 | 7.61 | 5.83 | 7.30 | 7.39 | 7.13 | 7.09 | 7.00 | 7.04 |
| DIBOs | B1_SMOTE | 7.48 | 7.91 | 7.00 | 5.78 | 7.26 | 8.78 | 8.83 | 8.22 | 8.30 | 8.65 |
| B2_SMOTE | 9.87 | 9.26 | 9.30 | 7.57 | 9.26 | 9.65 | 9.65 | 9.65 | 9.04 | 9.65 | |
| ADASYN | 9.17 | 7.43 | 7.57 | 3.74 | 7.04 | 6.30 | 5.35 | 4.61 | 2.87 | 4.30 | |
| MWMOTE | 6.20 | 5.39 | 5.26 |
| 5.00 | 4.48 | 3.61 | 2.48 | 4.78 | 3.25 | |
| SIBOs | SL_SMOTE | 10.26 | 10.96 | 9.78 | 9.26 | 10.87 | 10.26 | 9.17 | 9.26 | 9.91 | 8.22 |
| LN_SMOTE | 4.70 | 4.17 | 3.26 | 3.35 | 3.48 | 2.63 | 3.56 | 3.48 | 2.61 | 3.74 | |
| SOI_CJ | 2.65 | 2.35 |
| 3.04 |
| 2.57 | 3.04 | 4.13 | 4.39 | 4.13 | |
| km_SMOTE | 1.78 |
| 1.87 | 5.74 | 1.70 | 4.26 | 5.09 | 5.17 | 5.35 | 5.22 | |
| BIBO | - |
| 2.65 | 3.70 | 6.39 | 3.70 |
|
|
|
|
|
|
|
|
| |||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
| RAW | - | 4.50 | 7.70 | 6.50 | 4.50 | 6.50 | 7.37 | 8.46 | 7.50 | 5.50 | 5.50 |
| SMOTE | - | 7.57 | 6.13 | 5.04 | 4.30 | 5.78 | 7.74 | 6.30 | 5.04 | 4.57 | 6.00 |
| DIBOs | B1_SMOTE | 8.96 | 9.46 | 6.87 | 6.48 | 7.13 | 8.91 | 7.70 | 6.70 | 6.22 | 7.09 |
| B2_SMOTE | 9.96 | 8.70 | 7.87 | 7.39 | 7.96 | 9.00 | 8.61 | 7.83 | 7.17 | 8.00 | |
| ADASYN | 4.96 | 2.74 | 2.09 |
|
| 4.87 | 3.70 |
| 2.04 | 3.91 | |
| MWMOTE | 5.39 | 3.09 |
| 1.91 | 2.35 | 5.87 |
| 2.13 |
| 2.78 | |
| SIBOs | SL_SMOTE | 7.54 | 10.46 | 10.50 | 10.50 | 10.50 | 7.37 | 10.46 | 10.50 | 10.50 | 10.50 |
| LN_SMOTE | 5.00 | 5.35 | 4.91 | 5.26 | 5.13 | 4.48 | 5.30 | 5.00 | 5.70 | 5.00 | |
| SOI_CJ | 7.54 | 6.74 | 8.74 | 8.74 | 8.74 | 4.74 | 3.61 | 2.91 | 2.83 | 2.87 | |
| km_SMOTE | 2.76 |
| 4.70 | 6.61 | 2.57 | 2.70 | 2.53 | 4.87 | 6.39 |
| |
| BIBO | - |
| 3.65 | 2.83 | 2.70 | 3.09 |
| 6.48 | 8.70 | 8.74 | 8.74 |
Note: The values in bold indicate the best ranking.
The comparison of computational complexity.
|
|
|
|
|
|
|
| computational time (s) | 0.085 | 0.353 | 0.355 | 0.371 | 3.025 |
|
|
|
|
|
|
|
| computational time (s) | 0.899 | 2.002 | 50.090 | 4.076 | 0.689 |
The training dataset.
| NO. | Mcg | Gvh | Lip | Chg | Aac | Alm1 | Alm2 | Class |
|---|---|---|---|---|---|---|---|---|
| 1 | 0.23 | 0.48 | 0.48 | 0.50 | 0.59 | 0.88 | 0.89 | Negative |
| 2 | 0.56 | 0.40 | 0.48 | 0.50 | 0.49 | 0.37 | 0.46 | Positive |
| … | … | … | … | … | … | … | … | … |
| 175 | 0.24 | 0.41 | 0.48 | 0.50 | 0.49 | 0.23 | 0.34 | Positive |
| 176 | 0.20 | 0.44 | 0.48 | 0.50 | 0.46 | 0.51 | 0.57 | Positive |
Synthetic example generation.
| No. |
|
|
| Synthetic Examples |
|---|---|---|---|---|
| 1 | 11.053 | 11.458 | 0.982 | [0.50, 0.37, …, 0.69] |
| 2 | 16.335 | 9.340 | 0.272 | [0.00, 0.51, …, 0.44] |
| … | … | … | … | … |
| 53 | 9.503 | 10.345 | 0.958 | [0.12, 0.67, …, 0.63] |
| 54 | 5.326 | 9.503 | 0.718 | [0.33, 0.37, …, 0.65] |