| Literature DB >> 33320890 |
Kevin Teh1, Paul Armitage1, Solomon Tesfaye2, Dinesh Selvarajah2,3, Iain D Wilkinson1.
Abstract
One of the fundamental challenges when dealing with medical imaging datasets is class imbalance. Class imbalance happens where an instance in the class of interest is relatively low, when compared to the rest of the data. This study aims to apply oversampling strategies in an attempt to balance the classes and improve classification performance. We evaluated four different classifiers from k-nearest neighbors (k-NN), support vector machine (SVM), multilayer perceptron (MLP) and decision trees (DT) with 73 oversampling strategies. In this work, we used imbalanced learning oversampling techniques to improve classification in datasets that are distinctively sparser and clustered. This work reports the best oversampling and classifier combinations and concludes that the usage of oversampling methods always outperforms no oversampling strategies hence improving the classification results.Entities:
Year: 2020 PMID: 33320890 PMCID: PMC7737960 DOI: 10.1371/journal.pone.0243907
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Depicts diagram of conventional SMOTE algorithm.
Fig 2Depicts the two dataset described in Table 1.
DTS1 above shows classes with more pockets of bunching together (Clustered) whereby the DTS2 is a more sporadic class dataset (Sparser).
Shows parameters of the datasets used in this study.
| Dataset | ATR | N(N+/N-) | IR |
|---|---|---|---|
| 14 | 158(119/39) | 3.051 | |
| 13 | 53(40/13) | 3.077 |
Fig 3Schematic diagram illustrating the imbalanced learning workflow.
Top 10 performing oversamplers for DTS1 versus baseline (no oversampling and SMOTE) averaged across four classifiers.
| DTS1 | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| Rank | sampler | AUC | F1 | G | AS | AUCRank | F1Rank | GRank | avgRank |
| 1 | 0.7655 | 0.5603 | 0.7151 | 0.6803 | 3 | 3 | 5 | 3.67 | |
| 2 | 0.7606 | 0.5635 | 0.7199 | 0.6813 | 12 | 1 | 1 | 4.67 | |
| 3 | 0.7629 | 0.5600 | 0.7143 | 0.6791 | 5 | 4 | 6 | 5.00 | |
| 4 | 0.7669 | 0.5585 | 0.7138 | 0.6797 | 2 | 10 | 7 | 6.33 | |
| 5 | 0.7619 | 0.5592 | 0.7127 | 0.6779 | 7 | 7 | 9 | 7.67 | |
| 6 | 0.7629 | 0.5573 | 0.7120 | 0.6774 | 6 | 13 | 12 | 10.33 | |
| 7 | 0.7670 | 0.5587 | 0.7095 | 0.6784 | 1 | 8 | 24 | 11.00 | |
| 8 | 0.7633 | 0.5597 | 0.7094 | 0.6775 | 4 | 5 | 25 | 11.33 | |
| 9 | 0.7584 | 0.5563 | 0.7167 | 0.6771 | 17 | 17 | 2 | 12.00 | |
| 10 | 0.7602 | 0.5587 | 0.7103 | 0.6764 | 13 | 9 | 20 | 14.00 | |
| Baseline | 0.7522 | 0.5436 | 0.7032 | 0.6663 | 49 | 46 | 41 | 45.33 | |
| Baseline | 0.6877 | 0.4041 | 0.5612 | 0.5510 | 72 | 74 | 74 | 73.33 |
Top 10 performing oversamplers for DTS2 versus baseline (no oversampling and SMOTE) averaged across four classifiers.
| DTS2 | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| Rank | sampler | AUC | F1 | G | AS | AUCRank | F1Rank | GRank | avgRank |
| 1 | 0.8903 | 0.7322 | 0.8501 | 0.8242 | 1 | 2 | 1 | 1.33 | |
| 2 | 0.8872 | 0.7337 | 0.8489 | 0.82327 | 3 | 1 | 3 | 2.33 | |
| 3 | 0.8854 | 0.7299 | 0.8494 | 0.82157 | 8 | 4 | 2 | 4.67 | |
| 4 | 0.885 | 0.7305 | 0.8424 | 0.8193 | 9 | 3 | 7 | 6.33 | |
| 5 | 0.8894 | 0.7261 | 0.8388 | 0.8181 | 2 | 6 | 11 | 6.33 | |
| 6 | 0.8856 | 0.7239 | 0.8459 | 0.81847 | 7 | 9 | 4 | 6.67 | |
| 7 | 0.8857 | 0.7226 | 0.8427 | 0.817 | 6 | 11 | 6 | 7.67 | |
| 8 | 0.8839 | 0.7242 | 0.8452 | 0.81777 | 11 | 8 | 5 | 8 | |
| 9 | 0.8798 | 0.7258 | 0.8424 | 0.816 | 21 | 7 | 8 | 12 | |
| 10 | 0.8823 | 0.7208 | 0.8409 | 0.81467 | 16 | 14 | 9 | 13 | |
| Baseline | 0.8779 | 0.7086 | 0.8286 | 0.80503 | 22 | 28 | 29 | 26.33 | |
| Baseline | 0.831 | 0.5662 | 0.6795 | 0.69223 | 69 | 71 | 71 | 70.33 |
Shows the average and top performing AS over all oversamplers for the four different classifier types in DTS1 and DTS2.
| DTS1 | ||||
|---|---|---|---|---|
| Classifier | AUC | F1 | G | AS |
| k-NN | 0.7397 | 0.5225 | 0.6849 | 0.6490 |
| MLP | 0.7796 | 0.5597 | 0.7137 | 0.6843 |
| DT | 0.6597 | 0.4680 | 0.6369 | 0.5882 |
| SVM | 0.8170 | 0.6051 | 0.7430 | 0.7217 |
| DTS2 | ||||
| Classifier | AUC | F1 | G | AS |
| k-NN | 0.7948 | 0.8886 | 0.8130 | 0.8321 |
| MLP | 0.8124 | 0.9058 | 0.8300 | 0.8494 |
| DT | 0.6566 | 0.7116 | 0.7037 | 0.6907 |
| SVM | 0.8786 | 0.9462 | 0.8861 | 0.9037 |
Shows the top performers ranked by AS scores over the four columns reporting the four classifier techniques used.
| DTS1 | ||||||||
|---|---|---|---|---|---|---|---|---|
| Classifier | SVM | DT | k-NN | MLP | ||||
| Rank | Sampler | AS | Sampler | AS | Sampler | AS | Sampler | AS |
| 1 | A_SUWO | 0.7588 | Borderline_SMOTE2 | 0.6253 | CURE_SMOTE | 0.6841 | Stefanowski | 0.7183 |
| 2 | Borderline_SMOTE1 | 0.7563 | MSMOTE | 0.6169 | polynom_fit_SMOTE | 0.6836 | polynom_fit_SMOTE | 0.7153 |
| 3 | SMOTE_ENN | 0.7509 | SMOTE_ENN | 0.6144 | NRAS | 0.6831 | SMOTE_D | 0.7148 |
| 4 | SL_graph_SMOTE | 0.7499 | SL_graph_SMOTE | 0.6111 | Gazzah | 0.6814 | CBSO | 0.7136 |
| 5 | Borderline_SMOTE2 | 0.7496 | ISOMAP_Hybrid | 0.6106 | Gaussian_SMOTE | 0.6786 | DE_oversampling | 0.7132 |
| 6 | SMOTE_TomekLinks | 0.747 | AND_SMOTE | 0.6103 | ProWSyn | 0.6777 | MWMOTE | 0.7132 |
| 7 | SDSMOTE | 0.7463 | Assembled_SMOTE | 0.6093 | SOI_CJ | 0.6766 | distance_SMOTE | 0.7095 |
| 8 | SMOTE_FRST_2T | 0.7436 | ADOMS | 0.6084 | MDO | 0.6749 | ISMOTE | 0.7091 |
| 9 | LN_SMOTE | 0.7431 | LN_SMOTE | 0.6083 | Lee | 0.6723 | SN_SMOTE | 0.7077 |
| 10 | SMOBD | 0.7417 | SMOBD | 0.6076 | LLE_SMOTE | 0.672 | ADOMS | 0.7055 |
| Classifier | SVM | DT | k-NN | MLP | ||||
| Rank | Sampler | AS | Sampler | AS | Sampler | AS | Sampler | AS |
| 1 | SMOTE_Cosine | 0.93 | LVQ_SMOTE | 0.7221 | SMOTE_IPF | 0.8406 | Borderline_SMOTE2 | 0.8549 |
| 2 | Borderline_SMOTE1 | 0.9263 | Lee | 0.7207 | CE_SMOTE | 0.838 | cluster_SMOTE | 0.8525 |
| 3 | SDSMOTE | 0.9203 | SMOTE_D | 0.7107 | SMOTE_OUT | 0.838 | SMOTE_IPF | 0.852 |
| 4 | polynom_fit_SMOTE | 0.92 | SMOBD | 0.7054 | CBSO | 0.8365 | Edge_Det_SMOTE | 0.85 |
| 5 | G_SMOTE | 0.9198 | Assembled_SMOTE | 0.7017 | polynom_fit_SMOTE | 0.835 | SMOTE_FRST_2T | 0.8494 |
| 6 | SMOTE_OUT | 0.9198 | CE_SMOTE | 0.6978 | SMOTE_TomekLinks | 0.8338 | NDO_sampling | 0.8477 |
| 7 | Assembled_SMOTE | 0.9196 | G_SMOTE | 0.6974 | Selected_SMOTE | 0.832 | SMOTE_TomekLinks | 0.8475 |
| 8 | Lee | 0.9188 | NRSBoundary_SMOTE | 0.6968 | Borderline_SMOTE2 | 0.8284 | CURE_SMOTE | 0.8468 |
| 9 | MWMOTE | 0.9179 | polynom_fit_SMOTE | 0.6968 | MWMOTE | 0.8264 | CBSO | 0.8462 |
| 10 | cluster_SMOTE | 0.917 | Random_SMOTE | 0.696 | CURE_SMOTE | 0.8262 | SMOTE_D | 0.8449 |
Each row represents the oversampling technique providing the top results reported in descending order. We did this over both DTS1 and DTS2.
Table comparing operating principles over DTS1 and DTS2 based upon oversamplers categorized in S3 Table.
| DTS1 | DTS2 | |||
|---|---|---|---|---|
| RANK | Operating Principle | AS | Operating Principle | AS |
| 1 | Ordinary Sampling | 0.6689 | Application | 0.8004 |
| 2 | Density Based | 0.6677 | Uses Classifier | 0.8002 |
| 3 | Application | 0.6656 | Ordinary Sampling | 0.7949 |
| 4 | Uses Clustering | 0.6644 | Uses Clustering | 0.7940 |
| 5 | Componentwise Sampling | 0.6626 | Componentwise Sampling | 0.7928 |
| 6 | Borderline | 0.6624 | Density based | 0.7884 |
| 7 | Uses Classifier | 0.6614 | Borderline | 0.7861 |
| 8 | Memetic | 0.6597 | Sampling By Cloning | 0.7807 |
| 9 | Dimensionality Reduction | 0.6563 | Changes Majority | 0.7624 |
| 10 | Changes Majority | 0.6549 | Noise Removal | 0.7614 |
| 11 | Noise Removal | 0.6505 | Memetic | 0.7600 |
| 12 | Sampling By Cloning | 0.6489 | Dimensionality Reduction | 0.7506 |
| 13 | Density Estimation | 0.6383 | Density Estimation | 0.7310 |