| Literature DB >> 33289013 |
Venkata Pavan Kumar Turlapati1, Manas Ranjan Prusty2,3.
Abstract
Almost every dataset these days continually faces the predicament of class imbalance. It is difficult to train classifiers on these types of data as they become biased towards a set of classes, hence leading to reduction in classifier performance. This setback is often tackled by the use of various over-sampling or under-sampling algorithms. But, the method which stood out of all the numerous algorithms was the Synthetic Minority Oversampling Technique (SMOTE). SMOTE generates synthetic samples of the minority class by oversampling each data-point by considering linear combinations of existing minority class neighbors. Each minority data sample generates an equal number of synthetic data. As the world is suffering from the plight of COVID-19 pandemic, the authors applied the idea to help boost the classifying performance whilst detecting this deadly virus. This paper presents a modified version of SMOTE known as Outlier-SMOTE wherein each data-point is oversampled with respect to its distance from other data-points. The data-point which is farther than the other data-points is given greater importance and is oversampled more than its counterparts. Outlier-SMOTE reduces the chances of overlapping of minority data samples which often occurs in the traditional SMOTE algorithm. This method is tested on five benchmark datasets and is eventually tested on a COVID-19 dataset. F-measure, Recall and Precision are used as principle metrics to evaluate the performance of the classifier as is the case for any class imbalanced data set. The proposed algorithm performs considerably better than the traditional SMOTE algorithm for the considered datasets.Entities:
Keywords: COVID-19; Classification; Imbalanced dataset; Over-sampling; SMOTE
Year: 2020 PMID: 33289013 PMCID: PMC7710484 DOI: 10.1016/j.ibmed.2020.100023
Source DB: PubMed Journal: Intell Based Med ISSN: 2666-5212
Fig. 1An illustration describing the stage at which the algorithm works.
Fig. 2Workflow of Outlier-SMOTE.
An example showing the working of Outlier-SMOTE algorithm.
| N | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|
| Sum of Euclid Dist Matrix | 96 | 159 | 51 | 264 | 192 |
| Normalized Matrix | 0.12598 | 0.20866 | 0.06693 | 0.34645 | 0.25196 |
| Rounded Off Values | 0.13 | 0.21 | 0.07 | 0.35 | 0.25 |
| Oversampling Rates | 3 | 5 | 2 | 9 | 6 |
Illustration of the confusion matrix.
| Predicted + ve | Predicted -ve | |
|---|---|---|
| Actual + ve | TP | FN |
| Actual -ve | FP | TN |
Description of the datasets.
| DESCRIPTION | MIN: MAJ | # SAMPLES | # FEATURES | |
|---|---|---|---|---|
| ECOLI DATASET | Classification of proteins based on their amino acid sequences. | 35 : 336 (1 : 9) | 371 | 8 |
| ABALONE DATASET | Prediction of the age of abalone | 42 : 689 (1 : 16) | 731 | 8 |
| YEAST DATASET | Contains the data of localization of yeast bacteria | 51 : 1270 (1 : 24) | 1321 | 9 |
| WINE QUALITY DATASET | Signifies the quality of white wine | 175 : 4898 (1 : 27) | 5073 | 11 |
| MAMMOGRAPHY DATASET | Test for breast cancer | 260 : 11,183 (1 : 42) | 11,443 | 6 |
Fig. 3Bar Graph illustrating the number of majority and minority samples.
An illustration of 5-Fold Cross Validation.
Results obtained on ECOLI dataset.
| Recall | Precision | F1 - Score | |||||||
|---|---|---|---|---|---|---|---|---|---|
| OS Rate | O-SMOTE | SMOTE | ADASYN | O-SMOTE | SMOTE | ADASYN | O-SMOTE | SMOTE | ADASYN |
| 100 | 99.4 | 99.8 | 88.9 | 88.6 | 94.14 | 94.14 | 93.9 | ||
| 200 | 93.6 | 96.4 | 85.8 | 85.2 | 90.5 | 90.4 | |||
| 300 | 88.6 | 91.1 | 85.1 | 88.9 | 88.6 | 89.4 | |||
| 400 | 87.7 | 89.4 | 91.23 | 88.8 | 88.5 | 88.6 | |||
| 500 | 84.33 | 87.6 | 93.3 | 90.1 | 87 | 88.5 | |||
Results obtained on ABALONE dataset.
| Recall | Precision | F1 - Score | |||||||
|---|---|---|---|---|---|---|---|---|---|
| OS Rate | O-SMOTE | SMOTE | ADASYN | O-SMOTE | SMOTE | ADASYN | O-SMOTE | SMOTE | ADASYN |
| 100 | 99 | 99.89 | 84 | 86.2 | 91.3 | 92.5 | |||
| 200 | 94 | 97.6 | 81 | 81 | 87.6 | 89.3 | |||
| 300 | 89.5 | 88 | 84.7 | 79.6 | 86.3 | 84.4 | |||
| 400 | 85 | 85.4 | 84.8 | 81.1 | 85.3 | 83.4 | |||
| 500 | 82.5 | 81.7 | 84.8 | 80.8 | 83.2 | 81.7 | |||
Results obtained on YEAST dataset.
| Recall | Precision | F1 - Score | |||||||
|---|---|---|---|---|---|---|---|---|---|
| OS Rate | O-SMOTE | SMOTE | ADASYN | O-SMOTE | SMOTE | ADASYN | O-SMOTE | SMOTE | ADASYN |
| 100 | 99.99 | 99.9 | 99.9 | 93.1 | 94.2 | 96.06 | 97.0 | ||
| 200 | 99.99 | 99.9 | 99.8 | 92.4 | 91.9 | 95.1 | 95.8 | ||
| 300 | 99.3 | 99.6 | 89.6 | 89.6 | 94.2 | 94.1 | |||
| 400 | 96.6 | 98.8 | 86.6 | 88.6 | 91.3 | 93.0 | |||
| 500 | 96.7 | 97.2 | 87.4 | 86.4 | 92.04 | 91.5 | |||
Results obtained on WINE-QUALITY dataset.
| Recall | Precision | F1 - Score | |||||||
|---|---|---|---|---|---|---|---|---|---|
| OS Rate | O-SMOTE | SMOTE | ADASYN | O-SMOTE | SMOTE | ADASYN | O-SMOTE | SMOTE | ADASYN |
| 100 | 94.9 | 95.1 | 86.8 | 87.2 | 90.7 | 91 | |||
| 200 | 92.9 | 93 | 84.5 | 86.6 | 88.5 | 87 | |||
| 300 | 87.6 | 88.5 | 83.8 | 84 | 85.7 | 85.2 | |||
| 400 | 84.9 | 85 | 82.6 | 82.8 | 83.7 | 83.9 | |||
| 500 | 81.6 | 81.3 | 83.5 | 82.9 | 82.55 | 83.1 | |||
Results obtained on MAMMOGRAPHY dataset.
| Recall | Precision | F1 - Score | |||||||
|---|---|---|---|---|---|---|---|---|---|
| OS Rate | O-SMOTE | SMOTE | ADASYN | O-SMOTE | SMOTE | ADASYN | O-SMOTE | SMOTE | ADASYN |
| 100 | 99.4 | 99.5 | 97.7 | 98.2 | 98.5 | 98.9 | |||
| 200 | 99.3 | 99.4 | 97.6 | 97.0 | 98.5 | 98.2 | |||
| 300 | 99.19 | 99.2 | 96.95 | 95.5 | 98.0 | 97.3 | |||
| 400 | 98.9 | 98.9 | 96.2 | 94.3 | 97.6 | 96.5 | |||
| 500 | 98.5 | 98.6 | 95.6 | 93.4 | 97.1 | 95.9 | |||
Fig. 4Correlation Matrix denoting the importance of the various features in the considered COVID-19 dataset.
Results obtained on COVID-19 dataset.
| Recall | Precision | F1 - Score | |||||||
|---|---|---|---|---|---|---|---|---|---|
| OS | O-SMOTE | SMOTE | ADASYN | O-SMOTE | SMOTE | ADASYN | O-SMOTE | SMOTE | ADASYN |
| 100 | 69 | 64.2 | 69.5 | 69.2 | 68.1 | 58.0 | |||
| 200 | 71.3 | 70.2 | 77.3 | 76.6 | 74.1 | 71.4 | |||
| 300 | 82.13 | 81.5 | 80.8 | 80.8 | 81.44 | 80.8 | |||
| 400 | 86.7 | 87.45 | 84.12 | 83.9 | 85.37 | 85.5 | |||
| 500 | 88 | 86.9 | 85.5 | 85.07 | 86.77 | 85.86 | |||
Fig. 5Feature Importance of each attribute in the dataset.
Fig. 6SHAP values for each feature in the dataset.