| Literature DB >> 32605144 |
Garba Abdulrauf Sharifai1,2, Zurinahni Zainol2.
Abstract
The training machine learning algorithm from an imbalanced data set is an inherently challenging task. It becomes more demanding with limited samples but with a massive number of features (high dimensionality). The high dimensional and imbalanced data set has posed severe challenges in many real-world applications, such as biomedical data sets. Numerous researchers investigated either imbalanced class or high dimensional data sets and came up with various methods. Nonetheless, few approaches reported in the literature have addressed the intersection of the high dimensional and imbalanced class problem due to their complicated interactions. Lately, feature selection has become a well-known technique that has been used to overcome this problem by selecting discriminative features that represent minority and majority class. This paper proposes a new method called Robust Correlation Based Redundancy and Binary Grasshopper Optimization Algorithm (rCBR-BGOA); rCBR-BGOA has employed an ensemble of multi-filters coupled with the Correlation-Based Redundancy method to select optimal feature subsets. A binary Grasshopper optimisation algorithm (BGOA) is used to construct the feature selection process as an optimisation problem to select the best (near-optimal) combination of features from the majority and minority class. The obtained results, supported by the proper statistical analysis, indicate that rCBR-BGOA can improve the classification performance for high dimensional and imbalanced datasets in terms of G-mean and the Area Under the Curve (AUC) performance metrics.Entities:
Keywords: Grasshopper optimisation algorithm; class-imbalanced dataset; high dimensionality; multi-filter
Year: 2020 PMID: 32605144 PMCID: PMC7397300 DOI: 10.3390/genes11070717
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Figure 1Schematic illustration of the proposed approach for high dimensional imbalanced data set using feature selection.
Confusion matrix.
| Predicted Positive Class 1 | Predictive Negative Class 2 | |
|---|---|---|
| Actual positive class | TP (True Positive) | FN (False Negative) |
| Actual negative class | FP (False Positive) | TN (True Negative) |
Data sets and their characteristics used in this paper for evaluation.
| Data Sets | #Features | #Samples | #Classes | IR |
|---|---|---|---|---|
| Colon | 2000 | 62 | 2 | 1.82 |
| DLBCL | 7129 | 59 | 2 | 1.04 |
| CNS | 7129 | 60 | 2 | 1.86 |
| Leukaemia | 7129 | 72 | 2 | 1.88 |
| Breast | 24482 | 97 | 2 | 1.11 |
| CAR | 9182 | 174 | 2 | 14.82 |
| LUNG | 12534 | 181 | 2 | 4.84 |
| GLIOMA | 4433 | 50 | 2 | 2.43 |
| SRBCT | 2308 | 83 | 2 | 6.55 |
| Brain_Tumor1 | 5920 | 90 | 5 | 15.00 |
| Brain_Tumor2 | 10367 | 50 | 4 | 2.14 |
| SRBCT_4 | 2308 | 83 | 4 | 2.64 |
| LUNG Cancer | 12601 | 203 | 5 | 23.17 |
Comparison of the proposed rCBR method against single filter-based methods in terms of G-mean, AUC, Sensitivity (Sens) and specificity (Spec) performance measures.
| Dataset | Measures | ReliefF | Fisher Score | Chi Square | rCBR |
|---|---|---|---|---|---|
| Breast | G-mean | 66.9 |
| 61.3 | 66.7 |
| AUC | 66.7 | 67.3 | 61.5 |
| |
| Sens | 61.6 | 68.1 | 59.7 | 65.2 | |
| Spec | 71.8 | 66.4 | 63.4 | 69.3 | |
| Colon | G-mean | 80.5 | 82.1 | 68.1 | 83.9 |
| AUC | 81.1 | 82.9 | 69.4 | 84.9 | |
| Sens | 80.4 | 85.1 | 70.9 | 97.1 | |
| Spec | 81.6 | 80.7 | 68.0 | 72.6 | |
| AUC | 30.8 | 73.2 | 71.5 | 83.9 | |
| CAR | G-mean | 30.7 | 67.7 | 50.9 | 73.5 |
| Sens | 30.7 | 47.1 | 46.6 | 70.0 | |
| Spec | 95.9 | 99.4 | 96.4 | 97.6 | |
| CNS | AUC | 77.1 | 74.2 | 66.4 | 85.6 |
| G-mean | 76.4 | 71.1 | 65.6 | 85.1 | |
| Sens | 78.3 | 55.1 | 54.0 | 83.8 | |
| Spec | 75.9 | 93.3 | 79.7 | 88.1 | |
| DLBCL | AUC | 98.0 | 96.7 | 86.0 | 98.0 |
| G-mean | 97.9 | 96.5 | 85.1 | 97.9 | |
| Sens | 96.0 | 93.3 | 83.4 | 96.0 | |
| Spec | 100 | 100 | 66.7 | 100 | |
| AUC | 89.7 | 97.2 | 87.6 | 99.8 | |
| Leukemia | G-mean | 89.1 | 97.0 | 86.9 | 99.0 |
| Sens | 100 | 100 | 77.6 | 100 | |
| Spec | 79.6 | 94.4 | 97.5 | 98.8 | |
| Lung | AUC | 97.8 | 99.0 | 94.9 | 97.3 |
| G-mean | 97.8 | 99.0 | 94.7 | 97.2 | |
| Sens | 95.6 | 98.0 | 89.8 | 94.5 | |
| Spec | 100 | 100 | 100 | 100 | |
| Glioma | AUC | 30.8 | 81.7 | 30.4 | 95.0 |
| G-mean | 30.9 | 81.7 | 30.9 | 95.0 | |
| Sens | 30.7 | 63.3 | 30.7 | 90.0 | |
| Spec | 89.3 | 100 | 86.1 | 100 | |
| SRBCT | AUC | 98.7 | 96.7 | 90.8 | 100 |
| G-mean | 98.6 | 96.3 | 89.9 | 100 | |
| Sens | 100 | 93.3 | 81.7 | 100 | |
| Spec | 97.4 | 100 | 100 | 100 | |
| Brain Tumour1 | AUC | 89.6 | 85.9 | 74.00 | 86.2 |
| G-mean | 88.9 | 85.1 | 72.9 | 86.2 | |
| Sens | 100 | 83.3 | 70.0 | 90.0 | |
| Spec | 79.1 | 88.5 | 78.0 | 82.4 |
Figure 2(a–i) The G-mean accuracy results of rCBR-BGOA with different population sizes for Breast, CAR, CNS, Colon, DLBCL, GLIOMA, Lung, Leukemia, and SRBCT datasets. (a) Breast dataset. (b) CAR dataset. (c) CNS dataset. (d) Colon dataset. (e) DLBCL dataset. (f) GLIOMA dataset. (g) Lung dataset. (h) Leukemia dataset. (i) SRBCT dataset.
Experimental results of rCBR-BGOA in comparison other methods on G-Mean metric. F/5, 2F/5, and 3F/5 are 20%, 40%, and 60% of the total features in F.
| Data Sets | d | rCBR-BGOA | SYMON | SSVM-FS | FRHS | SVM-RFE | SVM-BFE | D-HELL | SMOTE-RLF | SMOTE-PCA |
|---|---|---|---|---|---|---|---|---|---|---|
| BREAST | F/5 | 92.8 | 62.6 | - | - | 56.0 | 66.4 | 62.6 | 52.4 | 52.4 |
| 2F/5 | 91.8 | 62.6 | - | - | 62.6 | 58.0 | 62.6 | 58.5 | 52.4 | |
| 3F/5 | 90.5 | 62.6 | - | - | 62.6 | 52.0 | 66.4 | 52.4 | 52.4 | |
| CAR | F/5 | 97.1 | 93.5 | 96.4 | 100 | 90.5 | 95.3 | 88.7 | 98.5 | 90.5 |
| 2F/5 | 97.4 | 93.5 | 98.2 | 100 | 92.4 | 93.6 | 90.5 | 95.3 | 90.5 | |
| 3F/5 | 97.4 | 93.5 | 98.2 | 96.5 | 92.4 | 92.4 | 88.7 | 95.3 | 90.5 | |
| CNS | F/5 | 95.2 | 79.0 | - | - | 74.5 | 74.5 | 70.7 | 57.7 | 62.4 |
| 2F/5 | 94.1 | 79.0 | - | - | 74.5 | 69.7 | 74.5 | 66.7 | 74.5 | |
| 3F/5 | 94.2 | 79.0 | - | - | 74.5 | 74.5 | 74.5 | 74.5 | 74.5 | |
| Colon | F/5 | 94.6 | 67.4 | 78.2 | 74.6 | 60.0 | 56.0 | 67.0 | 60.3 | 58.5 |
| 2F/5 | 93.1 | 71.5 | 71.4 | 74.6 | 60.0 | 60.0 | 60.0 | 60.3 | 63.0 | |
| 3F/5 | 93.1 | 67.4 | 71.4 | 74.6 | 56.0 | 56.0 | 64.0 | 60.3 | 63.0 | |
| DLBCL | F/5 | 100 | 29.6 | 54.3 | 76.2 | 25.0 | 25.0 | 22.3 | 27.4 | 38.7 |
| 2F/5 | 100 | 29.6 | 58.8 | 76.2 | 27.3 | 25.0 | 25.0 | 27.4 | 54.7 | |
| 3F/5 | 100 | 29.6 | 62.6 | 78.4 | 27.3 | 25.0 | 25.0 | 2.4 | 29.6 | |
| Leukemia | F/5 | 99.4 | 100 | - | - | 31.6 | 31.6 | 50.0 | 0.7 | 0.7 |
| 2F/5 | 99.7 | 100 | - | - | 31.6 | 31.6 | 83.6 | 0.7 | 0.7 | |
| 3F/5 | 99.7 | 100 | - | - | 31.6 | 31.6 | 44.7 | 0.7 | 0.7 | |
| LUNG | F/5 | 100 | 100 | - | - | 97.3 | 100 | 100 | 96.8 | 96.8 |
| 2F/5 | 100 | 100 | - | - | 97.3 | 100 | 100 | 96.8 | 96.8 | |
| 3F/5 | 100 | 100 | - | - | 97.3 | 100 | 100 | 96.8 | 96.8 | |
| GLIOMA | F/5 | 96.6 | 88.7 | 85.6 | 91.5 | 75.4 | 92.6 | 72.6 | 82.5 | 84.2 |
| 2F/5 | 96.5 | 85.4 | 83.4 | 92.8 | 73.8 | 90.2 | 80.6 | 82.5 | 84.2 | |
| 3F/5 | 97.3 | 81.3 | 79.3 | 89.6 | 73.2 | 86.7 | 82.6 | 79.4 | 84.2 |
SYMON: symmetrical uncertainty and harmony search, FS: Feature selection; FHRS: feature ranking based on harmony search; RFE: recursive feature elimination; BFE: backward feature elimination; HELL: Hellinger distance; RLF: ReliefF; PCA: Principal component analysis
AUC performance metric on microarray datasets and variant data sizes for the different methods F/5 means 20% of the total features in F.
| Data Sets | d | rCBR-BGOA | SYMON | SVM-RFE | SVM-BFE | D-HELL | SMOTE-RLF | SMOTE-PCA |
|---|---|---|---|---|---|---|---|---|
| BRE | F/5 | 78.1(0.2) | 79.2(0.8) | 75.0(0.2) | 29.1(0.4) | 75.0(0.4) | 65.2(0.8) | 59.3(0.2) |
| CAR | F/5 | 96.9(0.2) | 100(0.2) | 75.0(0.2) | 29.1(0.4) | 75.0(0.4) | 93.7(0.2) | 93.7(0.20 |
| CNS | F/5 | 94.1(0.2) | 65.2(0.2) | 46.0(0.6) | 65.2(0.4) | 65.2(0.6) | 75.2(0.2) | 75.0(0.4) |
| Col | F/5 | 96.4(0.2) | 72.0(0.8) | 61.3(0.2) | 67.0(0.4) | 73.8(0.4) | 61.6(0.2) | 67.6(0.8) |
| DLBCL | F/5 | 100(0.2) | 78.7(0.2) | 68.7(0.6) | 68.7(0.4) | 68.7(0.6) | 63.4(0.2) | 63.7(0.8) |
| LEU | F/5 | 99.0(0.2) | 93.5(0.2) | 87.5(0.2) | 87.5(0.2) | 87.5(0.2) | 87.5(0.2) | 87.5(0.2) |
| LUG | F/5 | 100(0.2) | 96.8(0.2) | 96.8(0.2) | 96.8(0.2) | 96.8(0.2) | 96.8(0.2) | 96.8(0.2) |
| SRBCT | F/5 | 100(0.2) | 100(0.2) | 100(0.2) | 100(0.2) | 100(0.2) | 100(0.2) | 100(0.2) |
Wilcoxon signed-rank test for G-mean evaluation metric.
| Evaluation Metric | Comparison | Hypothesis | Significant Difference | |
|---|---|---|---|---|
| G-mean | rCBR-BGOA vs. SYMON | Reject at 5% | 2.2689 × 10−4 (1) | Yes |
| rCBR-BGOA vs. SSVM-FS | Reject at 5% | 0.0049 (1) | Yes | |
| rCBR-BGOA vs. FRHS | Retain at 5% | 0.0078 (1) | No | |
| rCBR-BGOA vs. SVM-RFE | Reject at 5% | 1.8162 × 10−5 (1) | Yes | |
| rCBR-BGOA vs. SVM-BFE | Reject at 5% | 3.8662 × 10−5 (1) | Yes | |
| rCBR-BGOA vs. D-HELL | Retain at 5% | 3.8767 × 10−5 (1) | Yes | |
| rCBR-BGOA vs. SMOTE-ReliefF | Retain at 5% | 2.0645 × 10−5 (1) | Yes | |
| rCBR-BGOA vs. SMOTE-PCA | Retain at 5% | 1.816 × 10−5(1) | Yes |
Wilcoxon signed-rank test for AUC evaluation metric.
| Evaluation Metric | Comparison | Hypothesis Decision | Significant Difference | |
|---|---|---|---|---|
| AUC | rCBR-BGOA vs. SYMON | Retain at 5% | 0.0781 | No |
| rCBR-BGOA vs. SVM-RFE | Reject at 5% | 0.0156 | Yes | |
| rCBR-BGOA vs. SVM-BFE | Reject at 5% | 0.0156 | Yes | |
| rCBR-BGOA vs. D-HELL | Reject at 5% | 0.0156 | Yes | |
| rCBR-BGOA vs. SMOTE-ReliefF | Reject at 5% | 0.0156 | Yes | |
| rCBR-BGOA vs. SMOTE-PCA | Reject at 5% | 0.0156 | Yes |
The total execution time of rCBR-BGOA in comparison with similar methods across data sets.
| Execution Time | Algorithms | Data Sets | |||
|---|---|---|---|---|---|
| 2K (COL) | 7K (DLBCL) | 12 K (LUG) | 24 K (BC) | ||
| Execution time | SVM-REF | 2.6 | 7.58 | 16.67 | 28.68 |
| SVM-BFE | 80.64 | 358.52 | 25,066.6 | 48,569.73 | |
| D-HELL | 2.51 | 7.57 | 12.57 | 21.58 | |
| SYMON | 289 | 2622 | 17,023 | 31,805 | |
| SMOTE-RLF | 5.045 | 11.778 | 196.05 | 92.32 | |
| SMOTE-PCA | 2.755 | 11.305 | 59.466 | 1134.15 | |
| rCBR-BGOA | 12.92 | 17.39 | 143.19 | 76.67 | |
G-mean predictive accuracy and compared with other state-of-the-art methods.
| Data Sets | rCBR-BGOA | EnSVM-OAA(RUS) | C-E-MWELM |
|---|---|---|---|
| Brain-Tumor1 | 97.9 | 40.3 | 83.0 |
| Brain-Tumor2 | 98.8 | 64.6 | 92.4 |
| Lung-Cancer | 96.9 | 96.2 | 97.2 |
| SRBCT | 100 | 100 | 99.9 |
OAA(RUS): one against all random undersampling; MWELM: modified weighted extreme learning machine.