| Literature DB >> 31737241 |
Jue Zhang1,2, Li Chen1, Fazeel Abid1.
Abstract
To overcome the two-class imbalanced problem existing in the diagnosis of breast cancer, a hybrid of K-means and Boosted C5.0 (K-Boosted C5.0) is proposed which is based on undersampling. K-means is utilized to select the informative samples near the boundary. During the training phase, the K-means algorithm clusters the majority and minority instances and selects a similar number of instances from each cluster. Boosted C5.0 is then used as the classifier. As there is one different instance selection factor via clustering that encourages the diversity of the training subspace in K-Boosted C5.0, it would be a great advantage to get better performance. To test the performance of the new hybrid classifier, it is implemented on 12 small-scale and 2 large-scale datasets, which are the often used datasets in class imbalanced learning. The extensive experimental results show that our proposed hybrid method outperforms most of the competitive algorithms in terms of Matthews' correlation coefficient (MCC) and accuracy indices. It can be a good alternative to the well-known machine learning methods.Entities:
Year: 2019 PMID: 31737241 PMCID: PMC6817921 DOI: 10.1155/2019/7294582
Source DB: PubMed Journal: J Healthc Eng ISSN: 2040-2295 Impact factor: 2.682
Figure 1Block diagram for the proposed classification model.
Algorithm 1The clustering-based undersampling procedure.
Experimental datasets.
| Datasets | No. of data samples | No. of features | Imbalance ratio |
|---|---|---|---|
| Small-scale datasets | |||
| (1) Abalone | 731 | 8 | 16.4 |
| (2) Bcwo | 683 | 9 | 1.8577 |
| (3) Pima | 336 | 8 | 2.027 |
| (4) Redwine1 | 837 | 11 | 3.21 |
| (5) Redwine2 | 880 | 11 | 3.42 |
| (6) Redwine3 | 734 | 11 | 12.85 |
| (7) Redwine4 | 691 | 11 | 12.04 |
| (8) Wbcd | 569 | 30 | 1.8 |
| (9) Whitewine | 1043 | 11 | 5.4 |
| (10) Yeast1 | 707 | 8 | 1.8975 |
| (11) Yeast2 | 626 | 8 | 2.840 |
| (12) Yeast3 | 892 | 8 | 1.08 |
|
| |||
| Large-scale dataset | |||
| (1) Breast cancer | 102294 | 117 | 16319 |
| (2) Protein homology prediction | 145751 | 74 | 11146 |
Confusion matrix.
| Predicted positive | Predicted negative | |
|---|---|---|
| Actual positive | True positive (TP) | False negative (FN) |
| Actual negative | False positive (FP) | True negative (TN) |
Performance comparison based on Wbcd and Bcwo datasets.
| Dataset | Method | Accuracy | Sensitivity | Specificity |
| AUC | MCC |
|---|---|---|---|---|---|---|---|
| Wbcd |
|
| 0.9375 |
|
|
|
|
| SMOTEBoost | 0.964 |
| 0.978 | 0.9619 | 0.963 | 0.924 | |
| RUSBoost | 0.944 | 0.93 | 0.954 | 0.942 | 0.942 | 0.886 | |
| SMOTE-Boosted C5.O | 0.925 | 0.939 | 0.911 | 0.9248 | 0.925 | 0.847 | |
|
| |||||||
| Bcwo |
|
|
|
|
|
|
|
| SMOTEBoost | 0.92 | 0.98 | 0.89 | 0.934 | 0.933 | 0.839 | |
| RUSBoost | 0.936 | 0.926 | 0.944 | 0.9350 | 0.934 | 0.8539 | |
| SMOTE-Boosted C5.0 | 0.937 | 0.934 | 0.941 | 0.9375 | 0.937 | 0.8756 | |
Performance comparison based on the Wbcd dataset.
| ML method | Accuracy (%) | Sensitivity (%) | Specificity (%) |
| MCC |
|---|---|---|---|---|---|
| QKCLDA | 97.26 | — | — | — | |
| K-SVM | 97.38 | — | — | — | |
| PSO + Boosted c5.0 | 96.38 | 97.70 | 94.28 | — | |
| Aisl | 98.00 | 95.9 | 98.7 | — | |
| PSO-KDE | 98.45 | 100 | 97.99 | — | |
| EC | 96.5 | — | |||
| BBHA | 97.38 | 95.79 | 98.57 | — | |
| EM-PCA-CART-fuzzy | |||||
| Rule-based | 93.2 | — | — | — | |
| FSMLP | 100 | 100 | 100 | 100 | |
|
|
|
|
|
|
|
Performance comparison based on the Bcwo dataset.
| ML method | Accuracy (%) | Sensitivity (%) | Specificity (%) |
| MCC (%) |
|---|---|---|---|---|---|
| Aisl | 98.3 | 94.3 | 99.6 | 96.91 | |
| PSO-KDE | 98.53 | 95.79 | 100 | — | |
|
|
|
|
|
|
|
Result comparison based on different datasets.
| Dataset | Method | Accuracy | Sensitivity | Specificity |
| AUC | MCC |
|---|---|---|---|---|---|---|---|
| Abalone | K-Boosted C5.0 | 0.960 | 0.2 | 0.992 | 0.445 | 0.596 |
|
| SMOTEBoost | 0.822 | 0.628 | 0.834 | 0.724 | 0.730 | 0.264 | |
| RUSBoost | 0.592 | 0.802 | 0.58 | 0.682 | 0.69 | 0.15 | |
| SMOTE-BoostedC5.0 | 0.618 | 0.635 | 0.601 | 0.618 | 0.624 | 0.232 | |
|
| |||||||
| Pima | K-Boosted C5.0 | 0.766 | 0.640 | 0.820 | 0.725 | 0.730 |
|
| SMOTEBoost | 0.75 | 0.646 | 0.8 | 0.719 | 0.723 | 0.432 | |
| RUSBoost | 0.714 | 0.792 | 0.676 | 0.732 | 0.733 | 0.446 | |
| SMOTE-BoostedC5.0 | 0.713 | 0.742 | 0.684 | 0.712 | 0.713 | 0.425 | |
|
| |||||||
| Redwine1 | K-Boosted C5.0 | 0.823 | 0.517 | 0.905 | 0.684 | 0.823 |
|
| SMOTEBoost | 0.784 | 0.544 | 0.858 | 0.683 | 0.701 | 0.397 | |
| RUSBoost | 0.702 | 0.656 | 0.718 | 0.686 | 0.69 | 0.34 | |
| SMOTE-BoostedC5.0 | 0.688 | 0.715 | 0.661 | 0.687 | 0.688 | 0.377 | |
|
| |||||||
| Redwine2 | K-Boosted C5.0 | 0.902 | 0.681 | 0.969 | 0.812 | 0.825 |
|
| SMOTEBoost | 0.826 | 0.85 | 0.82 | 0.835 | 0.835 | 0.59 | |
| RUSBoost | 0.82 | 0.852 | 0.812 | 0.832 | 0.832 | 0.577 | |
| SMOTE-BoostedC5.0 | 0.844 | 0.865 | 0.822 | 0.8432 | 0.843 | 0.691 | |
|
| |||||||
| Redwine3 | K-Boosted C5.0 | 0.834 | 0.410 | 0.866 | 0.596 | 0.637 |
|
| SMOTEBoost | 0.89 | 0.12 | 0.948 | 0.337 | 0.545 | 0.054 | |
| RUSBoost | 0.764 | 0.46 | 0.788 | 0.602 | 0.634 | 0.171 | |
| SMOTE-BoostedC5.0 | 0.567 | 0.509 | 0.625 | 0.564 | 0.573 | 0.137 | |
|
| |||||||
| Redwine4 | K-Boosted C5.0 | 0.940 | 0.263 | 0.989 | 0.510 | 0.626 |
|
| SMOTEBoost | 0.916 | 0.28 | 0.964 | 0.520 | 0.623 | 0.317 | |
| RUSBoost | 0.678 | 0.64 | 0.682 | 0.661 | 0.660 | 0.152 | |
| SMOTE-BoostedC5.0 | 0.618 | 0.635 | 0.601 | 0.618 | 0.624 | 0.232 | |
|
| |||||||
| Whitewine | K-Boosted C5.0 | 0.925 | 0.650 | 0.961 | 0.79 | 0.805 |
|
| SMOTEBoost | 0.804 | 0.838 | 0.798 | 0.818 | 0.818 | 0.502 | |
| RUSBoost | 0.794 | 0.85 | 0.784 | 0.816 | 0.817 | 0.49 | |
| SMOTE-BoostedC5.0 | 0.796 | 0.801 | 0.792 | 0.796 | 0.797 | 0.593 | |
|
| |||||||
| Yeast1 | K-Boosted C5.0 | 0.952 | 0.957 | 0.949 | 0.953 | 0.957 |
|
| SMOTEBoost | 0.762 | 0.722 | 0.788 | 0.754 | 0.754 | 0.497 | |
| RUSBoost | 0.798 | 0.694 | 0.852 | 0.769 | 0.773 | 0.552 | |
| SMOTE-BoostedC5.0 | 0.723 | 0.734 | 0.712 | 0.723 | 0.723 | 0.45 | |
|
| |||||||
| Yeast2 | K-Boosted C5.0 | 0.951 | 0.924 | 0.958 | 0.941 | 0.941 |
|
| SMOTEBoost | 0.932 | 0.904 | 0.94 | 0.922 | 0.921 | 0.836 | |
| RUSBoost | 0.93 | 0.938 | 0.926 | 0.928 | 0.931 | 0.824 | |
| SMOTE-BoostedC5.0 | 0.9116 | 0.8934 | 0.9388 | 0.9158 | 0.916 | 0.821 | |
|
| |||||||
| Yeast3 | K-Boosted C5.0 | 0.646 | 0.575 | 0.706 | 0.637 | 0.641 |
|
| SMOTEBoost | 0.618 | 0.618 | 0.612 | 0.615 | 0.616 | 0.232 | |
| RUSBoost | 0.64 | 0.544 | 0.728 | 0.629 | 0.636 | 0.27 | |
| SMOTE-BoostedC5.0 | 0.598 | 0.450 | 0.735 | 0.575 | 0.593 | 0.195 | |
Figure 2MCC result comparison based on different datasets.
Figure 3Classification of MCC of the different classifiers over the breast cancer and protein homology datasets.
Figure 4Computational efficiency of approaches on 12 datasets.
Figure 5Computational efficiency of approaches on breast cancer and protein homology datasets.
Mean rank of the Friedman test over the four classification algorithms.
|
| K-Boosted C5.0 | SMOTEBoost | RUSBoost | SMOTE-BoostedC5.0 |
|---|---|---|---|---|
| 7.488 | 4 | 2.25 | 1.917 | 1.83 |
Figure 6Results of the pairwise comparisons of methods using the Nemenyi post hoc test.