| Literature DB >> 34545291 |
Jiang Shen1, Jiachao Wu1, Man Xu2, Dan Gan3, Bang An1, Fusheng Liu1.
Abstract
Predicting postoperative survival of lung cancer patients (LCPs) is an important problem of medical decision-making. However, the imbalanced distribution of patient survival in the dataset increases the difficulty of prediction. Although the synthetic minority oversampling technique (SMOTE) can be used to deal with imbalanced data, it cannot identify data noise. On the other hand, many studies use a support vector machine (SVM) combined with resampling technology to deal with imbalanced data. However, most studies require manual setting of SVM parameters, which makes it difficult to obtain the best performance. In this paper, a hybrid improved SMOTE and adaptive SVM method is proposed for imbalance data to predict the postoperative survival of LCPs. The proposed method is divided into two stages: in the first stage, the cross-validated committees filter (CVCF) is used to remove noise samples to improve the performance of SMOTE. In the second stage, we propose an adaptive SVM, which uses fuzzy self-tuning particle swarm optimization (FPSO) to optimize the parameters of SVM. Compared with other advanced algorithms, our proposed method obtains the best performance with 95.11% accuracy, 95.10% G-mean, 95.02% F1, and 95.10% area under the curve (AUC) for predicting postoperative survival of LCPs.Entities:
Mesh:
Year: 2021 PMID: 34545291 PMCID: PMC8449740 DOI: 10.1155/2021/2213194
Source DB: PubMed Journal: Comput Math Methods Med ISSN: 1748-670X Impact factor: 2.238
Feature details of the thoracic surgery dataset.
| Feature ID | Description | Type of attribute |
|---|---|---|
| 1 | Size of the original tumor, from OC11 (smallest) to OC14 (largest) | Nominal |
| 2 | Diagnosis (specific combination of ICD-10 codes for primary and secondary as well multiple tumors if any) | Nominal |
| 3 | Forced vital capacity | Numeric |
| 4 | Pain (presurgery) | Binary |
| 5 | Age at surgery | Numeric |
| 6 | Performance status | Nominal |
| 7 | Weakness (presurgery) | Binary |
| 8 | Dyspnoea (presurgery) | Binary |
| 9 | Cough (presurgery) | Binary |
| 10 | Haemoptysis (presurgery) | Binary |
| 11 | Peripheral arterial diseases | Binary |
| 12 | MI up to 6 months | Binary |
| 13 | Asthma | Binary |
| 14 | Volume that has been exhaled at the end of the first second of forced expiration | Numeric |
| 15 | Smoking | Binary |
| 16 | Type 2 diabetes mellitus | Binary |
| 17 | 1-year survival period (true value if died) | Binary |
Figure 1Using SMOTE alone may indiscriminately aggravate the noise.
Confusion matrix.
| Actual positive | Actual negative | |
|---|---|---|
| Predicted positive | TP | FP |
| Predicted negative | FN | TN |
Defuzzification of w, csoc, ccog, η, and λ.
| Output | Level | ||
|---|---|---|---|
| Low | Medium | High | |
|
| 0.3 | 0.5 | 1.0 |
|
| 1.0 | 2.0 | 3.0 |
|
| 0.1 | 1.5 | 3.0 |
|
| 0.0 | 0.001 | 0.01 |
|
| 0.1 | 0.15 | 0.2 |
Figure 2Flowchart of the proposed hybrid method for predicting postoperative survival of LCPs.
Accuracy comparison for different algorithms with different preprocessing methods.
| Algorithms | NONE | SMOTE | SL-SMOTE | SMOTE-TL | B-SMOTE | CVCF-SMOTE |
|---|---|---|---|---|---|---|
| FPSO-SVM |
|
|
| 0.7378 |
|
|
| PSO-SVM |
| 0.6570 | 0.6217 | 0.6776 | 0.7267 | 0.8643 |
| SVM |
| 0.5294 | 0.5561 | 0.4781 | 0.5493 | 0.5204 |
| RF | 0.8369 |
| 0.6023 |
| 0.8430 | 0.8869 |
| GBDT | 0.8156 | 0.7059 | 0.5864 | 0.7025 | 0.8213 | 0.9276 |
| KNN | 0.8227 | 0.6561 | 0.5833 | 0.6910 | 0.7905 | 0.9005 |
| AdaBoost | 0.7943 | 0.6652 | 0.5615 | 0.6458 | 0.7674 | 0.9095 |
G-mean comparison for different algorithms with different preprocessing methods.
| Algorithms | NONE | SMOTE | SL-SMOTE | SMOTE-TL | B-SMOTE | CVCF-SMOTE |
|---|---|---|---|---|---|---|
| FPSO-SVM | 0 | 0.6942 |
| 0.7203 |
|
|
| PSO-SVM | 0 | 0.5832 | 0.5628 | 0.6150 | 0.6567 | 0.8501 |
| SVM | 0 | 0 | 0 | 0.1537 | 0.1015 | 0.1659 |
| RF | 0 |
| 0.6017 |
| 0.8404 | 0.8868 |
| GBDT |
| 0.6901 | 0.5835 | 0.7024 | 0.8154 | 0.9274 |
| KNN | 0 | 0.6572 | 0.5819 | 0.6874 | 0.7919 | 0.9000 |
| AdaBoost | 0.2059 | 0.6550 | 0.5552 | 0.6464 | 0.7597 | 0.9096 |
F1 comparison for different algorithms with different preprocessing methods.
| Algorithms | NONE | SMOTE | SL-SMOTE | SMOTE-TL | B-SMOTE | CVCF-SMOTE |
|---|---|---|---|---|---|---|
| FPSO-SVM | 0 | 0.6612 | 0.5549 | 0.7059 |
|
|
| PSO-SVM | 0 | 0.5089 | 0.4995 | 0.5600 | 0.6022 | 0.8336 |
| SVM | 0 | 0 | 0 | 0.2823 | 0.0605 | 0.0536 |
| RF | 0 |
|
|
| 0.8241 | 0.8889 |
| GBDT |
| 0.6524 | 0.5470 | 0.7025 | 0.7950 | 0.9292 |
| KNN | 0 | 0.6545 | 0.5473 | 0.7094 | 0.7760 | 0.9035 |
| AdaBoost | 0.0645 | 0.6186 | 0.5101 | 0.6425 | 0.7323 | 0.9099 |
AUC comparison for different algorithms with different preprocessing methods.
| Algorithms | NONE | SMOTE | SL-SMOTE | SMOTE-TL | B-SMOTE | CVCF-SMOTE |
|---|---|---|---|---|---|---|
| FPSO-SVM | 0.5000 |
|
|
|
|
|
| PSO-SVM | 0.5000 | 0.6426 | 0.6069 | 0.6754 | 0.7094 | 0.8631 |
| SVM | 0.5000 | 0.5000 | 0.5000 | 0.4993 | 0.5059 | 0.5138 |
| RF | 0.4958 | 0.7115 | 0.6038 | 0.7397 | 0.8411 | 0.8873 |
| GBDT |
| 0.6993 | 0.5857 | 0.7052 | 0.8171 | 0.9281 |
| KNN | 0.4874 | 0.6581 | 0.5842 | 0.6919 | 0.7927 | 0.9010 |
| AdaBoost | 0.4891 | 0.6603 | 0.5582 | 0.6483 | 0.7621 | 0.9097 |
Figure 3Stacked histograms of accuracy, G-mean, F1, and AUC for different algorithms under different preprocessing methods.
Paired t-test results of CVCF-SMOTE+FPSO-SVM and the best performance under different preprocessing methods in terms of accuracy, F1, G-mean, and AUC on the thoracic surgery dataset. For CVCF-SMOTE, the p value is the statistic of the best result and the second best result.
| Methods | Accuracy | F1 | AUC | |
|---|---|---|---|---|
| NONE | 11.034 (0.000) | 25.502 (0.000) | 21.102 (0.000) | 27.01 (0.000) |
| SMOTE | 14.348 (0.000) | 16.01 (0.000) | 10.261 (0.000) | 12.469 (0.000) |
| SL-SMOTE | 29.947 (0.000) | 25.764 (0.000) | 30.349 (0.000) | 31.255 (0.000) |
| SMOTE-TL | 29.815 (0.000) | 30.281 (0.000) | 22.248 (0.000) | 26.895 (0.000) |
| B-SMOTE | 6.541 (0.000) | 5.176 (0.001) | 5.297 (0.000) | 5.997 (0.000) |
| CVCF-SMOTE | 5.237 (0.001) | 4.994 (0.001) | 4.67 (0.001) | 4.719 (0.001) |
Comparative results with previous studies based on accuracy.
| Authors | Methods | Accuracy |
|---|---|---|
| Mangat and Vig [ | DA-AC | 82.18% |
| Elyan and Gaber [ | RFGA | 84.67% |
| Li et al. [ | STDPNF | 85.32% |
| Muthukumar and Krishnan [ | IFSSs | 88% |
| Saber Iraji [ | ELM (wave kernel) | 88.79% |
| Our work | CVCF-SMOTE+FPSO-SVM | 95.11% |
Figure 4ROC curve comparison of different algorithms under different preprocessing methods.
Figure 5Fitness curves of FPSO-SVM (a) and PSO-SVM (b) with CVCF-SMOTE.
Details of Haberman and appendicitis datasets.
| Datasets | Case number | Attribute number | Class distribution |
|---|---|---|---|
| Haberman | 306 | 3 | 225/81 |
| Appendicitis | 106 | 7 | 85/21 |
Accuracy comparison for different algorithms with different preprocessing methods on the Haberman dataset.
| Algorithms | NONE | SMOTE | SL-SMOTE | SMOTE-TL | B-SMOTE | CVCF-SMOTE |
|---|---|---|---|---|---|---|
| FPSO-SVM |
|
| 0.6386 |
|
|
|
| PSO-SVM | 0.7098 | 0.6435 |
| 0.6538 | 0.6831 | 0.7205 |
| SVM | 0.7196 | 0.6291 | 0.6409 | 0.6423 | 0.6772 | 0.7165 |
| RF | 0.6989 | 0.6795 | 0.6142 | 0.7315 | 0.7559 | 0.7772 |
| GBDT | 0.6837 | 0.6606 | 0.6299 | 0.7252 | 0.7465 | 0.7764 |
| KNN | 0.7174 | 0.6630 | 0.6417 | 0.7000 | 0.7449 | 0.7992 |
| AdaBoost | 0.7163 | 0.6402 | 0.6331 | 0.6117 | 0.6819 | 0.7559 |
AUC comparison for different algorithms with different preprocessing methods on the Haberman dataset.
| Algorithms | NONE | SMOTE | SL-SMOTE | SMOTE-TL | B-SMOTE | CVCF-SMOTE |
|---|---|---|---|---|---|---|
| FPSO-SVM | 0.5274 | 0.6813 | 0.6288 |
|
|
|
| PSO-SVM | 0.5012 | 0.6131 | 0.6325 | 0.6669 | 0.6518 | 0.7121 |
| SVM | 0.5077 | 0.6096 | 0.6246 | 0.6598 | 0.6566 | 0.7035 |
| RF | 0.5731 |
| 0.6132 | 0.7283 | 0.7588 | 0.7784 |
| GBDT | 0.5492 | 0.6607 | 0.6274 | 0.7226 | 0.7475 | 0.7765 |
| KNN | 0.5737 | 0.6649 |
| 0.6997 | 0.7433 | 0.8009 |
| AdaBoost |
| 0.6359 | 0.6293 | 0.6118 | 0.6779 | 0.7549 |
Paired t-test results of CVCF-SMOTE+FPSO-SVM and the best performance under different preprocessing methods in terms of accuracy and AUC on the Haberman dataset.
| Methods | Accuracy | AUC |
|---|---|---|
| NONE | 6.603 (0.000) | 18.744 (0.000) |
| SMOTE | 6.555 (0.000) | 10.315 (0.000) |
| SL-SMOTE | 15.959 (0.000) | 15.806 (0.000) |
| SMOTE-TL | 4.506 (0.001) | 3.539 (0.006) |
| B-SMOTE | 2.601 (0.029) | 2.83 (0.02) |
| CVCF-SMOTE | 4.669 (0.001) | 4.392 (0.002) |
Accuracy comparison for different algorithms with different preprocessing methods on the appendicitis dataset.
| Algorithms | NONE | SMOTE | SL-SMOTE | SMOTE-TL | B-SMOTE | CVCF-SMOTE |
|---|---|---|---|---|---|---|
| FPSO-SVM |
|
|
|
|
|
|
| PSO-SVM | 0.8625 | 0.8713 | 0.7620 | 0.8104 | 0.8714 | 0.9277 |
| SVM | 0.8469 | 0.7979 | 0.7854 | 0.8310 | 0.8813 | 0.9021 |
| RF | 0.8438 | 0.8438 | 0.7271 | 0.8714 | 0.9083 | 0.9106 |
| GBDT | 0.8188 | 0.8479 | 0.7146 | 0.8690 | 0.8917 | 0.9085 |
| KNN | 0.8500 | 0.7708 | 0.7354 | 0.8476 | 0.8708 | 0.8957 |
| AdaBoost | 0.8031 | 0.8396 | 0.7458 | 0.8690 | 0.8896 | 0.9106 |
AUC comparison for different algorithms with different preprocessing methods on the appendicitis dataset.
| Algorithms | NONE | SMOTE | SL-SMOTE | SMOTE-TL | B-SMOTE | CVCF-SMOTE |
|---|---|---|---|---|---|---|
| FPSO-SVM | 0.6878 |
|
|
|
|
|
| PSO-SVM | 0.5893 | 0.7602 | 0.7708 | 0.9311 | 0.8917 | 0.9239 |
| SVM | 0.6674 | 0.7966 | 0.7832 | 0.8423 | 0.8788 | 0.8982 |
| RF |
| 0.8475 | 0.7324 | 0.8755 | 0.9064 | 0.9070 |
| GBDT | 0.6460 | 0.8539 | 0.7207 | 0.8713 | 0.8909 | 0.9092 |
| KNN | 0.6885 | 0.7736 | 0.7374 | 0.8499 | 0.8676 | 0.8954 |
| AdaBoost | 0.6352 | 0.8461 | 0.7492 | 0.8685 | 0.8888 | 0.9102 |
Paired t-test results of CVCF-SMOTE+FPSO-SVM and the best performance under different preprocessing methods in terms of accuracy and AUC on the appendicitis dataset.
| Methods | Accuracy | AUC |
|---|---|---|
| NONE | 6.591 (0.000) | 15.628 (0.000) |
| SMOTE | 4.562 (0.001) | 5.176 (0.001) |
| B-SMOTE | 3.024 (0.014) | 3.373 (0.008) |
| SL-SMOTE | 6.227 (0.000) | 7.009 (0.000) |
| SMOTE-TL | 1.089 (0.304) | 0.785 (0.453) |
| CVCF-SMOTE | 2.764 (0.022) | 2.787 (0.21) |
Running time (in second) by CVCF-SMOTE+FPSO-SVM and state-of-the-art algorithms.
| Datasets | Algorithms | ||
|---|---|---|---|
| Thoracic surgery | CVCF-SMOTE+GBDT | CVCF-SMOTE+PSO-SVM | CVCF-SMOTE+FPSO-SVM |
| 31.2 | 53.6 | 43.5 | |
|
| |||
| Haberman | CVCF-SMOTE+KNN | CVCF-SMOTE+PSO-SVM | CVCF-SMOTE+FPSO-SVM |
| 18.8 | 27.5 | 24.5 | |
|
| |||
| Appendicitis | SMOTE-TL+FPSO-SVM | CVCF-SMOTE+PSO-SVM | CVCF-SMOTE+FPSO-SVM |
| 13.8 | 22.2 | 17.3 | |