| Literature DB >> 36028803 |
Xuanxuan Liu1, Li Guo2, Hexiang Wang3, Jia Guo3, Shifeng Yang4, Lisha Duan5.
Abstract
BACKGROUND: Soft tissue sarcoma is a rare and highly heterogeneous tumor in clinical practice. Pathological grading of the soft tissue sarcoma is a key factor in patient prognosis and treatment planning while the clinical data of soft tissue sarcoma are imbalanced. In this paper, we propose an effective solution to find the optimal imbalance machine learning model for predicting the classification of soft tissue sarcoma data.Entities:
Keywords: Extremely randomized trees; Imbalanced data; Machine learning; Radiomics; Soft tissue sarcoma
Mesh:
Year: 2022 PMID: 36028803 PMCID: PMC9417078 DOI: 10.1186/s12880-022-00876-5
Source DB: PubMed Journal: BMC Med Imaging ISSN: 1471-2342 Impact factor: 2.795
Summary of recent literature on solving data imbalance problems
| Ref | Year | Dataset | Methods | Evaluation metric |
|---|---|---|---|---|
| [ | 2011 | National Inpatient Sample (NIS) data | Repeated random subsampling-RF | AUC = 88.79% |
| [ | 2014 | Real datasets of human protein | MTD-SVM | AC = 96.71% |
| [ | 2021 | From Hospital Israelita Albert Einstein | MiDT | AC = 93.255% |
| [ | 2022 | The esophageal cancer patient dataset | GDO-SVM | AUC = 0.71 |
| [ | 2022 | Wisconsin | GDO-SVM | AUC = 0.9662 |
| [ | 2020 | HTRU2 | Hybrid resampling-ETC | AC = 99.3% |
| [ | 2021 | The comments on social media platforms | RVVC-SMOTE | AC = 97% |
| [ | 2021 | UCI(fraud detection) | RONS/ROS/ROA-LR/SVM | Gmean = 0.905 |
| [ | 2021 | WCE images | BIR-CNN | AC = 99.3% |
| [ | 2021 | Chest X-ray image dataset | CNNs | AC = 99.5% |
AC Accuracy; The datasets and evaluation measures in the table are selected from parts of the original literature or the best performing ones
The number of samples and the radio of the imbalance dataset in MRI-QSH dataset
| The number of samples | Hige-grade | Low-grade | Imbalance ratio |
|---|---|---|---|
| 252 | 190 | 62 | 3.06 |
Fig. 1Example images of soft tissue sarcoma obtained by radiomics
17 different machine learning models
| Number | Feature selection method | Sampling technique | Classification method |
|---|---|---|---|
| 1 | RFE | ROSE | ERT |
| 2 | RFE | SMOTE | ERT |
| 3 | RFE | STT | ERT |
| 4 | RFE | ADASYN | ERT |
| 5 | RFE | ROSE | RF |
| 6 | RFE | SMOTE | RF |
| 7 | RFE | STT | RF |
| 8 | RFE | ADASYN | RF |
| 9 | RFE | ROSE | BRF |
| 10 | RFE | SMOTE | BRF |
| 11 | RFE | STT | BRF |
| 12 | RFE | ADASYN | BRF |
| 13 | RFE | ROSE | SVM |
| 14 | RFE | SMOTE | SVM |
| 15 | RFE | STT | SVM |
| 16 | RFE | ADASYN | SVM |
| 17 | RFE | GDO | SVM |
RFE recursive feature elimination; ROSE random oversampling examples; SMOTE synthetic minority oversampling technique; STT SMOTETomek; ADASYN adaptive synthetic samping; ERT extremely randomized trees; RF random forest; BRF balanced random forest; SVM support vector machine
Fig. 2The conventional dataset splitting process
Fig. 3The process of dataset spitting with SRS
Confusion matrix of classification results
| Positive example | Negative example | |
|---|---|---|
| Positive example | TP | FN |
| Negative example | FP | TN |
The effectiveness of 17 different machine learning methods in the testing set
| N | FS | ST | CM | AUC ± | G-mean ± | |||
|---|---|---|---|---|---|---|---|---|
| 1 | RFE | ROSE | ERT | 0.6013 ± 0.0482 | 78.82 ± 0.0545 | 21.60 ± 0.1014 | 0.4477 ± 0.1154 | |
| 2 | RFE | SMOTE | ERT | 0.6863 ± 0.0515 | 81.37 ± 0.0500 | 95.80 ± 0.0284 | 41.47 ± 0.0972 | 0.6260 ± 0.0782 |
| 3 | RFE | STT | ERT | 96.03 ± 0.0254 | 41.55 ± 0.1091 | 0.6263 ± 0.0860 | ||
| 4 | RFE | ADASYN | ERT | 0.6461 ± 0.0595 | 79.41 ± 0.0464 | 95.04 ± 0.0279 | 34.18 ± 0.1121 | 0.5621 ± 0.1017 |
| 5 | RFE | ROSE | RF | 0.6197 ± 0.0473 | 77.45 ± 0.0533 | 93.97 ± 0.0425 | 29.97 ± 0.0865 | 0.5258 ± 0.0746 |
| 6 | RFE | SMOTE | RF | 0.6567 ± 0.0488 | 76.27 ± 0.0502 | 87.50 ± 0.0427 | 43.84 ± 0.1032 | 0.6147 ± 0.0700 |
| 7 | RFE | STT | RF | 0.6580 ± 0.0447 | 76.67 ± 0.0448 | 88.35 ± 0.0396 | 43.25 ± 0.1018 | 0.6133 ± 0.0680 |
| 8 | RFE | ADASYN | RF | 0.6142 ± 0.0618 | 73.92 ± 0.0599 | 87.60 ± 0.0482 | 35.24 ± 0.1026 | 0.5503 ± 0.0877 |
| 9 | RFE | ROSE | BRF | 0.6151 ± 0.0332 | 77.45 ± 0.0446 | 94.52 ± 0.0356 | 28.49 ± 0.0645 | 0.5154 ± 0.0593 |
| 10 | RFE | SMOTE | BRF | 0.6287 ± 0.0487 | 74.90 ± 0.0422 | 86.97 ± 0.0381 | 38.77 ± 0.1031 | 0.5750 ± 0.0770 |
| 11 | RFE | STT | BRF | 0.6367 ± 0.0578 | 75.69 ± 0.0436 | 87.84 ± 0.0461 | 39.51 ± 0.1182 | 0.5822 ± 0.0872 |
| 12 | RFE | ADASYN | BRF | 0.6243 ± 0.0331 | 74.12 ± 0.0441 | 86.76 ± 0.0370 | 38.10 ± 0.0735 | 0.5720 ± 0.0503 |
| 13 | RFE | ROSE | SVM | 0.6863 ± 0.2226 | 77.65 ± 0.0436 | 87.49 ± 0.0438 | 52.30 ± 0.1295 | |
| 14 | RFE | SMOTE | SVM | 0.6812 ± 0.0591 | 76.47 ± 0.0606 | 85.41 ± 0.0633 | 50.82 ± 0.0894 | 0.6564 ± 0.0672 |
| 15 | RFE | STT | SVM | 0.6812 ± 0.0591 | 76.47 ± 0.0606 | 85.41 ± 0.0633 | 50.82 ± 0.0894 | 0.6564 ± 0.0672 |
| 16 | RFE | ADASYN | SVM | 0.6795 ± 0.0483 | 75.29 ± 0.0499 | 83.48 ± 0.0672 | 0.6588 ± 0.0557 | |
| 17 | RFE | GDO | SVM | 0.6691 ± 0.0685 | 76.67 ± 0.0657 | 87.51 ± 0.0557 | 46.30 ± 0.1083 | 0.6328 ± 0.0580 |
Best results are highlighted in bold style
N number; FS feature selection; ST sampling technique; CM classification method; AUC area under the curve; Sens sensitivity; Spec specificity; ROSE random oversampling examples; SMOTE synthetic minority oversampling technique; STT SMOTETomek; ADASYN adaptive synthetic sampling; RFE recursive feature elimination; ERT extremely randomized trees; RF random forest; BRF balanced random forest; SVM support vector machine
Fig. 4Histogram of classification performance of 17 models
Performance of the SRS dataset splitting method on 17 models in the testing set
| N | FS | ST | CM | AUC ± | G-mean ± | |||
|---|---|---|---|---|---|---|---|---|
| 1 | RFE | ROSE | ERT | 0.9308 ± 0.0445 | 95.49 ± 0.0245 | 87.47 ± 0.0915 | 0.9278 ± 0.0478 | |
| 2 | RFE | SMOTE | ERT | 96.66 ± 0.0229 | 92.10 ± 0.0713 | |||
| 3 | RFE | STT | ERT | 96.66 ± 0.0229 | 92.10 ± 0.0713 | |||
| 4 | RFE | ADASYN | ERT | 0.9419 ± 0.0430 | 94.90 ± 0.0280 | 96.11 ± 0.0200 | 92.28 ± 0.0821 | 0.9409 ± 0.0443 |
| 5 | RFE | ROSE | RF | 0.9358 ± 0.0410 | 94.71 ± 0.0321 | 96.30 ± 0.0363 | 90.86 ± 0.0754 | 0.9345 ± 0.0425 |
| 6 | RFE | SMOTE | RF | 0.9087 ± 0.0547 | 92.94 ± 0.0211 | 94.24 ± 0.0326 | 87.49 ± 0.1104 | 0.9059 ± 0.0600 |
| 7 | RFE | STT | RF | 0.9197 ± 0.0439 | 93.14 ± 0.0349 | 94.31 ± 0.0420 | 89.63 ± 0.0759 | 0.9185 ± 0.0447 |
| 8 | RFE | ADASYN | RF | 0.9220 ± 0.0429 | 92.55 ± 0.0289 | 92.62 ± 0.0322 | 91.78 ± 0.0865 | 0.9208 ± 0.0437 |
| 9 | RFE | ROSE | BRF | 0.9356 ± 0.0396 | 94.90 ± 0.0295 | 96.86 ± 0.0324 | 90.27 ± 0.0749 | 0.9342 ± 0.0412 |
| 10 | RFE | SMOTE | BRF | 0.9111 ± 0.0562 | 93.14 ± 0.0212 | 94.23 ± 0.0229 | 88.00 ± 0.1095 | 0.9088 ± 0.0614 |
| 11 | RFE | STT | BRF | 0.9350 ± 0.0284 | 93.53 ± 0.0186 | 93.89 ± 0.0315 | 93.10 ± 0.0695 | 0.9339 ± 0.0291 |
| 12 | RFE | ADASYN | BRF | 0.9388 ± 0.0404 | 93.73 ± 0.0304 | 93.99 ± 0.0346 | 0.9378 ± 0.0415 | |
| 13 | RFE | ROSE | SVM | 0.8191 ± 0.0448 | 87.84 ± 0.0180 | 94.53 ± 0.0324 | 69.29 ± 0.1085 | 0.8062 ± 0.0559 |
| 14 | RFE | SMOTE | SVM | 0.8276 ± 0.0545 | 86.08 ± 0.0339 | 88.94 ± 0.0445 | 76.59 ± 0.1120 | 0.8227 ± 0.0586 |
| 15 | RFE | STT | SVM | 0.8276 ± 0.0545 | 86.08 ± 0.0339 | 88.94 ± 0.0445 | 76.59 ± 0.1120 | 0.8227 ± 0.0586 |
| 16 | RFE | ADASYN | SVM | 0.8699 ± 0.0573 | 89.22 ± 0.0349 | 90.99 ± 0.0374 | 83.00 ± 0.1204 | 0.8664 ± 0.0604 |
| 17 | RFE | GDO | SVM | 0.8143 ± 0.0598 | 87.06 ± 0.0230 | 92.34 ± 0.0373 | 70.52 ± 0.1395 | 0.8020 ± 0.0752 |
Best results are highlighted in bold style
N number; FS feature selection; ST sampling technique; CM classification method; AUC area under the curve; Sens sensitivity; Spec: specificity; ROSE random oversampling examples; SMOTE synthetic minority oversampling technique; STT SMOTETomek; ADASYN adaptive synthetic RFEsampling; RFE recursive feature elimination; ERTextremely randomized trees; RFrandom forest; BRF balanced random forest; SVM support vector machine
Fig. 5Histogram of classification performance of 17 models using the SRS method
Running time of different machine learning models
| Number | Model | Conventional Split-Running time (s) | SRS-Running time (s) |
|---|---|---|---|
| 1 | RFE+ROSE+ERT | 64 | 65 |
| 2 | RFE+SMOTE+ERT | 66 | 66 |
| 3 | RFE+STT+ERT | 65 | 66 |
| 4 | RFE+ADASYN+ERT | 67 | 68 |
| 5 | RFE+ROSE+RF | 67 | 67 |
| 6 | RFE+SMOTE+RF | 69 | 67 |
| 7 | RFE+STT+RF | 66 | 68 |
| 8 | RFE+ADASYN+RF | 67 | 67 |
| 9 | RFE+ROSE+BRF | 66 | 66 |
| 10 | RFE+SMOTE+BRF | 66 | 66 |
| 11 | RFE+STT+BRF | 66 | 66 |
| 12 | RFE+ADASYN+BRF | 67 | 70 |
| 13 | RFE+ROSE+SVM | 68 | 66 |
| 14 | RFE+SMOTE+SVM | 64 | 66 |
| 15 | RFE+STT+SVM | 66 | 66 |
| 16 | RFE+ADASYN+SVM | 64 | 67 |
| 17 | RFE+GDO+SVM | 66 | 65 |
RFE recursive feature elimination; ROSE random oversampling examples; SMOTE synthetic minority oversampling technique; STT SMOTETomek; ADASYN adaptive synthetic samping; ERT extremely randomized trees; RF random forest; BRF balanced random forest; SVM support vector machine
Fig. 6Running time of different machine learning models using different dataset spiltting methods