| Literature DB >> 36068539 |
Mostafa Shanbehzadeh1, Mohammad Reza Afrash2, Nader Mirani3, Hadi Kazemi-Arpanahi4,5.
Abstract
INTRODUCTION: Chronic myeloid leukemia (CML) is a myeloproliferative disorder resulting from the translocation of chromosomes 19 and 22. CML includes 15-20% of all cases of leukemia. Although bone marrow transplant and, more recently, tyrosine kinase inhibitors (TKIs) as a first-line treatment have significantly prolonged survival in CML patients, accurate prediction using available patient-level factors can be challenging. We intended to predict 5-year survival among CML patients via eight machine learning (ML) algorithms and compare their performance.Entities:
Keywords: Data mining; Leukemia; Machine learning; Support vector machine; Survival
Mesh:
Year: 2022 PMID: 36068539 PMCID: PMC9450320 DOI: 10.1186/s12911-022-01980-w
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 3.298
Fig. 1The roadmap of the proposed system based on the CRISP methodology. SSVM support vector machine, RBF radial basic function, DT decision tree, KNN k-nearest neighborhood, XG Boost eXtreme gradient boosting, AUC area under the curve
Baseline predictor variables
| Data class | Types of variables | Variable | Range | Deceased within 5 years | Survived within 5 years | |
|---|---|---|---|---|---|---|
| Total | Total | |||||
| Basic data | Independent Variables | Age (years) | 18–45 | 19 | 121 | 0.081 |
| 45–65 | 36 | 295 | ||||
| 65–100 | 42 | 324 | ||||
| Gender | Male | 65 | 480 | 0.093 | ||
| Female | 32 | 260 | ||||
| History | Radiation exposure | Yes–No | 25–72 | 85–655 | 0.805 | |
| Previous cancer treatment | Yes–No | 22–57 | 36–704 | 0.692 | ||
| Genetic disorders | Yes–No | 6–91 | 13–727 | 0.811 | ||
| Family history of leukemia | Yes–No | 14–83 | 39–701 | 0.957 | ||
| Tobacco smoke | Yes–No | 9–88 | 41–699 | 0.561 | ||
| Pesticides and industrial solvents | Yes–No | 8–89 | 58–682 | 0.374 | ||
| Exposure to certain chemicalsa | Yes–No | 5–92 | 11–729 | 0.459 | ||
| Manifestations | Fever | Yes–No | 49–48 | 208–532 | 0.671 | |
| Chill | Yes–No | 17–80 | 148–592 | 0.759 | ||
| Swollen lymph nodes | Yes–No | 36–61 | 108–632 | 0.714 | ||
| Petechiae | Yes–No | 21–76 | 158–582 | 0.920 | ||
| Easy bleeding or bruising | Yes–No | 26–71 | 128–612 | 0.802 | ||
| Recurrent nosebleeds | Yes–No | 14–83 | 89–651 | 0.981 | ||
| Frequent or severe infections | Yes–No | 36–61 | 280–460 | 0.059 | ||
| Arthralgia | Yes–No | 29–68 | 211–485 | 0.630 | ||
| Headache | Yes–No | 36–61 | 248–492 | 0.710 | ||
| Malaise | Yes–No | 25–72 | 305–435 | 0.837 | ||
| Dyspnea | Yes–No | 8–89 | 63–677 | 0.910 | ||
| Dizziness | Yes–No | 6–91 | 58–682 | 0.891 | ||
| Visual disturbances | Yes–No | 13–84 | 62–678 | 0.452 | ||
| Nausea/vomiting | Yes–No | 88 | 119 | 0.100 | ||
| Ankle edema | Yes–No | 64 | 132 | 0.924 | ||
| Weakness | Yes–No | 51 | 79 | 0.130 | ||
| Sweats | Yes–No | 102 | 140 | 0.092 | ||
| Weight loss | Yes–No | 63–34 | 297–443 | 0.721 | ||
| Bone pain | Yes–No | 12–85 | 165–575 | 0.816 | ||
| Spleen palpable | Yes–No | 27–70 | 117–623 | 0.649 | ||
| Pain or a sense of "fullness" in the belly | Yes–No | 22–75 | 86–654 | 0.930 | ||
| Feeling full after eating even a small amount of food | Yes–No | 19–78 | 91–649 | 0.922 | ||
| Laboratory | BCR-ABL (Philadelphia chromosome) | Positive–negative | 88–9 | 82–658 | 0.631 | |
| Anemia | Yes–No | 43–54 | 215–489 | 0.052 | ||
| Poor appetite | Yes–No | 37–60 | 119–621 | 0.760 | ||
| Areas of bone damage | Yes–No | 10–87 | 74–666 | 0.058 | ||
| Increased leucocyte count | > 50 × 103 ml | 63 | 565 | 0.041 | ||
| Neutrophil proportion | > 72.6% | 53 | 445 | 0.029 | ||
| Elevated blast cell proportion | > 10% | 32 | 396 | 0.042 | ||
| Increased eosinophil count | > 0 /5 × 103uL | 66 | 321 | 0.049 | ||
| Increased basophil count | > 0 /1 × 103uL | 48 | 625 | 0.018 | ||
| Decreased platelet counts | < 150 × 103 ml | 29 | 108 | 0.052 | ||
| Increased neutrophil alkaline phosphatase | > 20 per 100 score neutrophils | 52 | 256 | 0.049 | ||
| Resistance to tyrosine kinase inhibitors | Yes–No | 24–73 | 268–472 | 0.072 | ||
| Outcome variable | Dependent variable | Five-years survival statues | Deceased within 5 years/survived within 5 years | 97 | 740 | – |
Exposure to certain chemicals, such as benzene—which is found in gasoline and is used by the chemical industry—is linked to an increased risk of some kinds of leukemia
Fig. 2The most important variables selected by the mRMR
Best hyperparameters of all the trained algorithms
| Num | Data mining Models | Hyper-parameters | f-score |
|---|---|---|---|
| 1 | Decision tree (j48) | ||
| 2 | MLP classifier | ‘Learning rate’ = ’constant’, hidden_layer_size’ = (100,100,100), ‘alpha’ = 0.05, ‘activation’ = ’rulo’ | 87.6 |
| 3 | SVM (kernel = linear) | C = 100, G = 0.0001 | 83.04 |
| 4 | SVM (kernel = RBF) | C = 10, G = 0.001 | 81.9 |
| 5 | XG Boost Classifier | ‘min_chid_weigh’ = 1’max_depht’ = 12,’learning_rate’ = 0.1, ‘gamma’ = 0.4, ‘colsample_bytree’ = 0.3 | 81.02 |
| 6 | KNN | K = 5 | 67.1 |
| 7 | Pattern recognition network | 57-10-5-2 | 69.02 |
| 8 | Probabilistic neural network | 57-2, Spread = 0.1 | 70.01 |
SVM support vector machine, XG Boost eXtreme gradient boosting, KNN K-nearest neighborhood
Performance evaluation of the selected ML algorithms
| Classifiers | MLP | KNN | DT (j48) | Pattern recognition network | XG Boost | Probabilistic neural network | SVM (kernel = RBF) | SVM (kernel = linear) | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Full feature | Selected Feature | Full feature | Selected Feature | Full feature | Selected Feature | Full feature | Selected Feature | Full feature | Selected Feature | Full feature | Selected Feature | Full feature | Selected Feature | Full feature | Selected Feature | ||
| Mean Accuracy | 0.67 | 0.77 | 0.62 | 0.68 | 0.73 | 0.83 | 0.62 | 0.68 | 0.69 | 0.79 | 0.62 | 0.69 | 0.69 | 0.85 | 0.69 | 0.83 | |
| 95% confidence interval | (0.66, 0.68) | (0.76, 0.781) | (0.59, 0.66) | (0.671, 0.71) | (0.71, 0.75) | (0.834, 0.848) | (0.611, 0.64) | (0.68, 0.7) | (0.68, 0.7) | (0.77, 0.81) | (0.62, 0.63) | (0.691, 0.71) | (0.69, 0.71) | (0.82, 0.85) | (0.69, 0.71) | (0.82, 0.84) | |
| Standard deviation | 0.01 | 0.09 | 0.05 | 0.02 | 0.02 | 0.01 | 0.02 | 0.01 | 0.01 | 0.02 | 0.01 | 0.01 | 0.01 | 0.02 | 0.01 | 0.01 | |
| Mean Specificity | 0.68 | 0.76 | 0.62 | 0.66 | 0.74 | 0.81 | 0.62 | 0.68 | 0.68 | 0.76 | 0.62 | 0.68 | 0.69 | 0.85 | 0.69 | 0.82 | |
| 95% confidence interval | (0.67, 0.71) | (0.75, 0.77) | (0.58, 0.66) | (0.651, 0.71) | (0.731, 0.75) | (0.80, 0.82) | (0.61, 0.64) | (0.67, 0.691) | (0.67, 0.69) | (0.75, 0.77) | (0.62, 0.63) | (0.68, 0.7) | (0.68, 0.7) | (0.85, 0.86) | (0.68, 0.7) | (0.816, 0.83) | |
| Standard deviation | 0.02 | 0.01 | 0.07 | 0.02 | 0.01 | 0.09 | 0.02 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | |
| Mean Sensitivity | 0.71 | 0.72 | 0.62 | 0.70 | 0.74 | 0.83 | 0.61 | 0.70 | 0.70 | 0.78 | 0.62 | 0.71 | 0.71 | 0.86 | 0.70 | 0.83 | |
| 95% confidence interval | (0.71, 0.73) | (0.71, 0.74) | (0.57, 0.68) | (0.68, 0.73) | (0.73, 0.752) | (0.83, 0.85) | (0.591, 0.64) | (0.69, 0.72) | (0.69, 0.72) | (0.78, 0.79) | (0.61, 0.65) | (0.7, 0.73) | (0.7, 0.73) | (0.86, 0.87) | (0.7, 0.72) | (0.82, 0.84) | |
| Standard deviation | 0.01 | 0.01 | 0.08 | 0.03 | 0.02 | 0.09 | 0.03 | 0.02 | 0.02 | 0.01 | 0.03 | 0.03 | 0.02 | 0.01 | 0.01 | 0.01 | |
| Mean area under the curve | 0.70 | 0.76 | 0.62 | 0.69 | 0.75 | 0.83 | 0.62 | 0.69 | 0.69 | 0.76 | 0.62 | 0.70 | 0.70 | 86.1% | 0.70 | 0.83 | |
| 95% confidence interval | (0.69, 0.71) | (0.751, 0.774) | (0.610, 0.630) | (0.671, 0.712) | (0.731, 0.76) | (0.83, 0.85) | (0.61, 0.63) | (0.68, 0.7) | (0.68, 0.71) | (0.75, 0.778) | (0.62, 0.64) | (0.69, 0.71) | (0.69, 0.71) | (0.85, 0.86) | (0.69, 0.71) | (0.82, 0.84) | |
| Standard deviation | 0.01 | 0.09 | 0.01 | 0.02 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.014 | 0.01 | |
| Mean F1-score | 0.70 | 0.76 | 0.61 | 0.68 | 0.73 | 0.83 | 0.62 | 0.69 | 0.69 | 0.77 | 0.62 | 0.70 | 0.72 | 0.87 | 0.69 | 0.82 | |
| 95% confidence interval | (0.69, 0.71) | (0.751, 0.772) | (0.61, 0.63) | (0.671, 0.71) | (0.72, 0.74) | (0.83, 0.851) | (0.611, 0.63) | (0.68, 0.7) | (0.68, 0.71) | (0.76, 0.78) | (0.61, 0.64) | (0.69, 0.71) | (0.69, 0.71) | (0.86, 0.88) | (0.69, 0.71) | (0.821, 0.84) | |
| Standard deviation | 0.01 | 0.08 | 0.01 | 0.03 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.02 | 0.01 | 0.01 | |
| Kappa Statistic (KS) | 0.7201 | 76.2% | 0.612 | 0.681 | 0.701 | 83.2% | 0.622 | 0.671 | 0.621 | 78.2% | 0.602 | 0.718 | 0.752 | 0.861 | 0.681 | 0.831 | |
| (0.71, 0.73) | (0.75, 0.771) | (0.61, 0.63) | (0.66, 0.69) | (0.70, 0.71) | (0.828, 0.85) | (0.59, 0.63) | (0.66, 0.69) | (0.61, 0.63) | (0.77, 0.79) | (0.58, 0.62) | (0.68, 0.73) | (0.74, 0.76) | (0.85, 0.86) | (0.67, 0.70) | (0.82, 0.84) | ||
| 0.01 | 0.01 | 0.01 | 0.02 | 0.00 | 0.08 | 0.07 | 0.01 | 0.02 | 0.01 | 0.05 | 0.04 | 0.01 | 0.01 | 0.01 | 0.01 | ||
SVM support vector machine, RBF radial basic function, DT decision tree, KNN k-nearest neighborhood, XG Boost eXtreme gradient boosting
Fig. 3Comparison of ML models' performance on A full and B selected features
Fig. 4The ROC curve for the top four ML algorithms
Fig. 5Classification report for SVM with RBF kernel