| Literature DB >> 28784991 |
Liyan Pan1, Guangjian Liu1, Fangqin Lin1, Shuling Zhong2, Huimin Xia3, Xin Sun4, Huiying Liang5.
Abstract
The prediction of relapse in childhood acute lymphoblastic leukemia (ALL) is a critical factor for successful treatment and follow-up planning. Our goal was to construct an ALL relapse prediction model based on machine learning algorithms. Monte Carlo cross-validation nested by 10-fold cross-validation was used to rank clinical variables on the randomly split training sets of 336 newly diagnosed ALL children, and a forward feature selection algorithm was employed to find the shortest list of most discriminatory variables. To enable an unbiased estimation of the prediction model to new patients, besides the split test sets of 150 patients, we introduced another independent data set of 84 patients to evaluate the model. The Random Forest model with 14 features achieved a cross-validation accuracy of 0.827 ± 0.031 on one set and an accuracy of 0.798 on the other, with the area under the curve of 0.902 ± 0.027 and 0.904, respectively. The model performed well across different risk-level groups, with the best accuracy of 0.829 in the standard-risk group. To our knowledge, this is the first study to use machine learning models to predict childhood ALL relapse based on medical data from Electronic Medical Record, which will further facilitate stratification treatments.Entities:
Mesh:
Year: 2017 PMID: 28784991 PMCID: PMC5547099 DOI: 10.1038/s41598-017-07408-0
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Part of Patient Characteristics.
| Mean (SD) | Patients, N | Missing, N | |
|---|---|---|---|
| Age (years) | 4.72 (2.65) | 570 | 0 |
| Birth Weight (kg) | 3.02 (0.43) | 510 | 60 |
| Hepatomegaly (cm) | 2.49 (2.08) | 557 | 13 |
| Splenomegaly (cm) | 2.04 (2.60) | 557 | 13 |
| White Blood Cell count (×109/L) | 40.23 (80.36) | 570 | 0 |
| Haemoglobin (g/L) | 65.77 (23.12) | 570 | 0 |
| Platelet (×109/L) | 72.55 (80.20) | 570 | 0 |
| Peripheral Heterotypic cell (%) | 0.30 (0.29) | 565 | 5 |
| Lactic Dehydrogenase (U/L) | 910.11 (1631.24) | 570 | 0 |
| Ferroprotein (ng/mL) | 540.61 (700.13) | 570 | 0 |
| Lymphoblast in Bone Morrow at diagnosis (%) | 0.83 (0.12) | 541 | 29 |
| Lymphoblast in Bone Marrow on Day 15 (%) | 0.10 (0.19) | 570 | 0 |
| Lymphoblast in Bone Marrow on Day 33 (%) | 0.01 (0.02) | 570 | 0 |
| Minimal Residual Disease on day 15 (%) | 8.07 (16.96) | 233 | 337 |
| Minimal Residual Disease on day 33 (%) | 0.84 (3.69) | 266 | 304 |
|
|
|
| |
| Sex | 570 | 0 | |
| Male | 367 (64.39) | ||
| Female | 203 (35.61) | ||
| Fever | 553 | 17 | |
| Yes | 386 (69.80) | ||
| No | 167 (30.20) | ||
| Extramedullary Leukemia | 570 | 0 | |
| Yes | 19 (3.33) | ||
| No | 551 (96.67) | ||
| Bone Invasion | 417 | 153 | |
| Negative | 281 (67.39) | ||
| Positive | 136 (32.61) | ||
| Prednisone Response | 570 | 0 | |
| Poor | 50 (8.77) | ||
| Good | 520 (91.23) | ||
| French-American-British | 570 | 0 | |
| L1 | 225 (39.47) | ||
| L2 | 285 (50.00) | ||
| L3 | 60 (10.53) | ||
| BCR-ABL | 570 | 0 | |
| Negative | 552 (96.84) | ||
| Positive | 18 (3.16) |
Validated predictive performance of classifiers.
| Samples | Features | Accuracy | Sensitivity | Specificity | PPV | NPV | AUC | Ƙ* | |
|---|---|---|---|---|---|---|---|---|---|
| Mean performances (±standard deviation) of four classifiers on 100 training sets with all features | |||||||||
| RF | 150 | 103 | 0.831 ± 0.033 | 0.767 ± 0.058 | 0.895 ± 0.040 | 0.880 ± 0.041 | 0.795 ± 0.047 | 0.902 ± 0.030 | — |
| SVM | 150 | 103 | 0.719 ± 0.034 | 0.580 ± 0.069 | 0.859 ± 0.049 | 0.807 ± 0.050 | 0.673 ± 0.034 | 0.806 ± 0.055 | 0.553 |
| LR | 150 | 103 | 0.719 ± 0.037 | 0.601 ± 0.066 | 0.838 ± 0.046 | 0.788 ± 0.052 | 0.679 ± 0.051 | 0.802 ± 0.035 | 0.557 |
| DT | 150 | 103 | 0.791 ± 0.037 | 0.810 ± 0.055 | 0.773 ± 0.051 | 0.781 ± 0.045 | 0.804 ± 0.054 | 0.792 ± 0.037 | 0.596 |
| Mean performances ( ± standard deviation) of RF on 100 test sets with 14 selected features | |||||||||
| RF | 150 | 14 | 0.827 ± 0.031 | 0.756 ± 0.051 | 0.897 ± 0.041 | 0.882 ± 0.040 | 0.788 ± 0.044 | 0.902 ± 0.027 | — |
| Performances of RF on independent 84 patients with 14 selected features | |||||||||
| RF | 84 | 14 | 0.798 | 0.750 | 0.813 | 0.556 | 0.912 | 0.904 | — |
PPV, Positive Predictive Value; NPV, Negative Predictive Value; AUC, Area Under Curve; SVM, Support. Vector Machine; LR, Logistic Regression; DT, Decision Tree; RF, Random Forest.
Figure 1Composition of predictive variable sets selected by the nested cross-validation strategy. The features were ranked according to the selection probability measured across the 100 training sets. Variables with selection probability >20% were marked by red rhombuses and the details were shown in the upper right inset. WBC, White Blood Cell count; HB, Haemoglobin; PLT, Platelet; PHC, Peripheral Heterotypic cell; LDH, Lactic Dehydrogenase; D0-BM, Lymphoblast in Bone Morrow at diagnosis; D33-BM, Lymphoblast in Bone Morrow on Day 33; BW, Birth Weight; FER, Ferroprotein; D15-BM, Lymphoblast in Bone Morrow on Day 15; Hepat, Hepatomegaly; Splen, Splenomegaly.
Figure 2Classification accuracy, sensitivity and AUC of the random forest model along with the number of considered features. AUC, areas under curve.
Prediction performances on stratified risk groups of the 84 patients.
| Risk group | Total (N) | Relapsed (N) | Accuracy | Sensitivity | Specificity |
|---|---|---|---|---|---|
| Standard-Risk | 26 | 3 | 0.829 | 0.579 | 0.880 |
| Intermediate-Risk | 30 | 12 | 0.699 | 0.691 | 0.705 |
| High-Risk | 28 | 13 | 0.821 | 0.855 | 0.739 |
Figure 3Flow chart of data collection, data preprocessing, feature selection and model development. The overall data set was made up of 570 patients and 121 out of them were relapsed.