| Literature DB >> 35589908 |
Qiong Bai1, Chunyan Su1, Wen Tang2, Yike Li3.
Abstract
The purpose of this study was to assess the feasibility of machine learning (ML) in predicting the risk of end-stage kidney disease (ESKD) from patients with chronic kidney disease (CKD). Data were obtained from a longitudinal CKD cohort. Predictor variables included patients' baseline characteristics and routine blood test results. The outcome of interest was the presence or absence of ESKD by the end of 5 years. Missing data were imputed using multiple imputation. Five ML algorithms, including logistic regression, naïve Bayes, random forest, decision tree, and K-nearest neighbors were trained and tested using fivefold cross-validation. The performance of each model was compared to that of the Kidney Failure Risk Equation (KFRE). The dataset contained 748 CKD patients recruited between April 2006 and March 2008, with the follow-up time of 6.3 ± 2.3 years. ESKD was observed in 70 patients (9.4%). Three ML models, including the logistic regression, naïve Bayes and random forest, showed equivalent predictability and greater sensitivity compared to the KFRE. The KFRE had the highest accuracy, specificity, and precision. This study showed the feasibility of ML in evaluating the prognosis of CKD based on easily accessible features. Three ML models with adequate performance and sensitivity scores suggest a potential use for patient screenings. Future studies include external validation and improving the models with additional predictor variables.Entities:
Mesh:
Year: 2022 PMID: 35589908 PMCID: PMC9120106 DOI: 10.1038/s41598-022-12316-z
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.996
Baseline patient characteristics.
| Variables | Original data |
|---|---|
| Age (years) | 57.8 ± 17.6 |
| Gender (male/female) | 419/329 |
| SBP (mmHg) | 129.5 ± 17.8 |
| DBP (mmHg) | 77.7 ± 11.1 |
| BMI (kg/m2) | 24.8 ± 3.7 |
| Primary GN | 292 (39.0%) |
| Diabetes | 224 (29.9%) |
| Hypertension | 97 (13.0%) |
| CIN | 64 (8.6%) |
| Others | 18 (2.4%) |
| Unknown | 53 (7.1%) |
| Creatinine (µmol/L) | 130.0 (100.0, 163.0) |
| Urea (mmol/L) | 7.9 (5.6, 10.4) |
| ALT (U/L) | 17.0 (12.0, 24.0) |
| AST (U/L) | 18.0 (15.0, 22.0) |
| ALP (U/L) | 60.0 (50.0, 75.0) |
| Total protein (g/L) | 71.6 ± 8.4 |
| Albumin (g/L) | 42.2 ± 5.6 |
| Urine acid (µmol/L) | 374.0 (301.0, 459.0) |
| Calcium (mmol/L) | 2.2 ± 0.1 |
| Phosphorous (mmol/L) | 1.2 ± 0.2 |
| Ca × P (mg2/dL2) | 33.5 ± 5.6 |
| Blood leukocyte (109/L) | 7.1 ± 2.4 |
| Hemoglobin (g/L) | 131.0 ± 20.3 |
| Platelet (109/L) | 209.8 ± 57.1 |
| eGFR (ml/min/1.73m2) | 46.1 (32.6, 67.7) |
| Stage 1 | 58 (7.8%) |
| Stage 2 | 183 (24.5%) |
| Stage 3 | 352 (47.1%) |
| Stage 4 | 119 (15.9%) |
| Stage 5 | 36 (4.8%) |
| Total cholesterol | 5.1 (4.3, 5.9) |
| Triglyceride | 1.8 (1.3, 2.6) |
| HDL-c | 1.3 (1.1, 1.6) |
| LDL-c | 3.0 (2.4, 3.7) |
| Fasting glucose (mmol/L) | 5.4 (4.9, 6.2) |
| Potassium (mmol/L) | 4.3 ± 0.5 |
| Sodium (mmol/L) | 140.2 ± 2.8 |
| Chlorine (mmol/L) | 106.9 ± 3.7 |
| Bicarbonate (mmol/L) | 25.9 ± 3.6 |
| Hypertension | 558 (74.6%) |
| Diabetes mellitus | 415 (55.5%) |
| Cardiovascular or cerebrovascular disease | 177 (23.7%) |
| Smoking | 91 (12.6%) |
SBP systolic blood pressure, DBP diastolic blood pressure, GN glomerulonephritis, CIN chronic interstitial nephritis, BMI body mass index, eGFR estimated glomerular filtration rate, ALT alanine aminotransferase, AST aspartate transaminase, ALP alkaline phosphatase, CKD chronic kidney disease, HDL-c high density lipoprotein cholesterol, LDL-c low density lipoprotein cholesterol, Ca × P calcium-phosphorus product.
Hyperparameters of the algorithms.
| Algorithms | Hyperparameters |
|---|---|
| Logistic regression | penalty = 'l2', class_weight = 'balanced', max_iter = 100000, C = 10, solver = 'liblinear' |
| Naive Bayes | type = 'multinomial', alpha = 150 |
| Decision tree | criterion = 'gini', splitter = 'best', max_depth = 16, max_features = 15, min_samples_leaf = 5, min_samples_split = 0.0001 |
| Random forest | class_weight = 'balanced', criterion = 'gini', max_depth = 9, max_features = 17, min_samples_leaf = 6, min_samples_split = 30, n_estimators = 32 |
| K-nearest neighbors | weights = 'distance', metric = 'minkowski', n_neighbors = 16, leaf_size = 10 |
The performance of all algorithms.
| Accuracy | Sensitivity | Specificity | Precision | F1 Score | AUC | |
|---|---|---|---|---|---|---|
| Logistic regression | 0.75 (0.72, 0.79) | 0.79 (0.73, 0.85) | 0.75 (0.71, 0.79) | 0.26 (0.24, 0.29) | 0.38 (0.36, 0.41) | 0.79 (0.77, 0.82) |
| Naïve Bayes | 0.86 (0.85, 0.87) | 0.72 (0.68, 0.75) | 0.87 (0.86, 0.89) | 0.37 (0.35, 0.40) | 0.49 (0.46, 0.51) | 0.80 (0.77, 0.82) |
| Random forest | 0.82 (0.80, 0.85) | 0.76 (0.71, 0.81) | 0.83 (0.80, 0.86) | 0.34 (0.30, 0.39) | 0.46 (0.43, 0.49) | 0.81 (0.78, 0.83) |
| K nearest neighbor | 0.84 (0.81, 0.86) | 0.60 (0.57, 0.64) | 0.86 (0.83, 0.89) | 0.35 (0.30, 0.40) | 0.43 (0.40, 0.46) | 0.73 (0.71, 0.75) |
| Decision tree | 0.84 (0.82, 0.86) | 0.44 (0.39, 0.49) | 0.89 (0.86, 0.91) | 0.33 (0.26, 0.40) | 0.35 (0.32, 0.39) | 0.66 (0.63, 0.68) |
| KFRE | 0.90 (0.90, 0.91) | 0.47 (0.42, 0.52) | 0.95 (0.94, 0.96) | 0.50 (0.45, 0.55) | 0.48 (0.43, 0.52) | 0.80 (0.78, 0.83) |
All outcomes are expressed as mean and (95% confidence interval).
KFRE kidney failure risk equation, AUC area under the curve.
Figure 1ROC curves of the random forest algorithm and the KFRE model.