| Literature DB >> 35789173 |
San Wang1, Jieun Han2, Se Young Jung3,4, Tae Jung Oh5,6, Sen Yao1, Sanghee Lim1, Hee Hwang7, Ho-Young Lee7, Haeun Lee7.
Abstract
This study aimed to develop a model to predict the 5-year risk of developing end-stage renal disease (ESRD) in patients with type 2 diabetes mellitus (T2DM) using machine learning (ML). It also aimed to implement the developed algorithms into electronic medical records (EMR) system using Health Level Seven (HL7) Fast Healthcare Interoperability Resources (FHIR). The final dataset used for modeling included 19,159 patients. The medical data were engineered to generate various types of features that were input into the various ML classifiers. The classifier with the best performance was XGBoost, with an area under the receiver operator characteristics curve (AUROC) of 0.95 and area under the precision recall curve (AUPRC) of 0.79 using three-fold cross-validation, compared to other models such as logistic regression, random forest, and support vector machine (AUROC range, 0.929-0.943; AUPRC 0.765-0.792). Serum creatinine, serum albumin, the urine albumin-to-creatinine ratio, Charlson comorbidity index, estimated GFR, and medication days of insulin were features that were ranked high for the ESRD risk prediction. The algorithm was implemented in the EMR system using HL7 FHIR through an ML-dedicated server that preprocessed unstructured data and trained updated data.Entities:
Mesh:
Year: 2022 PMID: 35789173 PMCID: PMC9253099 DOI: 10.1038/s41598-022-15036-6
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.996
Baseline characteristics of study participants.
| Total | Not progressed to ESRD | Progressed to ESRD | P value | |
|---|---|---|---|---|
| Number | 19,159 | 17,576 | 1,583 | |
| Age | 62.3 ± 11.7 (100.0%) | 62.0 ± 11.6 (100.0%) | 66.1 ± 12.1 (100.0%) | < 0.001 |
| Male | 10,674 (55.7%) | 9739 (55.4%) | 935 (59.1%) | |
| Female | 8485 (44.3%) | 7837 (44.6%) | 648 (40.9%) | 0.005 |
| No | 10,052 (52.5%) | 9178 (52.2%) | 874 (55.2%) | |
| Yes | 3622 (18.9%) | 3281 (18.7%) | 341 (21.5%) | 0.192 |
| Duration of Hospital Visits (days)a | 1260.5 ± 1061.0 (100.0%) | 1267.6 ± 1042.5 (100.0%) | 1181.8 ± 1245.2 (100.0%) | 0.002 |
| Weight (kg) | 66.0 ± 12.3 (88.3%) | 66.2 ± 12.2 (88.2%) | 63.8 ± 12.6 (88.6%) | < 0.001 |
| BMI (kg/m2) | 25.2 ± 4.2 (56.4%) | 25.2 ± 4.1 (57.2%) | 24.7 ± 5.3 (47.9%) | 0.003 |
| SBP (mmHg) | 129.1 ± 17.7 (89.4%) | 128.6 ± 17.2 (89.3%) | 135.1 ± 21.3 (90.1%) | < 0.001 |
| DBP (mmHg) | 74.1 ± 11.4 (90.6%) | 74.2 ± 11.3 (90.5%) | 73.0 ± 12.6 (91.9%) | < 0.001 |
| TC (mg/dL) | 170.4 ± 37.7 (99.1%) | 170.8 ± 37.1 (99.2%) | 165.8 ± 44.4 (98.8% | < 0.001 |
| HDL (mg/dL) | 48.5 ± 12.2 (85.1%) | 48.7 ± 12.1 (85.6%) | 45.2 ± 12.6 (79.5%) | < 0.001 |
| LDL(mg/dL) | 92.2 ± 29.0 (78.0%) | 92.1 ± 28.6 (78.7%) | 93.0 ± 33.8 (70.9%) | 0.324 |
| TG (mg/dL) | 144.7 ± 80.6 (85.1%) | 144.0 ± 80.2 (85.6%) | 152.7 ± 85.5 (79.6%) | < 0.001 |
| eGFR (ml/min/1.73 m2) | 76.7 ± 22.4 (98.9%) | 79.9 ± 19.3 (98.9%) | 41.5 ± 24.0 (98.9%) | < 0.001 |
| UACR | 72.7 ± 216.9 (51.9%) | 45.3 ± 145.6 (52.4%) | 420.6 ± 494.4 (45.6%) | < 0.001 |
| Duration of Type 2 Diabetesb | 884.5 ± 901.4 (100.0%) | 884.3 ± 884.2 (100.0%) | 886.4 ± 1074.2 (100.0%) | 0.930 |
| No | 2446 (12.8%) | 2422 (13.8%) | 24 (1.5%) | < 0.001 |
| Yes | 16,713 (87.2%) | 15,154 (86.2%) | 1559 (98.5%) | |
| Duration of hypertension | 1001.9 ± 973.5 (87.2%) | 1009.8 ± 959.1 (86.2%) | 924.7 ± 1101.0 (98.5%) | 0.001 |
Data are shown in mean ± SD or number (%).The percentage in the parenthesis indicates the percentage of non-zero values (continuous variables) or the percentage of a given category (categorical variables). Patient characteristics are calculated from one randomly selected cohort.
ESRD end stage renal disease, BMI body mass index, SBP systolic blood pressure, DBP diastolic blood pressure, TC total cholesterol, HDL high density lipoprotein, LDL low density lipoprotein, TG triglyceride, T2D type 2 diabetes, UACR Urine albumin to creatinine ratio, eGFR estimated glomerular filtration rate.
aDuration of Hospital Visits is the total duration of patients’ visits in Seoul National University Bundang Hospital.
bThe total duration of T2DM management gained through ICD-10 diagnosis codes of Type 2 Diabetes.
Model performance of the developed model.
| Model | XGB | ||||
|---|---|---|---|---|---|
| Metric | Accuracy | AUPRC | AUROC | Precision | Recall |
| Count | 9.000 | 9.000 | 9.000 | 9.000 | 9.000 |
| Mean | 0.959 | 0.785 | 0.947 | 0.828 | 0.631 |
| SD | 0.002 | 0.013 | 0.005 | 0.018 | 0.024 |
| ci_low | 0.957 | 0.777 | 0.944 | 0.817 | 0.616 |
| ci_high | 0.960 | 0.794 | 0.950 | 0.840 | 0.644 |
The values are derived from 9 iterations, which is the number of fold (k = 3) * the number of epochs (N = 3). Final performance value is the average of three-fold validation over three cohorts. The confidence interval is calculated by bootstrapping first then calculating the 95 percentile range.
ci_low Lowest 95% confidence interval of experiments, ci_high Highest 95% confidence interval of experiments, SD Standard Deviation, AUPRC Area Under Precision-Recall Curve, AUROC Area Under Receiver-Operator Characteristics.
Figure 1Model discrimination and calibration performance. AUROC area under the receiver-operating characteristic, PRC precision recall curve, ROC receiver-operating characteristic.
Model performances of difference machine learning algorithms.
| Model | LR | RF | SVM | XGB | ||||
|---|---|---|---|---|---|---|---|---|
| AUPRC | AUROC | AUPRC | AUROC | AUPRC | AUROC | AUPRC | AUROC | |
| count | 9.000 | 9.000 | 9.000 | 9.000 | 9.000 | 9.000 | 9.000 | 9.000 |
| mean | 0.783 | 0.942 | 0.792 | 0.943 | 0.765 | 0.929 | 0.792 | 0.947 |
| SD | 0.012 | 0.002 | 0.015 | 0.005 | 0.011 | 0.007 | 0.009 | 0.004 |
| ci_low | 0.775 | 0.940 | 0.782 | 0.939 | 0.759 | 0.924 | 0.786 | 0.945 |
| ci_high | 0.790 | 0.943 | 0.800 | 0.946 | 0.772 | 0.933 | 0.797 | 0.950 |
ci_low Lowest 95% confidence interval of experiments, ci_high Highest 95% confidence interval of experiments, SD Standard Deviation, AUPRC Area Under Precision-Recall Curve, AUROC Area Under Receiver-Operator Characteristics.
Figure 2Cost–benefit analysis.
Figure 3SHAP summary plot. If a feature is located on the upper side of this figure, then it implies a higher contribution of the feature to the prediction. More specifically, each dot represents the data of each patient, and the color of the dot indicates whether the respective feature value is low or high (as shown on the y-axis on the right). The location of the dot indicates whether the feature increases (right) or decreases (left) the risk prediction. The farther a dot is from 0, the greater the contribution to the prediction. SHAP Shapley Additive Explanations.
Figure 4Cohort selection. ESRD end-stage renal disease, SNUBH Seoul National University Bundang Hospital.
Figure 5Outcome labeling examples. OW observation window, PW prediction window, TAR time at risk.