| Literature DB >> 31112000 |
Violeta Rodriguez-Romero1,2, Richard F Bergstrom1,2, Brian S Decker1, Gezim Lahu3, Majid Vakilynejad3, Robert R Bies2,4.
Abstract
Applying data mining and machine learning (ML) techniques to clinical data might identify predictive biomarkers for diabetic nephropathy (DN), a common complication of type 2 diabetes mellitus (T2DM). A retrospective analysis of the Action to Control Cardiovascular Risk in Diabetes (ACCORD) trial was intended to identify such factors using ML. The longitudinal data were stratified by time after patient enrollment to differentiate early and late predictors. Our results showed that Random Forest and Simple Logistic Regression methods exhibited the best performance among the evaluated algorithms. Baseline values for glomerular filtration rate (GFR), urinary creatinine, urinary albumin, potassium, cholesterol, low-density lipoprotein, and urinary albumin to creatinine ratio were identified as DN predictors. Early predictors were the baseline values of GFR, systolic blood pressure, as well as fasting plasma glucose (FPG) and potassium at month 4. Changes per year in GFR, FPG, and triglycerides were recognized as predictors of late development. In conclusion, ML-based methods successfully identified predictive factors for DN among patients with T2DM.Entities:
Year: 2019 PMID: 31112000 PMCID: PMC6742939 DOI: 10.1111/cts.12647
Source DB: PubMed Journal: Clin Transl Sci ISSN: 1752-8054 Impact factor: 4.689
Figure 1Model development. Development of the classification model: 1. Feature selection. All the available attributes were analyzed. The Action to Control Cardiovascular Risk in Diabetes (ACCORD) data set was divided into eight different time windows. 2. Data enhancement. The slopes for each attribute were calculated for each time window in order to account for the change per year. 3. Data splitting. Each subset of data was randomly allocated into separate training and testing data sets using the sample_frac function. 4. Data balancing. Training subsets were balanced for the binary outcome (presence or absence of nephropathy) using the SMOTE method. 5. Model training. Different classifiers were evaluated following a 10‐fold cross‐validation. 6. Model validation. The learning algorithms were tested using the testing subsets. 7. Model selection. Evaluation of receiver operating characteristics (ROCs) and true positive rates (TPRs) to identify the most sensitive and accurate classifier. 8. Attribute selection. Identification of the most predictive attributes using the InfoGain method. Notes: Dash lines represent the result from each step of the process. Bold lines represent the final outcome of the model development. Italic font represents the implemented method. 1R, One Rule; ALT, alanine aminotransferase; BP, blood pressure; CPK, creatinine phosphokinase; CVD, cardiovascular disease; FPG, fasting plasma glucose; GFR, glomerular filtration rate; HbA1c, glycosylated hemoglobin; J48, J48 Decision Tree; K, potassium; NB, Naïve Bayes; RF, Random Forest; SCr, serum creatinine; SL, Simple Logistic; SMO, Sequential Minimal Optimization; UAlb, urinary albumin; UACR, urinary albumin to creatinine ratio; UCr, urinary creatinine. For more details, please refer to the text.
Patient baseline characteristics from the ACCORD trial (n = 10,251)
| Variable | Mean (SD) or Median [IQR] |
|---|---|
| Age, year | 62.8 (6.6) |
| Male | 6,299 (61.4%) |
| Race | |
| Black | 1,953 (19%) |
| White | 6,393 (62%) |
| Hispanic | 737 (7%) |
| Other | 1,168 (11%) |
| History of CV event, | 3,609 (35.2%) |
| GHb, % {mmol/mol} | 8.3 (1.01) {67} |
| Blood pressure, mmHg | |
| Systolic | 136.2 (16.5) |
| Diastolic | 74.7 (10.2) |
| eGFR, mL/minute/1.73 m2 | 89.6 [75.4–105.1] |
| SCr, mg/dL | 0.9 [0.8–1.0] |
| UCr, mg/dL | 114.7 [79.4–158.5] |
| UAlb, mg/dL | 1.59 [0.7–4.96] |
| Cholesterol, mg/dL | 182.9 |
| LDL | 104.9 (33.9) |
| HDL | 41.9 (11.6) |
| Triglycerides, mg/dL | 155.5 [106–229] |
| LDL | 101 [81–125] |
| HDL | 40 [34–48] |
| FPG, mg/dL | 167 [138–204] |
| ALT, mg/dL | 24 [18–32] |
| CPK, mg/dL | 105 [72–164] |
| K, mmol/L | 4.4 [4.2–4.7] |
| UACR, mg/g | 14 [7–47] |
ACCORD, Action to Control Cardiovascular Risk in Diabetes; ALT, alanine aminotransferase; CPK, creatinine phosphokinase; CV, cardiovascular; FPG, fasting plasma glucose; eGFR, estimated glomerular filtration rate; GHb, glycosylated hemoglobin; HDL, high‐density lipoprotein; IQR, interquartile range; K, potassium; LDL, low‐density lipoprotein; SCr, serum creatinine; UACR, urinary albumin to creatinine ratio; UAlb, urinary albumin; UCr, urinary creatinine.
Figure 2Incidence of nephropathy among the Action to Control Cardiovascular Risk in Diabetes (ACCORD) population. Blue bars represent the number of patients who did not develop nephropathy while the pink bars are the number of patients who developed nephropathy within the specified time window on the x‐axis.
Figure 3Incidence of the different nephropathies in patients with type 2 diabetes from Action to Control Cardiovascular Risk in Diabetes (ACCORD) study across time windows. (a) Decline in estimated glomerular filtration rate (eGFR); (b) macroalbuminuria; (c) microalbuminuria; and (d) renal failure.
Figure 4Sensitivity and accuracy (receiver operating characteristic (ROC) areas) for the training data sets at the established time windows using different algorithms in Waikato Environment for Knowledge Analysis (WEKA). Black circles = one rule; black squares = J48; black triangles = random forest; black diamonds = simple logistic; white triangles = sequential minimal optimization (SMO); white diamonds = Naïve Bayes. TPR, true positive rate.
Figure 5Sensitivity and accuracy (receiver operating characteristic (ROC) areas) for the testing data sets at the established time windows using different algorithms in Waikato Environment for Knowledge Analysis (WEKA). Black circles = one rule; black squares = J48; black triangles = random forest; black diamonds = simple logistic; white triangles = sequential minimal optimization (SMO); white diamonds = Naïve Bayes. TPR, true positive rate.
Predictive risk factors through the established time windows
| Time window | 0–5.9 months | 6–11.9 months | 1–1.9 years | 2–2.9 years | 3–3.9 years | 4–4.9 years | 5–5.9 years | 6–7 years |
|---|---|---|---|---|---|---|---|---|
| Risk factor (ranked by highest to lowest importance) | GFR00 | GFR00 | UAlb00 | GFR00 | UCr00 | UAlb00 | s1.dBP | K00 |
| UCr00 | UCr00 | GFR00 | Trig00 | Age | FPG00 | UACR00 | UAlb86 | |
| GFR04 | UAlb00 | UCr00 | s2.GFR | CPK00 | UCr00 | LDL24 | fs.UAlb | |
| Age | GFR04 | s1.GFR | UAlb00 | Trig00 | Trig00 | sBP16 | fs.UACR | |
| UAlb00 | Age | Trig00 | UCr00 | FPG00 | Chol00 | HbA1c00 | fs.GFR | |
| CPK00 | Trig00 | FPG00 | s1.GFR | GFR04 | LDL00 | HDL00 | s3.vLDL | |
| Trig00 | CPK00 | CPK00 | FPG00 | K00 | K04 | s1.LDL | Trig00 | |
| FPG00 | GFR08 | Age | Chol00 | GFR00 | UACR00 | K00 | s3.Trig | |
| K00 | FPG00 | GFR04 | CPK00 | s2.Trig | s3.GFR | Chol00 | s1.FPG | |
| Arm | K00 | K00 | UCr24 | Trig12 | K08 | s3.sBP | HDL24 | |
| LDL00 | K08 | LDL00 | K12 | Chol00 | GFR04 | s1.GFR | HDL00 | |
| Chol00 | Chol00 | Chol00 | Age | s3.GFR | GFR36 | Age | Chol00 | |
| sBP00 | FPG08 | GFR12 | s1.FPG | FPG04 | GFR00 | FPG08 | LDL00 | |
| FPG04 | LDL00 | FPG04 | s2.UCr | FPG08 | s2.GFR | GFR00 | UAlb00 | |
| K04 | UACR00 | s1.FPG | FPG08 | s1.GFR | HbA1c08 | UAlb00 | FPG00 | |
| vLDL | FPG04 | s1.Trig | LDL00 | LDL00 | sBP00 | UCr00 | UCr00 | |
| UACR00 | K04 | UACR00 | K00 | UACR00 | K00 | GFR04 | GFR04 | |
| dBP00 | sBP00 | FPG08 | UACR00 | UAlb00 | Age | LDL00 | UACR00 |
ALT, alanine aminotransferase; dBP, diastolic blood pressure; Chol, cholesterol; CPK, creatinine phosphokinase; FPG, fasting plasma glucose; fs, final slope (from baseline to the end of the study); GFR, glomerular filtration rate; HbA1c, glycosylated hemoglobin; HR, heart rate; HDL, high‐density lipoprotein; K, potassium; LDL, low‐density lipoprotein; s, slope of change; sBP, systolic blood pressure; SCr, serum creatinine; UACR, urinary albumin to creatinine ratio; UAlb, urinary albumin; UCr, urinary creatinine; VLDL, very low‐density lipoprotein.
Numbers following a risk factor indicate the number of months from baseline (00), whereas numbers preceding the risk factor and accompanied by a slope(s) denote the year of change. Thus, GFR04 corresponds to the GFR measurement at month 4; s1.GFR corresponds to the change in GFR from baseline to year 1.
aTime after the diagnosis of diabetes. bDisplayed in order of importance (from top to bottom). cRisk factors that fed the classifiers in at least 4 windows of time. dThose that were consistent throughout the study. eThose that were important for the first <3 years. fThose that gained importance after year 2.