| Literature DB >> 27727289 |
Ramon Casanova1, Santiago Saldana1, Sean L Simpson1, Mary E Lacy2, Angela R Subauste3, Chad Blackshear3, Lynne Wagenknecht4, Alain G Bertoni4.
Abstract
Statistical models to predict incident diabetes are often based on limited variables. Here we pursued two main goals: 1) investigate the relative performance of a machine learning method such as Random Forests (RF) for detecting incident diabetes in a high-dimensional setting defined by a large set of observational data, and 2) uncover potential predictors of diabetes. The Jackson Heart Study collected data at baseline and in two follow-up visits from 5,301 African Americans. We excluded those with baseline diabetes and no follow-up, leaving 3,633 individuals for analyses. Over a mean 8-year follow-up, 584 participants developed diabetes. The full RF model evaluated 93 variables including demographic, anthropometric, blood biomarker, medical history, and echocardiogram data. We also used RF metrics of variable importance to rank variables according to their contribution to diabetes prediction. We implemented other models based on logistic regression and RF where features were preselected. The RF full model performance was similar (AUC = 0.82) to those more parsimonious models. The top-ranked variables according to RF included hemoglobin A1C, fasting plasma glucose, waist circumference, adiponectin, c-reactive protein, triglycerides, leptin, left ventricular mass, high-density lipoprotein cholesterol, and aldosterone. This work shows the potential of RF for incident diabetes prediction while dealing with high-dimensional data.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27727289 PMCID: PMC5058485 DOI: 10.1371/journal.pone.0163942
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Scheme illustrating the computation experiment designed to compare Random Forests and logistic regression methods.
Baseline Characteristics by Incident Diabetes Mellitus Status in Prediction of Incident Diabetes in the Jackson Heart Study Cohort using Random Forests.
| Baseline Characteristic | Diabetes | No Diabetes (N = 2779) | All(N = 3363) |
|---|---|---|---|
| Sex | |||
| Male (%) | 37.0 | 36.3 | 36.5 |
| Female (%) | 63.0 | 63.7 | 63.5 |
| Age, y | 55.2 (11.0) | 53.0 (12.8) | 53.4 (12.5) |
| Education | |||
| < High school (%) | 19.9 | 14.4 | 15.4 |
| High school graduate (%) | 18.7 | 17.7 | 17.9 |
| Some college (%) | 29.6 | 29.7 | 29.7 |
| ≥ Bachelor’s degree (%) | 31.8 | 38.2 | 37.1 |
| BMI (kg/m2) | |||
| BMI <18.5 (underweight) (%) | 0.7 | 0.2 | 0.6 |
| BMI 18.5–24.9 (normal weight) (%) | 17.0 | 6.4 | 15.1 |
| BMI 25–29.9 (overweight) (%) | 36.4 | 26.2 | 34.6 |
| BMI ≥ 30.0 (obese) (%) | 46.0 | 67.3 | 49.7 |
| Waist circumference (cm) | 105.0(14.1) | 97.3(15.6) | 98.6(15.6) |
*Developed after baseline measurements. Abbreviations: BMI, body mass index.
Fig 2The dependence of classification accuracy on sample size is presented.
Prediction performance of the five models when using sample size 1000 (500 participants per group).
The values in each cell correspond to mean and standard deviation across the 100 computations.
| Method | Accuracy (%) | Sensitivity (%) | Specificity (%) | AUC |
|---|---|---|---|---|
| RF93 | 74 (0.02) | 75 (0.05) | 74 (0.02) | 0.82 (0.02) |
| LRARIC | 74 (0.01) | 74 (0.05) | 75 (0.01) | 0.82 (0.02) |
| LR93 | 71 (0.01) | 70 (0.05) | 71 (0.01) | 0.78 (0.03) |
| RF15 | 75 (0.02) | 74 (0.05) | 75 (0.01) | 0.82 (0.02) |
| LR15 | 74 (0.01) | 74 (0.04) | 74 (0.01) | 0.82 (0.02) |
RF93 = RF using as input all 93 variables; LR93 = logistic regression using as input all 93 variables; LRARIC = the logistic ARIC model; RF15 –two stage RF; LR15 = two stage LR.
Top 15 Variables Found in Random Forest Analyses, according to the Gini Index (N = 1000).
| Variable | Gini Index | Diabetes | No Diabetes | p-value |
|---|---|---|---|---|
| Hemoglobin A1c (%) | 57.4 | 5.9(0.4) | 5.4 (0.4) | < .0001 |
| Fasting plasma glucose (mg/dL) | 39.9 | 97.1 (10.7) | 88.8 (7.8) | < .0001 |
| Waist circumference (cm) | 19.4 | 105.0 (14.1) | 97.3 (15.6) | < .0001 |
| Adiponectin (ng/mL) | 19.0 | 4091.9 (2750.3) | 5566.3 (4032.8) | < .0001 |
| Body mass index (kg/m2) | 17.6 | 33.56 (7.0) | 30.7 (6.9) | < .0001 |
| High sensitivity C-reactive protein (mg/dL) | 15.4 | 0.6 (0.9) | 0.4(0.7) | < .0001 |
| Triglycerides (mg/dL) | 14.9 | 113.88 (59.0) | 94.8 (54.7) | < .0001 |
| Age (years) | 13.5 | 55.2 (11.1) | 53.0 (12.8) | 0.0001 |
| Leptin (ng/mL) | 13.2 | 32.1(27.2) | 26.0 (21.9) | < .0001 |
| Body Surface Area (m2) | 12.6 | 2.1 (0.2) | 2.0 (0.2) | < .0001 |
| eGFR (mL/min/1.73 m2) | 12.0 | 85.8 (17.8) | 87.2 (16.1) | 0.02 |
| 2D calculated left ventricular mass (grams) | 11.6 | 157.1 (89.3) | 141.8 (39.3) | < .0001 |
| Fasting HDL Cholesterol Level (mg/dL) | 11.5 | 49.3 (12.9) | 52.9 (14.8) | < .0001 |
| Fasting LDL Cholesterol Level (mg/dL) | 11.2 | 129.2 (37.9) | 127.1 (35.9) | 0.15 |
| Aldosterone (ng/mL) | 11.0 | 6.43 (6.48) | 5.28 (4.05) | < .0001 |
* Mean, standard deviations and p-values resulting from Wilcoxon- Mann-Whitney tests.
a Developed after baseline measurements.
Studies investigating prediction of diabetes using machine learning methods.
| Reference | Method | Predictors | Sample Size | Type of prediction | Performance |
|---|---|---|---|---|---|
| Yu et al. 2010 | SVM | family history, age, gender, race and ethnicity, weight, height, waist circumference, BMI, hypertension, physical activity, smoking, alcohol use, education, and household income(NHANES Cohort). | 4915 | Cross-sectional | AUC = 0.73 |
| Mani et al. 2012 | RF | A1c,Sys BP,Diastolic BP, GLU, BMI, Creatinine, HDL, MDRD, Triglycerides, Race, Gender, Age(EHR Data). | 2280 | 1 year ahead | AUC = 0.80 |
| Choi et al. 2014 | SVMANN | age, body mass index, hypertension, gender, daily alcohol intake, and waist circumference(KNHANES cohort) | 4685 | Cross-sectional | AUC = 0.74 |
| Anderson et al. 2016 | age,gender,systolic/diastolic BP, Height, Wieght, BMI, 150 ICD9 code, 150 common meds(HER data). | 9948 | Cross-sectional | AUC = 0.81 | |
| Luo 2016 | BRT + RF | The data set includes information ondemographics, diagnoses, allergies, immunizations, lab results, medications, smoking status, and vital signs. | 9948 | 1 year ahead | Accuracy = 87.4% |
| Our Study | RF15 | Hemoglobin A1c, fasting glucose, waist circumference, adiponectin, BMI, hs-CRP, triglycerides, age, leptin, body surface area, eGFR, 2D calculated left ventricular mass, HFL cholesterol, LDL cholesterol, aldosterone. | 3633 | 8 years ahead | AUC = 0.82Accuracy = 75% |
ANN–Artificial Neural Networks; BRT +RF–Combination of Boosting Regression Trees and RF classifiers.