| Literature DB >> 31615566 |
Hang Lai1,2, Huaxiong Huang1,2, Karim Keshavjee2,3, Aziz Guergachi1,2,4, Xin Gao5,6.
Abstract
BACKGROUND: Diabetes Mellitus is an increasingly prevalent chronic disease characterized by the body's inability to metabolize glucose. The objective of this study was to build an effective predictive model with high sensitivity and selectivity to better identify Canadian patients at risk of having Diabetes Mellitus based on patient demographic data and the laboratory results during their visits to medical facilities.Entities:
Keywords: Diabetes mellitus; Gradient boosting machine; Machine learning; Misclassification cost; Predictive models
Mesh:
Substances:
Year: 2019 PMID: 31615566 PMCID: PMC6794897 DOI: 10.1186/s12902-019-0436-6
Source DB: PubMed Journal: BMC Endocr Disord ISSN: 1472-6823 Impact factor: 2.763
Comparing the median of continuous variables between DM and No DM groups
| Group | BMI | FBS | HDL | TG | LDL | sBP | Age |
|---|---|---|---|---|---|---|---|
| DM | 31.16 | 6.10 | 1.20 | 1.56 | 2.71 | 130 | 64.00 |
| No DM | 28.32 | 5.20 | 1.40 | 1.24 | 2.74 | 130 | 66.00 |
Predictors associated with the logistic regression model
| Variables | Estimated coefficient | Odds ratio | 95% CI for odds ratio | |
|---|---|---|---|---|
| Intercept | −11.816 | < 0.0001 | ||
| Age | ||||
| Middle-Aged (40–64) | (Reference) | 1.000 | ||
| Elderly (85–90) | −0.829 | 0.436 | (0.31, 0.61) | < 0.0001 |
| Senior (65–84) | −0.127 | 0.881 | (0.78, 0.99) | 0.036 |
| Young (< 40) | 0.238 | 1.269 | (0.90, 1.79) | 0.170 |
| Male | −0.250 | 0.779 | (0.69, 0.88) | < 0.0001 |
| FBS | 1.963 | 7.122 | (6.45, 7.87) | < 0.0001 |
| BMI | 0.023 | 1.024 | (1.01, 1.03) | < 0.0001 |
| HDL | −0.894 | 0.409 | (0.34, 0.49) | < 0.0001 |
| TG | 0.158 | 1.171 | (1.09, 1.26) | < 0.0001 |
| sBP | −0.001 | 0.999 | (0.96, 1.00) | 0.560 |
| LDL | −0.011 | 0.990 | (0.93, 1.05) | 0.740 |
Fig. 1Information gain measure from predictors
Comparing the AROC values with other machine-learning techniques
| Model | Area under the ROC curve, AROC |
|---|---|
| GBM | 84.7% |
| LOGISTIC REGRESSION | 84.0% |
| RANDOM FOREST | 83.4% |
| RPART | 78.2% |
Fig. 2Receiver operating curves for the Rpart, random forest, logistic regression, and GBM models
Mean of AROC for the four models from the cross-validation results
| Mean | |
|---|---|
| GBM | 83.9% |
| Logistic Regression | 83.5% |
| Random Forest | 83.0% |
| Rpart | 77.1% |
Fig. 3Box plot: comparing the AROC of the four models in the cross-validation results
AROC, standard deviation, and 95% confidence interval of AROC for the four models using the DeLong method
| AROC | Standard deviation | 95% CI | |
|---|---|---|---|
| GBM | 84.5% | 0.97% | (82.6, 86.4) |
| Logistic Regression | 84.1% | 1.01% | (82.1, 86.1) |
| Random Forest | 83.2% | 1.05% | (81.1, 85.2) |
| Rpart | 78.1% | 1.10% | (76.0, 80.3) |
Paired one-sided DeLong test to compare the AROC values of the four models
| Test name | z-statistic | |
|---|---|---|
| GBM vs. Logistic Regression | 1.392 | 0.081 |
| GBM vs. Random Forest | 3.885 | 5.13e-05 |
| GBM vs. Rpart | 8.914 | 2.20e-16 |
| Logistic Regression vs. Random Forest | 2.038 | 0.021 |
| Logistic Regression vs. Rpart | 8.006 | 5.95e-16 |
| Random Forest vs. Rpart | 7.028 | 1.05e-12 |
Comparing the AROC values of the four models using PIMA Indian data set
| Mean | |
|---|---|
| GBM | 85.1% |
| Logistic Regression | 84.6% |
| Random Forest | 85.5% |
| Rpart | 80.5% |
Fig. 4Box plot of AROC values for the Rpart, random forest, logistic regression, and GBM models applied to PIMA Indian data set