| Literature DB >> 35743691 |
Qing Liu1, Miao Zhang1, Yifeng He2, Lei Zhang3, Jingui Zou2, Yaqiong Yan4, Yan Guo4.
Abstract
Early identification of individuals at high risk of diabetes is crucial for implementing early intervention strategies. However, algorithms specific to elderly Chinese adults are lacking. The aim of this study is to build effective prediction models based on machine learning (ML) for the risk of type 2 diabetes mellitus (T2DM) in Chinese elderly. A retrospective cohort study was conducted using the health screening data of adults older than 65 years in Wuhan, China from 2018 to 2020. With a strict data filtration, 127,031 records from the eligible participants were utilized. Overall, 8298 participants were diagnosed with incident T2DM during the 2-year follow-up (2019-2020). The dataset was randomly split into training set (n = 101,625) and test set (n = 25,406). We developed prediction models based on four ML algorithms: logistic regression (LR), decision tree (DT), random forest (RF), and extreme gradient boosting (XGBoost). Using LASSO regression, 21 prediction features were selected. The Random under-sampling (RUS) was applied to address the class imbalance, and the Shapley Additive Explanations (SHAP) was used to calculate and visualize feature importance. Model performance was evaluated by the area under the receiver operating characteristic curve (AUC), sensitivity, specificity, and accuracy. The XGBoost model achieved the best performance (AUC = 0.7805, sensitivity = 0.6452, specificity = 0.7577, accuracy = 0.7503). Fasting plasma glucose (FPG), education, exercise, gender, and waist circumference (WC) were the top five important predictors. This study showed that XGBoost model can be applied to screen individuals at high risk of T2DM in the early phrase, which has the strong potential for intelligent prevention and control of diabetes. The key features could also be useful for developing targeted diabetes prevention interventions.Entities:
Keywords: Chinese elderly; machine learning; prediction model; type 2 diabetes mellitus (T2DM)
Year: 2022 PMID: 35743691 PMCID: PMC9224915 DOI: 10.3390/jpm12060905
Source DB: PubMed Journal: J Pers Med ISSN: 2075-4426
Figure 1The study flow chart. LR, logistic regression; DT, decision tree; RF, random forest; XGBoost, extreme gradient boosting; LASSO, least absolute shrinkage and selection operator.
Baseline characteristics of the participants.
| Characteristics | Total | Incident T2DM | ||
|---|---|---|---|---|
| Yes | No | |||
| Age, mean (SD), years | 71.94 (5.10) | 72.39 (5.31) | 71.91 (5.08) | <0.001 |
| Gender, | <0.001 | |||
| Men | 56,774 (44.69) | 4114 (7.25) | 52,660 (92.75) | |
| Women | 70,257 (55.31) | 4184 (5.96) | 66,073 (94.04) | |
| Education, | <0.001 | |||
| Elementary school and below | 75,828 (59.69) | 5597 (7.38) | 70,231 (92.62) | |
| Junior high school | 28,298 (22.28) | 1522 (5.38) | 26,776 (94.62) | |
| Technical secondary school or high school | 13,742 (10.82) | 695 (5.06) | 13,047 (94.94) | |
| Junior college and above | 9163 (7.21) | 484 (5.28) | 8679 (94.72) | |
| Marital status, | <0.001 | |||
| Married | 98,131 (77.25) | 6046 (6.16) | 92,085 (93.84) | |
| Divorced | 656 (0.52) | 48 (7.32) | 608 (92.68) | |
| Widowed | 27,350 (21.53) | 2082 (7.61) | 25,268 (92.39) | |
| Single | 894 (0.70) | 122 (13.65) | 772 (86.35) | |
| Hypertension, | <0.001 | |||
| Yes | 56,847 (44.75) | 4347 (7.65) | 52,500 (92.35) | |
| No | 70,184 (55.25) | 3951 (5.63) | 66,233 (94.37) | |
| Myocardial infarction, | 0.621 | |||
| Yes | 686 (0.54) | 48 (7.00) | 638 (93.00) | |
| No | 126,345 (99.46) | 8250 (6.53) | 118,095 (93.47) | |
| Coronary heart disease, | 0.413 | |||
| Yes | 7471 (5.88) | 505 (6.76) | 6966 (93.24) | |
| No | 119,560 (94.12) | 7793 (6.52) | 111,767 (93.48) | |
| Angina pectoris, | 0.711 | |||
| Yes | 506 (0.40) | 31 (6.13) | 475 (93.87) | |
| No | 126,525 (99.60) | 8267 (6.53) | 118,258 (93.47) | |
| Fatty liver, | 0.020 | |||
| Yes | 2279 (1.79) | 176 (7.72) | 2103 (92.28) | |
| No | 124,752 (98.21) | 8122 (6.51) | 116,630 (93.49) | |
| Exercise, | <0.001 | |||
| Yes | 74,741 (58.84) | 4323 (5.78) | 70,418 (94.22) | |
| No | 52,290 (41.16) | 3975 (7.60) | 48,315 (92.40) | |
| Current smoking, | <0.001 | |||
| Yes | 20,498 (16.14) | 1515 (7.39) | 18,983 (92.61) | |
| No | 106,533 (83.86) | 6783 (6.37) | 99,750 (93.63) | |
| Current drinking, | 0.908 | |||
| Yes | 21,429 (16.87) | 1396 (6.51) | 20,033 (93.49) | |
| No | 105,602 (83.13) | 6902 (6.54) | 98,700 (93.46) | |
| BMI, mean (SD), kg/m2 | 23.70 (3.26) | 24.47 (3.51) | 23.65 (3.24) | <0.001 |
| WC, mean (SD), cm | 84.12 (9.16) | 86.30 (9.62) | 83.97 (9.10) | <0.001 |
| SBP, mean (SD), mm Hg | 137.12 (20.00) | 140.63 (20.38) | 136.87 (19.95) | <0.001 |
| DBP, mean (SD), mm Hg | 80.09 (11.20) | 81.63 (11.42) | 79.99 (11.18) | <0.001 |
| FPG, mean (SD), mmol/L | 5.12 (0.69) | 5.71 (0.79) | 5.08 (0.66) | <0.001 |
| TC, median (IQR), mmol/L | 4.81 (4.20–5.45) | 4.84 (4.20–5.49) | 4.81 (4.20–5.44) | 0.034 |
| TG, median (IQR), mmol/L | 1.17 (0.85–1.63) | 1.28 (0.90–1.79) | 1.16 (0.85–1.62) | <0.001 |
| HDL-C, median (IQR), mmol/L | 1.36 (1.15–1.62) | 1.32 (1.11–1.58) | 1.37 (1.15–1.62) | <0.001 |
| LDL-C, median (IQR), mmol/L | 2.60 (2.08–3.17) | 2.64 (2.11–3.24) | 2.60 (2.07–3.16) | <0.001 |
| ALT, median (IQR), U/L | 16.00 (12.00–21.00) | 17.00 (13.00–23.00) | 16.00 (12.00–20.90) | <0.001 |
| AST, median (IQR), U/L | 21.50 (18.00–26.00) | 22.00 (18.00–26.00) | 21.50 (18.00–26.00) | 0.004 |
| TBIL, median (IQR), µmol/L | 11.90 (9.17–15.30) | 12.40 (9.50–15.90) | 11.90 (9.10–15.30) | <0.001 |
| Scr, mean (SD), µmol/L | 76.82 (19.93) | 79.21 (20.94) | 76.66 (19.85) | <0.001 |
| BUN, median (IQR), mmol/L | 5.71 (4.76–6.82) | 5.67 (4.70–6.80) | 5.71 (4.77–6.83) | 0.037 |
| SUA, mean (SD), µmol/L | 323.80 (91.90) | 333.01 (94.31) | 323.15 (91.70) | <0.001 |
SD, standard deviation; IQR: Q1–Q3 values; T2DM, type 2 diabetes mellitus; BMI, body mass index; WC, waist circumference; SBP, systolic blood pressure; DBP, diastolic blood pressure; FBG, fasting plasma glucose; TC, total cholesterol; TG, triglycerides; HDL-C, high-density lipoprotein cholesterol; LDL-C, low-density lipoprotein cholesterol; ALT, alanine aminotransferase; AST, aspartate transaminase; TBIL, total bilirubin; Scr, serum creatinine; BUN, blood urea nitrogen; SUA, serum uric acid.
Least Absolute Shrinkage and Selection Operator (LASSO) regression coefficients.
| Predictors | Coefficient |
|---|---|
| Age | 0.012 |
| Gender | −0.026 |
| Education | −0.027 |
| Marital status | 0.023 |
| Hypertension | 0.010 |
| Exercise | −0.035 |
| Current smoking | 0.017 |
| Current drinking | −0.010 |
| WC | 0.033 |
| SBP | 0.014 |
| FPG | 0.219 |
| TC | −0.022 |
| TG | 0.020 |
| HDL-C | 0.006 |
| LDL-C | 0.009 |
| ALT | 0.037 |
| AST | −0.026 |
| TBIL | 0.006 |
| Scr | 0.004 |
| BUN | −0.017 |
| SUA | −0.002 |
Comparison of performance of the four machine learning models.
| Model | AUC | Sensitivity | Specificity | Accuracy |
|---|---|---|---|---|
| LR | 0.7601 | 0.6320 | 0.7636 | 0.7550 |
| DT | 0.7280 | 0.5821 | 0.7633 | 0.7514 |
| RF | 0.7772 | 0.6428 | 0.7524 | 0.7453 |
| XGBoost | 0.7805 | 0.6452 | 0.7577 | 0.7503 |
Figure 2The receiver operating characteristics (ROC) curves of the four machine learning models on the training set (A) and test set (B).
Figure 3The confusion matrix of the four machine learning models.
Figure 4The interpretations for the XGBoost model. (A): The feature importance ranking by the SHAP value; (B): SHAP summary plot of the XGBoost model. Each dot represents a sample, with blue indicating a low feature value and red indicating a high feature value. The higher the SHAP value of a feature, the higher the risk of incident T2DM. Smoking was defined as current smoking; drinking was defined as current drinking.