| Literature DB >> 34135637 |
Yunxing Jiang1, Xianghui Zhang1, Rulin Ma1, Xinping Wang1, Jiaming Liu1, Mulatibieke Keerman1, Yizhong Yan1, Jiaolong Ma1, Yanpeng Song1,2, Jingyu Zhang1, Jia He1, Shuxia Guo1,3, Heng Guo1.
Abstract
BACKGROUND: Cardiovascular disease (CVD) is the leading cause of mortality worldwide. Accurately identifying subjects at high-risk of CVD may improve CVD outcomes. We sought to systematically examine the feasibility and performance of 7 widely used machine learning (ML) algorithms in predicting CVD risks.Entities:
Keywords: Kazakh population; cardiovascular disease; machine learning; prediction model
Year: 2021 PMID: 34135637 PMCID: PMC8200454 DOI: 10.2147/CLEP.S313343
Source DB: PubMed Journal: Clin Epidemiol ISSN: 1179-1349 Impact factor: 4.790
Baseline Characteristics of Study Subjects in This Chinese Kazakhs
| Characteristics | Training Set | Test Set | ||||
|---|---|---|---|---|---|---|
| Non-CVD (n=1038) | CVD (n=168) | Non-CVD (n=267) | CVD (n=35) | |||
| Age, (years) | 38.01 (12.36) | 52.27 (11.97) | <0.001 | 37.27 (12.05) | 51.43 (11.81) | <0.001 |
| SBP(mmHg), Mean (SD) | 126.12 (20.34) | 147.63 (28.78) | <0.001 | 126.09 (19.14) | 146.26 (32.81) | 0.001 |
| FPG(mmol/L), Mean (SD) | 4.63 (1.01) | 5.13 (1.53) | <0.001 | 5.46 (13.14) | 5.17 (1.41) | 0.898 |
| TG (mmol/L), Mean (SD) | 1.17 (0.92) | 1.22 (0.61) | 0.364 | 1.26 (0.93) | 1.24 (0.77) | 0.875 |
| TC(mmol/L), Mean (SD) | 4.26 (1.02) | 4.67 (0.98) | <0.001 | 4.26 (1.13) | 4.45 (1.18) | 0.347 |
| HDL(mmol/L), Mean (SD) | 1.35 (0.38) | 1.42 (0.32) | 0.035 | 1.33 (0.39) | 1.35 (0.38) | 0.753 |
| Waistline(cm), Mean (SD) | 83.21 (11.21) | 87.77 (12.46) | <0.001 | 84.31 (0.96) | 90.69 (13.84) | 0.002 |
| BMI, Mean (SD) | 23.43 (3.73) | 25.29 (4.70) | <0.001 | 23.69 (3.69) | 27.13 (5.33) | 0.001 |
| BAI, Mean (SD) | 28.19 (4.51) | 30.32 (4.89) | <0.001 | 28.06 (4.39) | 31.86 (5.67) | <0.001 |
| LHR,Mean (SD) | 1.82 (3.11) | 1.83 (0.66) | 0.977 | 1.78 (0.72) | 1.77 (0.61) | 0.913 |
| INS (ng/mL), Median (P25,P75)# | 9.61 (5.26, 21.25) | 13.37 (7.48, 23.96) | 0.001 | 9.66 (5.24, 23.19) | 15.76 (6.05, 31.42) | 0.108 |
| IL6(ng/mL),Median (P25,P75)# | 30.41 (15.40, 88.70) | 51.12 (23.08, 157.96) | <0.001 | 30.55 (15.20, 97.22) | 45.18 (17.24, 109.46) | 0.176 |
| NEFA (mmol/L), Median (P25,P75)# | 0.48(0.33, 0.75) | 0.59 (0.35, 1.00) | 0.002 | 0.50 (0.32, 0.82) | 0.70 (0.45, 1.20) | 0.003 |
| hs-CRP (pg/mL), Median (P25,P75)# | 226.05 (22.32, 1133.81) | 756.26 (195.37, 1983.12) | <0.001 | 394.60 (30.88, 1253.46) | 513.68 (193.57, 1121.05) | 0.201 |
| ADP(ng/mL), Median (P25,P75)# | 33.41 (11.81, 174.23) | 16.96 (8.37, 40.39) | <0.001 | 26.34 (10.78, 118.49) | 16.68 (6.55, 29.37) | 0.004 |
| Sex,(male), n (%) | 468 (45.1) | 64 (38.1) | 0.090 | 117 (43.8) | 13 (37.1) | 0.453 |
| Dyslipidemia, n (%) | 259 (25.0) | 50 (29.8) | 0.185 | 71 (26.6) | 15 (42.9) | 0.045 |
| Family history of hypertension, n (%) | 281 (27.1) | 59 (35.1) | 0.031 | 76 (28.5) | 19 (54.3) | 0.002 |
| Family history of diabetes, n (%) | 12 (1.2) | 2 (1.2) | 0.969 | 4 (1.5) | 1 (2.9) | 0.554 |
| Current smoker, n (%) | 281 (27.1) | 74 (44.0) | 0.02 | 86 (32.2) | 16 (45.7) | 0.112 |
| Alcohol drinking, n (%) | 94 (9.1) | 21 (12.5) | 0.159 | 30 (11.2) | 4 (11.4) | 0.973 |
| MetS, n (%) | 233 (22.4) | 55 (32.7) | 0.004 | 71 (26.6) | 16 (45.7) | 0.019 |
| Follow-up period (years), Median | 5.17 | |||||
Note: #Mann–Whitney test.
Abbreviations: SBP, systolic blood pressure; FPG, fasting plasma glucose; TG, triglycerides; TC, total cholesterol; HDL, High density lipoprotein; BMI, body mass index; BAI, body adiposity index; LHR, LDL/HDL ratio; INS, insulin; IL-6, interleukin 6; NEFA, nonesterified fatty acid; hs-CRP, high-sensitivity C-reactive protein; ADP, adiponectin; MetS, metabolic syndrome.
Figure 1Feature importance of included variables obtained from a tuned random forest model.
Predictive Performance Metrics and Diagnostic Test Metrics of 7 ML-Based Models
| ML Risk Equations | AUC | Threshold Probability | Sensitivity(%) | Specificity(%) | PPV(%) | NPV(%) | Youden Index | High-Risk Patients(%) | Brier Score | Hosmer-Lemeshow –2 |
|---|---|---|---|---|---|---|---|---|---|---|
| DT | 0.770 (0.719, 0.817) | 0.15 | 60.0 | 82.8 | 31.3 | 94.0 | 0.43 | 22.5 | 0.092 (0.068, 0.115) | 10.94 |
| KNN | 0.845 (0.800, 0.884) | 0.13 | 80.0 | 79.8 | 34.1 | 96.8 | 0.60 | 27.5 | 0.086 (0.064, 0.110) | 10.50 |
| LR | 0.872 (0.829, 0.907) | 0.10 | 97.1 | 65.5 | 27.0 | 99.4 | 0.63 | 42.1 | 0.078 (0.061, 0.099) | 12.24 |
| NB | 0.791 (0.740, 0.835) | 0.07 | 68.6 | 79.4 | 30.4 | 95.1 | 0.48 | 26.5 | 0.090 (0.066, 0.117) | 14.17 |
| RF | 0.840 (0.794, 0.880) | 0.06 | 91.4 | 64.4 | 25.2 | 98.3 | 0.56 | 41.7 | 0.089 (0.065, 0.114) | 9.46 |
| SVM | 0.868 (0.825, 0.904) | 0.13 | 85.7 | 74.2 | 30.3 | 97.5 | 0.60 | 33.1 | 0.079 (0.059, 0.100) | 8.49 |
| XGB | 0.804 (0.754, 0.847) | 0.06 | 82.9 | 69.3 | 26.1 | 96.9 | 0.52 | 37.1 | 0.090 (0.066, 0.113) | 9.05 |
Abbreviations: ML, machine learning; DT, decision tree; RF, random forest; KNN, k-nearest neighbors; NB, Gaussian naive Bayes; SVM, support vector machine; XGB, extreme gradient boosting; LR, logistic regression with L-1 penalization; AUC, area under the receiver operating characteristic curve.
Figure 2Receiver operator characteristic curves for 7 ML models in predicting CVD outcomes in Chinese Kazakhs.
Figure 3Distribution of predicted probabilities for subjects who developed CVD versus those who did not.
Figure 4Calibration plots of 7 ML models in predicting CVD outcomes in Chinese Kazakhs.
Figure 5Decision curves for predicting CVD outcomes in Chinese Kazakhs using LR and SVM.
Net Benefits for Identifying High-Risk Subjects with LR or SVM Using Their Own Optimal Threshold Probability
| ML Risk Equations (Pt) | Net Benefit | Advantage of Model# | ||
|---|---|---|---|---|
| Treat All | ML Model | Net Benefit | Reduction in Avoidable Statins Use per 1000 Subjects | |
| LR (0.10) | 0.018 | 0.077 | 0.059 | 533 |
| SVM (0.13) | −0.016 | 0.064 | 0.080 | 535 |
Note: #The value was calculated as: (net benefit of the model – net benefit of treat all)/(pt/(1− pt)) × 100.
Abbreviations: ML, machine learning; Pt, optimal threshold probability; SVM, support vector machine; LR, logistic regression with L-1 penalization.