| Literature DB >> 35783868 |
Xin Qian1, Yu Li1, Xianghui Zhang1, Heng Guo1, Jia He1, Xinping Wang1, Yizhong Yan1, Jiaolong Ma1, Rulin Ma1, Shuxia Guo1,2.
Abstract
Background: Cardiovascular diseases (CVD) are currently the leading cause of premature death worldwide. Model-based early detection of high-risk populations for CVD is the key to CVD prevention. Thus, this research aimed to use machine learning (ML) algorithms to establish a CVD prediction model based on routine physical examination indicators suitable for the Xinjiang rural population. Method: The research cohort data collection was divided into two stages. The first stage involved a baseline survey from 2010 to 2012, with follow-up ending in December 2017. The second-phase baseline survey was conducted from September to December 2016, and follow-up ended in August 2021. A total of 12,692 participants (10,407 Uyghur and 2,285 Kazak) were included in the study. Screening predictors and establishing variable subsets were based on least absolute shrinkage and selection operator (Lasso) regression, logistic regression forward partial likelihood estimation (FLR), random forest (RF) feature importance, and RF variable importance. The selected subset of variables was compared with L1 regularized logistic regression (L1-LR), RF, support vector machine (SVM), and AdaBoost algorithm to establish a CVD prediction model suitable for this population. The incidence of CVD in this population was then analyzed. Result: After 4.94 years of follow-up, a total of 1,176 people were diagnosed with CVD (cumulative incidence: 9.27%). In the comparison of discrimination and calibration, the prediction performance of the subset of variables selected based on FLR was better than that of other models. Combining the results of discrimination, calibration, and clinical validity, the prediction model based on L1-LR had the best prediction performance. Age, systolic blood pressure, low-density lipoprotein-L/high-density lipoproteins-C, triglyceride blood glucose index, body mass index, and body adiposity index were all important predictors of the onset of CVD in the Xinjiang rural population.Entities:
Keywords: cardiovascular disease; cohort study; machine learning; predictive models; routine physical examination indicators
Year: 2022 PMID: 35783868 PMCID: PMC9247206 DOI: 10.3389/fcvm.2022.854287
Source DB: PubMed Journal: Front Cardiovasc Med ISSN: 2297-055X
Comparison of the prediction performance of the optimal model of each algorithm.
|
|
|
|
|
|
| ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Lasso-AdaBoost | 0.798 (0.782, | 0.472 | 0.11 | 73.09 | 74.10 | 23.5 | 96.2 | 30.4 | 0.078 (0.070, 0.086) | 13.81 | 0.09 |
| FLR-L1-LR | 0.817 (0.801, | 0.524 | 0.11 | 73.49 | 78.86 | 27.4 | 96.5 | 26.7 | 0.076 (0.069, 0.084) | 11.51 | 0.17 |
| FLR-RF | 0.804 (0.788, | 0.506 | 0.08 | 79.52 | 71.09 | 23.0 | 97.0 | 33.1 | 0.077 (0.070, 0.086) | 11.59 | 0.17 |
| FLR-SVM | 0.814 (0.798, | 0.511 | 0.11 | 73.90 | 77.16 | 26.0 | 96.5 | 38.4 | 0.076 (0.069, 0.084) | 16.10 | 0.04 |
AUC, area under the receiver operating characteristic curve; PPV, positive predictive value; NPV, negative predictive value; Lasso-AdaBoost, AdaBoost with Lasso regression; FLR-L1-LR, L1 regularized Logistic regression with forward Partial Likelihood Estimation; FLR-RF, random forest with forward Partial Likelihood Estimation; FLR-SVM, support vector machine with forward Partial Likelihood Estimation.
Figure 1Receiver operator characteristic curves of the optimal prediction model in Xinjiang rural population. FLR-L1-LR, L1 regularized Logistic regression with forwarding Partial Likelihood Estimation; FLR-RF, Random forest with forwarding Partial Likelihood Estimation; FLR-SVM, Support vector machine with forwarding Partial Likelihood Estimation.
Figure 2Calibration plots of four ML models in predicting CVD outcomes in Xinjiang rural population. CVD, cardiovascular disease; ML, machine learning; FLR-L1-LR, L1 regularized Logistic regression with forwarding Partial Likelihood Estimation; FLR-RF, Random forest with forwarding Partial Likelihood Estimation; FLR-SVM, Support vector machine with forwarding Partial Likelihood Estimation.
Comparison of discrimination performance of optimal prediction models.
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|
| Lasso-AdaBoost vs. FLR-L1-LR | 0.019 | 0.002 | 0.208 (0.078, 0.337) | <0.001 | 0.032 (0.019, 0.045) | <0.010 |
| Lasso-AdaBoost vs. FLR-RF | 0.007 | 0.334 | 0.097 (−0.033, 0.228) | 0.143 | 0.016 (0.007, 0.025) | <0.010 |
| Lasso-AdaBoost vs. FLR-SVM | 0.016 | 0.047 | 0.167 (0.037, 0.296) | 0.012 | 0.029 (0.016, 0.042) | <0.010 |
| FLR-RF vs. FLR-L1-LR | 0.012 | 0.045 | 0.108 (−0.022, 0.238) | 0.105 | 0.016 (0.003, 0.028) | 0.010 |
| FLR-RF vs. FLR-SVM | 0.003 | 0.016 | 0.072 (−0.058, 0.203) | 0.278 | 0.013 (0.001, 0.026) | 0.040 |
| FLR-SVM vs. FLR-L1-LR | 0.010 | 0.118 | 0.278 (0.149, 0.408) | <0.001 | 0.003 (0.001, 0.004) | <0.010 |
AUC, area under the receiver operating characteristic curve; cNRI, continuous Net Reclassification Index; IDI, Integrated Discrimination Improvement Index; Lasso-AdaBoost, AdaBoost with Lasso regression; FLR-L1-LR, L1 regularized Logistic regression with forward Partial Likelihood Estimation; FLR-RF, random forest with forward Partial Likelihood Estimation; FLR-SVM, support vector machine with forward Partial Likelihood Estimation.
Figure 3Decision curves for predicting CVD outcomes in Xinjiang rural population using four ML models. CVD, cardiovascular disease; ML, machine learning; FLR-L1-LR, L1 regularized Logistic regression with forwarding Partial Likelihood Estimation; FLR-RF, Random forest with forwarding Partial Likelihood Estimation; FLR-SVM, Support vector machine with forwarding Partial Likelihood Estimation.
Comparison of clinical effectiveness of models.
|
|
|
| |||
|---|---|---|---|---|---|
|
| |||||
| FLR-L1-LR | 5 | 0.051 | 0.066 | 0.015 | 29 |
| 10 | −0.002 | 0.049 | 0.051 | 46 | |
| 11 | −0.013 | 0.048 | 0.061 | 49 | |
| FLR-SVM | 5 | 0.051 | 0.065 | 0.014 | 27 |
| 10 | −0.002 | 0.048 | 0.050 | 45 | |
| 11 | −0.013 | 0.045 | 0.058 | 47 | |
| Lasso- | 5 | 0.051 | 0.063 | 0.012 | 23 |
| 10 | −0.002 | 0.045 | 0.047 | 43 | |
| 11 | −0.013 | 0.043 | 0.056 | 46 | |
| FLR-RF | 5 | 0.051 | 0.064 | 0.013 | 25 |
| 10 | −0.002 | 0.046 | 0.048 | 43 | |
| 8 | 0.02 | 0.053 | 0.033 | 38 | |
The value was calculated as: (net benefit of the model– net benefit of treat all)/[pt/(1 – pt)] × 100.
Select the optimal threshold probability of each model according to AUC.
Pt, Threshold probability; Lasso-AdaBoost, AdaBoost with Lasso regression; FLR-L1-LR, L1 regularized Logistic regression with forward Partial Likelihood Estimation; FLR-RF, random forest with forward Partial Likelihood Estimation; FLR-SVM, support vector machine with forward Partial Likelihood Estimation.
Figure 4Feature importance of included variables obtained from the random forest with forwarding Partial Likelihood Estimation (FLR-RF), L1 regularized Logistic regression with FLR (FLR-L1-LR), Lasso-AdaBoost model. SD, pulse pressure difference; DBP, diastolic blood pressure; BAI, body obesity index; BMI, body mass index; TyG, triglyceride blood glucose index; LpH, low-high-density lipoprotein ratio; AI, arteriosclerosis index; aUA, uric acid; TB, total bilirubin; APOB, apolipoprotein B; HDL-C, high-density lipoprotein cholesterol; TP, total protein; HBDH, α-hydroxybutyrate dehydrogenase; LDH, lactate dehydrogenase; SBP, systolic blood pressure; LCI, blood lipid index; AIP, Plasma arteriosclerosis index; TC, total cholesterol; ALP, alkaline phosphatase; aFBG, fasting blood glucose; AST, aspartate aminotransferase; WHR, waist-to-height ratio; APOAB, apolipoprotein AB; GGT, γ-glutaminase; DB, Direct Bilirubin; DM, diabetes mellitus; Fhchd, Family history of coronary heart disease.