| Literature DB >> 34070504 |
Joung Ouk Ryan Kim1, Yong-Suk Jeong2, Jin Ho Kim1, Jong-Weon Lee3, Dougho Park4, Hyoung-Seop Kim3.
Abstract
BACKGROUND: This study proposes a cardiovascular diseases (CVD) prediction model using machine learning (ML) algorithms based on the National Health Insurance Service-Health Screening datasets.Entities:
Keywords: algorithm; artificial intelligence; cardiovascular disease; machine learning; risk factors
Year: 2021 PMID: 34070504 PMCID: PMC8229422 DOI: 10.3390/diagnostics11060943
Source DB: PubMed Journal: Diagnostics (Basel) ISSN: 2075-4418
Figure 1Study process and architecture of the CVD prediction model.
Baseline characteristics and major contributing factors in the non-CVD and CVD groups.
| Non-CVD Group | CVD Group | ||
|---|---|---|---|
| Sex, | 1.00 a | ||
| Male | 2789 (59.4%) | 2789 (59.4%) | |
| Female | 1910 (40.6%) | 1910 (40.6%) | |
| Age group, | 1.00 b | ||
| 45–49 | 248 (5.3%) | 248 (5.3%) | |
| 50–54 | 556 (11.8%) | 556 (11.8%) | |
| 55–59 | 658 (14.0%) | 658 (14.0%) | |
| 60–64 | 893 (19.0%) | 893 (19.0%) | |
| 65–69 | 730 (15.5%) | 730 (15.5%) | |
| 70–74 | 957 (20.4%) | 957 (20.4%) | |
| 75–79 | 411 (8.7%) | 411 (8.7%) | |
| 80–84 | 210 (4.5%) | 210 (4.5%) | |
| ≥85 | 36 (0.8%) | 36 (0.8%) | |
| Height (cm) | 161.1 ± 9.0 | 161.1 ± 8.8 | 0.82 c |
| Weight (kg) | 62.4 ± 10.4 | 64.1 ± 10.8 | <0.001 c |
| Waist (cm) | 82.8 ± 8.5 | 84.5 ± 8.5 | <0.001 c |
| Total Cholesterol | 196.63 ± 37.52 | 173.52 ± 40.06 | <0.001 c |
| LDL Cholesterol | 116.44 ± 34.55 | 96.13 ± 35.44 | <0.001 c |
| HDL Cholesterol | 50.85 ± 13.56 | 53.09 ± 16.71 | <0.001 c |
| Triglyceride | 135.07 ± 82.23 | 137.40 ± 85.16 | 0.18 c |
| Previous CVD, | <0.001 b | ||
| No | 4576 (97.4%) | 2336 (49.7%) | |
| Yes | 123 (2.6%) | 2363 (50.3%) | |
| Previous Stroke, | 0.008 b | ||
| No | 4607 (98.0%) | 4568 (97.2%) | |
| Yes | 92 (2.0%) | 131 (2.8%) | |
| Previous Hypertension, | <0.001 a | ||
| No | 2949 (62.8%) | 2364 (50.3%) | |
| Yes | 1750 (37.2%) | 2335 (49.7%) | |
| Previous Diabetes, | <0.001 a | ||
| No | 4047 (86.1%) | 3680 (78.3%) | |
| Yes | 652 (13.9%) | 1019 (21.7%) | |
| Previous Hyperlipidemia, | <0.001 a | ||
| No | 4414 (93.9%) | 4222 (89.8%) | |
| Yes | 285 (6.1%) | 477 (10.2%) | |
| FH of CVD, | <0.001 b | ||
| No | 4587 (97.6%) | 4284 (91.2%) | |
| Yes | 112 (2.4%) | 415 (8.8%) | |
| FH of Stroke, | <0.001 a | ||
| No | 4409 (93.8%) | 4273 (90.9%) | |
| Yes | 290 (6.2%) | 426 (9.1%) | |
| Smoking Type, | <0.001 a | ||
| Never Smoking | 2823 (60.1%) | 2732 (58.1%) | |
| Past Smoking | 1025 (21.8%) | 1215 (25.9%) | |
| Current Smoking | 851 (18.1%) | 752 (16.0%) | |
| Drinking (days/week), | <0.001 b | ||
| 0 | 2953 (62.8%) | 3219 (68.5%) | |
| 1 | 634 (13.5%) | 516 (11.0%) | |
| 2 | 450 (9.6%) | 395 (8.4%) | |
| 3 | 290 (6.2%) | 258 (5.5%) | |
| 4 | 115 (2.4%) | 91 (1.9%) | |
| 5 | 70 (1.5%) | 86 (1.8%) | |
| 6 | 59 (1.3%) | 44 (0.9%) | |
| 7 | 128 (2.7%) | 90 (1.9%) | |
| Walk 30 min (days/week), | 0.27 a | ||
| 0 | 1503 (32.0%) | 1482 (31.5%) | |
| 1 | 337 (7.2%) | 322 (6.9%) | |
| 2 | 507 (10.8%) | 506 (10.8%) | |
| 3 | 584 (12.4%) | 651 (13.9%) | |
| 4 | 355 (7.6%) | 328 (7.0%) | |
| 5 | 423 (9.0%) | 410 (8.7%) | |
| 6 | 304 (6.5%) | 270 (5.7%) | |
| 7 | 686(14.6%) | 730 (15.5%) |
CVD: cardiovascular disease, LDL: low-density lipoprotein, HDL: high-density lipoprotein, FH: family history, min: minutes. a Chi-square test. b Fisher’s exact test. c Independent t-tests.
Figure 2Receiver operating characteristics curve of the prediction performance for each algorithm. The XG boosting, gradient boosting, and random forest algorithms showed the best average prediction accuracy.
Results of the prediction performance for each algorithm.
| Trials | Mean | SD | Min | 25% | 50% | 75% | Max | |
|---|---|---|---|---|---|---|---|---|
| eXtream Gradient Boosting | 30 | 0.812 | 0.005 | 0.803 | 0.810 | 0.812 | 0.815 | 0.822 |
| Gradient Boosting | 30 | 0.812 | 0.005 | 0.800 | 0.809 | 0.812 | 0.816 | 0.821 |
| Random Forest | 30 | 0.811 | 0.005 | 0.799 | 0.809 | 0.811 | 0.815 | 0.821 |
| Multi-layer Perceptron | 30 | 0.808 | 0.006 | 0.794 | 0.805 | 0.808 | 0.812 | 0.817 |
| Adaptive Boosting | 30 | 0.806 | 0.005 | 0.798 | 0.802 | 0.806 | 0.810 | 0.818 |
| Logistic Regression | 30 | 0.805 | 0.005 | 0.793 | 0.801 | 0.805 | 0.808 | 0.814 |
| Extra-Trees | 30 | 0.802 | 0.005 | 0.791 | 0.799 | 0.803 | 0.806 | 0.811 |
| K-nearest Neighbors | 30 | 0.792 | 0.005 | 0.779 | 0.789 | 0.794 | 0.796 | 0.802 |
| Decision Tree | 30 | 0.789 | 0.005 | 0.780 | 0.786 | 0.790 | 0.793 | 0.796 |
| Support Vector Machine | 30 | 0.782 | 0.006 | 0.766 | 0.779 | 0.783 | 0.786 | 0.793 |
SD: standard deviation, Min: minimum result, Max: maximum result.
Figure 3Importance analysis of contributing factors with the feature importance method. Previous CVD was the most important factor, followed by TC, LDL cholesterol, waist-to-height ratio, and body mass index.
Figure 4Importance analysis of contributing factors with the permutation importance method. As in the analysis of the feature importance method, previous CVD was the most important factor, followed by TC, LDL cholesterol, waist-to-height ratio, and body mass index.