| Literature DB >> 35600480 |
Xiao Huang1, Tianyu Cao2, Liangziqian Chen3, Junpei Li1, Ziheng Tan1, Benjamin Xu4, Richard Xu5, Yun Song3,6, Ziyi Zhou7, Zhuo Wang8, Yaping Wei8, Yan Zhang9, Jianping Li9, Yong Huo9, Xianhui Qin10, Yanqing Wu1, Xiaobin Wang11, Hong Wang12, Xiaoshu Cheng1, Xiping Xu8, Lishun Liu3,7.
Abstract
Background: Stroke is a major global health burden, and risk prediction is essential for the primary prevention of stroke. However, uncertainty remains about the optimal prediction model for analyzing stroke risk. In this study, we aim to determine the most effective stroke prediction method in a Chinese hypertensive population using machine learning and establish a general methodological pipeline for future analysis.Entities:
Keywords: XGBoost; machine learning; primary prevention; risk assessment; stroke
Year: 2022 PMID: 35600480 PMCID: PMC9120532 DOI: 10.3389/fcvm.2022.901240
Source DB: PubMed Journal: Front Cardiovasc Med ISSN: 2297-055X
Figure 1Analysis flow for the development and evaluation of models.
Baseline and follow-up characteristics of the study participants.
|
|
|
| ||||||
|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
| |
|
| 20,702 | 637 | 20,065 | 2,568 | 1,284 | 1,284 | ||
| Sex | 0.001 | 0.97 | ||||||
| Male | 8,497 (41.0) | 302 (47.4) | 8,195 (40.8) | 1,311 (51.1) | 656 (51.1) | 655 (51.0) | ||
| Female | 12,205 (59.0) | 335 (52.6) | 11,870 (59.2) | 1,257 (48.9) | 628 (48.9) | 629 (49.0) | ||
| Age, year | 60.0 (7.5) | 62.2 (7.3) | 59.9 (7.5) | <0.001 | 70.6 (8.2) | 70.6 (8.2) | 70.6 (8.2) | 0.99 |
| Hip, cm | 94.6 (6.9) | 94.7 (7.2) | 94.6 (6.9) | 0.73 | 98.1 (8.5) | 98.7 (8.4) | 97.5 (8.5) | 0.001 |
| BMI, kg/m2 | 24.8 (3.5) | 25.1 (3.6) | 24.8 (3.5) | 0.04 | 26.3 (4.1) | 26.6 (4.3) | 26.0 (3.7) | <0.001 |
| DBP, mmHg | 93.8 (10.9) | 95.8 (11.0) | 93.8 (10.8) | <0.001 | 85.5 (12.3) | 87.4 (12.8) | 83.6 (11.5) | <0.001 |
| SBP, mmHg | 165.8 (18.3) | 173.1 (18.5) | 165.5 (18.3) | <0.001 | 153.1 (22.8) | 157.0 (23.4) | 149.2 (21.4) | <0.001 |
| Pulse, BMP | 72.9 (8.7) | 72.9 (8.7) | 72.9 (8.7) | 0.93 | 74.5 (12.7) | 75.2 (12.8) | 73.7 (12.5) | 0.006 |
| Laboratory data | ||||||||
| Albumin, mmol/L | 48.6 (4.2) | 48.3 (4.2) | 48.6 (4.2) | 0.02 | 47.0 (3.2) | 47.0 (3.1) | 47.1 (3.2) | 0.21 |
| AST, mmol/L | 24.5 (6.8) | 23.6 (6.8) | 24.5 (6.8) | 0.001 | 21.9 (14.4) | 21.5 (8.9) | 22.4 (18.4) | 0.11 |
| γ-GT, mmol/L | 21.9 (9.6) | 23.3 (10.2) | 21.9 (9.5) | <0.001 | 27.0 (27.9) | 27.4 (25.7) | 26.6 (29.9) | 0.49 |
| TC, mmol/L | 5.5 (1.1) | 5.6 (1.0) | 5.5 (1.1) | <0.001 | 5.8 (1.2) | 5.8 (1.2) | 5.8 (1.2) | 0.56 |
| Calcium, mmol/L | 2.6 (0.2) | 2.6 (0.2) | 2.6 (0.2) | 0.30 | 2.3 (0.2) | 2.4 (0.2) | 2.3 (0.2) | 0.04 |
| Triglycerides, mmol/L | 1.5 (0.6) | 1.5 (0.6) | 1.5 (0.6) | 0.81 | 1.4 (0.9) | 1.5 (0.9) | 1.3 (0.8) | <0.001 |
| Glucose, mmol/L | 5.6 (0.8) | 5.7 (0.9) | 5.6 (0.8) | <0.001 | 6.2 (2.3) | 6.5 (2.5) | 6.0 (2.0) | <0.001 |
| Creatinine, mmol/L | 64.6 (13.2) | 66.1 (13.8) | 64.6 (13.2) | 0.005 | 60.2 (27.8) | 61.2 (27.4) | 59.2 (28.2) | 0.08 |
| Cardiovascular risk factors | ||||||||
| Diabetes | <0.001 | <0.001 | ||||||
| No | 18,414 (88.9) | 521 (81.8) | 17,893 (89.2) | 2,133 (83.1) | 1,019 (79.4) | 1,114 (86.8) | ||
| Yes | 2,288 (11.1) | 116 (18.2) | 2,172 (10.8) | 435 (16.9) | 265 (20.6) | 170 (13.2) | ||
| Smoking | <0.001 | 0.06 | ||||||
| Never | 14,263 (68.9) | 387 (60.8) | 13,876 (69.2) | 1,745 (68.0) | 856 (66.7) | 889 (69.2) | ||
| Former | 1,570 (7.6) | 62 (9.7) | 1,508 (7.5) | 256 (10.0) | 122 (9.5) | 134 (10.4) | ||
| Current | 4,869 (23.5) | 188 (29.5) | 4,681 (23.3) | 567 (22.1) | 306 (23.8) | 261 (20.3) | ||
| Alcohol drinking | 0.08 | 0.66 | ||||||
| Never | 14,283 (69.0) | 415 (65.1) | 13,868 (69.1) | 1,850 (72.0) | 923 (71.9) | 927 (72.2) | ||
| Former | 1,459 (7.0) | 57 (8.9) | 1,402 (7.0) | 95 (3.7) | 61 (4.8) | 34 (2.6) | ||
| Current | 4,960 (24.0) | 165 (25.9) | 4,795 (23.9) | 623 (24.3) | 300 (23.4) | 323 (25.2) | ||
| Living standard | 0.09 | 0.43 | ||||||
| Good | 2,476 (12.0) | 72 (11.3) | 2,404 (12.0) | 419 (16.3) | 207 (16.1) | 212 (16.5) | ||
| Common | 15,863 (76.6) | 476 (74.7) | 15,387 (76.7) | 2,014 (78.4) | 1,003 (78.1) | 1,011 (78.7) | ||
| Bad | 2,363 (11.4) | 89 (14.0) | 2,274 (11.3) | 135 (5.3) | 74 (5.8) | 61 (4.8) | ||
| Noon nap | 0.10 | 0.11 | ||||||
| No | 14,665 (70.8) | 433 (68.0) | 14,232 (70.9) | 1,154 (44.9) | 557 (43.4) | 597 (46.5) | ||
| Yes | 6,037 (29.2) | 204 (32.0) | 5,833 (29.1) | 1,414 (55.1) | 727 (56.6) | 687 (53.5) | ||
| Fruit, kg/week | 0.04 | 0.07 | ||||||
| <0.5 | 553 (2.7) | 29 (4.6) | 524 (2.6) | 117 (4.6) | 70 (5.5) | 47 (3.7) | ||
| 0.5–1.5 | 3,747 (18.1) | 116 (18.2) | 3,631 (18.1) | 705 (27.5) | 356 (27.7) | 349 (27.2) | ||
| >3 | 16,402 (79.2) | 492 (77.2) | 15,910 (79.3) | 1,746 (68.0) | 858 (66.8) | 888 (69.2) | ||
| Taste | 0.11 | 0.72 | ||||||
| Bland | 4,208 (20.3) | 124 (19.5) | 4,084 (20.4) | 1,316 (51.2) | 664 (51.7) | 652 (50.8) | ||
| Common | 8,699 (42.0) | 249 (39.1) | 8,450 (42.1) | 657 (25.6) | 324 (25.2) | 333 (25.9) | ||
| Heavy | 7,795 (37.7) | 264 (41.4) | 7,531 (37.5) | 595 (23.2) | 296 (23.1) | 299 (23.3) | ||
| Medication use | ||||||||
| Antihypertensive drugs | 0.007 | <0.001 | ||||||
| No | 11,166 (53.9) | 310 (48.7) | 10,856 (54.1) | 1,354 (52.7) | 583 (45.4) | 771 (60.0) | ||
| Yes | 9,536 (46.1) | 327 (51.3) | 9,209 (45.9) | 1,214 (47.3) | 701 (54.6) | 513 (40.0) | ||
| Hypoglycemic drugs | 0.94 | <0.001 | ||||||
| No | 20,385 (98.5) | 627 (98.4) | 19,758 (98.5) | 2,265 (88.2) | 1,091 (85.0) | 1,174 (91.4) | ||
| Yes | 317 (1.5) | 10 (1.6) | 307 (1.5) | 303 (11.8) | 193 (15.0) | 110 (8.6) | ||
Data are mean (SD) or n (%).
BMI, body mass index; AST, aspartate aminotransferase; γ-GT, gamma glutamyltranspeptidase; TC, total cholesterol.
Performance of machine learning methods in different datasets with different data balancing methods.
|
|
|
|
| |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
| |||
| No laboratory data | RF | Null | 0.651 | 0 | 100 | 97 | 0.970 | 0.565 | 0 | 100 | 50.0 | 0.500 |
| XG | 0.640 | 0 | 100 | 97 | 0.970 | 0.552 | 0 | 100 | 50.0 | 0.500 | ||
| LR | 0.641 | 0 | 100 | 97 | 0.970 | 0.595 | 0.2 | 100 | 50.1 | 0.002 | ||
| SLR | 0.641 | 0 | 100 | 97 | 0.970 | 0.610 | 0 | 100 | 50.0 | 0.500 | ||
| RF | RUS | 0.629 | 69.4 | 51.7 | 52.2 | 0.025 | 0.562 | 60.2 | 46.1 | 53.2 | 0.063 | |
| XG | 0.624 | 67.2 | 49.0 | 49.5 | 0.018 | 0.580 | 72.6 | 36.1 | 54.3 | 0.086 | ||
| LR | 0.626 | 58.6 | 57.9 | 57.9 | 0.022 | 0.582 | 62.4 | 50.2 | 56.3 | 0.125 | ||
| SLR | 0.616 | 60.2 | 56.3 | 56.4 | 0.022 | 0.579 | 65.4 | 46.3 | 55.8 | 0.117 | ||
| RF | SMOTE | 0.524 | 5.4 | 94.3 | 91.7 | −0.002 | 0.507 | 4.6 | 97.0 | 50.8 | 0.016 | |
| XG | 0.514 | 9.7 | 87.4 | 85.1 | −0.011 | 0.512 | 12.1 | 87.7 | 49.9 | −0.002 | ||
| LR | 0.505 | 21.5 | 78 | 76.3 | −0.001 | 0.483 | 35.9 | 63.4 | 49.6 | −0.007 | ||
| SLR | 0.505 | 21 | 77.9 | 76.2 | −0.003 | 0.483 | 36.1 | 63.3 | 49.7 | −0.006 | ||
| With laboratory data | RF | Null | 0.654 | 0 | 100 | 97 | 0.970 | 0.580 | 0 | 100 | 50.0 | 0.500 |
| XG | 0.621 | 0 | 100 | 97 | 0.970 | 0.576 | 0 | 100 | 50.0 | 0.500 | ||
| LR | 0.656 | 0 | 100 | 97 | 0.970 | 0.584 | 0.2 | 100 | 50.1 | 0.002 | ||
| SLR | 0.657 | 0 | 100 | 97 | 0.970 | 0.610 | 0 | 100 | 50.0 | 0.500 | ||
| RF | RUS | 0.640 | 72.0 | 52.8 | 53.4 | 0.030 | 0.584 | 68.5 | 42.9 | 55.7 | 0.114 | |
| XG | 0.620 | 67.7 | 50.9 | 51.4 | 0.022 | 0.577 | 73.7 | 32.4 | 53.0 | 0.061 | ||
| LR | 0.634 | 60.8 | 58.6 | 58.6 | 0.026 | 0.538 | 62.9 | 42.8 | 52.8 | 0.057 | ||
| SLR | 0.639 | 60.2 | 57.4 | 57.5 | 0.023 | 0.579 | 65.4 | 46.3 | 55.8 | 0.117 | ||
| RF | SMOTE | 0.533 | 4.8 | 95.2 | 92.5 | 0.000 | 0.531 | 3.0 | 97.8 | 50.4 | 0.008 | |
| XG | 0.525 | 10.8 | 88.1 | 85.8 | −0.005 | 0.526 | 13.9 | 88.2 | 51.1 | 0.021 | ||
| LR | 0.538 | 30.6 | 75.9 | 74.5 | 0.015 | 0.498 | 43.4 | 56.9 | 50.2 | 0.003 | ||
| SLR | 0.538 | 30.6 | 75.9 | 74.5 | 0.015 | 0.497 | 43.0 | 57.4 | 50.2 | 0.004 | ||
RF, random forest; XG, XGBoost; LR, logistic regression; SLR, stepwise logistic regression; RUS, random under-sampling; SMOTE, synthetic minority over-sampling technique; and AUC, area under the receiver operating characteristic curve.
Figure 2Receiver operating characteristic (ROC) curves for data analysis methods with laboratory data in (A) CSPPT dataset (training set) and (B) NCC dataset (external validation set).
Figure 3Most important variables from RUS-applied RF with both inclusion (A) and exclusion (B) of laboratory variables.