| Literature DB >> 33804724 |
Vida Abedi1,2, Venkatesh Avula1, Durgesh Chaudhary3, Shima Shahjouei3, Ayesha Khan3, Christoph J Griessenauer3,4, Jiang Li1, Ramin Zand3.
Abstract
BACKGROUND: The long-term risk of recurrent ischemic stroke, estimated to be between 17% and 30%, cannot be reliably assessed at an individual level. Our goal was to study whether machine-learning can be trained to predict stroke recurrence and identify key clinical variables and assess whether performance metrics can be optimized.Entities:
Keywords: artificial intelligence; clinical decision support system; electronic health record; explainable machine learning; healthcare; interpretable machine learning; ischemic stroke; machine learning; outcome prediction; recurrent stroke
Year: 2021 PMID: 33804724 PMCID: PMC8003970 DOI: 10.3390/jcm10061286
Source DB: PubMed Journal: J Clin Med ISSN: 2077-0383 Impact factor: 4.241
Figure 1(A) Flow-chart of inclusion-exclusion of subjects in cases and control group in the study. Patients in the control group had available records in the electronic health record for at least 5 years and no documented stroke recurrence within 5 years. Distribution panel shows the number of recurrences over time. At 24 days, the number of recurrent cases can be seen to approach a plateau. (B) The design strategy for predicting stroke recurrence using electronic health records (EHR), Geisinger Quality database as well as Social Security Death database.
Patient demographics, past medical and family history in different groups. Detailed description of the variables is provided in the Geisinger Neuroscience Ischemic Stroke (GNSIS) study [13]. IQR: interquartile range; HDL: high-density lipoprotein; LDL: low-density lipoprotein.
| Patient Characteristics | % Missing | Statistics (All Patients) | Control Group | Case Group 1 | Case Group 2 | Case Group 3 | Case Group 4 | Case Group 5 |
|---|---|---|---|---|---|---|---|---|
| Total number of patients | - | 2091 | 1654 | 210 | 306 | 375 | 411 | 437 |
| Age in years, mean (SD) | - | 67 (13) | 66 (13) | 71 (14) | 71 (13) | 71 (13) | 71 (13) | 71 (13) |
| Age in years, median (IQR) | - | 68 (58–77) | 67 (57–76) | 73 (62–83) | 72 (63–81) | 73 (63–81) | 73 (63–81) | 73 (63–81) |
| Male, n (%) | - | 1079 (52%) | 53% | 47% | 46% | 46% | 47% | 47% |
| Body mass index (BMI) in kg/m2, mean (SD) | 2.63% | 30 (7) | 30 (7) | 29 (6) | 29 (6) | 29 (6) | 29 (7) | 29 (6) |
| Body mass index (BMI) in kg/m2, median [IQR] | 2.63% | 29 (26–33) | 29 (26–33) | 28 (24–32) | 28 (25–32) | 28 (25–32) | 28 (25–32) | 28 (25–32) |
| Diastolic Blood Pressure, mean (SD) | 31.90% | 76 (12) | 76 (12) | 75 (13) | 75 (12) | 75 (12) | 75 (12) | 74 (12) |
| Systolic Blood Pressure, mean (SD) | 31.90% | 137 (22) | 136 (22) | 139 (26) | 139 (25) | 140 (24) | 139 (24) | 139 (24) |
| Hemoglobin (Unit: g/dL), mean (SD) | 1.82% | 14 (2) | 14 (2) | 13 (2) | 14 (2) | 14 (2) | 14 (2) | 14 (2) |
| Hemoglobin A1c (Unit: %), mean (SD) | 25.11% | 7 (2) | 7 (2) | 7 (2) | 7 (2) | 7 (2) | 7 (2) | 7 (2) |
| HDL (Unit: mg/dL), mean (SD) | 5.40% | 47 (15) | 47 (15) | 45 (13) | 45 (14) | 45 (14) | 45 (14) | 45 (14) |
| LDL (Unit: mg/dL), mean (SD) | 5.79% | 102 (40) | 103 (40) | 103 (44) | 100 (43) | 101 (42) | 101 (41) | 100 (41) |
| Platelet (Unit: 103/uL), mean (SD) | 1.82% | 232 (77) | 233 (76) | 227 (70) | 229 (73) | 231 (80) | 230 (78) | 229 (78) |
| White blood cell (Unit: 103/uL), mean (SD) | 1.82% | 9 (3) | 9 (3) | 8 (3) | 8 (3) | 9 (3) | 9 (3) | 9 (3) |
| Creatinine (Unit: mg/dL), mean (SD) | 2.58% | 1 (1) | 1 (0.5) | 1 (1) | 1 (1) | 1 (1) | 1 (1) | 1 (1) |
| Current smoker, n (%) | - | 288 (14%) | 14 (1) | 12 (6) | 12 (4) | 13 (3) | 13 (3) | 13 (3) |
| Difference in days between Last outpatient visit prior to index date and index date, mean (SD) | 26.16% | 347 (726) | 345 (691) | 371 (882) | 354 (846) | 369 (855) | 352 (826) | 354 (840) |
| MEDICAL HISTORY, n (%) | ||||||||
| Atrial flutter | 41 (2%) | 28 (2%) | 4 (2%) | 9 (3%) | 11 (3%) | 13 (3%) | 13 (3%) | |
| Atrial fibrillation | 319 (15%) | 230 (14%) | 35 (17%) | 55 (18%) | 72 (19%) | 82 (20%) | 89 (20%) | |
| Atrial fibrillation/flutter | 324 (15%) | 233 (14%) | 36 (17%) | 56 (18%) | 74 (20%) | 84 (20%) | 91 (21%) | |
| Chronic Heart failure (CHF) | 159 (8%) | 103 (6%) | 33 (16%) | 42 (14%) | 49 (13%) | 53 (13%) | 56 (13%) | |
| Chronic kidney disease | 223 (11%) | 142 (9%) | 55 (26%) | 68 (22%) | 74 (20%) | 78 (19%) | 81 (19%) | |
| Chronic liver disease | 35 (2%) | 23 (1%) | 2 (1%) | 7 (2%) | 10 (3%) | 11 (3%) | 12 (3%) | |
| Chronic liver disease (mild) | 33 (2%) | 21 (1%) | 2 (1%) | 7 (2%) | 10 (3%) | 11 (3%) | 12 (3%) | |
| Chronic liver disease (moderate to severe) | 7 (0.3%) | 5 (0.3%) | 0 (0%) | 1 (0.3%) | 1 (0.3%) | 2 (0.5%) | 2 (0.5%) | |
| Chronic lung diseases | 391 (19%) | 296 (18%) | 51 (24%) | 70 (23%) | 83 (22%) | 92 (22%) | 95 (22%) | |
| Diabetes | 615 (29%) | 439 (27%) | 86 (41%) | 122 (40%) | 151 (40%) | 165 (40%) | 176 (40%) | |
| Dyslipidemia | 1298 (62%) | 994 (60%) | 142 (68%) | 211 (69%) | 258 (69%) | 285 (69%) | 304 (70%) | |
| Hypertension | 1495 (72%) | 1150 (70%) | 168 (80%) | 240 (78%) | 293 (78%) | 327 (80%) | 345 (79%) | |
| Myocardial infarction | 215 (10%) | 159 (10%) | 30 (14%) | 43 (14%) | 51 (14%) | 53 (13%) | 56 (13%) | |
| Neoplasm | 284 (14%) | 211 (13%) | 35 (17%) | 49 (16%) | 61 (16%) | 65 (16%) | 73 (17%) | |
| Hypercoagulable | 29 (1%) | 24 (1%) | 4 (2%) | 4 (1%) | 5 (1%) | 5 (1%) | 5 (1%) | |
| Peripheral vascular disease | 313 (15%) | 219 (13%) | 46 (22%) | 65 (21%) | 75 (20%) | 88 (21%) | 94 (22%) | |
| Patent Foramen Ovale | 241 (12%) | 184 (11%) | 30 (14%) | 41 (13%) | 47 (13%) | 53 (13%) | 57 (13%) | |
| Rheumatic diseases | 76 (4%) | 53 (3%) | 11 (5%) | 14 (5%) | 18 (5%) | 21 (5%) | 23 (5%) | |
| FAMILY HISTORY | ||||||||
| Heart disorder | 943 (45%) | 747 (45%) | 85 (40%) | 130 (42%) | 165 (44%) | 182 (44%) | 196 (45%) | |
| Stroke | 361 (17%) | 279 (17%) | 39 (19%) | 60 (20%) | 72 (19%) | 77 (19%) | 82 (19%) | |
Figure 2Model performance summaries for the five different prediction windows, six different classifiers, and four feature selection approaches. Performance metrics for (A–F) Decision tree, (G–L) Gradient Boost, (M–R) Logistic Regression, (S–X) Random Forest, (Y–AD) SVM, and (AE–AJ) XGBoost.
Figure 3Area under the receiver operating characteristic (AROC) curve using six classifiers for the 1-year prediction window. The feature Set 3 is used for this figure. (A) Model without sampling; (B) Model with up-sampling at a 1:2 ratio; (C) Model with up-sampling at a 1:1 ratio. The best performer model (AUROC of 0.79) is when up-sampling is used with Random Forest algorithm (panel B).
Figure 4Feature importance based on the different trained models. (A–E) Six different classifiers (Gradient Boost, Random Forest, Extreme Gradient Boosting (XGBoost), Decision Trees, Support Vector Machine (SVM), and Logistic Regression) and five different prediction windows were used. (F) Average feature importance score across the different models and prediction windows.
Figure 5Model Performance summaries with sampling-based optimization for the 1 and 3-year prediction window. Up-sampling using was performed using the Synthetic Minority Over-sampling Technique (SMOTE). The feature Set 3 is used for this figure. (A–F) Model without sampling; (G–L) Model with down-sampling; (M–R) Model with up-sampling.