| Literature DB >> 35145468 |
Jenish Maharjan1, Yasha Ektefaie1, Logan Ryan1, Samson Mataraso1, Gina Barnes1, Sepideh Shokouhi1, Abigail Green-Saxena1, Jacob Calvert1, Qingqing Mao1, Ritankar Das1.
Abstract
BACKGROUND: Strokes represent a leading cause of mortality globally. The evolution of developing new therapies is subject to safety and efficacy testing in clinical trials, which operate in a limited timeframe. To maximize the impact of these trials, patient cohorts for whom ischemic stroke is likely during that designated timeframe should be identified. Machine learning may improve upon existing candidate identification methods in order to maximize the impact of clinical trials for stroke prevention and treatment and improve patient safety.Entities:
Keywords: anticoagulant therapy; artificial intelligence; clinical trial; machine learning; stroke prediction
Year: 2022 PMID: 35145468 PMCID: PMC8823366 DOI: 10.3389/fneur.2021.784250
Source DB: PubMed Journal: Front Neurol ISSN: 1664-2295 Impact factor: 4.086
Figure 1Study design timeline. Patients identified in the positive class according to our gold standard had to have been diagnosed with ischemic stroke within the prediction window, i.e., 1 day after the end of visit to within 1 year from end of visit. The negative class included patients in which no diagnosis of ischemic stroke was identified within the prediction window and they must have had at least 1 year of data after the end of visit.
Inclusion and exclusion details. International classification of diseases version 10 (ICD-10) codes were used to determine inclusion of ischemic stroke patients.
Features used in the model.
Figure 2Patient encounter inclusion diagram. Initially, more than 28 million inpatient visits were included in the analysis, then patient encounters were filtered by the exclusion criteria and the prediction window requirements. Forty-one thousand nine hundred seventy patients were identified as positive for ischemic stroke based on our gold standard. The prevalence of ischemic stroke encounters was 5.9% in the training set, 5.8% in the hold-out test set and 6.7% in the external validation set.
Demographic information for the study population sample in the training and testing of the algorithm.
|
|
|
|
| |
|---|---|---|---|---|
| Age | 18–40 | 1,705 (4.1%) | 163,566 (24.3%) | < 0.0001 |
| 40–60 | 10,620 (25.3%) | 205,509 (30.5%) | < 0.0001 | |
| 60–75 | 15,489 (36.9%) | 191,351 (28.4%) | < 0.0001 | |
| 75–100 | 14,156 (33.7%) | 113,440 (16.8%) | < 0.0001 | |
| Sex | Male | 21,499 (51.2%) | 307,425 (45.6%) | < 0.0001 |
| Female | 20,397 (48.6%) | 364,875 (54.1%) | < 0.0001 | |
| Unknown sex | 74 (0.2%) | 1,566 (0.2%) | 0.0204 | |
| Race | African American | 7,193 (17.1%) | 88,415 (13.1%) | < 0.0001 |
| Asian | 569 (1.4%) | 7,050 (1.0%) | < 0.0001 | |
| Caucasian | 31,189 (74.3%) | 530,059 (78.7%) | < 0.0001 | |
| Unknown or other race | 3,019 (7.2%) | 48,342 (7.2%) | 0.8841 | |
| Ethnicity | Hispanic | 2,600 (6.2%) | 41,696 (6.2%) | 0.9501 |
| Non-hispanic | 36,946 (88.0%) | 587,308 (87.2%) | 0.1747 | |
| Unknown ethnicity | 2,424 (5.8%) | 44,862 (6.7%) | < 0.0001 | |
| Comorbidities | Atrial fibrillation | 6,879 (16.4%) | 44,382 (6.6%) | < 0.0001 |
| Diabetes mellitus | 15,902 (37.9%) | 139,044 (20.6%) | < 0.0001 | |
| Congestive heart failure | 8,235 (19.6%) | 59,028 (8.8%) | < 0.0001 | |
| History of stroke | 24,693 (58.8%) | 38,066 (5.6%) | < 0.0001 | |
| Hypertension | 31,803 (75.8%) | 303,664 (45.1%) | < 0.0001 | |
| Peripheral vascular disease | 5,610 (13.4%) | 31,981 (4.7%) | < 0.0001 | |
| COPD | 8,831 (21.0%) | 99,652 (14.8%) | < 0.0001 | |
| Renal (CKD) | 9,217 (22.0%) | 70,550 (10.5%) | < 0.0001 | |
| Cancer (Leukemia and Lymphoma) | 894 (2.1%) | 13,946 (2.1%) | 0.4069 | |
| Cancer (Solid Tumor) | 4,850 (11.6%) | 59,280 (8.8%) | < 0.0001 | |
Demographic information for the study population sample in the external validation dataset.
|
|
|
|
| |
|---|---|---|---|---|
| Age | 18–40 | 93 (2.5%) | 7,004 (13.4%) | < 0.0001 |
| 40–60 | 810 (21.4%) | 14,972 (28.6%) | < 0.0001 | |
| 60–75 | 1,405 (37.1%) | 17,868 (34.1%) | < 0.0001 | |
| 75–100 | 1,482 (39.1%) | 12,509 (23.9%) | < 0.0001 | |
| Sex | Male | 1,858 (49.0%) | 23,740 (45.4%) | < 0.0001 |
| Female | 1,932 (51.0%) | 28,603 (54.6%) | < 0.0001 | |
| Unknown sex | 0 (0.0%) | 10 (0.0%) | 1 | |
| Race | African American | 1,060 (28.0%) | 10,475 (20.0%) | < 0.0001 |
| Asian | 52 (1.4%) | 619 (1.2%) | < 0.0001 | |
| Caucasian | 2,551 (67.3%) | 39,500 (75.4%) | < 0.0001 | |
| Unknown or other race | 127 (3.4%) | 1,759 (3.4%) | 1 | |
| Ethnicity | Hispanic | 218 (5.8%) | 3,137 (6.0%) | 0.5949 |
| Non-hispanic | 3,557 (93.9%) | 48,808 (93.2%) | 0.7903 | |
| Unknown ethnicity | 15 (0.4%) | 408 (0.8%) | 0.0062 | |
| Comorbidities | Atrial fibrillation | 839 (22.1%) | 7,315 (14.0%) | < 0.0001 |
| Diabetes mellitus | 1,678 (44.3%) | 15,709 (30.0%) | < 0.0001 | |
| Congestive heart failure | 986 (26.0%) | 8,736 (16.7%) | < 0.0001 | |
| History of stroke | 2,393 (63.1%) | 3,704 (7.1%) | < 0.0001 | |
| Hypertension | 3,259 (86.0%) | 3,4023 (65.0%) | < 0.0001 | |
| Peripheral Vascular Disease | 665 (17.5%) | 4,649 (8.9%) | < 0.0001 | |
| COPD | 951 (25.1%) | 11,759 (22.5%) | < 0.0001 | |
| Renal (CKD) | 1,200 (31.7%) | 10,054 (19.2%) | < 0.0001 | |
| Cancer (Leukemia and Lymphoma) | 71 (1.9%) | 964 (1.8%) | 0.8514 | |
| Cancer (Solid Tumor) | 442 (11.7%) | 5,163 (9.9%) | < 0.0001 | |
Performance metrics for XGBoost, logistic regression, and multilayer perceptron (MLP) machine learning algorithms (MLAs) on the testing set and external validation set in comparison to the CHA2DS2-VASc risk score.
|
| ||||||||
|---|---|---|---|---|---|---|---|---|
|
|
|
| ||||||
| XGBoost | 0.880 | 0.8 | 0.793 | 0.194 | 0.985 | 3.87 | 0.25 | 15.37 |
| Logistic regression (All Inputs) | 0.862 | 0.8 | 0.754 | 0.168 | 0.984 | 3.25 | 0.27 | 12.24 |
| MLP classifier | 0.862 | 0.8 | 0.772 | 0.179 | 0.984 | 3.50 | 0.26 | 13.54 |
| CHA2DS2-VASc Score | 0.754 | 0.871 | 0.479 | 0.094 | 0.984 | 1.67 | 0.27 | 6.22 |
|
| ||||||||
| XGBoost | 0.864 (0.859–0.869) | 0.8 | 0.749 | 0.188 | 0.981 | 3.19 | 0.27 | 11.97 |
| Logistic regression (All Inputs) | 0.858 | 0.8 | 0.745 | 0.185 | 0.981 | 3.14 | 0.27 | 11.68 |
| MLP classifier | 0.835 | 0.8 | 0.703 | 0.163 | 0.98 | 2.70 | 0.28 | 9.49 |
| CHA2DS2-VASc Score | 0.728 | 0.812 | 0.519 | 0.109 | 0.974 | 1.69 | 0.36 | 4.68 |
The testing set included 203,237 total patient encounters with 11,789 patients identified in the positive class. Area under the receiver operating characteristic (AUROC) curve, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), likelihood ratios (LR), and diagnostic odds ratio (DOR) are shown for the MLAs. .
Figure 3Receiver operating characteristic (ROC) curve for prediction of ischemic stroke for up to 1 year after first inpatient encounter on the test set data.
Figure 4SHAP Plot for model feature importance. Features are ranked in descending order of importance as measured by SHAP values. Red indicates a high feature value; blue indicates a low feature value. Dots to the right are indicative of a higher score; dots to the left a lower score. Mean and STD represent average and standard deviation, respectively. BMI, body mass index; BUN, blood urea nitrogen; CHF, congestive heart failure; DBP, diastolic blood pressure; RBC, red blood cells; SBP, systolic blood pressure; TEMP: temperature.