| Literature DB >> 29444182 |
Yizhao Ni1,2, Kathleen Alwell3, Charles J Moomaw3, Daniel Woo3, Opeolu Adeoye4, Matthew L Flaherty3, Simona Ferioli3, Jason Mackey5, Felipe De Los Rios La Rosa6, Sharyl Martini7, Pooja Khatri3, Dawn Kleindorfer3, Brett M Kissela3.
Abstract
OBJECTIVE: 1) To develop a machine learning approach for detecting stroke cases and subtypes from hospitalization data, 2) to assess algorithm performance and predictors on real-world data collected by a large-scale epidemiology study in the US; and 3) to identify directions for future development of high-precision stroke phenotypic signatures.Entities:
Mesh:
Year: 2018 PMID: 29444182 PMCID: PMC5812624 DOI: 10.1371/journal.pone.0192586
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1The overall processes of the study.
Summary of the variables used in the study.
| Variable Category | Number of Variables | Description |
|---|---|---|
| DEMO | 6 | Patient demographics, including age, sex, race, ethnicity, marital status and employment status |
| SU | 11 | Patients’ history of substance use (smoking, alcohol and street drugs) |
| VI | 4 | Visit information at time of admission (e.g., type of first medical contact, type of visited institution) |
| ED | 13 | Evaluations (e.g., blood pressure, Glasgow Coma Scale) performed in the emergency department |
| SE | 29 | Stroke-related evaluations (e.g., NIH stroke scale) |
| SS | 20 | Signs and symptoms that caused a patient to seek medical attention (e.g., weakness, headache, speech and vision) |
| CT/MRI | 24 | CT or MRI performed (Yes/No) and, if so, the findings (e.g., normal, acute infarct, intracerebral hemorrhage) |
| ANG | 6 | MRA, CTA, or cerebral angiography performed (Yes/No) and, if so, the findings (e.g., normal/abnormal) |
| CU | 2 | Carotid ultrasound performed (Yes/No) and, if so, the findings (e.g., normal/abnormal) |
| ECHO | 19 | Echocardiogram performed (Yes/No) and, if so, the findings (e.g., cardiomyopathy Yes/No) |
| EKG | 16 | Electrocardiogram performed (Yes/No) and, if so, the findings (e.g., normal/abnormal) |
| LAB | 14 | Laboratory results collected during hospitalization (e.g., white blood cell count, glucose level, total cholesterol) |
| MH | 52 | General medical history prior to hospitalization (e.g., history of hypertension Yes/No) |
| SH | 18 | History of stroke prior to hospitalization (e.g., ischemic stroke Yes/No) |
| ICD9 | 1 | Primary and secondary ICD-9 codes on patients’ discharge lists |
| DX | 47 | Complications and new diagnoses during hospitalization (e.g., pain, seizure, cardiac arrest Yes/No) |
| IT | 13 | Interventions performed (e.g., aneurysm clipping/coiling, clot evacuation Yes/No) |
| TH | 15 | Therapies performed (e.g., physical, occupational or speech therapy Yes/No) |
| OC | 6 | Clinical outcome of hospitalization (e.g., disposition at discharge, modified Rankin Scale) |
Fig 2The ICD-9 coded baseline.
Fig 3The event distribution of stroke subtypes among the four categories.
Fig 4The performance curves when adding the variable sets (Table 1).
Performance of different classification algorithms for stroke case identification.
| ACC | 60.45 | 85.41 | 87.17 | 87.56 | 87.56 | 87.35 | |
| P | 87.96 | 86.06 | 88.96 | 90.26 | 90.32 | 91.25 | |
| R | 62.47 | 97.11 | 95.87 | 96.45 | 92.81 | 94.31 | |
| F | 73.05 | 92.10 | 92.86 | 93.28 | 92.78 | 92.75 | |
| AUC | 55.83 | 50.60 | 86.11 | 85.93 | 86.41 | 85.98 | |
| AUC-PR | 87.91 | 83.29 | 97.15 | 96.81 | 96.86 | 96.74 | |
| ACC | 61.68 | 85.85 | 87.21 | 87.89 | 88.57 | 86.90 | |
| P | 88.41 | 86.39 | 89.40 | 90.94 | 90.99 | 91.59 | |
| R | 63.72 | 96.54 | 95.39 | 95.97 | 92.80 | 93.31 | |
| F | 74.06 | 92.32 | 92.84 | 93.11 | 93.30 | 92.44 | |
| AUC | 55.40 | 51.65 | 86.69 | 86.31 | 86.61 | 85.87 | |
| AUC-PR | 88.29 | 83.51 | 97.23 | 97.19 | 97.22 | 96.89 | |
Bold numbers indicate the best results.
Statistical significance tests (paired T-test) of the performance difference between the machine learning algorithms and the baselines on stroke case identification.
| Baseline | Measure | P Values between the Machine Learning Algorithms and the Baselines | ||||
|---|---|---|---|---|---|---|
| LR | SVM-P | SVM-R | RF | ANN | ||
| ICD9 | ACC | 1.04E-12 | 1.63E-12 | 6.82E-13 | 1.65E-12 | 9.61E-12 |
| P | 4.99E-4 | 4.95E-6 | 7.01E-6 | 7.68E-9 | 4.67E-7 | |
| R | 3.47E-13 | 1.40E-12 | 2.50E-13 | 9.98E-13 | 4.34E-12 | |
| F | 1.50E-12 | 2.07E-12 | 8.01E-13 | 2.31E-12 | 9.13E-12 | |
| AUC | 4.40E-11 | 3.46E-12 | 3.11E-12 | 8.43E-12 | 1.58E-11 | |
| AUC-PR | 8.28E-12 | 1.25E-11 | 1.01E-11 | 2.46E-12 | 1.02E-11 | |
| CLIN | ACC | 1.63E-4 | 4.65E-5 | 1.16E-5 | 5.45E-5 | 6.96E-4 |
| P | 1.53E-7 | 6.62E-10 | 1.43E-10 | 1.04E-10 | 1.49E-10 | |
| R | 0.999 | 1.00 | 0.999 | 1.00 | 1.00 | |
| F | 1.00E-3 | 7.85E-4 | 1.13E-4 | 3.90E-3 | 1.48E-2 | |
| AUC | 4.40E-11 | 1.50E-11 | 7.37E-12 | 1.23E-11 | 4.86E-11 | |
| AUC-PR | 5.94E-9 | 3.25E-9 | 3.80E-9 | 3.94E-9 | 9.13E-9 | |
*indicates statistical significance (p value < 0.05).
Fig 5Precision-recall curves generated by the algorithms.
Performance of different classification algorithms for stroke type identification.
| ACC | 68.67 | 84.05 | 86.66 | 85.63 | 86.57 | 85.75 | ||
| P | 80.64 | 83.36 | 89.37 | 92.22 | 92.58 | 90.31 | ||
| R | 79.59 | 94.17 | 88.48 | 91.05 | 90.53 | 91.80 | ||
| F | 80.11 | 89.91 | 91.08 | 91.62 | 91.54 | 91.04 | ||
| P | 87.40 | 93.88 | 94.31 | 93.85 | 92.69 | 94.61 | ||
| R | 82.80 | 96.97 | 94.36 | 96.24 | 97.98 | 94.51 | ||
| F | 84.99 | 95.60 | 94.50 | 95.00 | 95.23 | 94.50 | ||
| P | 67.42 | 83.83 | 87.02 | 87.43 | 86.16 | 88.00 | ||
| R | 72.47 | 94.36 | 89.19 | 93.18 | 96.07 | 91.48 | ||
| F | 69.81 | 89.75 | 90.53 | 88.90 | 90.19 | 89.69 | ||
| P | 12.52 | 31.23 | 52.30 | 57.00 | 59.18 | 54.67 | ||
| R | 11.90 | 2.72 | 38.71 | 51.91 | 49.96 | 47.55 | ||
| F | 12.18 | 4.98 | 47.10 | 54.20 | 54.15 | 50.75 | ||
| ACC | 67.68 | 84.12 | 86.40 | 85.17 | 87.08 | 87.08 | ||
| P | 79.50 | 82.43 | 89.91 | 93.28 | 93.60 | 91.16 | ||
| R | 79.69 | 92.07 | 86.06 | 90.02 | 89.66 | 91.71 | ||
| F | 79.59 | 89.12 | 90.97 | 90.12 | 91.59 | 91.43 | ||
| P | 88.02 | 97.14 | 95.98 | 95.53 | 94.48 | 94.35 | ||
| R | 84.48 | 97.70 | 95.98 | 98.28 | 98.28 | 95.98 | ||
| F | 86.22 | 97.42 | 95.98 | 96.88 | 96.34 | 95.16 | ||
| P | 65.99 | 84.48 | 87.65 | 88.21 | 86.71 | 89.22 | ||
| R | 67.36 | 96.61 | 94.52 | 89.82 | 93.73 | 95.04 | ||
| F | 66.67 | 90.13 | 90.96 | 89.70 | 90.89 | 91.63 | ||
| P | 11.95 | 50.00 | 56.18 | 50.09 | 56.77 | 58.67 | ||
| R | 11.79 | 5.24 | 43.67 | 56.77 | 54.59 | 50.22 | ||
| F | 11.87 | 9.49 | 49.14 | 56.78 | 56.77 | 54.12 | ||
Bold numbers indicate the best results.
Statistical significance tests (paired T-test) of the performance difference between the machine learning algorithms and the baselines on stroke type identification.
| Baseline | Measure | P Values between the ML Algorithms and the Baselines | ||||
|---|---|---|---|---|---|---|
| LR | SVM-P | SVM-R | RF | ANN | ||
| ICD9 | Overall ACC | 2.23E-10 | 2.62E-10 | 1.63E-9 | 1.65E-10 | 3.21E-10 |
| P (Ischemic stroke) | 6.07E-9 | 9.81E-8 | 5.24E-8 | 1.02E-8 | 4.40E-8 | |
| P (Hemorrhagic stroke) | 1.15E-5 | 4.73E-5 | 2.27E-5 | 4.87E-5 | 7.92E-5 | |
| P (TIA) | 2.13E-10 | 2.17E-9 | 7.82E-11 | 3.72E-10 | 8.84E-11 | |
| P (Non-stroke control) | 2.63E-9 | 8.46E-9 | 1.61E-8 | 8.88E-10 | 2.19E-9 | |
| CLIN | Overall ACC | 2.40E-3 | 1.89E-2 | 1.64E-4 | 7.57E-7 | 2.60E-3 |
| P (Ischemic stroke) | 4.35E-11 | 3.64E-9 | 3.94E-9 | 1.73E-10 | 5.33E-9 | |
| P (Hemorrhagic stroke) | 0.104 | 0.229 | 0.529 | 0.993 | 0.116 | |
| P (TIA) | 3.33E-5 | 6.05E-5 | 4.56E-6 | 2.34E-5 | 1.33E-4 | |
| P (Non-stroke control) | 1.90E-3 | 2.90E-3 | 7.05E-4 | 3.14E-4 | 2.00E-3 | |
*indicates statistical significance (p value < 0.05).
Fig 6Confusion matrices generated by ICD9, CLIN, and RF on the test set.
Misclassification errors made by the RF algorithm on the test set.
| 1 | No focal symptoms and key diagnostic tests (CT/MRI findings) were not performed (16.67%) | 0 | 0 | 2 | 0 | 0 | 0 | 12 | 0 | 20 |
| 2 | Missing CT/MRI findings (e.g., “no acute intracranial abnormality”) stored in textual data fields (11.27%) | 0 | 0 | 6 | 1 | 0 | 1 | 1 | 0 | 14 |
| 3 | Physicians used information not in the data (e.g., raw MRI images and clinical notes) to make the decisions (6.86%) | 0 | 0 | 7 | 0 | 1 | 1 | 3 | 1 | 1 |
| 4 | Missing information (e.g., MRI findings) due to ED or outpatient settings (6.86%) | 0 | 1 | 4 | 1 | 0 | 2 | 2 | 0 | 4 |
| 5 | Dilemma samples. Physicians determined as cases but the patients did not meet all inclusion criteria. The events were labeled as non-stroke “control” in our study (14.71%) | 0 | 0 | 0 | 0 | 0 | 0 | 27 | 2 | 1 |
| 6 | Complex cases. Ischemic stroke with hemorrhagic conversion (4.90%) | 7 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
| 7 | Undetermined etiology of cases. No focal symptoms or findings from diagnostic tests (12.25%) | 0 | 4 | 16 | 1 | 0 | 4 | 0 | 0 | 0 |
| 8 | Conflict findings between symptoms and diagnostic tests (21.08%) | 0 | 0 | 33 | 0 | 1 | 1 | 4 | 0 | 4 |
| 9 | Wrong predictions. Unidentified reason (5.39%) | 0 | 0 | 2 | 0 | 0 | 0 | 1 | 0 | 8 |
IS: Ischemic stroke; HS: Hemorrhagic stroke; TIA: Transient ischemic attack; NS: non-stroke control. Percentage of errors for each category is presented in the bracket.