| Literature DB >> 35087182 |
Jordan W Smoller1,2,3, Ben Y Reis4,5, Ilkin Bayramli6,7, Victor Castro8,9, Yuval Barak-Corren6, Emily M Madsen1,2, Matthew K Nock9,10,11.
Abstract
Clinical risk prediction models powered by electronic health records (EHRs) are becoming increasingly widespread in clinical practice. With suicide-related mortality rates rising in recent years, it is becoming increasingly urgent to understand, predict, and prevent suicidal behavior. Here, we compare the predictive value of structured and unstructured EHR data for predicting suicide risk. We find that Naive Bayes Classifier (NBC) and Random Forest (RF) models trained on structured EHR data perform better than those based on unstructured EHR data. An NBC model trained on both structured and unstructured data yields similar performance (AUC = 0.743) to an NBC model trained on structured data alone (0.742, p = 0.668), while an RF model trained on both data types yields significantly better results (AUC = 0.903) than an RF model trained on structured data alone (0.887, p < 0.001), likely due to the RF model's ability to capture interactions between the two data types. To investigate these interactions, we propose and implement a general framework for identifying specific structured-unstructured feature pairs whose interactions differ between case and non-case cohorts, and thus have the potential to improve predictive performance and increase understanding of clinical risk. We find that such feature pairs tend to capture heterogeneous pairs of general concepts, rather than homogeneous pairs of specific concepts. These findings and this framework can be used to improve current and future EHR-based clinical modeling efforts.Entities:
Year: 2022 PMID: 35087182 PMCID: PMC8795240 DOI: 10.1038/s41746-022-00558-0
Source DB: PubMed Journal: NPJ Digit Med ISSN: 2398-6352
Fig. 1Information overlap in EHR data.
Electronic health records contain both structured and unstructured data. These two types of data contain both unique and overlapping information.
Fig. 2Data and modeling workflow.
The diagram describes the filtering and processing steps taken to arrive at the final datasets used for training and testing different models described in this paper. STR—Structured Data; NLP—Unstructured data processed by Natural Language Processing; NBC—Naïve Bayesian Classifier; BRFC—Balanced Random Forest Classifier.
Fig. 3Distribution of time between penultimate hospital visit and first suicide attempt, in days.
As the distribution was highly skewed, the x-axis was capped at 100 days for clarity. A few patients had several years between their last recorded visit and suicide attempt.
Correspondence between structured and unstructured codes.
| Concept | Struct. | Unstruct. | Both | Total |
|---|---|---|---|---|
| Impulse control disorder | 145 (19%) | 688 (86%) | 37 (5%) | 796 |
| Unspecified bipolar disorder | 1,322 (30%) | 4,053 (94%) | 1,051 (24%) | 4,324 |
| Schizo-affective disorder | 250 (42%) | 522 (88%) | 177 (30%) | 595 |
| Opioid dependence or abuse | 1,183 (27%) | 3,893 (90%) | 761 (17%) | 4,315 |
The number of patients that have a structured EHR code for a given concept (first column), an NLP code (based on a free-text mention of that concept in their unstructured clinician notes, second column), and both a structured code and an NLP code for the given concept. Since NLP concepts are more general, each row includes one NLP code but several structured codes with similar descriptions. Furthermore, “opioid dependence” and “opioid abuse” codes were merged into one code since many EHR codes mention both opioid dependence and abuse.
Performance of NBC models on the test set.
| Unstructured | Structured | Both | ||||
|---|---|---|---|---|---|---|
| Specificity | PPV | Sensitivity | PPV | Sensitivity | PPV | Sensitivity |
| 0.070 | 0.079 | 0.072 | 0.076 | 0.088 | 0.092 | |
| 0.046 | 0.254 | 0.047 | 0.239 | 0.051 | 0.260 | |
| 0.035 | 0.378 | 0.036 | 0.365 | 0.039 | 0.391 | |
| 0.024 | 0.520 | 0.026 | 0.530 | 0.027 | 0.540 | |
| 0.714 | 0.742 | 0.743 | ||||
There is no significant increase (p = 0.688) in AUC between the model based on structured-data-only and the model based on both structured and unstructured data.
Performance of BRF models on the test set.
| Unstructured | Structured | Both | ||||
|---|---|---|---|---|---|---|
| Specificity | PPV | Sensitivity | PPV | Sensitivity | PPV | Sensitivity |
| 0.142 | 0.168 | 0.191 | 0.246 | 0.219 | 0.267 | |
| 0.082 | 0.447 | 0.092 | 0.507 | 0.097 | 0.545 | |
| 0.057 | 0.608 | 0.063 | 0.657 | 0.066 | 0.697 | |
| 0.037 | 0.766 | 0.040 | 0.820 | 0.041 | 0.845 | |
| 0.868 | 0.887 | 0.902 | ||||
There is a significant increase (p < 0.001) in AUC between the model based on structured-data-only and the model based on both structured and unstructured data. There are also substantial increases in sensitivity.
Fig. 4Performance of NBC and BRFC models, by type of data used.
BRFC models perform considerably better than NBC models in terms of AUC across all three datasets. Combining structured and unstructured data yields better performance than using structured data alone, which itself performs better than using unstructured data only.
Structured-unstructured feature pairs AB with high interaction heterogeneity (IH), where A is a strong risk factor for suicide attempt.
| Features | Cases | Non-cases | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Structured ( | Unstructured ( | AB expected | IH | |||||||
| Other, mixed, or unsp. Drug abuse, unsp. Use | Suicide attempts | 2356 | 3741 | 374.53 | 1003 | 148 | 563 | 3.54 | 53 | 77.55 |
| Other, mixed, or unsp. Drug abuse, unsp. Use | Section XII | 2356 | 3045 | 304.85 | 849 | 148 | 403 | 2.53 | 43 | 74.72 |
| Other, mixed, or unsp. Drug abuse, unsp. Use | Living on the street | 2356 | 1113 | 111.43 | 532 | 148 | 154 | 0.97 | 36 | 66.66 |
| Other, mixed, or unsp. Drug abuse, unsp. Use | Prison | 2356 | 2043 | 204.53 | 825 | 148 | 358 | 2.25 | 51 | 62.57 |
| Other, mixed, or unsp. Drug abuse, unsp. Use | Intoxications | 2356 | 2663 | 266.61 | 889 | 148 | 462 | 2.91 | 50 | 60.56 |
| Suicidal ideation | Section XII | 1820 | 3045 | 235.49 | 1299 | 127 | 403 | 2.17 | 81 | 54.69 |
| Other, mixed, or unsp. Drug abuse, unsp. Use | Undomiciled | 2356 | 2357 | 235.97 | 964 | 148 | 408 | 2.57 | 55 | 54.50 |
| Other, mixed, or unsp. Drug abuse, unsp. Use | Opioid dependence | 2356 | 1625 | 162.69 | 841 | 148 | 195 | 1.23 | 44 | 53.86 |
| Suicidal ideation | Schizoaffective schizophrenia | 1820 | 676 | 52.28 | 223 | 127 | 118 | 0.64 | 21 | 52.75 |
| Other, mixed, or unsp. Drug abuse, unsp. Use | Sober | 2356 | 3667 | 367.12 | 1329 | 148 | 723 | 4.55 | 76 | 52.29 |
| Other, mixed, or unsp. Drug abuse, unsp. Use | Unspecified bipolar disorder | 2356 | 3488 | 349.20 | 932 | 148 | 699 | 4.40 | 49 | 48.53 |
| Other, mixed, or unsp. Drug abuse, unsp. Use | Schizoaffective schizophrenia | 2356 | 676 | 67.68 | 172 | 148 | 118 | 0.74 | 15 | 46.44 |
| Opioid abuse, unspec. Use | Sober | 1305 | 3667 | 203.35 | 710 | 78 | 723 | 2.40 | 42 | 46.09 |
| Other, mixed, or unsp. Drug abuse, unsp. Use | Methadone | 2356 | 2992 | 299.54 | 1165 | 148 | 653 | 4.11 | 69 | 45.55 |
| Borderline personality | Methadone | 582 | 2992 | 74.00 | 139 | 35 | 653 | 0.97 | 14 | 43.59 |
| Opioid abuse, unspec. Use | Living on the street | 1305 | 1113 | 61.72 | 293 | 78 | 154 | 0.51 | 18 | 43.28 |
| Opioid type dependence, continuous use | Drug seeking | 710 | 463 | 13.97 | 96 | 50 | 51 | 0.11 | 9 | 37.61 |
| Suicidal ideation | Suicidality | 1820 | 2546 | 196.90 | 1057 | 127 | 380 | 2.05 | 58 | 35.84 |
| Other, mixed, or unsp. Drug abuse, unsp. Use | Cluster b | 2356 | 495 | 49.56 | 175 | 148 | 43 | 0.27 | 10 | 35.70 |
| Unspec. Neurotic disorder | Opioid dependence | 1003 | 1625 | 69.26 | 191 | 72 | 195 | 0.60 | 12 | 35.48 |
A high IH value indicates that the relationship between A and B changes significantly between case and non-case populations.
Structured-unstructured feature pairs AB with high interaction heterogeneity (IH), where A is a strong protective factor against suicide.
| Features | Cases | Non-cases | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Structured ( | Unstructured ( | IH | ||||||||
| Screening mammogram for malignant neoplasm of breast | Imp. Cont. Dis. | 89 | 661 | 2.50 | 51 | 2091 | 3658 | 325.03 | 875 | 110.08 |
| Annual Exam | Imp. Cont. Dis. | 171 | 661 | 4.80 | 81 | 2596 | 3658 | 403.53 | 1249 | 94.20 |
| Screening mammogram for malignant neoplasm of breast | vacuuming | 89 | 231 | 0.87 | 25 | 2091 | 1546 | 137.37 | 374 | 93.77 |
| Screening digital breast tomosynthesis, bilateral | Imp. Cont. Dis | 103 | 661 | 2.89 | 46 | 1656 | 3658 | 257.41 | 730 | 71.63 |
| Encounter for screening, unspec. | Imp. Cont. Dis | 55 | 661 | 1.54 | 30 | 809 | 3658 | 125.75 | 344 | 66.36 |
| Screening digital breast tomosynthesis, bilateral | vacuuming | 103 | 231 | 1.01 | 23 | 1656 | 1546 | 108.79 | 332 | 62.36 |
| Encounter for screening for malignant neoplasm of colon | Imp. Cont. Dis | 61 | 661 | 1.71 | 31 | 1399 | 3658 | 217.46 | 620 | 57.69 |
| Screening mammogram for malignant neoplasm of breast | Imp. Cont. Dis | 89 | 2019 | 7.64 | 80 | 2091 | 10987 | 976.24 | 1765 | 53.97 |
| Pure hypercholesterolemia, unsp. | Imp. Cont. Dis | 64 | 661 | 1.80 | 30 | 1328 | 3658 | 206.43 | 596 | 49.89 |
| Screening digital breast tomosynthesis, bilateral | Imp. Cont. Dis | 103 | 2019 | 8.84 | 82 | 1656 | 10987 | 773.15 | 1422 | 44.84 |
| Annual Exam | vacuuming | 171 | 231 | 1.68 | 23 | 2596 | 1546 | 170.54 | 423 | 44.53 |
| Physical therapy evaluation low complex 20 mins | Imp. Cont. Dis | 36 | 661 | 1.01 | 22 | 678 | 3658 | 105.39 | 325 | 44.29 |
| Screening, malig. neopl. colon | vacuuming | 61 | 231 | 0.60 | 14 | 1399 | 1546 | 91.91 | 269 | 43.32 |
| Screening, malig. neopl. breast | Imp. Cont. Dis | 30 | 661 | 0.84 | 18 | 571 | 3658 | 88.76 | 272 | 36.53 |
| Other hemorrhoids | Imp. Cont. Dis | 37 | 661 | 1.04 | 17 | 559 | 3658 | 86.89 | 236 | 33.29 |
| Age-related osteoporosis without current pathological fracture | Imp. Cont. Dis | 32 | 661 | 0.90 | 18 | 549 | 3658 | 85.34 | 271 | 32.33 |
| Asymptomatic menopausal state | vacuuming | 20 | 231 | 0.20 | 7 | 387 | 1546 | 25.42 | 81 | 29.70 |
| Other melanin hyperpigmentation | vacuuming | 25 | 231 | 0.25 | 8 | 699 | 1546 | 45.92 | 156 | 29.59 |
| Screening, unspecified | Imp. Cont. Dis | 55 | 2019 | 4.72 | 46 | 809 | 10987 | 377.70 | 692 | 29.58 |
| Mod sed same phys/qhp each addl 15 mins | Imp. Cont. Dis | 28 | 661 | 0.79 | 13 | 822 | 3658 | 127.77 | 329 | 28.45 |
A high IH value indicates that the relationship between A and B changes significantly between case and non-case populations. Among the unstructured concepts, “Imp. Cont. Dis” refers to impulse-control disorder, and “vacuuming” refers to use of hallucinogenic and psychoactive drugs derived from psilocybin mushrooms.
Contingency tables for the structured-unstructured pair “Other, mixed, or unspecified drug abuse, unspecified use” (A) and “suicide attempts” (B).
| Cases | Non-cases | ||||
|---|---|---|---|---|---|
| Concept | Concept | ||||
| 0.0401 | 0.0541 | 0.0021 | 0.004 | ||
| 0.1095 | 0.7376 | 0.0204 | 0.9150 | ||
This feature pair has a high interaction heterogeneity (IH) value of 77.55. Values shown are proportions of the total number of samples (23,566) for each bin.
Contingency tables for the structured-unstructured pair “Opioid abuse, unspecified use” (A) and “junk (heroin)” (B).
| Cases | Non-cases | ||||
|---|---|---|---|---|---|
| Concept | Concept | ||||
| 0.0443 | 0.0079 | 0.0022 | 0.0010 | ||
| 0.1071 | 0.7820 | 0.0297 | 0.9085 | ||
This feature pair has a low IH value of 3.95. Values shown are proportions of the total number of samples (23,566) for each bin. The differences between the two distributions are smaller in Table 7 than in Table 6, resulting in a lower IH value.
Fig. 5Interaction heterogeneity versus joint suicide risk.
A comparison of joint suicide attempt risk and interaction heterogeneity. Each data point corresponds to a structured-unstructured feature pair AB. The x-axis shows the joint suicide risk of features A and B, defined as the log of the ratio of the expected joint occurrences of AB in the case vs. non case cohorts. The y-axis shows the interaction heterogeneity, a measure of how much the interaction between A and B differs between case and non-case cohorts. The plot shows that feature pairs with similar joint suicide attempt risk can have very different interaction heterogeneity.
Structured-unstructured feature pairs A-B with high interaction heterogeneity (IH) values.
| Structured feature ( | Unstructured feature ( | Joint suicide attempt risk | IH |
|---|---|---|---|
| Other, mixed, or unspecified drug abuse, unspecified use | Suicide attempts | 2.02 | 77.55 |
| Other, mixed, or unspecified drug abuse, unspecified use | Section XII | 2.08 | 74.72 |
| Other, mixed, or unspecified drug abuse, unspecified use | Living on the street | 2.06 | 66.66 |
| Other, mixed, or unspecified drug abuse, unspecified use | Prison | 1.96 | 62.57 |
| Other, mixed, or unspecified drug abuse, unspecified use | Undomiciled | 2.02 | 61.18 |
| Other, mixed, or unspecified drug abuse, unspecified use | Intoxications | 1.96 | 60.56 |
| Suicidal ideation | Section XII | 2.03 | 54.69 |
| Other, mixed, or unspecified drug abuse, unspecified use | Undomiciled | 1.96 | 54.50 |
| Other, mixed, or unspecified drug abuse, unspecified use | Opioid dependence | 2.12 | 53.86 |
| Suicidal ideation | Schizoaffective schizophrenia | 1.91 | 52.75 |
| Other, mixed, or unspecified drug abuse, unspecified use | Sober | 1.91 | 52.29 |
| Opioid abuse, unspecified use | Methadone | 2.02 | 48.85 |
| Other, mixed, or unspecified drug abuse, unspecified use | Unspecified bipolar disorder | 1.90 | 48.53 |
| Suicidal ideation | Delusions | 1.86 | 48.32 |
| Other, mixed, or unspecified drug abuse, unspecified use | Methadone | 2.00 | 46.72 |
| Other, mixed, or unspecified drug abuse, unspecified use | Schizoaffective schizophrenia | 1.96 | 46.44 |
| Opioid abuse, unspecified use | Sober | 1.93 | 46.09 |
| Other, mixed, or unspecified drug abuse, unspecified use | Methadone | 1.86 | 45.55 |
| Cocaine abuse, unspecified use | Methadone | 1.97 | 43.78 |
| Borderline personality | Methadone | 1.88 | 43.59 |
The joint suicide attempt risk of features A and B is defined as the log of the ratio of the expected joint occurrences of AB in the case vs. non case cohorts.
Structured-unstructured feature pairs A-B with low interaction heterogeneity (IH) values.
| Structured feature ( | Unstructured feature ( | Joint suicide risk | IH |
|---|---|---|---|
| Opioid type dependence, continuous use | Hearing voices | 2.03 | 0.05 |
| Opioid type dependence, continuous use | Suicidality | 1.98 | 0.05 |
| Methadone tab 40 mg | Junk (heroin) | 1.73 | 0.05 |
| Barbiturate and similarly acting sedative or hypnotic abuse, unspecified use | Mugged (assault) | 1.96 | 0.04 |
| Unspecified neurotic disorder | VH (visual hallucinations) | 1.89 | 0.04 |
| Other, mixed, or unspecified drug abuse, unspecified use | Judgment impaired | 2.12 | 0.03 |
| Barbiturate and similarly acting sedative or hypnotic abuse, unspecified use | Prison | 2.04 | 0.03 |
| Opioid type dependence, continuous use | Junk (heroin) | 1.83 | 0.02 |
| Cocaine abuse, unspecified use | Blackouts | 1.88 | 0.02 |
| Methadone tab 40 mg | Junk (heroin) | 1.83 | 0.02 |
| Opioid type dependence, continuous use | Thioridazine | 1.99 | 0.02 |
| Barbiturate and similarly acting sedative or hypnotic abuse, unspecified use | Junk (heroin) | 2.11 | 0.01 |
| Acute alcoholic intoxication in alcoholism, continuous drinking behavior | Hallucinosis | 1.99 | 0.01 |
| Suicidal ideation | Crack | 2.02 | 0.01 |
| Methadone tab 40 mg | Stolen | 1.73 | 0.01 |
| Unspecified neurotic disorder | Sexual assaults | 1.81 | 0.01 |
| Depressive Neuroses (MS v24) | Sober | 1.96 | 0.00 |
| Depressive Neuroses (MS v24) | Prison | 2.01 | 0.00 |
| Unspecified neurotic disorder | VH | 1.85 | 0.00 |
| Cocaine abuse, continuous use | VH | 1.95 | 0.00 |
The joint suicide attempt risk of features A and B is defined as the log of the ratio of the expected joint occurrences of AB in the case vs. non case cohorts.
Contingency tables of structured-unstructured concept pairs A-B, for case and non-case cohorts.
| Cases | Non-cases | ||||
|---|---|---|---|---|---|
| Concept | Concept | ||||
| A: 1 | |||||
| A: 0 | |||||