| Literature DB >> 36071081 |
Cosmin A Bejan1, Michael Ripperger2, Drew Wilimitis2, Ryan Ahmed3, JooEun Kang4, Katelyn Robinson2, Theodore J Morley4, Douglas M Ruderfer2,4,5, Colin G Walsh2,3,5.
Abstract
Methods relying on diagnostic codes to identify suicidal ideation and suicide attempt in Electronic Health Records (EHRs) at scale are suboptimal because suicide-related outcomes are heavily under-coded. We propose to improve the ascertainment of suicidal outcomes using natural language processing (NLP). We developed information retrieval methodologies to search over 200 million notes from the Vanderbilt EHR. Suicide query terms were extracted using word2vec. A weakly supervised approach was designed to label cases of suicidal outcomes. The NLP validation of the top 200 retrieved patients showed high performance for suicidal ideation (area under the receiver operator curve [AUROC]: 98.6, 95% confidence interval [CI] 97.1-99.5) and suicide attempt (AUROC: 97.3, 95% CI 95.2-98.7). Case extraction produced the best performance when combining NLP and diagnostic codes and when accounting for negated suicide expressions in notes. Overall, we demonstrated that scalable and accurate NLP methods can be developed to identify suicidal behavior in EHRs to enhance prevention efforts, predictive models, and precision medicine.Entities:
Mesh:
Year: 2022 PMID: 36071081 PMCID: PMC9452591 DOI: 10.1038/s41598-022-19358-3
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.996
Figure 1A weakly supervised method of case label assignment for a ranked list of patients retrieved by the NLP system.
Characteristics of patients retrieved by the NLP system.
| Characteristic | SI | SA | ||
|---|---|---|---|---|
| N | % | N | % | |
| Total patients retrieved | 187,047 | 52,738 | ||
| Patients w/ ICD codes | 24,053 | 12.9 | 12,393 | 23.5 |
| Patients w/ 1+ positive mentions | 93,690 | 50.1 | 50,108 | 95.0 |
| Cases | 921 | 0.5 | 682 | 1.3 |
| Non-cases | 79 | 0.04 | 138 | 0.3 |
| Cases | 4484 | 2.4 | 2164 | 4.1 |
| Non-cases | 4308 | 2.3 | 1380 | 2.6 |
The extraction of cases and non-cases from psychiatric forms and chart review was restricted to the patients retrieved using NLP. The cases from psychiatric forms have at least one positive field while the non-cases have all the fields negated.
Figure 2Distribution of patients with ICD codes across the ranked lists of suicidal ideation (SI) and suicide attempt (SA) patients. For each retrieved list, patients were first ordered by their similarity score (or rank position) such that the most relevant ones are ranked at the top of the list. Each list was then split into 100 equal groups (percentiles) with the first and last percentiles representing the highest and lowest ranked patients, respectively. The percent of patients with relevant ICD codes was computed for each percentile.
NLP and ICD10CM validation.
| Patient selection | Outcome | P@200 | AUPRC (95% CI) |
|---|---|---|---|
| Top 200 patients retrieved by the NLP system | SI | 98.5 | 98.6 (97.1—99.5) |
| SA | 96.5 | 97.3 (95.2—98.7) | |
AUPRC area under the precision-recall curve, CI confidence interval, P@K precision at top K retrieved patients, P precision, SI suicidal ideation, SA suicide attempt.
Figure 3Precision-recall curves for suicidal ideation (SI) and suicide attempt (SA) evaluation of the top 200 highest ranked patients retrieved by the NLP system.
Evaluation of label assignment methods for suicidal ideation (SI) and suicide attempt (SA).
| SI | SA | |||||||
|---|---|---|---|---|---|---|---|---|
| All retrieved (N = 187,047) | w/ 1+ positive (N = 93,690) | All retrieved (N = 52,738) | w/ 1+ positive (N = 50,108) | |||||
| NLP | NLP + ICD | NLP | NLP + ICD | NLP | NLP + ICD | NLP | NLP + ICD | |
| AUPRC | 55.2 | 58.5 | 57.5 | 62.2 | 44.5 | 52.9 | 45.2 | 54.1 |
| Top K for P@K = 90% | 930 | 1270 | 980 | 1321 | 360 | 384 | 365 | 390 |
| Patients w/o ICD codes | 140 (15.1%) | 186 (14.6%) | 146 (14.9%) | 189 (14.3%) | 79 (21.9%) | 83 (21.6%) | 79 (21.6%) | 84 (21.5%) |
| Patients w/o manual review | 612 (65.8%) | 909 (71.6%) | 650 (66.3%) | 952 (72.1%) | 116 (32.2%) | 134 (34.9%) | 119 (32.6%) | 138 (35.4%) |
| Patients w/o psychiatric forms | 676 (72.7%) | 952 (75.0%) | 715 (73.0%) | 996 (75.4%) | 304 (84.4%) | 325 (84.6%) | 307 (84.1%) | 329 (84.4%) |
| Top K for P@K = 80% | 2941 | 5641 | 2971 | 5790 | 670 | 1420 | 680 | 1455 |
| Patients w/o ICD codes | 580 (19.7%) | 1581 (28.0%) | 559 (18.8%) | 1519 (26.2%) | 153 (22.8%) | 306 (21.5%) | 153 (22.5%) | 313 (21.5%) |
| Patients w/o manual review | 2491 (84.7%) | 5141 (91.1%) | 2527 (85.1%) | 5290 (91.4%) | 347 (51.8%) | 1000 (70.4%) | 354 (52.1%) | 1034 (71.1%) |
| Patients w/o psychiatric forms | 2408 (81.9%) | 4824 (85.5%) | 2427 (81.7%) | 4933 (85.2%) | 570 (85.1%) | 1227 (86.4%) | 580 (85.3%) | 1257 (86.4%) |
The “All retrieved” columns represent results of the methods using the initial lists with all the retrieved patients for SI (N = 187,047) and SA (N = 52,738). The “w/ 1+ positive” columns correspond to methods using only patients with at least one positively asserted suicide mention in their notes. “NLP” and “NLP + ICD” columns are associated with methods using and , respectively, for suicide label assignment.
Figure 4Evaluation comparing NLP and NLP + ICD label assignment methods for suicidal ideation (SI) and suicide attempt (SA). The patients used in this evaluation contain at least one positively asserted mention of suicide in their notes.
Figure 5Comparative analysis for extracting the top K highest ranked suicidal ideation (SI) and suicide attempt (SA) patients using various configurations of the suicide label assignment method described in Fig. . For each configuration, the label assignment method was run 1000 times. The “Patient retrieval: all” experiments include all the patients retrieved by the NLP system while the “Patient retrieval: w/ 1+ positive” experiments use only patients with at least one positive suicide mention in their notes. The “NLP” and “NLP + ICD” experiments were associated with methods using the label assignment probabilities and , respectively.
High-precision extraction of suicidal ideation (SI) and suicide attempt (SA) cases extracted from the EHR.
| PPV criterion | SI | SA | ||
|---|---|---|---|---|
| Resource | N (pooled) | Resource | N (pooled) | |
| PPV ≥ 90% | NLP, top K, P@K = 90 | 1209 | NLP, top K, P@K = 90 | 380 |
| +chart review (SI cases) | 1831 | +chart review (SA cases) | 846 | |
| +psychiatric forms (SI cases) | 6670 | +psychiatric forms (SA cases) | 4978 | |
| +ICD10CM codes (PPV = 96%) | 22,218 | |||
| PPV ≥ 80% | NLP, top K, P@K = 80 | 5342 | NLP, top K, P@K = 80 | 1384 |
| +chart review (SI cases) | 5833 | +chart review (SA cases) | 1681 | |
| +psychiatric forms (SI cases) | 10,150 | +psychiatric forms (SA cases) | 5701 | |
| +ICD10CM codes (PPV = 96%) | 23,848 | +ICD10CM codes (PPV = 85%) | 18,843 | |
Characteristics of suicidal ideation (SI) and suicide attempt (SA) cases identified in the EHR with a precision of at least 90%.
| Characteristic | SI | SA | ||
|---|---|---|---|---|
| N | % | N | % | |
| Total | 22,218 | 100 | 4978 | 100 |
| Age, years* | 31.7 | 17.4 | 35.6 | 15.7 |
| Dead | 541 | 2.4 | 207 | 4.2 |
|
| ||||
| Male | 10,026 | 45.1 | 2165 | 43.5 |
| Female | 12,191 | 54.9 | 2813 | 56.5 |
| Unknown | 1 | 0 | 0 | 0 |
|
| ||||
| White | 17,210 | 77.5 | 3999 | 80.3 |
| Black | 3254 | 14.6 | 743 | 14.9 |
| Asian | 271 | 1.2 | 37 | 0.7 |
| Native | 38 | 0.2 | 14 | 0.3 |
| Unknown | 1445 | 6.5 | 185 | 3.7 |
|
| ||||
| Not Hispanic or Latino | 19,957 | 89.8 | 4686 | 94.1 |
| Hispanic or Latino | 919 | 4.1 | 125 | 2.5 |
| Unknown | 1342 | 6.0 | 167 | 3.4 |
*Reported as mean and standard deviation