| Literature DB >> 32156302 |
Natalia Viani1, Joyce Kam2, Lucia Yin2, André Bittar2, Rina Dutta2,3, Rashmi Patel2,3, Robert Stewart2,3, Sumithra Velupillai2.
Abstract
BACKGROUND: Duration of untreated psychosis (DUP) is an important clinical construct in the field of mental health, as longer DUP can be associated with worse intervention outcomes. DUP estimation requires knowledge about when psychosis symptoms first started (symptom onset), and when psychosis treatment was initiated. Electronic health records (EHRs) represent a useful resource for retrospective clinical studies on DUP, but the core information underlying this construct is most likely to lie in free text, meaning it is not readily available for clinical research. Natural Language Processing (NLP) is a means to addressing this problem by automatically extracting relevant information in a structured form. As a first step, it is important to identify appropriate documents, i.e., those that are likely to include the information of interest. Next, temporal information extraction methods are needed to identify time references for early psychosis symptoms. This NLP challenge requires solving three different tasks: time expression extraction, symptom extraction, and temporal "linking". In this study, we focus on the first step, using two relevant EHR datasets.Entities:
Keywords: Electronic health records; Mental health; Natural language processing; Schizophrenia; Temporal information extraction
Mesh:
Year: 2020 PMID: 32156302 PMCID: PMC7063705 DOI: 10.1186/s13326-020-00220-2
Source DB: PubMed Journal: J Biomed Semantics
Fig. 1Example of clinical text describing the onset of psychosis symptoms. The example includes two structured dates (visit date and birth date) and four time expressions that are written in the text (“when he was 8 years old”, “oct 2009”, “since his teens”, “today”). As shown in the figure, time expressions can be normalized and placed on a timeline in order to reconstruct patient trajectories
Manual annotation results on the two EHR corpora (First referral and Early intervention)
| Corpus | Batch | Documents (# tokens) | All annotations (A1, A2) | Overlapping annotations | Same value | IAA (acc) |
|---|---|---|---|---|---|---|
| First referral | dev | 10 (49 K) | 932, 972 | 913 | 768 | 0.84 |
| First referral | valid | 23 (83 K) | 1455, 1475 | 1429 | 1254 | 0.88 |
| First referral | test | 19 (74 K) | 1119, 1159 | 1100 | 927 | 0.84 |
| Early intervention | batchA | 14 (18 K) | 435, 391 | 353 | 300 | 0.85 |
| Early intervention | batchB | 35 (57 K) | 867, 822 | 714 | 600 | 0.84 |
Manual annotation results on the two EHR corpora (First referral and Early intervention) divided into development (dev), validation (valid) and test sets, and batches (batchA and batchB), respectively. IAA: Inter-annotator agreement; A1/A2: annotators 1 and 2
Fig. 2Filtering steps from EHR documents related to early psychosis intervention services. First, we retain documents with length and average line length (avg_line_length) greater than a certain threshold. Then, we keep documents including at least one psychosis symptom keyword (from a list of predefined keywords). Finally, we retain documents containing more than five time expressions (as automatically extracted by a rule-based system)
Fig. 3Psychosis symptom keyword and time expression counts in the early intervention services dataset. The x-axis represents the number of documents obtained after applying length, average line length, and psychosis symptom keyword filters (9901). The y-axis represents normalized counts for psychosis symptom keywords (blue) and automatically extracted time expressions (orange), normalized to the range 0–1. Texts containing many temporal expressions are more likely to also include relevant psychosis symptom keywords
Automated time expression normalization results on the two EHR corpora (First referral and Early intervention)
| Corpus | Batch | Reference standard | TPs | System1 (value acc) | System2 (value acc) |
|---|---|---|---|---|---|
| First referral | dev | 768 | 686 | 0.77 | 0.86 |
| First referral | valid | 1254 | 1115 | 0.76 | 0.80 |
| First referral | test | 927 | 828 | 0.66 | 0.71 |
| Early intervention | batchA | 300 | 272 | 0.76 | 0.86 |
| Early intervention | batchB | 600 | 556 | 0.82 | 0.86 |
Automated time expression extraction results (normalized values) on the two EHR corpora (First referral and Early intervention), divided into development (dev), validation (valid) and test sets, and batches (batchA and batchB), respectively. Accuracy values are reported on overlapping annotations (TPs) for both the first system (System1) and its refined version (System2)
Automated time expression normalization results on the First referral corpus, divided per time expression type
| Batch | Type | IAA (matches) | IAA (acc) | System2 (TPs) | System2 (acc) | System2 (acc*) |
|---|---|---|---|---|---|---|
| dev | Date | 572 | 0.84 | 427 | 0.93 | 0.93 |
| Time | 77 | 0.87 | 65 | |||
| Duration | 137 | 0.82 | 102 | 0.74 | 0.74 | |
| Frequency | 58 | 0.95 | 52 | 0.92 | 0.92 | |
| Age_related | 69 | 0.81 | 40 | 0.93 | 0.93 | |
| valid | Date | 845 | 0.91 | 705 | 0.85 | 0.85 |
| Time | 128 | 0.79 | 100 | |||
| Duration | 209 | 0.77 | 147 | 0.84 | 0.84 | |
| Frequency | 123 | 0.98 | 101 | 0.95 | 0.95 | |
| Age_related | 124 | 0.81 | 62 | 0.73 | 0.73 | |
| test | Date | 554 | 0.92 | 482 | 0.82 | 0.82 |
| Time | 156 | 0.78 | 116 | |||
| Duration | 192 | 0.72 | 128 | 0.77 | 0.77 | |
| Frequency | 90 | 0.72 | 48 | 0.83 | 0.83 | |
| Age_related | 108 | 0.86 | 54 | 0.80 | 0.80 |
Automated time expression extraction results (normalized values) on the First referral corpus (dev, valid, test), divided per time expression type. Results are presented in terms of inter-annotator agreement (IAA), system raw accuracy (System2 acc) and system relaxed accuracy (System2 acc*), where expressions with type Time are evaluated only on the “Thh:mm” portion