Literature DB >> 32734154

Generating contextual embeddings for emergency department chief complaints.

David Chang¹, Woo Suk Hong², Richard Andrew Taylor².

Abstract

OBJECTIVE: We learn contextual embeddings for emergency department (ED) chief complaints using Bidirectional Encoder Representations from Transformers (BERT), a state-of-the-art language model, to derive a compact and computationally useful representation for free-text chief complaints.
MATERIALS AND METHODS: Retrospective data on 2.1 million adult and pediatric ED visits was obtained from a large healthcare system covering the period of March 2013 to July 2019. A total of 355 497 (16.4%) visits from 65 737 (8.9%) patients were removed for absence of either a structured or unstructured chief complaint. To ensure adequate training set size, chief complaint labels that comprised less than 0.01%, or 1 in 10 000, of all visits were excluded. The cutoff threshold was incremented on a log scale to create seven datasets of decreasing sparsity. The classification task was to predict the provider-assigned label from the free-text chief complaint using BERT, with Long Short-Term Memory (LSTM) and Embeddings from Language Models (ELMo) as baselines. Performance was measured as the Top-k accuracy from k = 1:5 on a hold-out test set comprising 5% of the samples. The embedding for each free-text chief complaint was extracted as the final 768-dimensional layer of the BERT model and visualized using t-distributed stochastic neighbor embedding (t-SNE).
RESULTS: The models achieved increasing performance with datasets of decreasing sparsity, with BERT outperforming both LSTM and ELMo. The BERT model yielded Top-1 accuracies of 0.65 and 0.69, Top-3 accuracies of 0.87 and 0.90, and Top-5 accuracies of 0.92 and 0.94 on datasets comprised of 434 and 188 labels, respectively. Visualization using t-SNE mapped the learned embeddings in a clinically meaningful way, with related concepts embedded close to each other and broader types of chief complaints clustered together. DISCUSSION: Despite the inherent noise in the chief complaint label space, the model was able to learn a rich representation of chief complaints and generate reasonable predictions of their labels. The learned embeddings accurately predict provider-assigned chief complaint labels and map semantically similar chief complaints to nearby points in vector space.
CONCLUSION: Such a model may be used to automatically map free-text chief complaints to structured fields and to assist the development of a standardized, data-driven ontology of chief complaints for healthcare institutions.

Entities: Chemical

Keywords: BERT; chief complaint; emergency medicine; machine learning; natural language processing

Year: 2020 PMID： 32734154 PMCID： PMC7382638 DOI： 10.1093/jamiaopen/ooaa022

Source DB: PubMed Journal: JAMIA Open ISSN： 2574-2531

LAY SUMMARY

Patient care in the emergency department (ED) is guided by the patient’s chief complaint, a concise statement regarding the patient’s medical history, current symptoms, and reason for visit. Because chief complaints are often stored as free-text descriptions of varying length and quality, secondary use of chief complaint data in operational decisions and research has been impractical. Moreover, even when chief complaints are stored in a structured format in electronic health records, there exists no standard nomenclature on how they are categorized. To remedy this problem, we use Bidirectional Encoder Representations from Transformers, a state-of-the-art language model, on a dataset of 1.8 million free-text ED chief complaints to derive a numerical representation for chief complaints, called “contextual embeddings.” We show that contextual embeddings accurately predict provider-assigned chief complaint labels and map chief complaints with similar meaning (eg “wheezing” and “breathing problem”) to nearby points in vector space. The model with its associated embeddings may be used to automatically map free-text chief complaints to structured labels and to help derive a standardized dictionary of chief complaints for healthcare institutions.

BACKGROUND AND SIGNIFICANCE

Patient care in the emergency department (ED) is guided by the patient’s chief complaint. Collected during the first moments of the patient encounter, a chief complaint is a concise statement regarding the patient’s medical history, current symptoms, and reason for visit. While a chief complaint can be represented in a structured format with predefined categories, it is often captured in unstructured, free-text descriptions of varying length and quality. Moreover, even when chief complaints are stored in a structured format, there exists no standard nomenclature or guidance on how they should be categorized., As a consequence, administrators and researchers frequently find chief complaint data difficult to use for downstream tasks such as quality improvement initiatives and predictive analytics. Thus, the secondary use of chief complaint data in daily operational decisions and research has been hampered by its form and representation. Advances in natural language processing (NLP) provide an opportunity to address many of the challenges of chief complaint data. Contextual language models such as Embeddings from Language Models (ELMo) and Bidirectional Encoder Representations from Transformers (BERT) are able to generate dense vector representations, or embeddings, of free-text data such that semantically similar words or documents are mapped to nearby points in vector space. Such methods have been successfully applied in the medical domain. Recent work has used contextual language models to generate embeddings for chief complaints in the primary care setting, using a small dataset of patient-generated text. Contextual embeddings for ED chief complaints have many desirable properties. They distill the complex information stored in free-text into a compact, numeric format while avoiding the data sparsity that results from converting categorical variables into dummy variables or from using traditional NLP models such as Term Frequency-Inverse Document Frequency (td-idf) and Bag of Words (BoW). Moreover, a contextual embedding model trained specifically on ED triage notes stores appropriate information about chief complaints within the context of ED patient care, as opposed to word similarities within a large undifferentiated corpus., ED chief complaints have been an important part of many clinical decision support tools, including those for early sepsis detection, in-hospital mortality, patient disposition, and early ED return. Contextual embeddings for ED chief complaints may facilitate incorporating free-text information into prediction models, as has been shown in models for in-patient readmission., Contextual embeddings also enable us to calculate a numeric distance between any two chief complaints to determine their relatedness, or similarity, an elusive concept that has hampered outcomes research focusing on subgroups of chief complaints as well as quality improvement projects on short-term ED return. Lastly, contextual embeddings may be used to derive a standardized, data-driven ontology of ED chief complaints that could be shared among healthcare institutions and research entities to minimize the variability in how chief complaint labels are assigned from ED to ED,,, as has been suggested by recent work on the Hierarchical Presenting Problem Ontology (HaPPy).,,

OBJECTIVES

In this study, we expand on prior work by applying BERT, a state-of-the-art language model, on a dataset of 1.8 million provider-generated free-text ED chief complaints from a healthcare system covering seven independent EDs., We use Long Short-Term Memory (LSTM) and ELMo for baseline comparison. We show that the contextual embeddings generated by BERT accurately predict provider-assigned chief complaint labels and map semantically similar chief complaints to nearby points in vector space.

MATERIALS AND METHODS

Retrospective data on all adult and pediatric ED visits were obtained from a large healthcare system covering the period of March 2013 to July 2019, with a combined annual census of approximately 500 000 visits across seven independent EDs, three of which are community hospital-based. The centralized data warehouse for the electronic health record (EHR) system (Epic, Verona, WI) was queried for chief complaint data. This study was approved, and the informed consent process waived, by the Human Investigation Committee at the authors’ institution (HIC 2000025236). Chief complaint data in the Epic EHR are represented in two forms, a “presenting problem” that is a structured list of 1145 labels and a free-text chief complaint comment section in the form of an unstructured text box. The structured label system does not correspond to external nomenclatures such as SNOMED Clinical Terms. The free-text chief complaint is entered by the triage nurse at the moment of patient encounter, along with one or more presenting problems, which the nurse selects from a structured list after searching for a free-text term. We removed visits that did not contain both forms of chief complaint data, but examined the distribution of structured chief complaint labels without a comment section through categorical data analysis and Chi-Square distance metrics to determine that the reduced dataset was representative of the full dataset. Visits that had been assigned more than one chief complaint label were treated as separate training instances. Given the skewed distribution of chief complaint labels, where the 25 most common labels out of a total of 1145 account for roughly half of the dataset, chief complaint labels that comprised less than 0.01%, or 1 in 10 000, of all visits were excluded to ensure adequate training samples per label. The cutoff threshold was then incremented on a log scale to create seven datasets of decreasing sparsity (Supplementary Table S1). A full list of the chief complaint labels, along with their frequencies, are available in Supplementary Table S2.

Model training

For each of the seven datasets, all samples were randomly split into training (90%), validation (5%), and test (5%) sets. The classification task was to predict the provider-assigned label from the free-text chief complaint. Given the clinical nature of the dataset, we used a version of clinical BERT pretrained on the MIMIC corpus. LSTM and ELMo were trained as baseline models on the largest dataset consisting of 434 labels. Using the open source library PyTorch, we fine-tuned each clinical BERT model for three epochs on three GTX 1080 Ti GPUs. Each epoch on the full dataset took about an hour using per_gpu_train_batch_size of 144. Hyperparameter tuning beyond the default values for BERT fine-tuning did not yield noticeable gains in performance, with the test accuracies converging to the same range of values for any reasonable configuration. A learning rate of 1e−4 and max_seq_length of 64 were used. Sequences longer than max_seq_length were truncated. The implementation code is available at https://github.com/dchang56/chief_complaints. Notably, the repository also includes an easy-to-use script with instructions to generate predictions for custom chief complaint datasets. For baseline comparison, we trained a bidirectional LSTM and ELMo using the AllenNLP framework. In both cases, the hidden dimension size was 512. The LSTM model was a one-layer bidirectional LSTM with GloVe embeddings, and the ELMo model was a two-layer biLSTM initialized with pretrained ELMo weights.

Performance

Performance was measured as the Top-k accuracy from k = 1:5 on a hold-out test set comprising 5% of the samples. Top-k accuracy is defined such that the model is considered to be correct if its top-k probability outputs contains the correct class label.

Error analysis

Having hundreds of potential labels with considerable semantic overlap (eg FACIAL LACERATION, LACERATION, HEAD LACERATION, FALL, FALL > 65) justifies taking into account the top few predictions rather than just the top 1. We hypothesized that the redundancy and noise in the label space would be responsible for the majority of the model’s errors and a priori determined to examine through two-physician review a random sample of errors, as well as look at the most frequent kinds of mislabeling for common chief complaint labels.

Embedding visualization

The embedding for each free-text chief complaint was extracted as the final 768-dimensional layer of the BERT classifier. We took the mean of the embeddings across each chief complaint label and visualized the averaged, label-specific embeddings in a two-dimensional space using t-SNE. More specifically, the mean of the 768-dimensional embeddings across each chief complaint label was reduced to two dimensions using the Rtsne package (v. 0.15) in R with the following default hyperparameters: initial_dims = 50, perplexity = 30, theta = 0.5. To enhance readability of the figure, we limited the number of visualized labels to 188 by using a cutoff threshold of 0.08%. The ggrepel and ggplot2 packages in R were used for plot generation. Clusters were determined via Gaussian mixture modeling with the optimal number selected by silhouette analysis.

RESULTS

In the defined query time period, there were an initial 2 154 862 visits among 736 570 patients. 355 497 (16.4%) visits from 65 737 (8.9%) patients were removed for absence of either a structured or unstructured chief complaint. Among chief complaint labels, 43 of the 1145 labels were removed because of the absence of any visit with unstructured text. In comparison to the initial dataset, the chi-square distance metric for the histogram of the remaining chief complaint categories (n = 1102) was 0.005. For model training, an additional 668 labels comprising 25 143 (1.3%) visits were removed after filtering out labels that comprised less than 0.01%, or 1 in 10 000, of all visits, resulting in a total of 434 labels and 1 859 599 training instances. The BERT models achieved increasing performance with higher label-frequency cutoff thresholds (Figure 1). BERT outperformed both LSTM and ELMo (Table 1). The BERT model yielded Top-1 accuracies of 0.65 and 0.69, Top-3 accuracies of 0.87 and 0.90, and Top-5 accuracies of 0.92 and 0.94 on datasets comprised of 434 and 188 labels, respectively. Common types of mislabeling for the frequent chief complaint labels, as well as labels with the lowest accuracies, are shown in Figure 2. The interquartile range for Top-5 accuracies amongst the chief complaint labels was 74.0–92.3%.

Figure 1.

Table 1.

Predictive performance by algorithm

	Algorithm	LSTM	ELMo	BERT
Full dataset (434 labels)	Top-1	0.63	0.63	0.65
	Top-2	0.77	0.78	0.80
	Top-3	0.84	0.85	0.87
	Top-4	0.88	0.88	0.90
	Top-5	0.90	0.90	0.92
Reduced dataset (188 labels)	Top-1	0.66	0.66	0.69
	Top-2	0.81	0.81	0.84
	Top-3	0.88	0.88	0.90
	Top-4	0.90	0.91	0.93
	Top-5	0.93	0.93	0.94

Figure 2.

Common types of mislabeling for select chief complaint labels. Top row shows three of the most common chief complaint labels, with their accuracies shown within parentheses. Bottom row shows three chief complaint labels with lowest accuracies. X-axis shows the top five most common misclassifications in decreasing order. Y-axis shows frequency of error. Note that even for low performing chief complaint labels, a high percentage of errors are due to semantic overlap.

Model performance for Top-1 to Top-5 accuracy. Label-frequency cutoff thresholds are represented by colors. The accuracy increases drastically when taking into account the first few predictions. Dotted line shows 90% accuracy. Common types of mislabeling for select chief complaint labels. Top row shows three of the most common chief complaint labels, with their accuracies shown within parentheses. Bottom row shows three chief complaint labels with lowest accuracies. X-axis shows the top five most common misclassifications in decreasing order. Y-axis shows frequency of error. Note that even for low performing chief complaint labels, a high percentage of errors are due to semantic overlap. Predictive performance by algorithm Manual error analysis showed that many errors were due to the problem of redundancy and noise in the label space. In some cases, the predictions of the model were more suitable than the provider-assigned labels, as independently validated by physicians. We show 10 representative examples in Table 2 and provide a hundred random selection of errors in Supplementary Table S3.

Table 2.

Examples of chief complaints and their corresponding top-5 predictions

	Chief complaint	Top-5 predictions
Correctly classified at second prediction	“right third finger injured in door”	FINGER INJURY, HAND PAIN, HAND INJURY, FINGER PAIN, EXTREMITY LACERATION
	“pt comes to er with cc piece of plastic stuck to back of left ear from earing”	FOREIGN BODY IN EAR, EAR PROBLEM, EAR PAIN, OTALGIA, FOREIGN BODY
	“vomiting for days, increasing yesterday. pos home preg test on Saturday”	EMESIS, EMESIS DURING PREGNANCY, NAUSEA, ABDOMINAL PAIN PREGNANT, GI PROBLEM
	“both eyes swollen & itchy & tearing after his nap”	EYE SWELLING, EYE PROBLEM, EYE REDNESS, EYE PAIN, CONJUNCTIVITIS
	“fall at 0300 today, rt side weakness”	FALL, FALL>65, ALTERED MENTAL STATUS, NEUROLOGIC PROBLEM, WEAKNESS
Correctly classified at fifth prediction	“Felt like heart was pounding history of CABG. missed metoprolol for about 3 days.”	PALPITATIONS, RAPID HEART RATE, TACHYCARDIA, IRREGULAR HEART BEAT, CHEST PAIN
	“2 weeks of sore throat, aches, dry cough. Denies intervention.”	SORE THROAT, COLD LIKE SYMPTOMS, URI, COUGH, FLU-LIKE SYMPTOMS
	“fall down 5 stairs lace to right eyebrow”	FALL, FACIAL LACERATION, LACERATION, FALL>65, HEAD LACERATION
	“fever to 101, diarrhea, vomiting”	FEVER-9 WEEKS TO 74 YEARS, FEVER, EMESIS, ABDOMINAL PAIN, FEVER-8 WEEKS OR LESS
	“blister on back of foot.”	BLISTER, FOOT PAIN, FOOT INJURY, FOOT SWELLING, SKIN PROBLEM

Examples of chief complaints and their corresponding top-5 predictions The predictions are sorted in decreasing order of likelihood. The provider-assigned ground truth label is italicized. The examples highlight the problem of semantic overlap in the label space. Figure 3 shows the t-SNE visualization of averaged embeddings for common chief complaint labels, clustered via Gaussian mixture modeling. Using the silhouette analysis, 15 was chosen to be the optimal number of clusters. A cutoff-threshold of 0.08% (ie 188 chief complaint labels) was used for readability in a two-dimensional space. t-SNE visualization for embeddings generated using LSTM and ELMo are shown in Supplementary Figures S4 and S5.

Figure 3.

t-SNE visualization of averaged embeddings of common chief complaints. Embeddings for common chief complaints were grouped by their ground truth label, then their arithmetic mean visualized using t-SNE. The embeddings are distributed in a clinically meaningful way, with related concepts embedded close to each other and broader types of chief complaints clustered together. Note that t-SNE is a stochastic algorithm and, while it preserves local structure of the data, does not completely preserve its global structure. The text labels have been jittered to enhance readability. Colored groupings represent clusters as determined by gaussian mixture modeling.

DISCUSSION

By applying BERT on a dataset of 1.8 million ED chief complaints from a healthcare system covering seven independent EDs, we derive contextual embeddings for chief complaints that accurately predict provider-assigned labels as well as map semantically similar chief complaints to nearby points in vector space. Prior studies have derived embeddings for medical concepts, patient-to-provider messages, and primary care chief complaints.,, We expand on prior work by using a large dataset of healthcare professional generated text, as opposed to patient-generated text, and by generating contextual embeddings for chief complaints within the emergency care setting. These embeddings may be instrumental in multiple downstream tasks, such as augmenting predictive performance of clinical decision support tools, calculating similarity measures between chief complaints to determine whether ED bounce-backs are due to a related cause, or creating a standardized, data-driven ontology of chief complaints. Recently, much important work has been done to create a standardized ontology, namely, the Hierarchical Presenting Problem Ontology (HaPPy), which increased the likelihood of label assignment from a free-text chief complaint from 26.2% to 97.2% in one study.,, Using such an ontology for training and testing purposes may present an opportunity for gold standard labels to be used to derive contextual embeddings. Our study has several limitations. Our data come from a single healthcare system that uses an internal classification system for ED chief complaints, and our results may not be generalizable across EDs operating under different EHR systems. Moreover, certain conditions may be more likely to have structured chief complaint labels and by only training on that subset of patients, the model may have restricted applicability. Also, free-text chief complaints often list several comorbid signs and symptoms, making it difficult to choose a single ground truth label. This raises concerns about whether the prediction task should be set up as a multi-label classification task. Another limitation is the noise inherent in the default set of chief complaint labels provided by our EHR. Of the 1145 default categories, 153 have one or no instance out of 1.87 million visits, while 472 account for 99% of the visits. Labels such as “Other” and “Medical” provide little to no information in an emergency care setting and restrict the applicability of the model. Some labels are synonyms (eg “Dyspnea” and “Shortness of breath”; “Otalgia” and “Ear pain”), while many more are hypo/hypernyms of one another (eg “Fall” and “Fall > 65”; “Migraine” and “Known dx migraine”). Such issues highlight the need to develop a principled and data-driven ontology for ED chief complaints. Despite the noise in the data, the model was able to learn a rich representation of chief complaints and generate reasonable predictions of their labels. In fact, many of the predictions that resulted in errors were more suitable than the ground truth labels, suggesting that the model did not overfit to the training data. Finally, our model was trained only on free-text data, without any other patient information. Including non-textual patient data such as demographics, vital signs, and hospital usage statistics may improve performance, as shown in many prediction tasks., Further studies are needed to assess the validity of these approaches.

CONCLUSION

The BERT language model was able to learn a rich representation of chief complaints and generate reasonable predictions of their labels despite the inherent noise in the label space. The learned embeddings accurately predicted provider-assigned chief complaint labels and mapped semantically similar chief complaints to nearby points in vector space. Such a model may be used to automatically map free-text chief complaints to structured fields and to derive a standardized, data-driven ontology of chief complaints for healthcare institutions.

SUPPLEMENTARY MATERIAL

Supplementary material is available at Journal of the American Medical Informatics Association online.

FUNDING

David Chang was supported by NIH Training Grant 5T15LM007056-33. Woo Suk Hong and Richard Andrew Taylor received no specific funding for this work.

Authors’ contributions

All authors contributed to the conception and design of the study. R.T. performed the acquisition and preprocessing of data. D.C., W.H., and R.T. performed the analysis of data, visualization, and interpretation of the results. D.C. and W.H. drafted the initial manuscript. D.C., W.H., and R.T. revised the final manuscript. Conflict of interest statement. None declared. Click here for additional data file.

23 in total

1. Patient returns to the emergency department: the time-to-return curve.

Authors: Kristin L Rising; Timothy W Victor; Judd E Hollander; Brendan G Carr
Journal: Acad Emerg Med Date: 2014-08-24 Impact factor: 3.451

2. Classifying free-text triage chief complaints into syndromic categories with natural language processing.

Authors: Wendy W Chapman; Lee M Christensen; Michael M Wagner; Peter J Haug; Oleg Ivanov; John N Dowling; Robert T Olszewski
Journal: Artif Intell Med Date: 2005-01 Impact factor: 5.326

3. Toward vocabulary control for chief complaint.

Authors: Stephanie W Haas; Debbie Travers; Judith E Tintinalli; Daniel Pollock; Anna Waller; Edward Barthell; Catharine Burt; Wendy Chapman; Kevin Coonan; Donald Kamens; James McClay
Journal: Acad Emerg Med Date: 2008-05 Impact factor: 3.451

4. Chief complaints in medical emergencies: do they relate to underlying disease and outcome? The Charité Emergency Medicine Study (CHARITEM).

Authors: Martin Mockel; Julia Searle; Reinhold Muller; Anna Slagman; Harald Storchmann; Philipp Oestereich; Werner Wyrwich; Angela Ale-Abaei; Joern O Vollert; Matthias Koch; Rajan Somasundaram
Journal: Eur J Emerg Med Date: 2013-04 Impact factor: 2.799

5. Consensus Development of a Modern Ontology of Emergency Department Presenting Problems-The Hierarchical Presenting Problem Ontology (HaPPy).

Authors: Steven Horng; Nathaniel R Greenbaum; Larry A Nathanson; James C McClay; Foster R Goss; Jeffrey A Nielson
Journal: Appl Clin Inform Date: 2019-06-12 Impact factor: 2.342