Literature DB >> 32240233

Risk of mortality and cardiopulmonary arrest in critical patients presenting to the emergency department using machine learning and natural language processing.

Marta Fernandes¹, Rúben Mendes¹, Susana M Vieira¹, Francisca Leite², Carlos Palos³, Alistair Johnson⁴, Stan Finkelstein⁵, Steven Horng⁶, Leo Anthony Celi^4,7.

Abstract

Emergency department triage is the first point in time when a patient's acuity level is determined. The time to assign a priority at triage is short and it is vital to accurately stratify patients at this stage, since under-triage can lead to increased morbidity, mortality and costs. Our aim was to present a model that can assist healthcare professionals in triage decision making, namely in the stratification of patients through the risk prediction of a composite critical outcome-mortality and cardiopulmonary arrest. Our study cohort consisted of 235826 adult patients triaged at a Portuguese Emergency Department from 2012 to 2016. Patients were assigned to emergent, very urgent or urgent priorities of the Manchester Triage System (MTS). Demographics, clinical variables routinely collected at triage and the patients' chief complaint were used. Logistic regression, random forests and extreme gradient boosting were developed using all available variables. The term frequency-inverse document frequency (TF-IDF) natural language processing weighting factor was applied to vectorize the chief complaint. Stratified random sampling was used to split the data into train (70%) and test (30%) data sets. Ten-fold cross validation was performed in train to optimize model hyper-parameters. The performance obtained with the best model was compared against the reference model-a regularized logistic regression trained using only triage priorities. Extreme gradient boosting exhibited good calibration properties and yielded areas under the receiver operating characteristic and precision-recall curves of 0.96 (95% CI 0.95-0.97) and 0.31 (95% CI 0.26-0.36), respectively. The predictors ranked with higher importance by this model were the Glasgow coma score, the patients' age, pulse oximetry and arrival mode. Compared to the reference, the extreme gradient boosting model using clinical variables and the chief complaint presented higher recall for patients assigned MTS-3 and can identify those who are at risk of the composite outcome.

Entities: Chemical Disease Gene Species

Year: 2020 PMID： 32240233 PMCID： PMC7117713 DOI： 10.1371/journal.pone.0230876

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

1 Introduction

1.1 Emergency department triage

Emergency Department (ED) triage is the first evaluation of a patient when the patient condition and acuity level are defined. The main purpose of triage is to identify patients who need immediate care. ED triage systems ensure that patients are assigned a certain level of priority that is appropriate to their clinical condition and urgency of treatment. Thus, an accurate triage decision support system to assist the health professional becomes of vital importance.

1.2 Existing clinical triage systems

So far, several triage systems have been implemented worldwide to assist health professionals classifying patients according to the severity of their medical condition. The Manchester Triage System (MTS) is a 5-level triage system widely used in Europe [1]. MTS priorities range from level 1 (emergent patients that should have immediate medical observation) to level 5 (non urgent patients that should wait a maximum time of 4 hours for medical observation). Although this triage system is well established, over-triage and under-triage still occur, which indicates room for improvement. According to studies [2-5], under-triage contributes to delays in time-sensitive interventions, morbidity, and mortality. Over-triage, may have indirect but equally harmful effects [6], resulting in diversion of limited time and resources from more urgent patients and inappropriate allocation to less severe patients. In a study [2], the clinical severity of under-triage in the MTS was assessed for a cohort of pediatric patients. It was found that 0.9% (119/13408) patients were under-triaged and that in 53% (63/119) of these patients, experts considered under-triage as clinically severe. It was concluded that although infrequent, under-triage could have serious clinical consequences. In another study [6], the frequency of under- and over-triaged patients was measured by nurses using the Emergency Severity Index (ESI). Initial ESI-determined triage priority was classified as inaccurate for 16426 of 96071 patients. It was found that under-triage was associated with significantly higher admission rates and critical outcomes, while over-triage was associated with a lower rate of both.

1.3 Prior work in machine learning for risk stratification

There is a richness of information contained in electronic health records (EHR) stored in large databases that can be explored using machine learning to provide insights to assist providers in making informed decisions based on objective criteria [7]. In the literature machine learning models have been developed to assist in the stratification of patients for prioritization, according to their acuity level at the triage [8-23], and according to their risk of mortality [24-37], cardiac arrest [32-34], Intensive Care Unit (ICU) admission [27-30], hospital admission [9, 27, 38–53], need for critical care [9], ED revisits [45, 54], abnormal medical condition [17], acute morbidity and infectious diseases [31] and expected number of resources [55]. A few studies used MTS priority as modeling predictor along with other variables collected at triage, namely for prediction of hospital admissions [39-41] and mortality [26, 53]. In [39], a logistic regression and an artificial neural network model both yielded an AUROC of 0.86 (95% confidence intervals (CI) 0.85-0.86) for prediction of hospital admission. The authors developed a nomogram using the logistic regression and an automated software decision support system with a web interface based on the artificial neural network model. In [40], a logistic regression model was developed yielding an AUROC of 0.88 (95% CI 0.87-0.88) and in [41], a logistic regression, a decision tree and a gradient boosted machine model presented an AUROC of 0.85, 0.82 and 0.86, respectively, for prediction of hospital admissions. In [26], a logistic regression was developed to predict mortality within 30 days from information of gender, major disease by priority, Charlson’s comorbidity index, ICU admission, blood transfusion, troponin elevation, door-to-team and team-to-ward time. The MTS priority was found to be highly predictive of subsequent 30-day mortality ranging from 4.8% (MTS-5) to 46.1% (MTS-1). The results for both independent predictors of death within 30 days were: door-to-team times OR of 1.13 (95% CI 1.07-1.18) and team-to-ward times OR of 1.07 (95% CI 1.02-1.13). In [53] a logistic regression model yielded an AUROC of 0.91 (95% CI 0.89-0.93) from information of MTS priority at presentation, age, gender, comorbidities, functional status at presentation, mode of arrival, time of ED visit, type of specialty, physiological parameters at different times, level of consciousness, need for supplement oxygen, need for ventilation assistance, use of vasoactive and inotropic agents and initial laboratory markers in the ED. Other studies were performed for mortality prediction, where MTS priority was not used as modeling predictor. In [25] an area under the receiver-operating characteristic curve (AUROC) of 0.92 (95% CI 0.92-0.93) was obtained using logistic regression. The mortality predictors of the final model were age, prolonged capillary refill time, blood pressure, mechanical ventilation, oxygen saturation, Glasgow coma score and the diagnostic category of the Acute Physiology and Chronic Health Evaluation II (APACHE II)—a classification system for severity of disease. In [24], the authors used logistic regression as well and the AUROC for the internal validation set was 0.85 (95% CI 0.84-0.86) and for the external validation set was 0.84 (95% CI 0.82-0.85). The predictors of the model were age, gender, year, mode of arrival, triage priority and nine triage complaint codes (cardiac arrest, syncope/collapse, other cardiac, sepsis, other neurological, stroke/transient ischaemic attack, other respiratory, malignancy and malaise). In [27], for the combined prediction of mortality and ICU admission, the authors used four machine learning models: Lasso regression, random forest, gradient boosted decision tree, and deep neural networks. As the reference model, the authors created a logistic regression model using the five-level ESI. The machine learning models outperformed the reference model, AUROC of 0.86 (95%CI 0.85-0.87) in the deep neural network vs 0.74 (95%CI 0.72-0.75) in the reference model, with less under-triaged patients in ESI triage levels 3 to 5 (urgent to non-urgent). Predictors included demographics, mode of arrival, triage vital signs, chief complaints and comorbidities. A more comprehensive systematic review of machine learning applied for ED triage decision support can be found in [56].

1.4 Goals of this investigation

In this work, we employed machine learning and natural language processing to identify ED patients with high risk of mortality and cardiopulmonary arrest using data routinely collected at triage from a Portuguese public Hospital Beatriz Ângelo (HBA). The case study of Portugal is relevant, given that it represents high demand for emergency services, compared to other Organisation for Economic Co-operation and Development (OECD) countries [57]. We developed models for prediction of a composite outcome combining mortality and cardiopulmonary arrest, in the first 24 hours after triage. The models were developed for the cohort of emergent, very urgent and urgent patients, where early intervention is more critical, compared to the cohort of standard, less urgent and non-urgent patients. The performance obtained with these models was compared against the reference model trained using only the MTS priority. The impact in the models performance was assessed when including the chief complaint text feature. The calibration of the models was also assessed, since inadequate model calibration can limit the use of the models for clinical decision support [58-62]. The uncertainty and reliability of risk predictions were finally assessed for the best model.

2 Materials and methods

2.1 Study overview

Data acquired from the Emergency Department Information Systems (EDIS) of HBA ranged from 2011 to 2016, with a total of 599276 ED visits in the adult population (≥ 18 years old). This study was approved by HBA Ethics Committee that waived the requirement for informed consent. At HBA, re-triage is performed for reassessment of parameters or change of priority, activation of clinical pathway or introduction/correction of other information in the system. re-triages were excluded from the study given that not all patients are re-triaged. Referrals from HBA ED to other hospital ED were excluded as well, since there was no information regarding the patients’ outcome. Patients who died before ED admission were also excluded. Finally, the cohorts of non-urgent or standard priority patients were excluded, as well as patients assigned a white priority (in HBA this priority is assigned for patients with less urgency for care). Less urgent patients usually are admitted to the ED for minor problems, and although there were no patients in this cohort with a critical outcome, it is a rare event in this cohort. The exclusion of these patients also reduced class imbalance, allowing for the development of more reliable models. For detailed exclusion criteria refer to S1 Appendix in supplementary materials.

2.2 Data collection and processing

We included as predictors variables routinely collected at triage (temperature, heart and respiratory rate, systolic, diastolic and mean arterial blood pressure, pulse oximetry SpO2 and pain scale), the chief complaint, glycemia levels, Glasgow coma scale (GCS), the priority assigned to the patient, the patient age and gender, mode of arrival (ambulance, walk-in), disabilities (stretcher, wheelchair or none), time of triage (weekday, hour and month), ED visit (first triage registered on the system or not), number and type of exams prescribed at triage (ophthalmology, otolaryngology, electrocardiogram, X-ray and orthopedic). For each physiological variable, two dummy indicator variables were created indicating abnormal/not abnormal (1/0) and missing/not missing (1/0) values. Detailed preprocessing can be depicted in S2 Appendix.

2.2.1 Text pre-processing

Term frequency–inverse document frequency (TF-idf) was used for vectorization of the chief complaint. The number of words (unigrams/bigrams/trigrams) to select from each patient chief complaint as well as the total number of words to use from the training vocabulary are indicated in S2 Table. The chief complaint for HBA dataset consists of unstructured free text and it was subjected to lowercasing, a process of temporal normalization, tokenization, expansion of abbreviations and correction using Jaro-Winkler and stemming. Detailed preprocessing can be depicted in S2 Appendix.

2.3 Outcome measure

The primary outcome measure is defined as the composite of in-hospital death or cardiopulmonary arrest within 24 hours after triage, with triage as reference (0 min).

2.4 Model development

For model design and evaluation, we performed stratified random sampling, splitting 70% of data for train and 30% for test. The data preprocessing methodology is depicted in supplementary materials. We performed a stratified 10-fold cross validation (CV) in the training set using a randomized search for hyperparameter optimization. The models were trained and tested with the same dataset partitions, to allow for reproducibility. The configuration of the model with highest AUROC was selected as well as the corresponding threshold. This model was evaluated in the held-out test dataset. We performed 100 iterations of bootstrapping random sampling in 95% confidence intervals (CI) to measure variance in performance. We compared results from logistic regression (LR), random forests regression bootstrap aggregation of decision trees (RFR) and extreme gradient boosting classifier (XGBoost). These algorithms were selected so we could compare a boosting classification technique—XGBoost, and a bagging regression technique—random forests with a more traditional technique—LR. LR is suitable for predicting a binary outcome variable, such as deceased/not deceased when there are no non-linear relationships. Random forests perform a randomized sampling process to train a set of individual decision trees where each aggregates the output to produce a single probabilistic prediction for each outcome. At the end, the majority vote is implemented for all of the tree outcomes [63]. XGBoost algorithm is an efficient supervised learning algorithm which is a variant of the original Gradient boosting method [64], having been very recently developed in [65]. XGBoost iteratively applies greedy search to find the optimal model structure by adding a split to the existing tree structure at each iteration. We performed a randomized search of hyperparameters using 10-fold cross validation in the training set. The hyperparameters search settings can be depicted in S2 Table. Besides AUROC, we also assessed the area under the precision recall curve (AUPRC), the average precision (AP), which summarizes a precision-recall curve as the weighted mean of precisions achieved at each threshold, the sensitivity, also referred to as recall or the true positive rate (TPR), specificity or true negative rate (TNR) and precision or positive predictive value (PPV), which were used in previous studies [41, 47, 49]. We calculated the F1 score, which is suited for dealing with imbalanced datasets [66] and Cohen’s Kappa (κ) [67] to analyse interrater reliability [68]. Finally, we presented the standardized mortality ratio (SMR) which in this case represents the ratio between the observed number of composite outcomes (death or cardiopulmonary arrest) predicted by the model and the number of composite outcomes which would be expected. We also evaluated the quality of the predictions using the Brier Skill Score (BSS). The BSS is based on the Brier Score (BS), a statistical index which can be used to validate predicted probabilities. BSS = 1 − BS/BSref, thus if the reference score (BSref) is evaluated, it results in a BSS of 0.0, which represents a no skill prediction. Negative values represent worse than no skill and values above 0.0 represent skillful predictions, with a perfect prediction value being 1.0 [69].

3 Results

3.1 Summary of patient population

Initially the dataset had a population of 599276 adult ED visits, from where we excluded triaged patients with unknown age (n = 51), unknown priority (n = 473) and unknown time of ED admission (n = 222). Standard (n = 287280), non-urgent (n = 7100) and white (n = 2095) priorities were excluded since the study focused on the first three most urgent priorities. Transfers to other hospital ED (n = 5524), obstetric patients (n = 64130), activation of protocols at triage (n = 20982), re-triages (n = 8515) and death before ED admission (n = 448) were also excluded leaving a final cohort of 235826 triaged patients, as presented in Fig 1. In this cohort, there were 1121 (0.48%) patients with a composite outcome: 1100 (98.13%) deceased, 21 (1.87%) with a cardiopulmonary arrest and 796 (71%) with both outcomes, in the first 24 hours after triage.

Fig 1

Inclusion and exclusion criteria.

“n” corresponds to the number of triages.

Inclusion and exclusion criteria.

“n” corresponds to the number of triages. Demographics and a subset of variables are presented in Table 1. The descriptive statistics of the remaining variables used for modeling, namely number of missing and abnormal vitals, Glasgow coma scale and types of exams are presented in S3 Table, time variables in S4 Table, variables indicators of missing or abnormal values in vitals in S5 Table and pain scale statistics in S6 Table.

Table 1

Demographic variables and a subset of features are summarized for emergency department patients with and without the composite outcome.

	Composite outcome
	Yes (N = 1121)	No (N = 234711)
Gender
Female	610 (54)	130292 (56)
Male	511 (46)	104419 (44)
Age (years old)	80 (12.3), 83	58 (21.0), 59
Arrival mode
Walk-in	102 (9)	124888 (53)
Ambulance	718 (64)	57016 (24)
Other	301 (27)	52807 (23)
Disability
Stretcher	252 (22)	12500 (5.3)
Wheelchair	9 (1)	10325 (4.4)
None	860 (77)	211886 (90.3)
Heart rate (beats/min)	89 (23.5), 86	86(13.2), 86
Diastolic blood pressure (mmHg)	71 (14.1), 77	77 (8.5), 77
Systolic blood pressure (mmHg)	130 (25.6), 143	143 (16.4), 143
Pulse oximetry (%)	91 (7.7), 96	96 (2.4), 96
Glycaemia (mg/dL)	162 (68.4), 147	147 (42.4), 147
Temperature (°C)	36 (1.4), 37	37 (0.7), 37
Respiratory rate (breaths/min)	18 (4.8), 17	17 (1.7), 17
Number of exams
0	1109 (98.9)	213073 (91)
1	10 (0.9)	14783 (6)
2	2 (0.2)	4651 (2)
3 or more	0 (0)	2204 (1)
Priorities
Emergent (MTS-1)	385 (34)	978 (0.4)
Very urgent (MTS-2)	539 (48)	40677 (17.3)
Urgent (MTS-3)	197 (18)	193056 (82.3)
MTS discriminators (top 5)	Very low pulse oximetry—287 (26)	Moderate pain—82831 (35)
	Airway compromised—213 (19)	Pleuritic pain—14176 (6)
	Ineffective breathing—156 (14)	Sudden onset—12330 (5.3)
	Change of conscious status—102 (9)	Low pulse oximetry—11034 (4.7)
	Abnormal pulse—60 (5)	Severe pain—10867 (4.6)

The table shows number of patients. The figures in parentheses are the column percentages within each categorical variable for the respective outcome of admission. For continuous variables mean is presented with respective range in parentheses, followed by median: mean (range), median. MTS—Manchester Triage System. We observed that gender was balanced and that the triaged population with a composite outcome—the positive class—was older compared with the negative class (80 vs 58 years old). The majority of the patients in the positive class arrived in ambulance (64%) while those in the negative class walked-in (53%). Most of the patients in both classes had no disability at arrival, however in the positive class 22% arrived in a stretcher vs. 5.3% in the negative class. Approximately half (48%) of the patients in the positive class was assigned a very urgent priority (MTS-2), while the majority (82%) of patients in the negative class was assigned an urgent priority (MTS-3). The top 5 MTS discriminators assigned to the positive class were “very low pulse oximetry”, “abnormal heart rate”, “compromised airway” and “ineffective breathing” as well as “change in conscious status”. For the negative class, three discriminators were related with pain (“moderate”, “severe” and “pleuritic”), the other two were “low pulse oximetry” and “sudden onset”. Regarding vital signs, the main discrepancy between positive vs negative classes was blood pressure (systolic—130 vs 143, diastolic 71 vs 77), pulse oximetry (91 vs 96) and glycaemia (162 vs 147).

3.2 Main results

The modeling training set for negative and positive classes consisted of 164297 and 785 patients, respectively, and for the test set 70414 and 336 patients, respectively. We developed models using all available predictors except triage priority, since the priority assignment is subjective and can vary across providers. The difference in performance of the models including vs not including the chief complaint as predictor can be visualized in Fig 2. We observe that both AUROC and AUPRC yielded better results for LR and XGBoost models when this predictor was included. The increase in AUROC was 2% for LR and there was no change for XGBoost, however the AP increased in 8% and 4% for each model, respectively. Although RF presented better performance than LR when not using the chief complaint as predictor, when adding this predictor, LR presented higher performance. In terms of specificity and recall there was a 6% increase in recall and 3% decrease in specificity for LR, while for RF, on the contrary, recall decreased 1% and specificity increased 1%. For XGBoost, only the specificity increased 1%. Although LR presented a 0.86 recall vs 0.84 in XGBoost (a 2% difference), specificity for LR was 0.88 and for XGBoost 0.94 (a 6% difference). Despite this difference in sensitivity, we selected XGBoost as the best model since it presented higher overall performance compared to the other two models. The performance results for all models can be depicted in S7 Table.

Fig 2

Models performance using only clinical variables (on the left) and including the chief complaint (on the right).

Models performance using only clinical variables (on the left) and including the chief complaint (on the right).

For plots on the left, each model performance (AUROC, AUPRC) is: logistic regression (0.93, 0.17), random forests (0.95, 0.21) and extreme gradient boosting (0.96, 0.25). For plots on the right, each model performance (AUROC, AUPRC) is: logistic regression (0.95, 0.25), random forests (0.94, 0.20) and extreme gradient boosting (0.96, 0.30). AUROC—Area under the ROC curve. AUPRC—Area under the precision recall curve. We assessed the models calibration to understand the reliability of the predictions, as depicted in Fig 3. When the chief complaint was added, the calibration improved. RF did not exhibit a good calibration curve and under-predicted the risk probabilities while LR had the tendency to over-predict. The XGBoost model was better calibrated in patients on the lower end of the risk spectrum, but it became less well calibrated above the mean risk of 50% and as the risk increased. We applied isotonic calibration to the XGBoost model as depicted in Fig 4.

Fig 3

Calibration curves for the models using only clinical variables (on the left) and including the chief complaint (on the right).

Fig 4

Calibration curve for the XGBoost model using clinical variables and the chief complaint with isotonic calibration.

Among the models, the best was selected based on AUROC and AUPRC values and the calibration curves. XGBoost presented overall higher performance when compared to the other models. The results for this model with isotonic calibration can be depicted in Table 2, as well as the hyperparameters tuned in 10-fold CV during model training. The BSS, depicted in Table 3, also indicated a better quality of predictions given by XGBoost when compared to the other models. Analyzing the XGBoost hyperparameters, we observed that each tree was built upon the selection of 90% of the patients information and less than half (40%) of their features. A combination of unigrams and bigrams yielded the best performance using 1000 words from the training vocabulary. Comparing the XGBoost performance against the reference model using only triage priority as predictor, we observe a significant overall increase in all performance measures. There was an increase of 11% in average AUROC, 3% in AUPRC, 20% in AP, 12% in specificity and inter-rater agreement was also significantly higher. Sensitivity presented the lower increase (2%) among all measures. The threshold obtained for XGBoost was approximately a half of the one for the reference, which indicates the importance of finding the best threshold using a performance metric. Although the SMR was still high for the XGBoost model, when comparing to the reference, we observed a reduction to almost a third. We also compared the true and false predictions, given by the models against the real classifications for both XGBoost and the reference, which can be depicted in S3 Fig. Comparing to the reference, the XGBoost model presented lower number of false positive predictions and could better identify the risk of a composite outcome in urgent patients, correspondent to MTS-3.

Table 2

Performance results for the XGBoost calibrated model using clinical variables and chief complaint against the reference model (triage priority) and respective hyperparameters.

Parameter	XGBoost	Reference
AUROC	0.96 [0.95-0.97]	0.85 [0.82-0.87]
AUPRC	0.31 [0.26-0.36]	0.28 [0.24-0.32]
AP	0.30 [0.25-0.34]	0.10 [0.07-0.12]
Specificity	0.94 [0.94-0.94]	0.82 [0.82-0.82]
Recall	0.84 [0.80-0.88]	0.82 [0.77-0.85]
Precision	0.06 [0.06-0.07]	0.02 [0.02-0.02]
F1	0.12 [0.11-0.13]	0.04 [0.04-0.05]
Cohen’s k	0.11 [0.10-0.13]	0.03 [0.03-0.04]
SMR	12.87 [11.69-14.20]	38.21 [34.77-42.29]
Threshold	0.008	0.015
Warm start	Yes	No
Regularization constant	γ = 5	C = 0.01
N-gram range	Combination of unigrams and bigrams	-
Words from vocabulary	1000	-
Number of trees	150	-
Maximum depth	4	-
Learning rate	0.09	-
Subsample (% of rows to build a tree)	90	-
Colsample (% of columns to build a tree)	40	-

In brackets is the result for 100 bootstrapping iterations in 95% confidence intervals. AUROC—area under the ROC curve. AUPRC—area under the precision recall curve. AP—Average precision. SMR—Standardized Mortality Ratio.

Table 3

Brier Skill Score (BSS) for the models using clinical variables and chief complaint against a unskilled reference model using triage priority.

Reference Brier Score = 0.005.

	XGBoost	RF	LR	Reference
BSS	0.19	0.05	0.15	0.00

Performance results for the XGBoost calibrated model using clinical variables and chief complaint against the reference model (triage priority) and respective hyperparameters.

AUROC was the performance measure for hyperparameter tuning and best model selection in train. The hyperparameters not mentioned in the table were the default in XGBClassifier from Python version 3.7. In brackets is the result for 100 bootstrapping iterations in 95% confidence intervals. AUROC—area under the ROC curve. AUPRC—area under the precision recall curve. AP—Average precision. SMR—Standardized Mortality Ratio.

Brier Skill Score (BSS) for the models using clinical variables and chief complaint against a unskilled reference model using triage priority.

Reference Brier Score = 0.005. We assessed importance estimates of predictors in the train set given by each model. For LR, we used the absolute values of the regression coefficients, for RF and XGBoost we used the importance estimates given by the ensemble of trees. The XGBoost relative importance estimates is presented in Fig 5, and for LR and RF it can be depicted in S1 and S2 Figs. The most important predictor in the XGBoost was the Glasgow coma score. The patients’ arrival mode was ranked a relative importance of approximately 80% and the patients’ age and pulse oximetry were ranked an importance of 60%. The remaining predictors were ranked an importance inferior to 50%. From the information of exams ordered, only the orthopedic exam was ranked a relative importance of 18%, the remaining exams or number of exams were not considered important by the model. Regarding the time predictors, these were ranked with a lower importance (10-20%). For RF, the most important predictor was also the Glasgow coma score. Pulse oximetry was ranked an importance of approximately 55% and the remaining predictors were ranked an importance inferior to 50%. The most important predictor in the LR was the patients’ age. Pulse oximetry and systolic blood pressure were ranked an importance of approximately 50%, followed by temperature, number of exams and orthopedic exam. These were followed by diastolic blood pressure, which was also ranked as important by the XGBoost model. The models also ranked specific missing measured variables as important predictors of the composite outcome. Missing value of pain scale was ranked by XGBoost, LR and RF with an importance estimate of 80%, 70% and 20%, respectively. XGBoost model also ranked the missing values of heart rate and Glasgow coma score an importance of 80% and 70%, respectively.

Fig 5

Relative importance of predictors obtained with XGBoost using all available variables.

Exams are prescribed at the time of triage.

Relative importance of predictors obtained with XGBoost using all available variables.

Exams are prescribed at the time of triage. We finally evaluated the uncertainty in risk prediction given by the XGBoost model built with clinical variables and the chief complaint, through visualization of the predicted probability as presented in Fig 6. The threshold with the value of 0.00824 was selected using the training data, for the best model obtained in train based on the maximum value of AUROC. The higher the distance of the predicted probability to the threshold, the less uncertainty of the model prediction. For the false negative predictions (on top of Fig 6) this distance does not reach 1%. For the false positive predictions, the majority is located bellow 20% of distance to the threshold. For the true positives (on bottom of Fig 6) we observed that there was higher certainty in prediction, since the distances of the probabilities to the threshold were higher. For the true negative predictions, the maximum distance to threshold was bellow 1%. We also analyzed these results in a perspective of triage prioritization, for false negative and false positive classifications, as presented in Figs 7 and 8, respectively. For the case of false positive predictions, we observe that the highest distances to the threshold correspond to patients assigned an emergent priority (MTS-1 color red), while for the urgent priorities (MTS-3 color yellow) the uncertainty is higher since the probabilities are very close to the threshold and all bellow a distance of 20%. From this group of false positive patients, we found that 8% (n = 364) had a critical outcome in an interval of 48 hours after triage, either death, cardiopulmonary arrest, ICU or intermediate care unit admission, sepsis or stroke. In terms of uncertainty, for these 8% patients the absolute minimum, average and maximum distance to the threshold were 0.005%, 6% and 75%. Moreover, those with a distance superior to 20% corresponded to approximately 23% of the group of false positives. In terms of priorities, 14% corresponded to emergent patients with an average distance of 16%, 73% to very urgent patients with an average distance of 5% and 13% to urgent patients with an average distance of 2%.

Fig 6

Risk probability for incorrect (on the top) and correct (on the bottom) classifications obtained with extreme gradient boosting model using clinical variables and chief complaint.

A zoom-in is presented for the predicted risk probability for the plots on the right. The classification threshold is 0.00824.

Fig 7

Predicted risk probability for false negative classifications discriminated by triage priority obtained with XGBoost model using clinical variables and chief complaint.

Fig 8

Predicted risk probability for false positive classifications discriminated by triage priority obtained with XGBoost model using clinical variables and chief complaint.

Risk probability for incorrect (on the top) and correct (on the bottom) classifications obtained with extreme gradient boosting model using clinical variables and chief complaint.

A zoom-in is presented for the predicted risk probability for the plots on the right. The classification threshold is 0.00824.

4 Discussion

In this study we built a model to predict the risk of a composite critical outcome—mortality and cardiopulmonary arrest—in a cohort of emergent, very urgent and urgent patients who presented to the emergency department of a Portuguese hospital from 2011 to 2016. We observed that patients in the positive class were older and presented lower blood pressure, lower pulse oximetry and higher glycaemia levels. We developed models for prediction of the composite outcome in the first 24 hours after emergency department triage using regularized logistic regression, random forests regression and extreme gradient boosting. We assessed the models performance when adding the chief complaint—unstructured text feature—to the clinical and demographic variables. Including chief complaint as predictor increased the model predictive power for logistic regression and extreme gradient boosting. We concluded that although logistic regression presented 2% higher recall compared to extreme gradient boosting, the specificity was 6% higher for the latter model. Since we have a class imbalance this difference is significant, which is reflected in the other measures such as average precision, F1-score and Cohen’s k. When visualizing the models calibration curves, we observed that the calibration improved when adding the chief complaint. In both scenarios, random forests performed poorly, being under confident and focusing on the hard samples close to the classification threshold. An isotonic or sigmoid re-calibration could solve this issue, however it was not pursued since the other two models exhibited better calibration curves and overall performance. Logistic regression and extreme gradient boosting were better calibrated in patients on the lower end of the predicted risk spectrum, but became less well calibrated as the predicted risk increased, over-estimating the risk. We decided that it is preferable to over-estimate rather than under-estimate probabilities of a composite outcome for the high-risk group of patients. The extreme gradient boosting model presented the best calibration and overall performance results, therefore we proceeded the study using this model. We then compared its performance against the one obtained using only the priority assigned to the patients, with the Manchester Triage System. We concluded that the extreme gradient boosting model had a better predictive performance, exhibiting higher performance across all measures. Recall presented the lowest increase (2%), which indicates that the Manchester Triage System might lead to over-triage of patients. This is reflected in the higher value of standardized mortality ratio –- approximately three times higher than the gradient boosting model ratio. Compared to the reference, this model had the ability to identify patients with higher risk of a composite outcome in MTS-3 and presented lower number of false negative classifications for MTS-1 and MTS-2. The important predictors of a composite outcome were identified for the models built using clinical variables and the chief complaint. We observed that for all models, the patients’ age, pulse oximetry and Glasgow coma score were ranked with relative high importance. The most important predictor for both extreme gradient boosting and random forests was Glasgow coma score. The most important predictor in the logistic regression was the patients’ age, which was also ranked with high importance by the other models. The arrival mode was also ranked a high importance by the extreme gradient boosting model. Since a significant amount of patients in critical conditions may arrive by ambulance, this can explain the ranking of importance given by the model to this predictor. The information regarding time for triage was not ranked a high importance by the this model, neither the number or type of exams, except for orthopedic exam which was ranked a relative low importance. Missing values of pain scale and Glasgow coma score were ranked a high importance. These variables might have been selected by the models as important predictors of the composite outcome due to the fact that the patients in severe conditions may be unresponsive, comatosed or unable to verbalize their symptoms and score their level of pain. The extreme gradient boosting model also ranked missing value of heart rate a relative high importance. The patient might have presented evident conditions of criticality or the intervention might have been immediate, and in these cases there was either no need or time to measure heart rate. We finally assessed the uncertainty in risk prediction for the best model. A high uncertainty in the prediction of false positive classifications was observed, where the majority had a prediction very close to the threshold (bellow 5%). We observed that the predictions more far apart from the threshold corresponded to patients assigned an emergent priority. We hypothesize that these patients might have similar characteristics to those with a composite outcome, since they were assigned an emergent priority. However, the intervention might have been successful or on the other hand these patients might have had a composite outcome later than the 24 hours and therefore it was not accounted for our 24-hour outcome. We analyzed for these false positives if they had a composite outcome in the following 24 hours (48 hours after triage) and we concluded that 8% had an outcome of either death, cardiopulmonary arrest, ICU or intermediate care unit admission, sepsis or stroke. Furthermore, from this group of patients, the emergent presented the highest average distance to the threshold, which indicates less uncertainty in the prediction for this priority. When analyzing the correct model classifications, we observed that there was high certainty in the predictions of patients who had a composite outcome and for the majority of those who did not have the outcome, the frequency of predictions was higher for values far apart from the threshold, which corresponded to higher certainty. We concluded that the measures of average precision and F1-score were relatively low compared to the other performance measures due to class imbalance present in the data. A significant proportion of patients did not have a composite outcome of death or cardiopulmonary arrest and may have had other complications, which the model does not take into account. To handle class imbalance we propose to use novelty detection methods as future work. Nonetheless, the extreme gradient boosting model presented good calibration and reliability in predictions. While the reference model presented higher recall for MTS-1 and MTS-2, the extreme gradient boosting model proved to have a higher sensitivity in identifying patients assigned MTS-3. This demonstrates the potential to complement the already existing triage system with a machine learning model and avoid under-triaging.

Detailed exclusion criteria.