Sean C Yu1,2, Mackenzie R Hofford2,3, Albert M Lai2, Marin H Kollef4, Philip R O Payne2, Andrew P Michelson2,4. 1. Department of Biomedical Engineering, School of Engineering, Washington University School in St. Louis, St. Louis, Missouri, USA. 2. Institute for Informatics, Department of Medicine, Washington University School of Medicine in St. Louis, St. Louis, Missouri, USA. 3. Division of General Medicine, Department of Medicine, Washington University School of Medicine in St. Louis, St. Louis, Missouri, USA. 4. Division of Pulmonary and Critical Care, Department of Medicine, Washington University School of Medicine in St. Louis, St. Louis, Missouri, USA.
Abstract
OBJECTIVE: Respiratory support status is critical in understanding patient status, but electronic health record data are often scattered, incomplete, and contradictory. Further, there has been limited work on standardizing representations for respiratory support. The objective of this work was to (1) propose a practical terminology system for respiratory support methods; (2) develop (meta-)heuristics for constructing respiratory support episodes; and (3) evaluate the utility of respiratory support information for mortality prediction. MATERIALS AND METHODS: All analyses were performed using electronic health record data of COVID-19-tested, emergency department-admit, adult patients at a large, Midwestern healthcare system between March 1, 2020 and April 1, 2021. Logistic regression and XGBoost models were trained with and without respiratory support information, and performance metrics were compared. Importance of respiratory-support-based features was explored using absolute coefficient values for logistic regression and SHapley Additive exPlanations values for the XGBoost model. RESULTS: The proposed terminology system for respiratory support methods is as follows: Low-Flow Oxygen Therapy (LFOT), High-Flow Oxygen Therapy (HFOT), Non-Invasive Mechanical Ventilation (NIMV), Invasive Mechanical Ventilation (IMV), and ExtraCorporeal Membrane Oxygenation (ECMO). The addition of respiratory support information significantly improved mortality prediction (logistic regression area under receiver operating characteristic curve, median [IQR] from 0.855 [0.852-0.855] to 0.881 [0.876-0.884]; area under precision recall curve from 0.262 [0.245-0.268] to 0.319 [0.313-0.325], both P < 0.01). The proposed generalizable, interpretable, and episodic representation had commensurate performance compared to alternate representations despite loss of granularity. Respiratory support features were among the most important in both models. CONCLUSION: Respiratory support information is critical in understanding patient status and can facilitate downstream analyses.
OBJECTIVE: Respiratory support status is critical in understanding patient status, but electronic health record data are often scattered, incomplete, and contradictory. Further, there has been limited work on standardizing representations for respiratory support. The objective of this work was to (1) propose a practical terminology system for respiratory support methods; (2) develop (meta-)heuristics for constructing respiratory support episodes; and (3) evaluate the utility of respiratory support information for mortality prediction. MATERIALS AND METHODS: All analyses were performed using electronic health record data of COVID-19-tested, emergency department-admit, adult patients at a large, Midwestern healthcare system between March 1, 2020 and April 1, 2021. Logistic regression and XGBoost models were trained with and without respiratory support information, and performance metrics were compared. Importance of respiratory-support-based features was explored using absolute coefficient values for logistic regression and SHapley Additive exPlanations values for the XGBoost model. RESULTS: The proposed terminology system for respiratory support methods is as follows: Low-Flow Oxygen Therapy (LFOT), High-Flow Oxygen Therapy (HFOT), Non-Invasive Mechanical Ventilation (NIMV), Invasive Mechanical Ventilation (IMV), and ExtraCorporeal Membrane Oxygenation (ECMO). The addition of respiratory support information significantly improved mortality prediction (logistic regression area under receiver operating characteristic curve, median [IQR] from 0.855 [0.852-0.855] to 0.881 [0.876-0.884]; area under precision recall curve from 0.262 [0.245-0.268] to 0.319 [0.313-0.325], both P < 0.01). The proposed generalizable, interpretable, and episodic representation had commensurate performance compared to alternate representations despite loss of granularity. Respiratory support features were among the most important in both models. CONCLUSION: Respiratory support information is critical in understanding patient status and can facilitate downstream analyses.
Managing respiratory status and providing appropriate respiratory support to prevent or mitigate hypoxemia is a critical aspect of clinical management, especially for patients suffering from respiratory conditions such as coronavirus disease 2019 (COVID-19). Leveraging respiratory support information not only provides a more complete clinical picture, but it can also be used in downstream analyses such as sub-phenotyping or predictive modeling as a source of features or to identify endpoints., However, generating generalizable conclusions from respiratory support information is difficult due to the heterogeneity of respiratory support methods and settings, the lack of standardized representations, and the poor quality of data regarding respiratory support.Standardization of patient data representation has accelerated knowledge discovery, replication of results, and translation into practice. While many parts of healthcare information have been standardized—such as the International Classification of Diseases (ICD) for diagnoses, LOINC for measurements and observations, or the all-encompassing UMLS—much of the patient information in EHR data, such as clinical cultures and respiratory support methods, remains unmapped., Standardization of respiratory support is made challenging due to the inherent diversity of methods, especially when considering methods designed for subpopulations, variations, and modifiable settings. Worse yet, the lack of a widely adopted standardization schema has resulted in the usage of highly heterogeneous terms in literature and practice, even for identical concepts—for example, heated and humidified high-flow nasal cannula has been described as high-flow nasal cannula, high humidity nasal cannula, high-flow nasal oxygen therapy, or referred to by brand names such as AirvoTM or OptiflowTM. Prior efforts to standardize respiratory support terms resulted in high granularity, which while necessary for coverage, can be excessive for downstream tasks. For example, the “Respiratory Therapy” concept in SNOMED-CT contains 14 children, of which “Oxygen Therapy” itself contains 14 children. Thus, there is a need for a pragmatic and parsimonious standardization of respiratory support terms.Beyond the lack of standardization, extraction of respiratory support status from electronic health records (EHRs) data is made challenging due to scattered, incomplete, and often contradictory documentation which necessitates the usage of auxiliary documentation and heuristics to determine respiratory support status. Prior heuristic development work, however, are dataset-specific and focus on endotracheal intubation thus fail to address other strata of respiratory support.,Therefore, the objective of the study was to (1) propose a preliminary, parsimonious, and pragmatic terminology system for respiratory support stratified by severity of hypoxemia; (2) develop (meta-)heuristics for the construction of respiratory support episodes from raw and heterogeneous EHR data; and (3) evaluate the terminology system and heuristics by measuring its impact on 30-day mortality prediction through the assessment of feature importance and feature ablation studies.
MATERIALS AND METHODS
Study design, data sources, and population
All patients ≥ 18 years of age admitted to all hospitals within a large Midwestern healthcare system serving the metropolitan St. Louis, mid-Missouri, and southern Illinois regions between March 1, 2020 and April 1, 2021 were eligible for inclusion. Patients were included if they had a COVID-19 PCR or antigen test, positive or negative, within 14 days prior to or 7 days after hospital admission. Patients were excluded if they had a hospital length of stay (LOS) < 24 h to allow for a sufficient observation window for the predictive modeling study. Patients without associated demographics, comorbidities, or location data were excluded. Only patients with at least 5 heart rate and 5 SpO2 measurements during the first 24 h of hospital arrival were included. EHR data, including demographics, vital signs, lab results, flowsheet entries, and so on, were extracted for all included subjects. This project was approved with a waiver of informed consent by the Washington University in St. Louis Institutional Review Board.
Classification system of respiratory support methods
To facilitate generalizability and reproducibility of studies leveraging respiratory support information, we propose the following terminologies, in increasing severity: Low-Flow Oxygen Therapy (LFOT), High-Flow Oxygen Therapy (HFOT), Non-Invasive Mechanical Ventilation (NIMV), Invasive Mechanical Ventilation (IMV), and ExtraCorporeal Membrane Oxygenation (ECMO).,
Meta-heuristics for identification of respiratory support episodes
The authors of MIMIC-III, in addition to the data, also published code for data processing and analysis to expedite and encourage collaborative research. Among the SQL scripts in their GitHub repository is one for calculating mechanical ventilation duration. Essentially, their logic was to chain together proximal pieces of documentation that are indicative of mechanical ventilation to form episodes with start and end times. Their heuristic has since been used successfully by researchers using the MIMIC-III dataset. To generalize the heuristic for other forms of respiratory support beyond mechanical ventilation, and for other datasets beyond MIMIC-III, we developed a generalized version of the MIMIC-III heuristic—a “meta-heuristic”—to guide the development of heuristics for the assembly of respiratory support episodes as follows:Define 2 parameters:MIN_DURATION for the minimum episode durationEXTENSION_TOLERANCE for the maximum allowable time gap between documentation for the formation of episodesIdentify timestamped documentation that are indicative of the presence of respiratory supportLink consecutive documentation occurring within EXTENSION_TOLERANCE into episodesDiscard any episodes with duration less than or equal to MIN_DURATIONNext, as we conceived of the respiratory support methods as being mutually exclusive, respiratory support episodes are “flattened” into a single timeline such that at any given time, a patient is on either no respiratory support or a single respiratory support method, by giving higher severity methods priority. Finally, the respiratory support trajectories are “repaired” such that gaps between episodes with a duration less than EXTENSION_TOLERANCE are filled by extending the preceding episode. Also, gaps at the beginning and end of the patient stay (between encounter start time and first respiratory support episode start time, and between last respiratory support episode end time and encounter end time), if less than MIN_DURATION, are filled by extending the first or last episode, respectively.
Evaluation through in-hospital mortality prediction
A predictive modeling study was designed to evaluate the utility of the respiratory support information extracted through our heuristics on downstream analyses. The task was to predict in-hospital mortality within 30 days, at 24 h after hospital arrival for a COVID-19-tested, adult cohort presenting to an emergency department (ED); 121 baseline features were generated from demographic, laboratory, vital sign, and other clinical data extracted from the EHR. For most numeric measurements, the median value during the observation window was extracted, but for frequent measurements such as heart rate, other distributional statistics (25th quantile, 75th quantile, and interquartile range) were also extracted.Ten additional, respiratory-support-derived features were generated using the proposed classification schema and heuristics which included duration of respiratory support per type, and the last respiratory support during the observation period. For comparison, we identified a small set of measurements related to respiratory status: fraction of inspired oxygen (FiO2) and oxygen flow rate. We also extracted the EHR-native representation of respiratory support called oxygen delivery method (O) which included ETT, CPAP, T-Piece, and so on. Lastly, we also considered a set of features based on the proposed classification, but using the raw time-stamped data prior to assembly into episodes. The feature sets including explicit respiratory support information—O2 Del Method, Raw, and Proposed—all also include “Baseline” and “Related” features (Online Supplemental Materials, eTable 1, eFigure 1).The compared algorithms were logistic regression (LogReg) and XGBoost (XGB). For LogReg, features were standardized and mean-imputed whereas for XGB, features were left as is. Hyperparameters such as regularization strength were optimized for log loss using the baseline features through 1000-iteration, 4-fold cross-validation (Online Supplemental Materials, eAppendix A). Once optimal hyperparameters were identified, five replicates of 2-fold cross-validation was performed to generate a distribution of performance metrics: area under receiver operating characteristic curve (AUROC), area under precision recall curve (AUPRC), and negative log loss. The distributions of performance metrics were compared using the Wilcoxon signed-rank test (2-way, paired). Feature importance was quantified using SHapley Additive exPlanations (SHAP) values for the XGBoost model, and coefficient values for the logistic regression model, both of which were aggregated over 100 bootstrap samples.,
Statistical analysis
Variables were summarized using frequencies and proportions for categorical data and medians and interquartile ranges or means and standard deviations for continuous data. Statistical comparisons were performed using the Chi-square and Mann-Whitney U tests where appropriate unless specified otherwise. A P-value < 0.01 was considered statistically significant. All resampling analyses, cross-validation and bootstrap, were performed using a fixed seed. All analysis and figure generation were performed with Python version 3.7.1 (Python Software Foundation, Beaverton, OR) using the following packages: scipy, numpy, pandas, matplotlib, sklearn, xgboost, and shap.,
RESULTS
The severity-stratified respiratory support methods and the documentation serving as evidence for each method are listed in Table 1. Examples of the full heuristic application process, along with documentation germane to respiratory status, can be found in Figure 1. Figure 2 shows respiratory support utilization over time, with patients temporally aligned at ED arrival.
Documentation of “NPPV Status” as “In Use” in the “Adult NPPV/NIV” flowsheet
Invasive Mechanical Ventilation (IMV)
Documentation of “Vent Status” as “In Use” in the “Ventilator Documentation” flowsheet
ExtraCorporeal Membrane Oxygenation (ECMO)
Documentation of “Pump Flow (L/min)” or “ECMO Pump Speed (RPM)” in the ECMO or VAD flowsheets
Figure 1.
Respiratory support trajectory example. This plot demonstrates the application of the respiratory support trajectory heuristics on a full, single patient encounter using data elements from Table 1, MIN_DURATION of 6 h, and EXTENSION_TOLERANCE of 24 h. The x-axis indicates time with each black tick indicating 24 h and red tick indicating 6 h. The top 5 subplots each pertain to a single respiratory support method where vertical lines indicate the times at which pieces of documentation serving as evidence for respiratory support were documented. The individual sub-trajectories are all merged into a single timeline as shown in the “flattened” subplot, after which it is repaired as according to the heuristic. The subplots below the “repaired” subplot provide context for the patient, showing patient location and measurements pertaining respiratory status. Abbreviations: ECMO: extracorporeal membrane oxygenation; ED: emergency department; HFOT: high-flow oxygen therapy; ICU: intensive care unit; IMV: invasive mechanical ventilation; LFOT: low-flow oxygen therapy; NIMV: non-invasive mechanical ventilation.
Figure 2.
Respiratory support utilization aligned at arrival. All patients were aligned at ED arrival, and their usage of respiratory support was plotted for the first 4 weeks. The top subplot shows the total number of patients utilizing each respiratory support method, whereas the bottom subplot shows the proportion. As expected, patients who have been in the hospital longer are more likely to be on higher levels of respiratory support. Abbreviations: ECMO: extracorporeal membrane oxygenation; HFOT: high-flow oxygen therapy; IMV: invasive mechanical ventilation; LFOT: low-flow oxygen therapy; NIMV: non-invasive mechanical ventilation.
Respiratory support trajectory example. This plot demonstrates the application of the respiratory support trajectory heuristics on a full, single patient encounter using data elements from Table 1, MIN_DURATION of 6 h, and EXTENSION_TOLERANCE of 24 h. The x-axis indicates time with each black tick indicating 24 h and red tick indicating 6 h. The top 5 subplots each pertain to a single respiratory support method where vertical lines indicate the times at which pieces of documentation serving as evidence for respiratory support were documented. The individual sub-trajectories are all merged into a single timeline as shown in the “flattened” subplot, after which it is repaired as according to the heuristic. The subplots below the “repaired” subplot provide context for the patient, showing patient location and measurements pertaining respiratory status. Abbreviations: ECMO: extracorporeal membrane oxygenation; ED: emergency department; HFOT: high-flow oxygen therapy; ICU: intensive care unit; IMV: invasive mechanical ventilation; LFOT: low-flow oxygen therapy; NIMV: non-invasive mechanical ventilation.Respiratory support utilization aligned at arrival. All patients were aligned at ED arrival, and their usage of respiratory support was plotted for the first 4 weeks. The top subplot shows the total number of patients utilizing each respiratory support method, whereas the bottom subplot shows the proportion. As expected, patients who have been in the hospital longer are more likely to be on higher levels of respiratory support. Abbreviations: ECMO: extracorporeal membrane oxygenation; HFOT: high-flow oxygen therapy; IMV: invasive mechanical ventilation; LFOT: low-flow oxygen therapy; NIMV: non-invasive mechanical ventilation.Respiratory support terminologies and heuristicsOxygen delivery method documentation: nasal cannula, non-rebreather mask, simple mask, venturi mask, aerosol mask, face tent.Oxygen flow rate ≤ 15 L/minOxygen delivery method documentation: high-flow nasal cannula, high humidity nasal cannula, optiflowOxygen flow rate > 15 L/minCohort characteristics and outcomes for the patient population used in this study can be seen in Table 2. During the study period there were 45 908 hospitalizations lasting at least 24 h available for analysis. Of these, 1601 (3.5%) experienced in-hospital death within 30 days. Non-survivors were older, more likely to be male, more likely to be COVID-19 positive, and have a longer length of stay (Table 2).
Table 2.
Cohort characteristics
Outcome = in-hospital mortality within 30 days of index time
Variable
Total
Yes
No
Pa
(n = 45,908)
(n = 1601, 3.5%)
(n = 44,307, 96.5%)
Age (years), median (IQR)
64.0 (51.0–76.0)
73.0 (62.0–83.0)
64.0 (50.0–76.0)
< 0.01*
Male, n (%)
22,638 (49.3%)
889 (55.5%)
21,749 (49.1%)
< 0.01*
Race, n (%)
< 0.01*
White
28,032 (61.1%)
933 (58.3%)
27,099 (61.2%)
0.021
Black
16,706 (36.4%)
576 (36.0%)
16,130 (36.4%)
0.747
Asian
340 (0.7%)
17 (1.1%)
323 (0.7%)
0.168
Other/unknown
830 (1.8%)
75 (4.7%)
755 (1.7%)
< 0.01*
BMI, median (IQR)
27.5 (23.2–33.4)
27.1 (22.7–32.2)
27.5 (23.2–33.4)
< 0.01*
COVID-19 positive, n (%)
8332 (18.1%)
502 (31.4%)
7830 (17.7%)
< 0.01*
Respiratory support duration during observation window (h), mean ± SD
None
16.7 ± 10.4
7.5 ± 10.2
17.0 ± 10.2
< 0.01*
LFOT
5.31 ± 9.16
7.36 ± 10.05
5.24 ± 9.12
< 0.01*
HFOT
0.39 ± 2.70
1.86 ± 5.73
0.34 ± 2.50
< 0.01*
NIMV
0.76 ± 3.74
1.47 ± 5.23
0.73 ± 3.68
< 0.01*
IMV
0.85 ± 4.21
5.79 ± 9.73
0.67 ± 3.75
< 0.01*
ICU transfer, n (%)
10,311 (22.5%)
1321 (82.5%)
8990 (20.3%)
< 0.01*
Total LOS (h), median (IQR)
99.7 (59.0–172.7)
166.8 (83.4–313.6)
98.6 (58.2–170.1)
< 0.01*
In-hospital mortality, n (%)
1682 (3.7%)
1601 (100.0%)
81 (0.2%)
< 0.01*
Abbreviations: BMI: body mass index; COVID-19: Coronavirus disease 2019; ECMO: extracorporeal membrane oxygenation; HFOT: high-flow oxygen therapy; ICU: intensive care unit; IMV: invasive mechanical ventilation; IQR: interquartile range; LFOT: low-flow oxygen therapy; LOS: length of stay; NIMV: non-invasive mechanical ventilation; SD: standard deviation.
Comparison of variables between those with and without the primary outcome of 30-day in-hospital mortality was performed using Mann-Whitney U test for continuous variables, and χ2 for categorical variables. Statistical significance,
P < 0.01, is indicated by.
Cohort characteristicsAbbreviations: BMI: body mass index; COVID-19: Coronavirus disease 2019; ECMO: extracorporeal membrane oxygenation; HFOT: high-flow oxygen therapy; ICU: intensive care unit; IMV: invasive mechanical ventilation; IQR: interquartile range; LFOT: low-flow oxygen therapy; LOS: length of stay; NIMV: non-invasive mechanical ventilation; SD: standard deviation.Comparison of variables between those with and without the primary outcome of 30-day in-hospital mortality was performed using Mann-Whitney U test for continuous variables, and χ2 for categorical variables. Statistical significance,P < 0.01, is indicated by.The optimized hyperparameters (Online Supplemental Materials, eAppendix A) were used for generating distributions of performance metrics through repeated cross-validation (Figure 3). For both XGB and LogReg, the addition of “Related” features significantly improved on “Baseline,” and the addition of “O2 Del Method,” “Raw,” or “Proposed” features improved on “Related” across all 3 metrics: AUROC, AUPRC, and negative log loss (Figure 3, Online Supplemental Materials, eTable 2, eTable 3). However, “O2 Del Method,” “Raw,” and “Proposed” rarely differed significantly, and when they did, the differences were very small as was the case for LogReg AUROC between “O2 Del Method” and “Proposed” (0.887 [0.884—0.890] and 0.887 [0.885—0.891], P < 0.01, Online Supplemental Materials, eTable 3).
Figure 3.
In-hospital mortality prediction feature ablation performance comparison. Comparison of in-hospital mortality prediction performance for LogReg and XGB models with varying sets of features from 5-repeat, 2-fold cross-validation. “Baseline” includes demographics, common lab results, and vital signs from the EHR data. “Related” also includes O2 flow rate and fraction of inspired oxygen. In addition, “O2 Del Method” includes the EHR-native representation of respiratory support status, “Raw” includes data from the proposed approach prior to assembly into episodes, and “Proposed” includes features derived from respiratory support episodes based on the proposed approach. The center horizontal line represents median, box represents the interquartile range between 25th and 75th percentiles, and whiskers represent 2.5th and 97.5th percentiles. Abbreviations: AUPRC: area under precision recall curve; AUROC: area under receiver operating characteristic curve; LogReg: logistic regression; XGB: extreme gradient boosted trees model.
In-hospital mortality prediction feature ablation performance comparison. Comparison of in-hospital mortality prediction performance for LogReg and XGB models with varying sets of features from 5-repeat, 2-fold cross-validation. “Baseline” includes demographics, common lab results, and vital signs from the EHR data. “Related” also includes O2 flow rate and fraction of inspired oxygen. In addition, “O2 Del Method” includes the EHR-native representation of respiratory support status, “Raw” includes data from the proposed approach prior to assembly into episodes, and “Proposed” includes features derived from respiratory support episodes based on the proposed approach. The center horizontal line represents median, box represents the interquartile range between 25th and 75th percentiles, and whiskers represent 2.5th and 97.5th percentiles. Abbreviations: AUPRC: area under precision recall curve; AUROC: area under receiver operating characteristic curve; LogReg: logistic regression; XGB: extreme gradient boosted trees model.Six of the top 20 most impactful features for the LogReg model were respiratory-support-derived features, including last respiratory support, IMV, and LFOT duration (Figure 4). For the XGB model, respiratory-support-derived features ranked 10th (last respiratory support, IMV) and 17th (LFOT duration) (Figure 5).
Figure 4.
Logistic regression feature importance. The left subplot shows the top 20 most important features in the logistic regression model based on coefficient values aggregated over 100 bootstrap samples, and the right subplot shows the absolute coefficient values. For each feature, the center vertical line represents median, box represents the interquartile range between 25th and 75th percentiles, and whiskers represent 2.5th and 97.5th percentiles. Features based on respiratory support information are colored/shaded in red. Abbreviations: FiO2: fraction of inspired oxygen; GCS: Glasgow Coma Scale; HFOT: high-flow oxygen therapy; IMV: invasive mechanical ventilation; LFOT: low-flow oxygen therapy; MAP: mean arterial pressure; NIMV: non-invasive mechanical ventilation; PLT: platelet count; RDW_CV: red blood cell distribution width coefficient of variation.
Figure 5.
XGBoost SHAP feature importance. The left subplot shows the top 20 most important features in the XGB model based on absolute mean SHAP values aggregated over 100 bootstrap samples. The right subplot shows the individual SHAP value for the same top 20 features, for all encounters in the full dataset. For each feature, the center vertical line represents median, box represents the interquartile range between 25th and 75th percentiles, and whiskers represent 2.5th and 97.5th percentiles. Features based on respiratory support information are colored/shaded in red. Abbreviations: ASP: aspartate aminotransferase; BMI: body mass index; FiO2: fraction of inspired oxygen; GCS: Glasgow Coma Scale; IMV: Invasive Mechanical Ventilation; LFOT: low-flow oxygen therapy; PLT: platelet count.
Logistic regression feature importance. The left subplot shows the top 20 most important features in the logistic regression model based on coefficient values aggregated over 100 bootstrap samples, and the right subplot shows the absolute coefficient values. For each feature, the center vertical line represents median, box represents the interquartile range between 25th and 75th percentiles, and whiskers represent 2.5th and 97.5th percentiles. Features based on respiratory support information are colored/shaded in red. Abbreviations: FiO2: fraction of inspired oxygen; GCS: Glasgow Coma Scale; HFOT: high-flow oxygen therapy; IMV: invasive mechanical ventilation; LFOT: low-flow oxygen therapy; MAP: mean arterial pressure; NIMV: non-invasive mechanical ventilation; PLT: platelet count; RDW_CV: red blood cell distribution width coefficient of variation.XGBoost SHAP feature importance. The left subplot shows the top 20 most important features in the XGB model based on absolute mean SHAP values aggregated over 100 bootstrap samples. The right subplot shows the individual SHAP value for the same top 20 features, for all encounters in the full dataset. For each feature, the center vertical line represents median, box represents the interquartile range between 25th and 75th percentiles, and whiskers represent 2.5th and 97.5th percentiles. Features based on respiratory support information are colored/shaded in red. Abbreviations: ASP: aspartate aminotransferase; BMI: body mass index; FiO2: fraction of inspired oxygen; GCS: Glasgow Coma Scale; IMV: Invasive Mechanical Ventilation; LFOT: low-flow oxygen therapy; PLT: platelet count.
DISCUSSION
In this study, we (1) propose a preliminary, parsimonious, and pragmatic terminology system for respiratory support methods; (2) develop (meta-)heuristics for extraction of respiratory support information from EHR data; and (3) investigate the utility of the respiratory support information extracted through the proposed terminology system and heuristics via a mortality prediction study in a COVID-19-tested, ED-admit, adult cohort. The developed heuristic was successfully applied to EHR data to extract respiratory support episodes, which were then used for in-hospital mortality prediction as features, which were found to be among the most important features for both LogReg and XGB models.Compared to models using demographics and commonly documented lab results and vital signs, the addition of respiratory-support-related information, FiO2 and O2 flow rate, significantly improved prediction performance for both XGB and LogReg across all measured performance metrics. Moreover, the additional inclusion of explicit respiratory support information further improved performance significantly, again for both model types and across all metrics.Because there are no other dataset-agnostic, full-severity-spanning classification, and heuristic system for respiratory support information found in literature, we compared our proposed approach against 2 other methods of explicit respiratory status representation: “O2 Del Method” which used the EHR-native representation and “Raw” which uses the data elements for the proposed approach but prior to assembly into episodes (Online Supplemental Materials, eTable 2, eFigure 1). The proposed representation had commensurate performance to alternate representations of explicit respiratory support information, despite loss of both conceptual and temporal granularity resulting from the aggregation of heterogeneous timestamped raw data into the more human-understandable format of encounter-spanning series of episodes (Figure 3).The increase in model performance associated with the addition of explicit respiratory support information in XGB was less than that of LogReg. We hypothesize that more complex models are able to infer respiratory support status or reconstitute information contained in respiratory status based on other features.For the models using features from the proposed approach, IMV status and duration was an important predictor which is unsurprising—patients who are intubated are known to have higher rates of in-hospital mortality. This result simply underscores the importance of leveraging respiratory support information, especially those of high severity, for understanding patient status. However, even features based on low-flow oxygen therapy status were among the most important features for both the XGB and LogReg models, indicating that lower severity respiratory support information is also critical for developing a complete clinical picture of patients.In this study, respiratory support information was used as features for in-hospital mortality prediction. However, there are many other potential uses of respiratory support information, such as endpoint identification, patient sub-phenotype discovery, patient trajectory analytics, or characterization of patient cohorts.,,As documentation regarding respiratory support status varies across time and across sites, identification of timestamped documentation that serve as evidence for respiratory support cannot be generalized and must be specified in each study by researchers with appropriate knowledge of local practice patterns contained within the dataset. Therefore, a (meta-)heuristic was developed which provides the structure for developing heuristics to establish episodes for any respiratory support method.As is typical of studies using EHR data, this study suffers from missingness and inaccuracy of information. For example, we identified patients admitted through the ED with a recent positive COVID-19 test who were transferred to and remained in the ICU for several days, yet had no documentation of respiratory support throughout their entire stay. While heuristics can work with scattered, conflicting, and incomplete documentation, significant missingness will still result in unrealistic scenarios. There is trade-off in setting the MIN_DURATION parameter—if it is too long then temporary/interim respiratory supports will be underrepresented; conversely, if it is too short then the heuristic will allow for unrealistically rapid oscillation among respiratory support methods. Additionally, tracheostomy status was considered orthogonal to the proposed system. NIMV is often used nightly for sleep apnea; thus, researchers utilizing the heuristic must decide whether to ignore those episodes based on their needs. Also, patients can be connected to a device for respiratory support, but not be actively using them (eg, delivering no or low-flow oxygen through a device capable of delivering high-flow oxygen), thus researchers must decide which is more important for their work: the occupation of the device or the active use of the device. Respiratory support methods are ever-evolving—helmet NIV and high-flow nasal cannula, for instance, have only recently been used for adult patients, meaning that these terminology systems will also require regular revisiting and updating.
CONCLUSION
To facilitate generalizable and reproducible research, a terminology system was developed for standardized representation of respiratory support methods. (Meta-)heuristics were also developed to enable extraction of respiratory support episodes from EHR data, and transformation into encounter-spanning set of respiratory support trajectories. To demonstrate the utility of respiratory support information extracted through proposed methods, feature ablation and feature importance analyses were performed via an in-hospital mortality prediction study for COVID-19-tested, ED-admit, adult patients. The addition of features generated from the proposed approach significantly improved model performance. Further, those features were found to be among the most important for models. Finally, the proposed approach, which generated more interpretable and generalizable representations, despite the loss of conceptual and temporal granularity, had commensurate performance to alternate representations of explicit respiratory support information.
FUNDING
This research received no specific grant from any funding agency in the public, commercial or not-for-profit sectors.
AUTHOR CONTRIBUTIONS
SCY, MH, and APM conceived of the idea and planned the experiments. SCY performed the analysis and drafted the manuscript. MH and APM made substantial contributions to the language of the manuscript. AML, MHK, and PROP supervised the project for methodological rigor and veracity. All authors provided critical feedback and contributed to the final manuscript.
SUPPLEMENTARY MATERIAL
Supplementary material is available at Journal of the American Medical Informatics Association online.
CONFLICT OF INTEREST STATEMENT
None declared.
DATA AVAILABILITY
The EHR data underlying this article cannot be shared publicly to protect the privacy of individuals included in the study, and thus is available only for authorized Washington University in St. Louis (WashU) investigators.Click here for additional data file.
Authors: Clement J McDonald; Stanley M Huff; Jeffrey G Suico; Gilbert Hill; Dennis Leavelle; Raymond Aller; Arden Forrey; Kathy Mercer; Georges DeMoor; John Hook; Warren Williams; James Case; Pat Maloney Journal: Clin Chem Date: 2003-04 Impact factor: 8.327
Authors: Scott M Lundberg; Gabriel Erion; Hugh Chen; Alex DeGrave; Jordan M Prutkin; Bala Nair; Ronit Katz; Jonathan Himmelfarb; Nisha Bansal; Su-In Lee Journal: Nat Mach Intell Date: 2020-01-17
Authors: J L Vincent; R Moreno; J Takala; S Willatts; A De Mendonça; H Bruining; C K Reinhart; P M Suter; L G Thijs Journal: Intensive Care Med Date: 1996-07 Impact factor: 17.440
Authors: Pauli Virtanen; Ralf Gommers; Travis E Oliphant; Matt Haberland; Tyler Reddy; David Cournapeau; Evgeni Burovski; Pearu Peterson; Warren Weckesser; Jonathan Bright; Stéfan J van der Walt; Matthew Brett; Joshua Wilson; K Jarrod Millman; Nikolay Mayorov; Andrew R J Nelson; Eric Jones; Robert Kern; Eric Larson; C J Carey; İlhan Polat; Yu Feng; Eric W Moore; Jake VanderPlas; Denis Laxalde; Josef Perktold; Robert Cimrman; Ian Henriksen; E A Quintero; Charles R Harris; Anne M Archibald; Antônio H Ribeiro; Fabian Pedregosa; Paul van Mulbregt Journal: Nat Methods Date: 2020-02-03 Impact factor: 28.547