Literature DB >> 34115013

Predicting ventilator-associated pneumonia with machine learning.

Christine Giang¹, Jacob Calvert, Keyvan Rahmani, Gina Barnes, Anna Siefkas, Abigail Green-Saxena, Jana Hoffman, Qingqing Mao, Ritankar Das.

Abstract

ABSTRACT: Ventilator-associated pneumonia (VAP) is the most common and fatal nosocomial infection in intensive care units (ICUs). Existing methods for identifying VAP display low accuracy, and their use may delay antimicrobial therapy. VAP diagnostics derived from machine learning (ML) methods that utilize electronic health record (EHR) data have not yet been explored. The objective of this study is to compare the performance of a variety of ML models trained to predict whether VAP will be diagnosed during the patient stay.A retrospective study examined data from 6126 adult ICU encounters lasting at least 48 hours following the initiation of mechanical ventilation. The gold standard was the presence of a diagnostic code for VAP. Five different ML models were trained to predict VAP 48 hours after initiation of mechanical ventilation. Model performance was evaluated with regard to the area under the receiver operating characteristic (AUROC) curve on a 20% hold-out test set. Feature importance was measured in terms of Shapley values.The highest performing model achieved an AUROC value of 0.854. The most important features for the best-performing model were the length of time on mechanical ventilation, the presence of antibiotics, sputum test frequency, and the most recent Glasgow Coma Scale assessment.Supervised ML using patient EHR data is promising for VAP diagnosis and warrants further validation. This tool has the potential to aid the timely diagnosis of VAP.

Entities: Chemical

Mesh：

Year: 2021 PMID： 34115013 PMCID： PMC8202554 DOI： 10.1097/MD.0000000000026246

Source DB: PubMed Journal: Medicine (Baltimore) ISSN： 0025-7974 Impact factor: 1.817

Introduction

Pneumonia is one of the deadliest infections across the globe.[ It can be acquired in the community or hospital settings, as well as through the use of invasive mechanical ventilation.[ Ventilator-associated pneumonia (VAP) is notoriously difficult to diagnose due to the absence of a diagnostic gold standard, which can be attributed to the diversity of disease-causing pathogens, lung cultures that limit sampling to anatomical surfaces, clinical interpretation of pathogens extracted in lung samples, underlying health issues in individuals, and agreement of radiological scans with centers for disease control (CDC) diagnostic criteria.[ VAP, which is defined as pneumonia that develops after more than 48 hours following the initiation of invasive mechanical ventilation,[ is estimated to occur in 5% to 67% of intubated patients, with the highest rates among those who are hospitalized due to physical trauma.[ Mortality estimates range from 24% to 76%.[ VAP is also associated with an estimated $47,238 in additional healthcare costs per stay in the United States.[ Given the high mortality rate with VAP and the vulnerability of the patient population at risk for VAP, prevention is crucial for clinical management to reduce the prevalence of VAP.[ The CDC's evidence-based recommendations to prevent VAP encompass in-hospital safety protocols that limit the introduction of microbes into the lungs of intubated patients, such as opting for non-invasive mechanical ventilation methods and avoiding intubation when feasible, keeping levels of subglottic secretions low by frequent secretion removal, and maintaining intubated patients at an elevated chest position.[ Diagnostic methods for VAP vary and rely on a combination of factors, including clinical assessment for infectious signs and symptoms (eg, fever), chest radiology, lung biopsy, and/or quantitative microbiological testing of respiratory secretions.[ However, diagnostic accuracy for VAP continues to be poor. A recent meta-analysis by Fernando et al found that eight common clinical criteria for VAP diagnosis lacked specificity, highlighting the need for diagnostic tools which can better inform the need for and timing of antibiotic treatment while respecting antibiotic stewardship.[ Despite the shortcomings of existing clinical indicators and scoring tools for VAP diagnosis, there is minimal research on the application of machine learning (ML) methods using electronic health record (EHR) data to diagnose VAP. Studies of ML applications for pneumonia have focused largely on mortality prediction among known pneumonia cases. ML has been proposed as a component of other diagnostic tools, such as to assist with chest radiography interpretation,[ to predict pneumonia outcomes,[ or work with electronic sensors to detect pneumonia in exhaled breath.[ Yet ML has not been robustly investigated to develop stand-alone diagnostic tools.[ The dearth of such ML research may reflect the very reason for its urgent need – the lack of a clear diagnostic gold standard. The absence of such a standard complicates the identification of pneumonia onset in retrospective EHR data, making it difficult to develop a supervised ML approach to pneumonia prediction. For VAP, the difficulty in identifying onset time in retrospective data is partially alleviated by the constraints on onset time relative to the start of mechanical ventilation, as per the definition of VAP. It is possible to determine when inpatients have reached the 48th hour from the initiation of mechanical ventilation, the minimum required time for pneumonia to be designated as “ventilator-acquired.” Using this definition, it is possible to develop and assess the ability of ML technology to accurately diagnose VAP in retrospective intensive care unit (ICU) data.

Objective

This exploratory analysis examined the suitability of ML methods for the prediction of VAP. To meet this goal, we developed and assessed a variety of ML approaches for the following two prediction tasks: Intubation task. Among ICU encounters lasting at least 48 hours following the initiation of mechanical ventilation, predict whether or not the given encounter will be associated with a diagnosis of VAP at any later time during the patient's stay, with predictions generated at the 48th hour following intubation. Admission task. Among all ICU encounters lasting at least k hours, classify whether or not the given encounter will be associated with a diagnosis of VAP at any time, with classifications made at the kth hour following admission. Note that, for this task, classifications could be made prior to the initiation of mechanical ventilation.

Materials and methods

Data processing

For all models, data were extracted from the Multiparameter Intelligent Monitoring in Intensive Care (MIMIC)-III version 1.3 dataset collected from the ICU at Beth Israel Deaconess Medical Center in Boston, Massachusetts.[ MIMIC-III contains EHR data (including lab results) and clinical notes on over 40,000 individual patient encounters. All MIMIC-III data were passively extracted from the patient EHR, were de-identified, and collected in compliance with the Health Insurance Portability and Accountability Act.

Intubation task

Data were included from encounters of patients aged 18 years or older, with a minimum of one observation of each of the following vital signs and lab tests: diastolic blood pressure, creatinine, Glasgow Coma Scale (GCS), heart rate, oxygen saturation (SpO2), platelet count, respiratory rate (RR), systolic blood pressure, temperature, hematocrit, and white blood cell count. Hematocrit has been shown to improve other pneumonia-related predictions.[ Community-acquired pneumonia patients were identified by the presence of a pneumonia diagnosis at admission and were excluded. All encounters for this task were required to involve at least one period of invasive mechanical ventilation. As VAP is defined as pneumonia developing after 48 hours following intubation, encounters were required to last at least 48 hours after intubation. All mechanically ventilated patients in this dataset met this requirement. ML models were compared against the CURB-65,[ VAP predisposition, insult, response, organ dysfunction (PIRO),[ and clinical pulmonary infection score (CPIS) scoring systems.[ To facilitate the comparison with CURB-65, we required encounters to include at least one measurement of blood urea nitrogen (BUN). These exclusion steps are summarized in Figure 1. For this task, the windows of data used to generate predictions were calculated backward from the 48th hour following the initiation of ventilation. That is, a 12-hour intubation task model used the 12 hours of data up to and including 48 hours after initiation of mechanical ventilation, or hours 37 to 48 after the initiation of mechanical ventilation. All windows for this intubation task included data from an identical number of patients.

Figure 1

Patient exclusion numbers for the intubation task. BUN = blood urea nitrogen, CAP = community acquired pneumonia, MV = mechanical ventilation.

Admission task

Identical exclusion criteria were applied for the admission task as were applied to the intubation task, with the exception of the initiation of mechanical ventilation requirement, which was not applied. For this task, the windows of data used to generate predictions were calculated forward from the time of ICU admission. For example, a 12-hour admission task model used the first 12 hours of data after a patient was admitted to the ICU, after which point a VAP risk prediction was generated. Patients were required to have a length of stay as long as or longer than the prediction window being examined; the number of patients included in the experiments therefore varied by prediction window (Fig. 1). For both tasks, we extracted patient baseline and time-varying clinical measurements for each encounter. Baseline data included age, and a boolean value for the presence of any relevant comorbidities or symptoms at the time of admission (bacteremia, cirrhosis, congestive heart failure, fever, intracranial hemorrhage, renal failure, respiratory distress, respiratory failure, sepsis, subarachnoid hemorrhage, and shortness of breath). We additionally included an indicator for acute respiratory distress syndrome (ARDS), as pneumonia is associated with ARDS.[ Time-varying clinical measurements included the required vital signs and laboratory tests, as well as urine output (evaluated as the number of urine measurements over the duration of stay) and blood culture information (evaluated as the order of any tests during the relevant window, and as the total test count during the window). We further included the hour of the initiation of mechanical ventilation and the number of accumulated mechanical ventilation hours at the time of prediction. Raw measurements were binned into 1-hour intervals and averaged within bins to produce a single, representative value for each hour. Missing values were imputed based on median values, which were determined using only data in a training set. This process did not allow information from the hold-out test set to influence the imputation. We calculated six summary statistics (minimum, maximum, median, first, last, and average) of each vital sign and laboratory test over a variable-length window (Table 1). Specifically, for each window length k ∊ {6,12,24,48}, we calculated the statistics for the k hours preceding and including the 48th hour after the initiation of mechanical ventilation (in the case of the intubation task) or the 48th hour after admission. Age, the number of total urine output events, and the number of blood culture tests were kept in their raw form. Boolean indicators were added for the presence of antibiotics, sputum labs, blood culture labs, comorbidities and symptoms listed above, and ARDS (Table 1). All variables were then concatenated into one vector for each encounter.

Table 1

Data included as input to the algorithm.

Required vitals and labs	Boolean indicators	Optional measures
– Systolic BP – Diastolic BP – HR – Respiratory rate – Temperature – Hematocrit – SpO₂ – BUN – GCS – Platelet count – WBC – Creatinine	– Antibiotics – Sputum labs – Blood culture labs – Any of cirrhosis, congestive heart failure, fever, bacteremia, intracranial hemorrhage, renal failure, respiratory distress, respiratory failure, sepsis, subarachnoid hemorrhage, shortness of breath – ARDS	– Age – Total urine output events – Number of blood culture tests – Number of sputum tests – Number of MV hours

ARDS = acute respiratory distress syndrome, BP = blood pressure, BUN = blood urea nitrogen, GCS = Glasgow Coma Scale, HR = heart rate, MV = mechanical ventilation, SpO2 = oxygen saturation, WBC = white blood cell.

Data included as input to the algorithm. ARDS = acute respiratory distress syndrome, BP = blood pressure, BUN = blood urea nitrogen, GCS = Glasgow Coma Scale, HR = heart rate, MV = mechanical ventilation, SpO2 = oxygen saturation, WBC = white blood cell.

Gold standard

The International Classification of Diseases (ICD) Revision 9 code 997.31 was the gold standard definition of VAP. Literature assessing the accuracy of ICD codes for VAP identification remains limited.[ However, studies have suggested that, while the sensitivity of administrative coding may be only moderate for VAP identification, specificity, and negative predictive value are quite high.[

ML methods and comparators

Due to the absence of VAP prediction literature, a variety of ML methods were chosen to identify the most appropriate ML methods to address these novel tasks. The ML methods evaluated included both simple linear models and complex, non-linear models, to evaluate the potential effectiveness of different ML models for VAP prediction. For each prediction task and each window length, we evaluated 5 ML methods: logistic regression, multilayer perceptron, random forest, support vector machines, and gradient boosted trees. The logistic regression and support vector machines methods were chosen as representative linear models and the random forest and gradient boosted trees methods were chosen as representative ensemble learning and tree-based methods. The multilayer perceptron model was included in lieu of neural network methods with more layers, as there were too few training examples to effectively train such models. Except for the gradient boosted trees model, which was created using the XGBoost Python package, the models were implemented using the scikit-learn Python package. For both the admission and intubation prediction tasks, each of the above ML methods was used to train prediction models for each predefined window length (6, 12, 24, and 48 hours). The use of five methods at each of four prediction windows resulted in a total of 20 models being evaluated for each prediction task. We compared the performance of the ML algorithm to the CURB-65, VAP PIRO, and CPIS scores for evaluating pneumonia severity. CPIS performance was estimated as described in the literature[ as it could not be calculated in our dataset. We implemented CURB-65 and VAP PIRO in our dataset.[ CURB-65 values were calculated for each hour according to the number of the following which was true: BUN > 19 mL/dL, RR ≥ 30, systolic BP < 90 mmHg/diastolic BP ≤ 60 mmHg, and age ≥65. We tried several variations of assigning a CURB-65 score to a temporal window, including using its maximum, average, and last values over the window. As the results were similar in each case, we reported its average over the window. PIRO is a four-variable score based on predisposition, insult, response, and organ dysfunction. The score is measured by assigning 1 point in 4 areas: detection of a comorbidity (chronic obstructive pulmonary disease, immunocompromised, heart failure, cirrhosis, or chronic renal failure), bacteremia, a systolic BP < 90 mmHg, and ARDS. The data were partitioned uniformly at-random into a set for training and hyperparameter tuning (80%) and a 20% hold-out test set, against which all trained models were evaluated for final performance metrics in the last step. For each task and for each window length k, each model was trained using four-fold grid search cross-validation on the 80% training set. After searching the space of model hyperparameter values, the hyperparameters that produced the best cross-validation performance in terms of area under the receiver operating characteristic (AUROC) curve were chosen. Class weights were used as a hyperparameter in all ML algorithms to improve the performance on the imbalanced dataset. Final hyperparameter ranges used for all models are presented in Supplementary Table 1. Each model was then tested on the 20% hold-out test set. Feature importance was measured through Shapley additive explanation (SHAP) values to assess similarities or differences in the features used to generate predictions across model types.

Minimal input models

In both admission and intubation tasks, we conducted feature selection on the overall best performing model (ie, combination of ML method and time window) for each task using the full set of features. As the best performing model was XGBoost in both cases, we selected the most important features from the SHAP plots to eliminate a large subset of features. In the first step, the top 10 most important features as measured by SHAP values were kept and the performance of the models were re-evaluated on these features. We observed that the performance of the models remained relatively unchanged using only the top 10 features in each case. A further reduction of the number of features was also done by assessing the correlations among the remaining 10 features and removing the features that are highly correlated. This resulted in a five feature model for the intubation task and a nine feature model for the admission task. For each minimal input model, we assessed model AUROC. We additionally assessed model specificity when sensitivity was fixed at 0.80.

Results

In total, 6126 patients were included in the intubation task experiments. Of those, 524 received a diagnosis of VAP during their stay, resulting in a VAP prevalence in the intubation task patient population of 8.55% and of 4.97% in the admission task patient population. Those who were diagnosed with VAP had a higher prevalence of ARDS, a greater number of sputum labs performed and were on average older when compared to those without VAP (Table 2). Additional comorbidity information for the intubation task population is presented in Supplementary Table 2.

Table 2

Demographic and comorbidity information for the experimental population.

Characteristic	VAP positive n = 524	VAP negative n = 5602
Age
<30	27 (5.15%)	215 (3.84%)
30–49	81 (15.46%)	735 (13.12%)
50–59	105 (20.04%)	972 (17.35%)
60–69	120 (22.90%)	1383 (24.69%)
70–79	107 (20.42%)	1185 (21.15%)
80+	71 (13.55%)	984 (17.57%)
ARDS
Yes	35 (6.68%)	284 (5.07%)
No	489 (93.32%)	5318 (94.93%)
Sputum test performed
Yes	497 (94.85%)	2644 (47.20%)
No	27 (5.15%)	2958 (52.80%)
Gender
Male	313 (59.73%)	3172 (56.62%)
Female	211 (40.27%)	2430 (43.38%)
Ethnicity
White	354 (5.15%)	4072 (72.69%)
Black/African-American	44 (8.40%)	484 (8.64%)
Asian	16 (3.05%)	134 (2.39%)
Hispanic/Latino	11 (2.10%)	209 (3.73%)
Unknown/other	99 (18.89%)	703 (12.55%)

ARDS = acute respiratory distress syndrome, VAP = ventilator-associated pneumonia.

Demographic and comorbidity information for the experimental population. ARDS = acute respiratory distress syndrome, VAP = ventilator-associated pneumonia.

Prediction of VAP 48 hours after intubation

For this task, the gradient boosted trees models demonstrated better performance than other model types, except when using summary statistics from a 24-hour window (Table 3), in which case a random forest model demonstrated the highest performance. The best AUROC was recorded by XGBoost using summary statistics from a 6-hour window. Multilayer perceptron models demonstrated lower performance than all other models, particularly when using 48 hours of data. All models outperformed the CURB-65 and PIRO scores at all prediction times.

Table 3

AUROC results on the hold-out test set of models trained to predict VAP 48 hours after intubation, using summary statistics from the previous k hours of patient data.

k (hours)	6	12	24	48
Logistic regression	0.744	0.751	0.739	0.776
Multilayer perceptron	0.731	0.740	0.722	0.741
Random forest	0.771	0.767	0.780	0.777
Support vector machines	0.765	0.769	0.764	0.775
XGBoost	0.799	0.794	0.775	0.791
CURB-65	0.503	0.498	0.498	0.498
PIRO	0.565	0.555	0.566	0.557

AUROC = the area under the receiver operating characteristic curve, ICU = intensive care unit, PIRO = predisposition, insult, response, organ dysfunction, VAP = ventilator-associated pneumonia.

AUROC results on the hold-out test set of models trained to predict VAP 48 hours after intubation, using summary statistics from the previous k hours of patient data. AUROC = the area under the receiver operating characteristic curve, ICU = intensive care unit, PIRO = predisposition, insult, response, organ dysfunction, VAP = ventilator-associated pneumonia. Receiver operating characteristic (ROC) curves for all intubation task models at the time windows k = 6 and k = 48 are presented in Figure 2; these models are annotated with the performance of several common clinical criteria in the diagnosis of VAP, as determined by a meta-analysis.[ All models meet or exceed the reported performance of all of the clinical criteria. ROC curves for the remaining time points are presented in Supplementary Figure 1.

Figure 2

ROC curve comparison for intubation task models, with summary statistics calculated from the (A) 6 hours and (B) 48 hours of data preceding the time of prediction. K = number of hours used to make prediction, MIMC = multiparameter intelligent monitoring in intensive care, ROC = receiver operating characteristic. An assessment of SHAP plots for the intubation task models showed a high degree of overlap in the features identified as most important for generating predictions (Supplemental Figures 2–6), with key recurrent features including length of time for mechanical ventilation, sputum culture measures, and clinical measures related to SpO2 and GCS.

Prediction of VAP after admission

Of the models trained for the admission task, each performed best when using summary statistics from 48 hours between admission and the time of prediction (Table 4). Overall best performance was obtained by the XGBoost model using 48 hours of data. Figure 3 provides comparisons of ROC curves for these windows, with annotated comparisons to common clinical criteria. Most models meet or exceed the reported performance of existing clinical criteria at 6 hours, and all did at 48 hours. The k = 12 and k = 24 cases are shown in Supplementary Figure 7.

Table 4

AUROC results on the hold-out test set of models trained to predict VAP k hours after ICU admission.

k (hours)	6	12	24	48
Logistic regression	0.772	0.788	0.812	0.822
Multilayer perceptron	0.587	0.712	0.753	0.820
Random forest	0.706	0.762	0.822	0.838
Support vector machines	0.766	0.799	0.812	0.829
XGBoost	0.733	0.796	0.820	0.854
CURB-65	0.481	0.496	0.506	0.517
PIRO	0.584	0.595	0.599	0.622

AUROC = the area under the receiver operating characteristic curve, PIRO = predisposition, insult, response, organ dysfunction, VAP = ventilator-associated pneumonia.

Figure 3

ROC curve comparison for admission task models with summary statistics calculated from the (A) 6 hours and (B) 48 hours of data preceding the time of prediction.

AUROC results on the hold-out test set of models trained to predict VAP k hours after ICU admission. AUROC = the area under the receiver operating characteristic curve, PIRO = predisposition, insult, response, organ dysfunction, VAP = ventilator-associated pneumonia. ROC curve comparison for admission task models with summary statistics calculated from the (A) 6 hours and (B) 48 hours of data preceding the time of prediction. As with the intubation task, SHAP plots demonstrated that features most important for generating VAP predictions were similar across model types and prediction windows (Supplementary Figures 8–12), with mechanical ventilation hours, GCS, and sputum again being important features across many models. This finding further supports that a wide range of models may be suitable for this prediction task. The best performing model for the intubation task was XGBoost using six hours of data and for the admission task was XGBoost using 48 hours of data. After all feature selection based on feature importance (Supplementary Figures 6 and 12) and correlation, the minimal input intubation task model included five features (last GCS, last temperature, mechanical ventilation duration, antibiotics indication, and count of sputum cultures). The minimal input admission task model included nine features (last measure of GCS, last systolic blood pressure, last RR, last white blood cell count, mechanical ventilation duration, count of sputum cultures, count of blood cultures, count of urine cultures, and age). The AUROC of both of these minimal input models were very close to the initial all feature models, with AUROC of 0.80 and 0.83 for the intubation and admission tasks, respectively. With sensitivity fixed at 0.80, the minimal input intubation and admission task models achieved specificity values of 0.69 and 0.73, respectively.

Discussion

Our retrospective results demonstrate the success of ML models trained to predict VAP in use cases (Tables 3 and 4). Due to the novelty of the task, a variety of models and prediction windows were explored, all of which demonstrated strong predictive performance. In both use cases, the test set performance of ML models significantly exceeds the reported performance of classic clinical indicators of VAP[ and does so with the potential of advance warning (Figures 1 and 3, Supplementary Figures 1 and 4). Additionally, simple, interpretable models such as logistic regression demonstrate strong performance for both tasks. Given the morbidity and mortality associated with VAP[ and with VAP treatment delays,[ these models may have important implications for improving patient care and outcomes, subject to future external and prospective validation. ML methods were explored due to the need for methods that improve upon the discriminatory ability of existing VAP prediction and detection methods. Existing methods include the PIRO score, which was originally developed for sepsis and has since been utilized for VAP risk stratification in hospitalized patients.[ The CPIS tool is also used as a VAP diagnostic tool for hospitalized patients.[ However, prior research has shown these tools exhibit poor diagnostic performance. In a recent study, PIRO achieved an AUROC of 0.605.[ A recent meta-analysis found that CPIS demonstrated a sensitivity and specificity of 73.8% and 66.4%, respectively, for VAP detection.[ The CURB-65 score, utilized for risk stratification and to predict 30-day mortality, has better accuracy than PIRO and CPIS with a c-statistic of 0.761.[ However, the sensitivity of CURB-65 varies with the severity of pneumonia. A 2016 study showed that 36% of patients who were classified as low risk based on their CURB-65 score were ultimately hospitalized as a result of pneumonia.[ Despite the urgent need for better VAP diagnostics and the popularity of ML applications in healthcare, relatively little effort has been devoted to the application of ML to EHR data for the purpose of predicting VAP. Several methods have been developed for identifying community-acquired pneumonia using neural networks and genetic algorithms[ and one study predicted hospital-acquired pneumonia in patients with schizophrenia.[ Concerning VAP, studies have examined the accuracy of electronic nose (e-nose) sniffers for screening potential VAP cases. These devices use ML methods to analyze exhaled breath for metabolites that may be suggestive of VAP, and some have demonstrated strong discrimination for identifying the presence of VAP.[ However, prospective validation of an e-nose tool found that sensitivity and specificity were insufficient for general clinical use.[ In this context, our study provides a valuable characterization of a variety of ML methods applied to two VAP prediction use cases. The first use case corresponds to the intubation task, which predicted VAP at the first time VAP can be diagnosed, the 48th hour following the initiation of mechanical ventilation. While the highest overall AUROC was demonstrated by XGBoost using data from the 6-hour window leading up to and including the prediction time, all models met or exceeded the performance of existing VAP identification methods. An alert at or before VAP onset by any of these methods is likely to improve identification of VAP, potentially overcoming limitations in diagnostic criteria that may lead to both under- and over-treatment with antibiotics.[ It is worth noting that XGBoost demonstrated decreasing performance over longer data collection windows for the intubation task. This may be due to the fact that, with increasing time from the point at which predictions are made, many of the model inputs lose clinical relevance to the current patient state, decreasing the overall relevance of the inputs. However, other methods, such as random forest, demonstrated increased performance with longer data windows. The relatively low performance of the multilayer perceptron model may be because a single layer perceptron can only classify linearly separable sets of vectors. Since the data used here have at least 82 dimensions, it is likely that the classes are not linearly separable. Although we tried a multilayered neural network, we had insufficient training data. The second use case corresponds to the admission task, which aims to predict VAP a fixed number of hours (6, 12, 24, or 48) following admission to the ICU. In the first three of these cases, alerts generated by the models give advance warning of VAP, as patients cannot have been ventilated for the 48 hours required for a VAP diagnosis. Further, these models have no requirement that patients be ventilated for algorithm alerts to be generated, although they incorporate information about current ventilation status. Therefore, these models may be able to identify patients at high risk of developing VAP should they become ventilated before ventilation occurs. For this task, the best-performing model was XGBoost using summary statistics generated from the entire 48-hour window after admission (Table 4, Figure 3). All models demonstrated performance meeting or exceeding methods current methods for VAP identification. For this task, models generally demonstrated increasing performance when using larger windows of data and when generating predictions further into the patient stay. Models addressing these two use cases exhibit different strengths, with potential implications for future clinical use. The intubation task is, by definition, applied to a high-risk population of patients. In many clinical settings, the positive predictive value of an intubation task alert is therefore likely to be high, and an intubation-based ML system likely to identify the vast majority of VAP cases. Such a system could therefore meaningfully improve the timeliness of antimicrobial administration, improving patient outcomes. In contrast, the admission task is designed for application to all ICU patients, regardless of ventilation status. Therefore, in addition to providing early identification of patients for whom additional monitoring is warranted, the admission task alert may provide clinicians with the opportunity to consider non-invasive methods of ventilation for high-risk patients,[ preventing VAP entirely. The success of both approaches in this retrospective study supports the potential for ML methods to meet a wide range of clinical needs relating to VAP treatment, identification, and prevention. For both use cases, the strong performance of simple linear models (eg, logistic regression) has important implications. Logistic regression models are readily interpretable, with the relative importance of each input feature measurable as the relative magnitude of its coefficient. While many ML methods are viewed as “black boxes,” linear models are in contrast far more transparent, with clear similarities to tabular scores already commonly used in clinical practice. This feature of simple models may increase trust in the model, increasing its utility.[ Additionally, the importance of commonly collected clinical features, simple indicator variables, and readily available information such as length of ventilation support the practicality of a clinical application of these models. ML models may be challenging to implement in clinical settings when specialized testing is required. However, the strong performance found in this study supports further work to develop models that use accessible measurements. Such work may help enable the practical implementation of ML tools in a clinical setting. There are several limitations to this study. Because the exact onset time of VAP could not be determined retrospectively from this dataset, it was not possible to determine the degree of advance warning provided by the models. For the best-performing models, the cumulative duration of mechanical ventilation, the presence of antibiotics administration, and the ordering of a sputum test were the most important features, along with statistics of GCS and creatinine (Supplementary Figures 2 and 4). It may be that, while by definition VAP may not be diagnosed until 48 hours after the initiation of mechanical ventilation, clinicians may suspect VAP is developing and order sputum tests or administer antibiotics within this time window. In these cases, while the classifications made by the models may be technically considered predictions, the alerts they would provide would not necessarily lead to a significant change in care. Given this, it may be valuable to conduct future research on prediction tasks using data obtained immediately after mechanical ventilation, so physicians have a greater lead time to intervene prior to the 48th hour. These results may also be limited by the use of ICD codes for the VAP gold standard, which may fail to accurately capture all patients who experienced VAP during their hospitalization. However, the generally high specificity of codes for hospital-acquired infections supports that our positive class consisted of true positive VAP cases. Another limitation is that these models were trained and tested on data from a single institution, which may limit generalizability. Model performance on novel patient populations or specific subpopulations cannot be inferred. Finally, due to the retrospective nature of this study, the impact that these algorithms may have on patient care and outcomes in a live clinical setting cannot be determined. These limitations underscore the need for future additional and prospective validation. In addition to prospective validation of model performance, additional work is needed to determine the ideal data collection and prediction windows. Incorporation of clinician feedback will be essential to ensure that a prospectively implemented model appropriately balances the need for early prediction with the collection of sufficient patient data.

Conclusion

The development of accurate and timely diagnostic tools for ventilator-associated pneumonia has been limited, despite the prevalence, mortality, and costliness associated with VAP. ML may be a key contributor for future management of VAP risk associated with mechanical intubation, with a variety of ML methods demonstrating suitability for this prediction task. Future work is necessary for further validation of ML algorithms for VAP prediction.

Author contributions

Conceptualization: Jacob Calvert. Data curation: Christine Giang. Formal analysis: Christine Giang, Keyvan Rahmani. Methodology: Jacob Calvert. Project administration: Ritankar Das. Software: Christine Giang. Supervision: Jacob Calvert, Jana Hoffman, Qingqing Mao, Ritankar Das. Validation: Jacob Calvert, Jana Hoffman, Qingqing Mao. Visualization: Christine Giang. Writing – original draft: Gina Barnes, Anna Siefkas. Writing – review & editing: Christine Giang, Jacob Calvert, Keyvan Rahmani, Gina Barnes, Anna Siefkas, Abigail Green-Saxena, Jana Hoffman.

35 in total

1. Use of genetic algorithms for neural networks to predict community-acquired pneumonia.

Authors: Paul S Heckerling; Ben S Gerber; Thomas G Tape; Robert S Wigton
Journal: Artif Intell Med Date: 2004-01 Impact factor: 5.326

2. Selection of predictor variables for pneumonia using neural networks and genetic algorithms.

Authors: P S Heckerling; B S Gerber; T G Tape; R S Wigton
Journal: Methods Inf Med Date: 2005 Impact factor: 2.176

Review 3. Accuracy of administrative code data for the surveillance of healthcare-associated infections: a systematic review and meta-analysis.

Authors: Michihiko Goto; Michael E Ohl; Marin L Schweizer; Eli N Perencevich
Journal: Clin Infect Dis Date: 2013-11-11 Impact factor: 9.079

Review 4. Mortality and the diagnosis of ventilator-associated pneumonia: a new direction.

Authors: R G Wunderink
Journal: Am J Respir Crit Care Med Date: 1998-02 Impact factor: 21.405

5. Validation of Hospital Administrative Dataset for adverse event screening.

Authors: Sandra Verelst; Jessica Jacques; Koen Van den Heede; Pierre Gillet; Philippe Kolh; Arthur Vleugels; Walter Sermeus
Journal: Qual Saf Health Care Date: 2010-04-27

6. CURB-65 Performance Among Admitted and Discharged Emergency Department Patients With Community-acquired Pneumonia.

Authors: Adam L Sharp; Jason P Jones; Ivan Wu; Dan Huynh; Keith E Kocher; Nirav R Shah; Michael K Gould
Journal: Acad Emerg Med Date: 2016-03-24 Impact factor: 3.451

Review 7. Advances in electronic-nose technologies for the detection of volatile biomarker metabolites in the human breath.

Authors: Alphus D Wilson
Journal: Metabolites Date: 2015-03-02

8. Predicting hospital-acquired pneumonia among schizophrenic patients: a machine learning approach.

Authors: Kuang Ming Kuo; Paul C Talley; Chi Hsien Huang; Liang Chih Cheng
Journal: BMC Med Inform Decis Mak Date: 2019-03-13 Impact factor: 2.796

Review 9. Diagnosis of ventilator-associated pneumonia in critically ill adult patients-a systematic review and meta-analysis.

Authors: Shannon M Fernando; Alexandre Tran; Wei Cheng; Michael Klompas; Kwadwo Kyeremanteng; Sangeeta Mehta; Shane W English; John Muscedere; Deborah J Cook; Antoni Torres; Otavio T Ranzani; Alison E Fox-Robichaud; Waleed Alhazzani; Laveena Munshi; Gordon H Guyatt; Bram Rochwerg
Journal: Intensive Care Med Date: 2020-04-18 Impact factor: 17.440

10. MIMIC-III, a freely accessible critical care database.

Authors: Alistair E W Johnson; Tom J Pollard; Lu Shen; Li-Wei H Lehman; Mengling Feng; Mohammad Ghassemi; Benjamin Moody; Peter Szolovits; Leo Anthony Celi; Roger G Mark
Journal: Sci Data Date: 2016-05-24 Impact factor: 6.444

3 in total

1. Multitask Learning With Recurrent Neural Networks for Acute Respiratory Distress Syndrome Prediction Using Only Electronic Health Record Data: Model Development and Validation Study.

Authors: Carson Lam; Rahul Thapa; Jenish Maharjan; Keyvan Rahmani; Chak Foon Tso; Navan Preet Singh; Satish Casie Chetty; Qingqing Mao
Journal: JMIR Med Inform Date: 2022-06-15

2. Predicting Falls in Long-term Care Facilities: Machine Learning Study.

Authors: Rahul Thapa; Anurag Garikipati; Sepideh Shokouhi; Myrna Hurtado; Gina Barnes; Jana Hoffman; Jacob Calvert; Lynne Katzmann; Qingqing Mao; Ritankar Das
Journal: JMIR Aging Date: 2022-04-01

Review 3. Artificial Intelligence in Infection Management in the ICU.

Authors: Thomas De Corte; Sofie Van Hoecke; Jan De Waele
Journal: Crit Care Date: 2022-03-22 Impact factor: 9.097

3 in total