Literature DB >> 35603264

Personalized antibiograms for machine learning driven antibiotic selection.

Conor K Corbin¹, Lillian Sung², Arhana Chattopadhyay¹, Morteza Noshad¹, Amy Chang³, Stanley Deresinksi³, Michael Baiocchi¹, Jonathan H Chen¹.

Abstract

Background: The Centers for Disease Control and Prevention identify antibiotic prescribing stewardship as the most important action to combat increasing antibiotic resistance. Clinicians balance broad empiric antibiotic coverage vs. precision coverage targeting only the most likely pathogens. We investigate the utility of machine learning-based clinical decision support for antibiotic prescribing stewardship.
Methods: In this retrospective multi-site study, we developed machine learning models that predict antibiotic susceptibility patterns (personalized antibiograms) using electronic health record data of 8342 infections from Stanford emergency departments and 15,806 uncomplicated urinary tract infections from Massachusetts General Hospital and Brigham & Women's Hospital in Boston. We assessed the trade-off between broad-spectrum and precise antibiotic prescribing using linear programming.
Results: We find in Stanford data that personalized antibiograms reallocate clinician antibiotic selections with a coverage rate (fraction of infections covered by treatment) of 85.9%; similar to clinician performance (84.3% p = 0.11). In the Boston dataset, the personalized antibiograms coverage rate is 90.4%; a significant improvement over clinicians (88.1% p < 0.0001). Personalized antibiograms achieve similar coverage to the clinician benchmark with narrower antibiotics. With Stanford data, personalized antibiograms maintain clinician coverage rates while narrowing 69% of empiric vancomycin+piperacillin/tazobactam prescriptions to piperacillin/tazobactam. In the Boston dataset, personalized antibiograms maintain clinician coverage rates while narrowing 48% of ciprofloxacin to trimethoprim/sulfamethoxazole. Conclusions: Precision empiric antibiotic prescribing with personalized antibiograms could improve patient safety and antibiotic stewardship by reducing unnecessary use of broad-spectrum antibiotics that breed a growing tide of resistant organisms.

Entities: Chemical

Keywords: Antibiotics; Bacterial infection; Disease prevention; Epidemiology

Year: 2022 PMID： 35603264 PMCID： PMC9053259 DOI： 10.1038/s43856-022-00094-8

Source DB: PubMed Journal: Commun Med (Lond) ISSN： 2730-664X

Introduction

The World Health Organization (WHO) estimates that 700,000 people already die annually due to antibiotic resistant infections, and expects this number to exceed 10 million per year by 2050[1]. Increasing antibiotic resistance is a natural and inevitable consequence of regular antibiotic use, raising the looming threat of a post-antibiotic era that could cripple routine medical care with higher infection-related mortality and costs of care[2-5]. The Centers for Disease Control and Prevention (CDC) identify improving antibiotic prescribing through antibiotic stewardship as the most important action to combat the spread of antibiotic resistant bacteria[6]. For example, sixty percent of hospitalized patients receive antibiotics despite the fact that half of antibiotic treatments are inappropriate—meaning antibiotic use was unwarranted, the wrong antibiotic was given, or the antibiotic was delivered with wrong dose or duration[7]. A key challenge is that antibiotics must often be prescribed empirically, before the identity of the infecting organism and antibiotic susceptibilities are known. Microbial cultures are the definitive diagnostic tests for this information, but may take days to confirm final results, far too long to delay initial therapy[8]. Broad-spectrum antibiotics help ensure coverage of a range of organisms that would lead to rapid clinical deterioration if left untreated[9,10]. Yet, it is precisely the excessive use of antibiotics that increases drug resistant organisms[11]. Overuse of broad-spectrum antibiotics can thus have severe immediate and indirect consequences ranging from increasing antibiotic resistance to drug-specific toxicities and secondary infections such as Clostridioides difficile colitis[12-14]. Existing standards of care for selecting empiric antibiotics involve referring to clinical practice guidelines combined with knowledge of institution-specific antibiograms—an annual report from an institution’s microbiology lab that tracks the most common organisms isolated by microbial cultures and the percentages that were found susceptible to different antibiotics[15-17]. An institution’s antibiogram might report for example that 1000 Escherechia Coli were isolated in the prior year, and that 98% were susceptible to meropenem, while only 89% were susceptible to ceftriaxone. These approaches may not consider many or any patient-specific features. Microbial culture results found within the electronic health record can be used to objectively measure not only whether chosen antibiotics were appropriate, but if alternatives would have sufficed. Here we hypothesize that the standard of care may benefit from machine learning-based clinical decision support for personalized treatment recommendations. The development of computerized clinical decision support for antibiotic prescribing stems back decades to the likes of MYCIN and Evans et al.—rule-based systems that guide clinicians through empiric antibiotic selection[18,19]. Though promising, neither system was widely adopted by clinicians as they were not easily integrated into their medical workflow or adaptable to constantly evolving local antibiotic resistance patterns[20]. With modern day hospital IT and electronic medical record software it is now possible to integrate clinical decision support into medical workflows and dynamically train models with real world clinical data streams[21]. Literature concerning modern day data-driven approaches to antibiotic decision support fall into two distinct categories. One category of studies predict infection status at the time microbial cultures were ordered, offering promising consideration for when antibiotics are needed at all[22-24]. Limitations in most of these prior studies is that positive microbial culture results were used as a proxy for the outcome of infection, despite their being both false positive and false negative microbial cultures with respect to an actual clinical infection. Moreover, these studies do not address the question of which antibiotics should have been administered. The second category of studies predict antibiotic susceptibility results for positive microbial cultures[25-28]. These studies address the challenge of selecting the right antibiotic. Antibiotic prescribing policies that leverage machine learning predictions were simulated and benchmarked against retrospective clinician prescribing and suggested improved performance. Optimizing patient coverage rates, however, is only one important objective that could be naively addressed by prescribing maximally broad antibiotics to all patients without consideration for adverse effects on the individual or population. Further critical research needs to systematically evaluate the trade-off between maximizing antibiotic coverage across a population of patients and minimizing broad-spectrum antibiotic use. In a previous work we demonstrated that machine learning models could predict antibiotic susceptibility results when conditioned on microbial species[29]. We examined precision-recall curves of these models and highlighted thresholds that separated subgroups of patients with probability of coverage with narrower-spectrum antibiotics equal to antibiogram values of broader-spectrum antibiotics. Here we substantially extend our work on personalized antibiograms to generalize beyond species identity, introduce a linear programming optimization framework to simulate optimal antibiotic allocations across a set of patients, conduct a sensitivity analysis to estimate model performance on patients with negative microbial cultures, and assess the generalizability of our findings with data from an external site. Specifically, our objective in this study is the following. We (1) train and evaluate personalized antibiograms—machine learning models that use electronic health record data to predict antibiotic susceptibility results; (2) evaluate the performance of antibiotic selections informed by personalized antibiograms relative to selections made by clinicians; and (3) systematically evaluate the trade-off in performance when fewer broad-spectrum antibiotics are selected across a population of patients. We complete this objective using a cohort of patients who presented to Stanford emergency departments between 2009 and 2019 and then replicate our process on an external cohort of patients who presented to the Massachusetts General Hospital and Brigham & Women’s Hospital in Boston between 2007 and 2016. In our Stanford cohort we find that personalized antibiograms are able to reallocate antibiotic selections made by clinicians with a coverage rate (defined as the fraction of infections covered by the antibiotic selection) of 85.9%, similar to the clinician coverage rate (84.3%, p = 0.11). We find in the Boston data that personalized antibiograms reallocate antibiotic selections with coverage rate of of 90.4%—significantly higher than the coverage rate clinicians achieve (88.1% p < 0.0001). In the Stanford data we find that antibiotic selections guided by personalized antibiograms achieve a coverage rate as good as the real world clinician prescribing rates while narrowing 69% of the vancomycin + piperacillin/tazobactam selections to piperacillin/tazobactam, 40% of piperacillin/tazobactam prescriptions to cefazolin, and 21% of ceftriaxone prescriptions to ampicillin. In the Boston data we find that personalized antibiograms can replace 93% of the total ciprofloxacin prescriptions with nitrofurantoin without falling below the real world coverage rate. Similarly 48% of the total ciprofloxacin and 62% of nitrofurantoin prescriptions can be exchanged with trimethoprim/sulfamethoxazole.

Methods

Data sources

We used the STAnford Research Repository (STARR) clinical data warehouse to extract de-identified patient medical records[30]. STARR contains electronic health record data collected from over 2.4 million unique patients spanning 2009–2021 who have visited Stanford Hospital (academic medical center in Palo Alto, California), ValleyCare hospital (community hospital in Pleasanton, California) and Stanford University Healthcare Alliance affiliated ambulatory clinics. We included patient encounters from both the Stanford and ValleyCare emergency departments. Structured electronic health record data include patient demographics, comorbidities, procedures, medications, labs, vital signs, and microbiology data. STARR microbiology data contain information about microbial cultures ordered within the hospital, emergency departments, and outpatient clinics. Our microbiology data included source of culture, order timestamp, and result timestamp. Microbial culture data also included resulting organism name and antibiotic susceptibility pattern, which indicated whether each organism was either susceptible, intermediate, or resistant to a set of tested antibiotics. Microbiology data collected from ValleyCare and Stanford emergency departments were analyzed at separate microbiology labs. Both follow standardized national procedures to measure antibiotic susceptibility as defined by the Clinical & Laboratory Standards Institute (CLSI)[31]. Our study was approved by the institutional review board of the Stanford University School of Medicine. Project-specific informed consent was not required because the study was restricted to secondary analysis of existing clinical data. We replicated the analysis on electronic medical record data of patients from Massachusetts General Hospital and the Brigham and Women’s Hospital in Boston, MA—a dataset made available through Physionet[28,32].

Cohort definitions

The unit of observation in this analysis was a patient-infection. In the Stanford data, analysis was restricted to patients who presented to the emergency department with infection between January 2009 and December 2019. We included patients 18 years or older and patients who required hospital admission. We further restricted the cohort to patients where an order for at least one positive blood, urine, cerebral spinal fluid or fluid microbial culture; and, at least one order for intravenous or intramuscular antibiotics were placed in the first 24 h after presentation to the emergency department. We excluded observations where antibiotics or microbial cultures had been ordered within the 2 weeks prior to the presentation to the emergency department. We incorporated admissions with negative cultures in a sensitivity analysis. Figure 1 illustrates the flow diagram of patient evaluation, reasons for exclusions and number included in our study tabulating both the number infections and number of unique patients. In the Boston dataset the unit of observation was similarly a patient-infection. Analysis was restricted to uncomplicated urinary tract infections, as described in Kanjilal et al.[28]. Observations between the years 2007 and 2016 were included in the study.

Fig. 1

Study cohort selection.

a 119,840 hospital admissions corresponding to 69,069 unique adult patients admitted from Stanford emergency rooms between 2009 and 2019 were initially examined for inclusion. b 42,448 admissions had a microbial culture and intravenous or intramuscular empiric antibiotic order placed within the first 24 h of the encounter. c Admissions were excluded if microbial cultures had been ordered in the 2 weeks leading up to the encounter. d Admissions resulting in negative microbial cultures were excluded in the primary analysis, leaving 8342 infections from 6920 unique patients.

Study cohort selection.

Labelling infections for personalized antibiogram models

Using Stanford data we trained 12 binary machine learning models to estimate the probability that common antibiotic selections would provide activity against infections at the point in time empiric antibiotics were chosen. An antibiotic selection was said to provide activity against a patient’s infection if all microbial organisms that grew in the patient’s microbial cultures were listed as susceptible to at least one of the antibiotics in the selection. While microbial cultures growing Coagulase-Negative Staphylococci sometimes represent true infections warranting antibiotic treatment, we excluded Coagulase-Negative Staphylococci cases as they frequently represent non-infectious contaminants. We trained models for eight commonly administered single antibiotic choices (vancomycin, piperacillin/tazobactam, cefepime, ceftriaxone, cefazolin, ciprofloxacin, ampicillin and meropenem) and four combination therapies (vancomycin + piperacillin/tazobactam, vancomycin + cefepime, vancomycin + ceftriaxone, and vancomycin + meropenem). We defined our prediction time to be the time at which the first intravenous or intramuscular antibiotic was ordered for the patient following admission to the emergency department. Using Boston data, we similarly trained personalized antibiogram models that predicted whether four antibiotics commonly administered for urinary tract infection (trimethoprim/sulfamethoxazole, nitrofurantoin, ciprofloxacin, and levofloxacin) would provide activity against the target infection. Not all antibiotics were tested against all organisms in the microbiology lab which resulted in missing labels for some of our observations. Antibiotics are only tested if they could plausibly be active against a specific organism. We received consultation from the Stanford microbiology lab to generate a set of rules to impute missing labels. For example, our imputation rule assumed that Pseudomonas aeruginosa would be resistant to ceftriaxone and cefazolin; Gram negative rods would be resistant to vancomycin; and Streptococcus agalactiae is susceptible to cephalosporins[33-35]. Antibiotic susceptibility was also inferred from observed results of related antibiotics. For example, if an organism was susceptible to a first-generation cephalosporin, it was assumed that it would also be susceptible to a second, third or fourth-generation cephalosporin[36].

Feature engineering

Using Stanford data, a feature matrix was constructed with data in the EHR with timestamps up until the prediction time. Though observations were restricted to infections that required hospital admission, features were constructed based on data from all forms of available medical encounters (including ex: prior primary care visits). We used a bag of words featurization technique similar to Rajkomar et al. to construct our feature matrix[37]. Categorical features included diagnosis codes (ICD9 and ICD10 codes), procedure orders, lab orders (including microbiology lab orders), medication orders, imaging orders and orders for respiratory care. For prior microbial culture results, categorical features were constructed based on the antibiotic susceptibility pattern of extracted isolates. Bag of words feature representations for each admission were generated such that the value in each column was the number of times that feature was present in the patient’s record during a pre-set look back window. For diagnosis codes, the look back window was defined as the entire medical history. For other categorical features, the look back window was defined as the year leading up until prediction time. If a feature was not present for a patient (for example a diagnosis code was never assigned or a lab test was never ordered), the value in the corresponding column was zero. This allowed us to implicitly encode missing values into our representation without having to impute our data. Numeric features included lab results and vital signs from flowsheet data. We binned the values of each unique numerical feature into categorical buckets based on the decile cutoffs of their distributions in the training set. We used the training set to identify thresholds for each decile in a feature’s distribution and then applied these thresholds to patients in our validation and test set to prevent information leakage. To create the bag of words representation, we created columns in our feature matrix where the corresponding value represented the number of times the feature with a value in a particular decile was observed within a look back window. For lab results and vital signs, the look back window was 14 days prior to prediction time. Features were not standardized, and rather left as counts. In addition to these categorical and numeric features, we included patient demographics (age in years, sex, race, and ethnicity), insurance information, and institution (Stanford or ValleyCare). Sex, race, ethnicity, insurance payer, and institution were one-hot encoded. In total, the sparse feature matrix contained 43,220 columns. With Boston data, a feature matrix was generated as in Kanjilal et al.[28] Features included prior microbiology data, antibiotic exposures, comorbidities, procedures, lab results and patient demographics. The total number of features used in this portion of the analysis was 788.

Training and model selection procedure

With Stanford data we split the cohort by year into training (2009–2017), validation (2018), and test (2019) sets to mimic distributional shifts that occur with deployment[38]. This is particularly important so that we can take into consideration changes in the data generating process (resistance patterns and medical practice) that occur over time when estimating model performance. We did not re-weigh or re-sample our training data according to class balance in an attempt to preserve model calibration on our test set[39]. We selected from four model classes: L1 (lasso) and L2 (ridge) logistic regressions, random forests, and gradient boosted trees. These were specifically chosen so that we could search over model classes with different biases and variances. The L1 and L2 penalized logistic regressions assume the outcome is a linear function of the features, are less flexible, but also less prone to overfitting the data. The random forests and gradient boosted trees can model nonlinear interactions between features and outcomes, are more flexible, but more prone to overfitting. Random forests perform inference by averaging predictions from a collection of trees, and gradient boosted trees perform inference by summing the predictions of a collection of trees that each fit the residuals at the prior boosting round[40]. The training and model selection procedure we used for the logistic regressions and random forest is as follows. First, hyperparameters for each model class were selected by performing a stratified k = 5 fold cross validation grid search over the training set. Hyperparameters that led to the highest mean area under the receiver operating characteristic curve (AUROC) were selected for each model class. We then fit the final model for each model class using the entire training set and evaluated each on the validation set. The best model class was chosen by selecting the model with the highest AUROC in the validation set. After choosing the best model class, hyperparameters were re-tuned on the combined training and validation set using a stratified k = 5 fold cross validation grid search. The training procedure was altered for the gradient boosted tree models so that we could regularize with early stopping. The training procedure was as above except that for each model fit, 5% of each training fold was held out and used as an additional validation set for early stopping. We set the maximum number of boosting iterations to 1000 and a tolerance of ten boosting rounds for the early stopping criteria. The final model was then trained using the combined training and validation set and final performance was evaluated on the test set. The logistic regressions and random forest models were fit using the sci-kit learn python package[41]. The gradient boosted tree models were fit using the lightgbm python package[42]. We computed the area under the receiver operating characteristics curve (AUROC) and average precision, with 95% confidence intervals estimated by bootstrapping the test set 1000 times[43]. We list all tested hyperparameter configurations in Supplementary Note 1. The Boston data was split into training (2007–2013) and test (2014–2016) sets by time (as in Kanjilal et al.). The optimal model class and hyperparameter setting for each of the four binary models was chosen with a k = 5 fold cross validation grid search over the training set[28].

Optimizing antibiotic selection with personalized antibiograms

We used the out of sample predicted probabilities from each of our binary classifiers to optimize antibiotic selections across patients in the test set and benchmarked against: (1) random antibiotic selections and (2) the observed clinician selections. Clinician performance was measured by extracting which antibiotics were administered to patients using information stored in the medication administration records. In the Stanford data, we restricted this analysis to admissions in the test set where one of the twelve antibiotic selections we trained models for were administered. In the Boston data, analysis was similarly restricted to patients who were prescribed one of the four antibiotics we trained personalized antibiogram models for. The optimized antibiotic selections were generated by solving a constrained optimization formulation using linear programming [44]. For each admission we selected an antibiotic option that maximized the predicted probability of choosing an antibiotic listed as susceptible subject to the constraints that (1) only one antibiotic option could be selected for a given patient infection and (2) the total number of times certain antibiotic options were selected across patients matched a fixed budget. In the initial simulation, this budget was defined to be the number of times particular antibiotic choices were actually administered by clinicians in the real-world data. Thus, if ceftriaxone was allocated 100 times in the data, the optimizer was similarly forced to allocate ceftriaxone 100 times. In further simulations, these budget constraints were perturbed to empirically estimate the trade-off between maximally selecting antibiotics with activity against a patient’s infection and reducing the use of broad-spectrum antibiotics. An illustrative example of how patient feature vectors are converted into antibiotic assignments is shown in Fig. 2.

Fig. 2

Optimizing antibiotic selections with linear programming.

Optimizing antibiotic selections with linear programming.

Patient feature vectors are ingested by personalized antibiogram models (a) to produce antibiotic efficacy estimates (b). Each patient in the test set receives a predicted probability of efficacy for each antibiotic. In this illustration, pentagons refer to one antibiotic option and triangles refer to another. Green indicates the antibiotic option is likely to cover the patient, orange indicates the antibiotic is unlikely to cover the patient. A linear programming objective function is specified with a set of constraints that limit how frequently certain antibiotics can be used. Here the objective function specifies to maximize the total predicted antibiotic efficacy (green) across the two patients subject to the constraint that each antibiotic option is only used once. c Depicts all possible antibiotic allocations color coded by patient specific antibiotic efficacy estimates produced by personalized antibiograms. Antibiotics allocations are only considered (d) if they meet the constraints of the linear programming formulation. The antibiotic allocation that maximizes the total predicted efficacy across the set of patients (e) is chosen. In technical detail, let N be the number of admissions in our held out test set and M be the total number of antibiotic selections. Let S ∈ R be a matrix of binary variables indicating antibiotic options, where s = 1 represents that the jth antibiotic option is allocated to the patient in the ith admission. Let Φ ∈ R be a matrix of out of sample probability estimates from our machine learning models, specifically ϕ represents the predicted probability that the patient in the ith admission would be covered by antibiotic option j. Finally, let K be the total number of times the jth antibiotic must be used over the set of our N admissions (the budget parameters). For our initial simulation we let the budget parameters match the number of times clinicians actually allocated the jth antibiotic selection over the set of N admissions. Our problem formulation specified below was implemented using the PuLP python package and solved with the CBC solver[45] [Eq. 1].

Sensitivity analysis

We performed a sensitivity analysis to estimate model performance on the full deployment population, including patients with negative microbial culture results. This is a important because at prediction time, whether a microbial culture will return positive is unknown. Further, negative microbial cultures do not preclude infection at a site not tested. Some patients with negative microbial cultures will have a latent undetected infection with an antibiotic susceptibility profile (set of labels) that goes unobserved. This can skew model performance estimates if patients with censored labels have a covariate distribution different from those with observed labels. To address this, we (1) constructed an electronic phenotype to identify patients with negative microbial cultures that truly lacked infection, (2) re-trained a new set of personalized antibiogram prediction models that include patients flagged by the electronic phenotype and (3) used inverse probability weighted estimates of AUROC to evaluate performance on the deployment population, the union of patient admissions with positive and negative microbial cultures.

Electronic phenotype

We created a rule based electronic phenotype that when applied to the set of patients in our cohort with negative microbial cultures attempted to extract instances where patients were truly uninfected. We created a strict phenotype, prioritizing positive predictive value over sensitivity. Patients were labelled as uninfected during the admission in question if the following was all true. None of the microbial cultures ordered within 2 weeks of the admission returned positive. As in the prior labelling scheme microbial cultures that grew only Coagulase-negative Staphylococci were considered negative. Antibiotics were either never administered, or they were stopped within 24 h of them starting. Antibiotics were not restarted for an additional 2 weeks if they were stopped. No ICD codes related to bacterial infection were associated with the hospital admission (see Supplementary Note 2). The patient did not die during the admission.

Updated labelling schema

Applying the above electronic phenotype to patients with negative microbial cultures resulted in a cohort of patient admissions broken down into three distinct buckets. Bucket 1 included patient microbial cultures that returned positive. Antibiotic susceptibility testing was performed and we observed their class label. This bucket is the set of patient infections included in our primary analysis. Bucket 2 included admissions whose microbial cultures returned negative and were flagged by our electronic phenotype indicating lack of infection. We observed these class labels, which we define as positive for each prediction task because lack of infection indicates the patient would have been covered by any antibiotic selection. Bucket 3 included admissions whose microbial cultures were negative and were not flagged by our electronic phenotype. Patients in this bucket may or may not have been infected. We did not observe their class labels. These three buckets are illustrated in Fig. 3.

Fig. 3

Three buckets of observations in the deployment population.

Three buckets of observations in the deployment population.

a The deployment population is the set of patients that would trigger personalized antibiogram model predictions in a deployment scenario. b Prediction time is defined as the time the empiric antibiotic order is placed. c After prediction time cultures can go on to have a positive or negative result. d If cultures are positive, antibiotic susceptibility testing is performed. If negative, our electronic phenotype flags patients who with high likelihood lacked a clinical infection that warranted antibiotics. e Three buckets of observations. Patients landing in Bucket 1 or 2 have observed labels in the labelling scheme defined in the sensitivity analysis. Patients landing in Bucket 3 have labels that go unobserved. We included patients in bucket 2 into our model training and evaluation procedure by adopting the following altered labelling schema. The labelling schema was as before except all patients in bucket 2 were assigned a positive label for every antibiotic. Specifically, for each of the twelve antibiotic options, a positive label was assigned if the admission resulted in a positive microbial culture (bucket 1) and the resulting organism(s) was susceptible to the antibiotic, or the admission resulted in negative cultures and was flagged by the electronic phenotype (bucket 2). A negative label was assigned if the admission resulted in a positive microbial culture (bucket 1) and a resulting organism(s) was not susceptible to to the antibiotic. Models were trained using patient admissions in buckets 1 and 2. The covariate distribution of patient admissions in bucket 3 was used to estimate model performance on the full deployment population (patients in buckets 1, 2, and 3) using inverse probability weighting.

Inverse probability weighting to estimate model performance on the deployment population

We used inverse probability weighting to account for patient admissions whose class labels we did not observe (bucket 3) in our estimates of model performance (AUROC). In a theoretical deployment scenario, we would deploy our model on a population of patient admissions that include the union of buckets 1, 2, and 3. If the covariate distribution of patients admissions in buckets 1 and 2 differs from the covariate distribution of patient admissions in bucket 3 and our models perform better or worse in regions of this covariate distribution that are more common for patient admissions in buckets 1 and 2, then we run the risk of over or underestimating how well our models would perform in deployment. To estimate performance on a population of patients that includes a set of patients whose labels we do not observe, we weigh each patient admission whose label we do observe (bucket 1 and 2) by the inverse probability of observing it. We obtain this probability by fitting a binary random forest classifier (using the same feature matrix and index time as the personalized antibiogram models) to predict whether the patient admission would land in bucket 1 and 2, or bucket 3. The inverse probability weighted estimates of sensitivity and specificity for each of our 12 models are shown below. is the predicted probability from a personalized antibiogram model for patient admission i, is the predicted probability of observing patient admission i’s class label, and t is the probability cut-off threshold to map predicted probability to a predicted class label. The inverse probability weighted ROC curve and area under it can be estimated by varying the probability cut-off thresholds of these estimators [Eqs. 2–3].

Statistics and reproducibility

Coverage rates of random antibiotic selection in the two cohorts were statistically compared to coverage rates achieved with personalized antibiograms with one-sided permutation tests[46]. Specifically, an empirical distribution of random coverage rates was created by repeatedly randomly re-assigning antibiotic selections to different patient-infections 10,000 times. A pvalue was calculated by taking the fraction of coverage rates in this empirical distribution that equaled or exceeded the coverage rate achieved with personalized antibiograms. If no value in the empirical distribution equaled or exceeded the observed value, the pvalue was reported as p < 0.0001. The clinician coverage rate was compared to the personalized antibiogram coverage rate using a similar procedure, except that the empirical distribution of clinician coverage rates was generated by performing a stratified bootstrap (stratified by antibiotic selection) 10,000 times.

Table 1

Stanford cohort demographics grouped by train test split.

		Dataset
Description	Category	Test (2019)	Train + Validation (2009–2018)
n	Total	1320	7022
Emergency department, n (%)	Stanford ED	855 (64.8)	6669 (95.0)
	Valley Care ED	465 (35.2)	353 (5.0)
Age, mean (SD)		70.4 (17.2)	67.5 (17.3)
Sex, n (%)	Female	793 (60.1)	4171 (59.4)
Sex, n (%)	Male	527 (39.9)	2851 (40.6)
Race, n (%)	White	757 (57.3)	3937 (56.1)
	Other	251 (19.0)	1411 (20.1)
	Asian	201 (15.2)	937 (13.3)
	Black	69 (5.2)	464 (6.6)
	Pacific Islander	30 (2.3)	206 (2.9)
	Unknown	7 (0.5)	40 (0.6)
	Native American	5 (0.4)	27 (0.4)
Ethnicity, n (%)	Non-Hispanic	1117 (84.6)	5823 (82.9)
	Hispanic/Latino	195 (14.8)	1169 (16.6)
	Unknown	8 (0.6)	30 (0.4)
Language, n (%)	English	1112 (84.2)	5743 (81.8)
Language, n (%)	Non-English	208 (15.8)	1279 (18.2)
Insurance Payer, n (%)	Medicare	651 (49.3)	3805 (54.2)
	Other	615 (46.6)	2987 (42.5)
	Medi-Cal	54 (4.1)	230 (3.3)

Table 2

Most frequently isolated species grouped by microbial culture type and emergency department.

Emergency department	Culture type	Organism	Infections
Stanford ED	Blood culture	Escherichia coli	1031
		Staphylococcus aureus	585
		Klebsiella pneumoniae	318
		Enterococcus faecalis	159
		Streptococcus agalactiae (group b)	131
	Urine culture	Escherichia coli	2927
		Enterococcus species	877
		Klebsiella pneumoniae	653
		Proteus mirabilis	299
		Pseudomonas aeruginosa	268
	Other fluid culture	Staphylococcus aureus	127
		Escherichia coli	83
		Streptococcus anginosus group	56
		Klebsiella pneumoniae	45
		Enterococcus faecium	28
Valley care ED	Blood culture	Escherichia coli	98
		Staphylococcus aureus	49
		Klebsiella pneumoniae	29
		Proteus mirabilis	15
		Pseudomonas aeruginosa	9
	Urine culture	Escherichia coli	361
		Proteus mirabilis	90
		Klebsiella pneumoniae	84
		Enterococcus faecalis	59
		Pseudomonas aeruginosa	43
	Other fluid culture	Escherichia coli	13
		Staphylococcus aureus	11
		Klebsiella pneumoniae	5
		Streptococcus anginosus group	4
		Enterococcus faecium	2

Table 3

Antibiotic susceptibility classifier performance.

Antibiotic selection	Best model class	Prevalence	Average precision	AUROC
Vancomycin	Gradient Boosted Tree	0.23	0.46 [0.40, 0.52]	0.72 [0.68, 0.75]
Ampicillin	Gradient Boosted Tree	0.43	0.54 [0.49, 0.58]	0.62 [0.59, 0.65]
Cefazolin	Gradient Boosted Tree	0.59	0.72 [0.68, 0.76]	0.67 [0.64, 0.70]
Ciprofloxacin	Random Forest	0.63	0.73 [0.70, 0.76]	0.61 [0.58, 0.64]
Ceftriaxone	Gradient Boosted Tree	0.66	0.79 [0.77, 0.82]	0.69 [0.66, 0.72]
Cefepime	Random Forest	0.80	0.87 [0.84, 0.89]	0.65 [0.61, 0.69]
Vancomycin + Ceftriaxone	Gradient Boosted Tree	0.81	0.87 [0.84, 0.89]	0.67 [0.63, 0.71]
Meropenem	Gradient Boosted Tree	0.82	0.90 [0.88, 0.92]	0.69 [0.65, 0.72]
Pip-Tazo	Random Forest	0.90	0.94 [0.92, 0.95]	0.64 [0.59, 0.69]
Vancomycin + Pip-Tazo	Random Forest	0.96	0.98 [0.97, 0.99]	0.70 [0.62, 0.77]
Vancomycin + Cefepime	Random Forest	0.97	0.98 [0.98, 0.99]	0.70 [0.62, 0.78]
Vancomycin + Meropenem	Gradient Boosted Tree	0.98	0.99 [0.99, 0.99]	0.73 [0.65, 0.81]

Pip-Tazo = piperacillin/tazobactam.

Table 4

Boston Model Performances.

Antibiotic selection	Best model class	Prevalence	Average precision	AUROC
Trime/Sulf	Gradient Boosted Tree	0.80	0.85 [0.84, 0.87]	0.60 [0.58, 0.62]
Nitrofurantoin	Gradient Boosted Tree	0.89	0.91 [0.90, 0.92]	0.57 [0.54, 0.61]
Ciprofloxacin	Lasso	0.94	0.95 [0.95, 0.96]	0.64 [0.60, 0.68]
Levofloxacin	Lasso	0.94	0.96 [0.95, 0.96]	0.64 [0.60, 0.67]

Trime/Sulf = trimethoprim/sulfamethoxazole.

Table 5

Personalized antibiogram sensitivity analysis with and without inverse probability weights: Pip-Tazo = piperacillin/tazobactam.

Antibiotic selection	Original classifiers	Sensitivity analysis
	AUROC	AUROC	AUROC_IPW
Vancomycin	0.72 [0.68, 0.75]	0.74 [0.71, 0.76]	0.75 [0.72, 0.77]
Ampicillin	0.62 [0.59, 0.65]	0.69 [0.66, 0.71]	0.69 [0.66, 0.71]
Cefazolin	0.67 [0.64, 0.70]	0.71 [0.68, 0.73]	0.70 [0.67, 0.73]
Ceftriaxone	0.69 [0.66, 0.72]	0.72 [0.69, 0.75]	0.72 [0.69, 0.74]
Cefepime	0.65 [0.61, 0.69]	0.64 [0.60, 0.68]	0.62 [0.58, 0.66]
Pip-Tazo	0.64 [0.59, 0.69]	0.65 [0.59, 0.70]	0.62 [0.56, 0.68]
Ciprofloxacin	0.61 [0.58, 0.64]	0.64 [0.62, 0.68]	0.64 [0.61, 0.67]
Meropenem	0.69 [0.65, 0.72]	0.71 [0.68, 0.74]	0.70 [0.67, 0.74]
Vancomycin + Meropenem	0.73 [0.65, 0.81]	0.76 [0.67, 0.84]	0.74 [0.65, 0.84]
Vancomycin + Pip-Tazo	0.70 [0.62, 0.77]	0.71 [0.63, 0.78]	0.70 [0.62, 0.78]
Vancomycin + Cefepime	0.70 [0.62, 0.78]	0.68 [0.60, 0.77]	0.67 [0.59, 0.76]
Vancomycin + Ceftriaxone	0.67 [0.63, 0.71]	0.71 [0.68, 0.75]	0.70 [0.66, 0.74]

43 in total

1. Vancomycin: a history.

Authors: Donald P Levine
Journal: Clin Infect Dis Date: 2006-01-01 Impact factor: 9.079

2. The potential for artificial intelligence in healthcare.

Authors: Thomas Davenport; Ravi Kalakota
Journal: Future Healthc J Date: 2019-06

3. Blood Culture Turnaround Time in U.S. Acute Care Hospitals and Implications for Laboratory Process Optimization.

Authors: Ying P Tabak; Latha Vankeepuram; Gang Ye; Kay Jeffers; Vikas Gupta; Patrick R Murray
Journal: J Clin Microbiol Date: 2018-11-27 Impact factor: 5.948

Review 4. CLSI Methods Development and Standardization Working Group Best Practices for Evaluation of Antimicrobial Susceptibility Tests.

Authors: Romney M Humphries; Jane Ambler; Stephanie L Mitchell; Mariana Castanheira; Tanis Dingle; Janet A Hindler; Laura Koeth; Katherine Sei
Journal: J Clin Microbiol Date: 2018-03-26 Impact factor: 5.948

5. Personalized Antibiograms: Machine Learning for Precision Selection of Empiric Antibiotics.

Authors: Conor K Corbin; Richard J Medford; Kojo Osei; Jonathan H Chen
Journal: AMIA Jt Summits Transl Sci Proc Date: 2020-05-30

6. A quality improvement initiative to improve adherence to national guidelines for empiric management of community-acquired pneumonia in emergency departments.

Authors: Kylie A McIntosh; David J Maxwell; Lisa K Pulver; Fiona Horn; Marion B Robertson; Karen I Kaye; Gregory M Peterson; William B Dollman; Angela Wai; Susan E Tett
Journal: Int J Qual Health Care Date: 2010-12-03 Impact factor: 2.038