Literature DB >> 33355350

A framework for making predictive models useful in practice.

Kenneth Jung¹, Sehj Kashyap¹, Anand Avati², Stephanie Harman³, Heather Shaw⁴, Ron Li³, Margaret Smith³, Kenny Shum⁵, Jacob Javitz⁵, Yohan Vetteth⁵, Tina Seto⁵, Steven C Bagley¹, Nigam H Shah¹.

Abstract

OBJECTIVE: To analyze the impact of factors in healthcare delivery on the net benefit of triggering an Advanced Care Planning (ACP) workflow based on predictions of 12-month mortality.
MATERIALS AND METHODS: We built a predictive model of 12-month mortality using electronic health record data and evaluated the impact of healthcare delivery factors on the net benefit of triggering an ACP workflow based on the models' predictions. Factors included nonclinical reasons that make ACP inappropriate: limited capacity for ACP, inability to follow up due to patient discharge, and availability of an outpatient workflow to follow up on missed cases. We also quantified the relative benefits of increasing capacity for inpatient ACP versus outpatient ACP.
RESULTS: Work capacity constraints and discharge timing can significantly reduce the net benefit of triggering the ACP workflow based on a model's predictions. However, the reduction can be mitigated by creating an outpatient ACP workflow. Given limited resources to either add capacity for inpatient ACP versus developing outpatient ACP capability, the latter is likely to provide more benefit to patient care. DISCUSSION: The benefit of using a predictive model for identifying patients for interventions is highly dependent on the capacity to execute the workflow triggered by the model. We provide a framework for quantifying the impact of healthcare delivery factors and work capacity constraints on achieved benefit.
CONCLUSION: An analysis of the sensitivity of the net benefit realized by a predictive model triggered clinical workflow to various healthcare delivery factors is necessary for making predictive models useful in practice.

Entities: Chemical Disease Gene Species

Keywords: learning, evaluation, utility assessment, workflow simulation, advanced care planning; machine

Year: 2021 PMID： 33355350 PMCID： PMC8200271 DOI： 10.1093/jamia/ocaa318

Source DB: PubMed Journal: J Am Med Inform Assoc ISSN： 1067-5027 Impact factor: 4.497

INTRODUCTION

Over the past decade, the rapid increase in the availability of healthcare data collected during routine care and dramatic advances in machine learning have fed a great deal of excitement about using machine learning to improve clinical care. Predictive models, which estimate the probability of some event of interest occurring in a specified time frame in the future, have been developed for events such as heart failure, inpatient mortality, and patient deterioration. However, there have been relatively few success stories where these models led to impact on what matters to patients, providers, and healthcare decision makers such as, reduction in costs, lower rate of those clinical events, and increased access to care., The lack of impact is in part because the evaluation of machine learning models typically focuses on measures of performance, such as area under the receiver-operator curve (AUROC) and mean precision. Such measures provide a convenient quantitative summary of the performance of the model, but they do not reflect the consequences of taking action based on the model’s output. The translation of model performance into benefit to patient care is mediated by multiple factors related to existing and proposed clinical workflows that incur a broad range of data acquisition, change management, and workflow integration burdens. While measures such as AUROC can summarize the accuracy of predicting an event, they have little bearing on the effectiveness of actions taken in response to the prediction, the constraints on those actions, and the ethical concerns or value mismatches that may arise from those actions. In many cases the overall benefit of using predictive models for improving healthcare will be critically determined by the considerations of implementation costs, actionability, safety, and utility.,, In this work, we present a comprehensive analysis of the net benefit of a machine-learning enabled workflow for identifying patients for Advance Care Planning (ACP). ACP is the elicitation and documentation of patient values and preferences regarding goals of care. Documentation of ACP is critical for guiding care when patients are seriously ill and unable to express their wishes or make their own decisions. The target population is patients admitted to the General Medicine service at Stanford Hospital. The prediction is an estimate of 12-month mortality based on the patient’s historical electronic health record (EHR) data; the action triggered by a high-risk estimate is the offering of ACP, carried out by the General Medicine teams. Timely offering of ACP and palliative care have proven to be effective at both reducing downstream costs, improving physician morale, and in some cases, improving outcomes such as survival., However, despite widespread recognition that such interventions are underutilized, timely ACP by hospitalists can be difficult, given the urgency of taking care of a severely ill patient, which can interfere with consideration of a patient’s longer-term prognosis. Predictive models for mortality have been evaluated for identifying patients to receive ACP or similar interventions. There is evidence that care workflows using such predictive models to screen patients can identify the right patients for timely conversations about goals of care. However, over the course of designing an implementation of a workflow enabled by a machine learning model at Stanford Hospital in early 2020, we identified various factors that limit the presumed net benefit resulting from setting up a model-triggered workflow. First, there may be limited capacity to perform ACP even if patients are correctly identified by the model. Second, effective ACP takes time to conduct, since it is based on conversations with patients and families. Building the shared understanding and rapport necessary for these conversations takes time which may be difficult to find for busy physicians. Third, it is possible to imagine alternatives that can mitigate the impact of these factors, such as increasing our capacity for inpatient ACP by hiring more staff or developing the capability for outpatient ACP. We quantify the impact of these factors and gain insight into how best to mitigate those impacts. Our approach is inspired by the evaluation of screening tests, and we view the prediction model as a screening test that triggers follow-up care. For example, analogous to a number needed to screen, we can compute a number needed to benefit which calculates the number of patients that would need to be screened by a test and treated in response to a positive result. Cost–benefit analysis is a widely used approach to evaluate decisions made under such uncertainty; here we apply a cost–benefit analysis on multiple factors impacting the effectiveness of the proposed workflow using simulations and sensitivity analysis. Unlike prior work applying the cost–benefit framework to the clinical use of predictive models,, our focus is not on selecting a suitable decision threshold for triggering a fixed intervention. Rather, akin to Horvitz’s efforts, we apply this framework to alternative setups of the ACP workflow to gain insight into the impact of healthcare delivery factors on the achieved benefit. The essence of our approach is to estimate utilities for each of 4 possible outcomes derived from the model output for each patient: true positives (the model correctly flags a patient for ACP and ACP is carried out), false positives (the model incorrectly flags a patient for ACP, and it is carried out), true negatives (the model correctly fails to flag a patient for ACP, and none is carried out), and false negatives (the model incorrectly fails to flag a patient for ACP, and no ACP is carried out). The utility for each outcome quantifies the benefit associated with that outcome, in whatever units are most convenient or meaningful. In this work, we quantify utility as total healthcare expenditures in the 6 months following discharge because estimates are available from a multisite randomized control trial (RCT)., If utility estimates are available in standard units of the amount of unwanted care avoided, or increase in patient comfort, the framework would still apply. Given these RCT derived utilities, along with a set of patients for whom we have both risk estimates and expert-provided, ground-truth assessments of appropriateness for ACP, we can then estimate the benefit under the setup of screening by a model and compare it to the benefit achieved by policies, such as intervening on all or none of the patients. While such approaches have been advocated for evaluating predictive analytics in healthcare,,, they likely overestimate the benefit realized in practice, because factors such as limited capacity to perform ACP can reduce the effective number of true positives. For example, if we can only perform ACP for 1 patient per day, then no benefit is accrued by the second and third patients flagged for ACP even if the predictions are correct. In this study, we quantify the effect of such healthcare delivery factors by performing simulations that assess a broad range of failure rates for a variety of factors. We also assess an alternative workflow in which ACP is conducted via an outpatient pathway for patients who were correctly flagged for ACP by the model, but for whom ACP was not completed during their hospital stay.

MATERIALS AND METHODS

Our analysis uses simulations of a proposed prediction triggered workflow (ie, the responsive action) running for an extended period. The prediction–action dyad starts with an automated estimation of the risk of 12-month all-cause mortality for patients admitted to the General Medicine service at Stanford Hospital. Risk estimates are output by a model for all-cause mortality, as described below. Chart review by an experienced palliative care nurse was used to assign ground-truth labels for whether ACP was appropriate. Patients whose risk estimate exceeds a chosen threshold are considered referred for an ACP intervention. Given a risk threshold, we can calculate a utility achieved for each patient, using the risk estimate and the ground-truth label to categorize that prediction as true positive, false positive, true negative, or false negative. Simulations vary the effects of various factors, such as work capacity constraints, which may prevent ACP from being successfully carried out thus reducing the utility actually achieved for the patient. Below, we describe the predictive model, its validation via chart review, and our simulation setup.

Model development

We developed a gradient boosted tree model to estimate the probability of 1-year all-cause mortality upon inpatient admission, using a deidentified, retrospective dataset of EHR records for adult patients seen at Stanford Hospital between 2010 and 2017. This dataset comprised 97 683 admissions, and had a prevalence of 17.6% for 1-year all-cause mortality. Data were obtained under a Stanford University Institutional Review Board approved protocol, and informed consent was waived. The input data for a given admission consisted of basic demographic information (age and gender), along with counts of diagnosis codes, medication orders, and encounter types (eg, inpatient, office visits, surgery) observed in the year prior to admission. A total of 63 043 features were used. We split the admissions according to year: we used 82 525 admissions from 2010 through 2015 as training data and 15 098 admissions in 2016 and 2017 as validation data. We tuned hyperparameters based on performance in the validation data and trained a final model on the full dataset using the optimal hyperparameters. We evaluated the performance of the resulting model against clinician chart review prospectively, as described below. We omitted admissions for which we do not have reliable indications of survival (eg, in person encounters) or mortality in the following year. Although our analysis in this work is focused on admissions to the General Medicine service, the 1-year all-cause mortality model was developed using admissions to all services. We did so to ensure adequate training data and to obtain a prediction model that can be used for other service lines as well. Note that 1-year mortality is a surrogate for the true target, “appropriate for ACP,” which is difficult to define programmatically. Therefore, in our prospective evaluation, we assess the concordance of referrals triggered by a model trained to predict this surrogate outcome against the key clinical criteria currently used by General Medicine clinicians at Stanford Hospital to trigger a consultation with Palliative Care.

Prospective evaluation by chart review

We conducted a prospective evaluation of the model against expert chart review over 2 months in the first half of 2019; each day in that period we presented an experienced palliative-care advanced-practice (AP) nurse with a list of patients newly admitted to General Medicine at Stanford Hospital. These lists were in random order, and no model output was presented. On any given day, the evaluator might not have been able to complete the chart review for all, or even any, of the newly admitted patients. However, the set of patients for whom chart review was completed is random. These data were collected under a separate Stanford University Institutional Review Board approved protocol; need for informed consent was waived. The AP nurse performed chart reviews for each patient in order, as time permitted each day, to answer the question, “Would you be surprised if this patient passed away in the next twelve months?” Note that we use the responses to this question as a ground-truth label for appropriateness for ACP because currently such assessment by General Medicine clinicians is a key determinant for triggering a consultation with palliative care. An important factor explored in our simulations is patients leaving the hospital before ACP can be completed. Length of stay information was not available at the time of chart review, which is done on day of admission. Thus, we mapped the patients who underwent chart review to a deidentified EHR data extract containing length of stay information. Of the 191 patient charts reviewed, 178 were mapped and used in subsequent analysis. The median age was 59.6 years (+/− 19.52 years), and 49.7% were female. The case mix for this population in comparison with the General Medicine admitted patients in the first half of 2019 is shown in the Supplementary Table S1. The AP nurse responded “No” to the evaluation question for 29.2% of these patients. The model achieved an area under the ROC curve and Precision-Recall curve of 0.86 and 0.76 respectively (with 95% confidence intervals of 0.8–0.92 and 0.65–0.97 respectively, estimated from 1000 bootstrap samples of the chart review set).

Utility values and simulation parameters

We estimated the utilities of the 4 possible outcomes of an intervention, Utp, Ufn, Ufp, and Utn, using a multicenter randomized trial of inpatient palliative care consults to measure the postdischarge healthcare costs (Table 1). Gade et al found that patients who received usual care incurred an average cost of $21 252 in the 6 months following discharge, while patients who received an inpatient palliative care consultation incurred an average cost of $14 486. We use the former as an estimate of the cost of a false negative Ufn (ie, the cost incurred by a patient who should have received ACP but did not), and the difference ($6766) as the savings from ACP. The consultation itself cost $1911 for a net savings of $4855. These data were collected between 2002 and 2003; accounting for inflation yields costs of $37 085 for patients who do not receive ACP. Subtracting the inflation adjusted net savings of $8472 yields a cost for true positives (Utp) of $28 613. We used these estimates as the values for true positives and false negatives, respectively. We estimated the cost of true negatives and false positives as follows. For the cost of true negatives, Utp, we use the mean per capita annual spending for US patients ($11 646) as determined from the Peterson–Kaiser Health System Tracker. For false positives, we added the inflation adjusted cost of the intervention ($3324) to this cost, yielding a total of $14 970. These utility values are summarized in Table 1. We note that the net expected utility at a given decision threshold depends solely on the prevalence of “appropriate for ACP” patients, the recall and specificity of the model, and the differences Utp—Ufn ($8472) and Ufp—Utn ($3324) rather than the individual utilities.

Table 1.

Utility values

Parameter	Desc	Value	Source
U_tp	Utility for true positives (ACP is appropriate and provided)	−28 613	Gade et al Net savings of 4855 * inflation multiplier, subtracted from U_fn
U_fn	Utility for false negatives (ACP is appropriate but not provided)	−37 085	Gade et al original value of 21 252 * inflation multiplier of 1.745
U_fp	Utility for false positives (ACP is not appropriate but provided)	−14 970	U_tn plus inflation adjusted cost of intervention.
U_tn	Utility for true negatives (ACP is not appropriate and not provided)	−11 646	Per capita spend in US, 2018, Peterson-Kaiser

Utility values

Establishing the “best case”

Our goal is to gain insight into how healthcare delivery factors impact the net benefit of triggering an Advanced Care Planning (ACP) workflow based on predictions of 12-month mortality. To that end, we use simulations based on resampling of the 178 admissions for which we have clinician evaluations, model output, and length of stay. We first establish the “best case” utilities achievable given the predictive model as it stands, where “best case” is the absence of failures due to healthcare delivery factors. Thus, even under the “best case,” given the model’s mistakes, false positive patients receive ACP (and false negatives miss ACP) in the simulation. To establish this best case, we simulated 5000 days of offering ACP. Each simulated day, we randomly sampled the number of admissions from the empirical distribution of admissions per day observed from March through August 2019 (the mean number of admissions to the General Medicine service was 9.4 per day, with a standard deviation of 3.4). We then chose that many admissions with replacement from the 178 admissions with model output and chart review. We calculated the number of patients in each of the 4 possible outcomes at each possible threshold of the model output to get the true positives, false negatives, false positives, and true negatives for the day. These counts were multiplied by the utilities for each category (Utp, Ufn, Ufp, and Utn), and the products summed. The result is the utility achieved on a given day, which is divided by the number of admissions to yield a per-admission utility. We simulated 5000 such days and calculated the mean utility over simulated days at each possible threshold. We smoothed the mean utilities across thresholds using loess with a smoothing parameter of 0.75 and took the maximum utility across all possible thresholds. We then normalize this maximum to the status quo ante by subtracting the utility under a policy of Treat Nobody, yielding the possible maximum net utility. Note that establishing the “best case” in this manner assumes each true positive is translated into successful execution of the ACP workflow.

Impact of external factors

Our next objective is to explore how various factors impact the achieved utility using the best case as a reference. The factors explored are listed in Table 2.

Table 2.

Simulation parameters to explore the impact of external factors

Parameter	Desc	Value	Range
Rejection rate	Fraction of patients for whom ACP is not possible for nonclinical reasons	0.1	0.1, 0.2, 0.3
Daily capacity	Daily capacity to carry out ACP	3	1, 2, 3, 4, 5
Mean time to complete ACP	Mean time in days to complete ACP, parameter to exponential distribution	2	1, 2, 3, 4
Outpatient pathway rescue rate	Rate of successful ACP by outpatient pathway	NA	0, 0.25, 0.50, 0.75, 1.0

Simulation parameters to explore the impact of external factors

Rejection for nonclinical reasons

Often, a patient recommended by a model is a valid candidate for ACP with respect to clinical prognosis, but has declined prior attempts at goals-of-care conversations. We simulated this factor by flipping a weighted coin for each true positive patient in the best case simulation. The weighted coin represents a Bernoulli trial with some probability that ACP will be rejected for nonclinical reasons. We varied the probability of rejection as 0.1, 0.2, and 0.3. We normalized the maximum utility achieved across all possible thresholds against the utility of the status quo ante to yield a net utility as before and compared against the net utility of the best case. We present all results as a fraction of the best-case utility to avoid tying the analysis with specific dollar values and ensuring that, if utility estimates are available in standard units of the amount of unwanted care avoided, or increased in patient comfort, the approach can still be used.

Limited capacity for ACP

Another frequent situation is that a patient is a valid candidate, but the clinician has limited time during the day to initiate ACP (ie, the “crazy day in the clinic”). We simulated this factor as a hard constraint on the capacity of the General Medicine service to intervene. We proceeded similarly to the best case simulation except that for each day we capped the number of true positives to some fixed limit (1, 2, 3, or 4). Any true positives in excess of this limit were counted as false negatives. As before, we compared the net maximum utility under each scenario against that of the best case.

Failure to complete ACP due to early discharge

A significant fraction (57%) of hospital stays is 3 days or less, while ACP may take several days due to logistical constraints, such as scheduling time with family members. For each true positive patient, we sampled the time required to complete ACP from an exponential distribution with different means; representing scenarios in which it takes 1, 2, 3, or 4 days on average to complete ACP. If the sampled time to completion exceeds the true length of stay for the patient, we count the patient as false negative from failure to complete ACP due to early discharge. As before we compared the net maximum utility under each scenario against that of the best case.

Effect of an outpatient ACP pathway

An outpatient care pathway for ACP may significantly mitigate decreases in achieved benefit from factors such as capacity constraints or failure to complete ACP due to discharge. To simulate this, we flipped a weighted coin for each patient who was moved from the true positive to the false negative categories due to a capacity constraint or early discharge. This coin flip represents the probability that the outpatient ACP pathway will be successful in serving the patient. We varied this probability from 0, 0.5, and 1.0, while keeping rejection for nonclinical reasons at 10%, a capacity constraint of 3 ACP interventions daily, and a mean time to completion of 2 days. As before, we compared the net maximum utility under each scenario against that of the best case.

Trade-offs between inpatient capacity vs outpatient ACP

Of the failure modes analyzed, we have the least control over rejection for nonclinical reasons. Similarly, it is not reasonable to extend the length of stay solely to complete ACP, especially given that length of stay is a closely watched quality metric. Therefore, the 2 factors that are most amenable to modification are capacity to offer ACP in the inpatient setting, and serving patients via an outpatient ACP pathway. We therefore explored the relative efficiency of increasing inpatient capacity versus improving “rescue” via an outpatient pathway. We simulated 5000 days at each combination of a range of inpatient capacities (1–5) and rescue rates (0%, 50%, and 100%). We then calculated the net maximum utility under each scenario and compared the net maximum utility gained by increasing inpatient capacity by 1 versus improving the outpatient pathway.

Analysis of utility ranges and effects of work capacity limits

Our simulation analysis used fixed utility values for the true positives, false negatives, false positives, and true negatives. These values might differ at different sites, and in some cases it may not be possible to obtain exact values. Therefore, to analyze the effect of choosing different utilities, each of the 4 entries in a utility matrix were expanded symmetrically by 10% to form the lower and upper bounds of utility ranges. For each combination of lower and upper bound of all 4 ranges, we computed the maximum expected utility for acting on the model’s predictions, producing 2^4, or 16 possible maxima. Separately, we constructed a graphical representation showing the relationship between choosing different classifier probability thresholds, the expected utility, and the total number of predicted positives, both true as well as false. Each positive prediction is a patient who will be offered an ACP consult. We refer to the total number of positives on whom to follow up as “work.”

RESULTS

We performed a series of simulation studies to evaluate the impact of multiple factors on the net utility of a workflow for ACP triggered by a 12-month mortality prediction. The factors analyzed were: (1) ineligibility of candidates due to nonclinical reasons, (2) limited capacity to offer ACP, (3) failure to complete ACP due to early discharge, and (4) availability of an outpatient pathway to follow up on missed cases. We also evaluated the relative benefits of increasing capacity for outpatient ACP versus increasing inpatient ACP capacity. We found that work capacity constraints and failure to complete ACP due to early discharge can significantly reduce the benefit. The impact of these factors can be mitigated by an alternative pathway for offering ACP in the outpatient setting. Under resource limitations to either add capacity for conducting inpatient ACP versus developing the outpatient pathway, the latter is likely to provide more benefit to patients.

Impact of external factors on the realized benefit

Rejection for nonclinical reasons

Figure 1a shows the fraction of the best case, per patient utility achieved, as the rate of rejection for nonclinical reasons varies from 0% to 30%. We see a decrease in the per patient utility at all thresholds. In the best-case scenario (no rejections), we would have seen an average utility of −$17 701 per patient. This best case sets the maximum achievable benefit and the ‘not offering ACP’ sets the zero baseline in the y-axis of the figure. At the 10% rejection rate scenario, we achieve the majority of the ‘best case’ utility, with a linear decline as more patients are rejected for nonclinical reasons. At a 30% rejection rate, we achieve just short of 75% of the best case. We present all results as a fraction of the best-case utility to avoid tying the analysis with specific dollar values and ensuring that if utility estimates are available in standard units of the amount of unwanted care avoided, or increased in patient comfort, the approach can still be used.

Figure 1.

The figure summarizes the effect of different factors on the realized net utility of triggering a care workflow based on a predictive model for 1 year mortality. In all plots the y-axis shows the achieved net utility relative to the best case labeled as ‘optimistic.’ The default state of treating nobody, is the 0 point on the y-axis. The achieved utility is plotted as a percentage of the best case scenario, in which every prediction is followed up by ACP. We also plot the relative net utility of treating everybody (Treat all) for comparison. A. Impact of rejection of recommendations for ACP for nonclinical reasons. The x-axis shows the rate of rejection of ACP due to nonclinical factors ranging from 10% to 30%. The rejection rate translates to a linear reduction in net utility. B. Impact of capacity constraints on per patient utility. The x-axis shows different capacity constraints for conducting ACP. Capacity constraints have a large impact on net utility, with a capacity of 1 capturing close to 50% utility of the “best case.” Increasing capacity offers rapidly diminishing returns because there are few days when more than 4 patients are recommended for ACP. C. Impact of failure to complete ACP due to discharge on per patient utility. The x-axis shows the average number of days it takes to complete ACP. The relative net benefit ranges from 92% to 62.5% of the best case estimate as the mean time to complete ACP ranges from 1 to 4 days. D. Impact of an outpatient rescue pathway on per patient utility. The x-axis shows the effect of rescuing 0%, 50%, and 100% of the model’s recommendations. Without rescue, the net utility is 65% of the optimistic estimate. At 50% rescue, we achieve 76% of the optimistic estimate. At 100% rescue, we achieve 90.5% of the best-case scenario because the outpatient rescue pathway can not rescue ACP rejected for nonclinical reasons.

Limited capacity for ACP

The General Medicine service takes care of complex patients, and clinicians often do not have time to perform ACP. As described in the methods, we simulated the impact of this as a limit on the number of patients for whom ACP can be done each day. The results are shown in Figure 1b. We see that having a work capacity of just 1 ACP per day captures just shy of half of the best case achieved benefit. Increasing the ACP capacity yields rapidly diminishing returns, with very little difference at a capacity of 5 compared to the “best case” scenario. This is because there are very few days in which we can expect to need ACP for 4 or more patients.

Failure to complete ACP due to early discharge

Because the ACP intervention takes time to perform, clinicians may not be able to complete ACP before the patient is discharged. We simulated this loss to discharge as a random process as described in the methods. Figure 1c shows that failure to complete ACP due to discharge may have a significant impact on achieved utility. If it takes on average 2 days to complete ACP, the savings relative to the best case drop to about 75%. If it takes 4 days on average to complete ACP, the impact on net benefit drops to 62.5% of best case.

Effect of an outpatient ACP pathway

Based on our results, capacity constraints and failure to complete ACP due to discharge could significantly reduce the benefit of a prediction triggered ACP workflow. Therefore, we explored the potential impact of having a pathway for performing ACP in an outpatient setting, post-discharge. We fixed a 10% rate of rejection due to external factors, along with a capacity constraint of 3 ACP per day, and a mean time to complete ACP of 2 days. We vary the rate at which the outpatient pathway successfully completes the recommended ACP from 0%, 50%, and 75%. Figure 1d shows the impact of the proposed outpatient pathway. Without outpatient rescue, the maximum achievable benefit is 66% of the best case, which is significantly down from the near-best-case seen with a 100% rescue rate. However, even a 50% rescue rate recovers a significant fraction of the best case net benefit.

Trade-offs between inpatient capacity vs outpatient ACP pathway

Of the factors we examine in this study, the modifiable ones are capacity for inpatient ACP and creation of an outpatient ACP pathway. The relative benefit of these alternatives matter because they are likely to have significantly different set-up and operating costs. We examine the relative benefit of increasing inpatient ACP capacity by 1 patient, versus investing in an outpatient pathway for ACP, at various starting levels of inpatient ACP capacity. Figure 2 shows the change in mean per patient utility as we increase inpatient capacity starting with different initial inpatient capacities (solid red line). The figure also shows the change in mean patient utility with an outpatient pathway for ACP with 25% to 100% success rates. We find that no matter what the starting inpatient capacity is, an outpatient pathway with a 50% success rate (dashed green line) results in greater savings than increasing inpatient ACP capacity.

Figure 2.

Trade-off between adding inpatient capacity for ACP versus outpatient capacity. The plot shows the change in mean per patient utility as we increment inpatient capacity starting from different initial inpatient capacity (solid red line). The dashed lines show the change in mean patient utility for having an outpatient pathway for ACP with 50% and 100% success rates. We find that at all starting inpatient capacities, an outpatient pathway with even a 50% success rate results in greater utility than adding to inpatient capacity.

Analysis of utility ranges and effects of work capacity limits

The analysis thus far used fixed utility values, which can be hard to obtain or simply differ by site. Therefore, as described in the methods, each of the 4 entries in a utility matrix were expanded by +/− 10% to form utility ranges. Examining all possible combinations of utility ranges gives a range of −$15 800 to −$19 300 for the maximum expected per-patient utility. Whatever the value of maximum expected per-patient utility may be, there exists a trade-off between per-patient utility and work capacity as illustrated in Figure 3. The x-axis is the probability threshold on the output from the classifier, ranging from 1.0 down to 0.0. Depending on the chosen threshold, patients with a predicted probability higher than the threshold will be followed up on. The y-axis is the expected per-patient utility calculated using the 4 utility values from Table 1. The boxed numbers—located at regularly spaced probability thresholds—are the patients to follow up on (ie, work), expressed as a percentage of the total patients for whom a prediction is made.

Figure 3.

Unit (per-patient) utility versus the probability threshold at which a patient is referred for follow up. The boxed numbers are the number of patients to follow up with (true positive and false positive), or “work” at that threshold, expressed as a percentage. Work increases as more patients are referred for ACP consultation. There is a tension between the goal of maximizing total utility, which is the product of per-patient utility and the number of patients acted upon; while keeping the number of patients followed up below the hospital system’s work capacity limit. For our model, the number of new cases for which predictions would be made corresponds to the new admissions per day to the Gen Med service (mean 9.4 patients, std. dev. 3.4). In this situation, a work capacity of 3 (which is 23.4% of the mean + one std. dev. of daily admits) is not close to the maximum per-patient utility, reinforcing the need to consider alternatives, such as an outpatient ACP pathway.

DISCUSSION

Despite considerable improvements in measures of performance of predictive models in healthcare, there have been relatively few successes using such models to provide better clinical outcomes at lower costs., We believe that this is in part because the translation of modeling advances into improvements in clinical care requires integrating the model’s output into complex human workflows that are separate from model performance.,, If the actions taken in response to a predictive model are embedded in such a system, the ability to make predictions and improvements in predictive accuracy alone are not sufficient to improve care. We note that successful uses of machine learning in healthcare typically require considerable integration efforts after model development is done. For example, the majority of published efforts regarding the sepsis early warning system (EWS) deployed at Kaiser Permanente Northern California focus on understanding existing processes and then managing the iterative refinement and dissemination of new workflows supported by the EWS model. This experience reinforces the lesson learned from decades of quality improvement efforts: clinical care takes place in a complex, adaptive social environment, and changing processes within that environment is not amenable to purely technical interventions. In this study of a predictive model for mortality used to recommend patients for ACP, we identified several workflow factors that limit the benefit of setting up a model-triggered care workflow, such as limited capacity for ACP during inpatient admissions, and early discharge before ACP can be completed. These factors partially explain the critiques that models are proliferating but concrete benefit is scant., Our analyses used simulations examining the net-utility of workflows triggered by the model to elucidate the impact of workflow factors. These analysis techniques are not new, but have not received sufficient attention in the machine learning-for-healthcare community.,, We believe it is time to adopt methods from health delivery science and health services research to provide honest evaluations of machine learning guided interventions in healthcare and to develop a delivery science for artificial intelligence interventions in healthcare. It is natural to ask where do such analyses fit into the development and deployment path for predictive models into the clinic. We adopt an idealized 4-stage framework, presented in Figure 4, for such projects. The first stage focuses on clear definitions of both the modeling problem (ie, what is the prediction target, how is it defined, when does the prediction happen, and what data are available at that time) and, just as important, the intervention presumably triggered by the model’s output, along with a clear articulation of the desired benefits sought. The second stage consists of development and technical validation of the prediction model; this is the focus of the bulk of currently published work in the machine learning for health community.,, The third stage comprises careful, iterative development of the clinical workflow associated with the model: what are the set of actions to be undertaken in response to a prediction? The final stage consists of monitoring and maintenance of a model and associated workflows, ideally followed by a prospective trial to demonstrate efficacy.

Figure 4.

A 4-stage framework guiding the development and evaluation of a predictive model throughout its life cycle. The stages are: 1) problem specification and clarification, 2) development and validation of the model, 3) analysis of utility and impacts on the clinical workflow that is triggered by the model, and 4) monitoring and maintenance of the deployed model as well as evaluation of the running system comprised of the model-triggered workflow. Analyses such as the 1 presented here are for guiding stage 3 (green) of the process in Figure 4. In this stage, the model itself takes a back seat, and we focus on where it fits into the clinical workflow and how to alter that workflow to achieve the desired benefit. Such analyses, when used in conjunction with traditional methodologies for change management and implementation science, can provide valuable guidance on where to focus efforts and resources in order to close the gap between best case estimates of benefit and the realized benefit. Such analyses have limitations. First and foremost is that it can be very difficult to estimate true cost and thus calculate utilities,, and our results are contingent on the utilities being accurate. Cost benefit analyses are very often quite sensitive to the utilities, and different values may lead to quite different conclusions about the relative importance of the different healthcare delivery factors. In this work, we have used postdischarge healthcare expenditures as a stand-in for utility; however, we acknowledge that this is to illustrate the need for such analysis in the absence of a standard quantification mechanism for the amount of unwanted care avoided, or increased in patient comfort. We note that the ultimate purpose of ACP is to provide care that is concordant with the expressed values and wishes of patients, that this provides benefits for providers who seek to provide the best care possible for their patients, and that this “true goal” is not easily measured and not captured by downstream healthcare spending. Therefore cost as a measure, while limited, is still directionally useful and we present results as a fraction of the best case utility to avoid tying the analysis with specific dollar values. A second significant limitation of this study is that we used a single expert’s response to the question, “Would you be surprised if this patient passed away in the next 12 months?” as a source for our ground-truth labels. This mirrors 1 of the key criteria that is supposed to trigger ACP in current workflows, but it is possible that this expert may be incorrect in their judgement, introducing bias into the evaluations. Finally, our simulations are not sufficiently fine-grained to accurately capture the subtleties of all the factors in play. These limitations notwithstanding, we believe that the analyses presented is a useful template for how to obtain insight into the net effect of a proposed predictive model paired with a clinical intervention.

CONCLUSION

We analyzed the impact of factors in healthcare delivery on the realized benefit of using a predictive model for 12-month mortality to identify patients for ACP. The analyses use simulations to identify factors that have a large impact on the achieved benefit of using the model to trigger an intervention. Factors included nonclinical reasons that make ACP inappropriate, limited capacity for ACP, inability to follow up due to patient discharge, and availability of an outpatient workflow to follow up on missed cases. The resulting estimates of the impact of these factors can guide allocation of resources to mitigate reductions in achieved benefit. We argue that routine use of such analyses of the sensitivity of the net benefit to various healthcare delivery factors is necessary for translation of advances in predictive modeling into real-world clinical benefit.

FUNDING

The work was supported by the Stanford Medicine Department of Medicine, an endowment from Debra and Mark Leslie, and innovations funds from Stanford Healthcare.

AUTHOR CONTRIBUTIONS

KJ and NHS conceived the project and drafted the manuscript. KJ wrote the code and set up the simulation analysis. SK, AA, KS, JJ, YV, and TS wrote code for data extraction, model training, and evaluation. SH, HS, RL, and MS provided domain expertise on the healthcare delivery factors to consider in the analysis and their plausible ranges. SCB analyzed the effect of work capacity limits. All authors have read and approved the manuscript.

SUPPLEMENTARY MATERIAL

Supplementary material is available at Journal of the American Medical Informatics Association online. Click here for additional data file.

42 in total

Review 1. Recent advances in the methods of cost-benefit analysis in healthcare. Matching the art to the science.

Authors: E McIntosh; C Donaldson; M Ryan
Journal: Pharmacoeconomics Date: 1999-04 Impact factor: 4.981

2. Making Machine Learning Models Clinically Useful.

Authors: Nigam H Shah; Arnold Milstein; Steven C Bagley PhD
Journal: JAMA Date: 2019-10-08 Impact factor: 56.272

Review 3. Machine Learning in Medicine.

Authors: Alvin Rajkomar; Jeffrey Dean; Isaac Kohane
Journal: N Engl J Med Date: 2019-04-04 Impact factor: 91.245

4. Artificial Intelligence in Health Care: Will the Value Match the Hype?

Authors: Ezekiel J Emanuel; Robert M Wachter
Journal: JAMA Date: 2019-06-18 Impact factor: 56.272

5. Barriers to Achieving Economies of Scale in Analysis of EHR Data. A Cautionary Tale.

Authors: Mark P Sendak; Suresh Balu; Kevin A Schulman
Journal: Appl Clin Inform Date: 2017-08-09 Impact factor: 2.342

6. Early Palliative Care Consultation in the Medical ICU: A Cluster Randomized Crossover Trial.

Authors: Jessica Ma; Stephen Chi; Benjamin Buettner; Katherine Pollard; Monica Muir; Charu Kolekar; Noor Al-Hammadi; Ling Chen; Marin Kollef; Maria Dans
Journal: Crit Care Med Date: 2019-12 Impact factor: 7.598

7. Interactive Cost-benefit Analysis: Providing Real-World Financial Context to Predictive Analytics.

Authors: Mark G Weiner; Wasiq Sheikh; Harold P Lehmann
Journal: AMIA Annu Symp Proc Date: 2018-12-05

8. Early detection, prevention, and mitigation of critical illness outside intensive care settings.

Authors: Gabriel J Escobar; R Phillip Dellinger
Journal: J Hosp Med Date: 2016-11 Impact factor: 2.960

9. Impact of an inpatient palliative care team: a randomized control trial.

Authors: Glenn Gade; Ingrid Venohr; Douglas Conner; Kathleen McGrady; Jeffrey Beane; Robert H Richardson; Marilyn P Williams; Marcia Liberson; Mark Blum; Richard Della Penna
Journal: J Palliat Med Date: 2008-03 Impact factor: 2.947

10. Big data in health care: using analytics to identify and manage high-risk and high-cost patients.

Authors: David W Bates; Suchi Saria; Lucila Ohno-Machado; Anand Shah; Gabriel Escobar
Journal: Health Aff (Millwood) Date: 2014-07 Impact factor: 6.301

10 in total

1. Prospective Comparison of Medical Oncologists and a Machine Learning Model to Predict 3-Month Mortality in Patients With Metastatic Solid Tumors.

Authors: Finly J Zachariah; Lorenzo A Rossi; Laura M Roberts; Linda D Bosserman
Journal: JAMA Netw Open Date: 2022-05-02

2. Progress toward a science of learning systems for healthcare.

Authors: Suzanne Bakken
Journal: J Am Med Inform Assoc Date: 2021-06-12 Impact factor: 7.942

Review 3. Deploying digital health tools within large, complex health systems: key considerations for adoption and implementation.

Authors: Jayson S Marwaha; Adam B Landman; Gabriel A Brat; Todd Dunn; William J Gordon
Journal: NPJ Digit Med Date: 2022-01-27

4. A comparison of approaches to improve worst-case predictive model performance over patient subpopulations.

Authors: Stephen R Pfohl; Haoran Zhang; Yizhe Xu; Agata Foryciarz; Marzyeh Ghassemi; Nigam H Shah
Journal: Sci Rep Date: 2022-02-28 Impact factor: 4.379

5. Crossing the chasm from model performance to clinical impact: the need to improve implementation and evaluation of AI.

Authors: Jayson S Marwaha; Joseph C Kvedar
Journal: NPJ Digit Med Date: 2022-03-03

6. Implementation of prediction models in the emergency department from an implementation science perspective-Determinants, outcomes and real-world impact: A scoping review protocol.

Authors: Sze Ling Chan; Jin Wee Lee; Marcus Eng Hock Ong; Fahad Javaid Siddiqui; Nicholas Graves; Andrew Fu Wah Ho; Nan Liu
Journal: PLoS One Date: 2022-05-12 Impact factor: 3.240

7. Assessment of Adherence to Reporting Guidelines by Commonly Used Clinical Prediction Models From a Single Vendor: A Systematic Review.

Authors: Jonathan H Lu; Alison Callahan; Birju S Patel; Keith E Morse; Dev Dash; Michael A Pfeffer; Nigam H Shah
Journal: JAMA Netw Open Date: 2022-08-01

8. Open questions and research gaps for monitoring and updating AI-enabled tools in clinical settings.

Authors: Sharon E Davis; Colin G Walsh; Michael E Matheny
Journal: Front Digit Health Date: 2022-09-02

9. An empirical characterization of fair machine learning for clinical risk prediction.

Authors: Stephen R Pfohl; Agata Foryciarz; Nigam H Shah
Journal: J Biomed Inform Date: 2020-11-18 Impact factor: 6.317

10. Expected clinical utility of automatable prediction models for improving palliative and end-of-life care outcomes: Toward routine decision analysis before implementation.

Authors: Ryeyan Taseen; Jean-François Ethier
Journal: J Am Med Inform Assoc Date: 2021-10-12 Impact factor: 4.497

10 in total