Literature DB >> 25717414

Automated physician order recommendations and outcome predictions by data-mining electronic medical records.

Abstract

The meaningful use of electronic medical records (EMR) will come from effective clinical decision support (CDS) applied to physician orders, the concrete manifestation of clinical decision making. CDS development is currently limited by a top-down approach, requiring manual production and limited end-user awareness. A statistical data-mining alternative automatically extracts expertise as association statistics from structured EMR data (>5.4M data elements from >19K inpatient encounters). This powers an order recommendation system analogous to commercial systems (e.g., Amazon.com's "Customers who bought this…"). Compared to a standard benchmark, the association method improves order prediction precision from 26% to 37% (p<0.01). Introducing an inverse frequency weighted recall metric demonstrates a quantifiable improvement from 3% to 17% (p<0.01) in recommending more specifically relevant orders. The system also predicts clinical outcomes, such as 30 day mortality and 1 week ICU intervention, with ROC AUC of 0.88 and 0.78 respectively, comparable to state-of-the-art prognosis scores.

Entities: Chemical Disease Gene Species

Year: 2014 PMID： 25717414 PMCID： PMC4333710

Source DB: PubMed Journal: AMIA Jt Summits Transl Sci Proc

Introduction

Electronic medical records (EMR) can improve patient safety and healthcare cost efficiency, but that depends on meaningful use of the data1. This will require effective clinical decision support (CDS) content, particularly to drive clinical orders (labs, imaging, medications, etc.), the concrete manifestation of clinical decision making. Order sets, risk scores, and similar CDS constructs help reinforce consistency and compliance with best-practices2,3, but their conventional development is limited by a top-down approach. This approach requires manual production of CDS content, feasible for only a limited number of common scenarios, and often with limited end-user awareness4. With the progressive digitization of clinical data in EMRs, a Big Data5,6 approach can instead crowd-source clinical expertise from the bottom-up by data-mining EMRs. Such an approach could continuously “learn” in real-time by streaming in accumulating EMRecords into data-driven models of clinical expertise, even as it is simultaneously applied to patient care with direct EMR integration.

Background

Prior work in automated CDS content development includes association rules and Bayesian networks between orders and diagnoses, and review of possible order set and corollary order content by subject experts7–10. With inspiration from analogous problems of information retrieval in recommender systems, collaborative filtering, market basket analysis, and natural language processing, we initiated an item association order recommendation framework11 analogous to Netflix or Amazon.com’s “Customer’s who bought A also bought B” system12. Here we update our initial efforts with a much larger dataset that includes non-order data to better define a patient’s clinical context, propose an alternative evaluation metric to identify recommendation methods that highlight items specifically relevant to a given clinical scenario, and use the framework to predict clinical outcomes.

Methods

Deidentified, structured patient data from inpatient hospitalizations at Stanford University Hospital in 2011 was extracted by the STRIDE project13. Extracted data covers patient encounters starting from their initial (emergency room) presentation until hospital discharge. With >19K distinct patients, the data consists of >5.4M instances of >17K distinct clinical items, with patients, instances, and items respectively analogous to documents, words, and vocabulary items. The clinical items include >3,500 medication, >1,000 laboratory, >800 imaging, and >700 nursing orders. Non-order items include >1,000 lab results, >5,800 problem list entries, >3,400 admission diagnosis ICD9 codes, and patient demographics on age, gender, and date of death. Numerical data was binned into categorical data, particularly lab results, based on “abnormal” flags as established by the clinical laboratory. The ICD9 coding hierarchy was collapsed as necessary into diagnosis codes with a significant number of instances. The relationship between item instances covered and the top clinical items considered is consistent with the “80/20 rule” in the form of a power law distribution14. This property allows one to ignore most clinical items with minimal information loss. In this case, ignoring sparsely populated clinical items with <256 instances (0.005% of all instances) reduces the effective item count from >17K to 1.5K (9%), while only reducing item instance coverage from 5.4M to 5.1M (94%). Computational efficiency of subsequent order recommendations improves significantly with this simplification, given methods requiring O(m space and O(q * m log m) time complexity, where m is the number of clinical items considered and q is the number of query items for a recommendation. A pre-computation step collects frequency statistics on clinical item instance co-occurrences from a training set of 16,408 randomly selected patients to build an item association matrix, based on the definitions in Table 1. These statistics drive subsequent recommendations by approximating Bayesian conditional probabilities as in Table 2.

Table 1

Pre-computed frequency statistics for clinical items. Counting repeats allowed.

Notation	Definition
n_A	Number of occurrences of order A
n_ABt	Number of occurrences of order B following an order A within time t
N	Total number of patients

Table 2

Bayesian probability estimates based on item frequency statistics.

Probability	Estimate	Notation / Notes
P(A)	n_A / N	BaselineFreq(A)
P(AB)	n_AB / N	n_AB (“Support”) only counts directed association where A occurs before B
P(B\|A) = P(AB) / P(A)	n_AB / n_A	ConditionalFreq(B\|A) (“Confidence”) Frequency of B, given A
P(B\|A) / P(B) = P(AB) / P(A)*P(B)	(n_AB/n_A) / (n_B / N)	FreqRatio(B\|A). Estimates likelihood ratio. Expect = 1, if A and B occur independently

To generate order recommendations from the above association statistics, query clinical items (A1,…,Aq) are used to select item association pairs from the pre-computed association matrix for all possible target orders (B1,…,Bm). Target orders are ranked by a score such as ConditionalFreq(Bj|Ai), the maximum likelihood estimator for the probability of order Bj occurring after query item Ai. As previously noted11, ranking by ConditionalFreq identifies likely orders, but also tends to yield non-specific orders (e.g., CBC, IV saline) that are common overall, yet not necessarily “interesting.” To identify orders more significantly relevant to the query, recommendations are ranked or filtered by FreqRatio(B|A), comparable to the TF*IDF (term frequency * inverse document frequency) information retrieval concept15. To quantify the significance of item associations, -2 log FreqRatio can approximate a chi-square statistic15 or the chi-square statistic can be directly calculated by comparing observed vs. expected occurrence counts. Issues with misinterpreting association strengths in the setting of inadequate data (heuristics advise at least 5 occurrences to be reliable15), are mitigated by excluding rare items occurring <0.005% of the time as previously described. Given q query items, the above method generates q scored lists of all m possible orders. These are aggregated into a single scored recommendation list by taking a weighted average of the component scores, weighted inversely proportional to their respective query item baseline frequencies (lending more weight to less common, more specific query items). Unweighted score averaging and a Naïve Bayes15 style composite product of the component conditional probabilities (i.e., conditional frequencies) were also attempted, though the weighted average method was retained as it yielded the best results. While there is no well accepted notion of recommendation quality, accuracy in predicting subsequent items is the most commonly measured, with precision (positive predictive value) and recall (sensitivity) correlating with end-user satisfaction16. A test set of 1,903 patients was randomly selected, separate from the training set. For each test patient, all clinical items from the first 4 hours of their hospital encounter were used (average of 29) to query for 10 recommended orders that were compared against the actual subsequent orders within the first 24 hours (average of 15). To quantitatively recognize recommenders that yield results that are more meaningfully relevant to a query and not simply common, we introduce the alternative metrics of inverse frequency weighted precision and recall, based on the following function definition: TP(i) = {1 if recommended item i is a true positive, 0 if not}. Likewise FP(i) for false positives and FN(i) for false negatives. The inverse frequency weighted precision and recall metrics are defined below in summation notation, with components weighted by the inverse baseline frequency of each item i (n. Note that the common constant factor N can be cancelled out to yield: The association framework was also applied towards “recommending” non-order items to predict outcomes such as patient death and ICU intervention. For the latter, a composite “AnyICU” clinical item was defined as the occurrence of interventions including mechanical ventilation, vasopressor infusion (epinephrine, norepinephrine, dopamine, phenylephrine, vasopressin, dobutamine), or continuous renal replacement therapy (CRRT). Taking 1,905 test patients separate from the training set, their first 24 hours of clinical items were used to query the association model for the probability (ConditionalFreq(B|A)t) of an outcome event within t time (30 days for death, 1 week for AnyICU) and compared them vs. actual event rates by receiver operating characteristic (ROC) analysis.

Results

Table 3 illustrates example order recommendations. Table 4 reports accuracy metrics for different recommendation methods, illustrating the trends toward the best results. Table 5 reports the ROC area-under-curve (AUC) prediction accuracy for outcomes of 30 day mortality and 1 week use of AnyICU. Table 6 illustrates an inverted query example, identifying items commonly preceding an outcome event.

Table 3

Example orders recommended when query by admitting diagnosis of GI Hemorrhage, ranked by ConditionalFreq(B|A)day and filtering out those with FreqRatio(B|A)day <1. Example interpretation: Given a GI Hemorrhage, 75% of patients receive IV Pantoprazole (standard initial treatment for an acute GI bleed) within 24 hours. This is somewhat more likely (FreqRatio 1.8) than for all patients in general, though even the baseline of 42% is relatively common as IV Pantoprazole is used for non-GI bleed scenarios (e.g., prophylaxis against stress ulcers). For comparison, the Pantoprazole IV continuous infusion is less common (51%), but has a higher relative likelihood (freqRatio 16.0), as it is used almost exclusively in the treatment of GI bleeds.

Rank	Description	Frequency / Likelihood			p
Rank	Description	Conditional	Baseline	Ratio	p
1	TYPE AND SCREEN	0.98	0.78	1.3	0.00
2	Pantoprazole (Intravenous)	0.75	0.42	1.8	0.00
3	TRANSFUSE RBC	0.55	0.52	1.1	0.20
4	PANTOPRAZOLE IV INFUSION	0.51	0.03	16.0	0.00
5	CONSULT MEDICINE	0.32	0.16	2.0	0.00
6	LIPASE	0.29	0.26	1.1	0.15
7	ISTAT TROPONIN I	0.28	0.28	1.0	0.96
8	CONSULT GASTROENTEROLOGY	0.22	0.03	8.6	0.00
9	UPPER GI ENDOSCOPY	0.21	0.08	2.8	0.00
10	ISTAT, VBG AND LACTATE	0.21	0.19	1.1	0.47
11	Oral Electrolyte Solution (Bowel Prep)	0.17	0.03	5.3	0.00
12	OCTREOTIDE INFUSION	0.17	0.01	11.7	0.00
13	TRANSFUSE FFP	0.16	0.16	1.0	0.91
14	Benzocaine+Tetracaine (Topical)	0.09	0.04	2.0	0.00
15	H. PYLORI AG, STOOL	0.08	0.02	4.9	0.00

Table 4

Average accuracy statistics for recommendation methods across 1,903 test patients comparing 10 system recommended orders vs. actual orders occurring within 24 hours. The ConditionalFreq ranked methods are subdivided by what time span t that their item association counting accepts. The last pair of methods use the FreqRatio for filtering (excluding recommendations with FreqRatio <1) or ranking. Bolded entries represent the best value for each metric.

Ranking Method	Time Span	Ratio Filter	Recall	Precision	F1-Score	Weighted Recall	Weighted Precision	Weighted F1-Score
Random			1%	2%	1%	1%	1%	1%
BaselineFreq*			17%	26%	19%	3%	24%	4%
ConditionalFreq	Any		22%	31%	23%	5%	29%	6%
ConditionalFreq	Hour		19%	27%*	20%	5%	17%	6%
ConditionalFreq	Day		27%	37%	28%	7%	37%	9%
ConditionalFreq	Day	Yes	9%	17%	11%	15%	14%	12%
FreqRatio	Day		8%	12%	9%	17%	8%	10%

All metrics are compared against the BaselineFreq method as a benchmark, with all yielding p<0.01, except precision of the ConditionalFreq (1 Hour) method, having p = 0.08.

Table 5

ROC area-under-curve prediction metrics for 30 day mortality and 1 week requirement for ICU intervention (ventilator, vasopressor infusion, CRRT) based upon 1,905 test patients’ first 24 hours of query clinical items.

	Death	Any ICU
Evaluation period	30 days	1 week
Patients screened	1,905	1,905
Patients evaluated, excluding those with outcome occurring during 24 hour query period	1,898	1,765
Patients with outcome subsequently occurring during evaluation period	44 (2.3%)	55 (3.1%)
ROC AUC score for association prediction	0.88	0.78

Table 6

Inverted query example showing the top “recommendations” for items that occur prior to a query item of patient death, ranked by FreqRatio(B|A)week. This recognizes that many deaths are anticipated with a greater likelihood for ordering “Comfort Care Measures” and “Liberalize Visitation Policy,” representing reprioritization of care for patients with expected imminent death. Complementary to that are deaths preceded by aggressive life-supporting ICU interventions including vasopressors (norepinephrine), continuous renal replacement therapy (CRRT), and mechanical ventilation for ARDS (lung protective ventilation protocol). Inverse queries can appropriately “recommend” non-order items such as abnormal lab values as well, in this case recognizing that lactic acidosis (high lactic acid) and acidemia (low pH) disproportionately precede death.

Rank	Description	Frequency / Likelihood			p
Rank	Description	Conditional	Baseline	Ratio	p
1	COMFORT CARE MEASURES	0.11	0.02	5.22	0.00
2	LIBERALIZE VISITATION POLICY	0.08	0.02	5.09	0.00
3	LACTIC ACID (High)	0.46	0.11	4.12	0.00
4	NOREPINEPHRINE IV INFUSION	0.15	0.04	3.84	0.00
5	CALCIUM CHLORIDE IV INFUSION	0.06	0.01	3.77	0.00
6	Citrate + Sodium Bicarbonate (CRRT)	0.05	0.01	3.68	0.00
7	CONSULT TO PALLIATIVE CARE	0.15	0.04	3.60	0.00
8	OSMOLALITY, SERUM (High)	0.07	0.02	3.55	0.00
9	pH Venous (Low)	0.23	0.06	3.51	0.00
10	LUNG PROTECTIVE VENTILATION	0.07	0.02	3.49	0.00

Discussion

The item association system developed above, analogous to commercial recommender systems, recommends physician orders and predicts clinical outcomes based on statistics data-mined from electronic medical records. As illustrated in Table 4, personalizing order recommendations with the ConditionalFreq ranking method improves accuracy compared to the standard BaselineFreq benchmark method that only functions as a general “best seller” list, recommending the overall most common orders, irrespective of query items. Demonstrated again is the importance of temporal information in order recommendation11, with accuracy optimized when the association time span t is comparable to the evaluation time frame. Specifically, when predicting orders occurring within 24 hours of hospitalization, shorter time span filters (e.g., one hour) result in the recommender missing relevant associations for orders outside the filter time, while longer time span filters (e.g., any time) result in the recommender being distracted by associations that occur outside the relevant 24 hour evaluation period. Similarly, when predicting 30 day mortality and 1 week ICU intervention, the time span filters should optimally be adjusted to one month and one week, respectively. Qualitative examples in Table 3 indicate that FreqRatio based methods can provide more specifically relevant recommendations, but these approaches inherently perform worse by standard accuracy metrics, as confirmed in Table 4. While standard accuracy metrics favor common items, it is more impressive to correctly predict a rare item (e.g., pantoprazole infusion) than the relatively mundane correct prediction of a common item (e.g., Type & Screen). Alternative metrics, the inverted frequency weighted precision and recall, are introduced here to preferentially score prediction of uncommon items. Interestingly, the ConditionalFreq method that performs best on standard accuracy metrics still performs best by the weighted precision metric. It is only for weighted recall that the FreqRatio based methods show improvement (3% to 17%, p<0.01). This reinforces the notion that the two approaches serve different purposes and can both be useful depending on the goals of the query. Table 5 reports the association framework’s ability to predict clinical outcomes with ROC AUC of 0.88 for 30 day mortality and 0.78 for requiring ICU intervention within 1 week of hospitalization. These are comparable to state-of-the art prognosis scoring systems such as APACHE, MPM, and SAPS with scores ranging from 0.75 to 0.90 for predicting hospital mortality17 and CURB-65, PSI, SCAP, and REA-ICU with scores ranging from 0.69 to 0.81 for predicting early ICU admission18. Other prediction possibilities could include hospital length of stay, readmissions, and many others, though the virtue of the framework is that it can predict any item labeled as an outcome event with minimal incremental effort in future work. While the FreqRatio based methods elaborated here help distinguish specifically relevant orders from those that are simply common, a primary concern with this method is favoring common practices that are not actually ideal. With preliminary results on predicting clinical outcomes above, a tempting possibility will link recommendations to favorable outcomes instead of just prevalence, but ultimately this concern will only be proven or disproven by deploying these methods in a prospective clinical trial. Another general concern is that order recommenders may favor over-utilization by encouraging unnecessary orders. The framework can counter-balance this by recommending against uncommon orders, and future work will explore personalized prediction of lab result pre-test probabilities to recommend against lab tests unlikely to impact clinical care. Another limitation of the current item association method is that it only considers pair-wise associations, thus querying with multiple items assumes independence between the query items. Incorporating more complex models such as Bayesian networks8 is possible, but unclear whether significant accuracy would be gained in exchange for the lost computational efficiency of a simpler model. In closing, this represents another step in ongoing work towards mature clinical decision support systems that will unlock the Big Data potential of electronic medical records. A clinical order recommendation framework is enhanced here with additional non-order data to better define clinical contexts, reporting of significance statistics for individual recommendations to further aid interpretability, multiple evaluation metrics to discern common from specifically relevant items, and application towards predicting clinical outcomes.

15 in total

1. Ten commandments for effective clinical decision support: making the practice of evidence-based medicine a reality.

Authors: David W Bates; Gilad J Kuperman; Samuel Wang; Tejal Gandhi; Anne Kittler; Lynn Volk; Cynthia Spurr; Ramin Khorasani; Milenko Tanasijevic; Blackford Middleton
Journal: J Am Med Inform Assoc Date: 2003-08-04 Impact factor: 4.497

2. STRIDE--An integrated standards-based translational research informatics platform.

Authors: Henry J Lowe; Todd A Ferris; Penni M Hernandez; Susan C Weber
Journal: AMIA Annu Symp Proc Date: 2009-11-14

3. A recommendation algorithm for automating corollary order generation.

Authors: Jeffrey Klann; Gunther Schadow; J M McCoy
Journal: AMIA Annu Symp Proc Date: 2009-11-14

4. Automated development of order sets and corollary orders by data mining in an ambulatory computerized physician order entry system.

Authors: Adam Wright; Dean F Sittig
Journal: AMIA Annu Symp Proc Date: 2006

5. Big data meets the electronic medical record: a commentary on "identifying patients at increased risk for unplanned readmission".

Authors: Greg de Lissovoy
Journal: Med Care Date: 2013-09 Impact factor: 2.983

6. A randomized trial of "corollary orders" to prevent errors of omission.

Authors: J M Overhage; W M Tierney; X H Zhou; C J McDonald
Journal: J Am Med Inform Assoc Date: 1997 Sep-Oct Impact factor: 4.497

7. Health information technology: standards, implementation specifications, and certification criteria for electronic health record technology, 2014 edition; revisions to the permanent certification program for health information technology. Final rule.

Authors:
Journal: Fed Regist Date: 2012-09-04

8. A method to compute treatment suggestions from local order entry data.

Authors: Jeffrey Klann; Gunther Schadow; Stephen M Downs
Journal: AMIA Annu Symp Proc Date: 2010-11-13

9. Distribution of Problems, Medications and Lab Results in Electronic Health Records: The Pareto Principle at Work.

Authors: Adam Wright; David W Bates
Journal: Appl Clin Inform Date: 2010 Impact factor: 2.342

10. Mining for clinical expertise in (undocumented) order sets to power an order suggestion system.

Authors: Jonathan H Chen; Russ B Altman
Journal: AMIA Jt Summits Transl Sci Proc Date: 2013-03-18

13 in total

1. DYNAMICALLY EVOLVING CLINICAL PRACTICES AND IMPLICATIONS FOR PREDICTING MEDICAL DECISIONS.

Authors: Jonathan H Chen; Mary K Goldstein; Steven M Asch; Russ B Altman
Journal: Pac Symp Biocomput Date: 2016

2. OrderRex: clinical order decision support and outcome predictions by data-mining electronic medical records.

Authors: Jonathan H Chen; Tanya Podchiyska; Russ B Altman
Journal: J Am Med Inform Assoc Date: 2015-07-21 Impact factor: 4.497

3. Learning Doctors' Medicine Prescription Pattern for Chronic Disease Treatment by Mining Electronic Health Records: A Multi-Task Learning Approach.

Authors: Eryu Xia; Jing Mei; Guotong Xie; Xuejun Li; Zhibin Li; Meilin Xu
Journal: AMIA Annu Symp Proc Date: 2018-04-16

4. An evaluation of clinical order patterns machine-learned from clinician cohorts stratified by patient mortality outcomes.

Authors: Jason K Wang; Jason Hom; Santhosh Balasubramanian; Alejandro Schuler; Nigam H Shah; Mary K Goldstein; Michael T M Baiocchi; Jonathan H Chen
Journal: J Biomed Inform Date: 2018-09-07 Impact factor: 6.317