Literature DB >> 26306281

Data-Mining Electronic Medical Records for Clinical Order Recommendations: Wisdom of the Crowd or Tyranny of the Mob?

Abstract

Uncertainty and variability is pervasive in medical decision making with insufficient evidence-based medicine and inconsistent implementation where established knowledge exists. Clinical decision support constructs like order sets help distribute expertise, but are constrained by knowledge-based development. We previously produced a data-driven order recommender system to automatically generate clinical decision support content from structured electronic medical record data on >19K hospital patients. We now present the first structured validation of such automatically generated content against an objective external standard by assessing how well the generated recommendations correspond to orders referenced as appropriate in clinical practice guidelines. For example scenarios of chest pain, gastrointestinal hemorrhage, and pneumonia in hospital patients, the automated method identifies guideline reference orders with ROC AUCs (c-statistics) (0.89, 0.95, 0.83) that improve upon statistical prevalence benchmarks (0.76, 0.74, 0.73) and pre-existing human-expert authored order sets (0.81, 0.77, 0.73) (P<10(-30) in all cases). We demonstrate that data-driven, automatically generated clinical decision support content can reproduce and optimize top-down constructs like order sets while largely avoiding inappropriate and irrelevant recommendations. This will be even more important when extrapolating to more typical clinical scenarios where well-defined external standards and decision support do not exist.

Entities: Chemical Disease Gene Species

Year: 2015 PMID： 26306281 PMCID： PMC4525236

Source DB: PubMed Journal: AMIA Jt Summits Transl Sci Proc

Introduction

Medical decision making is fraught with uncertainty, reflected in a lack of evidence to confirm or deny the efficacy for a third of surgeries to place pacemakers or ear tubes1 and over forty percent of recommendations for the management of cardiac disease2. Evidence-based medicine seeks to fill these gaps, but even with disruptive reforms3, the expanding breadth and evolving complexity of medical practice ensures that high-quality prospective data will perpetually lag behind the need to answer clinical questions. Even when high-quality evidence is available, inconsistency in distribution and implementation can result in wide practice variability, such as a quarter of patients with a heart attack not receiving aspirin4. Clinical decision support (CDS) constructs such as order sets and templates reinforce consistency and compliance with best-practices by systematically distributing expertise5,6, but their development is limited by a top-down, knowledge-based approach. This approach requires manual production and maintenance that is only feasible for a limited number of common scenarios7. The progressive adoption of electronic medical records (EMR) creates the opportunity for a Big Data8 approach of crowd-sourcing clinical expertise from the bottom-up, tapping into the collective experience of many practitioners to automatically generate CDS content in a learning health system9–11.

Background

Prior work in automated CDS content development includes reports of association rules and Bayesian networks relating orders and diagnoses, as well as unsupervised clustering of clinical orders12–15. We developed an item-based association framework to generate clinical order recommendations16 analogous to Netflix or Amazon.com’s “Customer’s who bought A also bought B” system17, extracting the expertise hidden in the patterns of clinical orders (e.g., labs, medications, imaging) that concretely manifest clinical decision making. This recommender predicts real clinical orders (improving precision at ten recommendations from a baseline of 26% to 37%) as well as clinical outcomes such as mortality and intensive care unit interventions (c-statistics of 0.88 and 0.78, respectively), comparable with state-of-the-art prognosis scoring systems18. Despite these encouraging results, a recurring concern for automatically generated CDS content is that common clinical decisions derived from the wisdom of the crowd19 do not entail “good” or appropriate decisions. Validating the quality of recommender systems is challenging as there is not a well-defined or generally accepted definition of a “good” recommendation20. We previously demonstrated an internal validation by predicting clinician behavior and clinical outcomes18, while others have assessed coverage of manually authored order sets15 and qualitative assessment by a couple clinician reviewers12,13. Limited evaluations exist against objective external standards. Here we contribute a structured evaluation of automated order recommendations against the external standard of clinical practice guidelines, representing the appropriate standard of care for sample scenarios including chest pain, gastrointestinal hemorrhage (GI bleed), and pneumonia in hospital patients. While we would not expect automated methods to discover new medical knowledge for these scenarios beyond what already exists in their reference guidelines, we hypothesize that automated methods will reproduce best practice patterns consistent with clinical guidelines. More importantly, they will automatically produce that content in the form of executable electronic orders, and can be applied for scenarios where reference guidelines do not exist.

Methods

As described previously18, we extracted patient data deidentified of protected health information for inpatient hospitalizations at Stanford University Hospital in 2011 from the STRIDE clinical data warehouse21. The structured data covers patient encounters from their initial (emergency room) presentation until hospital discharge, including >19K distinct patients with >5.4M instances of >17K distinct clinical items. Clinical items include medication, laboratory, imaging, and nursing orders, as well as non-order items for lab results, ICD9 diagnosis codes, and patient demographics. Medications were normalized by active ingredient based on RxNorm22. Applying the “80/20 rule”23, rarely used clinical items were removed from consideration to reduce the effective item count from >17K to the top 1.5K (9%), while still covering 5.1M (94%) of the item instances. Furthermore, infrastructure orders that are commonly part of treatment processes or admission protocols (e.g., vital signs, notify MD, regular diet, patient transport, and all other nursing and PRN medication orders) were excluded as they rarely reflect meaningful clinical decisions. These exclusions left 811 clinical order items as candidates for recommendation. To develop an external reference standard for order quality, we evaluated clinical practice guidelines from the National Guideline Clearinghouse (http://www.guideline.gov) that inform the management of selected hospital ICD9 admission code groups for chest pain24,25, GI Bleed26–29, and pneumonia30,31. Hospital admission scenarios were selected as they are “order-dense” enough for an order recommender to demonstrate measurable utility, while the specific diagnoses were selected based on the existence of relevant guidelines and a significant quantity of clinical data examples. The 811 candidate clinical orders were labeled as “guideline reference orders” based on whether a guideline explicitly mentioned them as appropriate to consider (e.g., treating pneumonia with levofloxacin), or implied them (e.g., bowel preps and NPO diet orders are implicitly necessary to fulfill explicitly recommended endoscopy procedures for GI Bleeds). Given the non-specific nature of admission diagnoses, separate guideline recommendations for management of ulcerative, variceal, and lower GI bleeding were included as the specific etiology is generally unknown at the time of hospital admission. Similarly, while we included chest pain guideline recommendations focused on empiric management and diagnosis, we excluded recommendations prescribing treatment for confirmed diagnoses (e.g., clopidogrel for heart attack, NSAIDs and colchicine for pericarditis) as such cases presumably would have been admitted under the specific diagnosis, rather than the undifferentiated “chest pain” syndrome. Automated order recommendations were generated from past clinician behavior by our previously described methods16. Specifically, based on Amazon’s product recommender17, an intensive pre-computation step collects frequency statistics for all clinical item instances and co-occurrences on a randomly selected training set of 15,629 patients to build a time-stratified item association matrix18. Counting by patients affords a natural interpretation of 2×2 contingency tables for pairs of items, from which various association statistics can be derived (e.g., odds ratio (OR), relative risk (RR), positive predictive value (PPV), sensitivity, baseline prevalence, and Fisher’s P-value)32. The order recommender used admission diagnoses as query items to generate lists of all candidate clinical order items occurring within the order-dense first 24 hours of each admission, score-ranked by one of the association statistics. These score-ranked order item lists were evaluated against the guideline reference orders by receiver operating characteristic (ROC) analysis and recommendation accuracy in terms of precision (PPV) and recall (sensitivity) when considering only the top K recommendations. We labeled the 811 candidate clinical order items as “order set items” based on inclusion in relevant pre-authored order sets available in the hospital EMR. Incorporating these as benchmarks is especially important, as their prior availability could result in the automated methods simply relearning the existing order sets. A key distinction is that the automated methods provide fully score-ranked lists of candidate items, as opposed to conventional order sets that generally present all n order set items (up to 102 in the case of chest pain) without a ranking method to convey the relative importance of items (default selections and submenus differentiated <25% of order set items). Consequently, using “order set item” labels to infer guideline reference orders effectively yields only two possible score-ranks for candidate items, resulting in a single discrete point on the ROC “curve” and the accuracy vs. top K items plots using an arbitrary ranking of items within order sets.

Results

Table 1 reports summary counts of patient information available, guideline reference orders, and pre-authored order set items for each of the admission diagnoses considered. Table 2 contains recommendation examples for the chest pain admission diagnosis with association statistics and reference labels. Figure 1 depicts ROC curves assessing discrimination of guideline reference orders. Figure 2 depicts recommendation accuracy for increasing number of K items considered, illustrating the tradeoff between precision and recall, and performance for more practical small values of K.

Table 1.

– Admission diagnoses evaluated, number of patients in training dataset, number of candidate orders referenced as appropriate in clinical practice guidelines, number of candidate orders available in pre-authored order sets, and the intersection between the latter two.

Admission Diagnosis (ICD9)	Training Patients	Guideline Reference Orders	Order Set Items	Guideline Reference Orders in Order Set
GI Bleed (578)	282	38	51	22
Chest Pain (786.5)	433	32	102	23
Pneumonia (486)	206	51	42	25

Table 2.

– Example top order recommendations occurring within 24 hours of admission diagnosis of Chest Pain (ICD9: 786.5), sorted by odds ratio (OR). Additional metrics include prevalence (pretest probability), positive predictive value (post-test probability), and P-value by Fisher’s exact test. Binary labels are assigned if the order exists in pre-authored order sets (order set item) or clinical practice guidelines (guideline reference order).

Item Description	Prevalence	PPV	OR	P-Fisher	Order Set/Guideline
POC Troponin I	16.3%	71.4%	14.4	1.6E-148	1 / 1
EKG 12-Lead	51.8%	92.8%	12.7	8.0E-80	1 / 1
Nitroglycerin (Sublingual)	1.1%	9.2%	11.2	2.3E-25	0 / 1
Consult Cardiology	4.6%	28.4%	9.6	8.2E-64	1 / 1
D - Dimer (ELISA)	1.4%	9.7%	9.4	5.2E-24	1 / 0
Aspirin (Oral)	24.4%	68.4%	7.2	2.4E-85	1 / 1
CK-MB	15.7%	51.3%	6.1	1.5E-68	1 / 0
Troponin I	23.8%	62.4%	5.6	3.3E-67	1 / 1
Clopidogrel (Oral)	5.6%	20.6%	4.7	1.5E-27	1 / 0
Cardiac Catheterization	2.6%	9.9%	4.4	6.7E-14	1 / 1
Heparin Activity Level	6.1%	18.9%	3.8	1.7E-20	0 / 0
Lipid Panel w/Direct LDL	8.5%	24.5%	3.7	4.6E-24	1 / 0
NT - proBNP	10.0%	26.3%	3.4	6.4E-23	1 / 1
Nitroglycerin (Topical)	2.2%	6.2%	3.2	8.1E-07	1 / 1
Dobutamine Stress Echo	1.2%	3.2%	3.0	6.2E-04	1 / 1

Figure 1.

– Receiver operating characteristic (ROC) curves for predicting clinical practice guideline reference orders based on automated recommender methods using different score-ranking options (PPV, OR, baseline prevalence, and presence in pre-authored order sets). Area-under-curve (AUC) reported as c-statistics with 95% confidence intervals empirically estimated by sampling items with replacement 1000 times.

Figure 2.

– Recommender accuracy (precision or recall) for predicting guideline reference orders as a function of the number of top K recommendations considered (up to 100) when sorting by different score-ranking options (OR, PPV, prevalence, and presence in pre-authored order sets). Data labels added for K = 10 and n, where n = Number of items available in the respective order sets.

Discussion

Figure 1 illustrates c-statistics (AUC) when using automated recommendations to identify orders referenced in clinical practice guidelines, improving upon the benchmarks set by pre-authored order sets and baseline prevalence (pre-test probability) (P<10−30 by Fisher’s exact test in all cases). Figure 2 illustrates a similar trend for recommendation precision at small values of K recommendations considered, which is more relevant to a satisfactory end-user experience. These results support the hypothesis that automated methods for generating executable decision support content from prior behavior can reproduce recommendations consistent with standards of care, while avoiding inappropriate or irrelevant recommendations. However, the primary limitation is that this validation is only demonstrated for this sample selection of admission diagnosis scenarios. Given that the pre-authored order sets were available to clinicians during the training data period, the direct incremental value of the example recommendations presented is limited, especially if they only reproduced existing order sets. The results above show that recommendations that incorporate real clinician behavior patterns do further optimize order validity beyond the pre-existing order set benchmarks, but the more important value of this work comes from extrapolating to more typical clinical scenarios that are too specific or complex, such that clinical practice guidelines and pre-authored order sets do not apply or perhaps do not even exist (e.g., management of an admission diagnosis of “altered mental status” (ICD9: 780.97), or a patient with a combination of medications orders for furosemide, spironolactone, and lactulose). In such cases where reference standards for high-quality orders do not exist, automated learning methods still provide data-driven recommendations based on other practitioners’ experience, with the example cases analyzed here providing confidence in the quality of those automated recommendations. A key issue for decision support quality and safety is the complementary goal of not recommending “inappropriate” orders. For our example cases, guidelines recommend against the routine use of IV hydrocortisone (stress does steroids) and filgrastim (granulocyte colony stimulating factor) in pneumonia and Factor VIIa in GI bleeding. The automated order recommendations appropriately score these orders with low odds ratios <1 and PPVs <3%. CK-MB for chest pain is the notable exception where automated recommendations endorse an order which guidelines explicitly reference as inappropriate. CK-MB is a cardiac biomarker for heart attacks largely made obsolete by more accurate troponin testing, but the practice patterns at this hospital (revealed by the recommender statistics) demonstrate the routine habit of ordering CK-MB (likely exacerbated by the inclusion of CK-MB in the pre-authored chest pain order set!), perpetuating a debate as to whether CK-MB tests still provide any value33. A major limitation of this study approach is the inherent complexity of medicine with patient-to-patient variability and insufficient prospective evidence that defy the existence of a gold standard for general clinical decision-making quality. Even for scenarios where clinical practice guidelines exist as a reference standard, guidelines often offer deliberately vague, conditional, and sometimes conflicting recommendations. For example, multiple guidelines recommended providing appropriate counseling for patients and performing hemodynamic stability assessment with resuscitative stabilization, yet the former does not map to concrete actions and the latter is left for the clinician to interpret how to fulfill. Some guideline recommendations such as Streptococcus antigen testing for pneumonia and terlipressin for esophageal variceal bleeding are not routinely available in the hospital evaluated (the latter is not even FDA approved in the US, with the guideline produced in the UK). One chest pain guideline specifically recommended against the use of stress EKG testing and natriuretic peptides in the diagnostic workup for hospitalized patients with chest pain, while another specifically referenced both as reasonable options. The complexity and uncertainty of medical-decision making requires clinicians to accumulate expert knowledge through extensive experiential learning, yet they rarely document their expertise in any formal manner. Information retrieval methods will enable the systematic extraction and dissemination of this undocumented collective wisdom by translating the end clinical data into a reproducible and executable form of expertise, unlocking the Big Data potential of electronic medical records.

23 in total

1. Ten commandments for effective clinical decision support: making the practice of evidence-based medicine a reality.

Authors: David W Bates; Gilad J Kuperman; Samuel Wang; Tejal Gandhi; Anne Kittler; Lynn Volk; Cynthia Spurr; Ramin Khorasani; Milenko Tanasijevic; Blackford Middleton
Journal: J Am Med Inform Assoc Date: 2003-08-04 Impact factor: 4.497

2. STRIDE--An integrated standards-based translational research informatics platform.

Authors: Henry J Lowe; Todd A Ferris; Penni M Hernandez; Susan C Weber
Journal: AMIA Annu Symp Proc Date: 2009-11-14

3. Infectious Diseases Society of America/American Thoracic Society consensus guidelines on the management of community-acquired pneumonia in adults.

Authors: Lionel A Mandell; Richard G Wunderink; Antonio Anzueto; John G Bartlett; G Douglas Campbell; Nathan C Dean; Scott F Dowell; Thomas M File; Daniel M Musher; Michael S Niederman; Antonio Torres; Cynthia G Whitney
Journal: Clin Infect Dis Date: 2007-03-01 Impact factor: 9.079

4. Scientific evidence underlying the ACC/AHA clinical practice guidelines.

Authors: Pierluigi Tricoci; Joseph M Allen; Judith M Kramer; Robert M Califf; Sidney C Smith
Journal: JAMA Date: 2009-02-25 Impact factor: 56.272

5. A method to compute treatment suggestions from local order entry data.

Authors: Jeffrey Klann; Gunther Schadow; Stephen M Downs
Journal: AMIA Annu Symp Proc Date: 2010-11-13

6. Paving the COWpath: data-driven design of pediatric order sets.

Authors: Yiye Zhang; Rema Padman; James E Levin
Journal: J Am Med Inform Assoc Date: 2014-03-27 Impact factor: 4.497

7. Distribution of Problems, Medications and Lab Results in Electronic Health Records: The Pareto Principle at Work.

Authors: Adam Wright; David W Bates
Journal: Appl Clin Inform Date: 2010 Impact factor: 2.342

8. Big data and new knowledge in medicine: the thinking, training, and tools needed for a learning health system.

Authors: Harlan M Krumholz
Journal: Health Aff (Millwood) Date: 2014-07 Impact factor: 6.301

9. NICE guidance. Chest pain of recent onset: assessment and diagnosis of recent onset chest pain or discomfort of suspected cardiac origin.

Authors: Jane S Skinner; Liam Smeeth; Jason M Kendall; Philip C Adams; Adam Timmis
Journal: Heart Date: 2010-06 Impact factor: 5.994

10. Mining for clinical expertise in (undocumented) order sets to power an order suggestion system.

Authors: Jonathan H Chen; Russ B Altman
Journal: AMIA Jt Summits Transl Sci Proc Date: 2013-03-18

8 in total

1. DYNAMICALLY EVOLVING CLINICAL PRACTICES AND IMPLICATIONS FOR PREDICTING MEDICAL DECISIONS.

Authors: Jonathan H Chen; Mary K Goldstein; Steven M Asch; Russ B Altman
Journal: Pac Symp Biocomput Date: 2016

2. OrderRex: clinical order decision support and outcome predictions by data-mining electronic medical records.

Authors: Jonathan H Chen; Tanya Podchiyska; Russ B Altman
Journal: J Am Med Inform Assoc Date: 2015-07-21 Impact factor: 4.497

3. Learning Doctors' Medicine Prescription Pattern for Chronic Disease Treatment by Mining Electronic Health Records: A Multi-Task Learning Approach.

Authors: Eryu Xia; Jing Mei; Guotong Xie; Xuejun Li; Zhibin Li; Meilin Xu
Journal: AMIA Annu Symp Proc Date: 2018-04-16

4. Integrating Clinical Knowledge and Real-World Evidence for Type 2 Diabetes Treatment.

Authors: Xingzhi Sun; Wei Zhao; Lei Zuo; Alexandra Dumitriu; Chuang-Chung Lee; Nan Cui; Xiyang Liao; Tingting Zhao; Xuehan Jiang; Zhuoyang Xu; Gang Hu; Guotong Xie; Hong Wu; Yahua Huang
Journal: AMIA Annu Symp Proc Date: 2020-03-04

5. Decaying relevance of clinical data towards future decisions in data-driven inpatient clinical order sets.

Authors: Jonathan H Chen; Muthuraman Alagappan; Mary K Goldstein; Steven M Asch; Russ B Altman
Journal: Int J Med Inform Date: 2017-03-18 Impact factor: 4.046

6. Assessing Screening Guidelines for Cardiovascular Disease Risk Factors using Routinely Collected Data.

Authors: Jaspreet Pannu; Sarah Poole; Neil Shah; Nigam H Shah
Journal: Sci Rep Date: 2017-07-26 Impact factor: 4.379

7. Adverse Events Due to Insomnia Drugs Reported in a Regulatory Database and Online Patient Reviews: Comparative Study.

Authors: Jill S Borchert; Bo Wang; Muzaina Ramzanali; Amy B Stein; Latha M Malaiyandi; Kirk E Dineley
Journal: J Med Internet Res Date: 2019-11-08 Impact factor: 5.428

8. OrderRex clinical user testing: a randomized trial of recommender system decision support on simulated cases.

Authors: Andre Kumar; Rachael C Aikens; Jason Hom; Lisa Shieh; Jonathan Chiang; David Morales; Divya Saini; Mark Musen; Michael Baiocchi; Russ Altman; Mary K Goldstein; Steven Asch; Jonathan H Chen
Journal: J Am Med Inform Assoc Date: 2020-12-09 Impact factor: 4.497

8 in total