| Literature DB >> 29930954 |
Hossein Estiri1,2, Kari Stephens3.
Abstract
Data variability is a commonly observed phenomenon in Electronic Health Records (EHR) data networks. A common question asked in scientific investigations of EHR data is whether the cross-site and -time variability reflects an underlying data quality error at one or more contributing sites versus actual differences driven by various idiosyncrasies in the healthcare settings. Although research analysts and data scientists have commonly used various statistical methods to detect and account for variability in analytic datasets, self service tools to facilitate exploring cross-organizational variability in EHR data warehouses are lacking and could benefit from meaningful data visualizations. DQe-v, an interactive, database-agnostic tool for visually exploring variability in EHR data provides such a solution. DQe-v is built on an open source platform, R statistical software, with annotated scripts and a readme document that makes it fully reproducible. To illustrate and describe functionality of DQe-v, we describe the DQe-v's readme document which includes a complete guide to installation, running the program, and interpretation of the outputs. We also provide annotated R scripts and an example dataset as supplemental materials. DQe-v offers a self service tool to visually explore data variability within EHR datasets irrespective of the data model. GitHub and CIELO offer hosting and distribution of the tool and can facilitate collaboration across any interested community of users as we target improving usability, efficiency, and interoperability.Entities:
Keywords: Data Quality; Data Variability; Data Warehouse; Electronic Health Records
Year: 2017 PMID: 29930954 PMCID: PMC5994933 DOI: 10.13063/2327-9214.1277
Source DB: PubMed Journal: EGEMS (Wash DC) ISSN: 2327-9214
Figure 1Workflow for DQe-v
Excerpt of Input Data from the Provided Example Data
| U_LOC | U_TIME | U_COND | POPULATION | PATIENT | PREVALENCE |
|---|---|---|---|---|---|
| LOC_P | 2010 | Condition_F | 807 | 298 | 0.369 |
| LOC_V | 2009 | Condition_F | 5456 | 1411 | 0.259 |
| LOC_Y | 1903 | Condition_C | 21514 | 46 | 0.002 |
| LOC Y | 1950 | Condition C | 21514 | 46 | 0.002 |
DQe-v’s Four Tabs and Their Functionality
| TAB NAME | OUTPUT DESCRIPTION |
|---|---|
| An overall visualization of data distribution and variability over time | |
| Interactive visualizations to explore high-low variability | |
| Visualization of probability density functions | |
| Predictive analytics to recommend anomalous site locations and times | |
Figure 2The Variability Preview Tab Previewing Diagnoses Per Visit Per Year
Figure 3The Exploratory Analysis Tab’s Plots Help Identify Years and Site Locations with High Variability in the Number of Diagnoses Per Visit
Note: The yellow box highlights a period with relative high variability.
Figure 4Visualization of Probability Density Functions in Density Plots Tab for Number of Creatinine Labs Per Patients with a CKD Diagnosis and 1+ Visits
Note: Colors signify time units.
Figure 5Outputs of the Regression-Based Analysis on Hemoglobin A1c Labs Per Patient with a Diabetes Diagnosis and 1+ Visits. (Site-location Names Are Obstructed.)