| Literature DB >> 27429992 |
Oliwier Dziadkowiec1, Tiffany Callahan2, Mustafa Ozkaynak1, Blaine Reeder1, John Welton1.
Abstract
OBJECTIVES: We examine the following: (1) the appropriateness of using a data quality (DQ) framework developed for relational databases as a data-cleaning tool for a data set extracted from two EPIC databases, and (2) the differences in statistical parameter estimates on a data set cleaned with the DQ framework and data set not cleaned with the DQ framework.Entities:
Keywords: Applied Statistics; Data Quality; Electronic Health Records; Relational Databases
Year: 2016 PMID: 27429992 PMCID: PMC4933574 DOI: 10.13063/2327-9214.1201
Source DB: PubMed Journal: EGEMS (Wash DC) ISSN: 2327-9214
Figure 1.Data Set Preparation Phases
Practical Guide to Examining EHR Data Sets Based on Kahn et al (2012)
| Accuracy and response validity | Coding and recoding checks and frequency analysis ( | |
| Missing data | Missing data analysis. | |
| Between database consistency | Compare patient IDs after a merge or compare the same patient ID on demographic variables ( | |
| Between site consistency | Compare results of merging data sets (by comparing primary keys or patient IDs) between sites. | |
| Time interval coding | Make sure that the time intervals are coded in the same units for all records and capture the desired time frame ( | |
| Time stamps | Check that time stamps fall in expected intervals (weekly or monthly) and don’t exceed a preestablished frequency. | |
| Event sequences per person and within a site | Make sure that the last event time occurs before the first event time ( | |
| Sequence timing by event | Make sure that events have appropriate concurrent event times. | |
| Qualifying events | Check to make sure that events that depend on a previous event (treatment that follows a certain diagnosis) make sense. ( | |
| Dependent events | Find chief complaint variable and compare to the first ED event frequency |
Initial Descriptive Screening by Data Set
| Number of patients | 241,773 | 70,061 |
| Number of encounters | 2,815,550 | 70,061 |
| Records needing recoding (check for Attribute Domain Constraints–related issues) | 25 | 13 |
| Number of potential primary keys (check for Relational Integrity Rules) | 2 | 1 |
| Variables with missing observations (check for Attribute Domain Constraints–related issues) | 32 | 4 |
| Variables with time sequences (check for Historical Data Rules, Relational Integrity Rules, State Dependent Object Rules) | 8 | 6 |
| Variables with dependent events (Check for Attribute Dependency Rules and Relational Integrity Rules) | 7 | 5 |
Figure 8.Pairwise Comparison Results for the Kruskal-Wallis ANOVA on Uncorrected (a) and Corrected (b)
Note: Yellow lines represent significant relationships (ps < 0.05); black lines represent nonsignificant relationships (ps > 0.05).