| Literature DB >> 31531382 |
Ritu Khare1, Levon H Utidjian1, Hanieh Razzaghi1, Victoria Soucek2, Evanette Burrows1, Daniel Eckrich3, Richard Hoyt4, Harris Weinstein1, Matthew W Miller1, David Soler1, Joshua Tucker1, L Charles Bailey1.
Abstract
BACKGROUND: Clinical data research networks (CDRNs) aggregate electronic health record data from multiple hospitals to enable large-scale research. A critical operation toward building a CDRN is conducting continual evaluations to optimize data quality. The key challenges include determining the assessment coverage on big datasets, handling data variability over time, and facilitating communication with data teams. This study presents the evolution of a systematic workflow for data quality assessment in CDRNs. IMPLEMENTATION: Using a specific CDRN as use case, the workflow was iteratively developed and packaged into a toolkit. The resultant toolkit comprises 685 data quality checks to identify any data quality issues, procedures to reconciliate with a history of known issues, and a contemporary GitHub-based reporting mechanism for organized tracking.Entities:
Keywords: CDRN; Checks; Data Quality; Electronic Health Records; GitHub; Issues
Year: 2019 PMID: 31531382 PMCID: PMC6676917 DOI: 10.5334/egems.294
Source DB: PubMed Journal: EGEMS (Wash DC) ISSN: 2327-9214
Examples of check type, check and data quality issue in PEDSnet.
| Entity | Attribute | Example-1 | Example-2 | Example-3 |
|---|---|---|---|---|
| unexpected most frequent values | Pre-birth fact | unexpected change in number of records between data cycles | ||
| 0 | 0 | |||
| 0 | 15 | |||
| Condition_concept_id | Visit_start_date, time_of_birth | |||
| Numeric | Date, date | |||
| condition_occurrence | visit_occurrence, person | drug_exposure | ||
| Shooting pain (OMOP concept_id: 4171519) | 11557 visits before patient was born | 22.65% | ||
| Solution proposed | persistent | withdrawn | ||
| ETL: programming error | Characteristic: true anomaly | False alarm: improvement in previous ETL | ||
| ETLv11 | ETLv8 | ETLv10 | ||
| September 2016 | April 2016 | September 2016 | ||
| 2.4.0 | 2.2.0 | 2.4.0 | ||
Figure 1The PEDSnet data quality assessment workflow.
Figure 2Examples of data quality issues posted on GitHub; sensitive data are hidden to preserve anonymity.
Figure 3The domain-wise longitudinal distribution of number and types of reported issues. The top horizontal bar indicates the version number of network conventions adopted for a given data cycle.
Figure 4Distribution of GitHub issue closure duration across data domains, issue causes, and DQ check types.