| Literature DB >> 28154833 |
Vojtech Huser1, Frank J DeFalco2, Martijn Schuemie3, Patrick B Ryan2, Ning Shang4, Mark Velez4, Rae Woong Park5, Richard D Boyce6, Jon Duke7, Ritu Khare8, Levon Utidjian8, Charles Bailey8.
Abstract
INTRODUCTION: Data quality and fitness for analysis are crucial if outputs of analyses of electronic health record data or administrative claims data should be trusted by the public and the research community.Entities:
Keywords: Common Data Model; Data Use and Quality; Electronic Health Record (EHR); Informatics
Year: 2016 PMID: 28154833 PMCID: PMC5226382 DOI: 10.13063/2327-9214.1239
Source DB: PubMed Journal: EGEMS (Wash DC) ISSN: 2327-9214
Data Quality Reporting Recommendations Formulated by the DQC13 (shortened)
| 1. Data capture descriptions | Information on how data was observed, collected and recorded | Recommendations 1–6 | #2 Data Steward: A description of the type of organization responsible for obtaining and managing the target data set (e.g., registry or state agency). |
| 2. Data processing descriptions | Information on how data was transformed (e.g., mapping, unit conversion, derived values) | Recommendations 7–11 | #8 Mappings from original values to standardized values: Documentation on how original data values were transformed to conform to the target data model format. |
| 3. Data elements characterizations | Information on observed data features of the target data, such as data distributions and missingness | Recommendations 12–15 | #13 Single element data descriptive statistics: For each variable, calculate the following descriptive statistics (count and % of missing, descriptive statistics for numerical and categorical variables, goodness-of-fit tests for anticipated distributions). |
| 4. Analysis-specific data element characterizations | Information on data quality for a specific cohort and analysis (not on the level of the entire database) | Recommendations 16–20 | #17 Data quality checks of key variables used for cohort identification: Study specific additions to recommendations #13–15. |
Figure 1:Screenshot Showing Viewing of ACHILLES Heel Errors and Warnings
Overview of Participating Sites
| Site A | 5 | Claims data |
| Site B | 1 | Drug dispensing + administrative data |
| Site C | 1 | EHR data |
| Site D | 7 | Claims + EHR data |
| Site E | 1 | Claims + EHR data |
| Site F | 1 | EHR data |
| Site G | 8 | EHR data |
Overview of Data Sets (Number of Heel Errors and Context Characteristics)
| siteA-data set1 | 104,125 | after multiple ETLs without Heel results | 1M+ |
| siteA-data set2 | 243 | after multiple ETLs without Heel results | 1M+ |
| siteA-data set3 | 22,289 | after multiple ETLs without Heel results | 1M+ |
| siteA-data set4 | 58,296 | after multiple ETLs without Heel results | 1M+ |
| siteA-data set5 | 43,089 | after multiple ETLs without Heel results | 1M+ |
| siteB-data set1 | 39 | after initial ETL | <10k |
| siteC-data set1 | 424 | after multiple ETLs | 1M+ |
| siteD-data set1 | 19 | after multiple ETLs without Heel results | 1M+ |
| siteD-data set2 | 13 | after multiple ETLs | 1M+ |
| siteD-data set3 | 7 | after multiple ETLs | 1M+ |
| siteD-data set4 | 25 | after multiple ETLs | 1M+ |
| siteD-data set5 | 19 | after multiple ETLs | 10k–100k |
| siteD-data set6 | 3 | after multiple ETLs | 10k–100k |
| siteD-data set7 | 22 | after multiple ETLs | 1M+ |
| siteE-data set1 | 31 | after multiple ETLs | 1M+ |
| siteF-data set1 | 25 | after multiple ETLs | 1M+ |
| siteG-data set1 | 17 | after multiple ETLs | 10k–100k |
| siteG-data set2 | 16 | after multiple ETLs | 10k–100k |
| siteG-data set3 | 16 | after multiple ETLs | 10k–100k |
| siteG-data set4 | 12 | after multiple ETLs | 10k–100k |
| siteG-data set5 | 14 | after multiple ETLs | 10k–100k |
| siteG-data set6 | 13 | after multiple ETLs | 10k–100k |
| siteG-data set7 | 15 | after multiple ETLs | 10k–100k |
| siteG-data set8 | 9 | after multiple ETLs | 10k–100k |
Most Common Errors Found
| 101 | 16 | n/a | Number of persons by age, with age at first observation period; should not have age < 0 |
| 103 | 15 | n/a | Distribution of age at first observation period; age should not be negative |
| 206 | 13 | 18 | Distribution of age by visit_concept_id; age should not be negative |
| 406 | 13 | 31 | Distribution of age by condition_concept_id; min(age) should not be negative |
| 600 | 13 | 14 | Number of persons with at least one procedure occurrence, by procedure_concept_id; concepts in data are not in correct vocabulary (CPT4; HCPCS,ICD9P) |
| 717 | 12 | 3173 | Distribution of quantity by drug_concept_id; max(quantity) should not be > 600 |
| 114 | 11 | n/a | Number of persons with observation period before year of birth; should not be > 0 |
| 410 | 11 | n/a | Number of condition occurrence records outside valid observation period; should not be > 0 |
| 510 | 11 | n/a | Number of death records outside valid observation period; count should not be > 0 |
| 806 | 11 | 25 | Distribution of age by observation_concept_id; should not be negative |
| 606 | 10 | 19 | Distribution of age by procedure_concept_id; min(age) should not be negative |
| 610 | 10 | n/a | Number of procedure occurrence records outside valid observation period; count should not be > 0 |
Notes:
Errors marked with an asterisk are all possibly related to the same underlying error in birth year
n/a indicates that the rule operates on the whole data set and the number of instances is not applicable (the rule can generate only one instance per database, and the count of instances for that rule is always equal to the “count of data sets with error” shown in the second column).
Data Quality Questionnaire Results
| Data Quality Evaluation | Q1: Describe at what stage of your CDM implementation did you execute ACHILLES Heel analysis? | Heel was executed 1.5 years after the CDM data set was created. | Use Heel iteratively during translation and loading process |
After first iteration of the full ETL. During ETL and the end. Run custom DQA scripts and Heel to identify DQA issues -> communicate back to site -> sites fix ETL issues and resend data -> DQA analysis |
After each data update and each change to our CDM implementation. Iteratively during translation and loading process. |
| Q2: Describe what impact had seeing the ACHILLES Heel results on your future ETL versions? | None. ETL is static. | Able to identify serious problems with ETL and fix the issues. |
Motivated to set cut-offs for outliers leading to invalid data, and to revise our observation period logic. Detect ETL errors, improve ETL scripts and learn pediatric-specific data issues for better understanding of data. |
Provide data quality check for each ETL version and further analysis and understanding ETL as well as data. Provide substantial feedback for future ETL versions. Made 1,500 lines SQL codes for the ETL to fix all the bugs encountered by Heel. | |
| General Organizational Data Quality Context | Q3: How frequently do you refresh your CDM data and how frequently do you modify the ETL? What resources are allocated to this task? | None. ETL is static. | 1 time per two years |
CDM monthly ETL after problems are detected 15–30 days for both Two FTEs currently <10%, but we expect to increase once our version 5 CDM stabilizes. Monitoring committee meets monthly to discuss data quality across health information exchange (HIE). |
ETL based on feedback from Heel, changes to source data or updates to CDM model CDM on a quarterly basis CDM: first implementation took a year, second renewal after 4 months of the first implementation, which took 2 weeks ETL: first DQM and ETL took a month (4 people), DQM took a month by a single person, second DQM & ETL took 2 weeks by a single person |
| Q4: What other tools and methods are your site (or your site’s specific data set) using to assess data quality? | SQL queries | Public quality measures from CMS Nursing Home Compare and with data provided by PBM at our request |
Frequently run variants of the ETL and compare the resulting ACHILLES to find the best approach. Also consult senior clinical and technical staff for validity. If it’s programmer error, issue tracking software is important. Biostatistician monitors A suite of DQA scripts in R |
A custom system that captures benchmarks of data volumes by table within each data source. The system can compare current and prior versions to show discrepancies and variation in volume trends within each table. Before Heel, all researchers took care of their own data quality analyses. | |
| Q5: Is there an ETL high-level description document? | on OHDSI website | Working on one to have public before submitting a manuscript |
WhiteRabbit and Rabbit-In-A-Hat In the form of a wiki Site-specific ETL documents for 8 sites and a common conventions document to populate the OMOP |
WhiteRabbit and Rabbit-In-A-Hat OMOP ETL template and prepared own ETL SQL code |