Literature DB >> 29888053

Predicting Causes of Data Quality Issues in a Clinical Data Research Network.

Ritu Khare¹, Byron J Ruth¹, Matthew Miller¹, Joshua Tucker¹, Levon H Utidjian^1,2, Hanieh Razzaghi¹, Nandan Patibandla³, Evanette K Burrows¹, L Charles Bailey^1,2.

Abstract

Clinical data research networks (CDRNs) invest substantially in identifying and investigating data quality problems. While identification is largely automated, the investigation and resolution are carried out manually at individual institutions. In the PEDSnet CDRN, we found that only approximately 35% of the identified data quality issues are resolvable as they are caused by errors in the extract-transform-load (ETL) code. Nonetheless, with no prior knowledge of issue causes, partner institutions end up spending significant time investigating issues that represent either inherent data characteristics or false alarms. This work investigates whether the causes (ETL, Characteristic, or False alarm) can be predicted before spending time investigating issues. We trained a classifier on the metadata from 10,281 real-world data quality issues, and achieved a cause prediction F1-measure of up to 90%. While initially tested on PEDSnet, the proposed methodology is applicable to other CDRNs facing similar bottlenecks in handling data quality results.

Entities: Disease Gene Species

Year: 2018 PMID： 29888053 PMCID： PMC5961770

Source DB: PubMed Journal: AMIA Jt Summits Transl Sci Proc

Introduction and Background

Clinical data research networks (CDRNs) transform electronic health record (EHR) data from multiple institutions into common data models, and make that data available, either in a centralized or a distributed fashion, to conduct a wide range of scientific studies.[1]–[3] Given that EHRs are designed for clinical operations rather than research use, one of the most critical aspects in building a CDRN is to ensure that the aggregated clinical data are “high-quality” or “ready for research use.”[2-7] CDRN datasets are typically built in an iterative fashion. The data coordinating center executes certain data characterization or validation modules on the dataset to identify any data quality problems; these problems are communicated to the contributing institutions that investigate and resolve the problems, and generate the improved datasets. The investigation of data quality problems is a complex process that involves local replication of the problems, reviews of the relevant extract-transform-load (ETL) code, verification of data assumptions, and discussions about local data characteristics with the interdisciplinary team of clinicians, analysts, researchers, and administrative staff. In the past, several studies have recommended techniques for conducting data quality assessments on EHR-derived datasets, such as using expert judgment, heuristics, knowledge of impossibilities, gold standard benchmarking, code reviews, conformance to value set domains, and computation of derived values,[2,5,8,9] and more recently Kahn et al. designed a comprehensive ontology to classify data quality checks.[9] However, handling, analysis, or classification of real-world data quality problems or issues, are largely undocumented.[10] Here, we present an automated approach that given a data quality issue, classifies the issue cause as “ETL” vs. “characteristic” vs. “false alarm,” to assist in prioritization and resolution of issues. The main contribution of this work is the use of supervised machine learning to predict the causes of data quality issues and achieve a promising performance. In this study, we focus on a pediatric CDRN, PEDSnet, that aggregates EHR data from eight of the nation’s largest children’s hospitals[11,12] using the Observational Medical Outcomes Partnership (OMOP) common data model (CDM).[13] PEDSnet has invested substantial efforts in designing and implementing data quality “checks” to evaluate the validity of EHR-derived datasets and identify any data quality “issues” that indicate that the data could be inaccurate or difficult to use for some research purposes.[14] PEDSnet uses “GitHub issues” to report the data quality issues to individual sites[15] as shown in Figure 1. Once the issues are reported, the originating site’s first task is to determine whether the issue is an error in the ETL programming pipeline (Figure 1a), or represents a characteristic or inherent property of data such as EHR data entry error, administrative issues, source data incompleteness, institutional data anomalies, etc. (Figure 1b), or is a false alarm caused due to a programming bug in the PEDSnet data quality program or limitation of the data quality check (Figure 1c). The next task is to resolve any ETL-related issues for the next data submission. Based on a detailed analysis of the six most recent data cycles or iterations in PEDSnet, a partner site on an average investigates 26 new issues in each cycle, out of which only 35% represent resolvable problems (i.e. ETL category), as shown in Figure 2. In addition, based on the analysis of issue timelines in GitHub, we found that a data quality issue is open for 31 days on an average suggesting the potential duration of the issue investigation process; see Figure 3 for a distribution of GitHub issue duration across different types of causes. In sum, issue handling is a major bottleneck toward the iterative development of PEDSnet, as the partner sites need to resort to time-consuming and expensive processes for manual prioritization and investigation of individual issues. In this study, we examine whether the cause of a data quality issue can be predicted before delving into investigation, to help minimize issue fatigue and avoid spending time on issues that cannot be resolved, e.g. characteristic issues, or that should not have been reported at all, e.g. false alarms.

Figure 1

GitHub screenshots of PEDSnet data quality issues illustrating different causes - top to bottom (a) ETL issue (in red), (b) Characteristic issue (in blue), (c) False alarm (in gray).

Figure 2

Longitudinal distribution of causes of data quality issues reported to PEDSnet sites

Figure 3

The Github (open – > close) duration of different classes of issues

Methods

As our data source, we use the PEDSnet data quality issue warehouse.[17] The warehouse contains metadata about data quality issues, and manually identified causes of those issues. The metadata includes the affected domain(s) and field(s) in the CDM, the tailored description of the issue, site information, the check type (Table 1) generating the issue, and the version of the CDM adopted for the given data cycle. It should be noted that the “characteristic” issues, once determined, get documented in the subsequent data cycles, but are not reported to the sites to avoid duplication of efforts.

Table 1.

Some Examples of Data Quality Check Types and Issues in PEDSnet

Check Type (alias: Name)	Example Data Quality Issues
InconSource: Inconsistency with source	Distribution of NULL values in race_source_value doesnot match with the distribution of “No Information” concept inrace_concept_id in Person table
InvalidValue: Value set violations	A non-standard concept used for populating thecondition_concept_id field
UnexFact: Unexpected facts	A medication name entered into the location.zip field
ImplEvent: Implausible events	Found encounters with visit_start_date occurring aftervisit_end_date
CatOutlier: Categorical outlier	A patient with over 30,000 procedures
UnexTop: Unexpected most frequentvalues	“injection for contraceptive” as the most frequent procedure at asite
UnexDiff: Unexpected difference from theprevious data cycle	Decrease in the number of deaths, or large increase (e.g. 2X) inthe number of conditions
MissData: Missing data	Gestational age is not available for 70% of patients
MissFact: Missing expected facts	No “creatinine” lab record found in measurement table

We hypothesize that training a machine learning classifier, using meta-data about known issues, can help determine the cause of a data quality issue, and that the classifier can deliver performance sufficient to drive issue prioritization. As input, we selected a variety of features from the PEDSnet issue warehouse, intended to capture several aspects of an issue. Overall, 83 binary features were selected. The features types and instances are described below, and illustrated in Table 2.

Table 2.

Features types and positive features for the example issues shown in Figure 1

Feature Types	ETL issue (Fig. 1a)	Characteristic issue (Fig. 1b)	False alarm (Fig. 1c)
Domain	Condition_occurrence	Visit_occurrence	Drug_exposure
Field Type	Concept identifier	Multiple	-
Check Type	UnexTop	ImplEvent	UnexDiff
Prevalence	Medium	Low	Medium
CDM versionupgrade	No	Yes	No

Domain: The CDM table where the issue was observed, e.g. Person, Care_Site, Location, Death, Condition_occurrence, Visit_payer, Visit_occurrence, Procedure_occurrence, Measurement, Drug_exposure, Measurement_organism, etc. Field Type: The type of field where the issue was observed, e.g, numerical fields, foreign keys, concept identifiers, source values, combination of fields, or others. Check Type: The type of data quality assessments conducted to identify the issue; some examples are shown in T able 1. Prevalence: The number of records affected by the issue, categorized as full (100%), high (30%-100%), medium (1%-30%), low (0%-1%), or unknown. Site: The site where the issue is observed, including one of the eight PEDSnet sites. CDM version upgrade: A boolean feature denoting whether the PEDSnet CDM version was upgraded since the previous data cycle. We targeted two classification problems, binary (ETL vs. Non-ETL), and three-way (ETL vs. Characteristic vs. False alarm). We evaluated several classification methods including Naïve Bayes (NB), Decision tree (DT), Decision tree with boosting (DTB), k-Nearest neighbor (KNN), and support vector machine (SVM). We used Python implementation[16] of the classifiers, and used the datasets extracted from the PEDSnet issue warehouse for training. Prior to choosing specific configurations for these learners, a “model grid search” was performed using the GridSearchCV algorithm to find the optimal set of parameters for each of the target learners as shown in Table 3. The search was performed on 80% of the data using a five-fold stratified training set. Each combination was evaluated against a hold-out set to score the model.

Table 3.

The learned parameters for various classifiers using a grid search

Learner	Parameters
Decision tree + pruning (DT)	Max depth = 10
Decision tree + pruning + boosting (DT+B)	Max depth = 4
	Estimators = 300
k-nearest neighbor (KNN)	K = 5 (binary), 3 (three-way)
Naïve Bayes (NB)	Class Priors = None
Support vector machine (SVM)	Kernel = linear
	Error term = 0.1
	Tolerance = 0.001

Results

We used the July 2017 version of the PEDSnet issue warehouse[17], which contains metadata on 11,434 data quality issues identified over a span of 30 months. We drew two datasets for experimentation, all-issues-dataset which includes 10,281 issues after filtering out the issues with unknown causes, and unique-issues-dataset, with 4,388 issues, that is a subset of all-issues-dataset prepared after excluding duplicate characteristic issues. The class label distributions across both datasets are: 14.37% (ETL), 81.78% (Characteristic), and 3.84% (False alarm); and 33.68% (ETL), 57.31% (Characteristic), and 9% (False alarm); respectively. Figures 4 and 5 show the receiver operating characteristics (ROC) curves and performance measures for binary classification (ETL vs. Non-ETL), respectively. The results indicate that the k-nearest neighbor and decision tree with boosting algorithms could be promising choices for this problem. The classifiers trained on the unique-issues-dataset deliver higher F1 measure for the ETL issues, as compared to the all-issues-dataset, given the higher balancedness. Figure 6 shows the performance of classifiers on three-way classification problem. The performance for each class is higher than that of the binary classifiers. The classification performance on characteristic issues is higher than that on ETL or false alarms. This is most likely due to the availability of significantly higher training examples for characteristic issues in both the datasets.

Figure 4

ROC curve for the binary (ETL vs. Non-ETL) classifier trained on all-issues-dataset

Figure 5

Performance measures for binary (ETL vs. Non-ETL) classification of issues

Figure 6

Performance measures for three-way (Characteristic, ETL, False Alarm) classification of issues

To further understand the results, we examined the types of issues that constitute the most frequent error cases in the all-issues-dataset using the Decision tree with boosting classifier (Table 4). In general, the issues that are frequently difficult to classify tend to be limited to five major check types representing candidates for further study. In the majority of error cases, the classifier could not determine whether MissData (missing data) for drug_exposure and measurement was due to inherent characteristics or ETL error. Both these domains represent some of the most evolving domains in PEDSnet wherein the sites are gradually populating various fields, and hence the fluctuations in causes of missingness in the past several cycles. Another difficult check type was UnexDiff (unexpected difference in the number of records between two data cycles) wherein it is difficult to determine whether the issue is due to an ETL error or due to a natural enlargement of site’s dataset, i.e. false alarm.

Table 4.

Most frequent error cases (Check type, domain), FP=false positive, FN= false negative

Cause Class	ETL		Characteristic		Non-issue
	FP	FN	FP	FN	FP	FN
1	UnexDiff, Measurement	MissData, drug_exposure	MissData, drug exposure	MissData, Measurement	UnexDiff, Procedure_occurrence	UnexDiff, Visit_occurrence
2	UnexDiff, Visit_occurrence	MissData, Measurement	MissData, Measurement	InvalidConID, Provider	UnexDiff, Drug_exposure	UnexDiff, Measurement
3	InvalidConID, Provider	MissConID, Drug_exposure	MissConID, Drug_exposure	MissData, Observation	MissData, Location	MissFact, care_site

Discussion

Based on our experience of conducting iterative data quality assessments on a pediatric CDRN, we find that a majority (>60%) of the data quality issues should receive lower priority for investigation, as they are either false alarms or an inherent characteristic of data that cannot be altered or resolved. In this study, we have studied the cause prediction problem using machine learning classifier that, given a data quality issue, predicts the cause of the issue. The best performing classifier achieved a promising F1-measure of 0.9, and indicates the potential to save significant effort by the data generation teams. While this study was primarily driven by the efficiency challenges faced in PEDSnet and the proposed method was tested on the pediatric dataset, the methodology can be applied to benefit other CDRNs. By conducting the experiments using several classifiers with different class configurations, we were able to identify strong candidates for real-world implementation and execution. While our interest primarily lies in accurately predicting ETL issues, the performance on ETL issues (F1-measure, 0.71) of all classifiers left substantial scope for improvement, e.g. further analysis of the frequent use cases identified through error analysis. In the future, we plan to extend this work using systematic feature selection, development of more granular causal classes, development of balanced datasets, and assessment of the impact of automatic predictions on user experience.

13 in total

Review 1. Defining and improving data quality in medical registries: a literature review, case study, and generic framework.

Authors: Danielle G T Arts; Nicolette F De Keizer; Gert-Jan Scheffer
Journal: J Am Med Inform Assoc Date: 2002 Nov-Dec Impact factor: 4.497

2. Challenges in using electronic health record data for CER: experience of 4 learning organizations and solutions applied.

Authors: K Bruce Bayley; Tom Belnap; Lucy Savitz; Andrew L Masica; Nilay Shah; Neil S Fleming
Journal: Med Care Date: 2013-08 Impact factor: 2.983

3. A pragmatic framework for single-site and multisite data quality assessment in electronic health record-based clinical research.

Authors: Michael G Kahn; Marsha A Raebel; Jason M Glanz; Karen Riedlinger; John F Steiner
Journal: Med Care Date: 2012-07 Impact factor: 2.983

4. Caveats for the use of operational electronic health record data in comparative effectiveness research.

Authors: William R Hersh; Mark G Weiner; Peter J Embi; Judith R Logan; Philip R O Payne; Elmer V Bernstam; Harold P Lehmann; George Hripcsak; Timothy H Hartzog; James J Cimino; Joel H Saltz
Journal: Med Care Date: 2013-08 Impact factor: 2.983

5. A longitudinal analysis of data quality in a large pediatric data research network.

Authors: Ritu Khare; Levon Utidjian; Byron J Ruth; Michael G Kahn; Evanette Burrows; Keith Marsolo; Nandan Patibandla; Hanieh Razzaghi; Ryan Colvin; Daksha Ranade; Melody Kitzmiller; Daniel Eckrich; L Charles Bailey
Journal: J Am Med Inform Assoc Date: 2017-11-01 Impact factor: 4.497

6. Multi-Institutional Sharing of Electronic Health Record Data to Assess Childhood Obesity.

Authors: L Charles Bailey; David E Milov; Kelly Kelleher; Michael G Kahn; Mark Del Beccaro; Feliciano Yu; Thomas Richards; Christopher B Forrest
Journal: PLoS One Date: 2013-06-18 Impact factor: 3.240

Review 7. Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research.

Authors: Nicole Gray Weiskopf; Chunhua Weng
Journal: J Am Med Inform Assoc Date: 2012-06-25 Impact factor: 4.497

8. Transparent reporting of data quality in distributed data networks.

Authors: Michael G Kahn; Jeffrey S Brown; Alein T Chun; Bruce N Davidson; Daniella Meeker; Patrick B Ryan; Lisa M Schilling; Nicole G Weiskopf; Andrew E Williams; Meredith Nahm Zozus
Journal: EGEMS (Wash DC) Date: 2015-03-23

9. PEDSnet: a National Pediatric Learning Health System.

Authors: Christopher B Forrest; Peter A Margolis; L Charles Bailey; Keith Marsolo; Mark A Del Beccaro; Jonathan A Finkelstein; David E Milov; Veronica J Vieland; Bryan A Wolf; Feliciano B Yu; Michael G Kahn
Journal: J Am Med Inform Assoc Date: 2014-05-12 Impact factor: 4.497

10. A Harmonized Data Quality Assessment Terminology and Framework for the Secondary Use of Electronic Health Record Data.

Authors: Michael G Kahn; Tiffany J Callahan; Juliana Barnard; Alan E Bauck; Jeff Brown; Bruce N Davidson; Hossein Estiri; Carsten Goerg; Erin Holve; Steven G Johnson; Siaw-Teng Liaw; Marianne Hamilton-Lopez; Daniella Meeker; Toan C Ong; Patrick Ryan; Ning Shang; Nicole G Weiskopf; Chunhua Weng; Meredith N Zozus; Lisa Schilling
Journal: EGEMS (Wash DC) Date: 2016-09-11

9 in total

1. Recommendations for improving national clinical datasets for health equity research.

Authors: Rebecca G Block; Jon Puro; Erika Cottrell; Mitchell R Lunn; M J Dunne; Ana R Quiñones; Bowen Chung; William Pinnock; Georgia M Reid; John Heintzman
Journal: J Am Med Inform Assoc Date: 2020-11-01 Impact factor: 4.497

2. Big Data for Nutrition Research in Pediatric Oncology: Current State and Framework for Advancement.

Authors: Charles A Phillips; Brad H Pollock
Journal: J Natl Cancer Inst Monogr Date: 2019-09-01

3. Using Electronic Health Record Data to Rapidly Identify Children with Glomerular Disease for Clinical Research.

Authors: Michelle R Denburg; Hanieh Razzaghi; L Charles Bailey; Danielle E Soranno; Ari H Pollack; Vikas R Dharnidharka; Mark M Mitsnefes; William E Smoyer; Michael J G Somers; Joshua J Zaritsky; Joseph T Flynn; Donna J Claes; Bradley P Dixon; Maryjane Benton; Laura H Mariani; Christopher B Forrest; Susan L Furth
Journal: J Am Soc Nephrol Date: 2019-11-15 Impact factor: 10.121

4. Using a Multi-Institutional Pediatric Learning Health System to Identify Systemic Lupus Erythematosus and Lupus Nephritis: Development and Validation of Computable Phenotypes.

Authors: Scott E Wenderfer; Joyce C Chang; Amy Goodwin Davies; Ingrid Y Luna; Rebecca Scobell; Cora Sears; Bliss Magella; Mark Mitsnefes; Brian R Stotter; Vikas R Dharnidharka; Katherine D Nowicki; Bradley P Dixon; Megan Kelton; Joseph T Flynn; Caroline Gluck; Mahmoud Kallash; William E Smoyer; Andrea Knight; Sangeeta Sule; Hanieh Razzaghi; L Charles Bailey; Susan L Furth; Christopher B Forrest; Michelle R Denburg; Meredith A Atkinson
Journal: Clin J Am Soc Nephrol Date: 2021-11-03 Impact factor: 8.237

5. The SHOnet learning health system: Infrastructure for continuous learning in pediatric rehabilitation.

Authors: Nikolas Koscielniak; Diane Jenkins; Sahar Hassani; Cathleen Buckon; Joshua S Tucker; Susan Sienko; Carole A Tucker
Journal: Learn Health Syst Date: 2022-02-15

Review 6. Factors Affecting the Quality of Person-Generated Wearable Device Data and Associated Challenges: Rapid Systematic Review.

Authors: Sylvia Cho; Ipek Ensari; Chunhua Weng; Michael G Kahn; Karthik Natarajan
Journal: JMIR Mhealth Uhealth Date: 2021-03-19 Impact factor: 4.773