| Literature DB >> 29888053 |
Ritu Khare1, Byron J Ruth1, Matthew Miller1, Joshua Tucker1, Levon H Utidjian1,2, Hanieh Razzaghi1, Nandan Patibandla3, Evanette K Burrows1, L Charles Bailey1,2.
Abstract
Clinical data research networks (CDRNs) invest substantially in identifying and investigating data quality problems. While identification is largely automated, the investigation and resolution are carried out manually at individual institutions. In the PEDSnet CDRN, we found that only approximately 35% of the identified data quality issues are resolvable as they are caused by errors in the extract-transform-load (ETL) code. Nonetheless, with no prior knowledge of issue causes, partner institutions end up spending significant time investigating issues that represent either inherent data characteristics or false alarms. This work investigates whether the causes (ETL, Characteristic, or False alarm) can be predicted before spending time investigating issues. We trained a classifier on the metadata from 10,281 real-world data quality issues, and achieved a cause prediction F1-measure of up to 90%. While initially tested on PEDSnet, the proposed methodology is applicable to other CDRNs facing similar bottlenecks in handling data quality results.Entities:
Year: 2018 PMID: 29888053 PMCID: PMC5961770
Source DB: PubMed Journal: AMIA Jt Summits Transl Sci Proc
Figure 1GitHub screenshots of PEDSnet data quality issues illustrating different causes - top to bottom (a) ETL issue (in red), (b) Characteristic issue (in blue), (c) False alarm (in gray).
Figure 2Longitudinal distribution of causes of data quality issues reported to PEDSnet sites
Figure 3The Github (open – > close) duration of different classes of issues
Some Examples of Data Quality Check Types and Issues in PEDSnet
| Check Type | Example Data Quality Issues |
|---|---|
| Distribution of NULL values in race_source_value doesnot match with the distribution of “No Information” concept inrace_concept_id in Person table | |
| A non-standard concept used for populating thecondition_concept_id field | |
| A medication name entered into the location.zip field | |
| Found encounters with visit_start_date occurring aftervisit_end_date | |
| A patient with over 30,000 procedures | |
| “injection for contraceptive” as the most frequent procedure at asite | |
| Decrease in the number of deaths, or large increase (e.g. 2X) inthe number of conditions | |
| Gestational age is not available for 70% of patients | |
| No “creatinine” lab record found in measurement table |
Features types and positive features for the example issues shown in Figure 1
| Feature Types | ETL issue ( | Characteristic issue ( | False alarm ( |
|---|---|---|---|
| Domain | Condition_occurrence | Visit_occurrence | Drug_exposure |
| Field Type | Concept identifier | Multiple | - |
| Check Type | |||
| Prevalence | Medium | Low | Medium |
| CDM versionupgrade | No | Yes | No |
The learned parameters for various classifiers using a grid search
| Learner | Parameters |
|---|---|
| Decision tree + pruning (DT) | Max depth = 10 |
| Decision tree + pruning + boosting (DT+B) | Max depth = 4 |
| Estimators = 300 | |
| k-nearest neighbor (KNN) | K = 5 (binary), 3 (three-way) |
| Naïve Bayes (NB) | Class Priors = None |
| Support vector machine (SVM) | Kernel = linear |
| Error term = 0.1 | |
| Tolerance = 0.001 |
Figure 4ROC curve for the binary (ETL vs. Non-ETL) classifier trained on all-issues-dataset
Figure 5Performance measures for binary (ETL vs. Non-ETL) classification of issues
Figure 6Performance measures for three-way (Characteristic, ETL, False Alarm) classification of issues
Most frequent error cases (Check type, domain), FP=false positive, FN= false negative
| 1 | ||||||
| 2 | ||||||
| 3 | ||||||