| Literature DB >> 31367649 |
Jose-Franck Diaz-Garelli1, Elmer V Bernstam2, MinJae Lee3, Kevin O Hwang3, Mohammad H Rahbar3, Todd R Johnson2.
Abstract
The well-known hazards of repurposing data make Data Quality (DQ) assessment a vital step towards ensuring valid results regardless of analytical methods. However, there is no systematic process to implement DQ assessments for secondary uses of clinical data. This paper presents DataGauge, a systematic process for designing and implementing DQ assessments to evaluate repurposed data for a specific secondary use. DataGauge is composed of five steps: (1) Define information needs, (2) Develop a formal Data Needs Model (DNM), (3) Use the DNM and DQ theory to develop goal-specific DQ assessment requirements, (4) Extract DNM-specified data, and (5) Evaluate according to DQ requirements. DataGauge's main contribution is integrating general DQ theory and DQ assessment methods into a systematic process. This process supports the integration and practical implementation of existing Electronic Health Record-specific DQ assessment guidelines. DataGauge also provides an initial theory-based guidance framework that ties the DNM to DQ testing methods for each DQ dimension to aid the design of DQ assessments. This framework can be augmented with existing DQ guidelines to enable systematic assessment. DataGauge sets the stage for future systematic DQ assessment research by defining an assessment process, capable of adapting to a broad range of clinical datasets and secondary uses. Defining DataGauge sets the stage for new research directions such as DQ theory integration, DQ requirements portability research, DQ assessment tool development and DQ assessment tool usability.Entities:
Keywords: Clinical data quality; clinical and translational science; data quality assessment; model-driven development; secondary use of clinical data
Year: 2019 PMID: 31367649 PMCID: PMC6659577 DOI: 10.5334/egems.286
Source DB: PubMed Journal: EGEMS (Wash DC) ISSN: 2327-9214
Figure 1DataGauge, an iterative analysis-specific DQ assessment method for the secondary use of clinical data. This process defines the general stages and steps for analysis-specific DQ assessment using data models and an analysis-specific DQ standard.
Figure 2Evolution of the data needs model for the purpose of assessing a relationship between prednisone and weight gain using repurposed clinical data. This data model defines the data needs for the evaluation of an association between prednisone and weight gain. a), b) and c) show the three versions of the DNM; one for each iteration. Note how the first DNM (a) obscures the observations of interest and their relationships, whereas the third (c) makes these explicit and makes it possible to specify cardinality requirements among them.
DQ requirement development guidance table. Integrated and modified from Wang & Strong’s classification of data quality dimensions (1996) [39] and Borek et al.’s classification of data quality assessment methods (2011) [44].
| Data Quality Dimensions | |||||
|---|---|---|---|---|---|
| Data Granularity Levels | Correctness and Plausibility | Completeness | Concordance | Representation | Timeliness |
| Domain analysis, Data Validation, Lexical analysis | Domain Analysis, Lexical Analysis | Domain Analysis | Column Analysis, Lexical Analysis, Schema Matching | Domain Analysis | |
| Column Analysis, Data Validation, Semantic Profiling | Column Analysis, Domain Analysis | Column Analysis, Data Validation | Column Analysis, Schema Matching | Column Analysis, Domain Analysis | |
| Domain Analysis, Semantic Profiling | Domain Analysis, Semantic Profiling | Domain Analysis, Semantic Profiling | Domain Analysis, Schema Matching | Domain Analysis, Semantic Profiling | |
| Domain Analysis | Domain Analysis, Column Analysis | Column Analysis, Semantic Profiling | Schema Matching | Semantic Profiling, Domain Analysis | |
| Semantic Profiling, PK/FK analysis, Column Analysis | Domain Analysis, Semantic Profiling | Domain Analysis, PK/FK Analysis, Semantic Profiling | Column analysis, PK/FK Analysis, Semantic Profiling, Schema Matching | Semantic Profiling, Domain Analysis | |
| Semantic Profiling, Domain Analysis, Column Analysis | Domain Analysis, Semantic Profiling | Semantic Profiling, Domain Analysis | Column analysis, Schema Matching, Semantic Profiling | Semantic Profiling, Domain Analysis | |
DQ requirement examples as they were generated with their respective percentage of compliance. The requirements became more specific and analysis-specific with each iteration.
| Iteration | DQ Dimension | Variable Granularity | Variable(s) | Analysis Specific | Requirement | DQ assessment method | DQ Result (% Compliance or Pass/Fail) |
|---|---|---|---|---|---|---|---|
| 1 | Accuracy | Value | Gender | No | In {‘M’,‘F’,‘U’} | Data Validation | 99.99 |
| Accuracy | Value | WeightValue | No | >0 | Range Checking | 92.65 | |
| Believability | Value | WeightValue | No | <400 | Range Checking | 99.95 | |
| Accuracy | Value | Strength | No | >0 | Range Checking | 97.37 | |
| Believability | Value | Strength | No | <2* [Max dose] | Domain Analysis | 100 | |
| Accuracy | Value | Dose | No | >0 | Range Checking | 51.68 | |
| Believability | Value | Dose | No | <2* [Max pills at min strength] | Domain Analysis | 100 | |
| Accuracy | Value | Refills | No | >=0 | Range Checking | 100 | |
| 2 | Accuracy | Value | WeightTime | No | >[System Installation Date] | Data Validation | 100 |
| Accuracy | Column | PatientID | No | Unique | Column Analysis | 100 | |
| Concordance | Row | WeightTime, DoB | No | Timestamp > DoB | Domain Analysis | 100 | |
| Concordance | Row | PrescDTTM, DoB | No | PrescDTTM > DoB | Domain Analysis | 100 | |
| Concordance | Table | PatientID, WeightTime, WeightValue | Yes | Patient weights on prescription date are less than 2% apart | Domain Analysis | 92.45 | |
| Completeness | Table | PatientID, WeightValue | Yes | 2 weight measurements per patient | Domain Analysis | 85.92 | |
| Completeness | Line | PatientID, WeightTime | Yes | Patient has weight measurement on prescription date | Domain Analysis | 97.54 | |
| Timeliness | Table | PatientID, WeightTime | Yes | Patient has second weight measure within 4 months of prescription | Domain Analysis | 48.62 | |
| 3 | Amount of data | Table | Strength, Dose, Days, Refills | Yes | Can calculate total milligrams prescribed for 50% of prescriptions | Domain Analysis | Failed |
| Amount of data | Table | Patient, PRN | Yes | Less than 25% PRN prescriptions | Domain Analysis | Passed | |
| Amount of data | Dataset | PatientID, WeightTime | Yes | 50% patients 2 weight measures within 4 months of first prescription | Domain Analysis | Failed | |
| Completeness | Dataset | PatientID, WeightValue, WeightTime, PrescriptionTable | Yes | Patients with at least 2 unflawed weights after an unflawed prescription | Domain Analysis | 13.1 | |
| All | Dataset | All Variables | No | Patient records with no general DQ flaw | Domain Analysis | 2.93 | |