| Literature DB >> 33166397 |
Jiang Bian1,2, Tianchen Lyu1, Alexander Loiacono1, Tonatiuh Mendoza Viramontes1, Gloria Lipori3, Yi Guo1, Yonghui Wu1, Mattia Prosperi4, Thomas J George5, Christopher A Harle1, Elizabeth A Shenkman1, William Hogan1.
Abstract
OBJECTIVE: To synthesize data quality (DQ) dimensions and assessment methods of real-world data, especially electronic health records, through a systematic scoping review and to assess the practice of DQ assessment in the national Patient-centered Clinical Research Network (PCORnet).Entities:
Keywords: PCORnet; clinical data research network; data quality assessment; electronic health record; real-world data
Mesh:
Year: 2020 PMID: 33166397 PMCID: PMC7727392 DOI: 10.1093/jamia/ocaa245
Source DB: PubMed Journal: J Am Med Inform Assoc ISSN: 1067-5027 Impact factor: 4.497
Figure 1.The flow chart of the literature review process: (A) individual studies identified from Chan et al (2010) and Weiskopf et al (2013a), and (B) new data quality related articles (both individual studies and review/framework articles) published from 2012 to February 2020.
Figure 2.The numbers of studies by (A) data type, (B) DQ dimension, and (C) DQ assessment method.
Figure 3.A summarization of existing DQ dimensions and DQ assessment methods.
Data quality dimensions summarized from existing DQ frameworks and reviews
| Dimension | Definition | Source frameworks/reviews | |
|---|---|---|---|
| D1 | Currency | Data were considered current if they were recorded in the EHR within a reasonable period of time following measurement or, alternatively, if they were representative of the patient state at a desired time of interest. Weiskopf et al (2013a) | Bloland et al (2019), |
| D2 | Correctness/Accuracy | EHR data were considered correct when the information they contained was true. Weiskopf et al (2013a) | Bloland et al (2019), |
| D3 | Plausibility | Plausibility focuses on actual values as a representation of a real-world object or conceptual construct by examining the distribution and density of values or by comparing multiple values that have an expected relationship to each other. Kahn et al (2016) | Henley-Smith et al (2019), |
| D3-1* | Uniqueness Plausibility | The Uniqueness subcategory seeks to determine if objects (entities, observations, facts) appear multiple times in settings where they should not be duplicated or cannot be distinguished within a database (Verification) or when compared with an external reference (Validation). Kahn et al (2016) | Henley-Smith et al (2019), |
| D3-2* | Atemporal Plausibility | Atemporal Plausibility seeks to determine if observed data values, distributions, or densities agree with local or “common” knowledge (Verification) or from comparisons with external sources that are deemed to be trusted or relative gold standards (Validation). Kahn et al (2016) | Henley-Smith et al (2019), |
| D3-3* | Temporal Plausibility | Temporal plausibility seeks to determine if time-varying variables change values as expected based on known temporal properties or across 1 or more external comparators or gold standards. Kahn et al (2016) | Henley-Smith et al (2019), |
| D4 | Completeness | Completeness focuses on features that describe the frequencies of data attributes present in a data set without reference to data values. Kahn et al (2016) | Henley-Smith et al (2019), |
| D5 | Concordance | Is there agreement between elements in the EHR, or between the EHR and another data source? Weiskopf et al (2013a) | Bloland et al (2019), |
| D6 | Comparability | Comparability is similarity in data quality and availability for specific data elements used in a measure across different entities, such as health plans or physicians or data sources. Chan et al (2010) | Terry et al (2019), |
| D7 | Conformance | Whether the values that are present meet syntactic or structural constraints. Kahn et al (2016) | Henley-Smith et al (2019), |
| D7-1* | Value Conformance | Agreement with a prespecified, constraint-driven data architecture. Kahn et al (2016) | Henley-Smith et al (2019), |
| D7-2* | Relational Conformance | Agreement with additional structural constraints imposed by the physical database structures that store data values. Kahn et al (2016) | Henley-Smith et al (2019), |
| D7-3* | Computational Conformance | If computations used to create derived values from existing variables yield the intended results either within a data set (Verification) or between data sets (Validation), when programs are based on identical specifications. Kahn et al (2016) | Henley-Smith et al (2019), |
| D8 | Flexibility | The extent to which data are expandable, adaptable, and easily applied to many tasks. Wang et al (1996) | Johnson et al (2015), |
| D9 | Relevance | The extent to which information is applicable and helpful for the task at hand. Liaw et al (2013) | Bloland et al (2019), |
| D10 | Usability/Ease-of-Use | A measure of the degree to which data can be accessed and used and the degree to which data can be updated, maintained, and managed. McGilvray (2008) | Liaw et al (2013) |
| D11 | Security | Personal data is not corrupted, and access suitably controlled to ensure privacy and confidentiality. Liaw et al (2013) | Liaw et al (2013), |
| D12 | Information Loss and Degradation | The loss and degradation of information content over time. Zozus et al (2014) | Bloland et al (2019), |
| D13 | Consistency | Pertains to the constancy of the data, at the desired degree of detail for the study purpose, within and across databases and data sets. Feder SL (2018) | Feder SL (2018), |
| D14 | Understandability/Interpretability | The ease with which a user can understand the data. Smith et al (2017) | Smith et al (2017), |
D3-1, D3-2, and D3-3 are subcategories of D3; D7-1, D7-2, and D7-3 are subcategories of D7.
Data quality assessment methods summarized from existing DQ frameworks and reviews
| Method | Definition | Source frameworks/reviews | |
|---|---|---|---|
| M1 | Log review | Information on the actual data entry practices (eg, dates, times, edits) is examined. Weiskopf et al (2013a) | Bloland et al (2019), |
| M2 | Element presence | A determination is made as to whether or not desired or expected data elements are present. Weiskopf et al (2013a) | Henley-Smith et al (2019), |
| M3 | Data element agreement | Two or more elements within an EHR are compared to see if they report the same or compatible information. Weiskopf et al, (2013a) | Henley-Smith et al (2019), |
| M4 | Validity check | If observed data values or densities agree with “common” knowledge or external knowledge; if time varying variables change values as expected based on known temporal properties or external knowledge. Kahn et al (2016) | Henley-Smith et al (2019), |
| M5 | Conformance check | Check the uniqueness of objects which should not be duplicated; the dataset agreement with prespecified or additional structural constraints Kahn et al (2016); | Henley-Smith et al (2019), |
| M6 | Data source agreement | Data from the EHR are compared with data from another source to determine if they are in agreement Weiskopf et al (2013a); | Bloland et al (2019), |
| M7 | Distribution comparison | Distributions or summary statistics of aggregated data from the EHR are compared with the expected distributions for the clinical concepts of interest. Weiskopf et al (2013a) | Terry et al (2019), |
| M8 | Gold standard | Data value and presence in the dataset is the same as the value and presence from trusted reference standards or datasets. If the data is extracted from paper record in a rigorous fashion, then it’s a gold standard (eg, manual chart review). | Bloland et al (2019), |
| M9 | Qualitative assessment | Descriptive qualitative measures with group interviews and interpreted with grounded theory. Liaw et al (2013) | Liaw et al (2013) |
| M10 | Security analyses | Analyses of access reports to examine whether there’s security issue. Liaw et al (2013) | Liaw et al (2013) |
Mapping PCORnet data characterization checks to the 14 DQ dimensions and 10 DQ assessment methods
| Data Check (DC) | Working description | Status | Method | Dimension |
|---|---|---|---|---|
| DC 1.01 | Required tables are not present | Since version 1 | M2 | D4, D7 |
| DC 1.02 | Required tables are not populated | Since version 1 | M2 | D4, D7 |
| DC 1.03 | Required fields are not present | Since version 1 | M2 | D4, D7 |
| DC 1.04 | Required fields do not conform to data model specifications for data type, length, or name. | Since version 1 | M5 | D7-1, D7-2 |
| DC 1.05 | Tables have primary key definition errors | Since version 1 | M5 | D3-1, D7-2 |
| DC 1.06 | Required fields contain values outside of specifications | Since version 1 | M5 | D7-1 |
| DC 1.07 | Required fields have non-permissible missing values | Since version 1 | M2 | D4 |
| DC 1.08 | Tables contain orphan PATIDs | Added in version 2 | M2, M5 | D4, D5, D7-2 |
| DC 1.09 | Tables contain orphan ENCOUNTERIDs | Added in version 2 | M2, M5 | D4, D5, D7-2 |
| DC 1.10 | Replication errors between the ENCOUNTER, PROCEDURES and DIAGNOSIS tables | Added in version 2 | M5 | D3-1, D7-2 |
| DC 1.11 | > 5% of encounters are assigned to more than 1 patient | Added in version 3 | M5 | D3-1, D7-2 |
| DC 1.12 | Tables contain orphan PROVIDERIDs | Added in version 5 | M2, M5 | D4, D5, D7-2 |
| DC 1.13 | More than 5% of ICD, CPT, LOINC, RXCUI, or NDC codes do not conform to the expected length or content | Added in version 6 | M5 | D7-1, D7-2 |
| DC 1.14 | Patients in the DEMOGRAPHIC table are not in the HASH_TOKEN table | Added in version 8 | M2, M5 | D4, D5, D7-2 |
| DC 2.01 | More than 5% of records have future dates | Since version 1 | M4 | D2, D3-3 |
| DC 2.02 | > 10% of records fall into the lowest or highest categories of age, height, weight, diastolic blood pressure, systolic blood pressure, or dispensed days supply | Since version 1 | M4, M7 | D3-2 |
| DC 2.03 | More than 5% of patients have illogical date relationships | Added in version 2 | M4 | D2, D3-3 |
| DC 2.04 | The average number of encounters per visit is > 2.0 for inpatient (IP), emergency department (ED), or ED to inpatient (EI) encounters | Added in version 2 | M4, M7 | D3-2 |
| DC 2.05 | More than 5% of results for selected laboratory tests do not have the appropriate specimen source | Added in version 3 | M4, M5 | D4, D7 |
| DC 2.06 | The median lab result value for selected tests is an outlier. | Added in version 5 | M4 | D3-2 |
| DC 2.07 | The average number of principal diagnoses per encounter is above threshold (2.0 for inpatient [IP] and ED to inpatient [EI]) | Added in version 5 | M4, M7 | D3-2 |
| DC 2.08 | The monthly volume of encounter, diagnosis, procedure, vital, prescribing, or laboratory records is an outlier. | Added in version 7 | M4, M7 | D3-2 |
| DC 3.01 | The average number of diagnoses records with known diagnosis types per encounter is below threshold (1.0 for ambulatory [AV], inpatient [IP], emergency department [ED], or ED to inpatient [EI] encounters) | Since version 1 | M4, M7 | D3-2 |
| DC 3.02 | The average number of procedure records with known procedure types per encounter is below threshold (0.75 for ambulatory [AV] encounters, 0.75 for emergency department [ED] encounters, 1.00 for ED to inpatient [EI] encounters, and 1.00 for inpatient [IP] encounters] | Since version 1 | M4, M7 | D3-2 |
| DC 3.03 | More than 10% of records have missing or unknown values for the following fields: BIRTH_DATE, SEX, DISCHARGE_DISPOSITION, among others | Since version 1 | M2 | D4 |
| DC 3.04 | Less than 50% of patients with encounters have DIAGNOSIS records | Added in version 2 | M2 | D4 |
| DC 3.05 | Less than 50% of patients with encounters have PROCEDURES records | Added in version 2 | M2 | D4 |
| DC 3.06 | More than 10% of IP (inpatient) or ED to inpatient (EI) encounters with any diagnosis don't have a principal diagnosis | Added in version 2 | M2 | D4 |
| DC 3.07 | Encounters, diagnoses, or procedures in an ambulatory (AV), emergency department (ED), ED to inpatient (EI), or inpatient (IP) setting are less than 75% complete 3 months prior to the current month | Added in version 3 | M2 | D1, D4 |
| DC 3.08 | Less than 80% of prescribing orders are mapped to a RXNORM_CUI which fully specifies the ingredient, strength and dose form | Added in version 3 | M2 | D4 |
| DC 3.09 | Less than 80% of laboratory results are mapped to LAB_LOINC | Added in version 3 | M2 | D4 |
| DC 3.10 | Less than 80% of quantitative results for tests mapped to LAB_LOINC fully specify the normal range | Added in version 3 | M2 | D4 |
| DC 3.11 | Vital, prescribing, or laboratory records are less than 75% complete 3 months prior to the current month | Added in version 4 | M2 | D1, D4 |
| DC 3.12 | Less than 80% of quantitative results for tests mapped to LAB_LOINC fully specify the RESULT_UNIT | Added in version 5 | M2 | D4 |
| DC 3.13 | The percentage of patients with selected lab tests is below threshold | Added in version 8 | M4, M7 | D3-2, D4 |
| DC 4.01 | More than a 5% decrease in the number of patients or records in a CDM table | Added in version 6 | M2 | D12 |
| DC 4.02 | More than a 5% decrease in the number of patients or records for diagnosis, procedures, labs or prescriptions during an ambulatory (AV), other ambulatory (OA), emergency department (ED), or inpatient (IP) encounter | Added in version 6 | M2 | D12 |
| DC 4.03 | More than a 5% decrease in the number of records or distinct codes for ICD9 or ICD10 diagnosis or procedure codes or CPT/HCPCS procedure codes | Added in version 6 | M2 | D12 |
| DC 4.01 | DataMart's DIAGNOSIS table has a minimum ADMIT_DATE after January 2010. DataMarts should include data that can be well curated. When possible, DataMarts should include historical data from no later than 2010 to the present. | Since version 1, but removed in version 2 | M2 | D1 |
| DC 4.02 | DataMart's PROCEDURES table has a minimum ADMIT_DATE after January 2010. DataMarts should include data that can be well curated. When possible, DataMarts should include historical data from no later than 2010 to the present. | Since version 1, but removed in version 2 | M2 | D1 |
| DC 4.03 | DataMart's VITAL table has a minimum MEASURE_DATE after January 2010. DataMarts should include data that can be well curated. When possible, DataMarts should include historical data from no later than 2010 to the present. | Since version 1, but removed in version 2 | M2 | D1 |
| DC 4.04 | DataMart does not include all of the following encounter types: ambulatory (AV), inpatient (IP or EI), and emergency department (ED or EI) encounters. This complement of encounter types is not required but may be important for some research studies. | Since version 1, but removed in version 2 | M2 | D4 |
| DC 4.05 | DataMart has obfuscated or imputed | Since version 1, but removed in version 2 | M10 | D11 |
Data in the PCORnet follows the PCORnet common data model (CDM). Both the PCORnet CDM and the PCORnet data checks specifications are available at https://pcornet.org/data-driven-common-model/.
The numbers of PCORnet data checks mapped to each DQ dimension and DQ assessment method
| DQ assessment method | Number of DCs | DQ dimension | Number of DCs |
|---|---|---|---|
| M1 Log review | 1 | D1 Currency | 9 |
| M2 Element presence | 25 | D2 Correctness/Accuracy | 2 |
| M3 Data element agreement | 0 | D3 Plausibility | 13 |
| M4 Validity check | 11 | D4 Completeness | 21 |
| M5 Conformance check | 11 | D5 Concordance | 4 |
| M6 Data source agreement | 0 | D6 Comparability | 0 |
| M7 Distribution comparison | 7 | D7 Conformance | 14 |
| M8 Gold standard | 0 | D8 Flexibility | 0 |
| M9 Qualitative assessment | 0 | D9 Relevance | 0 |
| M10 Security analyses | 1 | D10 Usability/Ease-of-Use | 0 |
| D11 Security | 1 | ||
| D12 Information Loss and Degradation | 3 | ||
| D13 Consistency | 0 | ||
| D14 Understandability/Interpretability | 0 |
Abbreviations: DC, data check; DQ, data quality.