| Literature DB >> 25992385 |
Michael G Kahn1, Jeffrey S Brown2, Alein T Chun3, Bruce N Davidson4, Daniella Meeker5, Patrick B Ryan6, Lisa M Schilling1, Nicole G Weiskopf7, Andrew E Williams8, Meredith Nahm Zozus9.
Abstract
INTRODUCTION: Poor data quality can be a serious threat to the validity and generalizability of clinical research findings. The growing availability of electronic administrative and clinical data is accompanied by a growing concern about the quality of these data for observational research and other analytic purposes. Currently, there are no widely accepted guidelines for reporting quality results that would enable investigators and consumers to independently determine if a data source is fit for use to support analytic inferences and reliable evidence generation. MODEL AND METHODS: We developed a conceptual model that captures the flow of data from data originator across successive data stewards and finally to the data consumer. This "data lifecycle" model illustrates how data quality issues can result in data being returned back to previous data custodians. We highlight the potential risks of poor data quality on clinical practice and research results. Because of the need to ensure transparent reporting of a data quality issues, we created a unifying data-quality reporting framework and a complementary set of 20 data-quality reporting recommendations for studies that use observational clinical and administrative data for secondary data analysis. We obtained stakeholder input on the perceived value of each recommendation by soliciting public comments via two face-to-face meetings of informatics and comparative-effectiveness investigators, through multiple public webinars targeted to the health services research community, and with an open access online wiki. RECOMMENDATIONS: Our recommendations propose reporting on both general and analysis-specific data quality features. The goals of these recommendations are to improve the reporting of data quality measures for studies that use observational clinical and administrative data, to ensure transparency and consistency in computing data quality measures, and to facilitate best practices and trust in the new clinical discoveries based on secondary use of observational data.Entities:
Keywords: Informatics; comparative effectiveness; data use and quality; observational data; research networks
Year: 2015 PMID: 25992385 PMCID: PMC4434997 DOI: 10.13063/2327-9214.1052
Source DB: PubMed Journal: EGEMS (Wash DC) ISSN: 2327-9214
Figure 1.Chain of Data Stewardship with Key Data Stewards
Notes: Dashed lines represent data quality issues referred back to previous data stewards.
Data Quality Assessment Documentation and Reporting Recommendations
| Data origin | 1 | A description of the source of the original or raw data prior to any subsequent processing or transformation for secondary use. Examples would be “clinical practices via AllScripts EHR 2009,” “interviewer-administered survey,” or “claim for reimbursement.” |
| Data capture method | 2 | A description of the technology used to record the data values in electronic format. Examples would be “EHR screen entry via custom form,” “automated instrument upload,” and “interactive voice response (IVR).” |
| Original collection purpose | 3 | A description of the original context in which data were collected. Examples would be “clinical care and operations,” “reimbursement,” or “research”—and in which kinds of facilities data were collected—such as “ambulatory clinic,” “same-day surgery clinic,” and “clinical research center.” |
| Data steward | 4 | A description of the type of organization responsible for obtaining and managing the target data set. Examples could be “PBRN,” “Registry,” Medical group practice,” and “State agency.” |
| Database model/data set structure | 5 | A description of how the data tables and variables are structured and linked in the target database or data set. Includes information on variable types (integer, date, string), min/max ranges if defined, and allowed values for enumerated categorical variable. Includes rules for mandatory/optional fields (variables), especially for fields used to link rows across tables. |
| Data dictionary/data set definitions | 6 | A description of data definitions used for data elements, including the URL to documentation if available on the Internet, that provides table- and field-level descriptions of data types and content for each element, and any required context for interpreting data within a patient or across the population. Whereas Recommendation #5 focuses on how the data are |
| Data extraction specifications, including use of natural language processing to extract variables from text documents | 7 | Documentation on how the target data was obtained from the source data. Examples would be “direct data entry by medical personnel,” “indirect data entry by medical record chart abstraction guidelines,” and “natural language processing algorithms.” Should include the URL to the documentation of the data creation specifications if available on the Internet. |
| Mappings from original values to standardized values | 8 | Documentation on how original data values were transformed to conform to the target data model format. Documentation should list source values and describe the logic or mappings used to transform from the original source to the required target values. |
| Data management organization’s data transformation routines, including constructed variables | 9 | Documentation of any additional data alterations that were performed by the data management team in creating the final data set, such as replacing missing values by imputed values, removal of extreme values, and creation of additional computed values, such as BMI from raw height and weight observations. Should include the URL to documentation if available on the Internet. The documentation should allow an independent reader to trace a value in the target data set to the original source value(s) and should explain all operations performed on the data. |
| Data processing validation routines | 10 | Documentation of all data validation rules to which the data were subjected. Rules should identify both data elements and validation algorithms. Examples include comparisons of row counts between source and target data sets and an explanation for any differences in row count or documentation, and a listing of differences in the distribution of categorical data values across source-to-target mappings. Should include the URL to documentation if available on the Internet. |
| Audit trail | 11 | Documentation of all changes made to data values, user/system making the change and date/time of the change in the process of “cleaning” a data set prior to use. Reason for the change should be evident from data transformation routines or documented issues (e.g., correction of isolated error, replacement of missing values with standardized “missing value” flag). |
| Data format | 12 | For required data variables verify the format, proper storage, and that required elements are not missing. Examples include verifying that floating point values are not rounded to integer values, conversions across units of measures are correct, and that precision and rounding rules are as expected based on transformations. |
| Single element data descriptive statistics | 13 | For each variable, calculate the following descriptive statistics:
Available or not (#/% missing) For continuous variables: min, max, mean, median, range, percentiles, etc. For categorical variables—frequencies & proportions by category If a specific distribution is anticipated, report on goodness-of-fit tests |
| Temporal constraints | 14 | Evaluate whether expected temporal constrains are violated or not. Examples include:
Start date and times occur before stop dates and times, Distribution of intervals between successive measurements, For time-series—changes in adjacent values and expected directionality in changes meet expectations, and Conformance to state transition/sequencing rules. |
| Multiple variables cross validations/consistency | 15 | Across two or more data variables that are known to be linked: Report violations of data model cardinality rules. A cardinality rule determines when zero, one, or more than one data rows in one table can be linked to one or more data rows in another table. Report violations of data model primary/foreign key rules. A primary/foreign key requires that a row in one table (the foreign key) must point to a row in another table (the primary key). The primary key row must be present. Report violations of cross-variables dependency rules. A cross-variables dependency states that one row can only exist if another row or value exists. For example, the state of pregnancy should exist only if the patient sex is female. Report violations of co-occurrence rules. Systolic and diastolic blood pressures should always occur as a pair. Report violations of co-measurement rules (two distinct measurements of the same observation). Age and date of birth should agree. Report violations of mutual exclusivity rules. A patient should not be recorded as being dead and alive at the same time. |
| Data cleansing/customization | 16 | Analytic- or study-specific additions to Item# 9 |
| Data quality checks of key variables used for cohort identification | 17 | Analytic or study specific additions to Items #13–15 that focus on variables that identify cohorts, detect outcomes, define exposures, and participate as covariates. Where these variables may be affected by other related (perhaps causal) variables, these influential variables should also be included. The list of variables contained in these assessments will vary by intended analysis/clinical study. However variables assessed should be organized according to the following categories: cohort, outcome, exposure, confounding. |
| Data quality checks of key variables used for outcome categorization | 18 | |
| Data quality checks of key variables used to classify exposure | 19 | |
| Data quality checks of key confounding variables | 20 | |
Notes: “Source data” refers to the original originating data. “Target data” refers to the data as received by the data user.