| Literature DB >> 27252794 |
Julia Eaton1, Ian Painter2, Don Olson3, William B Lober4.
Abstract
Secondary use of clinical health data for near real-time public health surveillance presents challenges surrounding its utility due to data quality issues. Data used for real-time surveillance must be timely, accurate and complete if it is to be useful; if incomplete data are used for surveillance, understanding the structure of the incompleteness is necessary. Such data are commonly aggregated due to privacy concerns. The Distribute project was a near real-time influenza-like-illness (ILI) surveillance system that relied on aggregated secondary clinical health data. The goal of this work is to disseminate the data quality tools developed to gain insight into the data quality problems associated with these data. These tools apply in general to any system where aggregate data are accrued over time and were created through the end-user-as-developer paradigm. Each tool was developed during the exploratory analysis to gain insight into structural aspects of data quality. Our key finding is that data quality of partially accruing data must be studied in the context of accrual lag-the difference between the time an event occurs and the time data for that event are received, i.e. the time at which data become available to the surveillance system. Our visualization methods therefore revolve around visualizing dimensions of data quality affected by accrual lag, in particular the tradeoff between timeliness and completion, and the effects of accrual lag on accuracy. Accounting for accrual lag in partially accruing data is necessary to avoid misleading or biased conclusions about trends in indicator values and data quality.Entities:
Keywords: accrual lag; data quality; data visualization; incomplete data; partially accruing data; real-time surveillance; secondary-use data
Year: 2015 PMID: 27252794 PMCID: PMC4874726 DOI: 10.5210/ojphi.v7i3.6096
Source DB: PubMed Journal: Online J Public Health Inform ISSN: 1947-2579
Figure 1Upload Pattern Plots. The horizontal axis represents receipt date and each vertical axis represents accrual lag in days.
Figure 2aStacklag Difference Plot. The horizontal axis represents the event date and each vertical number on the vertical axis represents accrual lag in days. The time series plotted for each accrual lag represents the change in the total number of counts from the previous accrual lag.
Figure 3Posterior mean from Bayesian change point detection method. The horizontal axis represents event date and the vertical axis represents the count. The scale of the vertical axis is intentionally suppressed for publication.
Figure 4ILI ratio errors with 5th, 10th, 25th, 50th, 75th, 90th and 95th quantiles using the ratio error function. We used the 21 day lagged counts as final counts to avoid the effects of long backfills that sites periodically provided.
Figure 5Lag histograms for three sites. The horizontal axis represents the proportion of data received; the vertical axis represents the accrual lag.
Figure 6Summary completion curves for all sites.