| Literature DB >> 33958393 |
Rinette Badker1, Kierste Miller2, Chris Pardee3, Ben Oppenheim3, Nicole Stephenson3, Benjamin Ash3, Tanya Philippsen3,4, Christopher Ngoon3, Partrick Savage3, Cathine Lam3, Nita Madhav3.
Abstract
The proliferation of composite data sources tracking the COVID-19 pandemic emphasises the need for such databases during large-scale infectious disease events as well as the potential pitfalls due to the challenges of combining disparate data sources. Multiple organisations have attempted to standardise the compilation of disparate data from multiple sources during the COVID-19 pandemic. However, each composite data source can use a different approach to compile data and address data issues with varying results.We discuss some best practices for researchers endeavouring to create such compilations while discussing three key categories of challenges: (1) data dissemination, which includes discrepant estimates and varying data structures due to multiple agencies and reporting sources generating public health statistics on the same event; (2) data elements, such as date formats and location names, lack standardisation, and differing spatial and temporal resolutions often create challenges when combining sources; and (3) epidemiological factors, including missing data, reporting lags, retrospective data corrections and changes to case definitions that cannot easily be addressed by the data compiler but must be kept in mind when reviewing the data.Efforts to reform the global health data ecosystem should bear such challenges in mind. Standards and best practices should be developed and incorporated to yield more robust, transparent and interoperable data. Since no standards exist yet, we have highlighted key challenges in creating a comprehensive spatiotemporal view of outbreaks from multiple, often discrepant, reporting sources and provided guidelines to address them. In general, we caution against an over-reliance on fully automated systems for integrating surveillance data and strongly advise that epidemiological experts remain engaged in the process of data assessment, integration, validation and interpretation to identify, diagnose and resolve data challenges. © Author(s) (or their employer(s)) 2021. Re-use permitted under CC BY-NC. No commercial re-use. See rights and permissions. Published by BMJ.Entities:
Keywords: COVID-19; epidemiology; public health
Year: 2021 PMID: 33958393 PMCID: PMC8103560 DOI: 10.1136/bmjgh-2021-005542
Source DB: PubMed Journal: BMJ Glob Health ISSN: 2059-7908
Figure 1Screenshot of the New Jersey Department of Health interactive dashboard.32
Key types of data provided by each source included in the Metabiota composite (metrics may have changed over the course of the pandemic)*
| Sources (total n=66) | Proportion of sources† | |
| Multinational | 10 | 0.15 |
| National | 44 | 0.67 |
| State/province (admin 1) | 55 | 0.83 |
| Substate (admins 2 and 3, locality and sublocality) | 33 | 0.50 |
| Include probable counts? | 18 | 0.27 |
| Data overwritten when updated | 29 | 0.44 |
| Retrospectively update cases | 4 | 0.06 |
| Individual | 15 | 0.23 |
| Population level | 66 | 1.0 |
| Individual and population levels | 15 | 0.23 |
*As of 15 February 2021, statistics may change as the event is ongoing. Full breakdown of sources can be found in online supplemental table S1).
†Proportion of 66 total sources represented in each category. Sources may appear in multiple categories. Figures may not sum to 1, for example, for sources that provide multiple geographical granularities.
Figure 2Drop in cumulative cases on 2 July 2020 due to data cleaning.33
Figure 3Time series of the number of cases reported by day in China.34
Summary of key challenges, best practices and recommendations
| Challenges | Description | Best practices for compilations | Best practices for reporting sources |
| Overall: standardisation | Standards do not exist, so data from multiple reporting sources cannot be directly compared. | Make and document any necessary adjustments to ensure the same information is being captured from different sources. | Create global data standards for epidemic reports and metadata |
| Data structure | There is a variety of formats for disseminating data, which requires varying amounts of interpretation. | Take care when extracting data elements from the reporting source. | Use a standard data structure, ideally across reporting sources and events. |
| Overwritten data | Some sources only report a current snapshot of the event, which makes it difficult to know when the cases occurred. | Visit reporting sources daily to create an epidemic timeline of cases and deaths. | Automatically archive all reported data on a regular cadence. Tools to do so are freely available. |
| Data corrections | Epidemic timelines can be inaccurate when no information about corrections to data (including data cleaning and retrospective cases and deaths) is provided. | Be consistent in applying data corrections. | Document any changes made, the impacted dates and the reason for the correction. |
| Language and regional date formats | Translating text, dates and times can be a challenge, especially when non-Roman alphabets are used. | Verify which date format(s) are used in each country or region of interest and adjust accordingly. Pay careful attention to translation. | Provide data in the most accessible file formats (eg, csv rather than pdf) and ensure the date format is clearly understandable. |
| Location names | Locations with the same name do not always have context to verify, and boundaries change over time. | Verify location name against standard naming conventions. | Use standard International Organization for Standardization codes to ensure clarity and consistency when describing a location. |
| | Reporting sources do not have a consistent spatial and temporal resolution. | Verify spatial and temporal consistency of data and ensure data are correctly rolled up to less granular resolutions (eg, from ADM1 to ADM0). | Use a ‘nested hierarchy’ method to ensure spatial and temporal consistency. |
| | There is no universal handling of missing data. | Handle null and zero incidence data consistently within each source and across sources. | Provide clear descriptions of how missing data are handled. |
| | Data cleaning, holidays and overtaxed laboratories can lead to reporting lags. | Report incidence data along with whether the date is the symptom onset, sample collection, laboratory diagnosis or the date reported. | Provide information on the reporting lag. |
| | Case definitions are not standardised and may vary by reporting agency. | Ensure case definition being used is clear and adjust as required to standardise across reports. | Clearly document case definitions and note any changes over time. |