| Literature DB >> 34139910 |
Brian Lee1,2, Brandi Dupervil1,3, Nicholas P Deputy1,3,4, Wil Duck1,5, Stephen Soroka1,6, Lyndsay Bottichio1,6, Benjamin Silk1,4,7, Jason Price1,8, Patricia Sweeney1,8, Jennifer Fuld1,9, J Todd Weber1,6, Dan Pollock1,6.
Abstract
OBJECTIVES: Federal open-data initiatives that promote increased sharing of federally collected data are important for transparency, data quality, trust, and relationships with the public and state, tribal, local, and territorial partners. These initiatives advance understanding of health conditions and diseases by providing data to researchers, scientists, and policymakers for analysis, collaboration, and use outside the Centers for Disease Control and Prevention (CDC), particularly for emerging conditions such as COVID-19, for which data needs are constantly evolving. Since the beginning of the pandemic, CDC has collected person-level, de-identified data from jurisdictions and currently has more than 8 million records. We describe how CDC designed and produces 2 de-identified public datasets from these collected data.Entities:
Keywords: COVID-19; SARS-CoV-2; data paper; data privacy; de-identification; open data
Year: 2021 PMID: 34139910 PMCID: PMC8216038 DOI: 10.1177/00333549211026817
Source DB: PubMed Journal: Public Health Rep ISSN: 0033-3549 Impact factor: 2.792
Figure 1Creation of COVID-19 case surveillance public datasets from state, tribal, local, and territorial public health jurisdictions, Centers for Disease Control and Prevention, 2020. Abbreviation: DCIPHER, Data Collation and Integration for Public Health Event Response.
Figure 2The 7-step process of privacy review implemented by the Centers for Disease Control and Prevention in the design of 2 public datasets for COVID-19 case surveillance in 2020.
Privacy characteristics used to create datasets and how they differ between the public-use and scientific-use datasets, developed in 2020 by the Centers for Disease Control and Prevention for design of 2 public datasets for COVID-19 case surveillance
| Variable | Definition | Public-use dataset | Scientific-use dataset |
|---|---|---|---|
| No. of fields | Total fields | 11 | 31 |
| Privacy threshold | Minimum acceptable value for
privacy calculations. |
|
|
| Quasi-identifier fields | Dataset fields that may identify individuals |
sex [sex] age_group [age group] race_ethnicity_combined [race and ethnicity] |
sex [sex] age_group [age group] race_ethnicity_combined [race and ethnicity] res_county [county of residence] res_state [state of residence] hc_work_yn [health care worker status] |
| Confidential fields | Dataset fields that do not identify individuals but contain confidential information | pos_spec_dt [date of first positive specimen] | pos_spec_dt [date of first positive specimen] |
Figure 3An example of how k-anonymity field suppression changes the values of quasi-identifier fields sex, age_group, race_ethnicity_combined to reduce the risk of re-identification of individuals in 2 public datasets developed by the Centers for Disease Control and Prevention in 2020 for COVID-19 case surveillance. When the frequency count of raw records with shared quasi-identifiers is below the k = 5 privacy threshold, suppressed data are produced with “NA” values for some quasi-identifiers so that the frequency increases to 5. Abbreviation: NA, not applicable.
Figure 4An example of how l-diversity field suppression changes values of the confidential pos_spec_dt field to reduce the risk of disclosure of personally identifiable information in 2 public datasets developed by the Centers for Disease Control and Prevention in 2020 for COVID-19 case surveillance. When the distinct count of raw records with shared quasi-identifiers sex, age_group, race_ethnicity_combined is below the l = 2 privacy threshold, suppressed data are produced with “NA” values for pos_spec_dt, preventing confidential information from being disclosed based on knowing a patient’s quasi-identifier values. Abbreviation: NA, not applicable.