| Literature DB >> 35714132 |
Yangdi Jiang1,2, Lucy Mosquera2, Bei Jiang1, Linglong Kong1, Khaled El Emam2,3,4.
Abstract
BACKGROUND: One common way to share health data for secondary analysis while meeting increasingly strict privacy regulations is to de-identify it. To demonstrate that the risk of re-identification is acceptably low, re-identification risk metrics are used. There is a dearth of good risk estimators modeling the attack scenario where an adversary selects a record from the microdata sample and attempts to match it with individuals in the population.Entities:
Mesh:
Year: 2022 PMID: 35714132 PMCID: PMC9205507 DOI: 10.1371/journal.pone.0269097
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.752
A summary of the datasets that were used in our simulation.
These datasets represent the population that were used in our simulation.
| Name | Description | Number of Records |
|---|---|---|
| Adult dataset | UCI Machine Learning Repository Adults dataset; this dataset is included as a reference point since it is often used in the machine learning and disclosure control community | 48,842 |
| Texas hospitals 2007 dataset | The Texas hospital discharge dataset (public dataset from the Texas Department of Health and Social Services) | 50,000 records selected from the original 2,244,997 records |
| Washington 2007 hospitals dataset | The Washington state hospital discharge dataset | 50,000 records selected from the original 644,902 records |
| Nexoid dataset | An on-line survey on COVID-19 exposure | 50,000 records selected from the original 968,408 records |
Fig 1The estimation error for the Texas hospitals 2007 dataset at the 0.05 sampling fraction.
Fig 2The estimation error for the Texas hospitals 2007 dataset at the 0.3 sampling fraction.
Fig 3The estimation error for the Texas hospitals 2007 dataset at the 0.7 sampling fraction.
Fig 4The estimation error for the Nexoid dataset at the 0.05 sampling fraction.
Fig 6The estimation error for the Nexoid dataset at the 0.7 sampling fraction.
Fig 7The sensitivity of the estimation error for the Texas dataset at the 0.3 sampling fraction.
Fig 8The sensitivity of the estimation error for the Nexoid dataset at the 0.3 sampling fraction.
The quasi-identifiers and how they were modified to ensure a low risk of re-identification.
| Variable | Generalizations |
|---|---|
| Date | Converted to month format |
| FSA | Forward Sortation Area, which is the first three characters of the postal code |
| Conditions | Medical conditions diagnosed |
| age_1 | Age categories: <26, 26–44, 45–64, >65 |
| travel_outside_canada | Travel outside Canada in the last 14 days (binary) |
| Ethnicity | |
| Sex | |
| tobacco_usage | |
| travel_work_school | |
| covid_results_date | Converted to month format |
| people_in_household | Removed |
| Notation | Interpretation |
|---|---|
|
| Microdata sample dataset |
|
| Synthetic population dataset |
|
| Synthetic microdata dataset, sampled from the synthetic population dataset |
|
| Index for records in the microdata sample |
|
| Number of records in the microdata sample |
|
| Number of records in the population |
|
| Size of the equivalence class (in the microdata sample) that record |
|
| Size of the equivalence class (in the population) that record |
|
| Match rate for population-to-sample attacks |
|
| Match rate for sample-to-population attacks |