| Literature DB >> 33196453 |
Khaled El Emam1,2,3, Lucy Mosquera3, Jason Bass3.
Abstract
BACKGROUND: There has been growing interest in data synthesis for enabling the sharing of data for secondary analysis; however, there is a need for a comprehensive privacy risk model for fully synthetic data: If the generative models have been overfit, then it is possible to identify individuals from synthetic data and learn something new about them.Entities:
Keywords: data access; data sharing; de-identification; open data; privacy; synthetic data
Year: 2020 PMID: 33196453 PMCID: PMC7704280 DOI: 10.2196/23139
Source DB: PubMed Journal: J Med Internet Res ISSN: 1438-8871 Impact factor: 5.428
Figure 1The relationships between the different datasets under consideration. Matching between a synthetic sample record and someone in the population goes through the real sample and can occur in 2 directions.
Example of a population dataset, with one’s origin as the quasi-identifier and one’s income as the sensitive variable.
| National ID | Origin | Income ($) |
| 1 | Japanese | 110k |
| 2 | Japanese | 100k |
| 3 | Japanese | 105k |
| 4 | North African | 95k |
| 5 | European | 70k |
| 6 | Hispanic | 100k |
| 7 | Hispanic | 130k |
| 8 | Hispanic | 65k |
Example of a real sample, with one’s origin as the quasi-identifier and one’s income as the sensitive variable.
| Origin | Income ($) |
| European | 70k |
| Japanese | 100k |
| Hispanic | 130k |
| Hispanic | 65k |
| North African | 95k |
Example of a synthetic sample, with one’s origin as the quasi-identifier and one’s income as the sensitive variable.
| Origin | Income ($) |
| Japanese | 115k |
| Japanese | 120k |
| North African | 100k |
| European | 110k |
| Hispanic | 65k |
Notation used in this paper.
| Notation | Interpretation |
|
| An index to count records in the real sample |
|
| An index to count records in the synthetic sample |
|
| The number of records in the true population |
|
| The equivalence class group size in the real sample for a particular record |
|
| The equivalence group size in the population that has the same quasi-identifier values as record |
|
| The number of records in the (real or synthetic) sample |
|
| A binary indicator of whether record |
|
| A binary indicator of whether the adversary would learn something new if record |
|
| Number of quasi-identifiers |
| λ | Adjustment to account for errors in matching and a verification rate that is not perfect |
|
| The minimal percentage of sensitive variables that need to be similar between the real sample and synthetic sample to consider that an adversary has learned something new |
Figure 2The relationship between a real observation to the rest of the data in the real sample and to the synthetic observation, which can be used to determine the likelihood of meaningful identity disclosure.
Quasi-identifiers included in the analysis of the Washington State Inpatient Database (SID) dataset.
| Variable | Definition |
| AGE | patient's age in years at the time of admission |
| AGEDAY | age in days of a patient under 1 year of age |
| AGEMONTH | age in months for patients under 11 years of age |
| PSTCO2 | patient's state/county federal information processing standard (FIPS) code |
| ZIP | patient's zip code |
| FEMALE | sex of the patient |
| AYEAR | hospital admission year |
| AMONTH | admission month |
| AWEEKEND | admission date was on a weekend |
Overall meaningful identity disclosure risk results. (The italicized values are the maximum risk values.)
| Parameter | Synthetic data risk | Real data risk | ||
|
| Population-to-sample risk | Sample-to-population risk | Population-to-sample risk | Sample-to-population risk |
| Washington State Inpatient Database | 0.00056 |
| 0.016 |
|
| Canadian COVID-19 cases | 0.0043 |
| 0.012 |
|