| Literature DB >> 36238080 |
Khaled El Emam1,2,3, Lucy Mosquera1,3, Xi Fang1.
Abstract
Background: One of the increasingly accepted methods to evaluate the privacy of synthetic data is by measuring the risk of membership disclosure. This is a measure of the F1 accuracy that an adversary would correctly ascertain that a target individual from the same population as the real data is in the dataset used to train the generative model, and is commonly estimated using a data partitioning methodology with a 0.5 partitioning parameter. Objective: Validate the membership disclosure F1 score, evaluate and improve the parametrization of the partitioning method, and provide a benchmark for its interpretation. Materials and methods: We performed a simulated membership disclosure attack on 4 population datasets: an Ontario COVID-19 dataset, a state hospital discharge dataset, a national health survey, and an international COVID-19 behavioral survey. Two generative methods were evaluated: sequential synthesis and a generative adversarial network. A theoretical analysis and a simulation were used to determine the correct partitioning parameter that would give the same F1 score as a ground truth simulated membership disclosure attack.Entities:
Keywords: data privacy; membership disclosure; synthetic data generation
Year: 2022 PMID: 36238080 PMCID: PMC9553223 DOI: 10.1093/jamiaopen/ooac083
Source DB: PubMed Journal: JAMIA Open ISSN: 2574-2531
The notation used in the article
| Datasets | |
|---|---|
|
| The real dataset |
|
| The population from which the real dataset is sampled |
|
| Synthetic datasets |
|
| The attack dataset |
|
| A record in the attack dataset (ie, |
|
| A record in the synthetic dataset (ie, |
|
| A record in the real dataset (ie, |
| Dataset sizes | |
|
| The number of records in the real dataset (ie, |
|
| The number of records in the attack dataset (ie, |
|
| The size of the population that |
|
| The proportion of the attack dataset records that are in the real dataset |
| Hamming distance | |
|
| Hamming distance function |
|
| Hamming distance threshold |
Figure 1.The (ground truth) process for a membership disclosure attack which accounts for the fact that the attack dataset will be sampled independently from the same population as the real dataset. The attack dataset is matched with the synthetic dataset to infer which records are in the real dataset.
Figure 2.An overview of the membership disclosure evaluation process that is commonly used in the literature.
The fields in the datasets used in our study (a) Ontario COVID-19 Case dataset, (b) Washington state hospital discharge database, (c) The Canadian Community Health Survey data, and (d) the Nexoid COVID-19 behavioral survey
| Variable | Definitions | Variable | Definitions |
|---|---|---|---|
| (a) | (b) | ||
| Date reported | Number of days since 1 January 2020; this variable was discretized into 20 groups | AGE | Patient age in years |
| Health region | 34 unique regions | AMONTH | Admission month |
| Age group | Decades from 20 to 80+ (ordinal) | AWEEKEND | Weekend admission (Y/N) |
| Gender | Binary gender | DIED | Whether the patient died |
| Exposure | close contact, outbreak, travel, not reported | FEMALE | Sex |
| Case status | recovered, deceased, active | LOS | Length of stay |
| ZIP | Patient ZIP code | ||
| AYEAR | Admission year (2006 or 2007) | ||
| DX1-DX9 | Diagnosis codes | ||
|
|
| ||
| LBSG31 | Employment status over the last 12 months (full/part time) | survey_date | Date survey was administered |
| SMKDSTY | Type of smoker | country | Country of residence of the respondent |
| GEOGPRV | Province of residence | sex | Sex |
| DHHGAGE | Age (category) | age | Age in years |
| DHH_SEX | Sex | height, weight, bmi | Height (cm), weigh (kg), and BMI |
| DHHGMS | Marital status | blood_type | Blood type or “unknown” |
| DHHGLVG | Living arrangements | smoking | Amount of cigarettes smoked |
| DHHGHSZ | Household size | Drugs (x6) | Drugs that the respondent may be taking |
| GEN_08 | Worked in a job or business over last 12 months | risk_infection | Calculated risk of infection with COVID-19 |
| LBSGSOC | Occupation group | Risk_mortality | Calculated mortality risk from COVID-19 |
| EDUDH04 | Highest level of education | ||
| SDC_8 | Current student | ||
| SDCFIMM | Immigrant or not | ||
| SDCGCGT | Cultural or racial origin | ||
| INCGHH | Household income | ||
F1 score results: (a) the ground truth F1 values (from the simulation) versus the F1 values estimated using the partitioning method with t = n/N, and (b) the mean results on 50 iterations for the 4 datasets
|
| ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Dataset | Sequential trees | CTGAN | ||||||||||
| 5k | 15k | 25k | 5k | 15k | 25k | |||||||
| Act. | Est. | Act. | Est. | Act. | Est. | Act. | Est. | Act. | Est. | Act. | Est. | |
| F1 | F1 | F1 | F1 | F1 | F1 | F1 | F1 | F1 | F1 | F1 | F1 | |
| COVID | 0.105 | 0.104 | 0.283 | 0.283 | 0.426 | 0.432 | 0.104 | 0.104 | 0.28 | 0.284 | 0.431 | 0.432 |
| Washington | 0.146 | 0.148 | 0.34 | 0.334 | 0.456 | 0.454 | 0.066 | 0.07 | 0.168 | 0.169 | 0.235 | 0.24 |
| CCHS | 0.077 | 0.075 | 0.21 | 0.2 | 0.329 | 0.327 | 0.076 | 0.075 | 0.214 | 0.211 | 0.33 | 0.327 |
| Nexoid | 0.169 | 0.174 | 0.402 | 0.4 | 0.568 | 0.564 | 0.156 | 0.159 | 0.358 | 0.36 | 0.507 | 0.502 |
Summary of the oncology trials used on the analysis with the study size and the population, as well as the membership disclosure risk
| Dataset | Population size (dataset size) |
|
|---|---|---|
| Trial #1 (NCT00041197): National Cancer Institute | ||
| Tests if postsurgery receipt of imatinib could reduce the recurrence of gastrointestinal stromal tumors (GIST). Imatinib is an FDA approved protein-tyrosine kinase inhibitor for treating certain cancers of the blood cells. This drug is hypothesized to be effective against GIST as imatinib inhibits the kinase which experiences gain of function mutations in up to 90% of GIST patients. | 1310 ( | −1.42 |
| Trial #2 (NCT01124786): Clovis Oncology | ||
| Most pancreatic cancer patients have advanced inoperable disease and potentially metastases. At the time of this trial the first line therapy for patients with inoperable disease was gemcitabine monotherapy. One transporter (hENT1: human equilibrative nucleoside transporter-1) has been identified as a potential predictor of successful treatment via gemcitabine. This trial compares standard gemcitabine therapy to a novel fatty acid derivative of gemcitabine. This is hypothesized to be superior to gemcitabine in metastatic pancreatic adenocarcinoma patients with low hENT1 activity as it exhibits anticancer activity independent of nucleoside transporters like hENT1, while gemcitabine seems to require nucleoside transporters for anticancer activity. | 19 255 ( | −0.0137 |
| Trial #3 (NCT00688740): Sanofi | ||
| This phase 3 trial compares adjuvant anthracycline chemotherapy (fluorouracil, doxorubicin, and cyclophosphamide) with anthracycline taxane chemotherapy (docetaxel, doxorubicin, and cyclophosphamide) in women with lymph node positive early breast cancer. | 21 875 ( | −0.034 |
| Trial #4 (NCT00113763): Amgen | ||
| This was a randomized Phase 3 trial examining whether panitumumab, when combined with best supportive care, improves progression-free survival among patients with metastatic colorectal cancer, compared with those receiving best supportive care alone. | 58 381 ( | −0.0137 |
| Trial #5 (NCT00460265): Amgen | ||
| This was also a randomized Phase 3 trial on panitumumab, but among patients with metastatic and/or recurrent squamous cell carcinoma of the head and neck. The treatment group received panitumumab in addition to other chemotherapy (Cisplatin and Fluorouracil), while the control group received Cisplatin and Fluorouracil as first line therapy. | 5868 ( | −0.0947 |
| Trial #6 (NCT00119613): Amgen | ||
| This was a randomized and blinded Phase 3 trial aimed at evaluating whether “increasing or maintaining hemoglobin concentrations with darbepoetin alfa” improves survival among patients with previously untreated extensive-stage small cell lung cancer. The treatment group received darbepoetin alfa with platinum-containing chemotherapy, whereas the control group received placebo instead of darbepoetin alfa. | 16 484 ( | −0.0322 |
| Trial #7 (N0147): NCCTG | ||
| This was a randomized trial of 2686 patients with stage 3 colon adenocarcinoma that were randomly assigned to adjuvant regimens with or without Cetuximab. After resection of colon cancer, Cetuximab was added to the modified 6th version of the FOLFOX regimen including oxaliplatin plus 5-fluorouracil and leucovorin (mFOLFOX6), fluorouracil, leucovorin, and irinotecan (FOLFIRI), or a hybrid regimen consisting of mFOLFOX6 followed up by FOLFIRI. | 27 526 ( | 0.052 |
Note: The population includes the specific study participants. The n value indicates the number of trial participants for which we had data available.