| Literature DB >> 20361870 |
Khaled El Emam1, Ann Brown, Philip AbdelMalik, Angelica Neisa, Mark Walker, Jim Bottomley, Tyson Roffey.
Abstract
BACKGROUND: A common disclosure control practice for health datasets is to identify small geographic areas and either suppress records from these small areas or aggregate them into larger ones. A recent study provided a method for deciding when an area is too small based on the uniqueness criterion. The uniqueness criterion stipulates that an the area is no longer too small when the proportion of unique individuals on the relevant variables (the quasi-identifiers) approaches zero. However, using a uniqueness value of zero is quite a stringent threshold, and is only suitable when the risks from data disclosure are quite high. Other uniqueness thresholds that have been proposed for health data are 5% and 20%.Entities:
Mesh:
Year: 2010 PMID: 20361870 PMCID: PMC2858714 DOI: 10.1186/1472-6947-10-18
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
The list of quasi-identifiers that were analyzed from the census file
| Variable Name in the 2001 Census RDC File | Definition | # Response categories(*) |
|---|---|---|
| SEXP | Gender | 2 |
| BRTHYR | Year of birth (from 1880 to 2001). | 24 |
| HLNABDR | Language: Language spoken most often at home by the individual at the time of the census. | 4 |
| ETH1-6 | Ethnic Origin: Refers to the six possible answers for the ethnic or cultural group(s) to which the respondent's ancestors belong. | 26 |
| ASRR | Aboriginal Identity: Persons identifying with at least one Aboriginal group. | 8 |
| RELIGWI | Religious denomination: Specific religious denominations, groups or bodies as well as sects, cults, or other religiously defined communities or systems of belief. | 3 |
| TOTYRSR | Total Years of Schooling: Total sum of the years (or grades) of schooling at the elementary, high school, university and college levels. Only available for individuals age 15+. | 9 |
| MARST | Marital Status (Legal) | 5 |
| TOTINC | Total income: Total money income received from all sources during the calendar year 2000 by persons 15 years of age and over. We defined categories in $15K ranges. | 22 |
| DVISMIN | Visible minority status | 4 |
| DISABIL | Activity difficulties/reductions: Combinations of one or more activity difficulties/reduction. | 4 |
(*) The number of response categories excludes non-specific responses such as missing values, not available or "other".
Example uniqueness estimates, POP and MaxCombs values for some FSA and quasi-identifier combinations.
| Example of Uniqueness Estimates for FSA and Quasi-identifier Model Combinations | |||||
|---|---|---|---|---|---|
| 1 | K7N | Age, Sex | 0% | N | N |
| 2 | M2K | Age, Aboriginal, Religion | 1.7% | N | N |
| 3 | K1A | Sex, Marital Status, Language | 14.3% | Y | N |
| 4 | L6P | Sex, Aboriginal, Schooling, Language | 16.7% | Y | N |
| 5 | H3T | Age, Aboriginal, Income, Marital Status, Language | 56.0% | Y | Y |
| 6 | L1 M | Sex, Disability, Marital Status, Schooling, Ethnicity | 67.80% | Y | Y |
| 7 | K1A | Age, Disability, Income, Marital Status, Schooling | 94.70% | Y | Y |
Example of what the raw data used to build the models looked like.
| Example of Raw Data Used in Building the Logistic Regression Models | ||||
|---|---|---|---|---|
| 1 | 6,228 | 48 | 0 | 0 |
| 2 | 14,047 | 576 | 0 | 0 |
| 3 | 100 | 40 | 1 | 0 |
| 4 | 2,247 | 576 | 1 | 0 |
| 5 | 12,916 | 84,480 | 1 | 1 |
| 6 | 7,080 | 9,360 | 1 | 1 |
| 7 | 100 | 95,040 | 1 | 1 |
(b) The population uniqueness binary value is used in the logistic regression model with the other predictor variables. We used the 2001 Canadian Census population values.
Figure 1Definition of prediction evaluation metrics. Low Risk means that the (predicted) percentage of unique records is below or equal to the 5% or 20% threshold. High Rish means that the (predicted) percentage of unique records is above the 5% or 20% threshold.
Figure 2The population sizes for urban and rural FSAs in Canadian provinces.
Figure 3The population sizes for urban and rural FSAs in Canada overall.
Distribution of FSAs based on whether they are urban or rural.
| Prov | Total Rural | Total Urban | Grand Total | %Rural | %Urban |
|---|---|---|---|---|---|
| 12 | 138 | 150 | 8.00% | 92.00% | |
| 18 | 171 | 189 | 9.52% | 90.48% | |
| 10 | 54 | 64 | 15.63% | 84.38% | |
| 110 | 110 | 0.00% | 100.00% | ||
| 13 | 22 | 35 | 37.14% | 62.86% | |
| 14 | 62 | 76 | 18.42% | 81.58% | |
| 56 | 466 | 522 | 10.73% | 89.27% | |
| 39 | 374 | 413 | 9.44% | 90.56% | |
| 11 | 37 | 48 | 22.92% | 77.08% | |
| 173 | 1434 | 1607 | 10.77% | 89.23% | |
Figure 4Areas in km.
Comparison of unbalanced data modeling methods.
| Model Evaluation for the 5% Uniqueness Threshold | |||
|---|---|---|---|
| 0.9849 | 0.87 | 0.996 | |
| 0.9849 | 0.449 | 0.992 | |
| 0.947 | 0.74 | 0.98 | |
| 0.949 | 0.59 | 0.949 | |
**We tested the difference between the AUC values, and the difference was statistically significant between the two methods only for 20% uniqueness at an alpha level of 0.05
Logistic regression model results for the 5% and 20% thresholds using down-sampling.
| Logistic Regression Model for 5% Threshold | ||||
|---|---|---|---|---|
| 779.1 | -37.35 | 137.8 | -6.5 | |
| (744, 815.5) | (-60.46, -13.72) | (131.6, 144.2) | (-10.61, -2.36) | |
| <0.0001 | <0.0017 | <0.001 | 0.0019 | |
| 63.3 | -6 | 11.8 | -1 | |
| (61.85, 64.74) | (-6.83, -5.16) | (11.59, 12.1) | (-1.16, -0.86) | |
| <0.0001 | <0.0001 | <0.0001 | <0.0001 | |
The percentage of Niday and emergency department records that would have to be suppressed because they are high risk for each of the uniqueness thresholds.
| 0% Threshold | 5% Threshold | 20% Threshold | |
|---|---|---|---|
| 85% | 77% | 0% | |
| 93% | 54% | 0% | |