| Literature DB >> 35090447 |
Demetris Avraam1,2, Elinor Jones3, Paul Burton4.
Abstract
BACKGROUND: Data privacy is one of the biggest challenges for any organisation which processes personal data, especially in the area of medical research where data include sensitive information about patients and study participants. Sharing of data is therefore problematic, which is at odds with the principle of open data that is so important to the advancement of society and science. Several statistical methods and computational tools have been developed to help data custodians and analysts overcome this challenge.Entities:
Keywords: Data privacy; Deterministic anonymisation; Disclosure risk; Information loss; k nearest neighbours
Mesh:
Year: 2022 PMID: 35090447 PMCID: PMC8796499 DOI: 10.1186/s12911-022-01754-4
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
An example of four records from the Titanic passengers data
| Passenger Id | Survived | Pclass | Name | Sex | Age | SibSp | ParCh | Fare |
|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 3 | Braund, Mr. Owen Harris | Male | 22 | 1 | 0 | 7.2500 |
| 2 | 1 | 1 | Cumings, Mrs. John Bradley | Female | 38 | 1 | 0 | 71.2833 |
| 3 | 1 | 3 | Heikkinen, Miss. Laina | Female | 26 | 0 | 0 | 7.9250 |
| 891 | 0 | 3 | Dooley, Mr. Patrick | Male | 32 | 0 | 0 | 7.750 |
Fig. 1Illustration of the steps of the anonymisation procedure applied to Titanic passengers data. A Original continuous variables, B the variables after standardisation, C stratification of the variables in the 12 distinct strata, D the centroids of each 3 nearest neighbours, E scaled centroids and re-centralised back to the observed means (that are the anonymised data), F a comparison between the original and the anonymised variables
Frequencies of all possible combinations in the levels of categorical variables Pclass, Sex and Family
| Pclass | Sex | Family | Frequency | Survival ratio |
|---|---|---|---|---|
| 1 | Female | 0 | 34 | 33/34 |
| 2 | Female | 0 | 32 | 29/32 |
| 3 | Female | 0 | 60 | 37/60 |
| 1 | Male | 0 | 75 | 25/75 |
| 2 | Male | 0 | 72 | 7/72 |
| 3 | Male | 0 | 264 | 32/264 |
| 1 | Female | 1 | 60 | 58/60 |
| 2 | Female | 1 | 44 | 41/44 |
| 3 | Female | 1 | 84 | 35/84 |
| 1 | Male | 1 | 47 | 20/47 |
| 2 | Male | 1 | 36 | 10/36 |
| 3 | Male | 1 | 83 | 15/83 |
The table also shows the survival ratio that is the number of survived passengers over the total number of passengers in each stratum
Fig. 2Risky observations according to the robust Mahalanobis distance-based metric. A 38 observations are considered as risky according to risk1, B 8 observations of those are considered as unsafe according to risk2
Fig. 3Box plots of the individual propensity scores for each observation of original and anonymised data
Estimated coefficients of the logistic regression model predicting survival of Titanic passengers using the original (left) and the anonymised (right) data
| Coefficient | Original data | Anonymised data | Std. difference | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Estimate | Std. Err. | z value | Pr( | Estimate | Std. Err. | z value | Pr( | ||
| Intercept | 3.519 | 0.437 | 8.054 | 8.00e–16 | 3.615 | 0.454 | 7.972 | 1.56e–15 | 0.220 |
| Pclass (2nd) | − 1.067 | 0.286 | − 3.732 | 0.0002 | − 1.112 | 0.300 | − 3.707 | 0.0002 | 0.159 |
| Pclass (3rd) | − 2.283 | 0.281 | − 8.135 | 4.12e–16 | − 2.343 | 0.300 | − 7.822 | 5.21e–15 | 0.216 |
| Sex (male) | − 2.628 | 0.194 | − 13.527 | < 2e–16 | − 2.625 | 0.194 | − 13.508 | < 2e–16 | 0.012 |
| Age | − 0.033 | 0.008 | − 4.437 | 9.13e–06 | − 0.035 | 0.008 | − 4.586 | 4.53e–06 | 0.205 |
| Fare | 0.001 | 0.002 | 0.466 | 0.641 | 0.001 | 0.002 | 0.217 | 0.829 | 0.223 |
| Family (yes) | − 0.091 | 0.194 | − 0.471 | 0.638 | − 0.089 | 0.197 | − 0.452 | 0.651 | 0.010 |
Fig. 4Standardised coefficients and their 95% confidence intervals of the logistic regression model that predicts survival of the Titanic passengers. Black colour denotes the estimates of the model applied to the original data and red colour denotes the estimates of the model applied to the anonymised data. For ease of presentation, the intercept coefficient of both models is not displayed in the plots
Fig. 5The disclosure risks (A risk1 and B risk2) estimated by the robust Mahalanobis distance-based metric for different values of the parameters and
Estimated coefficients of the linear regression model predicting fasting glucose level of participants in the National Child Development Study using the original (left) and the anonymised (right) data
| Coefficient | Original data | Anonymised data | Std. difference | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Estimate | Std. Err. | t value | Pr( | Estimate | Std. Err. | t value | Pr( | ||
| Intercept | 6.430 | 0.527 | 12.211 | <2e–16 | 6.982 | 0.564 | 12.379 | <2e–16 | 1.048 |
| Sex (female) | − 0.098 | 0.057 | − 1.729 | 0.084 | − 0.153 | 0.061 | − 2.524 | 0.012 | 0.959 |
| Smoker (yes) | 0.110 | 0.049 | 2.253 | 0.024 | 0.099 | 0.048 | 2.054 | 0.040 | 0.212 |
| HDL | − 0.229 | 0.058 | − 3.924 | 9.19e–05 | − 0.245 | 0.059 | − 4.119 | 4.06e–05 | 0.278 |
| Height | − 0.013 | 0.003 | − 4.225 | 2.57e–05 | − 0.017 | 0.003 | − 4.916 | 1.01e–06 | 1.10 |
| Weight | 0.010 | 0.002 | 6.265 | 5.18e–10 | 0.011 | 0.002 | 6.589 | 6.59e–11 | 0.606 |
Fig. 6Standardised coefficients and their 95% confidence intervals of the linear regression model that predicts fasting glucose levels of participants in the 1958 Birth Cohort. Black colour denotes the estimates of the model applied to the original data and red colour denotes the estimates of the model applied to the anonymised data. For ease of presentation, the intercept coefficient of both models is not displayed in the plots
Fig. 7The effect of the number of nearest neighbours (parameter k) on utility loss and disclosure risk of non-stochastic anonymised data. A The dataset-specific utility loss as measured by the summary statistic U of propensity scores. B The variables-specific utility loss as measured by the Euclidean distance-based metric. C The analysis-specific information loss as measured by the standardised difference of regression model coefficients. D The robust Mahalanobis distance-based disclosure risks. Each point and error bar in the four panels indicates the mean plus minus one standard deviation of the metrics across 100 generated synthetic samples of 500 individual-level records each
Fig. 8The effect of the sample size on utility loss and disclosure risk of non-stochastic anonymised data. A The dataset-specific utility loss as measured by the summary statistic U of propensity scores. B The variables-specific utility loss as measured by the Euclidean distance-based metric. C The analysis-specific information loss as measured by the standardised difference of regression model coefficients. D The robust Mahalanobis distance-based disclosure risks. Each point and error bar in the four panels indicates the mean plus minus one standard deviation of the metrics across 100 simulations with constant k = 5