| Literature DB >> 31731730 |
Amanda M Y Chu1, Benson S Y Lam2, Agnes Tiwari3,4, Mike K P So5.
Abstract
Patient data or information collected from public health and health care surveys are of great research value. Usually, the data contain sensitive personal information. Doctors, nurses, or researchers in the public health and health care sector do not analyze the available datasets or survey data on their own, and may outsource the tasks to third parties. Even though all identifiers such as names and ID card numbers are removed, there may still be some occasions in which an individual can be re-identified via the demographic or particular information provided in the datasets. Such data privacy issues can become an obstacle in health-related research. Statistical disclosure control (SDC) is a useful technique used to resolve this problem by masking and designing released data based on the original data. Whilst ensuring the released data can satisfy the needs of researchers for data analysis, there is high protection of the original data from disclosure. In this research, we discuss the statistical properties of two SDC methods: the General Additive Data Perturbation (GADP) method and the Gaussian Copula General Additive Data Perturbation (CGADP) method. An empirical study is provided to demonstrate how we can apply these two SDC methods in public health research.Entities:
Keywords: data perturbation; data privacy; data utility; health care; risk
Mesh:
Year: 2019 PMID: 31731730 PMCID: PMC6888099 DOI: 10.3390/ijerph16224519
Source DB: PubMed Journal: Int J Environ Res Public Health ISSN: 1660-4601 Impact factor: 3.390
Figure 1Procedure in applying GADP.
Pearson’s correlation matrix of .
|
|
|
|
|
| |
|---|---|---|---|---|---|
|
| 1 | ||||
|
| 0.70 | 1 | |||
|
| 0.80 | 0.75 | 1 | ||
|
| 0.50 | 0.40 | 0.25 | 1 | |
|
| 0.30 | 0.20 | 0.15 | 0.60 | 1 |
Pearson’s correlation matrix of .
|
|
|
|
|
| |
|---|---|---|---|---|---|
|
| 1 | ||||
|
| 0.6322 | 1 | |||
|
| 0.7571 | 0.7216 | 1 | ||
|
| 0.5025 | 0.3576 | 0.2342 | 1 | |
|
| 0.2152 | 0.1336 | 0.1000 | 0.4398 | 1 |
Figure 2Procedure for applying CGADP.
Questions.
| Variable | Description |
|---|---|
|
| Feel rested upon awakening at the end of a sleep period |
|
| Feel satisfied with the quality of your sleep |
|
| Get too much sleep |
|
| Take a nap at a scheduled time |
|
| Fall asleep at an unscheduled time |
|
| Weight |
|
| Height |
Note: 0 = no days, 1 = 1 day, 2 = 2 days, 3 = 3 days, 4 = 4 days, 5 = 5 days, 6 = 6 days, 7 = every day. The respondents give answers according to his or her own situation.
Original survey data.
| No. |
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|
| 1 | 0 | 0 | 7 | 5 | 7 | 56 | 165 |
| 2 | 0 | 0 | 0 | 0 | 0 | 48 | 152 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 185 | 1 | 3 | 1 | 0 | 0 | 72 | 162 |
| 186 | 7 | 1 | 0 | 1 | 1 | 47 | 149 |
Statistical information from the original survey database.
| Summary Statistics | Pearson’s Correlation Matrix | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Mean | Std |
|
|
|
|
|
|
| |
| 2.37 | 2.18 |
| 1 | ||||||
| 2.31 | 2.28 |
| 0.4461 | 1 | |||||
| 1.15 | 1.71 |
| 0.1795 | 0.1590 | 1 | ||||
| 1.74 | 1.99 |
| 0.1092 | 0.0379 | 0.2091 | 1 | |||
| 1.59 | 1.90 |
| 0.0116 | 0.1593 | 0.2489 | 1 | |||
| 55.44 | 8.75 |
| 0.1003 | 1 | |||||
| 157.69 | 5.60 |
| 0.0032 | 0.0592 | 0.0807 | 0.3719 | 1 | ||
Spearman’s Rank correlation matrix of the original survey database.
|
|
|
|
|
|
|
| |
|---|---|---|---|---|---|---|---|
|
| 1 | ||||||
|
| 0.4680 | 1 | |||||
|
| 0.2017 | 0.2254 | 1 | ||||
|
| 0.1285 | 0.0787 | 0.1985 | 1 | |||
|
| 0.0621 | 0.1690 | 0.3009 | 1 | |||
|
| 0.1090 | 1 | |||||
|
| 0.0039 | 0.0123 | 0.0954 | 0.3482 | 1 |
Perturbed data using the GADP method.
| No. |
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|
| 1 | 4.6587 | 2.7093 | −2.0323 | 3.6562 | 5.9077 | 56 | 165 |
| 2 | 4.4505 | 5.0854 | 0.7921 | −1.7925 | −2.3083 | 48 | 152 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 185 | 2.0445 | 2.0600 | 1.2364 | 0.4676 | 2.6754 | 72 | 162 |
| 186 | 4.3816 | 4.2980 | 2.8277 | 2.9120 | 2.6487 | 47 | 149 |
Statistical information of the perturbed data using the GADP method.
| Summary Statistics | Pearson’s Correlation Matrix | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Mean | Std |
|
|
|
|
|
|
| |
| 2.43 | 2.14 |
| 1 | ||||||
| 2.43 | 2.39 |
| 0.4994 # | 1 | |||||
| 0.93 | 1.72 |
| 0.1741 | 0.2190 # | 1 | ||||
| 1.61 | 2.03 |
| 0.1834 # | 0.1141 # | 0.1503 # | 1 | |||
| 1.58 | 1.88 |
| 0.0315 ^ | 0.0548 | 0.2573 # | 0.3349 # | 1 | ||
| 55.44 | 8.75 |
| −0.1022 | −0.1272 # | −0.1063 | −0.0614 | 0.1642 # | 1 | |
| 157.69 | 5.60 |
| 0.0375 | −0.1244 # | 0.0808 | 0.0759 | −0.0722 | 0.3719 | 1 |
Note: values are marked with ^ if the sign changed after perturbation, or with # if the absolute difference of the value is >0.05.
Spearman’s Rank correlation matrix of the perturbed data using the GADP method.
|
|
|
|
|
|
|
| |
|---|---|---|---|---|---|---|---|
|
| 1 | ||||||
|
| 0.4892 | 1 | |||||
|
| 0.1790 | 0.2300 | 1 | ||||
|
| 0.2117 # | 0.1306 # | 0.1873 | 1 | |||
|
| 0.0109 ^ | 0.0436 | 0.2704 # | 0.3062 | 1 | ||
|
| −0.0994 | −0.1151 # | −0.1405 # | −0.0163 | 0.2163 # | 1 | |
|
| 0.0488 | −0.0800 | 0.1086 # | 0.0892 | −0.0788 | 0.3482 | 1 |
Note: values are marked with ^ if the sign changed after perturbation, or with # if the absolute difference of the value is >0.05.
Figure 3Frequency histograms of all the confidential and non-confidential data.
Fitting distribution of and goodness of fit test.
| Distribution | Parameters | Test Value | |||
|---|---|---|---|---|---|
|
| Normal | 0.0995 | 0.0502 | ||
|
| Normal | 0.0605 | 0.5029 | ||
Figure 4Frequency histograms of non-confidential data with a density curve.
Fitting distribution of and goodness of fit test.
| Distribution | Parameters | Test Value | ||||
|---|---|---|---|---|---|---|
|
| ZANB | 9.04 | 0.2498 | |||
|
| ZINB | 5.2487 | 0.6296 | |||
|
| ZINB | 6.2476 | 0.5112 | |||
|
| ZANB | 6.3010 | 0.5051 | |||
|
| ZINB | 9.2857 | 0.2328 | |||
Note: is the parameter for a probability of zero in a zero inflated/adjusted model.
Perturbed data using the CGADP method.
| No. |
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|
| 1 | 4 | 5 | 0 | 2 | 4 | 56 | 165 |
| 2 | 3 | 1 | 2 | 0 | 2 | 48 | 152 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 185 | 1 | 0 | 2 | 0 | 0 | 72 | 162 |
| 186 | 0 | 0 | 0 | 3 | 6 | 47 | 149 |
Statistical information of the perturbed data using the CGADP method.
| Summary Statistics | Pearson’s Correlation Matrix | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Mean | Std |
|
|
|
|
|
|
| |
| 2.90 | 2.02 |
| 1 | ||||||
| 2.98 | 2.14 |
| 0.4975 # | 1 | |||||
| 1.84 | 1.61 |
| 0.1793 | 0.2930 # | 1 | ||||
| 2.42 | 1.85 |
| 0.2327 # | 0.1283 # | 0.3395 # | 1 | |||
| 2.01 | 1.80 |
| −0.0543 | 0.0294 | 0.1530 | 0.3484 # | 1 | ||
| 55.44 | 8.75 |
| −0.1179 | −0.1061 # | −0.1382 | 0.0436 ^ | 0.0744 | 1 | |
| 157.69 | 5.60 |
| 0.0418 | −0.0871 | 0.0140 | 0.1057 | −0.0500 | 0.3719 | 1 |
Note: values are marked with ^ if the sign changed after perturbation, or with # if the absolute difference of the value is >0.05.
Spearman’s Rank correlation matrix of the perturbed data using the CGADP method.
|
|
|
|
|
|
|
| |
|---|---|---|---|---|---|---|---|
|
| 1 | ||||||
|
| 0.5122 | 1 | |||||
|
| 0.1969 | 0.2902 # | 1 | ||||
|
| 0.2359 # | 0.1636 # | 0.3282 # | 1 | |||
|
| −0.0195 | 0.0588 | 0.1953 | 0.3304 | 1 | ||
|
| −0.1211 | −0.0873 | −0.1357 # | −0.0108 | 0.0306 # | 1 | |
|
| 0.0525 | −0.0765 | 0.0264 | 0.1168 | −0.0300 | 0.3482 | 1 |
Note: values are marked with ^ if the sign changed after perturbation, or with # if the absolute difference of the value is >0.05.