| Literature DB >> 35936976 |
Gatha Varma1, Ritu Chauhan2, Dhananjay Singh3.
Abstract
The collection of user attributes by service providers is a double-edged sword. They are instrumental in driving statistical analysis to train more accurate predictive models like recommenders. The analysis of the collected user data includes frequency estimation for categorical attributes. Nonetheless, the users deserve privacy guarantees against inadvertent identity disclosures. Therefore algorithms called frequency oracles were developed to randomize or perturb user attributes and estimate the frequencies of their values. We propose Sarve, a frequency oracle that used Randomized Aggregatable Privacy-Preserving Ordinal Response (RAPPOR) and Hadamard Response (HR) for randomization in combination with fake data. The design of a service-oriented architecture must consider two types of complexities, namely computational and communication. The functions of such systems aim to minimize the two complexities and therefore, the choice of privacy-enhancing methods must be a calculated decision. The variant of RAPPOR we had used was realized through bloom filters. A bloom filter is a memory-efficient data structure that offers time complexity of O(1). On the other hand, HR has been proven to give the best communication costs of the order of log(b) for b-bits communication. Therefore, Sarve is a step towards frequency oracles that exhibit how privacy provisions of existing methods can be combined with those of fake data to achieve statistical results comparable to the original data. Sarve also implemented an adaptive solution enhanced from the work of Arcolezi et al. The use of RAPPOR was found to provide better privacy-utility tradeoffs for specific privacy budgets in both high and general privacy regimes.Entities:
Keywords: Differential privacy; Frequency estimation; Frequency oracle; Privacy; Synthetic data
Year: 2022 PMID: 35936976 PMCID: PMC9345740 DOI: 10.1186/s42400-022-00129-6
Source DB: PubMed Journal: Cybersecur (Singap) ISSN: 2523-3246
Fig. 1A schematic of the RS + FD framework proposed by Arcolezi et al. that formed the basis of the proposed solution Sarve
Fig. 2The probabilities associated with the use of RAPPOR as a privatization mechanism
Fig. 3The probabilities associated with the use of Hadamard Response as a privatization mechanism
Fig. 4The flow of logic for an improvised adaptive approach to dynamic randomization using variance as the evaluation metric
The various parameter values that comprised the experimental setup were tested on real-world datasets
| Experimental setup identifier | Values of ε | Dataset name | Number of observations = N | Number of categorical attributes = D | Number of allowed values for each attribute = A |
|---|---|---|---|---|---|
| ES_Real_1 | [ln(2), ln(3), ln(4), ln(5), ln(6), ln(7)] | UCI Adult | 45,222 | 9 | [7, 16, 7, 14, 6, 5, 2, 41, 2] |
| ES_Real_2 | [ln(2), ln(3), ln(4), ln(5), ln(6), ln(7)] | UCI Nursery | 12,960 | 9 | [3, 5, 4, 4, 3, 2, 3, 3, 5] |
| ES_Real_3 | [ln(2), ln(3), ln(4), ln(5), ln(6), ln(7)] | MS-FIMU | 88,935 | 6 | [3, 3, 8, 12, 37, 11] |
| ES_Real_4 | [2, 3, 4, 5, 6, 7] | UCI Adult | 45,222 | 9 | [7, 16, 7, 14, 6, 5, 2, 41, 2] |
| ES_Real_5 | [2, 3, 4, 5, 6, 7] | UCI Nursery | 12,960 | 9 | [3, 5, 4, 4, 3, 2, 3, 3, 5] |
| ES_Real_6 | [2, 3, 4, 5, 6, 7] | MS-FIMU | 88,935 | 6 | [3, 3, 8, 12, 37, 11] |
The various parameter values that comprised experimental setup tested on synthetic datasets
| Experimental setup identifier | Values of ε | Dataset name | Number of observations = N | Number of categorical attributes = D | Number of allowed values for each attribute = A |
|---|---|---|---|---|---|
| ES_Syn_1 | [ln(2), ln(3), ln(4), ln(5), ln(6), ln(7)] | 50K_5D | 50,000 | 5 | [10, 10, 10, 10, 10] |
| ES_Syn_2 | [ln(2), ln(3), ln(4), ln(5), ln(6), ln(7)] | 50K_10D | 50,000 | 10 | [10, 10, 10, 10, 10, 10, 10, 10, 10, 10] |
| ES_Syn_3 | [ln(2), ln(3), ln(4), ln(5), ln(6), ln(7)] | 500K_5D | 500,000 | 5 | [10, 10, 10, 10, 10] |
| ES_Syn_4 | [ln(2), ln(3), ln(4), ln(5), ln(6), ln(7)] | 500K_10D | 500,000 | 10 | [10, 10, 10, 10, 10, 10, 10, 10, 10, 10] |
| ES_Syn_5 | [ln(2), ln(3), ln(4), ln(5), ln(6), ln(7)] | 500K_10D_NU | 500,000 | 10 | [10, 20, 30, 40, 50, 60, 70, 80, 90, 100] |
| ES_Syn_6 | [ln(2), ln(3), ln(4), ln(5), ln(6), ln(7)] | 500K_20D_NU | 500,000 | 20 | [10, 10, 20, 20, 30, 30, 40, 40, 50, 50, 60, 60, 70, 70, 80, 80, 90, 90, 100, 100] |
| ES_Syn_7 | [2, 3, 4, 5, 6, 7] | 50K_5 D | 50,000 | 5 | [10, 10, 10, 10, 10] |
| ES_Syn_8 ara> | [2, 3, 4, 5, 6, 7] | 50K_10 D | 50,000 | 10 | [10, 10, 10, 10, 10, 10, 10, 10, 10, 10] |
| ES_Syn_9 | [2, 3, 4, 5, 6, 7] | 500K_5D | 500,000 | 5 | [10, 10, 10, 10, 10] |
| ES_Syn_10 | [2, 3, 4, 5, 6, 7] | 500K_10D | 500,000 | 10 | [10, 10, 10, 10, 10, 10, 10, 10, 10, 10] |
| ES_Syn_11 | [2, 3, 4, 5, 6, 7] | 500K_10D_NU | 500,000 | 10 | [10, 20, 30, 40, 50, 60, 70, 80, 90, 100] |
| ES_Syn_12 | [2, 3, 4, 5, 6, 7] | 500K_20D_NU | 500,000 | 20 | [10, 10, 20, 20, 30, 30, 40, 40, 50, 50, 60, 60, 70, 70, 80, 80, 90, 90, 100, 100] |
Fig. 5The MSE averaged over 100 runs for the UCI adult dataset privatized using different randomization techniques and fake data under a high privacy regime, experimental setup labeled ES_Real_1 in Table 1; b general privacy regime, experimental setup labeled ES_Real_4 in Table 1
Fig. 6The MSE averaged over 100 runs for the UCI nursery dataset privatized using different randomization techniques and fake data under a high privacy regime, experimental setup labeled ES_Real_2 in Table 1; b general privacy regime, experimental setup labeled ES_Real_5 in Table 1
Fig. 7The MSE averaged over 100 runs for the MS-FIMU dataset privatized using different randomization techniques and fake data under a high privacy regime, experimental setup labeled ES_Real_3 in Table 1; b general privacy regime, experimental setup labeled ES_Real_6 in Table 1
The lowest MSE for the existing method and the enhancements which were tested on the UCI adult dataset
| Method | ε = ln(2) | ε = ln(3) | ε = ln(4) | ε = ln(5) | ε = ln(6) | ε = ln(7) |
|---|---|---|---|---|---|---|
| RS + FD[ADP] | 0.000596388 | 0.000325887 | 0.000278437 | 0.000183621 | 0.000162579 | 0.000126356 |
| RAPPOR | 0.000821897 | 0.000474918 | 0.000429617 | 0.000391613 | 0.000293406 | 0.000274562 |
| Hadamard response | 0.001347787 | 0.000483257 | 0.000394622 | 0.00029166 | 0.00023901 | 0.000161183 |
| Sarve | 0.000190493 | 0.00014343 |
The values in bold indicate privacy conditions when Sarve performed better than adaptive RS+FD and resulted in lower MSE between real and post-privatization estimated frequencies
The lowest MSE for the existing method and the enhancements which were tested on UCI nursery dataset
| Method | ε = ln(2) | ε = ln(3) | ε = ln(4) | ε = ln(5) | ε = ln(6) | ε = ln(7) |
|---|---|---|---|---|---|---|
| RS + FD[ADP] | 0.000491035 | |||||
| RAPPOR | 0.000981 | 0.000799 | 0.000441 | 0.000529 | 0.000483 | 0.000338 |
| Hadamard response | 0.00163 | 0.00075 | 0.000433 | 0.000659 | 0.000556 | 0.000396 |
| 0.00087 | 0.000417 | 0.000259 | 0.000387 | 0.000285 |
The values in bold indicate privacy conditions when Sarve performed better than adaptive RS+FD and resulted in lower MSE between real and post-privatization estimated frequencies
The lowest MSE for the existing method and the enhancements which were tested on MS-FIMU dataset
| Method | ε = ln(2) | ε = ln(3) | ε = ln(4) | ε = ln(5) | ε = ln(6) | ε = ln(7) |
|---|---|---|---|---|---|---|
| RS + FD[ADP] | 0.000108 | 3.37E−05 | 2.97E−05 | |||
| RAPPOR | 0.000207 | 8.86E−05 | 8.01E−05 | 5.29E−05 | 5.78E−05 | 4.75E−05 |
| Hadamard response | 0.004249 | 0.001295 | 0.000635 | 0.00037 | 0.000263 | 0.000187 |
| 6.73E−05 | 4.78E−05 | 4.41E−05 |
The values in bold indicate privacy conditions when Sarve performed better than adaptive RS+FD and resulted in lower MSE between real and post-privatization estimated frequencies
Fig. 8The MSE averaged over 100 runs for synthetic data with N = 50,000, D = 5 privatized using different randomization techniques and fake data under a high privacy regime, experimental setup labeled ES_Syn_1 in Table 2; b general privacy regime, experimental setup labeled ES_Syn_7 in Table 2
Fig. 9The MSE averaged over 100 runs for synthetic data with N = 50,000, D = 10 privatized using different randomization techniques and fake data under a high privacy regime, experimental setup labeled ES_Syn_2 in Table 2; b general privacy regime, experimental setup labeled ES_Syn_8 in Table 2
Fig. 10The MSE averaged over 100 runs for synthetic data with N = 500,000, D = 5 privatized using different randomization techniques and fake data under a high privacy regime, experimental setup labeled ES_Syn_3 in Table 2; b general privacy regime, experimental setup labeled ES_Syn_9 in Table 2
Fig. 11The MSE averaged over 100 runs for synthetic data with N = 500,000, D = 10 privatized using different randomization techniques and fake data under a high privacy regime, experimental setup labeled ES_Syn_4 in Table 2; b general privacy regime, experimental setup labeled ES_Syn_10 in Table 2
Fig. 12The MSE averaged over 100 runs for synthetic non-uniform data with N = 500,000, D = 10 privatized using different randomization techniques and fake data under a high privacy regime, experimental setup labeled ES_Syn_5 in Table 2; b general privacy regime, experimental setup labeled ES_Syn_11 in Table 2
Fig. 13The MSE averaged over 100 runs for synthetic non-uniform data with N = 500,000, D = 20 privatized using different randomization techniques and fake data under a high privacy regime, experimental setup labeled ES_Syn_6 in Table 2; b general privacy regime, experimental setup labeled ES_Syn_12 in Table 2
The lowest MSE for the existing method and the enhancements which were tested on a synthetic 10-dimensional dataset having 50,000 records
| Method | ε = ln(2) | ε = ln(3) | ε = ln(4) | ε = ln(5) | ε = ln(6) | ε = ln(7) |
|---|---|---|---|---|---|---|
| RS + FD[ADP] | 0.000496 | 0. | 0.00015 | 0.000111 | ||
| RAPPOR | 0.000807 | 0.00048 | 0.000325 | 0.00026 | 0.000247 | 0.000226 |
| Hadamard response | 0.000943 | 0.000312 | 0.000268 | 0.000215 | 0.000187 | 0.000151 |
| 0.000261 | 0.000207 | 0.000118 |
The values in bold indicate privacy conditions when Sarve performed better than adaptive RS+FD and resulted in lower MSE between real and post-privatization estimated frequencies
The lowest MSE for the existing method and the enhancements, tested on synthetic 20-dimensional dataset having 50,000 records
| Method | ε = ln(2) | ε = ln(3) | ε = ln(4) | ε = ln(5) | ε = ln(6) | ε = ln(7) |
|---|---|---|---|---|---|---|
| RS + FD[ADP] | 2.77E−05 | |||||
| RAPPOR | 0.000155 | 0.000121 | 9.77E−05 | 8.32E−05 | 7.35E−05 | 6.61E−05 |
| Hadamard response | 0.000159 | 7.82E−05 | 5.37E−05 | 4.32E−05 | 3.57E−05 | 3.27E−05 |
| 0.000113 | 6.46E−05 | 4.51E−05 | 3.39E−05 | 2.77E−05 | 2.50E−05 |
The values in bold indicate privacy conditions when Sarve performed better than adaptive RS+FD and resulted in lower MSE between real and post-privatization estimated frequencies
The attribute values followed non-uniform distribution