| Literature DB >> 34252969 |
Ziye Tao1, Griffin M Weber2, Yun William Yu1,3.
Abstract
MOTIVATION: The rapid growth in of electronic medical records provide immense potential to researchers, but are often silo-ed at separate hospitals. As a result, federated networks have arisen, which allow simultaneously querying medical databases at a group of connected institutions. The most basic such query is the aggregate count-e.g. How many patients have diabetes? However, depending on the protocol used to estimate that total, there is always a tradeoff in the accuracy of the estimate against the risk of leaking confidential data. Prior work has shown that it is possible to empirically control that tradeoff by using the HyperLogLog (HLL) probabilistic sketch.Entities:
Mesh:
Year: 2021 PMID: 34252969 PMCID: PMC8275349 DOI: 10.1093/bioinformatics/btab292
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.931
Fig. 1.Illustration of HyperLogLog k-anonymity. A hospital has an identified set B contained within the background population A. Binary hashes are taken of all patient identifiers. Those hashes are first used to partition the patients into four buckets. Within each bucket of B, the smallest value is chosen as the representative. Then the k-anonymity of that bucket is the number of hashes in the corresponding bucket of the background population that share the same position of the leading 1 bit
Expected number of non-10-anonymous buckets from Approximations A1 and A2 compared against ground truth simulations
| | | | |
|
| Simulation average | Simulation replicates | A1 | A1 time (s) | A2 | A2 time (s) |
|---|---|---|---|---|---|---|---|---|---|
| 100 | 10 000 | 100 | 0.1 | 70.60 | 100 | 70.28 | 14.75 | 72.76 | 2.00 |
| 50 | 10 000 | 200 | 0.1 | 141.14 | 100 | 141.12 | 6.61 | 149.85 | 1.00 |
| 20 | 10 000 | 500 | 0.1 | 354.38 | 100 | 353.74 | 2.60 | 414.61 | 0.20 |
| 300 | 30 000 | 100 | 0.1 | 70.68 | 100 | 70.60 | 62.26 | 71.60 | 3.00 |
| 150 | 30 000 | 200 | 0.1 | 141.79 | 100 | 141.59 | 26.66 | 144.55 | 2.00 |
| 60 | 30 000 | 500 | 0.1 | 354.90 | 100 | 354.65 | 9.88 | 372.67 | 1.00 |
| 30 | 30 000 | 1000 | 0.1 | 712.22 | 100 | 709.73 | 4.12 | 783.81 | 0.40 |
| 500 | 50 000 | 100 | 0.1 | 71.87 | 100 | 70.71 | 84.74 | 71.44 | 4.00 |
| 250 | 50 000 | 200 | 0.1 | 142.94 | 100 | 141.76 | 47.65 | 143.68 | 3.00 |
| 100 | 50 000 | 500 | 0.1 | 352.96 | 100 | 353.20 | 19.70 | 363.80 | 2.00 |
| 50 | 50 000 | 1000 | 0.1 | 707.75 | 100 | 706.99 | 8.19 | 749.25 | 0.70 |
| 800 | 80 000 | 100 | 0.1 | 70.57 | 100 | 70.40 | 136.09 | 70.98 | 4.00 |
| 400 | 80 000 | 200 | 0.1 | 142.29 | 100 | 140.90 | 77.41 | 142.45 | 3.00 |
| 160 | 80 000 | 500 | 0.1 | 354.54 | 100 | 353.85 | 34.99 | 360.40 | 2.00 |
| 80 | 80 000 | 1000 | 0.1 | 704.88 | 100 | 707.91 | 16.47 | 734.01 | 1.00 |
| 1000 | 100 000 | 100 | 0.1 | 71.13 | 100 | 70.69 | 252.00 | 71.24 | 4.60 |
| 500 | 100 000 | 200 | 0.1 | 142.77 | 100 | 141.76 | 134.00 | 142.87 | 3.18 |
| 200 | 100 000 | 500 | 0.1 | 354.27 | 100 | 353.37 | 40.57 | 358.63 | 2.00 |
| 100 | 100 000 | 1000 | 0.1 | 705.02 | 100 | 706.95 | 22.16 | 727.60 | 1.30 |
| 50 | 100 000 | 2000 | 0.1 | 1416.61 | 100 | 1414.37 | 9.95 | 1498.50 | 0.70 |
| 20 | 100 000 | 5000 | 0.1 | 3536.27 | 100 | 3539.54 | 3.37 | 4146.33 | 0.20 |
| 3000 | 300 000 | 100 | 0.1 | 70.47 | 100 | 70.76 | 8.00 | ||
| 300 | 300 000 | 1000 | 0.1 | 709.57 | 100 | 709.02 | 90.00 | 715.96 | 2.40 |
| 5000 | 500 000 | 100 | 0.1 | 71.13 | 100 | 70.91 | 10.00 | ||
| 500 | 500 000 | 1000 | 0.1 | 708.77 | 100 | 710.08 | 155.00 | 714.36 | 3.00 |
| 8000 | 800 000 | 100 | 0.1 | 71.92 | 100 | 71.06 | 14.00 | ||
| 800 | 800 000 | 1000 | 0.1 | 708.00 | 100 | 707.00 | 25.00 | 709.76 | 4.00 |
| 10 000 | 1 000 000 | 100 | 0.1 | 70.58 | 100 | 70.89 | 16.00 | ||
| 2000 | 1 000 000 | 500 | 0.1 | 356.33 | 100 | 354.85 | 607.00 | 355.69 | 7.00 |
| 1000 | 1 000 000 | 1000 | 0.1 | 707.66 | 100 | 710.06 | 316.00 | 712.36 | 5.00 |
| 500 | 1 000 000 | 2000 | 0.1 | 1419.32 | 60 | 1420.47 | 150.00 | 1428.72 | 3.00 |
| 200 | 1 000 000 | 5000 | 0.1 | 3534.64 | 50 | 3536.36 | 65.00 | 3586.25 | 2.00 |
| 100 | 1 000 000 | 10 000 | 0.1 | 7068.96 | 50 | 7073.09 | 30.00 | 7275.97 | 1.30 |
| 50 | 1 000 000 | 20 000 | 0.1 | 14 146.62 | 12.00 | 14 985.00 | 0.70 | ||
| 20 | 1 000 000 | 50 000 | 0.1 | 35 396.39 | 4.00 | 41 463.50 | 0.20 | ||
| 30 000 | 3 000 000 | 100 | 0.1 | 71.01 | 100 | 70.98 | 30.00 | ||
| 3000 | 3 000 000 | 1000 | 0.1 | 703.79 | 100 | 707.55 | 8.00 | ||
| 50 000 | 5 000 000 | 100 | 0.1 | 71.54 | 100 | 70.71 | 40.00 | ||
| 5000 | 5 000 000 | 1000 | 0.1 | 708.32 | 100 | 709.07 | 12.00 | ||
| 80 000 | 8 000 000 | 100 | 0.1 | 71.01 | 100 | 70.87 | 50.00 | ||
| 8000 | 8000 000 | 1000 | 0.1 | 707.81 | 70 | 710.63 | 15.00 | ||
| 1 00 000 | 10 000 000 | 100 | 0.1 | 70.48 | 100 | 70.71 | 55.00 | ||
| 20 000 | 10 000 000 | 500 | 0.1 | 354.08 | 100 | 354.39 | 30.00 | ||
| 10 000 | 10 000 000 | 1000 | 0.1 | 711.81 | 70 | 708.87 | 16.00 | ||
| 5000 | 10 000 000 | 2000 | 0.1 | 1418.13 | 11.00 | ||||
| 2000 | 10 000 000 | 5000 | 0.1 | 3551.59 | 726.00 | 3556.85 | 7.00 | ||
| 1500.15 | 10 000 000 | 6666 | 0.1 | 4711.94 | 547.00 | 4720.90 | 5.70 | ||
| 1000 | 10 000 000 | 10 000 | 0.1 | 7103.56 | 366.86 | 7123.63 | 4.60 | ||
| 666.7 | 10 000 000 | 15 000 | 0.1 | 10 614.93 | 250.00 | 10 659.18 | 3.70 | ||
| 500 | 10 000 000 | 20 000 | 0.1 | 14 207.49 | 192.00 | 14 287.16 | 3.00 | ||
| 200 | 10 000 000 | 50 000 | 0.1 | 35 366.18 | 79.00 | 35 862.54 | 2.00 | ||
Note: Some entries are empty because the computation time was infeasibly long. We have highlighted (in yellow or green) the more accurate approximation finished within 10 min. Full simulation and computation results for r are available on Github in machine-readable format.
Choice table for approximation method
|
|
|
|
|
|
| |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
| |A| | m | A1 | A2 | A1 | A2 | A1 | A2 | A1 | A2 | A1 | A2 | A1 | A2 |
| 100 | 104 | 100 |
|
|
|
|
|
| ||||||
| 50 | 104 | 200 |
|
|
|
|
|
| ||||||
| 20 | 104 | 500 |
|
|
|
|
|
| ||||||
| 1000 | 105 | 100 |
|
|
|
|
|
| ||||||
| 500 | 105 | 200 |
|
|
|
|
|
| ||||||
| 200 | 105 | 500 |
|
|
|
|
|
| ||||||
| 100 | 105 | 1000 |
|
|
|
|
|
| ||||||
| 50 | 105 | 2000 |
|
|
|
|
|
| ||||||
| 20 | 105 | 5000 |
|
|
|
|
|
| ||||||
| 10 000 | 106 | 100 |
|
|
|
|
|
| ||||||
| 2000 | 106 | 500 |
|
|
|
|
|
| ||||||
| 1500 | 106 | 666 |
|
|
|
|
|
| ||||||
| 1000 | 106 | 1000 |
|
|
|
|
|
| ||||||
| 500 | 106 | 2000 |
|
|
|
|
|
| ||||||
| 200 | 106 | 5000 |
|
|
|
|
|
| ||||||
| 100 | 106 | 10 000 |
|
|
|
|
|
| ||||||
| 50 | 106 | 20 000 |
|
|
|
|
|
| ||||||
| 20 | 106 | 50 000 |
|
|
|
|
|
| ||||||
| 100 000 | 107 | 100 |
|
|
|
|
|
| ||||||
| 20 000 | 107 | 500 |
|
|
|
|
|
| ||||||
| 10 000 | 107 | 1000 |
|
|
|
|
|
| ||||||
| 5000 | 107 | 2000 |
|
|
|
|
|
| ||||||
| 3333 | 107 | 3000 |
|
|
|
|
|
| ||||||
| 2000 | 107 | 5000 |
|
|
|
|
|
| ||||||
| 1500 | 107 | 6666 |
|
|
|
|
|
| ||||||
| 1000 | 107 | 10 000 |
|
|
|
|
|
| ||||||
| 500 | 107 | 20 000 |
|
|
|
|
|
| ||||||
| 200 | 107 | 50 000 |
|
|
|
|
|
| ||||||
Note: A is the total size of the hospital background population, m is the number of buckets used in the HyperLogLog sketch and r is the fraction of the background population that matches the query criteria. ‘A1’ and ‘A2’, respectively, denote approximations 1 and 2. For every one of the parameter regimes, we used simulations to determine which of the approximation methods is more suitable for the practitioner.
Fig. 2.Errors between Approximation (based on choice table) and simulation of 100 random trials with number of buckets = 100 (top) and 1000 (bottom)
Fig. 3.Expected number of non-10-anonymous buckets under different combinations of number of buckets (m) and prevalence rate (r) when total number of patients is 107. (Top) Number of non-10-anonymous buckets under different combinations of m (number of buckets) and r (prevalence rate) when total number of patients is 107. (Left bottom) However, the fraction of non-10-anonymous buckets remains constant as the number of buckets increase when the other variables are held fixed. (Right bottom) It is the relationship to prevalence rate that is more complicated and nonlinear, as shown by focusing on the behavior for 100 and 500 buckets