Literature DB >> 34252969

Expected 10-anonymity of HyperLogLog sketches for federated queries of clinical data repositories.

Ziye Tao¹, Griffin M Weber², Yun William Yu^1,3.

Abstract

MOTIVATION: The rapid growth in of electronic medical records provide immense potential to researchers, but are often silo-ed at separate hospitals. As a result, federated networks have arisen, which allow simultaneously querying medical databases at a group of connected institutions. The most basic such query is the aggregate count-e.g. How many patients have diabetes? However, depending on the protocol used to estimate that total, there is always a tradeoff in the accuracy of the estimate against the risk of leaking confidential data. Prior work has shown that it is possible to empirically control that tradeoff by using the HyperLogLog (HLL) probabilistic sketch.
RESULTS: In this article, we prove complementary theoretical bounds on the k-anonymity privacy risk of using HLL sketches, as well as exhibit code to efficiently compute those bounds.
AVAILABILITY AND IMPLEMENTATION: https://github.com/tzyRachel/K-anonymity-Expectation.

Entities: Chemical

Mesh：

Year: 2021 PMID： 34252969 PMCID： PMC8275349 DOI： 10.1093/bioinformatics/btab292

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.931

1 Introduction

Clinical data containing patients’ personal medical records are important resources for biomedical research. Fully centralizing that data may permit the widest array of potential analyses, this is often not feasible due to privacy and confidentiality requirements (Benitez and Malin, 2010; Emam ; Heatherly ). During times of pressing need, such as during a global pandemic, these privacy requirements may be justifiably relaxed (Haendel )—such as using trusted third party vendors such as Datavant (Kho and Goel, 2019)—but even then, it is important to keep in mind the various privacy-utility tradeoffs (Bengio , 2020). A more privacy-friendly alternative is to use a federated network instead, which give hospitals control over their local databases; then, a distributed query tool enables researchers to send queries to the network, such as ‘how many patients across the network have diabetes’ (Brat ; Weber, 2015). A number of these hospital networks have emerged, including the Shared Health Research Information Network for Harvard affiliated hospitals (Weber ), the Federated Aggregate Cohort Estimator developed through a collaboration of five universities and institutions (Wyatt ), the open-source PopMedNet (Davies ) and the Patient Centered Outcomes Research Institute launched PCORnet as a network of networks (Fleurence ). However, patients often receive medical care from multiple hospitals, so medical records at different hospitals may be duplicated or incomplete. Depending on the aggregation method used to combine results from the network, this can produce errors. For example, consider using a simple summation of aggregate counts: if a patient with hypertension receives medical care from both Hospital A and Hospital B, then it is possible that the sum will double count that patient, which results in the overestimation of the number of patients with hypertension (Weber, 2013). Of course, this problem can be mostly alleviated by sending a hashed identifier of patients matching each hospital’s queries to a trusted third party, but that again raises privacy concerns (Oechslin, 2003). There is some natural tradeoff between the privacy guaranteed to individual patients and the accuracy of the aggregate query, and hashed identifiers and simple summation fall at opposite ends of the spectrum. Several of the authors of this article recently proposed using the HyperLogLog (HLL) ‘probabilistic sketch’ (Durand and Flajolet, 2003; Flajolet and Martin, 1985; Flajolet ) to access intermediate tradeoffs of privacy versus accuracy (Yu and Weber, 2020). Probabilistic counting was introduced to the computer literature decades ago, and has found use in analyzing large streaming data in a variety of settings, ranging from internet routers (Cai ) to text corpora comparisons (Broder, 1997) to genomic sequences (Baker and Langmead, 2019; Ondov ; Solomon and Kingsford, 2018). Instead of sharing a single aggregate count, or sharing the full list of matching patient IDs (Weber, 2013), each hospital instead shares a smaller ‘summary sketch’ built from taking the logarithm of a coordinated random sample of m matching patient hashed IDs (Yu and Weber, 2020). Because only m patient IDs are used, and those are obfuscated through taking a logarithm, these HLL sketches are significantly more private than sending a full list of matching IDs. Due to the way the estimators work, HLL sketches have an error of about , which can be much less than expected from simple summation. But when any data are shared by a hospital to a third party, there is risk of accidental leakage. Advances in homomorphic encryption and secure multi-party computation (Lindell, 2005) may eventually solve this problem by not allowing the third party any unencrypted data, but these are currently still impractical for deployment due to both computational and communication complexity. For example, consider the case where a hospital finds that there is only one patient satisfying the criterion for a query. If this hospital returns the aggregate count as one, then this unique patient’s personal information is linked and can potentially be re-identified through a linkage attack (Emam and Dankar, 2008; Yu and Weber, 2020). To properly compare the privacy of various methods of data aggregation, we turn to the concept of k-anonymity. The basic idea behind k-anonymity is that if a method or dataset is k-anonymous, then each patient is similar to at least k − 1 other patients with respect to potentially identifying variables, so that it is hard to determine the identity of a single patient in the dataset (Emam and Dankar, 2008; Sweeney, 2002). Although other mathematical formalisms like differential privacy (Dwork, 2008) are much stronger, they are harder to work with, as they require injecting deliberate noise, and are not currently widely in use by clinical databases. Furthermore, it is provably impossible for composable cardinality estimators (such as HLL) to be differentially private, because the ability to deduplicate runs counter to the base assumptions of differential privacy (Desfontaines ). In this article, we will assume that hospitals in a federated network implement the HLL algorithm for queries. We will then prove bounds on the expected k-anonymity of the shared sketches, as well as provide fast algorithms for computing that expected k-anonymity. This study is an extension of previous work (Yu and Weber, 2020), which operated under the same setting and assumptions, but only provided empirical results and no proofs on the levels of privacy achieved. Here, we provide rigorous theoretical justification for those empirical claims.

2 Materials and methods

2.1 Setting and summary

In this article, we adopt the HLL sketch federated clinical network setting given in prior work (Yu and Weber, 2020). For completeness, we duplicate the salient points below. Assume that every patient has a single invariant ID that is used across hospitals. Prototypically, one might consider using social security numbers in the USA for that purpose. Even without a single unique identifier, it is possible to generate an ID based off a combination of other possibly non-unique IDs, such as first and last name, zip code, address, birthdate, etc. Unfortunately, there may be errors in these records due to character recognition errors (e.g. S and 8), phonetic errors (e.g. ph and f) and typographic errors including insertion, transposition and substitutions. Luckily, there is a lot of existing literature on this problem, and methods such as BIN-DET and BIN-PROB (Durham ) have been proposed to deal with the issue. Thus, in this article, we will treat this problem as out-of-scope and assume for simplicity that every patient has a unique stable ID known to all institutions. Further assume that there is a federated network of hospitals (or other institutions) responding to clinical queries, along with a central party that manages and relays messages. When hospitals receive a query, they generate a list of the IDs of patients who match the query. Each hospital will use a publicly known hash function to first pseudorandomly partition the patients into m buckets and then assign a uniform pseudorandom number between 0 and 1 to each patient. We also assume that the hash function is known by the attacker, because the attacker may have compromised one of the other hospitals or the central party. The hospital then stores the order of magnitude of the smallest number within each bucket, and sends these m smallest bucket values to the central party. By applying the HLL estimator, the central party is then able to compute the aggregate count for the query with a relative error of around (Flajolet ). Here, we focus on an individual hospital and want to determine the expected exposure to accidentally disclosing private information if the central party is compromised. As the HLL sketch aggregates information within each of the m buckets, our goal is to compute the expected number of buckets which are not k-anonymized. In line with common practice, we set k = 10 for most of our results, though the algorithms and proofs hold for other k. Below, we provide two approximation formulas for the expected value and in the Section 4 construct a table for the user to determine which approximation should be chosen based on the number of distinct patients and other relevant parameters.

2.2 k-Anonymity and HLL

2.2.1 High-level overview

The HLL (Flajolet ) probabilistic sketching algorithm is widely used to estimate the cardinality (number of different elements) of a set. Assume we have a database of electronic medical records; we can estimate the number of distinct patients by applying the HLL algorithm. The basic idea behind HLL is that the minimum value of a collection of random numbers between 0 and 1 is inversely proportional to the size of the collection. Therefore, we can estimate the cardinality of a set by first applying a hash function which maps all the elements uniformly onto and considering the minimum value. For the purposes of this article, we will operate in the random oracle model, where we assume that the hash function actually maps to a random number; in practice, a standard hash function like SHA-256 would probably be employed. In order to increase the accuracy of estimation, we randomly divide the set into m partitions and then estimate the cardinality of the original set by the harmonic mean from m partitions. Furthermore, the HLL algorithm only needs to store the position of the first 1 bit in the 64-bit hash value, rather than the full patient ID hash, providing partial privacy protection. As the expected error in the final estimate is around , increasing m can reduce the error of HLL query but increases the risk of privacy leaks. In our setting, when a hospital is sent a query, there are two relevant sets to consider: (i) the background population (often, the set of all patients at the hospital) and (ii) the set of patients matching the query. The reason for considering the background population is that they can ‘hide’ patients who match the query by providing plausible deniability. The hospital will return a HLL sketch, which contains m values—the maximum position of the first 1 bit within each bucket. We define a HLL bucket with value x to be ‘k-anonymous’ if at least k − 1 patients in the background population have hash value x; we will call these corresponding hash values in the background population hash collisions (Yu and Weber, 2020). This means that if an attacker gets access to the sketch and can narrow down the set of potential patients to the background population, they cannot determine with certainty which of the k patients with that hash value was in the set of patients matching the query. Our goal is to determine the expected number of buckets that are not at least 10-anonymous (Fig. 1).

Fig. 1.

Illustration of HyperLogLog k-anonymity. A hospital has an identified set B contained within the background population A. Binary hashes are taken of all patient identifiers. Those hashes are first used to partition the patients into four buckets. Within each bucket of B, the smallest value is chosen as the representative. Then the k-anonymity of that bucket is the number of hashes in the corresponding bucket of the background population that share the same position of the leading 1 bit We wish to note that in this article, we deliberately use the much weaker notion of privacy provided by k-anonymity (Emam and Dankar, 2008), rather than stronger alternatives like differential privacy (Dwork, 2008), which have provable protection against inference attacks. Unfortunately, differential privacy (and similar alternatives) are provably incompatible with any composable cardinality estimation (Desfontaines ). In practice, hospital IRBs admit the use of 10-anonymity for query set patients as a useful metric, despite known issues with vulnerability of k-anonymity to inference attacks. Our article thus focuses on analyzing probabilistic sketches as a more private alternative to the standard practice of sending full hashed IDs.

2.2.2 Formal description

Let us recast the textual description above a bit more rigorously as the following mathematical problem: Let A be a set and is a non-empty subset of A. A represents the background population and B represents patients satisfying the query. We define as the ratio of number of patients satisfying the query to background population (also sometimes known as concept prevalence). Let be a one-way hash function. In theory, we assume that we have a shared oracle available to both parties. In practice, a cryptographic hash function such as SHA-1, SHA-224 or SHA-256 (Johnson, 2020) is generally used. σ uniformly maps each element in A to a random real number in the interval . Let be defined by . This function returns the number of 0 bits before the first 1 bit in under a base 2 expansion. Let be a map that randomly partitions patients into m buckets. In practice, this map can also be derived from a cryptographic hash function. From the partition function p, we define and , which, respectively, represent the ith bucket in whole database and sample. Let be the maximum number of zeros before the first one among all hash values represented in base 2 in the ith bucket of B which is B. Let be the set of elements in the ith bucket of A which collide with the elements in B. We want to compute the and , the expected number of non-k-anonymous buckets.

2.3 Probability of

As described above, we need to consider the collisions against all m buckets. Here, however, we first show a simple analysis with no partition function (i.e. the case where m = 1) and compute the probability of each possible number of collisions so that in the later sections we can use this result to compute the desired expected value of ‘non-k-anonymous’ buckets. Since there is only one bucket, there are only two sets A and B which represent the set of all patients and the set of patients matching the query, respectively. We denote , the maximum number of zeros before the first one among all hash values in base 2 in B, and , the set of collisions. We want to compute the probability that the number of collisions is less or equal to k, which is . Each element in can be thought of as an i.i.d. random variable with distribution Unif (0, 1). Therefore, if and only if . Then we get . Thus, . Given sets , the probability of exactly collisions is: where and . Proof. Since the sets A and B are fixed, we use to represent for notational simplicity here. By the law of total probability, we know that . First we consider the case where we have k collisions in : Next we consider the case where we have k collisions in : Thus, and □

2.4 Expected number of buckets with less than k collisions

Recall that A is the background population and B the set of patients satisfying the query criteria. We denote the buckets of A and B under our partition function by and where for and e is the sets of collisions in the ith bucket. Thus, the expected value of the number of buckets with no more than k collision is . Note that with . Therefore, we know for a single bucket, say A1, its cardinality follows a binomial distribution that is With a given A, . Thus, where and and . The expected number of buckets which have at least 1 collision but no more than k collisions is: where and . Proof. where . In order to compute , we have to consider the range of which is . In contrast to the simple case in Section 2.3, here B1 is not necessarily a proper subset of A1 because A1 can be the empty set and thus B1 is also an empty set in this case. The collision number is zero if and only if A1 is an empty sets. Therefore, we will expand the formula in Lemma 2.1 to compute . Furthermore, if we want rule out the case of zero collisions—because when the bucket is empty, there is not a patient ID for which we need to guarantee k-anonymity—we should set the range of and as and , respectively. Therefore, we will get: where and . □

3 Algorithms

3.1 Time complexity of evaluating expectation

Again, recall that A is the background population, B is the set of patients satisfying the query criteria and e is the set of collisions. In Section 2.4, we gave an explicit formula for computing . However, the time complexity of carrying out that computation is troublesome where and . Usually, k is smaller than and the infinity in the second sum will be replaced by 64 (or some other constant <100) because it represents the maximum number of zeros before the first one among all hash values in base 2. As there are only 7 billion people on Earth, 64 bits is sufficient for the original hash function to have low probability of collisions. Therefore, the time complexity is for at most k collisions. We consider the time complexity of computing the desired expectation. Theoretically, the range of is and the range of is . Therefore, the computation time is: and the time complexity is , which is quadratic in the size of population for at most k collisions. In practice, for large set sizes, it is computationally infeasible to use this theoretical formula to compute the desired expectation; thus, in the remainder of this article we analyze fast approximations.

3.2 Approximation A1: concentration inequalities

When is large, it is impossible to sum over whole range of . Therefore, we will use concentration inequalities to restrict and to a smaller range. Because there is only an exponentially small probability that A1 and B1 will fall outside these restricted windows, this will have minimal effect on the final answer while reducing the computation time from quadratic to linear in the size of A. Recall that and . In order to reduce the time complexity, we will restrict in our computations to the interval . Recall that for a given and . However, we define which is greater than and restrict in the interval in order to compute the error bound more easily below in Section 3.2.1. After concentration, we can make sure that and which is shown below in detail in Section 3.2.1. As an aside, while these two intervals of and have been chosen for analyzing the error bound and time complexity analytically, in the computing code we can directly use built-in functions to compute the relevant confidence intervals for and . By the concentration inequalities on and , the desired expectation will be approximated by: where and . The computation time after concentration is: So, the time complexity after concentration is which is linear in . After concentration, the expected value is smaller than the actual , but we can bound the error.

3.2.1 Error bounds

Recall that and . We concentrate in the interval . We define the cumulative density function of . First, we consider the concentration on . We will apply the higher moments inequality on (Blum ): If we choose r = 6 then, we will get: where . Then we consider the for a given . For a given , we know Hypergeometric and . We concentrate in the interval where . We define the cumulative density function of for a given . Note that for , we can get and . The expected value is equal to and the variance is equal to which is bigger than the variance of for this given . This explains that the hypergeometric distribution is more concentrated about the mean than the binomial distribution (Kalbfleisch, 1985). Therefore, we will use this binomial distribution to bound the tail of our hypergeometric distribution: where and . But the in the computing code, we can use the built-in function to find the interval (L, U) and (L, U) such that and . This will not affect the time complexity and can ensure that the absolute error between the estimated expected value and the actual expected value is <1 by choosing a proper α. It is obvious the smaller α is, the smaller the error will be, but the intervals (L, U) and (L, U) will be bigger which means a longer computing time. Therefore, there is a tradeoff between accuracy and speed (see Table 2 for real computing time). Fortunately, in all cases we explore, the L and U given above can ensure that . where and .

Table 2

Expected number of non-10-anonymous buckets from Approximations A1 and A2 compared against ground truth simulations

\|A\|/m	\|A\|	m	r	Simulation average	Simulation replicates	A1	A1 time (s)	A2	A2 time (s)
100	10 000	100	0.1	70.60	100	70.28	14.75	72.76	2.00
50	10 000	200	0.1	141.14	100	141.12	6.61	149.85	1.00
20	10 000	500	0.1	354.38	100	353.74	2.60	414.61	0.20
300	30 000	100	0.1	70.68	100	70.60	62.26	71.60	3.00
150	30 000	200	0.1	141.79	100	141.59	26.66	144.55	2.00
60	30 000	500	0.1	354.90	100	354.65	9.88	372.67	1.00
30	30 000	1000	0.1	712.22	100	709.73	4.12	783.81	0.40
500	50 000	100	0.1	71.87	100	70.71	84.74	71.44	4.00
250	50 000	200	0.1	142.94	100	141.76	47.65	143.68	3.00
100	50 000	500	0.1	352.96	100	353.20	19.70	363.80	2.00
50	50 000	1000	0.1	707.75	100	706.99	8.19	749.25	0.70
800	80 000	100	0.1	70.57	100	70.40	136.09	70.98	4.00
400	80 000	200	0.1	142.29	100	140.90	77.41	142.45	3.00
160	80 000	500	0.1	354.54	100	353.85	34.99	360.40	2.00
80	80 000	1000	0.1	704.88	100	707.91	16.47	734.01	1.00
1000	100 000	100	0.1	71.13	100	70.69	252.00	71.24	4.60
500	100 000	200	0.1	142.77	100	141.76	134.00	142.87	3.18
200	100 000	500	0.1	354.27	100	353.37	40.57	358.63	2.00
100	100 000	1000	0.1	705.02	100	706.95	22.16	727.60	1.30
50	100 000	2000	0.1	1416.61	100	1414.37	9.95	1498.50	0.70
20	100 000	5000	0.1	3536.27	100	3539.54	3.37	4146.33	0.20
3000	300 000	100	0.1	70.47	100			70.76	8.00
300	300 000	1000	0.1	709.57	100	709.02	90.00	715.96	2.40
5000	500 000	100	0.1	71.13	100			70.91	10.00
500	500 000	1000	0.1	708.77	100	710.08	155.00	714.36	3.00
8000	800 000	100	0.1	71.92	100			71.06	14.00
800	800 000	1000	0.1	708.00	100	707.00	25.00	709.76	4.00
10 000	1 000 000	100	0.1	70.58	100			70.89	16.00
2000	1 000 000	500	0.1	356.33	100	354.85	607.00	355.69	7.00
1000	1 000 000	1000	0.1	707.66	100	710.06	316.00	712.36	5.00
500	1 000 000	2000	0.1	1419.32	60	1420.47	150.00	1428.72	3.00
200	1 000 000	5000	0.1	3534.64	50	3536.36	65.00	3586.25	2.00
100	1 000 000	10 000	0.1	7068.96	50	7073.09	30.00	7275.97	1.30
50	1 000 000	20 000	0.1			14 146.62	12.00	14 985.00	0.70
20	1 000 000	50 000	0.1			35 396.39	4.00	41 463.50	0.20
30 000	3 000 000	100	0.1	71.01	100			70.98	30.00
3000	3 000 000	1000	0.1	703.79	100			707.55	8.00
50 000	5 000 000	100	0.1	71.54	100			70.71	40.00
5000	5 000 000	1000	0.1	708.32	100			709.07	12.00
80 000	8 000 000	100	0.1	71.01	100			70.87	50.00
8000	8000 000	1000	0.1	707.81	70			710.63	15.00
1 00 000	10 000 000	100	0.1	70.48	100			70.71	55.00
20 000	10 000 000	500	0.1	354.08	100			354.39	30.00
10 000	10 000 000	1000	0.1	711.81	70			708.87	16.00
5000	10 000 000	2000	0.1					1418.13	11.00
2000	10 000 000	5000	0.1			3551.59	726.00	3556.85	7.00
1500.15	10 000 000	6666	0.1			4711.94	547.00	4720.90	5.70
1000	10 000 000	10 000	0.1			7103.56	366.86	7123.63	4.60
666.7	10 000 000	15 000	0.1			10 614.93	250.00	10 659.18	3.70
500	10 000 000	20 000	0.1			14 207.49	192.00	14 287.16	3.00
200	10 000 000	50 000	0.1			35 366.18	79.00	35 862.54	2.00

Note: Some entries are empty because the computation time was infeasibly long. We have highlighted (in yellow or green) the more accurate approximation finished within 10 min. Full simulation and computation results for r are available on Github in machine-readable format.

3.2.2 Approximation A2: mean-field approximation

Although the time complexity after concentration is linear in , for large and m small, this speedup is often still not enough. We can further approximate by and get the following approximation of the expectation: This is a ‘mean-field’ approximation based on Approximation A1. The basic idea behind this approximation is to use the probability at the mean value which is to represent all the probabilities when because is monotonic increasing in and the interval (L, U) is small enough compared with the theoretical range . The range of is still . Therefore, the computation time of E2 is: and the time complexity is . The real computing time will be discussed in the Section 4. Unfortunately, we do not have a strong provable guarantee with this approximation, but it seems empirically to work well in practice.

4 Results

In order to assess the accuracy–speed tradeoffs of our two approximations, we ran simulations measuring the ground truth empirical k-anonymity of patients in several different regimes using HLL sketches. Those simulations serve as the ground truth since they have the same distribution as hashing real patient identifiers with a random seed, without needing to use real patient data for this article. Then, we compared those empirical values against the approximations described in this article. In the large cardinality regimes, it is computationally infeasible to run full simulations, so we only compare the run-times of the two approximation methods. In Table 2, we provide full tables of these results. In Table 1, we provide a high-level summary giving a practitioner guidance on which method is appropriate under those particular parameter choices. All computations were run in single-thread mode on an AMD Ryzen Threadripper 3970X 32-core CPU machine running Ubuntu 18.04.5 LTS (bionic) with 256 GiB of RAM. It is worth mentioning that steps in computations are trivially parallelizable, but for benchmarking purposes all our results are of single-threaded performance. Additionally, instead of using actual hash functions (e.g. SHA-256), we generate uniform random numbers as the hashed values, which has the same probability distribution. Code is available on Github and relies on using the numpy, scipy.stat and decimal packages for simulation of patient hashes and explicit computation of probability distributions: https://github.com/tzyRachel/K-anonymity-Expectation

Table 1.

Choice table for approximation method

			r = 0.1		r = 0.08		r = 0.05		r = 0.01		r = 0.005		r = 0.001
\|A\|/m	\|A\|	m	A1	A2	A1	A2	A1	A2	A1	A2	A1	A2	A1	A2
100	10⁴	100	√		√		√		√		√		√
50	10⁴	200	√		√		√		√		√		√
20	10⁴	500	√		√		√		√		√		√
1000	10⁵	100	√		√		√		√		√		√
500	10⁵	200	√		√		√		√		√		√
200	10⁵	500	√		√		√		√		√		√
100	10⁵	1000	√		√		√		√		√		√
50	10⁵	2000	√		√		√		√		√		√
20	10⁵	5000	√		√		√		√		√		√
10 000	10⁶	100		√		√		√		√		√		√
2000	10⁶	500		√		√		√	√		√		√
1500	10⁶	666		√		√		√		√	√		√
1000	10⁶	1000	√		√		√		√		√		√
500	10⁶	2000	√		√		√		√		√		√
200	10⁶	5000	√		√		√		√		√		√
100	10⁶	10 000	√		√		√		√		√		√
50	10⁶	20 000	√		√		√		√		√		√
20	10⁶	50 000	√		√		√		√		√		√
100 000	10⁷	100		√		√		√		√		√		√
20 000	10⁷	500		√		√		√		√		√		√
10 000	10⁷	1000		√		√		√		√		√		√
5000	10⁷	2000		√		√		√		√		√		√
3333	10⁷	3000		√		√		√		√		√		√
2000	10⁷	5000		√		√		√		√	√		√
1500	10⁷	6666		√		√	√		√		√		√
1000	10⁷	10 000	√		√		√		√		√		√
500	10⁷	20 000	√		√		√		√		√		√
200	10⁷	50 000	√		√		√		√		√		√

Note: A is the total size of the hospital background population, m is the number of buckets used in the HyperLogLog sketch and r is the fraction of the background population that matches the query criteria. ‘A1’ and ‘A2’, respectively, denote approximations 1 and 2. For every one of the parameter regimes, we used simulations to determine which of the approximation methods is more suitable for the practitioner.

Choice table for approximation method Note: A is the total size of the hospital background population, m is the number of buckets used in the HyperLogLog sketch and r is the fraction of the background population that matches the query criteria. ‘A1’ and ‘A2’, respectively, denote approximations 1 and 2. For every one of the parameter regimes, we used simulations to determine which of the approximation methods is more suitable for the practitioner. Expected number of non-10-anonymous buckets from Approximations A1 and A2 compared against ground truth simulations Note: Some entries are empty because the computation time was infeasibly long. We have highlighted (in yellow or green) the more accurate approximation finished within 10 min. Full simulation and computation results for r are available on Github in machine-readable format. Recall that A represents the number of all patients, B represents the number of patients who meet some query criteria and m is the number of buckets in the HLL process. We introduce to represent the ratio of and , because as we will see, this ratio controls to a large extent the number of collisions. Intuitively, r represents the number of background population persons who could be used to provide plausible deniability to each patient in the query set. Our simulations sweep over the different combinations of the parameters A, r and m to construct a table to fit Approximations A1 and A2. In all simulations, we restrict in the interval and m in the interval . In this article, the total number of different patients in a hospital is assumed to be over 104; not only are approximation methods unnecessary for because exact computations are feasible, but there is not a sufficiently large background population to hide the query set when is small, and the privacy characteristics then become equivalent to sending hashed IDs (Yu and Weber, 2020). Since the simulations are run under the condition of ‘10-anonymity’, we make sure that which is the mean value of the single bucket size. Also, r is restricted in the interval and we choose six different values of r which are to run the simulations and compare the simulation results with computing results. As we discussed in the Section 2, we can estimate the desired expected value by both Approximations A1 and A2. The final choice of Approximation A1 or Approximation A2 seems to be dependent primarily on . In most cases, when , Approximation A2 is good enough and the computing time is no longer than 3 min. When , Approximation A2 will be not accurate enough and we have to choose Approximation A1. The computing time of Approximation A1 is proportional to , which is sometimes a concern. When , the computing time is usually no longer than 8 min. But there are several special cases, such as when r = 0.1 and r = 0.08, that the computing time at is ∼10 min which might be acceptable but is really not ideal. Furthermore, in extreme cases, the approximate expected k-anonymity return by Approximations 1 and 2 differ by ∼10 (Table 2). To make things easier for the end-practitioner, we provide a summary ‘choice’ table (Table 1) guiding them on which approximation is suggested, based on different numbers of patients, numbers of buckets and ratios of number of patients matching query to all patients. Choosing between approximations A1 and A2 is an accuracy/running time tradeoff. A1 is usually both more accurate and expensive than A2. For the purpose of the choice table, to give a concrete recommendation, we aim to have single-threaded running times below 10 min; many modern multi-core machines can run over a dozen threads at once, and given that the approximation algorithms are trivially parallelizable, this amounts implicitly to a goal of real wall-clock time of less than a minute. The choice table is filled out by selecting the approximation method with the least error given that time constraint. In most cases, we choose A1 if the running time is below 10 min. Sometimes, computation results from A1 and A2 are almost the same, so the faster method can be chosen. Based on this rule, we compare gold-standard simulation results against the approximations in Table 2 to construct the ‘choice’ table. Note that our choice of 10 min single-threaded run-time was arbitrary; given extra computational resources, the ideal switch-off point between approximations will vary. Figure 2 shows the errors between the approximation results (based on the choice Table 1) and simulation results (Table 2) when number of distinct patients is 107 and number of buckets are 100 and 1000, respectively. The absolute values of all the errors are no more than 4.

Fig. 2.

Errors between Approximation (based on choice table) and simulation of 100 random trials with number of buckets = 100 (top) and 1000 (bottom)

5 Discussion

We first note that all of the approximations we have provided finish on the order of minutes. As they are analytical approximations, there is also no need to run them multiple times. Although we have not shown explicit simulation run-times in the tables above, the larger simulations take upwards of hours; furthermore, we did not perform simulations for the largest parameter ranges because we expected those to take significantly longer. Our approximations speed up determining the expected privacy loss from distributing HLL sketches. We are also able to form some general conclusions about the expected privacy of HLL sketches. As mentioned above the prevalence ratio , where A and B are, respectively, the background population and query population can be interpreted as the ratio of patients matching a query (e.g. ‘How many patients have been diagnosed with diabetes?’). Based on HLL, m is the number of buckets and A and B are the ith bucket in A and B. Figure 3 plots the number of buckets and prevalence rate against the estimated expected number of non-‘k-anonymized’ buckets and the number of buckets versus the percentage of the non-‘k-anonymous’ buckets. The two top plots are simply the number of non-k-anonymous buckets against the number of buckets and varying the other parameters, but this turns out to not be the right set of variables to control.

Fig. 3.

Expected number of non-10-anonymous buckets under different combinations of number of buckets (m) and prevalence rate (r) when total number of patients is 107. (Top) Number of non-10-anonymous buckets under different combinations of m (number of buckets) and r (prevalence rate) when total number of patients is 107. (Left bottom) However, the fraction of non-10-anonymous buckets remains constant as the number of buckets increase when the other variables are held fixed. (Right bottom) It is the relationship to prevalence rate that is more complicated and nonlinear, as shown by focusing on the behavior for 100 and 500 buckets Instead, as evidenced by the lower-left plot (Fig. 3), a roughly constant fraction of the buckets are not k-anonymized when r is constant. This is unsurprising because as mentioned earlier, r is intuitively the number of background population members that could be used to hide each patient. Of course, random chance also plays a large role. More precisely, this constant is close to where is the probability of that the number of collisions is >0 and <10 when the bucket size is at the mean value . It is not quite equal for two reasons. The first reason is the obvious one, that we are using the approximations that form the subject of this article. The second reason is that the single bucket size follows a Binomial distribution with mean and . When and are big enough, we can get by concentrating in an interval centered at the means, which is similar to what we did in Approximation A2, but simpler. However, when is not that big, for example, , then and will differ a lot for different value of and . Now that it is clear that r is the value of primary importance, we see in the lower-right plot of Figure 3 that as prevalence rate (r) increases, more buckets are non-‘k-anonymized’. This is because bigger r means more overlap between sets A and B and also each pair of buckets A and B. Thus, the maximum number of zeros before the first one among all hash values in B is more likely equal to that in A. Thus, a hospital IRB or clinical query system seeking to understand the 10-anonymity of a particular query can use a first-order approximation based only on r, without even needing to run our code. Indeed, they need only consult our lower-right plot in Figure 3 and scale to the size of their background population to determine that first-order approximation. This can be done without any code. When a more precise result is needed, however, our two Approximations can provide that answer in only a few minutes. Of course, if even that is insufficient, the practitioner may choose to directly measure the k-anonymity of a particular HLL sketch; this is not in the scope of this article, but was empirically done in prior work (Yu and Weber, 2020).

6 Conclusion

In this article, we have developed a method to quickly compute the expected number of non-‘k-anonymous’ buckets in the HLL sketch. Because of the number of patients (denoted as in our model) is too big to compute the precise expected value, we introduced two approximations based on concentration inequalities. In general, Approximation A1 is suitable for the case when the expected value of single bucket size which is is ‘small’, for example, total number of patients () is 105 and number of buckets (m) is 100 or total number of patients () is 107 and number of buckets (m) is 105. Approximation A2 is suitable for the case when is ‘big’, for example, total number of patients () is 107 and number of buckets (m) is 100 (see choice table in Section 4). By an appropriate choice of approximation method, we can control the computing time to under 300 s in almost all the cases. In other words, when an individual hospital is asked a query to return the aggregate counts based on sharing HLL sketches, we can compute the expected number of buckets which match fewer than 10 patients in the background population. If this number is too high, that is a signal to the clinical query system that the particular query is unsafe to release using HLL sketches. It is then up to the clinical query system to decide whether to fall back on another aggregation method, or if they should simply not respond to the query. Our results further give some guidance into the parameter ranges in which HLL sketches are likely to be safe to release. HLL sketches are especially useful for rare diseases, where the prevalence ratio in the population is low. Note that this is in marked contrast to sending raw counts, where rare diseases are precisely the least k-anonymous. Thus, HLL sketches fill a complementary role. Indeed, at the heart of the problem is the tradeoff between the utility/accuracy of HLL sketches and privacy, which increase or decrease, respectively, with the number of buckets. The average k-anonymity of a bucket is roughly inversely proportional to the square of the estimation error; our work computes instead the number of buckets that are not at least 10-anonymous. For more guidance on this tradeoff, we refer the reader to prior work, where we graphed this tradeoff empirically (Yu and Weber, 2020). Ultimately, our work is primarily useful in contexts where federated clinical query systems are used in biomedical research. The past year has seen increasing amounts of data centralization to combat the Covid-19 pandemic. The cost to privacy has been accepted because of the urgent clear and present need. However, in the future post-pandemic era as the pendulum swings the other direction, privacy may again take center stage. We hope that our work will be useful in analyzing the privacy consequences of distributed query systems and help inform policy-makers and institutional IRBs about the privacy-utility tradeoffs at hand.

Data Availability

All of the data and code used to generate benchmarks is available on the Github: https://github.com/tzyRachel/K-anonymity-Expectation

Funding

We acknowledge startup funding from the University of Toronto Department of Computer and Mathematical Sciences for support. Conflict of Interest: none declared.

19 in total

1. Protecting privacy using k-anonymity.

Authors: Khaled El Emam; Fida Kamal Dankar
Journal: J Am Med Inform Assoc Date: 2008-06-25 Impact factor: 4.497

2. The Shared Health Research Information Network (SHRINE): a prototype federated query tool for clinical data repositories.

Authors: Griffin M Weber; Shawn N Murphy; Andrew J McMurry; Douglas Macfadden; Daniel J Nigrin; Susanne Churchill; Isaac S Kohane
Journal: J Am Med Inform Assoc Date: 2009-06-30 Impact factor: 4.497

3. Evaluating re-identification risks with respect to the HIPAA privacy rule.

Authors: Kathleen Benitez; Bradley Malin
Journal: J Am Med Inform Assoc Date: 2010 Mar-Apr Impact factor: 4.497

4. Inherent privacy limitations of decentralized contact tracing apps.

Authors: Yoshua Bengio; Daphne Ippolito; Richard Janda; Max Jarvie; Benjamin Prud'homme; Jean-François Rousseau; Abhinav Sharma; Yun William Yu
Journal: J Am Med Inform Assoc Date: 2021-01-15 Impact factor: 4.497

5. The need for privacy with public digital contact tracing during the COVID-19 pandemic.

Authors: Yoshua Bengio; Richard Janda; Yun William Yu; Daphne Ippolito; Max Jarvie; Dan Pilat; Brooke Struck; Sekoul Krastev; Abhinav Sharma
Journal: Lancet Digit Health Date: 2020-06-02

6. Launching PCORnet, a national patient-centered clinical research network.

Authors: Rachael L Fleurence; Lesley H Curtis; Robert M Califf; Richard Platt; Joe V Selby; Jeffrey S Brown
Journal: J Am Med Inform Assoc Date: 2014-05-12 Impact factor: 4.497

7. Software-Enabled Distributed Network Governance: The PopMedNet Experience.

Authors: Melanie Davies; Kyle Erickson; Zachary Wyner; Jessica Malenfant; Rob Rosen; Jeffrey Brown
Journal: EGEMS (Wash DC) Date: 2016-03-30

8. International electronic health record-derived COVID-19 clinical course profiles: the 4CE consortium.

Authors: Gabriel A Brat; Griffin M Weber; Nils Gehlenborg; Paul Avillach; Nathan P Palmer; Luca Chiovato; James Cimino; Lemuel R Waitman; Gilbert S Omenn; Alberto Malovini; Jason H Moore; Brett K Beaulieu-Jones; Valentina Tibollo; Shawn N Murphy; Sehi L' Yi; Mark S Keller; Riccardo Bellazzi; David A Hanauer; Arnaud Serret-Larmande; Alba Gutierrez-Sacristan; John J Holmes; Douglas S Bell; Kenneth D Mandl; Robert W Follett; Jeffrey G Klann; Douglas A Murad; Luigia Scudeller; Mauro Bucalo; Katie Kirchoff; Jean Craig; Jihad Obeid; Vianney Jouhet; Romain Griffier; Sebastien Cossin; Bertrand Moal; Lav P Patel; Antonio Bellasi; Hans U Prokosch; Detlef Kraska; Piotr Sliz; Amelia L M Tan; Kee Yuan Ngiam; Alberto Zambelli; Danielle L Mowery; Emily Schiver; Batsal Devkota; Robert L Bradford; Mohamad Daniar; Christel Daniel; Vincent Benoit; Romain Bey; Nicolas Paris; Patricia Serre; Nina Orlova; Julien Dubiel; Martin Hilka; Anne Sophie Jannot; Stephane Breant; Judith Leblanc; Nicolas Griffon; Anita Burgun; Melodie Bernaux; Arnaud Sandrin; Elisa Salamanca; Sylvie Cormont; Thomas Ganslandt; Tobias Gradinger; Julien Champ; Martin Boeker; Patricia Martel; Loic Esteve; Alexandre Gramfort; Olivier Grisel; Damien Leprovost; Thomas Moreau; Gael Varoquaux; Jill-Jênn Vie; Demian Wassermann; Arthur Mensch; Charlotte Caucheteux; Christian Haverkamp; Guillaume Lemaitre; Silvano Bosari; Ian D Krantz; Andrew South; Tianxi Cai; Isaac S Kohane
Journal: NPJ Digit Med Date: 2020-08-19

9. Balancing Accuracy and Privacy in Federated Queries of Clinical Data Repositories: Algorithm Development and Validation.

Authors: Yun William Yu; Griffin M Weber
Journal: J Med Internet Res Date: 2020-11-03 Impact factor: 5.428

10. The National COVID Cohort Collaborative (N3C): Rationale, design, infrastructure, and deployment.

Authors: Melissa A Haendel; Christopher G Chute; Tellen D Bennett; David A Eichmann; Justin Guinney; Warren A Kibbe; Philip R O Payne; Emily R Pfaff; Peter N Robinson; Joel H Saltz; Heidi Spratt; Christine Suver; John Wilbanks; Adam B Wilcox; Andrew E Williams; Chunlei Wu; Clair Blacketer; Robert L Bradford; James J Cimino; Marshall Clark; Evan W Colmenares; Patricia A Francis; Davera Gabriel; Alexis Graves; Raju Hemadri; Stephanie S Hong; George Hripscak; Dazhi Jiao; Jeffrey G Klann; Kristin Kostka; Adam M Lee; Harold P Lehmann; Lora Lingrey; Robert T Miller; Michele Morris; Shawn N Murphy; Karthik Natarajan; Matvey B Palchuk; Usman Sheikh; Harold Solbrig; Shyam Visweswaran; Anita Walden; Kellie M Walters; Griffin M Weber; Xiaohan Tanner Zhang; Richard L Zhu; Benjamin Amor; Andrew T Girvin; Amin Manna; Nabeel Qureshi; Michael G Kurilla; Sam G Michael; Lili M Portilla; Joni L Rutter; Christopher P Austin; Ken R Gersing
Journal: J Am Med Inform Assoc Date: 2021-03-01 Impact factor: 7.942