Literature DB >> 26522470

Privacy Risks from Genomic Data-Sharing Beacons.

Suyash S Shringarpure¹, Carlos D Bustamante².

Abstract

The human genetics community needs robust protocols that enable secure sharing of genomic data from participants in genetic research. Beacons are web servers that answer allele-presence queries--such as "Do you have a genome that has a specific nucleotide (e.g., A) at a specific genomic position (e.g., position 11,272 on chromosome 1)?"--with either "yes" or "no." Here, we show that individuals in a beacon are susceptible to re-identification even if the only data shared include presence or absence information about alleles in a beacon. Specifically, we propose a likelihood-ratio test of whether a given individual is present in a given genetic beacon. Our test is not dependent on allele frequencies and is the most powerful test for a specified false-positive rate. Through simulations, we showed that in a beacon with 1,000 individuals, re-identification is possible with just 5,000 queries. Relatives can also be identified in the beacon. Re-identification is possible even in the presence of sequencing errors and variant-calling differences. In a beacon constructed with 65 European individuals from the 1000 Genomes Project, we demonstrated that it is possible to detect membership in the beacon with just 250 SNPs. With just 1,000 SNP queries, we were able to detect the presence of an individual genome from the Personal Genome Project in an existing beacon. Our results show that beacons can disclose membership and implied phenotypic information about participants and do not protect privacy a priori. We discuss risk mitigation through policies and standards such as not allowing anonymous pings of genetic beacons and requiring minimum beacon sizes.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2015 PMID： 26522470 PMCID： PMC4667107 DOI： 10.1016/j.ajhg.2015.09.010

Source DB: PubMed Journal: Am J Hum Genet ISSN： 0002-9297 Impact factor: 11.025

Introduction

In the coming decade, a great deal of human genomic data, along with linked phenotypes in electronic health records, will be collected in the context of health care. A major goal of the human genomics community is to enable efficient sharing, aggregation, and analysis of these data in order to understand the genetic contributors of health and disease. Previous large-scale data-sharing approaches have had limited success because of the potential for privacy breaches and risks of participant re-identification. Homer et al. and others2, 3, 4, 5 showed that subjects in a genome-wide association study could be re-identified with the use of allele frequencies, resulting in the removal of publicly available allele-frequency data. The Beacon Project by the Global Alliance for Genomics & Health (GA4GH) aims to simplify data sharing through a web service (“beacon”) that provides only allele-presence information. Users can query institutional beacons for information about genomic data available at the institution. Queries are of the form “Do you have a genome that has a specific nucleotide (e.g., A) at a specific genomic position (e.g., position 11,272 on chromosome 1)?” and the beacon server can answer “yes” or “no.” Beacons are intended to be easily set up and to allow data sharing while protecting participant privacy. By providing only allele-presence information, beacons are safe from attacks that require allele frequencies.1, 2, 3, 4, 5 However, a privacy breach from a beacon would be troubling given that beacons often summarize data with a particular disease of interest. For instance, identifying that a given genome is part of the SFARI beacon, which contains genomic data from families with a child affected by autism spectrum disorder, means that the individual belongs to a family where some member has autism spectrum disorder. Thus, beacons could leak not only membership information but also phenotype information. Although genetic privacy is protected to some extent by the Genetic Information Nondiscrimination Act (GINA), the offered protections are limited, and GINA does not apply to long-term care insurance, life insurance, disability insurance, or other special cases. Therefore, all data-sharing mechanisms, including beacons, must protect participant privacy. To examine the question of re-identification in a beacon, we have developed a likelihood-ratio test (LRT) that uses allele presence or absence responses from a beacon to predict whether a given individual genome is present in the beacon database. Our approach is independent of allele frequencies. The statistical properties of the LRT guarantee that it is the most powerful test for this problem. A variation of our LRT can detect relatives of the query individual in the beacon. Our results suggest that anonymous-access beacons do not protect individual privacy and are open to re-identification attacks. As a result, they can also disclose phenotype information about individuals whose genomes are present in the beacon.

Material and Methods

We assume a beacon composed of unrelated individuals from a single population. Given query q = {C, P, A}, the beacon answers “yes” (represented as 1) if allele A is an alternate allele at position P on chromosome C and has a non-zero frequency in the sample used for constructing the beacon, and it answers “no” (represented as 0) otherwise. We consider only bi-allelic SNPs for our analysis. Thus, given a set of n queries Q = {q1, …, q}, the beacon returns a set of responses R = {x1, …, x}. For our scenario, we assume that the attacker has access to more information—the number of individuals (N) in the beacon database and the site frequency spectrum (SFS) of the population in the beacon—parameterized as a beta distribution with shape parameters (a′, b′). Thus, we assume that alternate allele frequencies f for all SNPs observed in the population are distributed as f ∼ beta(a′, b′). For our attack scenario, we assume a setting identical to that used by Homer et al. and others. In this setting, the attacker receives a VCF file listing all the SNP positions at which the query individual has an alternate allele and the genotype calls at the corresponding positions. The attacker then queries the beacon for all heterozygous positions by using the alternate allele listed in the VCF and obtains the set of responses R from the beacon. We develop a LRT that can use the responses R to decide whether the query genome is in the beacon. If the query individual is present in the beacon, then every allele in the query genome must be present in the beacon. Thus, the beacon will return a “yes” (1) response to every query. If a query individual is not present in the beacon, then the beacon response will be “yes” (1) if some individual in the beacon has the allele and “no” (0) otherwise. By calculating the likelihood of the responses, we can differentiate query individuals in the beacon from those not in the beacon. Our approach for re-identifying individuals within a beacon is based on a LRT that uses this information. For each query genome, we calculate the likelihood of the beacon responses to n allele-presence queries under the null hypothesis that a given individual is not in the beacon and the alternative hypothesis that the given individual is in the beacon. We then calculate the test statistic as the ratio of the two likelihoods. To make our LRT generalizable across populations, we will remove direct dependence on allele frequencies given that frequencies can vary considerably for a given allele across populations. Instead, we will allow our test to depend on the shape of the SFS, which is described by (a′, b′), the parameters of the beta distribution. Although allele frequencies for a given allele can vary considerably across populations, the SFS parameters for most populations are similar to each other (Modeling SFSs by Beta Distributions in Appendix A). Therefore, the results from a test that depends on the shape of the SFS but is independent of the actual allele frequencies can be generalized to many populations (Figure S1). Our LRT evaluates the likelihood of the beacon response under two possible hypotheses. Null hypothesis H0: query genome is not in the beacon database. Alternative hypothesis H1: query genome is in the beacon database.

LRT

In an ideal setting, we would expect x1 = x2 … = x = 1 if a query genome g is in the beacon B. In practice, because of sequencing errors and differences in variant-calling pipelines, we might have some mismatches between the query copy of a genome and its copy in the beacon. We assume that this happens with probability δ. Let the alternate allele frequency at the SNP corresponding to query q be f. Because the beacon is only queried at the positions where the query genome is heterozygous, f is not distributed as beta(a′, b′) but shows an ascertainment bias. We can show that f ∼ beta(a, b), where a = a′ + 1 and b = b′ + 1 in theory (Posterior Distribution of Allele Frequencies in Appendix A). The log-likelihood of a response set R = {x1, …, x} can be written as For the LRT, we need to evaluate this log-likelihood under the null hypothesis and the alternative hypothesis. The null hypothesis is that the query genome is not present in the beacon, and the alternative hypothesis is that the query genome is present in the beacon. We can show that under the alternative hypothesis, the log-likelihood can be calculated aswhere D is the probability that none of N − 1 genomes has an alternate allele at a given position (see Likelihood under the Alternative Hypothesis in Appendix A). Similarly, the log-likelihood under the null hypothesis is(see Likelihood under the Null Hypothesis in Appendix A). The log of the likelihood-ratio statistic can then be written aswhere we have defined and (see LRT Statistic in Appendix A). For , we have C < 0. In practice, because , , and mismatch rate , this will always be true. Therefore, the LRT statistic can be stated as The LRT stated above can be understood to be a test for a simple null hypothesis H0: θ = 1 − D against a simple alternative hypothesis H1: θ = 1 − δD when we are given {x1, …, x} sampled as x ∼ Bernoulli(θ). By the Neyman-Pearson lemma, the LRT is the most powerful test for a given test size α.

Binomial Test

The null hypothesis is rejected if Λ < t for some threshold t. Let t be such that P(Λ < t | H0) = α. This is equivalent to rejecting the null hypothesis if , where . Because the x are independent and identically distributed (i.i.d.) under both hypotheses, and . Therefore, the power of the exact test can be calculated as , where is chosen such that . A sufficient statistic for the LRT is the number of “yes” responses from the beacon.

Relationship between the Number of Queries Required and Beacon Size

In the null and alternative hypotheses, x is a Bernoulli random variable. Therefore, by the central limit theorem, the LRT statistic has a Gaussian distribution. We can therefore use the parameters of the Gaussian distribution to obtain a relationship between the number of queries (required for achieving a desired power and false-positive rate) and the number of individuals in the beacon. Let μ0 and σ0 be the mean and SD, respectively, of the LRT statistic under the null hypothesis, and let μ1 and σ1 be the corresponding values under the alternative hypothesis. For an LRT statistic with false-positive rate α, power 1 − β, and a normal distribution, we have thatwhere z is the y quantile of the standard normal distribution. For the LRT we describe, this relationship is equivalent to(see Gaussian LRT Power Approximation in Appendix A). The right-hand side of the equation is independent of both n and N for a specified false-positive rate α and power 1 − β. Thus, we have that .

LRT for Detecting Relatives

The relatedness of two individuals can be parameterized with a single parameter ϕ, which is the probability that the two individuals share an allele at a single SNP. Thus, identical twins have ϕ = 1, parent-offspring and sibling pairs have ϕ = 0.5, first cousins have ϕ = 0.25, and so on. The likelihood for the null hypothesis remains the same as before. Under the alternate hypothesis (a relative of the query genome g with relatedness ϕ is present in beacon B), the log-likelihood is given by(see Likelihood under the Alternate Hypothesis in Appendix B). We can use this form to calculate the LRT statistic for this setting. Here, the exact test uses as the sufficient statistic (as before), and the sufficient statistic is binomially distributed under both hypotheses. The distributions are given by and . Therefore, the power of the exact test can be calculated as , where is chosen such that .

Simulation Experiments

We simulated 500,000 SNPs in a sample of 1,000 diploid individuals. Alternate allele frequencies were sampled from a multinomial distribution with probabilities obtained from the expected allele-frequency distribution for a standard neutral model under the assumption of a population size of 10,000 individuals. We constructed a beacon by using the 1,000 simulated individuals. The query set of individuals consisted of 200 diploid individuals from the beacon 200 diploid individuals not in the beacon and whose genotypes were simulated according to the generated allele frequencies at all SNPs. For initial experiments, the mismatch rate between the beacon and query copies of the same genomes was set to 10−6 to simulate near-ideal data. The null distribution of the LRT statistic was obtained with the exact-test calculation for the 200 individuals not in the beacon. Power was calculated as the proportion of successfully rejected tests (out of 200) for the query genomes in the beacon.

Detecting Relatives

To examine whether relatives could be identified from the beacon, we used 200 individuals from the beacon to generate query genomes with varying degrees of relatedness to the original individual.

Effect of Noise

Genome sequencing is more error prone than array genotyping. Even with high-coverage data, biological replicates of the same individual could have 1%–5% SNPs unique to each replicate. On the same sequenced sample, different variant-calling pipelines can produce SNP calls at positions that might differ from each other. We model this in our simulation by allowing for a mismatch probability (δ) that for a query individual who is in the beacon and is heterozygous at the query SNP, the copy in the beacon is a homozygous reference, i.e., the beacon will (erroneously) return 0 as the response to the query. Table S2 shows the levels of mismatch modeled in our experiments.

Experiments with Real Data

1000 Genomes Phase 1 CEU Beacon

We created a beacon by using the CEU population (Utah residents with ancestry from northern and western Europe from the CEPH collection) from phase 1 of the 1000 Genomes Project. Of the 85 CEU samples present in phase 1, 65 were used in the beacon. 20 samples from the beacon and the remaining 20 samples were used as query genomes. Figure S4 shows the setup of the 1000 Genomes phase 1 CEU beacon. To test the effect of censoring on power, we constructed a beacon by using the same data as above, except that the beacon always responded “no” to queries for singletons. We then used whole genomes to query the beacon, as before. To test whether sharing SNP array data was more secure than sharing whole genomes, we repeated the setup of Figure S4 with Affymetrix array data for the CEU samples. We then used SNP array data to query the beacon.

Combining Multiple Datasets

We used the scheme of Figure S5 to create beacons that contained either a single population (65 CEU individuals) or multiple populations (a CEU + YRI [Yoruba in Ibadan, Nigeria] beacon with 32 CEU and 33 YRI individuals and a CEU + JPT [Japanese in Tokyo, Japan] beacon with 32 CEU and 33 JPT individuals). We used 40 CEU individuals as query individuals, 20 of whom belonged to all beacons and 20 of whom belonged to none of the beacons.

Re-identifying a Personal Genome Project Individual

To test our method on existing beacons, we selected from the Personal Genome Project (PGP) a single genome (ID hu48C4EB or PGP 183). We chose 1,000 heterozygous SNPs from the selected individual’s genome and used the GA4GH Beacon Network query interface to query all existing beacons for the alternate allele at the chosen SNPs. If a beacon of size N produced k “yes” responses to n queries, the p value was calculated under the null hypothesis as . Through metadata (see Web Resources), we were able to ascertain that the selected individual was present in the PGP beacon and the Kaviar beacon.

Results

Re-identification in a Simulated Beacon

We validated our LRT framework by simulating a beacon with 1,000 individuals and 500,000 total SNPs. From the power curve (Figure 1A), we can see that the LRT had more than 95% power to detect whether an individual was in the beacon with just 5,000 SNP queries. We also see that our theoretical analysis matches the empirical results. For the same number of SNPs queried, the power for detecting relatives was reduced but still considerable (Figure 1B; Figure S2). Sequencing errors and variant-calling differences reduced the power of the test (Figure S3).

Figure 1

Power of Re-identification Attacks on Beacons Constructed with Simulated Data

Power curves for the likelihood-ratio test (LRT) on (A) a simulated beacon with 1,000 individuals and (B) detecting relatives in the simulated beacon. The false-positive rate was set to 0.05 for all scenarios.

Re-identification in Phase 1 CEU Beacon

For evaluation with real data, we set up a beacon by using 65 CEU individuals from phase 1 of the 1000 Genomes Project (Figure S4). With just 250 SNPs, beacon membership could be detected with 95% power and a 5% false-positive rate (Figure 2A). A beacon constructed with the same individuals but with SNP array data showed a reduction in power, as did a beacon that used sequence data but censored responses by always replying “no” to queries for singletons (Figure 2B). Even in these scenarios, the LRT had greater than 90% power if 10,000 or more queries were permitted.

Figure 2

Power of Re-identification Attacks on Beacons Constructed with Real Data

Power curves for the LRT on (A) a beacon constructed from 65 CEU individuals from 1000 Genomes phase 1 and (B) CEU beacons of size 65 and constructed with array data, censored WGS data (without singletons), and WGS data. The false-positive rate was set to 0.05 for all scenarios.

Re-identification in Multi-population Beacon

From our theoretical analysis, we can see that increasing beacon size increases the number of SNPs required for achieving a given power level at a specified threshold for the false-positive rate. Combining multiple datasets can make detection more difficult in the same way. A question of interest is whether combining multiple datasets can also make detection more difficult by affecting the SFS of the samples in the beacon. Figure 3 shows the power curves for beacons containing multiple populations. The results show that for a fixed number of SNPs to query, the power for the multi-population beacons is higher than that for the CEU-only beacon. A single-population beacon is therefore more secure than a multi-population beacon of the same size. Because the protective effect of extra samples in the beacon against re-identification depends on their allele sharing with the query genome, including other populations in a beacon is less effective than including the same number of individuals from the population of the query genome.

Figure 3

Power of the LRT for Multi-population Datasets

Power is larger for multi-population beacons than for the CEU-only beacon.

Re-identification in Existing Beacons

We used our theoretical analysis to estimate the number of queries our framework would require to re-identify individuals and relatives from existing beacons. We used publicly available beacon metadata to infer the number of individuals present in the beacon. Where this was not possible (the AMPlab, ICGC, and NCBI beacons), we used conservative estimates based on the size of the underlying datasets. For SFS parameters, we used the estimates we obtained for our simulation data. The Kaviar beacon contains 63,500 exomes and 8,400 whole genomes. Because exomes are only 1% of entire genomes in length, this beacon can be assumed to consist of two beacons—an exome beacon with 72,000 exomes and a genome beacon with 8,400 whole genomes. Re-identification is possible in the genome beacon if queries for SNPs in the coding regions are avoided. From Table 1, we see that only the Cafe CardioKit gene-panel beacon, the Broad Institute exome beacon, and the Kaviar beacon are safe from our re-identification attack, given that the gene panels and exomes have much fewer SNPs than genomes. For all other beacons, re-identification is possible with 95% power and fewer than 50,000 allele queries. Thus, our approach is computationally feasible with existing beacons.

Table 1

Estimated Number of SNP Queries Required for Re-identification in Real Beacons with a 5% False-Positive Rate and 95% Power

Beacon Name	Number of Samples	SNPs Required for Re-identification
Identical Genomes	First-Degree Relatives	Second-Degree Relatives
1000 Genomes Project	1,092	3,649	34,467	157,861
1000 Genomes Project phase 3	2,535	8,469	79,976	366,276
AMPLab	2,535	8,469	79,976	366,276
Broad Institute	60,706	202,770	1,914,581	8,768,007
Cafe CardioKit	1,070	3,575	33,773	154,684
ICGC	12,807	42,779	403,936	1,849,878
Known VARiants	72,000	240,494	2,270,772	10,399,218
Known VARiants (genomes only)	8,400	28,059	264,947	1,213,368
NCBI	14,466	48,320	456,258	2,089,490
PGP	174	582	5,515	25,273
IBD	5,070	16,936	159,926	732,410
Native American + Egyptian	100	335	3,181	14,586
UK10K	6,322	21,118	199,411	913,239
SFARI	10,400	34,739	328,024	1,502,231

Re-identifying a PGP Individual

We demonstrated the feasibility of re-identification in existing beacons by querying them 1,000 times with a single genome from the PGP. To avoid overloading the beacon servers, we inserted a delay of 5 s between queries, and all 1,000 queries were completed in 3 hr 53 min from a single computer. In beacons where the presence of the individual could be confirmed from metadata, we obtained 100% “yes” responses (Table 2). The null hypothesis (the query genome is not in the beacon) could be rejected only for the PGP beacon (p = 0.0033), but not for the larger Kaviar beacon (p = 0.98), demonstrating that re-identification is more difficult in larger beacons.

Table 2

Theoretical p Values for 1,000 Queries for SNPs from a Genome in the Personal Genome Project

Beacon Name	Beacon Size	“Yes” Responses	p Value
Known VARiantsa	72,000	1000	0.98
Broad Institute	60,706	27	1
1000 Genomes Project	1,092	711	1
PGPa	174	1000	0.0033
Cafe CardioKit	1,070	0	1
Wellcome Trust Sanger Institute	11,492	960	1
NCBI	14,466	947	1
ICGC	12,807	134	1
AMPLab	2,535	946	1
1000 Genomes Project phase 3	2,535	946	1

Beacons known to contain the individual (from metadata).

Discussion

We have developed a LRT for identifying whether a given individual genome is part of a beacon. Our experiments show that in a variety of settings, detecting membership in a beacon is possible with high power for not only individuals in the beacon but also their relatives. Because beacons are often designed to share samples with a certain phenotype, this also discloses phenotype information about the individual who is detected to be part of the beacon. Although detecting membership does not breach privacy, disclosure of potentially sensitive phenotype information is a serious privacy breach. In Table 1, of the nine beacons that index non-publically available genomic data (see Table S3 for details of beacon datasets and phenotypes), four are associated with a single phenotype (Cafe CardioKit, ICGC, IBD, and SFARI beacons), four are associated with multiple phenotypes (Broad Institute, Kaviar, NCBI, and UK10K beacons), and one is not associated with any phenotype (Native American + Egyptian beacon). For instance, identifying that a given genome is part of the SFARI beacon, which contains genomic data from families with a child affected by autism spectrum disorder, means that the individual belongs to a family where some member has autism spectrum disorder. The LRT we describe can be used in a number of undesirable ways. For instance, a United States direct-to-consumer genetic-testing company that collects genome-wide data from customers could use it to infer phenotype or disease information without their customers’ knowledge by querying beacons. Because the re-identification attack we describe requires the attacker to have access to an individual’s genome, an alternative is that the attacker can use the query genome to directly predict disease risk by using existing risk-prediction methods, such as genomic risk scores or machine-learning methods. A comparison of the performance of risk prediction and the re-identification LRT would be useful in understanding whether re-identification discloses any extra information about the query individual. However, most risk-prediction methods focus on the risk that the subject will develop the disease (in 10 years or at some future time), whereas identifying beacon membership gives a direct estimate of the probability that the queried individual currently has the disease studied in the beacon sample. A fair comparison of the two is therefore not possible. If our LRT (with false-positive rate α = 5%) identifies an individual as belonging to a case-only beacon (i.e., rejects the null hypothesis) for a disease with population prevalence (prior probability that an individual has the disease) p = 1%, the posterior probability that the individual has the disease is given by (1 − α) + αp = 0.9505 according to Bayes’ theorem. For the same result in a case-control beacon with equal numbers of case and control individuals, the probability that the individual has the disease is given by 0.5 × (1 − α) + 0.05p = 0.4755. In contrast, although genomic risk prediction has high success rates for Mendelian diseases with highly penetrant alleles and in some cancers, the success of such approaches for predicting common disease risk is modest. An upper bound on performing genomic risk prediction by using an individual’s genome can be obtained if one considers the (broad-sense) heritability of the disease being studied. Polderman et al. examined the heritability of 17,804 human traits. From their analysis, we can see that 26 out of 43 ICD-10 (International Classification of Diseases, Tenth Revision) and ICF (International Classification of Functioning, Disability, and Health) subchapter-level disease categories have heritability less than 50%, suggesting that the performance of genomic risk prediction for many disease categories will be limited. Our approach makes some simplifying assumptions. We assume that the beacon samples and the query genome belong to the same population. This is a reasonable assumption given that beacons often publish the ethnicity of the datasets included, whereas the ethnicity of the query genome can be identified by comparison to reference panels such as 1000 Genomes. Genotypes are assumed to be distributed according to Hardy-Weinberg equilibrium. We also assume that allele queries are independent, which can lead to overly confident predictions for common SNPs. We expect that it will not affect our results significantly, given that most SNPs are rare (<5% frequency) in human populations. Inaccurate estimates of the shape of the SFS can affect our theoretical analysis. However, as Figure S1 shows for the theoretical power, the power of the test is similar for populations with different SFS parameters, and Figure 2A shows good agreement between theoretical and empirical power curves on the CEU beacon. In addition, the empirical power of the test does not depend on the SFS parameters (Binomial Test in Appendix A). This suggests that our test is robust to different SFS parameters. A computational limitation is that establishing high confidence might need millions of queries. In our experiments with existing beacons, we were able to make 1,000 queries to the beacon server in 3 hr 53 min, with a 5 s delay between queries. An important caveat is that the proposed LRT is only a demonstration that individual privacy can be compromised through beacons. It aims to show that beacon membership can be identified with only the query genome, even if allele frequencies are not known. As a result, the bounds we obtain for the number of queries required for re-identification (Table 1) are conservative and should not be used directly to guide the construction of beacons. A re-identification test that uses only rare SNPs and/or incorporates the allele frequencies at SNPs will be more powerful than our method and will require fewer queries than our estimates. Because the LRT we describe requires access to genomic data, such attacks might not be frequent or imminent at this time. However, as access to genomic data becomes easier, such attacks might need to be accounted for in the design of data-sharing mechanisms. Our results have important implications for setting up beacons to allow data sharing and protect individual privacy. Beacons are designed to help researchers find datasets relevant to their research interests (e.g., datasets containing an allele that the researchers might suspect to be associated with a rare Mendelian disorder). Access to individual-level genotype data is usually controlled, and a researcher might spend considerable time and effort applying for access only to find that the dataset is not relevant to his or her study. An advantage of a beacon is that any researcher can use it to query access-controlled data without applying for access. This will allow researchers to establish whether an access-controlled dataset might be of interest to them and apply for access only for relevant datasets. Two desirable features in beacons might therefore be that they contain a single dataset (so researchers who find a relevant dataset by querying a beacon can get data access through a single request) and that they return accurate information about the presence of rare alleles. Solutions for protecting privacy in beacons must also maintain the utility of beacons by supporting these features. We examine two ways in which security can be improved for anonymous-access beacons: (1) making detection of membership in the beacon harder and (2) reducing the leakage of phenotype information from the beacon. A number of approaches can be used for making detection of membership in the beacon harder. Increasing beacon size can make detection harder, but protection against genome-wide re-identification attacks will require tens of thousands of individuals. Beacons sharing small genomic regions (single genes or exomes) are more secure than those sharing whole genomes. Beacons containing multiple populations are less secure than single-population beacons of the same size. Publishing metadata—such as the ethnicity of samples, beacon size, or the names of datasets included—reduces beacon security. Limiting the number and/or rate of queries per IP address can only slow down attackers and is therefore ineffective. Data-anonymization approaches, such as using only common variation or censoring (Figure 2B; Censoring Beacon Responses in Appendix B), make re-identification harder but not impossible. All of these methods make detection of membership in the beacon harder, but they also reduce the utility of beacons to users. An alternative way of improving beacon security is to address the leakage of phenotype information instead of the possibility of genomic re-identification. As described earlier, the probability that a re-identified sample has the disease associated with the beacon dataset depends on the proportion of case samples in the beacon dataset. Therefore, adding a suitable number of control samples or aggregating responses from multiple beacons (implemented as an option in the Beacon Network) might reduce the probability that a re-identified sample has the disease to an acceptable level. Heritability estimates can be used for determining an acceptable probability level for a particular disease. By including non-case samples, these solutions reduce the phenotype information that can be obtained from a beacon while keeping the reduction in the utility of the beacon to a minimum. We expect that, because of the lack of monitoring and access control, anonymous-access beacons will always be open to re-identification attempts. The most important step for improving security and reducing loss of privacy through beacons would be to prohibit anonymous access. Requiring users to authenticate their identity to access beacons will allow the research community to discourage re-identification attacks through policies outlining acceptable uses of beacons.

18 in total

1. Large sample size, wide variant spectrum, and advanced machine-learning technique boost risk prediction for inflammatory bowel disease.

Authors: Zhi Wei; Wei Wang; Jonathan Bradfield; Jin Li; Christopher Cardinale; Edward Frackelton; Cecilia Kim; Frank Mentch; Kristel Van Steen; Peter M Visscher; Robert N Baldassano; Hakon Hakonarson
Journal: Am J Hum Genet Date: 2013-05-23 Impact factor: 11.025

2. Protecting aggregate genomic data.

Authors: Elias A Zerhouni; Elizabeth G Nabel
Journal: Science Date: 2008-09-04 Impact factor: 47.728

3. Whole-genome sequence variation, population structure and demographic history of the Dutch population.

Authors:
Journal: Nat Genet Date: 2014-06-29 Impact factor: 38.330

4. Meta-analysis of the heritability of human traits based on fifty years of twin studies.

Authors: Tinca J C Polderman; Beben Benyamin; Christiaan A de Leeuw; Patrick F Sullivan; Arjen van Bochoven; Peter M Visscher; Danielle Posthuma
Journal: Nat Genet Date: 2015-05-18 Impact factor: 38.330

5. Contemporary Considerations for Constructing a Genetic Risk Score: An Empirical Approach.

Authors: Benjamin A Goldstein; Lingyao Yang; Elias Salfati; Themistoclies L Assimes
Journal: Genet Epidemiol Date: 2015-07-22 Impact factor: 2.135

Review 6. Routes for breaching and protecting genetic privacy.

Authors: Yaniv Erlich; Arvind Narayanan
Journal: Nat Rev Genet Date: 2014-05-08 Impact factor: 53.242

7. Redefining genomic privacy: trust and empowerment.

Authors: Yaniv Erlich; James B Williams; David Glazer; Kenneth Yocum; Nita Farahany; Maynard Olson; Arvind Narayanan; Lincoln D Stein; Jan A Witkowski; Robert C Kain
Journal: PLoS Biol Date: 2014-11-04 Impact factor: 8.029

8. The limits of individual identification from sample allele frequencies: theory and statistical analysis.

Authors: Peter M Visscher; William G Hill
Journal: PLoS Genet Date: 2009-10-02 Impact factor: 5.917

9. An integrated map of genetic variation from 1,092 human genomes.

Authors: Goncalo R Abecasis; Adam Auton; Lisa D Brooks; Mark A DePristo; Richard M Durbin; Robert E Handsaker; Hyun Min Kang; Gabor T Marth; Gil A McVean
Journal: Nature Date: 2012-11-01 Impact factor: 49.962

Review 10. Genetic-based prediction of disease traits: prediction is very difficult, especially about the future.

Authors: Steven J Schrodi; Shubhabrata Mukherjee; Ying Shan; Gerard Tromp; John J Sninsky; Amy P Callear; Tonia C Carter; Zhan Ye; Jonathan L Haines; Murray H Brilliant; Paul K Crane; Diane T Smelser; Robert C Elston; Daniel E Weeks
Journal: Front Genet Date: 2014-06-02 Impact factor: 4.599

59 in total

Review 1. Benefits and Risks of Sharing Genomic Information.

Authors: Dhikshitha Balaji; Sharon F Terry
Journal: Genet Test Mol Biomarkers Date: 2015-11-20

2. Rapid evaluation of phenotypes, SNPs and results through the dbGaP CHARGE Summary Results site.

Authors: Stephen S Rich; Zeng Y Wang; Anne Sturcke; Lora Ziyabari; Mike Feolo; Christopher J O'Donnell; Ken Rice; Joshua C Bis; Bruce M Psaty
Journal: Nat Genet Date: 2016-06-28 Impact factor: 38.330

Review 3. Privacy challenges and research opportunities for genomic data sharing.

Authors: Luca Bonomi; Yingxiang Huang; Lucila Ohno-Machado
Journal: Nat Genet Date: 2020-06-29 Impact factor: 38.330

4. Swarm: A federated cloud framework for large-scale variant analysis.

Authors: Amir Bahmani; Kyle Ferriter; Vandhana Krishnan; Arash Alavi; Amir Alavi; Philip S Tsao; Michael P Snyder; Cuiping Pan
Journal: PLoS Comput Biol Date: 2021-05-12 Impact factor: 4.475

5. Detecting the Presence of an Individual in Phenotypic Summary Data.

Authors: Yongtai Liu; Zhiyu Wan; Weiyi Xia; Murat Kantarcioglu; Yevgeniy Vorobeychik; Ellen Wright Clayton; Abel Kho; David Carrell; Bradley A Malin
Journal: AMIA Annu Symp Proc Date: 2018-12-05

6. Privacy-preserving biomedical data dissemination via a hybrid approach.

Authors: Yichen Jiang; Chenghong Wang; Zhixuan Wu; Xin Du; Shuang Wang
Journal: AMIA Annu Symp Proc Date: 2018-12-05

7. Expanding Access to Large-Scale Genomic Data While Promoting Privacy: A Game Theoretic Approach.

Authors: Zhiyu Wan; Yevgeniy Vorobeychik; Weiyi Xia; Ellen Wright Clayton; Murat Kantarcioglu; Bradley Malin
Journal: Am J Hum Genet Date: 2017-01-05 Impact factor: 11.025

8. PRINCESS: Privacy-protecting Rare disease International Network Collaboration via Encryption through Software guard extensionS.

Authors: Feng Chen; Shuang Wang; Xiaoqian Jiang; Sijie Ding; Yao Lu; Jihoon Kim; S Cenk Sahinalp; Chisato Shimizu; Jane C Burns; Victoria J Wright; Eileen Png; Martin L Hibberd; David D Lloyd; Hai Yang; Amalio Telenti; Cinnamon S Bloss; Dov Fox; Kristin Lauter; Lucila Ohno-Machado
Journal: Bioinformatics Date: 2017-03-15 Impact factor: 6.937

9. SCOTCH: Secure Counting Of encrypTed genomiC data using a Hybrid approach.

Authors: Wang Chenghong; Yichen Jiang; Noman Mohammed; Feng Chen; Xiaoqian Jiang; Md Momin Al Aziz; Md Nazmus Sadat; Shuang Wang
Journal: AMIA Annu Symp Proc Date: 2018-04-16

Review 10. An overview of human genetic privacy.

Authors: Xinghua Shi; Xintao Wu
Journal: Ann N Y Acad Sci Date: 2016-09-14 Impact factor: 5.691