| Literature DB >> 28339683 |
Jean Louis Raisaro1, Florian Tramèr1, Zhanglong Ji2, Diyue Bu3, Yongan Zhao3, Knox Carey4, David Lloyd5,6, Heidi Sofia7, Dixie Baker8, Paul Flicek5, Suyash Shringarpure9, Carlos Bustamante9, Shuang Wang2, Xiaoqian Jiang2, Lucila Ohno-Machado2, Haixu Tang3, XiaoFeng Wang3, Jean-Pierre Hubaux1.
Abstract
The Global Alliance for Genomics and Health (GA4GH) created the Beacon Project as a means of testing the willingness of data holders to share genetic data in the simplest technical context-a query for the presence of a specified nucleotide at a given position within a chromosome. Each participating site (or "beacon") is responsible for assuring that genomic data are exposed through the Beacon service only with the permission of the individual to whom the data pertains and in accordance with the GA4GH policy and standards.While recognizing the inference risks associated with large-scale data aggregation, and the fact that some beacons contain sensitive phenotypic associations that increase privacy risk, the GA4GH adjudged the risk of re-identification based on the binary yes/no allele-presence query responses as acceptable. However, recent work demonstrated that, given a beacon with specific characteristics (including relatively small sample size and an adversary who possesses an individual's whole genome sequence), the individual's membership in a beacon can be inferred through repeated queries for variants present in the individual's genome.In this paper, we propose three practical strategies for reducing re-identification risks in beacons. The first two strategies manipulate the beacon such that the presence of rare alleles is obscured; the third strategy budgets the number of accesses per user for each individual genome. Using a beacon containing data from the 1000 Genomes Project, we demonstrate that the proposed strategies can effectively reduce re-identification risk in beacon-like datasets.Entities:
Keywords: beacon; ga4gh; genomic data sharing; genomic privacy; re-identification
Mesh:
Year: 2017 PMID: 28339683 PMCID: PMC5881894 DOI: 10.1093/jamia/ocw167
Source DB: PubMed Journal: J Am Med Inform Assoc ISSN: 1067-5027 Impact factor: 4.497
Algorithm describing mitigation strategy S3
| Algorithm1 | |
|---|---|
| 1. | Set all |
| 2. | Receive |
| 3. | Return the previous answer, then go to Step 2. |
| 4. | Compute the risk |
| 5. | Check whether there are any records with the asked variant and |
| 6. | For all the individuals with such variant and |
| 7. | Go back to Step 2 and wait for the next query. |
Figure 1.“Optimal” re-identification attack in single-population beacon. Different power rates per number of SNPs queried from an unprotected beacon with a single population (EUR) by an adversary with different types of background knowledge: (green) the attacker knows the allele frequencies (AFs) of a population from the same ancestry (EUR) as the one in the beacon and performs queries following the rare-allele-first logic; (red, cyan, blue, and purple) the attacker knows the AFs of a population from an ancestry different from the one in the beacon and performs queries following the rare-allele-first logic (African [AFR], admixed American [AMR], East Asian [EAS], or South Asian [SAS], respectively); (yellow) the attacker knows the AFs of a distinct population with the same ancestry (EUR) other than the one in the beacon but performs queries in random order; (black) the attacker does not have any information on AFs (i.e., the original attack by Shringarpure and Bustamante).
Figure 2.Kendall rank correlation coefficient with respect to true beacon allele frequencies. Kendall rank correlation coefficient between the actual AFs of the single-population beacon of Figure 1 and the AFs of populations with different ancestries. Values closer to represent higher correlation. Color mapping as in Figure 1.
Figure 3.“Optimal” re-identification attack in beacon with S1. Different power rates per number of SNPs randomly queried from a beacon with mitigation S1 by an adversary with knowledge on and on AFs from the 1000 Genomes project: (blue) ; (green) .
Figure 4.“Optimal” re-identification attack in beacon with S2. Different power rates per number of SNPs queried (with rare-first logic) from a beacon with mitigation S2 by an adversary with knowledge on and on AFs from the 1000 Genomes project. Different colors for different values of .
Proportions of queries (over a period of 12 weeks) for each range of allele frequency
| Allele frequency | < 0.001 | 0.001–0.01 | 0.01–0.05 | 0.05–0.5 | > 0.5 |
|---|---|---|---|---|---|
| Queries in ExAC | 0.853 | 0.076 | 0.023 | 0.033 | 0.014 |
Figure 5.Budget evaluation in beacon with S3. Behaviors of individual budgets per number of SNPs queried according to the typical user’s query profile obtained from ExAC log data. The cyan curve represents the number of individuals with enough budget to answer “Yes” to queries targeting alleles with AF = 0.001. Red, green, and blue curves correspond to 0.002, 0.005, and 0.01, respectively.
Summary of advantages and disadvantages of the 3 proposed mitigation strategies
| Risk mitigation strategy | Disadvantages | Advantages |
|---|---|---|
| Eliminates possibility of querying for unique alleles highly likely to be most useful in genetic research | Protects privacy of individuals possessing variants most likely to be targeted by attackers | |
| Decreases rate of true answers returned from querying unique alleles likely to be useful in genetic research | Permits some unique alleles to be discoverable and to fine-tune the privacy–utility trade-off | |
| Requires the assumption of Beacon user being nonanonymous and holding no more than one Beacon account; may require complicated accounting scheme | Enables all alleles to be discoverable until budget is exceeded |