| Literature DB >> 25521230 |
Xiaoqian Jiang, Yongan Zhao, Xiaofeng Wang, Bradley Malin, Shuang Wang, Lucila Ohno-Machado, Haixu Tang.
Abstract
To answer the need for the rigorous protection of biomedical data, we organized the Critical Assessment of Data Privacy and Protection initiative as a community effort to evaluate privacy-preserving dissemination techniques for biomedical data. We focused on the challenge of sharing aggregate human genomic data (e.g., allele frequencies) in a way that preserves the privacy of the data donors, without undermining the utility of genome-wide association studies (GWAS) or impeding their dissemination. Specifically, we designed two problems for disseminating the raw data and the analysis outcome, respectively, based on publicly available data from HapMap and from the Personal Genome Project. A total of six teams participated in the challenges. The final results were presented at a workshop of the iDASH (integrating Data for Analysis, 'anonymization,' and SHaring) National Center for Biomedical Computing. We report the results of the challenge and our findings about the current genome privacy protection techniques.Entities:
Mesh:
Year: 2014 PMID: 25521230 PMCID: PMC4290799 DOI: 10.1186/1472-6947-14-S1-S1
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Figure 1The haploblock structure in the first dataset used in task 1.
Figure 2The haploblock structure in the datasets used in task 1. (a) The haploblock structure for dataset 1. (b) The haploblock structure for dataset 2.
Figure 3The p-value distributions based on a . (b) The p-value distribution for dataset 2.
Figure 4A snapshot of the dataset released for task 1.
Figure 5The number of SNVs sampled from chromosomes. (a) The number-sampling distribution for the large dataset. (b) The number-sampling distribution for the small dataset.
Figure 6Evaluation results for task 1: (a) data utility and (b) privacy evaluation through WIDGET.
Figure 7Evaluation results for task 2: (a) comparison of different methods for a given K in the top-K SNV identification challenge; (b) comparison of different top-K SNV identification performance of a given method.
Results of Task 1.
| Baseline | Team 1 | Team 2 | Team 3 | # of sig SNVs | |||
|---|---|---|---|---|---|---|---|
| D1 | Power | 0.05 | 0.03 | 0.61 | 0.04 | 0.01 | 22 |
| D2 | Power | 0.04 | 0.115 | 0.005 | 0.01 | 0.09 | 45 |
In the first column, D1 refers to 200 participants, 311 SNVs (~29504091-30044866, chr2) and D2 refers to 200 participants, 610 SNVs (~55127312-56292137, chr10). The rows labeled 'Power' indicate the ratio of identifiable individuals using the likelihood ratio test in the case group. The other rows start with a cutoff threshold for the χ2test (e.g., 5 × 10-2, 10-3, 10-5), for which two measurements (true positive rate and false positive rate for SNVs using the χ2 test) were calculated under each method. The last column corresponds to the number of significant SNVs calculated using the original data (i.e., without added noise).
Results of Task 2.
| Teams | Top 1 | Top 3 | Top 5 | Top 10 | Top 15 | Top 20 | Top 30 | |
|---|---|---|---|---|---|---|---|---|
| Small (5000 SNVs) | UT Austin | 1 | 2.66 | 4.44 | 8.48 | 7.07 | 4.68 | 2.37 |
| Large (100K SNVs) | UT Austin | 1 | 2.65 | 4.41 | 5.90 | 2.26 | 0.69 | 0.18 |
The table shows the average number of (1000 iterations) privacy-preserving SNV identification algorithms developed by the two participating teams. Both algorithms were trained using the small dataset consisting of 5000 SNVs, and then were tested on both small and large datasets, i.e., select top K (i.e., K = 1, 3, ..., 30) most significant SNVs.