| Literature DB >> 29734293 |
Hyunghoon Cho1, David J Wu2, Bonnie Berger1,3.
Abstract
Most sequenced genomes are currently stored in strict access-controlled repositories. Free access to these data could improve the power of genome-wide association studies (GWAS) to identify disease-causing genetic variants and aid the discovery of new drug targets. However, concerns over genetic data privacy may deter individuals from contributing their genomes to scientific studies and could prevent researchers from sharing data with the scientific community. Although cryptographic techniques for secure data analysis exist, none scales to computationally intensive analyses, such as GWAS. Here we describe a protocol for large-scale genome-wide analysis that facilitates quality control and population stratification correction in 9K, 13K, and 23K individuals while maintaining the confidentiality of underlying genotypes and phenotypes. We show the protocol could feasibly scale to a million individuals. This approach may help to make currently restricted data available to the scientific community and could potentially enable secure genome crowdsourcing, allowing individuals to contribute their genomes to a study without compromising their privacy.Entities:
Mesh:
Year: 2018 PMID: 29734293 PMCID: PMC5990440 DOI: 10.1038/nbt.4108
Source DB: PubMed Journal: Nat Biotechnol ISSN: 1087-0156 Impact factor: 54.908
Figure 1Overview of our secure GWAS pipeline
Study participants (private individuals or institutes) secretly share their genotypes and phenotypes with computing parties (research groups or government agencies), denoted CP1 and CP2, who jointly carry out our secure GWAS protocol to obtain association statistics without revealing the underlying data to any party involved. An auxiliary computing party (CP0) performs input-independent precomputation to greatly speed up the main computation.
Our secure GWAS protocol accurately identifies SNPs with significant disease associations while protecting privacy
We securely performed GWAS on published data sets for lung cancer (n = 9,098 after quality control), bladder cancer (n = 10,678), and age-related macular degeneration (AMD; n = 20,679). Top two significant associations for each disease identified by our protocol are shown (disregarding redundant nearby hits). Ground truth association statistics are calculated based on the plaintext data. P-values are obtained via the Cochran-Armitage trend test (one-sided) and adjusted for multiple testing via Bonferroni correction. For AMD, p-values were smaller than machine precision and thus could not be precisely determined. Our protocol infers biologically meaningful discoveries from GWAS data sets without compromising the privacy of the underlying data. We additionally provide the top 20 associations for each data set in Supplementary Tables 1–3.
| Data set | Top | Genomic | Gene | Cochran- | Secure | |
|---|---|---|---|---|---|---|
|
| ||||||
| Secure | Ground | |||||
| Lung cancer | rs2736100 | chr5:1339516 | 0.01194 | 0.01201 | 7.9924E-20 | |
|
| ||||||
| rs7086803 | chr10:114488466 | 0.00799 | 0.00796 | 6.1631E-12 | ||
|
| ||||||
| Bladder cancer | rs4862110 | chr4:183988023 | 0.01403 | 0.01449 | 8.7899E-29 | |
|
| ||||||
| rs11245742 | chr11:50478883 | - | 0.01031 | 0.01101 | 4.0381E-20 | |
|
| ||||||
| AMD | rs3750847 | chr10:124215421 | 0.09296 | 0.09297 | <1E-300 | |
|
| ||||||
| rs3766405 | chr1:196695161 | 0.07441 | 0.07440 | <1E-300 | ||
|
| ||||||
Figure 2Our secure GWAS protocol achieves practical runtimes, and all of our scalability metrics follow a linear trend
We quantified runtime, communication bandwidth, the size of the precomputed data, and the size of the initial data sharing (Online Methods) for the lung cancer, bladder cancer, and AMD data sets as well as simulated data sets of varying sizes obtained by subsampling the lung cancer data set (for 2K and 5K individuals) or duplicating the AMD data set (for 50K and 100K individuals). Since the number of SNPs differ between the data sets, we normalized all measurements to 500K SNPs for comparison, assuming a linear dependence on the number of SNPs. Lines show the best linear fit for each group. Note that the observed linear trends are not perfect due to the fraction of individuals or SNPs passing quality control being different across different data sets. Overall, our protocol achieves practical runtimes, and all of our performance measures scale linearly with the number of individuals. Phase 1: Quality control procedure. Phase 2: Population stratification analysis (PCA). Phase 3: Association tests.