| Literature DB >> 35246669 |
Zhiyu Wan1,2,3, James W Hazel1,4, Ellen Wright Clayton1,4,5, Yevgeniy Vorobeychik6, Murat Kantarcioglu7, Bradley A Malin8,9,10,11.
Abstract
Recent developments in a variety of sectors, including health care, research and the direct-to-consumer industry, have led to a dramatic increase in the amount of genomic data that are collected, used and shared. This state of affairs raises new and challenging concerns for personal privacy, both legally and technically. This Review appraises existing and emerging threats to genomic data privacy and discusses how well current legal frameworks and technical safeguards mitigate these concerns. It concludes with a discussion of remaining and emerging challenges and illustrates possible solutions that can balance protecting privacy and realizing the benefits that result from the sharing of genetic information.Entities:
Mesh:
Year: 2022 PMID: 35246669 PMCID: PMC8896074 DOI: 10.1038/s41576-022-00455-y
Source DB: PubMed Journal: Nat Rev Genet ISSN: 1471-0056 Impact factor: 59.581
Fig. 1An overview of privacy intrusions and safeguards in genomic data flows.
The four routes of genomic data flow (as indicated by the arrow colours) represent four settings in which data are used or shared: health care (red), research (gold), direct-to-consumer (DTC; green) and forensic (dark blue). The grey line represents a combination of the first three settings. In the health-care setting, data collected by a health-care entity (for example, Vanderbilt University Medical Center) are protected by the Genetic Information Nondiscrimination Act of 2008 (GINA)[128] and the Health Insurance Portability and Accountability Act of 1996 (HIPAA)[116,117] for primary uses. In the research setting, data collected by a research entity (for example, 1000 Genomes Project, Electronic Medical Records and Genomics (eMERGE) network or All of Us Research Program) are primarily protected by the Common Rule[14,124] for primary uses and protected by the US National Institutes of Health (NIH) data sharing policy[37,38] for secondary uses. In the DTC setting, data collected by a DTC entity are protected by the European Union’s General Data Protection Regulation (GDPR)[12] and/or the US state privacy laws (for example, California Consumer Privacy Act[130], California Privacy Rights Act[131] or Virginia Consumer Data Protection Act[132]) for primary uses and protected by self-regulation (for example, data use agreements[36], privacy policies[173] or terms of service[174]) for secondary uses. In the forensic setting, data shared with law enforcement are protected by informed consent[192]. A first party refers to the individual to whom the data correspond, whereas a second party refers to the organization (or individual) who collects and/or uses the data for a purpose that the first party is made aware of. By contrast, third parties refer to users (or recipients) of data who have the ability to communicate with the second party only and might include malicious attackers. Examples of third parties include researchers who access data from an existing research study or a pharmaceutical company that partners with a DTC genetic testing company. The data flow from a DTC entity to a research entity is represented by the arrow at the bottom. Confidentiality is mostly concerned when data are being used, whereas anonymity and solitude are mostly concerned when data are being shared. Specifically, cryptographic tools[31] protect confidentiality against unauthorized access attacks, whereas access control[27] and data perturbation approaches[83] protect anonymity against privacy intrusions such as re-identification and membership inference attacks. We simplify the figure by omitting the impacts of GDPR and data use agreements in the research setting.
A taxonomy of technical research articles on genomic data privacy featured in this Review
| Attack or protection | Use | Data flow | Data level | Setting | How attacks or protections are achieved | Attributes studied other than genotypes/how data are used | Refs |
|---|---|---|---|---|---|---|---|
| Attack | Secondary | Share | Individual | Health care | Re-ID | Demographics, hospital trail | [ |
| Research | Re-ID | NA | [ | ||||
| Pedigree | [ | ||||||
| Re-ID, genotype imputation | Signal profiles | [ | |||||
| Re-ID, genotype inference | Diseases | [ | |||||
| Visual traits/3D facial structures | [ | ||||||
| Re-ID, non-genotypic attribute inference | Demographics, name | [ | |||||
| Demographics, surname | [ | ||||||
| Face, traits, demographics | [ | ||||||
| Genotype imputation | NA | [ | |||||
| Research, DTC | Genotype imputation | Pedigree | [ | ||||
| Genotype imputation, genotype inference, genotype reconstruction | Pedigree | [ | |||||
| Summary | Research | Membership inference | GWAS statistics | [ | |||
| Membership inference, genotype inference | Machine learning model, demographics | [ | |||||
| GWAS statistics, pedigree | [ | ||||||
| Membership inference, non-genotypic attribute inference | Disease status | [ | |||||
| Membership inference, re-ID, genotype imputation | GWAS statistics | [ | |||||
| Membership inference, re-ID, genotype inference, genotype reconstruction | GWAS statistics, visual traits | [ | |||||
| Protection | Secondary | Share | Individual | Research | Generalization | RNA sequences | [ |
| Generalization, suppression, | NA | [ | |||||
| Masking/hiding, risk assessment | Demographics | [ | |||||
| Summary | Research | Suppression, risk assessment | NA | [ | |||
| Beacons | Disease | [ | |||||
| GWAS statistics, pedigree | [ | ||||||
| Beacons, differential privacy | GWAS statistics | [ | |||||
| Beacons, risk assessment | GWAS statistics | [ | |||||
| Differential privacy | GWAS statistics | [ | |||||
| Generative adversarial network | Disease | [ | |||||
| Federated learning | GWAS statistics | [ | |||||
| Risk assessment | NA | [ | |||||
| Protection | Primary | Use | Individual | Health care | Homomorphic encryption | Disease susceptibility test | [ |
| Controlled functional encryption | Relatedness tests | [ | |||||
| SMC | Disease diagnosis | [ | |||||
| Research | Homomorphic encryption | GWAS computation | [ | ||||
| Homomorphic encryption, SMC | GWAS computation | [ | |||||
| Homomorphic encryption, TEE | GWAS computation | [ | |||||
| SMC | GWAS computation | [ | |||||
| TEE | GWAS computation | [ | |||||
| Symmetric encryption, cryptographic hardware | GWAS computation | [ | |||||
| Research, DTC | Homomorphic encryption | Sequence matching, sequence comparison | [ | ||||
| SMC | Sequence comparison | [ | |||||
| Fuzzy encryption | Relative identification | [ | |||||
| DTC | Private set intersection protocols | Paternity test, genetic compatibility test | [ | ||||
| Store | Individual | Health care | Honey encryption | NA | [ | ||
| Secure file format | NA | [ | |||||
| Secondary | Share | Individual | Research | Blockchain | NA | [ | |
| Research, DTC | Blockchain | NA | [ | ||||
| DTC | Blockchain, controlled access, homomorphic encryption, SMC | NA | [ | ||||
| Summary | Research | Blockchain | Machine learning model | [ | |||
| Controlled access | GWAS statistics | [ | |||||
| Attack | Secondary | Share | Individual | DTC, forensic | Familial search, genotype imputation, genotype reconstruction | Name, e-mail address | [ |
| Forensic | Familial search, re-ID | Demographics | [ | ||||
| Familial search, re-ID, genotype imputation | Pedigree | [ | |||||
| Individual, summary | Research, DTC | Non-genotypic attribute inference, kin genotype reconstruction | Pedigree | [ | |||
| Attribute inference, kin genotype reconstruction | Pedigree, disease | [ | |||||
| Protection | Primary | Collect | Individual | Forensic | Controlled access, encryptions | NA | [ |
| Secondary | Share | Individual | DTC, research | Masking/hiding, risk assessment | Pedigree | [ | |
DTC, direct-to-consumer; GWAS, genome-wide association study; ID, identification; NA, not applicable; SMC, secure multiparty computation; TEE, trusted execution environment.
Fig. 2Data perturbation approaches for privacy protection in genomic data sharing.
Each module (or submodule) can work independently to protect data as shown by the corresponding data flow. In the transformation module, data can be masked[93], generalized[88] and/or suppressed according to a privacy protection model (for example, k-anonymity)[87]. In the aggregation module, data can be aggregated to summary statistics[81] or parameters in a machine learning (ML) model[61]. In the module of synthetic data generation, a synthetic data set can be generated using a generative adversarial network (GAN)[110]. In the obfuscation module, noise can be added to data using a privacy protection model (for example, differential privacy)[103]. All contents in each module (or submodule) are examples for illustration purposes only. In the example for the generalization submodule, the plus sign represents a generalization of values one and two for a genomic attribute. In the example for the submodule of summary statistics, the minor allele frequency for each single-nucleotide polymorphism (SNP) marker is computed for each group of individual records. (n represents the number of records in the group; x represents the value of a genomic attribute for the ith record in a group, which is the number of minor alleles at a SNP position for a record in this example.) In the example for the submodule of ML models, the neural network with three layers has 21 parameters (that is, 16 weights and 5 biases) that need to be learned. In the example for the GAN submodule, X represents the input data set, G represents the generator network and D represents the discriminator network. In the example for the reconstruction attack in the module of risk assessment[91], the attacker tries to reconstruct the original data set by linkage and inference[66], and the privacy risk is assessed by the data sharer using a distance function. In the example for the membership inference attack in the module of risk assessment[92], the attacker tries to infer the membership of each targeted individual by hypothesis testing[58], and the privacy risk is assessed by the data sharer using a function that measures the test’s accuracy. The reconstruction attack and the membership inference attack are used here for illustration purposes only and could be replaced with any other attack (for example, a re-identification attack or a familial search attack) or some arbitrary combination of attacks. Data can be sequentially protected by multiple modules and submodules before the privacy risk is mitigated to an acceptable level and finally released. r represents the privacy risk; d represents the distance function; f represents the function measures accuracy; represents the threshold for the privacy risk.
Fig. 3Cryptographic approaches for privacy protection in the use of genomic data.
a | Homomorphic encryption enables computation by a third party on encrypted data without decrypting any specific record[141]. In this instance, it is applied to a genome-wide association study[142] and a disease susceptibility test[78]. b | Secure multiparty computation enables multiple parties to jointly compute a function of their inputs without revealing inputs[146]. Here, three institutions share encrypted data to third parties for summary statistics (for example, minor allele frequency (MAF)) computing[145]. c | A trusted execution environment, such as Intel Software Guard Extensions (SGX)[152], isolates the computation process in an encrypted enclave using central processing unit (CPU) support so that even malicious operating system software cannot see the enclave contents[153]. Here, an institution computes summary statistics (for example, MAF) in a secure enclave of a third party. d | A blockchain enables encrypted immutable records stored on a decentralized network[161]. Here, the individual manages the decryption key using a blockchain while sharing encrypted data with researchers[32]. Avg., average; RAM, random-access memory; SNP, single-nucleotide polymorphism.