Literature DB >> 26873931

Privacy-preserving microbiome analysis using secure computation.

Justin Wagner¹, Joseph N Paulson¹, Xiao Wang², Bobby Bhattacharjee², Héctor Corrada Bravo¹.

Abstract

MOTIVATION: Developing targeted therapeutics and identifying biomarkers relies on large amounts of research participant data. Beyond human DNA, scientists now investigate the DNA of micro-organisms inhabiting the human body. Recent work shows that an individual's collection of microbial DNA consistently identifies that person and could be used to link a real-world identity to a sensitive attribute in a research dataset. Unfortunately, the current suite of DNA-specific privacy-preserving analysis tools does not meet the requirements for microbiome sequencing studies.
RESULTS: To address privacy concerns around microbiome sequencing, we implement metagenomic analyses using secure computation. Our implementation allows comparative analysis over combined data without revealing the feature counts for any individual sample. We focus on three analyses and perform an evaluation on datasets currently used by the microbiome research community. We use our implementation to simulate sharing data between four policy-domains. Additionally, we describe an application of our implementation for patients to combine data that allows drug developers to query against and compensate patients for the analysis.
AVAILABILITY AND IMPLEMENTATION: The software is freely available for download at: http://cbcb.umd.edu/∼hcorrada/projects/secureseq.html SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. CONTACT: hcorrada@umiacs.umd.edu.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
DNA

Year: 2016 PMID： 26873931 PMCID： PMC4908319 DOI： 10.1093/bioinformatics/btw073

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Microbiome sequencing seeks to characterize and classify the composition and structure of microbial communities from metagenomic DNA samples. It is estimated that only 1 in 10 cells in and on a person’s body contain that individual’s DNA (Turnbaugh ), the remainder corresponding to microbial DNA, most from organisms that cannot be cultured and studied in the laboratory. The Human Microbiome Project (HMP) (Turnbaugh ), the Global Enterics Multi-Center Study (MSD)(Pop ), the Personal Genome Project (Church, 2005) and the American Gut Project (Blaser ) aim to characterize the ecology of human microbiota and its impact on human health. Potentially pathogenic or probiotic bacteria can be identified by detecting significant differences in their distribution across healthy and disease populations. While the biology has led to promising results, privacy concerns of microbiome research are now being identified with no secure analysis tools available. Recent work by Franzosa et al. (2015) shows that microbiome data are an unique identifier across time points in a dataset and could be used to link a sensitive attribute to an individual. Earlier work by Fierer showed that it is possible to identify an object that an individual touched by comparing microbiome samples from the object and the individual’s hand. We provide a thorough review of microbiome sequencing and a categorization of microbiome privacy considerations in the Supplementary Materials. To counter these concerns, we present an implementation and evaluation of metagenomic association analyses in a secure multi-party computation (SMC) framework. For this work, we focus on garbled circuits, a cryptographic technique that evaluates a function over private inputs from two parties. In this article, we concentrate on the case where two parties, each holding organism abundances in a set of case and control samples, are interested in performing an association analysis (e.g. determining organisms that are differentially abundant in cases) over their combined data, without revealing organism abundances in any specific sample. We provide a detailed review of this approach in Section 3 and benchmark our secure implementation of commonly used microbiome analyses on three public datasets. We also quantify the statistical gain of analysis using combined datasets by simulation with a dataset that contains samples from four different countries. We believe that implementing metagenomic analyses in an SMC framework will prove beneficial to researchers focused on the human microbiome as well as the secure computation community. Computational biologists will benefit from a method that allows efficient and secure function evaluation over datasets which they may be obligated to keep confidential. Security researchers can draw on the findings from our work and construct protocols that enable sharing large, sparse datasets to perform analysis.

2 System and methods

Our secure metagenomic analysis system is built upon garbled circuits (Malkhi ), which we describe in this Section. We then detail our system including participants along with alternative approaches in the design space for privacy-preserving analysis.

2.1 Garbled circuits

Two parties, one holding input x and another holding input y, wish to compute a public function F(x, y) over their inputs without revealing anything besides the output. The parties could provide their inputs to a trusted third-party that computes the function and reveals the output to each party. However, modern cryptography offers a mechanism to run a protocol between only the two parties while achieving the desired functionality. The main idea behind garbled circuits is to represent the function to be computed as a Boolean circuit over the inputs from both parties and use encryption to hide the input of each party during evaluation by mapping each 0 and 1 bit of the inputs unto random strings that still compute the same result. At the end of circuit evaluation, the resulting random strings can be mapped back to appropriate 0 and 1 bit values that can then be released to each party. In this way, each party learns F(x, y) without learning anything else about the input of x and y. Figure 1 illustrates the garbled circuits protocol.

Fig. 1.

Schematic illustration of the garbled circuits protocol. For analyses discussed in this paper, parties P1 and P2 are researchers performing a statistical analysis over combined data. They provide metagenomic count matrices, or locally precomputed statistics computed from count matrices, along with case/control status as input. Function F(x, y) is determined by the analysis performed, e.g. test on difference in Alpha Diversity between case and control. The ‘garbling’ in step (B) also includes randomly permuting the rows of the truth table so that the inputs are not revealed by the ordering - we omit this from the figure for clarity. A review of the Oblivious Transfer protocol used in step (D) is provided in Supplementary Materials Section S3

2.2 System participants

We consider the case in which parties located in two policy-domains want to perform metagenomic analyses over shared data. Examples of policy-domains include countries with differing privacy laws or institutions (universities, companies) that stipulate different data disclosure procedures. For , denoting PD as a policy domain, R as a researcher in policy domain i, D as the data from R, F as the set of functions that a set of Rs would like to compute we consider the following setting: R1 and R2 would like to compute F over combined D1 and D2 but cannot do so by broadcasting the data as either PD1 or PD2 does not allow for public release or reception of individual-level microbiome data. We set but this setting could be generalized to any i. Policy domains naturally arise due to differences in privacy laws. For example, studies currently funded by the NIH are required to release non-human genomic sequences including human microbiome data (http://gds.nih.gov/PDF/NIH_GDS_Policy.pdf). In contrast, the European General Data Protection Regulation, which is currently in draft form, lists biometric data and ‘any “data concerning health” means any personal data which relates to the physical or mental health of an individual, or to the provision of health services to the individual’ as protected information that is not to be released publicly (http://www.europarl.europa.eu/sides/getDoc.do?pubRef=-//EP//TEXT+TA+P7-TA-2014-0212+0+DOC+XML+V0//EN). Therefore, researchers in the USA and EU may encounter different policies for data release but still have an interest in computing metagenomic analyses over shared data. Also, given the results published by Franzosa et al., some institutions may re-evaluate microbiome data release policies.

2.2.1 Threat model

We consider researcher R1, who has a microbiome sample from a victim mixed with other samples, to be a semi-honest adversary, or one that follows the protocol but examines the transcript to learn more information than it should. Researcher R2 is examining an association for a specific trait and would like to expand her study to use samples held by R1. R1 wants to determine if the victim is in R2’s dataset and thus learn a sensitive attribute of the victim such as disease status. The attacks of Fierer et al. and Franzosa et al. operate over the vector of feature counts for a given sample. For the analyses studied in this article, an adversary will have no better chance of reconstructing the count vector for a specific sample than guessing the majority, or mode, of the count of any specific feature in this system. Through using a garbled circuit implementation of metagenomic analyses, R2 will be able to keep the vector of microbiome features for any sample private, learn the outputs of functions that she would like to learn over the shared data, and prevent R1 from completing the attack.

2.3 Solution design approaches

We consider different approaches to allow two parties to compute analyses over data which each must keep confidential.

2.3.1 Access control plus trusted third party

In the USA, the NIH has recognized re-identification through publicly posted genomic data as a realistic threat. Therefore, policy allows for publication of summary statistics and transfer of individual level sequencing data through access control using the Database for Genotypes and Phenotypes (Mailman ). Once a researcher receives permission to access data, she is provided the data and is required to maintain the access control list for her research group. We look to remove the need for access control by implementing the queries that a researcher would like to run without revealing the data directly.

2.3.2 Differential privacy

Statistical perturbation of analysis results, most widely implemented as differential privacy, is a second approach for researchers to provide privacy guarantees to participants. In this setting, a researcher maintains a data set and allows other researchers to perform queries over the data. Informally, the results of these queries are perturbed in such a manner that an adversary, with access to query results over a data set in which one specific participant has a set of values and results from another data set with that specific participant having a different set of values, will not be able to infer any information about that individual by examining the results (Groce, 2014). Although this approach provides provable privacy guarantees, the introduction of statistical noise has not gained traction in the computational biology research community. Also, recent work showed that learning warfarin dosage models on differentially private data sets introduces enough noise that the dosage recommendation could be fatal to patients (Fredrikson ).

2.3.3 Secure multiparty computation

An alternative solution which we undertake is using secure computation to perform metagenomic analyses. Other researchers have presented SMC for computing secure genome-wide association studies using secret-sharing, but that particular approach requires the use of three parties for computing tasks (Kamm ). We address the feasibility of using garbled circuits to implement metagenomic analyses in terms of running time, network traffic, and accuracy. We believe that garbled circuits is the best approach for this scenario as it allows for direct communication between two parties and models research settings well. Further, garbled circuits can handle a variety of adversaries beyond the semi-honest one that we consider in this work.

3 Implementation

In this section, we describe how we implemented metagenomic analyses in garbled circuits and detail an evaluation of our system.

3.1 Metagenomics using garbled circuits

3.1.1 FlexSC

FlexSC, the back end of ObliVM, is a framework for secure computation including garbled circuits with a semi-honest adversary (Liu ). FlexSC allows users to write a function in Java for two parties to compute then compiles and evaluates the garbled circuit representation of that function. We implemented all metagenomic tests as Java packages then compiled and ran each with FlexSC. Our initial work on χ2-test was based on a χ2-test implementation using SNP data (https://github.com/wangxiao1254/idash_competition).

3.1.2 Metagenomic analysis assumptions

For this article, we perform all analyses at the species taxonomic level. As detailed in Supplementary Materials Section S1, OTUs are generated from direct pairwise comparison of sequencing reads. This is a compute-intensive process when performed on clear text (Ghodsi ). We do not attempt it in SMC for this work and assume each party performs this operation locally. We assume that each party will annotate each resulting OTU by matching to a common reference database, previously agreed upon by both parties (note that this reference database is orthogonal to sample-specific sequencing results obtained by each party). For illustration we assume that the agreed upon reference database yields annotation at the microbial species level. We also assume that parties can split data into case and control groups based on an agreed upon phenotype. Finally, we do not consider features that have all zeros in the case or control group for either party.

3.2 Design approaches

We took several approaches to implement each statistic. Since the metagenomic datasets we examined are at least 80% sparse and this trend is expected with OTU data (Paulson ), we make design choices to make computation with garbled circuits feasible. We now detail each implementation of the χ2 -test, odds ratio, Differential Abundance and Alpha Diversity. To measure the impact of our design choices we implemented a naive algorithm for each statistic and compared results.

3.2.1 Precomputation

We first developed a method that finds an aggregate statistic at each party so that only those values are circuit inputs. This method is a straightforward approach to reduce the amount of operations and data in the secure computation protocol. As expected, for each statistic this approach had the best performance on all the datasets we evaluated. Supplementary Figure S2 shows the process for calculating a χ2-test and odds ratio on precomputed contingency table counts. An issue with this approach is not all analyses that researchers are interested in computing may be able to be performed over locally generated aggregates.

3.2.2 Sparse matrix

We devised two methods to account for the sparsity of the feature count matrices we used for evaluation. We first followed an approach introduced by Nikolaenko to perform sparse matrix factorization in garbled circuits. We detail our work with this technique in the Supplementary Materials Section S4. As our contribution, we took a conceptually simpler approach that input the non-zero elements for each feature to the circuit and operated over those elements directly. As shown in Figures 4 and 5, this method significantly reduces the number of operations that need to be performed in the secure protocol and offers reasonable running times compared to the precomputation approach.

3.2.3 Presence/absence

We implemented the χ2-test and odds ratio to perform presence/absence association testing. We provide a review of χ2-test and odds ratio in Supplementary Materials Section S1. For the precomputation technique, each party splits its data into case and control groups on a characteristic determined outside of this protocol. Each party then locally computes the contingency table counts on the split data. These contingency table counts are each party’s input into the circuit. Within the circuit, the counts are summed for both case and control groups then the χ2-statistic along with the odds ratio are computed for each feature. In the sparse matrix approach, the total number of samples and all non-zero elements for each feature are input to a garbled circuit. The circuit first adds the number of non-zero elements to compute the present contingency table counts then uses the total number of samples to find the absent counts.

3.2.4 Differential abundance

For calculating differential abundance, we implemented a two-sample t-test for testing the mean abundance between case and control groups. We assume normalization of sequencing counts can be accomplished in a preprocessing step between both parties. We make this assumption because we use normalized datasets in our evaluation. We leave implementation of normalization techniques in garbled circuits to future work. For review of two-sample t-test we refer the reader to the Supplementary Materials Section S1. We examined the process for calculating mean, variance and the t-statistic to determine what optimizations can be made for computing in a circuit. In order to avoid processing all samples within the computation framework, we observe transformations that reduce the total number of operations. In the Supplementary Materials, we show how mean abundance and variance can be computed using the sum, sum of squares and total number of elements from each party. For precomputation, as each institution only needs to provide three values per feature we calculate them locally. In the circuit, a two-sample t-statistic to test difference between case and control groups is computed. For the sparse matrix approach, the total sum and sum of squares are calculated in the circuit using the non-zero elements for each feature. Mean abundance along with variance can then be calculated and used compute the two-sample t-test. We refer the reader to Supplementary Materials Section S4 for more detail.

3.2.5 Alpha diversity

We use a two-sample t-test to determine the significance of mean Alpha Diversity difference between case and control groups. Given that FlexSC does not currently compute logarithm, we measure Alpha Diversity as Simpson’s index: where n is the number of OTU counts for OTU and N is the total number of counts observed in a sample. For precomputation, we locally compute Simpson’s index for each sample. These values are input into the circuit where they are summed, mean and variance is taken, and the t-statistic is calculated. In Alpha Diversity, all samples in case and control must be processed together as opposed to Presence/Absence and Differential Abundance which can be computed per feature. For our sparse computation design, the two values for Simpson’s index, and are generated over each sample in the circuit during one pass through the matrix. Then a pass over an array of these values using division yields Simpson’s index from which the total sum and sum of squares can be used to compute the two-sample t-test between case and control groups.

3.3 Evaluation

We evaluated our implementation using two Amazon EC2 r3.2xLarge instances with 2.5 GHz processors and 61 GB RAM running Amazon Linux AMI 2015.3. We measured the size of the circuit generated, running time and network traffic between both parties for each metagenomic statistic and dataset. Circuit size serves as a useful comparison metric since it depends on the function and input sizes but is independent of hardware. Running time and network traffic are helpful in system-design decisions and benchmarking of deployments.

3.4 Datasets

We used OTU count data from the Personal Genome Project (PGP) (Church, 2005), the HMP (Turnbaugh ), and the Global Enterics MSD (Pop ). We retrieved the MSD data from the project website (ftp://ftp.cbcb.umd.edu/pub/data/GEMS/MSD1000.biom) as well as the PGP and HMP datasets are from the American-Gut project site (https://github.com/biocore/American-Gut/tree/master/data) (Blaser ). We used the tongue as the case and gingiva as control for the HMP data. For PGP, we set forehead as case and left palm as control. Case and control criteria for the MSD dataset were already set by the researchers that publish the data depending on disease phenotype. After aggregating to species and removing features which hold all zeros for either the case or the control group, the PGP contains 168 samples and 277 microbiome features, the HMP has 694 samples and 97 features, and the MSD dataset consists of 992 samples and 754 features. Supplementary Table S2 summarizes the size and sparsity of each dataset.

3.5 Efficiency of secure computation

3.5.1 Circuit size

Figure 2 shows the circuit size per feature for each experiment. As a result of the work by Kolesnikov and Schnieder (2008), XOR gates in each circuit do not require costly network traffic and computation, therefore the total number of non-XOR gates is reported for each statistic and dataset. Using precomputation, the complexity of the equation in terms of arithmetic operations to calculate each statistic determines the circuit size. This explains the circuit sizes for odds ratio and χ2 test as compared with Differential Abundance. For Alpha Diversity, all rows and columns are preprocessed with only the two sample t-test computed in the circuit. With the sparse implementation, the complexity of the test along with the number of non-zero elements in the dataset directly affects circuit size.

Fig. 2.

Circuit size per feature for each implementation and dataset. The feature count for Alpha Diversity is the number of samples. The differences in Alpha Diversity between datasets is explained by the number of samples for PGP (168) being much lower than that of HMP (694) and MSD (992). PC, Pre-compute

3.5.2 Running time

For the sparse implementation, the running time was proportional to the size and number of non-zero elements in each dataset. For precomputation, Alpha Diversity was affected by the number of samples in each dataset. The running time for the χ2 test, odds ratio, and Differential Abundance were proportional to the number of features (rows) processed. Figure 3 summarizes the effects of input size and algorithm complexity on running time.

Fig. 3.

Running time for each statistic and each dataset in minutes. In each statistic, the number of arithmetic operations determined the running time. The size of the dataset along with sparsity contributed to running time for the sparse implementations. Alpha Diversity MSD Naive did not run to completion on the EC2 instance size due to insufficient memory. Based on the circuit size and the number of gates processed per second for other statistics, we estimate the running time to be 378 min. PC, Pre-compute

3.5.3 Network traffic

Supplementary Table S5 shows the network traffic for each experiment. The increase in network traffic between the precomputation and sparse implementations is more significant than the differences in running times of those approaches. We believe that the network traffic for the precompute implementation is quite good for the security guarantees provided with using garbled circuits while the sparse approach presents an acceptable tradeoff depending on the network resources available.

3.6 Accuracy

We compared the accuracy of our implementation results to computing the statistic using standard R libraries. Table 1 lists the accuracy of results for the χ2 statistic, odds ratio, as well as the t-test results for Differential Abundance and Alpha Diversity. The differences in our garbled circuits results compared to the R values appear to be the result of circuit complexity. The floating-point arithmetic operations in FlexSC are software implementations. Therefore the operations are subject to rounding errors that are rarely observed on modern processors which have hardware level support for floating-point arithmetic.

Table 1.

Computation accuracy

	PGP	HMP	MSD
Chi-square statistic	7.84e-07	7.48e-06	7.02e-08
Chi-square P-value	2.00e-07	2.14e-06	9.72e-08
odds ratio	1.60e-13	5.42e-13	2.44e-13
Differential abundance
t-statistic	0.023	0.0017	0.0012
Differential abundance
degrees of freedom	2.7e-4	2.5e-4	0.0028
Differential abundance
P-value	0.0024	0.0026	0.0011
Alpha Diversity
t-statistic	0.0038	0.017	0.0049
Alpha Diversity
degrees of freedom	1.48e-05	9.7e-4	2.2e-4
Alpha Diversity
P-value	0.0088	0.044	0.014

Results were generated using the R chisq.test{stats}, odds.ratio{abd}, t.test{stats}, and diversity{vegan} against our implementation in ObliVM for the χ2-test, odds ratio, differential abundance and Alpha Diversity. We use Normalized Mean Squared Error: with x as the value output by R and y the value from our implementation. For comparing P-values, we use the log10 P-value and exclude any exact matches [since log10(0) = −Inf in R] while computing the mean.

Computation accuracy Results were generated using the R chisq.test{stats}, odds.ratio{abd}, t.test{stats}, and diversity{vegan} against our implementation in ObliVM for the χ2-test, odds ratio, differential abundance and Alpha Diversity. We use Normalized Mean Squared Error: with x as the value output by R and y the value from our implementation. For comparing P-values, we use the log10 P-value and exclude any exact matches [since log10(0) = −Inf in R] while computing the mean. We investigated if our implementation yielded any false positives and false negatives with the results from R acting as ground truth. For the P-values of Differential Abundance in PGP, HMP, and MSD datasets we found no false positives or false negatives for a significance level of 0.05.

3.7 Significant features discovered through data-sharing

Researchers in different policy domains may be forced to compute analyses on partial data. We measured the effect of using our implementation for data-sharing between policy domains. The MSD dataset provides a means to simulate secure computation of microbiome analyses between different countries. The data were gathered from Kenya, The Gambia, Bangladesh and Mali. We simulate each country performing secure Differential Abundance pair-wise with the other countries. We observed that sharing data resulted in a substantial increase (at minimum a 98% increase) in the number of species found to be differentially abundant between case and control groups. Table 2 summarizes the results.

Table 2.

Significant features found from sharing data between each country

	Features found	Total increase
Kenya only	47	N/A
Gambia only	84	N/A
Mali only	58	N/A
Bangladesh only	75	N/A
Kenya + The Gambia	133	86
Kenya + Mali	112	65
Kenya + Bangladesh	138	91
Gambia + Bangladesh	166	82
Mali + Gambia	167	109
Mali + Bangladesh	169	111

When computing data with another policy domain, each country saw an increase in the number of features detected to be significantly different between case and control groups.

Significant features found from sharing data between each country When computing data with another policy domain, each country saw an increase in the number of features detected to be significantly different between case and control groups.

3.8 Metagenomic codes

We also evaluated our implementation on the genetic marker data that showed the greatest identification power in the metagenomic codes analysis (Franzosa ). The data are also from the HMP and consists of a total of 85 samples and 221 111 features. Due to the large number of features and sparsity of the data, we implemented a filtering garbled circuit in which we first return a vector to each party denoting if a given feature meets a presence cutoff and then have each party input those features into our existing implementations to compute the statistical test. For χ2, the 1 729 851 751 gate circuit (circuit size of 7823 Non-Free gates per feature) is evaluated in 67.4 min, with 51 926.35 MB sent to the evaluator, and 1 642.53 MB sent to generator. For odds ratio, the 632 918 505 gate circuit is evaluated in 33.18 min, with 20,542.84 MB sent to the evaluator, and 1,642.29 MB sent to generator. This result shows that the secure comparative analyses we would like to perform are possible given the legitimate concerns raised by Franzosa et al.

4 Discussion

In this section, we describe related work and provide a context for our contribution. We also discuss a use case for our solution in building datasets and finally present conclusions we formed during the course of our work.

4.1 Related work

As we are the first, to our knowledge, to approach secure microbiome analysis, we review related work on privacy-preserving operations over human DNA.

4.1.1 Secure DNA sequence matching and searching

Comparing two DNA segments is essential to genome alignment and identifying the presence of a disease causing mutation. One approach is to use an oblivious finite state machine for privacy-preserving approximate string matching (Troncoso-Pastoriza ). FastGC, the predecessor of the FlexSC library, was benchmarked by computing Levenstein distance and the Smith-Waterman algorithm between private strings held by two parties (Huang ). More recently, Wang compute approximate edit-distance using whole genome sequences.

4.1.2 Privacy-preserving Genome-wide association studies

Prior work has shown that secure computation between two institutions on biomedical data is possible by using a three-party secret-sharing scheme (Kamm ). The authors present an implementation of a χ2-test over SNP data using the Sharemind framework. Other researchers have presented a modification of functional encryption that enables a person to provide her genome and phenotype to a study but only for a restricted set of functions based on a policy parameter (Naveed ). Prior works have built systems for genomic studies using different cryptographic protocols, including systems using additive homomorphic encryption (Dachman-Soled ) and systems using fully homomorphic encryption (Lauter ). When compared with these works, we use a garbled circuit protocol with circuits for floating-point operations. Our system has two unique advantages compared to these prior works: (i) We can benefit from a long line of work on improving the practicality of garbled circuits (Huang ; Kolesnikov and Schneider, 2008; Zahur ) (ii) Floating-point operations ensure us a small and bounded error even after multiple operations.

4.1.3 Secure genetic testing

For using sequencing results in the clinical realm, paternity determination and patient-matching is possible using private set intersection (Baldi ). Also, it is feasible to utilize homomorphic encryption for implementing disease-risk calculation without revealing the value of any genomic variant (Ayday ).

4.2 Patient pool

A novel application of multi-party secure computation approaches to genomic analysis are patient pool designs that can benefit patient groups, specifically those suffering from rare diseases or those with insufficient data in existing repositories for association studies. The recent announcement by 23andMe to begin drug development on its genome variant datasets highlights the value of biomarker data. We imagine a scenario where individuals can use our solution to create and manage datasets in order to charge drug developers to run analysis functions over the data. The companies will have to be non-colluding as otherwise all function results could be shared among companies. The current regulatory process for drug development allows a mechanism to enforce this constraint. The patient pool can be paid to compute a function to over its data and sign the output. Upon requesting drug trial permission in the USA, a company is required to hand over all data from research, which in this case would include the output of the patient pool analysis and signatures over those results. The FDA could verify the signatures to enforce non-collusion between companies. This provides a mechanism to create high-quality datasets that are accessible to a variety of companies and ensure patients are compensated for their efforts.

5 Conclusions

In this article, we show that it is possible to perform metagenomic analyses in a secure computation framework. Our implementation made use of precomputation steps to minimize the number of operations performed in secure computation making the use of garbled circuits feasible. We also implemented sparse-matrix methods for each statistic. We took this step in order to prove the applicability of this solution for other analyses when the data itself acts as sufficient statistics, such as for the Wilcoxon rank-sum test. We also explored potential applications of our implementation in patient pool designs. Although the storage and sharing of medical data is ultimately a policy matter, providing a technical solution is useful to forming good policy. We believe that given the time costs associated with re-consenting patients to release data to another researcher or creating a legal contract stipulating a data receiver’s responsibility, that the running times we presented for metagenomic analyses are a reasonable tradeoff. DNA-sequencing technologies are entering a period of unprecedented applicability in clinical and medical settings with a concomitant need for regulatory oversight over each individual’s sequencing data. We believe that addressing privacy concerns through computational frameworks similar to those used in this article is paramount for patients while allowing researchers to have access to the largest and most descriptive datasets possible. We expect that secure computation and storage of DNA sequencing data, both the individual’s DNA and their metagenomic DNA, will play an increasingly important role in the biomedical research and clinical practice landscape.

11 in total

1. Forensic identification using skin bacterial communities.

Authors: Noah Fierer; Christian L Lauber; Nick Zhou; Daniel McDonald; Elizabeth K Costello; Rob Knight
Journal: Proc Natl Acad Sci U S A Date: 2010-03-15 Impact factor: 11.205

2. The NCBI dbGaP database of genotypes and phenotypes.

Authors: Matthew D Mailman; Michael Feolo; Yumi Jin; Masato Kimura; Kimberly Tryka; Rinat Bagoutdinov; Luning Hao; Anne Kiang; Justin Paschall; Lon Phan; Natalia Popova; Stephanie Pretel; Lora Ziyabari; Moira Lee; Yu Shao; Zhen Y Wang; Karl Sirotkin; Minghong Ward; Michael Kholodov; Kerry Zbicz; Jeffrey Beck; Michael Kimelman; Sergey Shevelev; Don Preuss; Eugene Yaschenko; Alan Graeff; James Ostell; Stephen T Sherry
Journal: Nat Genet Date: 2007-10 Impact factor: 38.330

Review 3. The microbiome explored: recent insights and future challenges.

Authors: Martin Blaser; Peer Bork; Claire Fraser; Rob Knight; Jun Wang
Journal: Nat Rev Microbiol Date: 2013-02-04 Impact factor: 60.633

4. Identifying personal microbiomes using metagenomic codes.

Authors: Eric A Franzosa; Katherine Huang; James F Meadow; Dirk Gevers; Katherine P Lemon; Brendan J M Bohannan; Curtis Huttenhower
Journal: Proc Natl Acad Sci U S A Date: 2015-05-11 Impact factor: 11.205

5. Privacy in Pharmacogenetics: An End-to-End Case Study of Personalized Warfarin Dosing.

Authors: Matthew Fredrikson; Eric Lantz; Somesh Jha; Simon Lin; David Page; Thomas Ristenpart
Journal: Proc USENIX Secur Symp Date: 2014-08

6. A new way to protect privacy in large-scale genome-wide association studies.

Authors: Liina Kamm; Dan Bogdanov; Sven Laur; Jaak Vilo
Journal: Bioinformatics Date: 2013-02-14 Impact factor: 6.937

7. DNACLUST: accurate and efficient clustering of phylogenetic marker genes.

Authors: Mohammadreza Ghodsi; Bo Liu; Mihai Pop
Journal: BMC Bioinformatics Date: 2011-06-30 Impact factor: 3.169

8. The personal genome project.

Authors: G M Church
Journal: Mol Syst Biol Date: 2005-12-13 Impact factor: 11.429

9. Differential abundance analysis for microbial marker-gene surveys.

Authors: Joseph N Paulson; O Colin Stine; Héctor Corrada Bravo; Mihai Pop
Journal: Nat Methods Date: 2013-09-29 Impact factor: 28.547

10. Diarrhea in young children from low-income countries leads to large-scale alterations in intestinal microbiota composition.

Authors: Mihai Pop; Alan W Walker; Joseph Paulson; Brianna Lindsay; Martin Antonio; M Anowar Hossain; Joseph Oundo; Boubou Tamboura; Volker Mai; Irina Astrovskaya; Hector Corrada Bravo; Richard Rance; Mark Stares; Myron M Levine; Sandra Panchalingam; Karen Kotloff; Usman N Ikumapayi; Chinelo Ebruke; Mitchell Adeyemi; Dilruba Ahmed; Firoz Ahmed; Meer Taifur Alam; Ruhul Amin; Sabbir Siddiqui; John B Ochieng; Emmanuel Ouma; Jane Juma; Euince Mailu; Richard Omore; J Glenn Morris; Robert F Breiman; Debasish Saha; Julian Parkhill; James P Nataro; O Colin Stine
Journal: Genome Biol Date: 2014-06-27 Impact factor: 13.583

6 in total

Review 6. The development of large-scale de-identified biomedical databases in the age of genomics-principles and challenges.

Authors: Fida K Dankar; Andrey Ptitsyn; Samar K Dankar
Journal: Hum Genomics Date: 2018-04-10 Impact factor: 4.639