| Literature DB >> 26873931 |
Justin Wagner1, Joseph N Paulson1, Xiao Wang2, Bobby Bhattacharjee2, Héctor Corrada Bravo1.
Abstract
MOTIVATION: Developing targeted therapeutics and identifying biomarkers relies on large amounts of research participant data. Beyond human DNA, scientists now investigate the DNA of micro-organisms inhabiting the human body. Recent work shows that an individual's collection of microbial DNA consistently identifies that person and could be used to link a real-world identity to a sensitive attribute in a research dataset. Unfortunately, the current suite of DNA-specific privacy-preserving analysis tools does not meet the requirements for microbiome sequencing studies.Entities:
Mesh:
Substances:
Year: 2016 PMID: 26873931 PMCID: PMC4908319 DOI: 10.1093/bioinformatics/btw073
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Schematic illustration of the garbled circuits protocol. For analyses discussed in this paper, parties P1 and P2 are researchers performing a statistical analysis over combined data. They provide metagenomic count matrices, or locally precomputed statistics computed from count matrices, along with case/control status as input. Function F(x, y) is determined by the analysis performed, e.g. test on difference in Alpha Diversity between case and control. The ‘garbling’ in step (B) also includes randomly permuting the rows of the truth table so that the inputs are not revealed by the ordering - we omit this from the figure for clarity. A review of the Oblivious Transfer protocol used in step (D) is provided in Supplementary Materials Section S3
Fig. 2.Circuit size per feature for each implementation and dataset. The feature count for Alpha Diversity is the number of samples. The differences in Alpha Diversity between datasets is explained by the number of samples for PGP (168) being much lower than that of HMP (694) and MSD (992). PC, Pre-compute
Fig. 3.Running time for each statistic and each dataset in minutes. In each statistic, the number of arithmetic operations determined the running time. The size of the dataset along with sparsity contributed to running time for the sparse implementations. Alpha Diversity MSD Naive did not run to completion on the EC2 instance size due to insufficient memory. Based on the circuit size and the number of gates processed per second for other statistics, we estimate the running time to be 378 min. PC, Pre-compute
Computation accuracy
| Chi-square statistic | 7.84e-07 | 7.48e-06 | 7.02e-08 |
| Chi-square | 2.00e-07 | 2.14e-06 | 9.72e-08 |
| odds ratio | 1.60e-13 | 5.42e-13 | 2.44e-13 |
| Differential abundance | |||
| 0.023 | 0.0017 | 0.0012 | |
| Differential abundance | |||
| degrees of freedom | 2.7e-4 | 2.5e-4 | 0.0028 |
| Differential abundance | |||
| 0.0024 | 0.0026 | 0.0011 | |
| Alpha Diversity | |||
| 0.0038 | 0.017 | 0.0049 | |
| Alpha Diversity | |||
| degrees of freedom | 1.48e-05 | 9.7e-4 | 2.2e-4 |
| Alpha Diversity | |||
| 0.0088 | 0.044 | 0.014 |
Results were generated using the R chisq.test{stats}, odds.ratio{abd}, t.test{stats}, and diversity{vegan} against our implementation in ObliVM for the χ2-test, odds ratio, differential abundance and Alpha Diversity. We use Normalized Mean Squared Error: with x as the value output by R and y the value from our implementation. For comparing P-values, we use the log10 P-value and exclude any exact matches [since log10(0) = −Inf in R] while computing the mean.
Significant features found from sharing data between each country
| Kenya only | 47 | N/A |
| Gambia only | 84 | N/A |
| Mali only | 58 | N/A |
| Bangladesh only | 75 | N/A |
| Kenya + The Gambia | 133 | 86 |
| Kenya + Mali | 112 | 65 |
| Kenya + Bangladesh | 138 | 91 |
| Gambia + Bangladesh | 166 | 82 |
| Mali + Gambia | 167 | 109 |
| Mali + Bangladesh | 169 | 111 |
When computing data with another policy domain, each country saw an increase in the number of features detected to be significantly different between case and control groups.