| Literature DB >> 28786362 |
Md Momin Al Aziz1, Dima Alhadidi2, Noman Mohammed3.
Abstract
BACKGROUND: Edit distance is a well established metric to quantify how dissimilar two strings are by counting the minimum number of operations required to transform one string into the other. It is utilized in the domain of human genomic sequence similarity as it captures the requirements and leads to a better diagnosis of diseases. However, in addition to the computational complexity due to the large genomic sequence length, the privacy of these sequences are highly important. As these genomic sequences are unique and can identify an individual, these cannot be shared in a plaintext.Entities:
Keywords: Edit distance approximation on genomic data; Genomic sequence similarity; Privacy of genomic data; Secure edit distance; Secure genomic sequence similarity
Mesh:
Year: 2017 PMID: 28786362 PMCID: PMC5547448 DOI: 10.1186/s12920-017-0279-9
Source DB: PubMed Journal: BMC Med Genomics ISSN: 1755-8794 Impact factor: 3.063
Fig. 1Problem architecture
Fig. 2Garbled Circuits
Fig. 3Execution order (second approximation)
Dataset consideration
| Parameters | Dataset 1 | Dataset 2 |
|---|---|---|
| Number of records ( | 500 | 2000 |
| Sequence length ( | 3400-3500 | 9000-10000 |
| Number of queries | 1 | 50 |
| Query length | 3461 | 9000-10000 |
| Data size (MB) | 1.65 | 17.2 |
| Data source | iDash 2016 [ | Generated |
Relationship between the shingle dataset size and the number of unique shingles for different shingle size (w)
| Shingle size | Unique shingles | Shingle dataset size (MB) |
|---|---|---|
| 5 | 1024 | 0.007 |
| 10 | 354,457 | 4.05 |
| 15 | 1,383,525 | 22.4 |
| 20 | 2,927,918 | 61.4 |
Fig. 4Accuracy of shingling and PSI approximation using Dataset 1. X-axis shows different k values (top-k) and Y-axis shows the accuracy for different w values
Fig. 5Accuracy of shingling and PSI approximation using Dataset 2. X-axis shows different k values (top-k) and Y-axis shows the accuracy for different w values
Fig. 6Accuracy of the banded alignment using Dataset 2. X-axis shows different k values (top-k) and Y-axis shows the accuracy for different band values b values
Fig. 7Accuracy of the banded alignment after shingles and PSI method using Dataset 2. X-axis shows different k values (top-k) and Y-axis shows the accuracy for different t values
Running time analysis (top-10 queries with k=10,c=5(t=c k),w=10, and b=5)
| Dataset | Method | Preprocessing | Query |
|---|---|---|---|
| Time (s) | Time (s) | ||
| Dataset 1 | Plain Edit Distance | 0 | 23 |
| Dataset 1 | Shingles with PSI | 18 | 5 |
| Dataset 1 | Protocol 1 [ | 5.7 | 585 |
| Dataset 1 | Protocol 2 [ | 5.7 | 511 |
| Dataset 2 | Plain Edit distance | 0 | 930 |
| Dataset 2 | Protocol 1 [ | 61 | 3049 |
| Dataset 2 | Protocol 2 [ | 61 | 2800 |
| Dataset 2 | Shingles with PSI | 181 | 108 |
| Dataset 2 | Shingles with PSI + | 181 | 730 |
| banded alignment |
Fig. 8Run time analysis using Dataset 2. X-axis shows different k values (top-k) and Y-axis shows the run time (in seconds) for different approximations where b=5 and c=5
Fig. 9Accuracy of Protocol1 and Protocol2 [1] using Dataset 1. X-axis shows different k values (top-k) and Y-axis shows the accuracy for both protocols
Fig. 10Accuracy of Protocol1 and Protocol2 [1] using Dataset 2
Chronological development of privacy preserving genomic data similarity methods
| Authors | Year | Data ( | Time (s) | Principal method |
|---|---|---|---|---|
| Jha et al. [ | 2008 | 25×25 | <40 | Smith-Waterman |
| Wang et al. [ | 2009 | 400×400 | 28.5 | Custom |
| protocols | ||||
| Wang et al. [ | 2015 | 2000×9000 | 2800 | Private set |
| difference with | ||||
| a reference | ||||
| sequence | ||||
| Cheon et al. [ | 2015 | 8×8 | 16.4 | Homomorphic |
| encryption | ||||
| Shimzu et al. [ | 2016 | 2184 genomes | 4-10 | Burrows-Wheeler |
| transform |