| Literature DB >> 31111064 |
Maleeha Najam1, Raihan Ur Rasool2, Hafiz Farooq Ahmad3, Usman Ashraf3, Asad Waqar Malik4,5.
Abstract
Storing and processing of large DNA sequences has always been a major problem due to increasing volume of DNA sequence data. However, a number of solutions have been proposed but they require significant computation and memory. Therefore, an efficient storage and pattern matching solution is required for DNA sequencing data. Bloom filters (BFs) represent an efficient data structure, which is mostly used in the domain of bioinformatics for classification of DNA sequences. In this paper, we explore more dimensions where BFs can be used other than classification. A proposed solution is based on Multiple Bloom Filters (MBFs) that finds all the locations and number of repetitions of the specified pattern inside a DNA sequence. Both of these factors are extremely important in determining the type and intensity of any disease. This paper serves as a first effort towards optimizing the search for location and frequency of substrings in DNA sequences using MBFs. We expect that further optimizations in the proposed solution can bring remarkable results as this paper presents a proof of concept implementation for a given set of data using proposed MBFs technique. Performance evaluation shows improved accuracy and time efficiency of the proposed approach.Entities:
Mesh:
Year: 2019 PMID: 31111064 PMCID: PMC6487161 DOI: 10.1155/2019/7074387
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Classification of data structures.
| Data Structure | Variants |
|---|---|
| Full-Text Index | Suffix Array [ |
| Full-Text Index | Suffix Tree [ |
| Self-Index | Compressed Suffix Array (CSA) [ |
| Self-Index | Run Length Compressed Suffix Array (RLCSA) [ |
| Self-Index | Succinct Suffix Array [ |
| Self-Index | FM-Index [ |
| Self-Index | Alphabet Friendly FM-Index [ |
| Self-Index | LZ-Index [ |
| Word-based Self-Index | Word based Compressed Suffix Array (WCSA) [ |
| Word-based Self-Index | Word based Succinct suffix array (WSSA) [ |
| Word-based Self-Index | Byte oriented Codes wavelet Tree (BOC-WT) [ |
| Probabilistic-Index | Fast and Accurate Classification of Sequences (FACS) [ |
| Probabilistic-Index | Probabilistic de Bruijn Graph [ |
| Probabilistic-Index | Bloom filter Alignment-free reference-based Compression and Decompression (BARCIDE) [ |
| Probabilistic-Index | Sequence Bloom Tree [ |
Figure 1Overview of proposed approach.
Figure 2Formation of words.
Figure 3Shows the number of occurrences of k-mers in chromosomes.
Key-Value store.
| Key | Value (K-mer positions) | |
|---|---|---|
| ACGTT | 0 | 7 |
|
| ||
| CACGT | 6 | |
| CCACG | 5 | |
|
| ||
| CGTTC | 1 | 8 |
|
| ||
| GTTCA | 9 | |
| GTTCC | 2 | |
| TCCAC | 4 | |
| TTCCA | 3 | |
Figure 4Illustration of usage of MBFs.
Impact of k-mer size on Key-Value store.
| K-mer size | Chr 1 | Chr 12 | Chr 21 |
|---|---|---|---|
| KV Store Size (GB) | KV Store Size (MB) | KV Store Size (MB) | |
| 4 | 1.2 | 700 | 195.1 |
| 5 | 1.3 | 756.8 | 209.8 |
| 6 | 1.4 | 816.4 | 226.4 |
Impact of k-mer size on MBFs and construction time (false positive probability of bloom filters= 0.01).
| K-mer Size | Chr 1 | Chr 12 | Chr 21 | |||
|---|---|---|---|---|---|---|
| MBF Size (MB) | Construction Time (Secs) | MBF Size (MB) | Construction Time (Secs) | MBF Size (MB) | Construction Time (Secs) | |
| 4 | 268 | 25.4 | 153 | 14.68 | 41.5 | 4.21 |
| 5 | 270 | 30.9 | 154.8 | 16.29 | 41.9 | 4.5 |
| 6 | 271.5 | 33.2 | 155.4 | 17.49 | 42.04 | 5.01 |
Impact of false positive probability on MBF size (k-mer size=4).
| Chromosome | FP Prob=0.1 | FP Prob=0.01 | FP Prob=0.001 | |
|---|---|---|---|---|
| Original file size (MB) | MBF Size (MB) | MBF size (MB) | MBF size (MB) | |
| Chromosome 1 | 230.8 | 134.5 | 268 | 402.2 |
| Chromosome 12 | 131.9 | 76.9 | 153 | 230.6 |
| Chromosome 21 | 35.9 | 20.81 | 41.5 | 62.24 |
Pattern searching time (k-mer size=4; FP probability =0.01).
| Pattern | Length | Chr 1 | Chr 12 | Chr 21 | |||
|---|---|---|---|---|---|---|---|
| Occurrences | Time (Secs) | Occurrences | Time (Secs) | Occurrences | Time (Secs) | ||
| TTTGAT | 6 | 98196 | 137.10 | 59222 | 36.27 | 16169 | 2.31 |
| CATCAT | 6 | 77840 | 83.81 | 46203 | 25.41 | 12202 | 1.85 |
| GTGTCTGT | 8 | 8615 | 88.87 | 5035 | 23.70 | 1471 | 1.50 |
| TGGAATGGGA | 10 | 552 | 116.64 | 286 | 28.70 | 108 | 1.85 |
| TTTTTTAGAAT | 11 | 327 | 81.48 | 208 | 23.14 | 60 | 1.48 |
| GAGGCAGGAGGATCCC | 16 | 82 | 32.14 | 44 | 9.28 | 13 | 0.65 |
| TTTATTGGAAATATGGGAT | 19 | 0 | 34.37 | 1 | 9.93 | 0 | 0.66 |
| AGCATATTTTTACTGTAGGAGAA | 23 | 0 | 34.08 | 1 | 9.97 | 0 | 0.66 |
Number of false positives for FP probability=0.1 and 0.01.
| Pattern | Length | Chr 1 | Chr 12 | Chr 21 | |||
|---|---|---|---|---|---|---|---|
| FB Prob=0.1 | FB Prob=0.01 | FB Prob=0.1 | FB Prob=0.01 | FB Prob=0.1 | FB Prob=0.01 | ||
| TTTGAT | 6 | 39260 | 9172 | 23743 | 5666 | 6438 | 1549 |
| CATCAT | 6 | 39567 | 9906 | 23097 | 6007 | 6453 | 1620 |
| GTGTCTGT | 8 | 7309 | 1251 | 3982 | 756 | 1207 | 206 |
| TGGAATGGGA | 10 | 636 | 104 | 339 | 49 | 55 | 14 |
| TTTTTTAGAAT | 11 | 431 | 75 | 261 | 50 | 69 | 9 |
| GAGGCAGGAGGATCCC | 16 | 36 | 18 | 29 | 10 | 3 | 0 |
| TTTATTGGAAATATGGGAT | 19 | 0 | 0 | 1 | 1 | 0 | 0 |
| AGCATATTTTTACTGTAGGAGAA | 23 | 0 | 0 | 1 | 1 | 0 | 0 |
Pattern searching time in seeq.
| Pattern | Length | Chr 1 | Chr 12 | Chr 21 | |||
|---|---|---|---|---|---|---|---|
| Occurrences | Time (Secs) | Occurrences | Time (Secs) | Occurrences | Time (Secs) | ||
| TTTGAT | 6 | 98196 | 1.624 | 59222 | 0.932 | 16169 | 0.258 |
| CATCAT | 6 | 77840 | 1.606 | 46203 | 0.902 | 12202 | 0.275 |
| GTGTCTGT | 8 | 8615 | 1.49 | 5035 | 0.856 | 1471 | 0.261 |
| TGGAATGGGA | 10 | 552 | 1.341 | 286 | 0.807 | 108 | 0.22 |
| TTTTTTAGAAT | 11 | 327 | 1.36 | 208 | 0.79 | 60 | 0.215 |
| GAGGCAGGAGGATCCC | 16 | 82 | 1.224 | 44 | 0.721 | 13 | 0.204 |