| Literature DB >> 20823301 |
Abhilash Srikantha1, Ajit S Bopardikar, Kalyan Kumar Kaipa, Parthasarathy Venkataraman, Kyusang Lee, TaeJin Ahn, Rangavittal Narayanan.
Abstract
MOTIVATION: Exact sequence search allows a user to search for a specific DNA subsequence in a larger DNA sequence or database. It serves as a vital block in many areas such as Pharmacogenetics, Phylogenetics and Personal Genomics. As sequencing of genomic data becomes increasingly affordable, the amount of sequence data that must be processed will also increase exponentially. In this context, fast sequence search algorithms will play an important role in exploiting the information contained in the newly sequenced data. Many existing algorithms do not scale up well for large sequences or databases because of their high-computational costs. This article describes an efficient algorithm for performing fast searches on large DNA sequences. It makes use of hash tables of Q-grams that are constructed after downsampling the database, to enable efficient search and memory use. Time complexity for pattern search is reduced using beam pruning techniques. Theoretical complexity calculations and performance figures are presented to indicate the potential of the proposed algorithm.Entities:
Mesh:
Year: 2010 PMID: 20823301 PMCID: PMC2935425 DOI: 10.1093/bioinformatics/btq364
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Block diagram of the preprocessing stage.
A sample Q-gram hash table
| Key: | Bucket: Locations' list |
|---|---|
| agt | 0, 3, 6 |
| gta | 1, 4, 7 |
| tag | 2, 5 |
| taa | 8 |
| aac | 9 |
| aca | 10 |
Fig. 2.Block diagram of the pattern-search stage.
Comparison of theoretical complexities of various pattern/homology search algorithms
| Search algorithm | Space complexity | Time complexity |
|---|---|---|
| BLAST | ||
| FASTA | ||
| Finite automaton | ||
| Knuth Morris Pratt | ||
| Suffix tree based | ||
| BWA-SW | ||
| SSAHA | ||
aMethods also yield approximate matches.
bL < < L < < L.
The bold line corresponds to the complexity of the proposed algorithm.
Hash table size for M − Q combinations (MB)
| 117.0 | 118.3 | 120.3 | 129.1 | |
| 55.0 | 55.3 | 57.6 | 65.0 | |
| 35.9 | 36.4 | 38.4 | 45.3 | |
| 26.7 | 27.2 | 29.2 | 35.2 | |
| 21.3 | 21.7 | 23.7 | 28.9 |
Search times for various M − Q combinations (in micro seconds)
| 1830 | 277 | 27 | 30 | |
| 288 | 33 | 7 | 3 | |
| 105 | 9 | 1.4 | 0.5 | |
| 108 | 8 | – | – | |
| – | – | – | – |
Number of matches to be post processed versus M
| Number of matches | 1 | 1 | 2 | 330 |