Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 SST: an algorithm for finding near-exact sequence matches in time proportional to the logarithm of the database size.

Literature DB >> 12075023

SST: an algorithm for finding near-exact sequence matches in time proportional to the logarithm of the database size.

Eldar Giladi¹, Michael G Walker, James Z Wang, Wayne Volkmuth.

Abstract

MOTIVATION: Searches for near exact sequence matches are performed frequently in large-scale sequencing projects and in comparative genomics. The time and cost of performing these large-scale sequence-similarity searches is prohibitive using even the fastest of the extant algorithms. Faster algorithms are desired.
RESULTS: We have developed an algorithm, called SST (Sequence Search Tree), that searches a database of DNA sequences for near-exact matches, in time proportional to the logarithm of the database size n. In SST, we partition each sequence into fragments of fixed length called 'windows' using multiple offsets. Each window is mapped into a vector of dimension 4(k) which contains the frequency of occurrence of its component k-tuples, with k a parameter typically in the range 4-6. Then we create a tree-structured index of the windows in vector space, with tree-structured vector quantization (TSVQ). We identify the nearest neighbors of a query sequence by partitioning the query into windows and searching the tree-structured index for nearest-neighbor windows in the database. When the tree is balanced this yields an O(logn) complexity for the search. This complexity was observed in our computations. SST is most effective for applications in which the target sequences show a high degree of similarity to the query sequence, such as assembling shotgun sequences or matching ESTs to genomic sequence. The algorithm is also an effective filtration method. Specifically, it can be used as a preprocessing step for other search methods to reduce the complexity of searching one large database against another. For the problem of identifying overlapping fragments in the assembly of 120 000 fragments from a 1.5 megabase genomic sequence, SST is 15 times faster than BLAST when we consider both building and searching the tree. For searching alone (i.e. after building the tree index), SST 27 times faster than BLAST. AVAILABILITY: Request from the authors.

Mesh：

Year: 2002 PMID： 12075023 DOI： 10.1093/bioinformatics/18.6.873

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

Keyword Cloud
Cited

6 in total

1. Sequence alignment by cross-correlation.

Authors: Alan L Rockwood; David K Crockett; James R Oliphant; Kojo S J Elenitoba-Johnson
Journal: J Biomol Tech Date: 2005-12

2. Geometric aspects of biological sequence comparison.

Authors: Aleksandar Stojmirović; Yi-Kuo Yu
Journal: J Comput Biol Date: 2009-04 Impact factor: 1.479

3. Towards computational improvement of DNA database indexing and short DNA query searching.

Authors: Done Stojanov; Sašo Koceski; Aleksandra Mileva; Nataša Koceska; Cveta Martinovska Bande
Journal: Biotechnol Biotechnol Equip Date: 2014-10-31 Impact factor: 1.632

4. Modeling of endothelial cell dysfunction using human induced pluripotent stem cells derived from patients with end-stage renal disease.

Authors: Kyoung Woon Kim; Yoo Jin Shin; Bo-Mi Kim; Sheng Cui; Eun Jeong Ko; Sun Woo Lim; Chul Woo Yang; Byung Ha Chung
Journal: Kidney Res Clin Pract Date: 2021-10-20

5. Acceleration of sequence clustering using longest common subsequence filtering.

Authors: Youhei Namiki; Takashi Ishida; Yutaka Akiyama
Journal: BMC Bioinformatics Date: 2013-05-09 Impact factor: 3.169

6. Database indexing for production MegaBLAST searches.

Authors: Aleksandr Morgulis; George Coulouris; Yan Raytselis; Thomas L Madden; Richa Agarwala; Alejandro A Schäffer
Journal: Bioinformatics Date: 2008-06-21 Impact factor: 6.937

6 in total