Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Clustered sequence representation for fast homology search.

Literature DB >> 17683263

Clustered sequence representation for fast homology search.

Michael Cameron¹, Yaniv Bernstein, Hugh E Williams.

Abstract

We present a novel approach to managing redundancy in sequence databanks such as GenBank. We store clusters of near-identical sequences as a representative union-sequence and a set of corresponding edits to that sequence. During search, the query is compared to only the union-sequences representing each cluster; cluster members are then only reconstructed and aligned if the union-sequence achieves a sufficiently high score. Using this approach with BLAST results in a 27% reduction in collection size and a corresponding 22% decrease in search time with no significant change in accuracy. We also describe our method for clustering that uses fingerprinting, an approach that has been successfully applied to collections of text and web documents in Information Retrieval. Our clustering approach is ten times faster on the GenBank nonredundant protein database than the fastest existing approach, CD-HIT. We have integrated our approach into FSA-BLAST, our new Open Source version of BLAST (available from http://www.fsa-blast.org/). As a result, FSA-BLAST is twice as fast as NCBI-BLAST with no significant change in accuracy.

Entities: Disease

Mesh：

Year: 2007 PMID： 17683263 DOI： 10.1089/cmb.2007.R005

Source DB: PubMed Journal: J Comput Biol ISSN： 1066-5277 Impact factor: 1.479

Keyword Cloud
Cited

9 in total

1. TBC: a clustering algorithm based on prokaryotic taxonomy.

Authors: Jae-Hak Lee; Hana Yi; Yoon-Seong Jeon; Sungho Won; Jongsik Chun
Journal: J Microbiol Date: 2012-04-27 Impact factor: 3.422

2. Mining the NCBI Influenza Sequence Database: adaptive grouping of BLAST results using precalculated neighbor indexing.

Authors: Leonid Zaslavsky; Tatiana Tatusova
Journal: PLoS Curr Date: 2009-10-30

Clustered sequence representation for fast homology search.

1. TBC: a clustering algorithm based on prokaryotic taxonomy.

2. Mining the NCBI Influenza Sequence Database: adaptive grouping of BLAST results using precalculated neighbor indexing.

3. PSimScan: algorithm and utility for fast protein similarity search.

4. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches.

5. Minimizing proteome redundancy in the UniProt Knowledgebase.

6. Clustering analysis of proteins from microbial genomes at multiple levels of resolution.

7. Duplicates, redundancies and inconsistencies in the primary nucleotide databases: a descriptive study.

8. Compressive genomics for protein databases.

9. Large-scale analysis of NBS domain-encoding resistance gene analogs in Triticeae.