Literature DB >> 17683263

Clustered sequence representation for fast homology search.

Michael Cameron1, Yaniv Bernstein, Hugh E Williams.   

Abstract

We present a novel approach to managing redundancy in sequence databanks such as GenBank. We store clusters of near-identical sequences as a representative union-sequence and a set of corresponding edits to that sequence. During search, the query is compared to only the union-sequences representing each cluster; cluster members are then only reconstructed and aligned if the union-sequence achieves a sufficiently high score. Using this approach with BLAST results in a 27% reduction in collection size and a corresponding 22% decrease in search time with no significant change in accuracy. We also describe our method for clustering that uses fingerprinting, an approach that has been successfully applied to collections of text and web documents in Information Retrieval. Our clustering approach is ten times faster on the GenBank nonredundant protein database than the fastest existing approach, CD-HIT. We have integrated our approach into FSA-BLAST, our new Open Source version of BLAST (available from http://www.fsa-blast.org/). As a result, FSA-BLAST is twice as fast as NCBI-BLAST with no significant change in accuracy.

Entities:  

Mesh:

Year:  2007        PMID: 17683263     DOI: 10.1089/cmb.2007.R005

Source DB:  PubMed          Journal:  J Comput Biol        ISSN: 1066-5277            Impact factor:   1.479


  9 in total

1.  TBC: a clustering algorithm based on prokaryotic taxonomy.

Authors:  Jae-Hak Lee; Hana Yi; Yoon-Seong Jeon; Sungho Won; Jongsik Chun
Journal:  J Microbiol       Date:  2012-04-27       Impact factor: 3.422

2.  Mining the NCBI Influenza Sequence Database: adaptive grouping of BLAST results using precalculated neighbor indexing.

Authors:  Leonid Zaslavsky; Tatiana Tatusova
Journal:  PLoS Curr       Date:  2009-10-30

3.  PSimScan: algorithm and utility for fast protein similarity search.

Authors:  Anna Kaznadzey; Natalia Alexandrova; Vladimir Novichkov; Denis Kaznadzey
Journal:  PLoS One       Date:  2013-03-07       Impact factor: 3.240

4.  UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches.

Authors:  Baris E Suzek; Yuqi Wang; Hongzhan Huang; Peter B McGarvey; Cathy H Wu
Journal:  Bioinformatics       Date:  2014-11-13       Impact factor: 6.937

5.  Minimizing proteome redundancy in the UniProt Knowledgebase.

Authors:  Borisas Bursteinas; Ramona Britto; Benoit Bely; Andrea Auchincloss; Catherine Rivoire; Nicole Redaschi; Claire O'Donovan; Maria Jesus Martin
Journal:  Database (Oxford)       Date:  2016-12-26       Impact factor: 3.451

6.  Clustering analysis of proteins from microbial genomes at multiple levels of resolution.

Authors:  Leonid Zaslavsky; Stacy Ciufo; Boris Fedorov; Tatiana Tatusova
Journal:  BMC Bioinformatics       Date:  2016-08-31       Impact factor: 3.169

7.  Duplicates, redundancies and inconsistencies in the primary nucleotide databases: a descriptive study.

Authors:  Qingyu Chen; Justin Zobel; Karin Verspoor
Journal:  Database (Oxford)       Date:  2017-01-10       Impact factor: 3.451

8.  Compressive genomics for protein databases.

Authors:  Noah M Daniels; Andrew Gallant; Jian Peng; Lenore J Cowen; Michael Baym; Bonnie Berger
Journal:  Bioinformatics       Date:  2013-07-01       Impact factor: 6.937

9.  Large-scale analysis of NBS domain-encoding resistance gene analogs in Triticeae.

Authors:  Dhia Bouktila; Yosra Khalfallah; Yosra Habachi-Houimli; Maha Mezghani-Khemakhem; Mohamed Makni; Hanem Makni
Journal:  Genet Mol Biol       Date:  2014-09       Impact factor: 1.771

  9 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.