Literature DB >> 11836214

Tolerating some redundancy significantly speeds up clustering of large protein databases.

Weizhong Li1, Lukasz Jaroszewski, Adam Godzik.   

Abstract

MOTIVATION: Sequence clustering replaces groups of similar sequences in a database with single representatives. Clustering large protein databases like the NCBI Non-Redundant database (NR) using even the best currently available clustering algorithms is very time-consuming and only practical at relatively high sequence identity thresholds. Our previous program, CD-HI, clustered NR at 90% identity in approximately 1 h and at 75% identity in approximately 1 day on a 1 GHz Linux PC (Li et al., Bioinformatics, 17, 282, 2001); however even faster clustering speed is needed because the size of protein databases are rapidly growing and many applications desire a lower attainable thresholds.
RESULTS: For our previous algorithm (CD-HI), we have employed short-word filters to speed up the clustering. In this paper, we show that tolerating some redundancy makes for more efficient use of these short-word filters and increases the program's speed 100 times. Our new program implements this technique and clusters NR at 70% identity within 2 h, and at 50% identity in approximately 5 days. Although some redundancy is present after clustering, our new program's results only differ from our previous program's by less than 0.4%.

Entities:  

Mesh:

Year:  2002        PMID: 11836214     DOI: 10.1093/bioinformatics/18.1.77

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  159 in total

1.  New iron acquisition system in Bacteroidetes.

Authors:  Pablo Manfredi; Frédéric Lauber; Francesco Renzi; Katrin Hack; Estelle Hess; Guy R Cornelis
Journal:  Infect Immun       Date:  2014-11-03       Impact factor: 3.441

2.  UniProt: the Universal Protein knowledgebase.

Authors:  Rolf Apweiler; Amos Bairoch; Cathy H Wu; Winona C Barker; Brigitte Boeckmann; Serenella Ferro; Elisabeth Gasteiger; Hongzhan Huang; Rodrigo Lopez; Michele Magrane; Maria J Martin; Darren A Natale; Claire O'Donovan; Nicole Redaschi; Lai-Su L Yeh
Journal:  Nucleic Acids Res       Date:  2004-01-01       Impact factor: 16.971

3.  A widespread occurrence of extra open reading frames in plant Ty3/gypsy retrotransposons.

Authors:  Veronika Steinbauerová; Pavel Neumann; Petr Novák; Jiří Macas
Journal:  Genetica       Date:  2012-04-29       Impact factor: 1.082

4.  Spatial variability in airborne bacterial communities across land-use types and their relationship to the bacterial communities of potential source environments.

Authors:  Robert M Bowers; Shawna McLetchie; Rob Knight; Noah Fierer
Journal:  ISME J       Date:  2010-11-04       Impact factor: 10.302

5.  Comprehensive proteomic analysis of membrane proteins in Toxoplasma gondii.

Authors:  Fa-Yun Che; Carlos Madrid-Aliste; Berta Burd; Hongshan Zhang; Edward Nieves; Kami Kim; Andras Fiser; Ruth Hogue Angeletti; Louis M Weiss
Journal:  Mol Cell Proteomics       Date:  2010-10-10       Impact factor: 5.911

6.  Improved membrane protein topology prediction by domain assignments.

Authors:  Andreas Bernsel; Gunnar Von Heijne
Journal:  Protein Sci       Date:  2005-07       Impact factor: 6.725

7.  A Bayesian-probability-based method for assigning protein backbone dihedral angles based on chemical shifts and local sequences.

Authors:  Jun Wang; Haiyan Liu
Journal:  J Biomol NMR       Date:  2006-12-07       Impact factor: 2.835

8.  LigProf: a simple tool for in silico prediction of ligand-binding sites.

Authors:  Grzegorz Koczyk; Lucjan S Wyrwicz; Leszek Rychlewski
Journal:  J Mol Model       Date:  2007-01-03       Impact factor: 1.810

9.  Multiplicity of carbohydrate-binding sites in beta-prism fold lectins: occurrence and possible evolutionary implications.

Authors:  Alok Sharma; Divya Chandran; Desh D Singh; M Vijayan
Journal:  J Biosci       Date:  2007-09       Impact factor: 1.826

10.  Metagenome analysis of an extreme microbial symbiosis reveals eurythermal adaptation and metabolic flexibility.

Authors:  Joseph J Grzymski; Alison E Murray; Barbara J Campbell; Mihailo Kaplarevic; Guang R Gao; Charles Lee; Roy Daniel; Amir Ghadiri; Robert A Feldman; Stephen C Cary
Journal:  Proc Natl Acad Sci U S A       Date:  2008-11-05       Impact factor: 11.205

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.