Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Removing near-neighbour redundancy from large protein sequence collections.

Literature DB >> 9682055

Removing near-neighbour redundancy from large protein sequence collections.

Abstract

MOTIVATION: To maximize the chances of biological discovery, homology searching must use an up-to-date collection of sequences. However, the available sequence databases are growing rapidly and are partially redundant in content. This leads to increasing strain on CPU resources and decreasing density of first-hand annotation.
RESULTS: These problems are addressed by clustering closely similar sequences to yield a covering of sequence space by a representative subset of sequences. No pair of sequences in the representative set has >90% mutual sequence identity. The representative set is derived by an exhaustive search for close similarities in the sequence database in which the need for explicit sequence alignment is significantly reduced by applying deca- and pentapeptide composition filters. The algorithm was applied to the union of the Swissprot, Swissnew, Trembl, Tremblnew, Genbank, PIR, Wormpep and PDB databases. The all-against-all comparison required to generate a representative set at 90% sequence identity was accomplished in 2 days CPU time, and the removal of fragments and close similarities yielded a size reduction of 46%, from 260 000 unique sequences to 140 000 representative sequences. The practical implications are (i) faster homology searches using, for example, Fasta or Blast, and (ii) unified annotation for all sequences clustered around a representative. As tens of thousands of sequence searches are performed daily world-wide, appropriate use of the non-redundant database can lead to major savings in computer resources, without loss of efficacy. AVAILABILITY: A regularly updated non-redundant protein sequence database (nrdb90), a server for homology searches against nrdb90, and a Perl script (nrdb90.pl) implementing the algorithm are available for academic use from http://www.embl-ebi.ac. uk/holm/nrdb90. CONTACT: holm@embl-ebi.ac.uk

Mesh：

Substances：

Year: 1998 PMID： 9682055 DOI： 10.1093/bioinformatics/14.5.423

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

Keyword Cloud
Cited

82 in total

1. The identification of conserved interactions within the SH3 domain by alignment of sequences and structures.

Authors: S M Larson; A R Davidson
Journal: Protein Sci Date: 2000-11 Impact factor: 6.725

2. Identification of thermophilic species by the amino acid compositions deduced from their genomes.

Authors: D P Kreil; C A Ouzounis
Journal: Nucleic Acids Res Date: 2001-04-01 Impact factor: 16.971

3. LiveBench-1: continuous benchmarking of protein structure prediction servers.

Authors: J M Bujnicki; A Elofsson; D Fischer; L Rychlewski
Journal: Protein Sci Date: 2001-02 Impact factor: 6.725

4. Environment-dependent residue contact energies for proteins.

Authors: C Zhang; S H Kim
Journal: Proc Natl Acad Sci U S A Date: 2000-03-14 Impact factor: 11.205

5. Proteins of the endoplasmic-reticulum-associated degradation pathway: domain detection and function prediction.

Authors: C P Ponting
Journal: Biochem J Date: 2000-10-15 Impact factor: 3.857

6. Genes linked by fusion events are generally of the same functional category: a systematic analysis of 30 microbial genomes.

Authors: I Yanai; A Derti; C DeLisi
Journal: Proc Natl Acad Sci U S A Date: 2001-07-03 Impact factor: 11.205

7. Proteomics of Mycoplasma genitalium: identification and characterization of unannotated and atypical proteins in a small model genome.

Authors: S Balasubramanian; T Schneider; M Gerstein; L Regan
Journal: Nucleic Acids Res Date: 2000-08-15 Impact factor: 16.971

8. A comparison of profile hidden Markov model procedures for remote homology detection.

Authors: Martin Madera; Julian Gough
Journal: Nucleic Acids Res Date: 2002-10-01 Impact factor: 16.971

9. Pauling and Corey's alpha-pleated sheet structure may define the prefibrillar amyloidogenic intermediate in amyloid disease.

Authors: Roger S Armen; Mari L DeMarco; Darwin O V Alonso; Valerie Daggett
Journal: Proc Natl Acad Sci U S A Date: 2004-07-27 Impact factor: 11.205

10. Improving Retrieval Efficacy of Homology Searches Using the False Discovery Rate.

Authors: Hyrum D Carroll; Alex C Williams; Anthony G Davis; John L Spouge
Journal: IEEE/ACM Trans Comput Biol Bioinform Date: 2015 May-Jun Impact factor: 3.710