Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 RSDB: representative protein sequence databases have high information content.

Literature DB >> 10871268

RSDB: representative protein sequence databases have high information content.

Abstract

MOTIVATION: Biological sequence databases are highly redundant for two main reasons: 1. various databanks keep redundant sequences with many identical and nearly identical sequences 2. natural sequences often have high sequence identities due to gene duplication. We wanted to know how many sequences can be removed before the databases start losing homology information. Can a database of sequences with mutual sequence identity of 50% or less provide us with the same amount of biological information as the original full database?
RESULTS: Comparisons of nine representative sequence databases (RSDB) derived from full protein databanks showed that the information content of sequence databases is not linearly proportional to its size. An RSDB reduced to mutual sequence identity of around 50% (RSDB50) was equivalent to the original full database in terms of the effectiveness of homology searching. It was a third of the full database size which resulted in a six times faster iterative profile searching. The RSDBs are produced at different granularity for efficient homology searching. AVAILABILITY: All the RSDB files generated and the full analysis results are available through internet: ftp://ftp.ebi.ac. uk/pub/contrib/jong/RSDB/http://cyrah.e bi.ac.uk:1111/Proj/Bio/RSDB

Mesh：

Substances：
Proteins

Year: 2000 PMID： 10871268 DOI： 10.1093/bioinformatics/16.5.458

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

Keyword Cloud
Cited

12 in total

1. Uniclust databases of clustered and deeply annotated protein sequences and alignments.

Authors: Milot Mirdita; Lars von den Driesch; Clovis Galiez; Maria J Martin; Johannes Söding; Martin Steinegger
Journal: Nucleic Acids Res Date: 2016-11-28 Impact factor: 16.971

2. Intrinsic disorder in transcription factors.

Authors: Jiangang Liu; Narayanan B Perumal; Christopher J Oldfield; Eric W Su; Vladimir N Uversky; A Keith Dunker
Journal: Biochemistry Date: 2006-06-06 Impact factor: 3.162

3. The Pfam protein families database.

Authors: Robert D Finn; Jaina Mistry; John Tate; Penny Coggill; Andreas Heger; Joanne E Pollington; O Luke Gavin; Prasad Gunasekaran; Goran Ceric; Kristoffer Forslund; Liisa Holm; Erik L L Sonnhammer; Sean R Eddy; Alex Bateman
Journal: Nucleic Acids Res Date: 2009-11-17 Impact factor: 16.971

4. Ultrafast clustering algorithms for metagenomic sequence analysis.

Authors: Weizhong Li; Limin Fu; Beifang Niu; Sitao Wu; John Wooley
Journal: Brief Bioinform Date: 2012-07-06 Impact factor: 11.622

5. Application of protein structure alignments to iterated hidden Markov model protocols for structure prediction.

Authors: Eric D Scheeff; Philip E Bourne
Journal: BMC Bioinformatics Date: 2006-09-14 Impact factor: 3.169

6. Localizome: a server for identifying transmembrane topologies and TM helices of eukaryotic proteins utilizing domain information.

Authors: Sunghoon Lee; Byungwook Lee; Insoo Jang; Sangsoo Kim; Jong Bhak
Journal: Nucleic Acids Res Date: 2006-07-01 Impact factor: 16.971

RSDB: representative protein sequence databases have high information content.

1. Uniclust databases of clustered and deeply annotated protein sequences and alignments.

2. Intrinsic disorder in transcription factors.

3. The Pfam protein families database.

4. Ultrafast clustering algorithms for metagenomic sequence analysis.

5. Application of protein structure alignments to iterated hidden Markov model protocols for structure prediction.

6. Localizome: a server for identifying transmembrane topologies and TM helices of eukaryotic proteins utilizing domain information.

7. ADDA: a domain database with global coverage of the protein universe.

8. SANS: high-throughput retrieval of protein sequences allowing 50% mismatches.

9. Probing metagenomics by rapid cluster analysis of very large datasets.

10. PairsDB atlas of protein sequence space.