| Literature DB >> 15608183 |
Thomas Meinel1, Antje Krause, Hannes Luz, Martin Vingron, Eike Staub.
Abstract
The SYSTERS project aims to provide a meaningful partitioning of the whole protein sequence space by a fully automatic procedure. A refined two-step algorithm assigns each protein to a family and a superfamily. The sequence data underlying SYSTERS release 4 now comprise several protein sequence databases derived from completely sequenced genomes (ENSEMBL, TAIR, SGD and GeneDB), in addition to the comprehensive Swiss-Prot/TrEMBL databases. The SYSTERS web server (http://systers.molgen.mpg.de) provides access to 158 153 SYSTERS protein families. To augment the automatically derived results, information from external databases like Pfam and Gene Ontology are added to the web server. Furthermore, users can retrieve pre-processed analyses of families like multiple alignments and phylogenetic trees. New query options comprise a batch retrieval tool for functional inference about families based on automatic keyword extraction from sequence annotations. A new access point, PhyloMatrix, allows the retrieval of phylogenetic profiles of SYSTERS families across organisms with completely sequenced genomes.Entities:
Mesh:
Substances:
Year: 2005 PMID: 15608183 PMCID: PMC539984 DOI: 10.1093/nar/gki030
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1Information flow in SYSTERS. Left-hand side: publicly accessible protein sequence resources: input to SYSTERS. Four information levels in rows: top row, possible queries to the web server; second row, the SYSTERS database; third row, output features; and bottom row: analysis options. In black: new in SYSTERS release 4.
Characteristic data of SYSTERS all-against-all search and clustering procedure for the number of sequences obtained from source databases and used in the two pre-processing steps are given
| Input | |
|---|---|
| Numbers of sequences | Sequence quality; usage in SYSTERS procedure |
| 1 168 498 | Redundant sequences from all source databases (for details see text) |
| −139 843 | Duplicated sequences: 100% identity, full length of both sequences |
| −59 076 | Included sequences: 100% identity, full length of shorter sequence |
| 969 579 | Non-redundant sequence set, used in Smith–Waterman all-against-all searches |
| −423 041 | Fragmental sequences: ≥80% identity and ≥80% of length of shorter sequence |
| 546 538 | Non-redundant sequence set, used in clustering procedure |
The non-redundant sequence set results from the subtraction of identical and fragmental sequences.
Characteristic data of SYSTERS all-against-all search and clustering procedure for the two clustering steps result in SYSTERS superfamiliesand families
| Output superfamilies | |||||||
|---|---|---|---|---|---|---|---|
| Number of superfamilies | Superfamily type | Number of sequences | Number of output families to protein families | Cluster graph type | Number of sequences | ||
| Non-redundant | Redundant | Non-redundant | Redundant | ||||
| 37 488 | Multi sequence | 436 230 | 1 030 969 | 35 345 | Perfect | 134 191 | 238 717 |
| 110 308 | Single sequence | 110 308 | 137 529 | 9355 | Nested | 127 036 | 265 357 |
| 147 796 | Superfamilies | 546 538 | 1 168 498 | 3131 | Overlapping | 174 989 | 526 877 |
| 110 322 | Single | 110 322 | 137 547 | ||||
| 158 153 | Protein families | 546 538 | 1 168 498 | ||||
Protein families are categorized according to the intra-family relationships between proteins: in perfect clusters all sequences match each other, in nested clusters at least one sequence matches all others, in overlapping clusters there is no sequence matching with all others.
Figure 2PhyloMatrix: phylogenetic profiling based on SYSTERS protein families. (i) On the left: PhyloMatrix entry page with several options to access phylogenetic profiles: (a) by browsing, (b) by family selection or (c) by specification of an organism pattern. (ii) On the right: 45 phylogenetic patterns describe 99 protein families comprising ribosomal proteins. For the given example, protein families were pre-selected via the second option (b). The order of organisms in each pattern follows the same order in taxonomic tree of the query page. Selected profiles are sorted according to a hierarchical clustering of all profiles.