Literature DB >> 15608183

The SYSTERS Protein Family Database in 2005.

Thomas Meinel¹, Antje Krause, Hannes Luz, Martin Vingron, Eike Staub.

Abstract

The SYSTERS project aims to provide a meaningful partitioning of the whole protein sequence space by a fully automatic procedure. A refined two-step algorithm assigns each protein to a family and a superfamily. The sequence data underlying SYSTERS release 4 now comprise several protein sequence databases derived from completely sequenced genomes (ENSEMBL, TAIR, SGD and GeneDB), in addition to the comprehensive Swiss-Prot/TrEMBL databases. The SYSTERS web server (http://systers.molgen.mpg.de) provides access to 158 153 SYSTERS protein families. To augment the automatically derived results, information from external databases like Pfam and Gene Ontology are added to the web server. Furthermore, users can retrieve pre-processed analyses of families like multiple alignments and phylogenetic trees. New query options comprise a batch retrieval tool for functional inference about families based on automatic keyword extraction from sequence annotations. A new access point, PhyloMatrix, allows the retrieval of phylogenetic profiles of SYSTERS families across organisms with completely sequenced genomes.

Entities: Disease Species

Mesh：

Substances：
Proteins

Year: 2005 PMID： 15608183 PMCID： PMC539984 DOI： 10.1093/nar/gki030

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The principal goal of the SYSTERS project is to automatically partition all the available protein space. Because the fully automated classification scheme does not rely on interventions and updates by experts, the SYSTERS approach is complementary to expert-curated protein domain or protein family classification schemes like Pfam (1), SMART (2) or PROSITE (3). The SYSTERS database is derived from rigorous all-against-all Smith–Waterman searches (4). The resulting pairwise sequence similarities are used in a refined two-step clustering approach that assigns each protein to a family and a superfamily (A. Krause, J. Stoye and M. Vingron, submitted for publication). The SYSTERS web resource comprises a multitude of query access points, data retrieval options, pre-processed sequence analyses of individual families and comprehensive views on multiple families (Figure 1). The automatically derived protein families are augmented with expert-curated biological information from various resources. For the functional characterization of each cluster, keywords are extracted from annotations of source sequence databases and are assigned to each family. In SYSTERS release 4, Pfam domain assignments to sequences of Swiss-Prot/TrEMBL (5) help to visualize the domain architecture of a protein and to identify differences in domain composition within a protein family. A special focus of SYSTERS is to support phylogenetic studies of protein families. Sequences of SYSTERS families can be selected and downloaded in multiple ways. The users are offered pre-calculated multiple alignments and phylogenetic trees that can serve as a starting point for their own focused analyses.

Figure 1

Information flow in SYSTERS. Left-hand side: publicly accessible protein sequence resources: input to SYSTERS. Four information levels in rows: top row, possible queries to the web server; second row, the SYSTERS database; third row, output features; and bottom row: analysis options. In black: new in SYSTERS release 4.

In this paper, we will describe the differences of SYSTERS release 4 compared to previous releases and highlight the recent developments of tools to access and view information on SYSTERS protein families and superfamilies.

INPUT DATA AND CLUSTERING RESULTS OF SYSTERS RELEASE 4

The underlying protein data for SYSTERS release 4 comprise more than 1.1 million sequences. The Swiss-Prot/TrEMBL database content was extended by several protein data sources with information from completely sequenced genomes (Figure 1): Saccharomyces cerevisiae (6), Schizosaccharomyces pombe (7), Arabidopsis thaliana (8), Drosophila melanogaster, Anopheles gambiae, Caenorhabditis elegans, Caenorhabditis briggsae, Takifugu rubripes, Mus musculus and Homo sapiens (9). After removal of redundant sequences, the results of more than 1011 pairwise Smith–Waterman comparisons were fed into the clustering procedure (Table 1).

Table 1.

Characteristic data of SYSTERS all-against-all search and clustering procedure for the number of sequences obtained from source databases and used in the two pre-processing steps are given

Input
Numbers of sequences	Sequence quality; usage in SYSTERS procedure
1 168 498	Redundant sequences from all source databases (for details see text)
−139 843	Duplicated sequences: 100% identity, full length of both sequences
−59 076	Included sequences: 100% identity, full length of shorter sequence
969 579	Non-redundant sequence set, used in Smith–Waterman all-against-all searches
−423 041	Fragmental sequences: ≥80% identity and ≥80% of length of shorter sequence
546 538	Non-redundant sequence set, used in clustering procedure

The non-redundant sequence set results from the subtraction of identical and fragmental sequences.

The resulting numbers of SYSTERS superfamilies and protein families are presented in Table 2. Only 11.8% of sequences remained as singletons. The majority (74%) of multi-sequence families are ‘perfect’, meaning that all sequences in a family match with each other. Only 6.5% of the families are classified as ‘overlapping’: these families might harbour protein pairs that do not share homologous regions, but are linked indirectly via an intermediate protein that has distinct homologous regions in common with both. The protein family size is power-law-like distributed (10). There are few families with many sequences and many families with only a few sequences. This result complements earlier findings (11,12) on the mode of protein evolution.

Table 2.

Characteristic data of SYSTERS all-against-all search and clustering procedure for the two clustering steps result in SYSTERS superfamiliesand families

Output superfamilies
Number of superfamilies	Superfamily type	Number of sequences		Number of output families to protein families	Cluster graph type	Number of sequences
		Non-redundant	Redundant			Non-redundant	Redundant
37 488	Multi sequence	436 230	1 030 969	35 345	Perfect	134 191	238 717
110 308	Single sequence	110 308	137 529	9355	Nested	127 036	265 357
147 796	Superfamilies	546 538	1 168 498	3131	Overlapping	174 989	526 877
				110 322	Single	110 322	137 547
				158 153	Protein families	546 538	1 168 498

Protein families are categorized according to the intra-family relationships between proteins: in perfect clusters all sequences match each other, in nested clusters at least one sequence matches all others, in overlapping clusters there is no sequence matching with all others.

NEW FEATURES AND SERVICES

Information characterizing a SYSTERS family

For each protein family, SYSTERS provides a comprehensive overview of its member proteins and their annotations. On the entry page, users have access to more detailed information on protein annotations, sequences, multiple alignments, phylogenetic analyses, protein domains, taxonomic distribution and gene structure-related data (Figure 1). In addition to pre-calculated multiple alignments by MView (13), the SYSTERS web server now offers multiple alignments and UPGMA trees generated using DIALIGN (14). The DIALIGN alignment incorporates all sequences in full length, colour-coded information on alignment quality and Pfam domain positions. From MView alignments we derived consensus sequences for each family. The database of consensus sequences can be queried by the user via BLAST (15) interface. SYSTERS provides a new wizard-like tool that allows a flexible selection of user-defined sequences. In this way, users can compile sequences of different SYSTERS families or user-supplied sequences. Subsequently, multiple alignment and UPGMA trees can be constructed using DIALIGN and viewed online. We extracted frequently occurring keywords from all original protein annotations of a SYSTERS family. The keyword list represents a succinct functional description of a family, thus helping to infer functions of hypothetical proteins. We integrated further Swiss-Prot/TrEMBL annotations such as Gene Ontology (16) terms, InterPro (17) terms and Enzyme Commission (EC) numbers (18) that support function inference. A new batch-retrieval tool allows fast annotation of large protein sets. Users can supply a list of sequence database identifiers, e.g. from SWISS-PROT, and are offered to download a list of associated SYSTERS protein family IDs, extracted keywords and GO terms. Protein domain positions of all Swiss-Prot/TrEMBL proteins as annotated in the Pfam database are now integrated into SYSTERS. Domain architectures of all proteins in a SYSTERS family are visualized and can easily be compared. This allows to pinpoint differences in domain architectures within the family that might indicate lineage-specific domain acquisitions or losses.

Taxonomy and phylogenetic profiling

We have integrated the taxonomic system as maintained by the NCBI (19) into SYSTERS and offer to visualize the distribution of protein family members over the taxonomic tree. This now allows users to select sequences of a subfamily specified by internal nodes of the taxonomic tree for further analysis. Additionally, it is possible to select all SYSTERS protein families that have (at least one/exclusively) member protein(s) within a user-defined taxonomic range. A special taxonomic view of a protein family focuses on the presence/absence patterns of member proteins across organisms, also known as phylogenetic profiles (20). Similar profiles often point to similar cellular function or a physical interaction. We set up PhyloMatrix, an extension of SYSTERS to phylogenetic profiling. PhyloMatrix profiles are based on the representation of 106 completely sequenced organisms in SYSTERS protein families, 78 bacteria, 12 eukaryota and 16 archaea. We found 7563 different profiles for 19 374 protein families under the constraint that at least three organisms be present in a family. Users can define a list of protein family IDs to retrieve a set of profiles. Alternatively, PhyloMatrix can be queried with a specific organism pattern to display profiles of matching families. PhyloMatrix is a helpful tool for the exploration of evolutionary events. For example, Figure 2 shows profiles of ribosomal protein families. These are complementary for mitochondrial and cytosolic forms reflecting the endosymbiotic origin of mitochondria.

Figure 2

PhyloMatrix: phylogenetic profiling based on SYSTERS protein families. (i) On the left: PhyloMatrix entry page with several options to access phylogenetic profiles: (a) by browsing, (b) by family selection or (c) by specification of an organism pattern. (ii) On the right: 45 phylogenetic patterns describe 99 protein families comprising ribosomal proteins. For the given example, protein families were pre-selected via the second option (b). The order of organisms in each pattern follows the same order in taxonomic tree of the query page. Selected profiles are sorted according to a hierarchical clustering of all profiles.

Cross-references to external databases

The SYSTERS web server augments information on sequences and protein families by links to a multitude of data resources. We reference all protein source databases (Figure 1). In addition, SYSTERS can be queried with gene names, with accessions from the EMBL nucleotide database (21) or with identifiers of the specialized structure databases, such as PDB (22), MSD (23) and IMB (24). SYSTERS is embedded in the network of genomic database resources in the Computational Molecular Biology Department of the Max Planck Institute for Molecular Genetics, Berlin, including GeneNest, SpliceNest (25) and CORG (26).

25 in total

1. The Protein Data Bank.

Authors: H M Berman; J Westbrook; Z Feng; G Gilliland; T N Bhat; H Weissig; I N Shindyalov; P E Bourne
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment.

Authors: B Morgenstern
Journal: Bioinformatics Date: 1999-03 Impact factor: 6.937

3. SYSTERS, GeneNest, SpliceNest: exploring sequence space from genome to protein.

Authors: Antje Krause; Stefan A Haas; Eivind Coward; Martin Vingron
Journal: Nucleic Acids Res Date: 2002-01-01 Impact factor: 16.971

4. The IMB Jena Image Library of Biological Macromolecules: 2002 update.

Authors: Jan Reichert; Jürgen Sühnel
Journal: Nucleic Acids Res Date: 2002-01-01 Impact factor: 16.971

Review 5. The structure of the protein universe and genome evolution.

Authors: Eugene V Koonin; Yuri I Wolf; Georgy P Karev
Journal: Nature Date: 2002-11-14 Impact factor: 49.962

6. The EMBL Nucleotide Sequence Database.

Authors: Tamara Kulikova; Philippe Aldebert; Nicola Althorpe; Wendy Baker; Kirsty Bates; Paul Browne; Alexandra van den Broek; Guy Cochrane; Karyn Duggan; Ruth Eberhardt; Nadeem Faruque; Maria Garcia-Pastor; Nicola Harte; Carola Kanz; Rasko Leinonen; Quan Lin; Vincent Lombard; Rodrigo Lopez; Renato Mancuso; Michelle McHale; Francesco Nardone; Ville Silventoinen; Peter Stoehr; Guenter Stoesser; Mary Ann Tuli; Katerina Tzouvara; Robert Vaughan; Dan Wu; Weimin Zhu; Rolf Apweiler
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

7. The InterPro Database, 2003 brings increased coverage and new features.

Authors: Nicola J Mulder; Rolf Apweiler; Teresa K Attwood; Amos Bairoch; Daniel Barrell; Alex Bateman; David Binns; Margaret Biswas; Paul Bradley; Peer Bork; Phillip Bucher; Richard R Copley; Emmanuel Courcelle; Ujjwal Das; Richard Durbin; Laurent Falquet; Wolfgang Fleischmann; Sam Griffiths-Jones; Daniel Haft; Nicola Harte; Nicolas Hulo; Daniel Kahn; Alexander Kanapin; Maria Krestyaninova; Rodrigo Lopez; Ivica Letunic; David Lonsdale; Ville Silventoinen; Sandra E Orchard; Marco Pagni; David Peyruc; Chris P Ponting; Jeremy D Selengut; Florence Servant; Christian J A Sigrist; Robert Vaughan; Evgueni M Zdobnov
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

8. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003.

Authors: Brigitte Boeckmann; Amos Bairoch; Rolf Apweiler; Marie-Claude Blatter; Anne Estreicher; Elisabeth Gasteiger; Maria J Martin; Karine Michoud; Claire O'Donovan; Isabelle Phan; Sandrine Pilbout; Michel Schneider
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

9. CORG: a database for COmparative Regulatory Genomics.

Authors: C Dieterich; H Wang; K Rateitschak; H Luz; M Vingron
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

10. The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community.

Authors: Seung Yon Rhee; William Beavis; Tanya Z Berardini; Guanghong Chen; David Dixon; Aisling Doyle; Margarita Garcia-Hernandez; Eva Huala; Gabriel Lander; Mary Montoya; Neil Miller; Lukas A Mueller; Suparna Mundodi; Leonore Reiser; Julie Tacklind; Dan C Weems; Yihe Wu; Iris Xu; Daniel Yoo; Jungwon Yoon; Peifen Zhang
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

20 in total

1. Rodent and nonrodent malaria parasites differ in their phospholipid metabolic pathways.

Authors: Sandrine Déchamps; Marjorie Maynadier; Sharon Wein; Laila Gannoun-Zaki; Eric Maréchal; Henri J Vial
Journal: J Lipid Res Date: 2010-01 Impact factor: 5.922

2. Towards New Drug Targets? Function Prediction of Putative Proteins of Neisseria meningitidis MC58 and Their Virulence Characterization.

Authors: Mohd Shahbaaz; Krishna Bisetty; Faizan Ahmad; Md Imtaiyaz Hassan
Journal: OMICS Date: 2015-06-15

3. Exploration of uncharted regions of the protein universe.

Authors: Lukasz Jaroszewski; Zhanwen Li; S Sri Krishna; Constantina Bakolitsa; John Wooley; Ashley M Deacon; Ian A Wilson; Adam Godzik
Journal: PLoS Biol Date: 2009-09-29 Impact factor: 8.029

4. Genome-wide comparative gene family classification.

Authors: Christian Frech; Nansheng Chen
Journal: PLoS One Date: 2010-10-15 Impact factor: 3.240

5. Ultra-fast sequence clustering from similarity networks with SiLiX.

Authors: Vincent Miele; Simon Penel; Laurent Duret
Journal: BMC Bioinformatics Date: 2011-04-22 Impact factor: 3.307

6. CancerResource: a comprehensive database of cancer-relevant proteins and compound interactions supported by experimental knowledge.

Authors: Jessica Ahmed; Thomas Meinel; Mathias Dunkel; Manuela S Murgueitio; Robert Adams; Corinna Blasse; Andreas Eckert; Saskia Preissner; Robert Preissner
Journal: Nucleic Acids Res Date: 2010-10-15 Impact factor: 16.971

10. Discovering functional linkages and uncharacterized cellular pathways using phylogenetic profile comparisons: a comprehensive assessment.

Authors: Raja Jothi; Teresa M Przytycka; L Aravind
Journal: BMC Bioinformatics Date: 2007-05-23 Impact factor: 3.169