Literature DB >> 18037617

SIMAP--structuring the network of protein similarities.

Thomas Rattei¹, Patrick Tischler, Roland Arnold, Franz Hamberger, Jörg Krebs, Jan Krumsiek, Benedikt Wachinger, Volker Stümpflen, Werner Mewes.

Abstract

Protein sequences are the most important source of evolutionary and functional information for new proteins. In order to facilitate the computationally intensive tasks of sequence analysis, the Similarity Matrix of Proteins (SIMAP) database aims to provide a comprehensive and up-to-date dataset of the pre-calculated sequence similarity matrix and sequence-based features like InterPro domains for all proteins contained in the major public sequence databases. As of September 2007, SIMAP covers approximately 17 million proteins and more than 6 million non-redundant sequences and provides a complete annotation based on InterPro 16. Novel features of SIMAP include a new, portlet-based web portal providing multiple, structured views on retrieved proteins and integration of protein clusters and a unique search method for similar domain architectures. Access to SIMAP is freely provided for academic use through the web portal for individuals at http://mips.gsf.de/simap/and through Web Services for programmatic access at http://mips.gsf.de/webservices/services/SimapService2.0?wsdl.

Entities: Chemical Disease Species

Mesh：

Substances：
Proteins

Year: 2007 PMID： 18037617 PMCID： PMC2238827 DOI： 10.1093/nar/gkm963

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The number of proteins stored in public databases is rapidly growing and the sequences of amino acids are, at the moment, the most important source of evolutionary and functional information for new proteins. Therefore, the calculations of similarities and features based on protein sequences are by far the most frequently used bioinformatics applications and consume huge amounts of CPU cycles worldwide. Database searches of individual sequences that are already included in sequence databases and the generation of sequence similarity networks by all-against-all comparisons, e.g. for clustering of proteins or prediction of orthologous groups, can be drastically accelerated and cheapened by the pre-calculation of sequence similarities and features. Redundant calculations are hence replaced by retrieval of data from a database. In order for such a database to be useful and applicable to a wide range of bioinformatics problems, it should cover the known protein space comprehensively and be frequently updated. The database ‘Similarity Matrix of Proteins’ (SIMAP) aims to provide a comprehensive and up-to-date dataset of pre-calculated sequence similarities and features for all proteins contained in the major public sequence databases, including Uniprot/Swissprot, Uniprot/TrEMBL (1), PDB (2), GenBank (3) and RefSeq (4). Due to its high coverage and frequent update cycles, SIMAP has developed into the largest and thus unique resource of pre-calculated sequence analysis so far. The core of SIMAP consists of a database system that consistently stores all proteins imported from heterogeneous data sources and provides efficient and fully automated update functionality (5). The amino acid sequences are kept non-redundantly, resulting in a current number of ∼17 million proteins and >6 million sequences in SIMAP (see Figure 1 for a comparison of the proteins and sequences covered by the three most important public sequence databases). The basic protein data are supplemented by the taxonomic assignments, if available, from the source databases. Other information that is important for downstream analysis, e.g. chromosomal location or functional annotation, is available from the tightly interconnected PEDANT genome database system (6).

Figure 1.

Numbers of the proteins (left) and non-redundant sequences (right) covered by the three most important public sequence databases: Uniprot, RefSeq and GenBank as of September 2007.

Numbers of the proteins (left) and non-redundant sequences (right) covered by the three most important public sequence databases: Uniprot, RefSeq and GenBank as of September 2007. For all non-redundant sequences in SIMAP, a matrix of all-against-all sequence similarities [calculated by a sensitive two-step algorithm based on FASTA and Smith–Waterman (5)] is maintained by our system. In contrast to other databases storing pre-calculated similarities (like NCBI BLink), the similarity calculation is thresholded only by a static and sensitive raw score cutoff and not by a maximal number of hits per sequence. Therefore, the structure of the graph formed by the sequence similarities is not altered by the representation of particular protein families in sequence databases and is thus well suited for downstream analysis like clustering or the analysis of its network structure. To facilitate the individual analysis of protein families, the graph formed by pairwise sequence alignments has to be complemented by position-specific scoring of similarities in order to focus on functionally or structurally important residues. SIMAP therefore provides pre-calculated predictions of protein domains for all member databases of InterPro (7) and of additional features like transmembrane helices (8), signal peptides (9) or localization predictions (10) for the complete set of sequences. The computational space of calculating sequence similarities and features is minimized by the non-redundant representation of both sequences and feature models. This allows for a strictly incremental updating procedure, not only with respect to the sequences but also for the feature space. Thus, when upgrading all SIMAP features to a new InterPro release, only a usually small number of changed and new domain models have to be calculated. Most of the calculations are performed by the public resource computing project BOINCSIMAP (11). All data in SIMAP are freely available for academic use through the web portal and Web Services. The smaller parts of the data, i.e. protein and sequence information and the sequence features, can be downloaded as flat files. The similarity data are not suited for direct download due to its huge size of currently more than 1 TB and can therefore only be accessed through the SIMAP Web Services. For projects that want to make use of SIMAP data for a large set of proteins, dumps are provided individually upon request, including a regular update service.

NEW FEATURES AND IMPROVEMENTS IN SIMAP

User-friendly access through integrative web portal

To retrieve proteins, features and homologs from SIMAP, a new and improved web portal provides a user-friendly and powerful toolbox. During the implementation of this portal, the integration of information from heterogeneous databases into the different views, e.g. proteins and homologs from SIMAP and functional annotation from PEDANT, has been a major handicap. Therefore, the new SIMAP web portal is based on an enterprise portal server that is capable of aggregating individual content by reusable portlets, thereby providing context-specific views. The entry point into SIMAP through the web portal is to search proteins by user-defined text terms and sequences. If a query sequence cannot be found in SIMAP, the closely related sequences are searched by a rapid ‘SeqFinder’ algorithm based on a suffix array representation of SIMAP. In order to find related sequences in the SIMAP database, the query sequence is translated into a reduced alphabet of 10 groups of amino acids having positive substitution scores in the BLOSUM50 matrix (12). The transformed query sequence is fragmented into overlapping short substrings. Each substring is searched for exact matches in the suffix array representation of SIMAP, which also has been transformed into the reduced alphabet of amino acids (13). All matching sequences are classified by their relation to the complete query sequence as ‘equal’, ‘containing’, ‘contained’ and ‘similar’ sequences. The search space can be reduced easily by selection of databases and taxa. The classical list view of the results is complemented by a taxonomic view, which allows the user to explore the proteins found in a tree-like structure based on by the NCBI taxonomy (14). For every protein in SIMAP, its pre-calculated features and the list of homologs can be retrieved immediately from the database. To explore homologous proteins by multiple criteria, the classical result list including a graphical representation of the alignments and grouping of proteins that share the same sequences (Figure 2) is complemented by alternative views that structure the homologs by taxonomy or assignment to sequence clusters (see below).

Figure 2.

To explore the homologs of a user-defined query protein, the classical result list including a graphical representation of the alignments is shown per default. Additional views allow structuring the homologs by taxonomy or assignment to sequence clusters.

Structuring the sequence space by clustering of protein families

In order to structure the sequence space of known proteins, SIMAP provides an integrated clustering that is based on sequence homology as well as domain architectures. Clustering a large number of sequences by their pairwise sequence similarities is a non-trivial, computationally very expensive task. Among the many approaches that were successfully established, see e.g. (15), (16) or (17), the Tribe-MCL pipeline (18) provides the implementation of an efficient algorithm for large-scale detection of protein families based on the Markov cluster algorithm. Due to the huge number of pairwise similarities in SIMAP, even the application of the very fast Tribe-MCL pipeline requires preprocessing steps as described below. To avoid contamination of clusters by promiscuous domains as discussed in Ref. (17), we implemented a subclustering method that splits MCL clusters based on the domain architecture of the cluster members. Clusters are calculated using a hierarchical algorithm consisting of five main steps: The cluster-centric view, which is available for all sets of protein shown in the SIMAP portal, allows exploring the similarity relations of the query protein and its homologs in an easy and convenient manner. separation of sequences into the major taxonomic divisions—bacteria, archaea, eukaryota and viruses, generation of non-redundant sets of sequences by pre-clustering of very similar sequences (ratios of alignment score between two sequences/maximal alignment score of the two sequences compared with itself and alignment/length must be both ≥90%), Markov chain linkage clustering (18) of the similarity networks of non-redundant sequence sets into main clusters, subclustering of the main clusters from Step 3 based on different domain architectures (more details on this method are given below) and comparison of all member proteins of the main clusters from Step 3 between the taxonomic divisions to form metaclusters connecting related protein families from bacteria, archaea, eukaryota and viruses.

Search by similarity of domain architectures

A novel search method in SIMAP addresses the task of finding homologs of multidomain proteins, especially in case of domain duplications or domain shuffling. The new ‘Domain similarity’ tool takes advantage of the consistent annotation of all sequences in SIMAP with their InterPro domains. Given a certain query protein, it allows to search for sequences of similar domain architecture. To quantitatively describe the evolutionary distance of two domain architectures—which is not trivial due to the specific evolution of multidomain proteins—we adapted a method proposed by Lin et al. (19). ‘Domain similarity’ searches are capable not only of refining the sort order of homologs, but also of finding remote homologs that lack sufficient sequence similarity for significant hits by FASTA and Smith–Waterman; however, their conservation is still detectable using position-specific scoring models (Figure 3).

Figure 3.

Example of remote homologs retrieved by the ‘Domain similarity’ tool of SIMAP. When searching the query sequence in the Uniprot database, high E-values result from the low bitscores. Thus, these proteins show insufficient pairwise sequence homology to the query and would not be found by database searches that are typically restricted to a maximal E-value of 10. However, the similar domain architectures suggest a common ancestry of these proteins.

Mapping of individual proteins into the public protein space

Due to the use of multiple identifiers for the same protein in different databases, an important but time-consuming task in bioinformatics is the transformation of a set of proteins into another domain of identifiers. This task is necessary also for proprietary databases that use special identifiers and should be mapped to recent public databases. A similar situation occurs for data from proteomics experiments. SIMAP provides a very fast mapping between protein sets, based on the identity of protein sequences by comparison of their MD5 hashes. In cases that do not allow for mapping by sequence identity, e.g. if sequences are fragmented or altered by unidentified residues, a more time-consuming mapping can be performed using PROMPT (20) that makes use of the SIMAP ‘SeqFinder’ function and provides the mapping by individual similarity searches.

FUTURE DIRECTIONS

In the future, the contents of the SIMAP database will be continuously updated every month to stay abreast of all published protein sequences. The recent statistics and information about contained databases can be found from the SIMAP web portal. Recently, the natural diversity of life and its underlying genetic information has been investigated by metagenomic projects. Sequences from environmental sequencing projects (‘metagenomes’) will be integrated into SIMAP soon. Together with future plans for the enhanced integration of functional annotations of proteins and the improvement of the clustering procedures, SIMAP will continue to facilitate individual discoveries as well as systematic downstream projects by providing a structured database of the pre-calculated sequence similarity and feature spaces.

18 in total

1. An efficient algorithm for large-scale detection of protein families.

Authors: A J Enright; S Van Dongen; C A Ouzounis
Journal: Nucleic Acids Res Date: 2002-04-01 Impact factor: 16.971

2. Improvements to CluSTr: the database of SWISS-PROT+TrEMBL protein clusters.

Authors: E V Kriventseva; F Servant; R Apweiler
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

3. Amino acid substitution matrices from protein blocks.

Authors: S Henikoff; J G Henikoff
Journal: Proc Natl Acad Sci U S A Date: 1992-11-15 Impact factor: 11.205

4. Improved prediction of signal peptides: SignalP 3.0.

Authors: Jannick Dyrløv Bendtsen; Henrik Nielsen; Gunnar von Heijne; Søren Brunak
Journal: J Mol Biol Date: 2004-07-16 Impact factor: 5.469

5. Predicting subcellular localization of proteins based on their N-terminal amino acid sequence.

Authors: O Emanuelsson; H Nielsen; S Brunak; G von Heijne
Journal: J Mol Biol Date: 2000-07-21 Impact factor: 5.469

6. New developments in the InterPro database.

Authors: Nicola J Mulder; Rolf Apweiler; Teresa K Attwood; Amos Bairoch; Alex Bateman; David Binns; Peer Bork; Virginie Buillard; Lorenzo Cerutti; Richard Copley; Emmanuel Courcelle; Ujjwal Das; Louise Daugherty; Mark Dibley; Robert Finn; Wolfgang Fleischmann; Julian Gough; Daniel Haft; Nicolas Hulo; Sarah Hunter; Daniel Kahn; Alexander Kanapin; Anish Kejariwal; Alberto Labarga; Petra S Langendijk-Genevaux; David Lonsdale; Rodrigo Lopez; Ivica Letunic; Martin Madera; John Maslen; Craig McAnulla; Jennifer McDowall; Jaina Mistry; Alex Mitchell; Anastasia N Nikolskaya; Sandra Orchard; Christine Orengo; Robert Petryszak; Jeremy D Selengut; Christian J A Sigrist; Paul D Thomas; Franck Valentin; Derek Wilson; Cathy H Wu; Corin Yeats
Journal: Nucleic Acids Res Date: 2007-01 Impact factor: 16.971

7. ProtoNet 4.0: a hierarchical classification of one million protein sequences.

Authors: Noam Kaplan; Ori Sasson; Uri Inbar; Moriah Friedlich; Menachem Fromer; Hillel Fleischer; Elon Portugaly; Nathan Linial; Michal Linial
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

8. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins.

Authors: Kim D Pruitt; Tatiana Tatusova; Donna R Maglott
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

9. The RCSB Protein Data Bank: a redesigned query system and relational database based on the mmCIF schema.

Authors: Nita Deshpande; Kenneth J Addess; Wolfgang F Bluhm; Jeffrey C Merino-Ott; Wayne Townsend-Merino; Qing Zhang; Charlie Knezevich; Lie Xie; Li Chen; Zukang Feng; Rachel Kramer Green; Judith L Flippen-Anderson; John Westbrook; Helen M Berman; Philip E Bourne
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

10. Super paramagnetic clustering of protein sequences.

Authors: Igor V Tetko; Axel Facius; Andreas Ruepp; Hans-Werner Mewes
Journal: BMC Bioinformatics Date: 2005-04-01 Impact factor: 3.169

15 in total

1. The genome of the amoeba symbiont "Candidatus Amoebophilus asiaticus" reveals common mechanisms for host cell interaction among amoeba-associated bacteria.

Authors: Stephan Schmitz-Esser; Patrick Tischler; Roland Arnold; Jacqueline Montanaro; Michael Wagner; Thomas Rattei; Matthias Horn
Journal: J Bacteriol Date: 2009-12-18 Impact factor: 3.490

2. ProteinWorldDB: querying radical pairwise alignments among protein sets from complete genomes.

Authors: Thomas Dan Otto; Marcos Catanho; Cristian Tristão; Márcia Bezerra; Renan Mathias Fernandes; Guilherme Steinberger Elias; Alexandre Capeletto Scaglia; Bill Bovermann; Viktors Berstis; Sergio Lifschitz; Antonio Basílio de Miranda; Wim Degrave
Journal: Bioinformatics Date: 2010-01-19 Impact factor: 6.937

3. Signature protein of the PVC superphylum.

Authors: Ilias Lagkouvardos; Marc-André Jehl; Thomas Rattei; Matthias Horn
Journal: Appl Environ Microbiol Date: 2013-11-01 Impact factor: 4.792

4. MIPS: curated databases and comprehensive secondary data resources in 2010.

Authors: H Werner Mewes; Andreas Ruepp; Fabian Theis; Thomas Rattei; Mathias Walter; Dmitrij Frishman; Karsten Suhre; Manuel Spannagl; Klaus F X Mayer; Volker Stümpflen; Alexey Antonov
Journal: Nucleic Acids Res Date: 2010-11-24 Impact factor: 16.971

5. FGDB: revisiting the genome annotation of the plant pathogen Fusarium graminearum.

Authors: Philip Wong; Mathias Walter; Wanseon Lee; Gertrud Mannhaupt; Martin Münsterkötter; Hans-Werner Mewes; Gerhard Adam; Ulrich Güldener
Journal: Nucleic Acids Res Date: 2010-11-04 Impact factor: 16.971

6. B2G-FAR, a species-centered GO annotation repository.

Authors: Stefan Götz; Roland Arnold; Patricia Sebastián-León; Samuel Martín-Rodríguez; Patrick Tischler; Marc-André Jehl; Joaquín Dopazo; Thomas Rattei; Ana Conesa
Journal: Bioinformatics Date: 2011-02-18 Impact factor: 6.937

7. Protein comparison at the domain architecture level.

Authors: Byungwook Lee; Doheon Lee
Journal: BMC Bioinformatics Date: 2009-12-03 Impact factor: 3.169

8. An integrated approach to the interpretation of single amino acid polymorphisms within the framework of CATH and Gene3D.

Authors: Jose M G Izarzugaza; Anja Baresic; Lisa E M McMillan; Corin Yeats; Andrew B Clegg; Christine A Orengo; Andrew C R Martin; Alfonso Valencia
Journal: BMC Bioinformatics Date: 2009-08-27 Impact factor: 3.169

9. Gene3D: merging structure and function for a Thousand genomes.

Authors: Jonathan Lees; Corin Yeats; Oliver Redfern; Andrew Clegg; Christine Orengo
Journal: Nucleic Acids Res Date: 2009-11-11 Impact factor: 16.971

10. Comprehensive in silico prediction and analysis of chlamydial outer membrane proteins reflects evolution and life style of the Chlamydiae.

Authors: Eva Heinz; Patrick Tischler; Thomas Rattei; Garry Myers; Michael Wagner; Matthias Horn
Journal: BMC Genomics Date: 2009-12-29 Impact factor: 3.969