Literature DB >> 19906725

SIMAP--a comprehensive database of pre-calculated protein sequence similarities, domains, annotations and clusters.

Thomas Rattei¹, Patrick Tischler, Stefan Götz, Marc-André Jehl, Jonathan Hoser, Roland Arnold, Ana Conesa, Hans-Werner Mewes.

Abstract

The prediction of protein function as well as the reconstruction of evolutionary genesis employing sequence comparison at large is still the most powerful tool in sequence analysis. Due to the exponential growth of the number of known protein sequences and the subsequent quadratic growth of the similarity matrix, the computation of the Similarity Matrix of Proteins (SIMAP) becomes a computational intensive task. The SIMAP database provides a comprehensive and up-to-date pre-calculation of the protein sequence similarity matrix, sequence-based features and sequence clusters. As of September 2009, SIMAP covers 48 million proteins and more than 23 million non-redundant sequences. Novel features of SIMAP include the expansion of the sequence space by including databases such as ENSEMBL as well as the integration of metagenomes based on their consistent processing and annotation. Furthermore, protein function predictions by Blast2GO are pre-calculated for all sequences in SIMAP and the data access and query functions have been improved. SIMAP assists biologists to query the up-to-date sequence space systematically and facilitates large-scale downstream projects in computational biology. Access to SIMAP is freely provided through the web portal for individuals (http://mips.gsf.de/simap/) and for programmatic access through DAS (http://webclu.bio.wzw.tum.de/das/) and Web-Service (http://mips.gsf.de/webservices/services/SimapService2.0?wsdl).

Entities: Disease Species

Mesh：

Substances：
Proteins

Year: 2009 PMID： 19906725 PMCID： PMC2808863 DOI： 10.1093/nar/gkp949

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Protein sequences are of utmost importance for studying the function and evolution of genes and genomes. Evolutionary processes of mutation and selection have shaped the protein sequence space and became manifest in the protein sequences as well as their pair-wise and group-wise similarities. Therefore, a rich collection of methods in computational biology relies on the analysis and comparison of protein sequences. Many of these intensively used methods perform sequence similarity searches [e.g. BLAST (1)] or compare protein sequences against secondary databases of protein families [e.g. InterPro (2)]. The fast increasing volume of publicly available protein sequences forges a computational dilemma for bioinformatics tasks that require repeated all-against-all calculations of sequence similarities or sequence features. Such rather straightforward but technically challenging tasks among others are the annotation of genomes or the clustering of the protein sequence space into protein families. Due to the exponential growth of the number of sequences and the quadratic complexity of the sequence similarity matrix, the computational demand of calculating an all-versus-all sequence matrix of all known proteins easily outgrows available computational resources. Due to the subsequent growth of the secondary databases, a similar problem exists for the prediction of protein domains. As a consequence, any repeated ab initio recalculation of the similarity matrix is highly ineffective due to the recalculation of the vast majority of already known sequence similarity relations. However, as the number of recently added sequences is always small compared to the bulk of know sequences, repeated recalculations—frequently performed in many sequence-based projects—waste a remarkable amount of compute sources worldwide. The Similarity Matrix of Proteins (SIMAP) solves the computational dilemma described above by incrementally pre-calculating the sequence similarities forming the known protein sequence space (3). The comparison of new sequences versus known ones returns symmetric scores that can be updated accordingly in the existing records. Compared to other resources that pre-calculate sequence similarities [e.g. NCBI Blink (4)], the FASTA (5) and Smith–Waterman (6) based similarity calculation in SIMAP is only restricted by a static and sensitive raw score threshold without limiting the maximal number of hits per sequence. Hence the structure of the sequence similarity matrix is not influenced by the taxonomy and study biases that exist in the major protein sequence databases. The SIMAP database stores raw scores from the calculated alignments. When querying SIMAP, e-values are calculated on-the-fly according to the selected databases and taxa. To complement the pair-wise sequence similarity matrix by position specific searches against known protein families, SIMAP in addition pre-calculates sequence based features as e.g. InterPro matches (2). To maximize its coverage to provide an efficient alternative to BLAST or Interproscan calculations, the comprehensive representation of the protein sequence space is crucial for SIMAP. Recent improvements in SIMAP have addressed this requirement by further expanding the sequence space and including metagenomic sequences. Further improvements have extended the functional annotation of the protein sequence space in SIMAP by pre-calculated GO annotations and improved the data access and query tools of SIMAP.

NEW FEATURES AND IMPROVEMENTS IN SIMAP

Comprehensive coverage of the protein sequence space

SIMAP represents the known protein sequence space comprehensively and up-to-date. According to this goal, the SIMAP database is synchronized once per month with the major protein sequence databases (Table 1). The consideration of each of these databases in SIMAP is justified by providing either unique protein sequences that are not found in other databases [e.g. ENSEMBL (7)], or unique protocols for data processing [e.g. NCBI RefSeq (8)]. The continuous and rapid growth of the sequence space demands for a sophisticated high-performance computing infrastructure to pre-calculate the sequence similarities of all new sequences and their sequence-based features immediately after the import of new sequences even in case of SIMAP’s incremental implementation. The SIMAPBOINC public resource computing project (9) steadily provides compute power beyond the current need and thus enables rapid updating of SIMAP.

Table 1.

Number of protein entries and non-redundant sequences of the major protein sequence databases included in SIMAP as of September 2009

Database	Protein entries	Non-redundant sequences
NCBI GenBank	16 146 018	13 065 886
NCBI RefSeq	8 181 910	6 681 186
Uniprot/TrEMBL	8 926 016	7 586 794
Uniprot/Swissprot	495 880	416 496
PDB	139 106	41 445
PEDANT	5 480 442	5 389 911
ENSEMBL	1 094 482	1 062 197

Number of protein entries and non-redundant sequences of the major protein sequence databases included in SIMAP as of September 2009

Consistent processing and annotation of metagenomes

With the breakthrough of next generation sequencing methods and their application to environmental samples (10), metagenomic sequences have indelibly expanded the protein sequence space to non-culturable organisms and environmental communities. However, the pioneering ‘Global ocean sampling’ (GOS) project (11) so far remains the only metagenomic dataset of which protein sequences are represented in a major public sequence database [NCBI GenBank (4)]. All other metagenomes are—if at all—deposited in distributed resources as the ‘Whole Genome Shutgun’ (wgs) section of NCBI GenBank (4) or the IMG/M database (12). No standardized protocol for gene calling and the annotation of protein-coding sequences has been established so far for these data collections. As the consistent annotation of metagenomes is indispensable for any downstream comparative analysis such as comparisons of taxonomic or functional profiles between different metagenomes, an extension of SIMAP was implemented that extracts coding sequences from metagenomic sequencing reads, assembled contigs and scaffolds in a consistent way. This part of SIMAP covering environmental sequence fragments is monthly synchronized with three major repositories of metagenomes (Table 2). Entirely redundant metagenomes are considered only once, whereas redundant representations of the same project differing in their total number of nucleotides (e.g. the whale fall samples in IMG/M and GenBank wgs) are retained. Similar to the methodology used by the GOS project (13), coding sequences are extracted from the nucleotide sequences in a multi-step procedure:

Table 2.

Number of metagenomic samples and extracted protein-coding sequences in SIMAP as of September 2009

Database	Metagenomic samples	Non-redundant sequences
Camera (JCVI)	54	6 031 109
NCBI Genbank/wgs section	130	4 244 008
IMG/M (JGI)	65	2 833 359

all open reading frames (ORF) exceeding a length of 90 nt are extracted from the nucleotide sequences of a metagenome, all-against-all protein sequence similarities between all ORFs in a metagenome are calculated using the SIMAP software (3): first a FASTA (5) similarity search against the low-complexity masked sequences down to the BLOSUM50 (14) score of 80 is performed without restricting the number of hits, thereafter the alignments are re-calculated without low-complexity masking, ORFs are weighted by the number and score of their sequence alignments; shadow ORFs are detected by their overlap with higher weighted ORFs and removed using the methodology and parameters as in the GOS project (13), remaining ORFs having a length of at least 60 aa are imported into the main SIMAP database, all-against-all protein sequence similarities between all ORFs of a metagenome and all other protein sequences in SIMAP are calculated as in step 2, again, shadow ORFs are removed as in step 3. Number of metagenomic samples and extracted protein-coding sequences in SIMAP as of September 2009 Compared to the supervised gene prediction methods as used in other metagenomic resources, the procedure applied in SIMAP is not biased towards any taxonomic group (i.e. prokaryotes) and only limited by the minimal length of open reading frames in step 1 and 4. The parameters applied in this procedure ensure optimal sensitivity in detecting coding sequences both in single-exon and multi-exon genes. The derived metagenomic ORFs have almost doubled the volume of the known protein sequence space and thus significantly added valuable information (Figure 1). However, metagenomic sequences exhibit lower accuracy compared to completely sequenced genes and genomes, show fragmentation in case of multi-exon genes and lack knowledge of their taxonomic origin. Therefore, metagenomic sequences can be excluded when retrieving data from SIMAP according to the individual requirements of the user.

Figure 1.

Composition of the non-redundant protein sequence space in SIMAP as of September 2009.

Functional annotation of the protein sequence space

Many computational methods to support the prediction of protein function are computationally expensive and therefore benefit from comprehensive pre-calculation and incremental updates as the basic design principles of SIMAP. SIMAP thus pre-calculates Interpro domains and features (2) for all sequences including metagenomic ORFs (15). New releases of InterPro are incorporated into SIMAP as soon as they become available; SIMAP is regularly updated to the latest InterPro version (currently 22.0). SIMAP provides an ideal complete resource for the computation of secondary features such as the functional annotation of protein sequences based on information transfer from annotated proteins. BLAST2GO may serve as an example that provides various annotation tools for the functional classification of proteins (16,17). Blast2GO achieves the automatic functional annotation of DNA or protein sequences employing the Gene Ontology vocabulary. We have adapted the Blast2GO suite to enable the retrieval of sequence similarities from the SIMAP database instead of performing BLAST (1) searches. This step saves an enormous amount of compute-time compared to BLAST and allows annotating the complete protein sequence space of SIMAP using a few PCs within a week. We have integrated the adapted BLAST2GO program into the monthly update workflow of SIMAP in order to keep the pre-calculated BLAST2GO annotations complete and up-to-date (Table 3).

Table 3.

Pre-calculated functional annotations in SIMAP as of September 2009

Method	Number of pre-calculated features
InterProScan	133 829 528
TargetP	17 205 439
SignalP	11 060 831
TMHMM	15 841 454
Phobius	18 488 832
Blast2GO	190 801 556

Pre-calculated functional annotations in SIMAP as of September 2009

High performance data access facilities

All data in SIMAP are freely available. The continuously growing size of SIMAP demands a sophisticated implementation of the database to provide versatile and rapid access to the data with respect to a broad spectrum of use cases. Based on the established database and standard middleware components of SIMAP, we have improved the performance and stability of SIMAP through clustering of two independent database and application servers. This clustering effectively uncouples production and maintenance processes. Each of the servers is ready to process more than 2 million complex queries per day. Furthermore, we have improved the different data access facilities connecting SIMAP to its users. The versatile web portal allows searching for proteins by text or sequence queries. The matches are starting points for retrieving homologous proteins based on sequence similarity or domain architecture. Protein report pages integrate data from SIMAP including InterPro and GO annotation as well as from external resources as the PEDANT database (18). To facilitate clustering methods, all-against-all matrices of similarity scores can be downloaded for user-supplied groups of proteins. Programmatic access to SIMAP is provided by several SOAP based Web-Services. The SimpAT (Simap Access Tools) allows easy access to the SIMAP database using Web-Service functionality. Recently, we have implemented Distributed Annotation System (DAS) services for SIMAP. These can be accessed via the URL http://webclu.bio.wzw.tum.de/das/ and provide easy and rapid access to the proteins, sequence similarities, InterPro matches and GO annotations from SIMAP. These data with the exception of the very huge similarity matrix itself can also be downloaded as flat files from the SIMAP web portal. For research projects interested in parts of the similarity matrix, we provide project specific monthly dumps upon request.

DISCUSSION

The SIMAP database is a unique fundamental resource for computational biology that consequently puts the principle of incremental pre-calculation of sequence similarities and sequence based features into practice. SIMAP as an exhaustive, up-to-date resource to inspect the sequence similarity of any known sequence enables of any type of systematic post-processing with respect to the functional or structural classification of proteins. The recent integration of metagenomic sequences into SIMAP based on a consistent extraction of coding sequences has been beneficial to preserve the comprehensiveness of the sequence space representation in SIMAP. At the same time it makes use of the sequence similarity matrix of SIMAP to resolve overlaps and remove shadow ORFs. SIMAP represents to our knowledge the largest and most homogeneous resource for the annotation of coding sequences in metagenomes. It provides an ideal data repository and speed-up for tools as e.g. MEGAN (19) that extract taxonomic and functional information from similarities between metagenomic ORFs and known proteins in major sequence databases. The extended functional annotation of the sequence space through the pre-calculation of GO annotations and the improved data access facilities have enhanced the potential of SIMAP in assisting biologists in answering their individual research questions as well as facilitating downstream projects in computational biology at any scale.

FUNDING

SUN Microsystems Inc. (funding a fully equipped X4500 data center server that is hosting parts of the SIMAP database, through a SUN Academic Excellence Grant), European Science Foundation (financial support for Stefan Götz through the activity entitled ‘Frontiers of Functional Genomics’). Funding for open access charge: Helmholtz Zentrum München, German Research Center for Environmental Health, Neuherberg. Conflict of interest statement. None declared.

18 in total

1. Amino acid substitution matrices from protein blocks.

Authors: S Henikoff; J G Henikoff
Journal: Proc Natl Acad Sci U S A Date: 1992-11-15 Impact factor: 11.205

Review 2. Metagenomics: application of genomics to uncultured microorganisms.

Authors: Jo Handelsman
Journal: Microbiol Mol Biol Rev Date: 2004-12 Impact factor: 11.056

3. Rapid and sensitive sequence comparison with FASTP and FASTA.

Authors: W R Pearson
Journal: Methods Enzymol Date: 1990 Impact factor: 1.600

Review 4. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Authors: S F Altschul; T L Madden; A A Schäffer; J Zhang; Z Zhang; W Miller; D J Lipman
Journal: Nucleic Acids Res Date: 1997-09-01 Impact factor: 16.971

5. Identification of common molecular subsequences.

Authors: T F Smith; M S Waterman
Journal: J Mol Biol Date: 1981-03-25 Impact factor: 5.469

6. PEDANT covers all complete RefSeq genomes.

Authors: Mathias C Walter; Thomas Rattei; Roland Arnold; Ulrich Güldener; Martin Münsterkötter; Karamfilka Nenova; Gabi Kastenmüller; Patrick Tischler; Andreas Wölling; Andreas Volz; Norbert Pongratz; Ralf Jost; Hans-Werner Mewes; Dmitrij Frishman
Journal: Nucleic Acids Res Date: 2008-10-21 Impact factor: 16.971

7. InterPro: the integrative protein signature database.

Authors: Sarah Hunter; Rolf Apweiler; Teresa K Attwood; Amos Bairoch; Alex Bateman; David Binns; Peer Bork; Ujjwal Das; Louise Daugherty; Lauranne Duquenne; Robert D Finn; Julian Gough; Daniel Haft; Nicolas Hulo; Daniel Kahn; Elizabeth Kelly; Aurélie Laugraud; Ivica Letunic; David Lonsdale; Rodrigo Lopez; Martin Madera; John Maslen; Craig McAnulla; Jennifer McDowall; Jaina Mistry; Alex Mitchell; Nicola Mulder; Darren Natale; Christine Orengo; Antony F Quinn; Jeremy D Selengut; Christian J A Sigrist; Manjula Thimma; Paul D Thomas; Franck Valentin; Derek Wilson; Cathy H Wu; Corin Yeats
Journal: Nucleic Acids Res Date: 2008-10-21 Impact factor: 16.971

8. Ensembl 2009.

Authors: T J P Hubbard; B L Aken; S Ayling; B Ballester; K Beal; E Bragin; S Brent; Y Chen; P Clapham; L Clarke; G Coates; S Fairley; S Fitzgerald; J Fernandez-Banet; L Gordon; S Graf; S Haider; M Hammond; R Holland; K Howe; A Jenkinson; N Johnson; A Kahari; D Keefe; S Keenan; R Kinsella; F Kokocinski; E Kulesha; D Lawson; I Longden; K Megy; P Meidl; B Overduin; A Parker; B Pritchard; D Rios; M Schuster; G Slater; D Smedley; W Spooner; G Spudich; S Trevanion; A Vilella; J Vogel; S White; S Wilder; A Zadissa; E Birney; F Cunningham; V Curwen; R Durbin; X M Fernandez-Suarez; J Herrero; A Kasprzyk; G Proctor; J Smith; S Searle; P Flicek
Journal: Nucleic Acids Res Date: 2008-11-25 Impact factor: 16.971

9. Database resources of the National Center for Biotechnology Information.

Authors: Eric W Sayers; Tanya Barrett; Dennis A Benson; Stephen H Bryant; Kathi Canese; Vyacheslav Chetvernin; Deanna M Church; Michael DiCuccio; Ron Edgar; Scott Federhen; Michael Feolo; Lewis Y Geer; Wolfgang Helmberg; Yuri Kapustin; David Landsman; David J Lipman; Thomas L Madden; Donna R Maglott; Vadim Miller; Ilene Mizrachi; James Ostell; Kim D Pruitt; Gregory D Schuler; Edwin Sequeira; Stephen T Sherry; Martin Shumway; Karl Sirotkin; Alexandre Souvorov; Grigory Starchenko; Tatiana A Tatusova; Lukas Wagner; Eugene Yaschenko; Jian Ye
Journal: Nucleic Acids Res Date: 2008-10-21 Impact factor: 16.971

10. High-throughput functional annotation and data mining with the Blast2GO suite.

Authors: Stefan Götz; Juan Miguel García-Gómez; Javier Terol; Tim D Williams; Shivashankar H Nagaraj; María José Nueda; Montserrat Robles; Manuel Talón; Joaquín Dopazo; Ana Conesa
Journal: Nucleic Acids Res Date: 2008-04-29 Impact factor: 16.971

24 in total

1. SeqDepot: streamlined database of biological sequences and precomputed features.

Authors: Luke E Ulrich; Igor B Zhulin
Journal: Bioinformatics Date: 2013-11-13 Impact factor: 6.937

2. Genome comparison of barley and maize smut fungi reveals targeted loss of RNA silencing components and species-specific presence of transposable elements.

Authors: John D Laurie; Shawkat Ali; Rob Linning; Gertrud Mannhaupt; Philip Wong; Ulrich Güldener; Martin Münsterkötter; Richard Moore; Regine Kahmann; Guus Bakkeren; Jan Schirawski
Journal: Plant Cell Date: 2012-05-22 Impact factor: 11.277

3. Maximized Autotransporter-Mediated Expression (MATE) for Surface Display and Secretion of Recombinant Proteins in Escherichia coli.

Authors: Shanna Sichwart; Iasson E P Tozakidis; Mark Teese; Joachim Jose
Journal: Food Technol Biotechnol Date: 2015-09 Impact factor: 3.918

4. Deciphering the cryptic genome: genome-wide analyses of the rice pathogen Fusarium fujikuroi reveal complex regulation of secondary metabolism and novel metabolites.

Authors: Philipp Wiemann; Christian M K Sieber; Katharina W von Bargen; Lena Studt; Eva-Maria Niehaus; Jose J Espino; Kathleen Huß; Caroline B Michielse; Sabine Albermann; Dominik Wagner; Sonja V Bergner; Lanelle R Connolly; Andreas Fischer; Gunter Reuter; Karin Kleigrewe; Till Bald; Brenda D Wingfield; Ron Ophir; Stanley Freeman; Michael Hippler; Kristina M Smith; Daren W Brown; Robert H Proctor; Martin Münsterkötter; Michael Freitag; Hans-Ulrich Humpf; Ulrich Güldener; Bettina Tudzynski
Journal: PLoS Pathog Date: 2013-06-27 Impact factor: 6.823

5. Unity in variety--the pan-genome of the Chlamydiae.

Authors: Astrid Collingro; Patrick Tischler; Thomas Weinmaier; Thomas Penz; Eva Heinz; Robert C Brunham; Timothy D Read; Patrik M Bavoil; Konrad Sachse; Simona Kahane; Maureen G Friedman; Thomas Rattei; Garry S A Myers; Matthias Horn
Journal: Mol Biol Evol Date: 2011-06-20 Impact factor: 16.240

6. Selected rapporteur summaries from the XX World Congress of Psychiatric Genetics, Hamburg, Germany, October 14-18, 2012.

Authors: Heike Anderson-Schmidt; Olga Beltcheva; Mariko D Brandon; Enda M Byrne; Eric J Diehl; Laramie Duncan; Suzanne D Gonzalez; Eilis Hannon; Katri Kantojärvi; Iordanis Karagiannidis; Mark Z Kos; Eszter Kotyuk; Benjamin I Laufer; Katarzyna Mantha; Nathaniel W McGregor; Sandra Meier; Vanessa Nieratschker; Helen Spiers; Alessio Squassina; Geeta A Thakur; Yash Tiwari; Biju Viswanath; Michael J Way; Cybele C P Wong; Anne O'Shea; Lynn E DeLisi
Journal: Am J Med Genet B Neuropsychiatr Genet Date: 2013-01-22 Impact factor: 3.568

7. Signature protein of the PVC superphylum.

Authors: Ilias Lagkouvardos; Marc-André Jehl; Thomas Rattei; Matthias Horn
Journal: Appl Environ Microbiol Date: 2013-11-01 Impact factor: 4.792

8. Integrating metagenomic and amplicon databases to resolve the phylogenetic and ecological diversity of the Chlamydiae.

Authors: Ilias Lagkouvardos; Thomas Weinmaier; Federico M Lauro; Ricardo Cavicchioli; Thomas Rattei; Matthias Horn
Journal: ISME J Date: 2013-08-15 Impact factor: 10.302

9. Pythoscape: a framework for generation of large protein similarity networks.

Authors: Alan E Barber; Patricia C Babbitt
Journal: Bioinformatics Date: 2012-09-08 Impact factor: 6.937

10. Genome sequencing of the plant pathogen Taphrina deformans, the causal agent of peach leaf curl.

Authors: Ousmane H Cissé; João M G C F Almeida; Alvaro Fonseca; Ajay Anand Kumar; Jarkko Salojärvi; Kirk Overmyer; Philippe M Hauser; Marco Pagni
Journal: mBio Date: 2013-04-30 Impact factor: 7.867