| Literature DB >> 24165881 |
Roland Arnold1, Florian Goldenberg, Hans-Werner Mewes, Thomas Rattei.
Abstract
The Similarity Matrix of Proteins (SIMAP, http://mips.gsf.de/simap/) database has been designed to massively accelerate computationally expensive protein sequence analysis tasks in bioinformatics. It provides pre-calculated sequence similarities interconnecting the entire known protein sequence universe, complemented by pre-calculated protein features and domains, similarity clusters and functional annotations. SIMAP covers all major public protein databases as well as many consistently re-annotated metagenomes from different repositories. As of September 2013, SIMAP contains >163 million proteins corresponding to ∼70 million non-redundant sequences. SIMAP uses the sensitive FASTA search heuristics, the Smith-Waterman alignment algorithm, the InterPro database of protein domain models and the BLAST2GO functional annotation algorithm. SIMAP assists biologists by facilitating the interactive exploration of the protein sequence universe. Web-Service and DAS interfaces allow connecting SIMAP with any other bioinformatic tool and resource. All-against-all protein sequence similarity matrices of project-specific protein collections are generated on request. Recent improvements allow SIMAP to cover the rapidly growing sequenced protein sequence universe. New Web-Service interfaces enhance the connectivity of SIMAP. Novel tools for interactive extraction of protein similarity networks have been added. Open access to SIMAP is provided through the web portal; the portal also contains instructions and links for software access and flat file downloads.Entities:
Mesh:
Year: 2013 PMID: 24165881 PMCID: PMC3965014 DOI: 10.1093/nar/gkt970
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Number of protein entries, non-redundant sequences, pre-calculated sequence similarities, protein domains, features and functional annotations (all given in millions) in SIMAP as of September 2013
| The protein sequence universe covered by SIMAP | Protein entries: | 163 |
| Unique sequences (non-metagenomic): | 27 | |
| Unique sequences (metagenomic): | 35 | |
| Sequence similarities | FASTA/Smith-Waterman hits | 3 517 306 |
| InterPro hits | BlastProDom | 1 |
| FPrintScan | 28 | |
| HMMPanther | 40 | |
| HMMPfam | 50 | |
| HMMPIR | 2 | |
| HMMSmart | 16 | |
| HMMTigr | 7 | |
| ProfileScan | 17 | |
| PatternScan | 10 | |
| Superfamily | 39 | |
| Gene3D | 43 | |
| Coil | 8 | |
| Seg | 71 | |
| HAMAP | 2 | |
| Sequence features | SignalP | 30 |
| TargetP | 51 | |
| TMHMM | 39 | |
| PHOBIUS | 45 | |
| Functional annotations | Blast2GO | 157 |
Performance of the main methods of the SIMAP Web-Service
| Web-Service method | Request per minute from a single client |
|---|---|
| Retrieval of homologs (SIMAP XML) | 26 |
| Retrieval of homologs (BLAST XML) | 25 |
| Retrieval of InterPro hits | 37 |
Values denote average numbers of requests per minute from a geographically remote location (Toronto, Ontario, Canada).
Figure 1.Schematic representation of the SIMAP database contents and access facilities.