| Literature DB >> 19906725 |
Thomas Rattei1, Patrick Tischler, Stefan Götz, Marc-André Jehl, Jonathan Hoser, Roland Arnold, Ana Conesa, Hans-Werner Mewes.
Abstract
The prediction of protein function as well as the reconstruction of evolutionary genesis employing sequence comparison at large is still the most powerful tool in sequence analysis. Due to the exponential growth of the number of known protein sequences and the subsequent quadratic growth of the similarity matrix, the computation of the Similarity Matrix of Proteins (SIMAP) becomes a computational intensive task. The SIMAP database provides a comprehensive and up-to-date pre-calculation of the protein sequence similarity matrix, sequence-based features and sequence clusters. As of September 2009, SIMAP covers 48 million proteins and more than 23 million non-redundant sequences. Novel features of SIMAP include the expansion of the sequence space by including databases such as ENSEMBL as well as the integration of metagenomes based on their consistent processing and annotation. Furthermore, protein function predictions by Blast2GO are pre-calculated for all sequences in SIMAP and the data access and query functions have been improved. SIMAP assists biologists to query the up-to-date sequence space systematically and facilitates large-scale downstream projects in computational biology. Access to SIMAP is freely provided through the web portal for individuals (http://mips.gsf.de/simap/) and for programmatic access through DAS (http://webclu.bio.wzw.tum.de/das/) and Web-Service (http://mips.gsf.de/webservices/services/SimapService2.0?wsdl).Entities:
Mesh:
Substances:
Year: 2009 PMID: 19906725 PMCID: PMC2808863 DOI: 10.1093/nar/gkp949
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Number of protein entries and non-redundant sequences of the major protein sequence databases included in SIMAP as of September 2009
| Database | Protein entries | Non-redundant sequences |
|---|---|---|
| NCBI GenBank | 16 146 018 | 13 065 886 |
| NCBI RefSeq | 8 181 910 | 6 681 186 |
| Uniprot/TrEMBL | 8 926 016 | 7 586 794 |
| Uniprot/Swissprot | 495 880 | 416 496 |
| PDB | 139 106 | 41 445 |
| PEDANT | 5 480 442 | 5 389 911 |
| ENSEMBL | 1 094 482 | 1 062 197 |
Number of metagenomic samples and extracted protein-coding sequences in SIMAP as of September 2009
| Database | Metagenomic samples | Non-redundant sequences |
|---|---|---|
| Camera (JCVI) | 54 | 6 031 109 |
| NCBI Genbank/wgs section | 130 | 4 244 008 |
| IMG/M (JGI) | 65 | 2 833 359 |
Figure 1.Composition of the non-redundant protein sequence space in SIMAP as of September 2009.
Pre-calculated functional annotations in SIMAP as of September 2009
| Method | Number of pre-calculated features |
|---|---|
| InterProScan | 133 829 528 |
| TargetP | 17 205 439 |
| SignalP | 11 060 831 |
| TMHMM | 15 841 454 |
| Phobius | 18 488 832 |
| Blast2GO | 190 801 556 |