| Literature DB >> 18037617 |
Thomas Rattei1, Patrick Tischler, Roland Arnold, Franz Hamberger, Jörg Krebs, Jan Krumsiek, Benedikt Wachinger, Volker Stümpflen, Werner Mewes.
Abstract
Protein sequences are the most important source of evolutionary and functional information for new proteins. In order to facilitate the computationally intensive tasks of sequence analysis, the Similarity Matrix of Proteins (SIMAP) database aims to provide a comprehensive and up-to-date dataset of the pre-calculated sequence similarity matrix and sequence-based features like InterPro domains for all proteins contained in the major public sequence databases. As of September 2007, SIMAP covers approximately 17 million proteins and more than 6 million non-redundant sequences and provides a complete annotation based on InterPro 16. Novel features of SIMAP include a new, portlet-based web portal providing multiple, structured views on retrieved proteins and integration of protein clusters and a unique search method for similar domain architectures. Access to SIMAP is freely provided for academic use through the web portal for individuals at http://mips.gsf.de/simap/and through Web Services for programmatic access at http://mips.gsf.de/webservices/services/SimapService2.0?wsdl.Entities:
Mesh:
Substances:
Year: 2007 PMID: 18037617 PMCID: PMC2238827 DOI: 10.1093/nar/gkm963
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Numbers of the proteins (left) and non-redundant sequences (right) covered by the three most important public sequence databases: Uniprot, RefSeq and GenBank as of September 2007.
Figure 2.To explore the homologs of a user-defined query protein, the classical result list including a graphical representation of the alignments is shown per default. Additional views allow structuring the homologs by taxonomy or assignment to sequence clusters.
Figure 3.Example of remote homologs retrieved by the ‘Domain similarity’ tool of SIMAP. When searching the query sequence in the Uniprot database, high E-values result from the low bitscores. Thus, these proteins show insufficient pairwise sequence homology to the query and would not be found by database searches that are typically restricted to a maximal E-value of 10. However, the similar domain architectures suggest a common ancestry of these proteins.