Literature DB >> 29040681

20 years of the SMART protein domain annotation resource.

Abstract

SMART (Simple Modular Architecture Research Tool) is a web resource (http://smart.embl.de) for the identification and annotation of protein domains and the analysis of protein domain architectures. SMART version 8 contains manually curated models for more than 1300 protein domains, with approximately 100 new models added since our last update article (1). The underlying protein databases were synchronized with UniProt (2), Ensembl (3) and STRING (4), doubling the total number of annotated domains and other protein features to more than 200 million. In its 20th year, the SMART analysis results pages have been streamlined again and its information sources have been updated. SMART's vector based display engine has been extended to all protein schematics in SMART and rewritten to use the latest web technologies. The internal full text search engine has been redesigned and updated, resulting in greatly increased search speed.

Entities: Chemical Disease Species

Mesh：

Substances：
Proteins

Year: 2018 PMID： 29040681 PMCID： PMC5753352 DOI： 10.1093/nar/gkx922

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The concept of summarizing alignments of proteins or domains therein as profiles, collect them in a resource and scan new sequences against it has been already implemented in the 1980ies, e.g. using regular expressions (PROSITE release 1.0, (5)), alignment profiles, (ProfileScan, (6)) or property patterns (7) (ExCell, (8)). This precursor of SMART (Simple Modular Architecture Research Tool), a growing collection of profiles for shuffled extracellular domains, was in use as an inhouse tool soon after, i.e. in pre-WWW times (ExCell, (9,10)). With more data and increasing awareness and availability of mobile intracellular protein domains, the first version of SMART, released in 1997, aimed at creating a comprehensive resource for analyzing modular architecture of proteins; it used already then HMM profiles (11) and offered a web interface (12). Despite the focus shift in bioinformatics towards comparative genomics, transcriptomics and network biology, protein domain analysis remains an essential and important research tool, made easy by various frequently used online domain resources and databases, like Pfam (13), PANTHER (14) or PROSITE (15). Many of these, including SMART, are integrated into InterPro (16). The SMART database (12) integrates manually curated hidden Markov models (11,17) for many domains with a powerful web-based interface offering various analysis and visualization tools. After 20 years since its inception, it remains a popular and widely used tool with close to 50 000 distinct users per month. In the following sections, we give an overview of the major developments and new features introduced since our last update (1).

EXPANDED DOMAIN COVERAGE

SMART was never intended to be exhaustive, and was initially focused on mobile domains. In order to provide context of other domains in modular proteins, but also to help in functional annotation, it continued to gradually expand its domain coverage with each new release. The current version introduces >100 new domains, compared to the last version (1), bringing the total to 1302. SMART’s domain annotation includes a significant amount of manual work and expertise, in particular in creating the high-quality underlying multiple sequence alignments and selecting the individual per-domain cut-off values. Other, more exhaustive databases, like Pfam (13), already annotated many of these domains, but SMART’s own manual annotation pipeline leads to partially different protein annotations, enabling increased hypothesis generation by biologists.

UPDATED PROTEIN DATABASES

The main underlying protein database in SMART combines of the complete Uniprot (2) with all stable Ensembl (3) proteomes. Current release contains >50 million proteins from around 460 thousand species, subspecies and strains. To minimize the impact of the inherently high redundancy of these databases, we use a per-species clustering method described in (18), which created 2.9 million multi protein clusters with a total of 5.5 million proteins. In addition to the regular protein database described above, SMART offers a ‘genomic’ analysis mode that contains only proteins from completely sequenced genomes. Synchronized with the current STRING version 10.5 (4), it currently contains approximately 9.6 million proteins from 2031 complete genomes (238 Eukaryota, 1678 Bacteria and 115 Archaea).

NEW PROTEIN VISUALIZATION ENGINE AND UPDATED ANNOTATION PAGES

SMART version 8 introduces a vector based display engine for protein schematics (‘bubblograms’) throughout the server. Protein schematics in various list displays (such as domain architecture analysis results) are displayed as inline SVG (Scalable Vector Graphics) images, which seamlessly scale to the users display size, regardless of its resolution. A vector based protein schematic display applet was used in the single protein annotation mode since the previous SMART version. However, it was implemented in Adobe Flash, which has been recently discontinued and will not be available in future web browsers. Therefore, we have implemented a new version based on HTML5 Canvas element, which should have a reasonable life time (Figure 1A). Schematics can be zoomed into any level without loss in display quality, as well as exported into vector or bitmap images at any resolution. A tool box within the interactive viewer provides access to several additional functions, for example allowing users to toggle the display of intron positions or to navigate among various alternative representations of proteins containing overlapping domain predictions.

Figure 1.

Example SMART protein annotation and tree export. (A) SMART annotation page for protein CS1_HUMAN. Protein schematic representations are displayed using vector graphics in a Canvas based applet. Schematics are zoomable without quality loss and exportable into high resolution SVG or bitmap images. Protein features selected in various data tables are dynamically highlighted directly in the viewer. Using the interactive scale, any protein region can be selected and submitted for further BLAST analysis. (B) An example domain architecture analysis result, exported from SMART directly into the interactive Tree Of Life (iTOL) version 3 (21). The new protein viewer interactively ties various parts of the annotation page. Selecting a predicted domain or other feature in any of the data tables will automatically highlight its position in the protein. Since many predicted features are not directly displayed in the protein schematic (mostly due to overlaps), this function simplifies the visual identification of relations among different protein features. Various parts of the protein sequence can be interactively selected independent of the annotated features, and submitted to further BLAST analysis, when a more fine-grained evaluation is required. Detailed information about any detected protein feature can be displayed in streamlined floating popup dialogs, enhancing the user experience and lowering the need to navigate across different web pages. Condensed version of domain annotation pages is included in the dialogs, with optional links to the complete annotation. In addition, several convenience functions are included, allowing users to copy the underlying amino acid sequence to their clipboard, or to submit the subsequence for further BLAST analysis.

UPDATED EXTERNAL INFORMATION SOURCES

Protein orthology data are parsed from the the eggNOG database version 4.5 (19) and covers ∼7.5 million proteins from >3500 species. SMART’s annotation pages show a detailed list of all orthologous groups that include the protein annotated, with their description and taxonomic class. Cross links to eggNOG are provided, with detailed overviews of each orthologous group as well as the associated alignments and phylogenetic trees. Data on posttranslational protein modifications, which are displayed since the last SMART release, have been synchronized with the latest version 2 of the PTMcode database (20). SMART displays the total numbers of various posttranslational modifications annotated in a particular protein, with links to the detailed annotation pages in PTMcode, where users can explore the modifications and their possible functional associations within the protein, as well as with their direct interaction partners.

EXPANDED PROTEIN INTERACTION DATA

With the update of the underlying protein databases, we have also synchronized our protein interaction data with the version 10.5 of the STRING database (4). Updated graphical representations of putative interaction partners are now available for >9.5 million proteins.

UPDATED TAXONOMIC TREE DATA EXPORT

Domain architecture analysis functions in SMART allow users to simply access proteins containing combinations of particular domains. These can be also generated using combinations of GO terms associated to protein domains, and restricted to various taxonomic classes. In addition to the standard SMART protein schematic visualization, these data can also be exported into FASTA files or phylogenetic trees. The phylogenetic tree export has been completely rewritten and made compatible with the version 3 of the Interactive Tree of Life (iTOL) (21), with which these trees and their associated protein domain datasets can be further annotated (Figure 1B). Furthermore, backend taxonomic information used for the tree generation was synchronized with the latest NCBI taxonomy database.

BACKEND OPTIMIZATIONS AND EXPANDED SEARCH ENGINE

The backend of SMART is a relational database management system (RDBMS), powered by the PostgreSQL engine, which stores the annotation of all SMART domains, protein annotation and sequences, taxonomy information and the pre-calculated protein analyses for the entire Uniprot (2), Ensembl (3) and STRING (4) proteomes. In addition to the predictions of all SMART and Pfam domains, this includes various protein intrinsic features, like signal peptides, transmembrane and coiled coil regions. Due to constant growth of the number of annotated features, we are regularly restructuring our backend databases, and optimizing various parts of the server code in order to make the user experience satisfactory. Additionally, the server hardware that powers the sequence annotation searches and database queries has been replaced and significantly expanded with additional RAM and CPUs, greatly increasing the processing speed of user submitted proteins, and lowering the overall response times. SMART’s full text search engine allows users to quickly identify domains or proteins based on their annotation and other associated text. The current version introduces an updated search backend, providing access to a wider array of text information associated with each protein/domain, while offering increased search speed.

CONCLUSION

Since the initial release of SMART more than 20 years ago, our goal has been to provide a useful biological web resource, characterized by a high quality of underlying data and a powerful, simple user interface, even in the context of a very low funding level (on average less than 1 FTE over the last 10 years). Thus, we are confident that we are able to continue to modestly expand our coverage, keep in line with latest web standards and implement new features to make using SMART a better and more enjoyable experience to both existing and new users.

20 in total

1. Profile analysis: detection of distantly related proteins.

Authors: M Gribskov; A D McLachlan; D Eisenberg
Journal: Proc Natl Acad Sci U S A Date: 1987-07 Impact factor: 11.205

2. Hidden Markov models in computational biology. Applications to protein modeling.

Authors: A Krogh; M Brown; I S Mian; K Sjölander; D Haussler
Journal: J Mol Biol Date: 1994-02-04 Impact factor: 5.469

3. Comprehensive sequence analysis of the 182 predicted open reading frames of yeast chromosome III.

Authors: P Bork; C Ouzounis; C Sander; M Scharf; R Schneider; E Sonnhammer
Journal: Protein Sci Date: 1992-12 Impact factor: 6.725

4. New and continuing developments at PROSITE.

Authors: Christian J A Sigrist; Edouard de Castro; Lorenzo Cerutti; Béatrice A Cuche; Nicolas Hulo; Alan Bridge; Lydie Bougueleret; Ioannis Xenarios
Journal: Nucleic Acids Res Date: 2012-11-17 Impact factor: 16.971

5. SMART: recent updates, new developments and status in 2015.

Authors: Ivica Letunic; Tobias Doerks; Peer Bork
Journal: Nucleic Acids Res Date: 2014-10-09 Impact factor: 16.971

6. Ensembl 2017.

Authors: Bronwen L Aken; Premanand Achuthan; Wasiu Akanni; M Ridwan Amode; Friederike Bernsdorff; Jyothish Bhai; Konstantinos Billis; Denise Carvalho-Silva; Carla Cummins; Peter Clapham; Laurent Gil; Carlos García Girón; Leo Gordon; Thibaut Hourlier; Sarah E Hunt; Sophie H Janacek; Thomas Juettemann; Stephen Keenan; Matthew R Laird; Ilias Lavidas; Thomas Maurel; William McLaren; Benjamin Moore; Daniel N Murphy; Rishi Nag; Victoria Newman; Michael Nuhn; Chuang Kee Ong; Anne Parker; Mateus Patricio; Harpreet Singh Riat; Daniel Sheppard; Helen Sparrow; Kieron Taylor; Anja Thormann; Alessandro Vullo; Brandon Walts; Steven P Wilder; Amonida Zadissa; Myrto Kostadima; Fergal J Martin; Matthieu Muffato; Emily Perry; Magali Ruffier; Daniel M Staines; Stephen J Trevanion; Fiona Cunningham; Andrew Yates; Daniel R Zerbino; Paul Flicek
Journal: Nucleic Acids Res Date: 2016-11-28 Impact factor: 16.971

7. The STRING database in 2017: quality-controlled protein-protein association networks, made broadly accessible.

Authors: Damian Szklarczyk; John H Morris; Helen Cook; Michael Kuhn; Stefan Wyder; Milan Simonovic; Alberto Santos; Nadezhda T Doncheva; Alexander Roth; Peer Bork; Lars J Jensen; Christian von Mering
Journal: Nucleic Acids Res Date: 2016-10-18 Impact factor: 16.971

8. UniProt: the universal protein knowledgebase.

Authors:
Journal: Nucleic Acids Res Date: 2016-11-29 Impact factor: 16.971

9. SMART 6: recent updates and new developments.

Authors: Ivica Letunic; Tobias Doerks; Peer Bork
Journal: Nucleic Acids Res Date: 2008-10-31 Impact factor: 16.971

10. The Pfam protein families database: towards a more sustainable future.

Authors: Robert D Finn; Penelope Coggill; Ruth Y Eberhardt; Sean R Eddy; Jaina Mistry; Alex L Mitchell; Simon C Potter; Marco Punta; Matloob Qureshi; Amaia Sangrador-Vegas; Gustavo A Salazar; John Tate; Alex Bateman
Journal: Nucleic Acids Res Date: 2015-12-15 Impact factor: 16.971

536 in total

1. Integrative and quantitative view of the CtrA regulatory network in a stalked budding bacterium.

Authors: Oliver Leicht; Muriel C F van Teeseling; Gaël Panis; Celine Reif; Heiko Wendt; Patrick H Viollier; Martin Thanbichler
Journal: PLoS Genet Date: 2020-04-23 Impact factor: 5.917

2. Two forms of phosphomannomutase in gammaproteobacteria: The overlooked membrane-bound form of AlgC is required for twitching motility of Lysobacter enzymogenes.

Authors: Guoliang Qian; Shifang Fei; Michael Y Galperin
Journal: Environ Microbiol Date: 2019-05-23 Impact factor: 5.491

3. DomainViz: intuitive visualization of consensus domain distributions across groups of proteins.

Authors: Pascal Schläpfer; Devang Mehta; Cameron Ridderikhoff; R Glen Uhrig
Journal: Nucleic Acids Res Date: 2021-07-02 Impact factor: 16.971

4. The multi PAM2 protein Upa2 functions as novel core component of endosomal mRNA transport.

Authors: Silke Jankowski; Thomas Pohlmann; Sebastian Baumann; Kira Müntjes; Senthil Kumar Devan; Sabrina Zander; Michael Feldbrügge
Journal: EMBO Rep Date: 2019-07-24 Impact factor: 8.807

Review 5. LEA Proteins and the Evolution of the WHy Domain.

Authors: Jasmin Mertens; Habibu Aliyu; Don A Cowan
Journal: Appl Environ Microbiol Date: 2018-07-17 Impact factor: 4.792

6. Firefly genomes illuminate parallel origins of bioluminescence in beetles.

Authors: Timothy R Fallon; Sarah E Lower; Ching-Ho Chang; Manabu Bessho-Uehara; Gavin J Martin; Adam J Bewick; Megan Behringer; Humberto J Debat; Isaac Wong; John C Day; Anton Suvorov; Christian J Silva; Kathrin F Stanger-Hall; David W Hall; Robert J Schmitz; David R Nelson; Sara M Lewis; Shuji Shigenobu; Seth M Bybee; Amanda M Larracuente; Yuichi Oba; Jing-Ke Weng
Journal: Elife Date: 2018-10-16 Impact factor: 8.140

10. R⁹³P Substitution in the PmrB HAMP Domain Contributes to Colistin Heteroresistance in Escherichia coli Isolates from Swine.

Authors: Qihong Kuang; Dandan He; Huarun Sun; Huihui Hu; Fulin Li; Wenya Li; Gongzheng Hu; Hua Wu; Li Yuan
Journal: Antimicrob Agents Chemother Date: 2020-10-20 Impact factor: 5.191