Literature DB >> 22053084

SMART 7: recent updates to the protein domain annotation resource.

Ivica Letunic¹, Tobias Doerks, Peer Bork.

Abstract

SMART (Simple Modular Architecture Research Tool) is an online resource (http://smart.embl.de/) for the identification and annotation of protein domains and the analysis of protein domain architectures. SMART version 7 contains manually curated models for 1009 protein domains, 200 more than in the previous version. The current release introduces several novel features and a streamlined user interface resulting in a faster and more comfortable workflow. The underlying protein databases were greatly expanded, resulting in a 2-fold increase in number of annotated domains and features. The database of completely sequenced genomes now includes 1133 species, compared to 630 in the previous release. Domain architecture analysis results can now be exported and visualized through the iTOL phylogenetic tree viewer. 'metaSMART' was introduced as a novel subresource dedicated to the exploration and analysis of domain architectures in various metagenomics data sets. An advanced full text search engine was implemented, covering the complete annotations for SMART and Pfam domains, as well as the complete set of protein descriptions, allowing users to quickly find relevant information.

Entities: Gene Species

Mesh：

Year: 2011 PMID： 22053084 PMCID： PMC3245027 DOI： 10.1093/nar/gkr931

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The SMART database (http://smart.embl.de) is now in its 13th year (1), and provides high quality, manually curated Hidden–Markov models and alignments of protein domain families. Accessible though a web interface or via various programmatic methods, SMART remains a popular tool for domain annotation and exploration of protein domain architectures, with an average of 200 000 user submitted proteins analyzed monthly.

IMPROVED DOMAIN COVERAGE

Even though the rate of novel domain discovery is constantly declining (2), SMART gradually expands its domain coverage in each release. The current version 7 introduces more than 200 new domains, bringing the total to 1009 distinct modules that can be searched. Even though many of these domains were already annotated in other databases, like Pfam (3), SMART's domain annotation pipeline relies heavily on manual intervention, making the re-annotation process worthwhile.

UPDATED PROTEIN DATABASES

The number of annotated protein sequences is constantly growing, at the same time increasing the redundancy in the databases. Since protein redundancy significantly skews the number of domains reported in both domain architecture analyses and when comparing domain counts in complete genomes, past versions of SMART (4) introduced several features to minimize these problems. The standard protein database used by SMART combines the complete Uniprot protein database (5) with predicted proteins from all stable Ensembl (6) genomes. Since these are inherently highly redundant, SMART implements a per-species clustering method (7) to minimize the redundancy in the final database. Yet, the updated version currently contains more than 11 million proteins from around 150 thousand species, subspecies and varietas. Additionally, SMART offers a ‘genomic’ analysis mode that contains only proteins from completely sequenced genomes. Synchonized with STRING version 9 (8), this database has been significantly expanded, and contains 1133 complete genomes (121 Eukaryota, 943 Bacteria and 69 Archaea).

NOVEL ARCHITECTURE ANALYSIS DATA EXPORT AND VISUALIZATION FEATURES

Domain architecture analysis functions in SMART allow users to simply access proteins containing combinations of particular domains. These can be also generated using combinations of GO terms (9) associated to protein domains, and restricted to various taxonomic classes. Previous versions of SMART allowed users to download these selected proteins as FASTA formatted files or to display them through schematic representations (SMART ‘bubblograms’). SMART 7 offers a new data export functions for domain architecture analysis, which is tightly coupled with iTOL [interactive Tree Of Life (10,11)], our phylogenetic tree visualization tool. Data are exported into two separate files, which can be directly used by iTOL: a Newick formatted phylogenetic tree and a protein domain data set file used to visualize proteins on the tree. The procedure is as follows: an initial list of proteins is obtained through an architecture analysis query; proteins are grouped according to their species of origin; these species are used to ‘prune’ the complete NCBI taxonomy database (12) by walking the taxonomy tree up to the root and exporting the resulting structure into a Newick formatted phylogenetic tree; and each protein's domain organization is converted into a plain text format understood by iTOL. Resulting plain text files can be downloaded, or directly visualized in iTOL by a simple button click (Figure 1).

Figure 1.

Displaying SMART protein domain architectures in iTOL. New data export features allow users to simply display domain architecture query results on a NCBI taxonomy based phylogenetic tree. Phylogenetic trees are generated on-the-fly by pruning the NCBI taxonomy database (12) and visualized in interactive Tree Of Life (10). (a) SMART was queried for all proteins containing both CUB and CCP domains. (b) Query results visualized on a phylogenetic tree in iTOL.

EXPANDED PROTEIN INTERACTION DATA

Similar to previous SMART updates, we synchronized our underlying protein interaction data with the latest version of the STRING database (8). Since the number of species in our protein database based on completely sequenced genomes increased almost 2-fold in this release, the information on putative protein interaction partners has also been significantly expanded, and is now available for more than 3.5 million proteins. Interaction network data display has been updated, and uses a streamlined graphical representation, which brings several extra layers of information while being easier to interpret.

metaSMART: BASIC INTEGRATION OF ENVIRONMENTAL SEQUENCING DATA

Metagenomics projects (that is environmental shotgun sequencing) are constantly increasing the amount of novel, uncharacterized DNA and (fragments of) protein sequences. Functional characterization and annotation of such data remains a daunting task, and various pipelines, such as SmashCommunity (13), are being developed to help scientists in this process. As an initial step toward meaningful integration of these data into SMART, we created ‘metaSMART’. Its primary goal is the exploration and analysis of protein domain architectures in various publicly available metagenomics data sets. Users can compare different domain frequencies, co-occurrences and complex architectures in different environments to illustrate the role of domain variability depending on the habitat. Furthermore, metaSMART allows the exploration of completely novel domain architectures, unique in databases so far; analyses of various non-described domain compositions could broaden the knowledge about new protein functions related to their domain interdependency (Figure 2). Four metagenomics data sets are the starting point of metaSMART: Sargasso sea (14), acid mine drainage biofilm (15), Minnesota farm soil (16) and ‘Whale fall’ carcasses (16). We are currently integrating several additional metagenomes [for example, the human gut (17)], which will significantly expand the amount of available information in metaSMART and provide novel biological insights in the context of metagenomics.

Figure 2.

metaSMART, a novel sub resource dedicated to the exploration of domain architectures in metagenomics data sets. (a) metaSMART user interface provides simple access to all available functions. (b) A subset of protein domain architectures present in the Sargasso Sea data set (14). These are not present in other metagenomics data sets or the standard SMART database, and could be pointing to novel functional associations of various domains.

DATABASE AND WEB SERVER OPTIMIZATIONS

The backend of SMART is a PostgreSQL-based relational database management system, which stores the annotation of all SMART domains and the pre-calculated protein analyses for the entire Uniprot (18), Ensembl (19) and STRING (8) sequence databases. These include SMART and Pfam domains, as well as several protein intrinsic features, like signal peptides, transmembrane and coiled-coil regions. With close to 50 million annotated features in the current database, we have to constantly find new ways of keeping the response times of the server acceptable. Therefore, the database was restructured and several parts of the database access code have been optimized. Additionally, the hardware cluster that powers the sequence annotation searches and database queries has been refreshed and expanded with additional CPUs.

USER INTERFACE IMPROVEMENTS

Version 7 brings various updates to SMART's web interface. Many parts of the interface have been simplified and compacted, resulting in easier navigation and simpler identification of relevant content. To make SMART more accessible to new users, we added help popup windows to various parts of the interface, making different functions easier to understand. A new full text search engine has been implemented, based on KinoSearch libraries (http://incubator.apache.org/lucy). It indexes the complete annotation pages for all SMART and Pfam domains, as well as Uniprot, Ensembl and STRING protein descriptions, allowing users to quickly identify domains or proteins of interest. Programmatic access to SMART has been extended with easy to parse text-only output mode, allowing simple batch access to the SMART search engine. Ready to use example scripts that use the batch access interface are also provided.

FUNDING

EMBL (internal budget) and the European Union under the program ‘FP7 capacities: Scientific Data Repositories’ (grant 213037) (IMproving Protein Annotation and Co-ordination using Technology – IMPACT). Funding for open access charge: EMBL (internal budget). Conflict of interest statement. None declared.

19 in total

1. SmashCommunity: a metagenomic annotation and analysis tool.

Authors: Manimozhiyan Arumugam; Eoghan D Harrington; Konrad U Foerstner; Jeroen Raes; Peer Bork
Journal: Bioinformatics Date: 2010-10-19 Impact factor: 6.937

2. The Gene Ontology (GO) project: structured vocabularies for molecular biology and their application to genome and expression analysis.

Authors: Judith A Blake; Midori A Harris
Journal: Curr Protoc Bioinformatics Date: 2008-09

3. Enterotypes of the human gut microbiome.

Authors: Manimozhiyan Arumugam; Jeroen Raes; Eric Pelletier; Denis Le Paslier; Takuji Yamada; Daniel R Mende; Gabriel R Fernandes; Julien Tap; Thomas Bruls; Jean-Michel Batto; Marcelo Bertalan; Natalia Borruel; Francesc Casellas; Leyden Fernandez; Laurent Gautier; Torben Hansen; Masahira Hattori; Tetsuya Hayashi; Michiel Kleerebezem; Ken Kurokawa; Marion Leclerc; Florence Levenez; Chaysavanh Manichanh; H Bjørn Nielsen; Trine Nielsen; Nicolas Pons; Julie Poulain; Junjie Qin; Thomas Sicheritz-Ponten; Sebastian Tims; David Torrents; Edgardo Ugarte; Erwin G Zoetendal; Jun Wang; Francisco Guarner; Oluf Pedersen; Willem M de Vos; Søren Brunak; Joel Doré; María Antolín; François Artiguenave; Hervé M Blottiere; Mathieu Almeida; Christian Brechot; Carlos Cara; Christian Chervaux; Antonella Cultrone; Christine Delorme; Gérard Denariaz; Rozenn Dervyn; Konrad U Foerstner; Carsten Friss; Maarten van de Guchte; Eric Guedon; Florence Haimet; Wolfgang Huber; Johan van Hylckama-Vlieg; Alexandre Jamet; Catherine Juste; Ghalia Kaci; Jan Knol; Omar Lakhdari; Severine Layec; Karine Le Roux; Emmanuelle Maguin; Alexandre Mérieux; Raquel Melo Minardi; Christine M'rini; Jean Muller; Raish Oozeer; Julian Parkhill; Pierre Renault; Maria Rescigno; Nicolas Sanchez; Shinichi Sunagawa; Antonio Torrejon; Keith Turner; Gaetana Vandemeulebrouck; Encarna Varela; Yohanan Winogradsky; Georg Zeller; Jean Weissenbach; S Dusko Ehrlich; Peer Bork
Journal: Nature Date: 2011-04-20 Impact factor: 49.962

4. The Pfam protein families database.

Authors: Robert D Finn; Jaina Mistry; John Tate; Penny Coggill; Andreas Heger; Joanne E Pollington; O Luke Gavin; Prasad Gunasekaran; Goran Ceric; Kristoffer Forslund; Liisa Holm; Erik L L Sonnhammer; Sean R Eddy; Alex Bateman
Journal: Nucleic Acids Res Date: 2009-11-17 Impact factor: 16.971

5. The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored.

Authors: Damian Szklarczyk; Andrea Franceschini; Michael Kuhn; Milan Simonovic; Alexander Roth; Pablo Minguez; Tobias Doerks; Manuel Stark; Jean Muller; Peer Bork; Lars J Jensen; Christian von Mering
Journal: Nucleic Acids Res Date: 2010-11-02 Impact factor: 16.971

6. Ensembl 2011.

Authors: Paul Flicek; M Ridwan Amode; Daniel Barrell; Kathryn Beal; Simon Brent; Yuan Chen; Peter Clapham; Guy Coates; Susan Fairley; Stephen Fitzgerald; Leo Gordon; Maurice Hendrix; Thibaut Hourlier; Nathan Johnson; Andreas Kähäri; Damian Keefe; Stephen Keenan; Rhoda Kinsella; Felix Kokocinski; Eugene Kulesha; Pontus Larsson; Ian Longden; William McLaren; Bert Overduin; Bethan Pritchard; Harpreet Singh Riat; Daniel Rios; Graham R S Ritchie; Magali Ruffier; Michael Schuster; Daniel Sobral; Giulietta Spudich; Y Amy Tang; Stephen Trevanion; Jana Vandrovcova; Albert J Vilella; Simon White; Steven P Wilder; Amonida Zadissa; Jorge Zamora; Bronwen L Aken; Ewan Birney; Fiona Cunningham; Ian Dunham; Richard Durbin; Xosé M Fernández-Suarez; Javier Herrero; Tim J P Hubbard; Anne Parker; Glenn Proctor; Jan Vogel; Stephen M J Searle
Journal: Nucleic Acids Res Date: 2010-11-02 Impact factor: 16.971

7. Database resources of the National Center for Biotechnology Information.

Authors: Eric W Sayers; Tanya Barrett; Dennis A Benson; Evan Bolton; Stephen H Bryant; Kathi Canese; Vyacheslav Chetvernin; Deanna M Church; Michael DiCuccio; Scott Federhen; Michael Feolo; Ian M Fingerman; Lewis Y Geer; Wolfgang Helmberg; Yuri Kapustin; David Landsman; David J Lipman; Zhiyong Lu; Thomas L Madden; Tom Madej; Donna R Maglott; Aron Marchler-Bauer; Vadim Miller; Ilene Mizrachi; James Ostell; Anna Panchenko; Lon Phan; Kim D Pruitt; Gregory D Schuler; Edwin Sequeira; Stephen T Sherry; Martin Shumway; Karl Sirotkin; Douglas Slotta; Alexandre Souvorov; Grigory Starchenko; Tatiana A Tatusova; Lukas Wagner; Yanli Wang; W John Wilbur; Eugene Yaschenko; Jian Ye
Journal: Nucleic Acids Res Date: 2010-11-21 Impact factor: 16.971

8. Interactive Tree Of Life v2: online annotation and display of phylogenetic trees made easy.

Authors: Ivica Letunic; Peer Bork
Journal: Nucleic Acids Res Date: 2011-04-05 Impact factor: 16.971

9. The Universal Protein Resource (UniProt) in 2010.

Authors:
Journal: Nucleic Acids Res Date: 2009-10-20 Impact factor: 16.971

10. The universal protein resource (UniProt).

Authors:
Journal: Nucleic Acids Res Date: 2007-11-27 Impact factor: 16.971

736 in total

1. Identification of the cglC, cglD, cglE, and cglF genes and their role in cell contact-dependent gliding motility in Myxococcus xanthus.

Authors: Darshankumar T Pathak; Daniel Wall
Journal: J Bacteriol Date: 2012-02-17 Impact factor: 3.490

2. Designing orthogonal signaling pathways: how to fit in with the surroundings.

Authors: John Karanicolas
Journal: Proc Natl Acad Sci U S A Date: 2012-03-26 Impact factor: 11.205

3. Evolutionary Conservation and Diversification of Puf RNA Binding Proteins and Their mRNA Targets.

Authors: Gregory J Hogan; Patrick O Brown; Daniel Herschlag
Journal: PLoS Biol Date: 2015-11-20 Impact factor: 8.029

4. Investigation of the ASR family in foxtail millet and the role of ASR1 in drought/oxidative stress tolerance.

Authors: Zhi-Juan Feng; Zhao-Shi Xu; Jiutong Sun; Lian-Cheng Li; Ming Chen; Guang-Xiao Yang; Guang-Yuan He; You-Zhi Ma
Journal: Plant Cell Rep Date: 2015-10-06 Impact factor: 4.570

5. Selection of recombinant anti-SH3 domain antibodies by high-throughput phage display.

Authors: Haiming Huang; Nicolas O Economopoulos; Bernard A Liu; Andrea Uetrecht; Jun Gu; Nick Jarvik; Vincent Nadeem; Tony Pawson; Jason Moffat; Shane Miersch; Sachdev S Sidhu
Journal: Protein Sci Date: 2015-09-16 Impact factor: 6.725

6. Genomic analysis of 38 Legionella species identifies large and diverse effector repertoires.

Authors: David Burstein; Francisco Amaro; Tal Zusman; Ziv Lifshitz; Ofir Cohen; Jack A Gilbert; Tal Pupko; Howard A Shuman; Gil Segal
Journal: Nat Genet Date: 2016-01-11 Impact factor: 38.330

7. Identification and functional characterization of nonmammalian Toll-like receptor 20.

Authors: Danilo Pietretti; Marleen Scheer; Inge R Fink; Nico Taverne; Huub F J Savelkoul; Herman P Spaink; Maria Forlenza; Geert F Wiegertjes
Journal: Immunogenetics Date: 2013-12-11 Impact factor: 2.846

8. Characterization of the Z lineage Major histocompatability complex class I genes in zebrafish.

Authors: Hayley Dirscherl; Jeffrey A Yoder
Journal: Immunogenetics Date: 2013-11-28 Impact factor: 2.846

9. The diversity of rice phytocystatins.

Authors: Ana Paula Christoff; Rogerio Margis
Journal: Mol Genet Genomics Date: 2014-08-07 Impact factor: 3.291

10. Cloning and functional characterization of three branch point oxidosqualene cyclases from Withania somnifera (L.) dunal.

Authors: Niha Dhar; Satiander Rana; Sumeer Razdan; Wajid Waheed Bhat; Aashiq Hussain; Rekha S Dhar; Samantha Vaishnavi; Abid Hamid; Ram Vishwakarma; Surrinder K Lattoo
Journal: J Biol Chem Date: 2014-04-25 Impact factor: 5.157