Literature DB >> 16381859

SMART 5: domains in the context of genomes and networks.

Ivica Letunic¹, Richard R Copley, Birgit Pils, Stefan Pinkert, Jörg Schultz, Peer Bork.

Abstract

The Simple Modular Architecture Research Tool (SMART) is an online resource (http://smart.embl.de/) used for protein domain identification and the analysis of protein domain architectures. Many new features were implemented to make SMART more accessible to scientists from different fields. The new 'Genomic' mode in SMART makes it easy to analyze domain architectures in completely sequenced genomes. Domain annotation has been updated with a detailed taxonomic breakdown and a prediction of the catalytic activity for 50 SMART domains is now available, based on the presence of essential amino acids. Furthermore, intrinsically disordered protein regions can be identified and displayed. The network context is now displayed in the results page for more than 350 000 proteins, enabling easy analyses of domain interactions.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Multiprotein Complexes

Year: 2006 PMID： 16381859 PMCID： PMC1347442 DOI： 10.1093/nar/gkj079

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

When the Simple Modular Architecture Research Tool (SMART) database was first made public 8 years ago (1), the current extent of completely sequenced genomes was little more than a dream. In the last few years, the astonishing successes of whole organism approaches to biology are not only limited to sequencing efforts but also include techniques, such as the high-throughput identification of protein–protein interactions, which have created new opportunities and higher expectations for computational approaches to interpreting biological sequences. In the last 2 years, we have been developing new ways of meeting these challenges. The basic data of SMART are high-quality manually derived alignments of protein domain families. As hidden Markov models (2) these allow us to identify protein domains in sequence databases; these results are stored in a database accessible via a simple web interface (). The data provide a framework for understanding the evolution and function of genes and proteins throughout the living world. Whereas the SMART philosophy has been to include essentially all available protein sequences, we recognize that many users are interested primarily in the biology of a particular organism. Accordingly, we have developed new views more tightly integrated with genome data. These new genome views allow further cross-referencing with protein–protein interaction maps, making SMART an invaluable tool for systems biologists to interpret pathways and networks.

REDUCED PROTEIN DATABASE REDUNDANCY AND ‘GENOMIC’ MODE

Owing to the nature of our source databases (Swiss-Prot, SP-TrEMBL and Ensembl) (3,4) the protein database in SMART has significant redundancy, even though identical proteins are removed. Different proteins and fragments in the source databases often correspond to the same gene. Users exploring the various domain architectures or interested in domain counts in various genomes are particularly vulnerable to this problem, as the numbers they get are often inflated and unrealistic. To overcome this problem, we extended SMART with a new operating mode, namely ‘Genomic’ mode. The main difference between normal and genomic mode in SMART is the underlying protein database. In genomic mode, only the proteins from 170 completely sequenced genomes are included (a full list is available at ). Swiss-Prot (3) is our main source database of genomic data, together with Ensembl (4) for metazoan genomes. This database has minimal redundancy, and is therefore particularly useful for whole genome studies of domain architectures or single domain distributions.

PREDICTION OF CATALYTIC ACTIVITY

To improve the function prediction for single domains, we annotated essential catalytic sites for all enzymatic domains in SMART. These were extracted from structural reports in the primary literature, wherever the catalytic mechanism was known (5). Now, protein sequences can be scanned for the presence of important catalytic amino acids (Figure 1). Absence of one of these amino acids very likely results in loss of catalytic activity. Recently, it turned out that many domains homologous to signaling enzymes seem to have lost their catalytic ability, although they are evolutionarily conserved. Instead of a catalytic function these domains appear to play a role in regulatory processes. This trend is especially obvious in the protein tyrosine phosphatase family (5). The inclusion of catalytic amino acid residues in the database will allow a more rapid identification of inactive enzyme homologs in the future.

Figure 1

Prediction of catalytic activity in SMART. First guanylyl cyclase domain in human adenylate cyclase type III (ENSP00000260600) is marked as ‘inactive’ because the two amino acids required for its activity are not present. Domain annotation page shows which amino acids are not detected and gives pointers to the relevant literature.

DOMAIN ARCHITECTURE INVENTION DATING

As a further step from the single domain to the understanding of multi domain proteins, SMART now predicts the taxonomic class, where the concept of a protein, that is its domain architecture, was invented. The domain architecture is defined as the linear order of all SMART domains in the protein sequence. To derive the point of its invention, all proteins with the same domain architecture are mapped onto NCBIs taxonomy (6). The last common ancestor of all organisms containing at least one protein with the domain architecture is defined as the point of its origin. From the knowledge on the origin of domain architectures one might infer the distribution and presence of these architectures in not yet or incompletely sequenced genomes. In addition, conclusions on the general function of domain architectures can be drawn.

PROTEIN INTERACTION DATA

The latest version of SMART provides information about putative interaction partners for more than 350 000 proteins (Figure 2). This information is imported from the STRING database (7), in which known and predicted protein–protein associations are integrated from a variety of sources. The interactors are shown in SMART in the form of a summary graphic (network); the various types of interaction evidence are depicted as lines of different colors in the network. Clicking on the graphic will launch the STRING website, where the underlying evidence can be studied in detail. The interactions in STRING include physical binding interactions, as well as functional associations, such as membership in a common pathway or process. The data are derived from a variety of sources, including knowledge bases, such as BIND (8), KEGG (9), HPRD (10) and Reactome (11), as well as in silico prediction approaches and automated text-mining. STRING aims to improve usability of the interactome by scoring and ranking interaction data (making a confidence estimate on each prediction), as well as by transferring interaction knowledge between model organisms where applicable. SMART and STRING are both cross-referenced through a common set of proteins and genomes, and STRING in turn uses domain information from the SMART server in its pages as well.

Figure 2

Interaction networks in SMART. Around 350 000 protein annotation pages include an interaction network in a pop-up window. Networks are linked to the STRING database () which provides the data.

NEW DATABASE FEATURES

The core of SMART is a relational database management system (RDBMS) which stores information on SMART domains (1,12). Owing to the exponentially increasing amount of data, many parts of the database access code have been updated or completely rewritten, resulting in greatly improved response times, most noticeably in the domain architecture analysis operations. SMART database includes the information on domain presence in all proteins in a non-redundant database, now with the added data on the catalytic activity for 50 catalytic domains. All domain architecture analysis results include this information, and domains with missing essential amino acids are overlaid with the word ‘inactive’ (Figure 1). The domain annotation page provides detailed information on which of the required amino acids are missing, and gives pointers to the relevant literature.

NEW ANALYSIS METHODS

DisEMBL [, (13)] predictions of intrinsic protein disorder were included into SMART's analysis methods. DisEMBL is a computational tool for the prediction of disordered/unstructured regions within a protein sequence. Predictions included in SMART are based on missing coordinates in X-ray structure as defined by REMARK465 entries in PDB and the ‘Hot loops’ method. Hot loops constitute a refined subset of the standard loops/coils as defined by DSSP (14), namely, those loops with a high degree of mobility as determined from C-α temperature factors (B-factors).

USER INTERFACE IMPROVEMENTS AND TECHNICAL CHANGES

SMART's user interface was completely rewritten and is now fully compliant with the latest web standards, such as XHTML1.0 and CSS2. Users with standards-compliant web browsers can fully enjoy the extra speed and features. Owing to increasing server load, the queuing system was completely rewritten and the hardware greatly expanded resulting in a more stable operation and faster response times. An important new feature is the introduction of taxonomic trees into SMART. Two primary uses for taxonomic trees in SMART are the grouping of domain architecture query results and the detailed taxonomic distribution of domains now shown on domain annotation pages (Figure 3). The grouping of architecture query results allows users to easily display only proteins from certain species or taxonomic nodes. Taxonomic distribution of proteins on domain annotation pages gives a detailed overview of domain presence in different species and taxa.

Figure 3

Taxonomic trees in SMART. (a) Domain architecture query results grouped into a tree. Users can select individual proteins or taxonomic nodes to display. (b) Domain annotation pages show detailed domain and protein counts in various taxonomic nodes.

14 in total

1. SMART 4.0: towards genomic data integration.

Authors: Ivica Letunic; Richard R Copley; Steffen Schmidt; Francesca D Ciccarelli; Tobias Doerks; Jörg Schultz; Chris P Ponting; Peer Bork
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

2. The KEGG resource for deciphering the genome.

Authors: Minoru Kanehisa; Susumu Goto; Shuichi Kawashima; Yasushi Okuno; Masahiro Hattori
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

3. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003.

Authors: Brigitte Boeckmann; Amos Bairoch; Rolf Apweiler; Marie-Claude Blatter; Anne Estreicher; Elisabeth Gasteiger; Maria J Martin; Karine Michoud; Claire O'Donovan; Isabelle Phan; Sandrine Pilbout; Michel Schneider
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

4. Inactive enzyme-homologues find new function in regulatory processes.

Authors: Birgit Pils; Jörg Schultz
Journal: J Mol Biol Date: 2004-07-09 Impact factor: 5.469

5. SMART, a simple modular architecture research tool: identification of signaling domains.

Authors: J Schultz; F Milpetz; P Bork; C P Ponting
Journal: Proc Natl Acad Sci U S A Date: 1998-05-26 Impact factor: 11.205

6. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features.

Authors: W Kabsch; C Sander
Journal: Biopolymers Date: 1983-12 Impact factor: 2.505

7. Hidden Markov models in computational biology. Applications to protein modeling.

Authors: A Krogh; M Brown; I S Mian; K Sjölander; D Haussler
Journal: J Mol Biol Date: 1994-02-04 Impact factor: 5.469

8. Human protein reference database as a discovery resource for proteomics.

Authors: Suraj Peri; J Daniel Navarro; Troels Z Kristiansen; Ramars Amanchy; Vineeth Surendranath; Babylakshmi Muthusamy; T K B Gandhi; K N Chandrika; Nandan Deshpande; Shubha Suresh; B P Rashmi; K Shanker; N Padma; Vidya Niranjan; H C Harsha; Naveen Talreja; B M Vrushabendra; M A Ramya; A J Yatish; Mary Joy; H N Shivashankar; M P Kavitha; Minal Menezes; Dipanwita Roy Choudhury; Neelanjana Ghosh; R Saravana; Sreenath Chandran; Sujatha Mohan; Chandra Kiran Jonnalagadda; C K Prasad; Chandan Kumar-Sinha; Krishna S Deshpande; Akhilesh Pandey
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

9. Protein disorder prediction: implications for structural proteomics.

Authors: Rune Linding; Lars Juhl Jensen; Francesca Diella; Peer Bork; Toby J Gibson; Robert B Russell
Journal: Structure Date: 2003-11 Impact factor: 5.006

10. Database resources of the National Center for Biotechnology Information.

Authors: David L Wheeler; Tanya Barrett; Dennis A Benson; Stephen H Bryant; Kathi Canese; Deanna M Church; Michael DiCuccio; Ron Edgar; Scott Federhen; Wolfgang Helmberg; David L Kenton; Oleg Khovayko; David J Lipman; Thomas L Madden; Donna R Maglott; James Ostell; Joan U Pontius; Kim D Pruitt; Gregory D Schuler; Lynn M Schriml; Edwin Sequeira; Steven T Sherry; Karl Sirotkin; Grigory Starchenko; Tugba O Suzek; Roman Tatusov; Tatiana A Tatusova; Lukas Wagner; Eugene Yaschenko
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

386 in total

1. A novel minicollagen gene links cnidarians and myxozoans.

Authors: Jason W Holland; Beth Okamura; Hanna Hartikainen; Chris J Secombes
Journal: Proc Biol Sci Date: 2010-09-01 Impact factor: 5.349

2. The ancient function of RB-E2F pathway: insights from its evolutionary history.

Authors: Lihuan Cao; Bo Peng; Lei Yao; Xinming Zhang; Kuan Sun; Xianmei Yang; Long Yu
Journal: Biol Direct Date: 2010-09-20 Impact factor: 4.540

3. DMDM: domain mapping of disease mutations.

Authors: Thomas A Peterson; Asa Adadey; Ivette Santana-Cruz; Yanan Sun; Andrew Winder; Maricel G Kann
Journal: Bioinformatics Date: 2010-08-04 Impact factor: 6.937

4. The role of pseudo-endoglucanases in the evolution of nematode cell wall-modifying proteins.

Authors: Annelies Haegeman; Tina Kyndt; Godelieve Gheysen
Journal: J Mol Evol Date: 2010-04-23 Impact factor: 2.395

Review 5. Coactivator recruitment: a new role for PAS domains in transcriptional regulation by the bHLH-PAS family.

Authors: Carrie L Partch; Kevin H Gardner
Journal: J Cell Physiol Date: 2010-06 Impact factor: 6.384

6. Autoregulation of lantibiotic bovicin HJ50 biosynthesis by the BovK-BovR two-component signal transduction system in Streptococcus bovis HJ50.

Authors: Jianqiang Ni; Kunling Teng; Gang Liu; Caixia Qiao; Liandong Huan; Jin Zhong
Journal: Appl Environ Microbiol Date: 2010-11-12 Impact factor: 4.792

7. Crystal structure of human mitoNEET reveals distinct groups of iron sulfur proteins.

Authors: Jinzhong Lin; Tao Zhou; Keqiong Ye; Jinfeng Wang
Journal: Proc Natl Acad Sci U S A Date: 2007-08-31 Impact factor: 11.205

8. The glycoprotein gene of Chrysanthemum stem necrosis virus and Zucchini lethal chlorosis virus and molecular relationship with other tospoviruses.

Authors: Tatsuya Nagata; Keisiane Rodrigues Carvalho; Rogeria De Alcântara Sodré; Luisa Silva Dutra; Priscila Amorim Oliveira; Eliane Ferreira Noronha; Fernanda Antinolfi Lovato; Renato De Oliveira Resende; Antônio Carlos De Avila; Alice Kazuko Inoue-Nagata
Journal: Virus Genes Date: 2007-06-15 Impact factor: 2.332

9. The GAF-like-domain-containing transcriptional regulator DfdR is a sensor protein for dibenzofuran and several hydrophobic aromatic compounds.

Authors: Toshiya Iida; Taro Waki; Kaoru Nakamura; Yuki Mukouzaka; Toshiaki Kudo
Journal: J Bacteriol Date: 2008-10-24 Impact factor: 3.490

10. Whi3, a developmental regulator of budding yeast, binds a large set of mRNAs functionally related to the endoplasmic reticulum.

Authors: Neus Colomina; Francisco Ferrezuelo; Hongyin Wang; Martí Aldea; Eloi Garí
Journal: J Biol Chem Date: 2008-07-29 Impact factor: 5.157