Literature DB >> 18931379

The SWISS-MODEL Repository and associated resources.

Florian Kiefer¹, Konstantin Arnold, Michael Künzli, Lorenza Bordoli, Torsten Schwede.

Abstract

SWISS-MODEL Repository (http://swissmodel.expasy.org/repository/) is a database of 3D protein structure models generated by the SWISS-MODEL homology-modelling pipeline. The aim of the SWISS-MODEL Repository is to provide access to an up-to-date collection of annotated 3D protein models generated by automated homology modelling for all sequences in Swiss-Prot and for relevant models organisms. Regular updates ensure that target coverage is complete, that models are built using the most recent sequence and template structure databases, and that improvements in the underlying modelling pipeline are fully utilised. As of September 2008, the database contains 3.4 million entries for 2.7 million different protein sequences from the UniProt database. SWISS-MODEL Repository allows the users to assess the quality of the models in the database, search for alternative template structures, and to build models interactively via SWISS-MODEL Workspace (http://swissmodel.expasy.org/workspace/). Annotation of models with functional information and cross-linking with other databases such as the Protein Model Portal (http://www.proteinmodelportal.org) of the PSI Structural Genomics Knowledge Base facilitates the navigation between protein sequence and structure resources.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2008 PMID： 18931379 PMCID： PMC2686475 DOI： 10.1093/nar/gkn750

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Three-dimensional protein structures are crucial for understanding protein function at a molecular level. In recent years, tremendous progress in experimental techniques for large-scale protein structure determination by X-ray crystallography and NMR has been achieved. Structural genomic efforts have contributed significantly to the elucidation of novel protein structures (1), and to the development of technologies, which have increased the speed and success rate at which structures can be determined and lowered the cost of the experiments (2,3). However, the number of known protein sequences grows at an ever higher rate as large-scale sequencing projects, such as the Global Ocean Sampling expedition, are producing sequence data at an unprecedented rate (4). Consequently, the last release of the UniProt (5) protein knowledgebase (version 14.0) contained more than 6.5 million sequences, which is about 100 times the number protein structures currently deposited in the Protein Data Bank (6) (∼53 000, September 2008). For the foreseeable future, stable and reliable computational approaches for protein structure modelling will therefore be required to derive structural information for the majority of proteins, and a broad variety of in silico methods for protein structure prediction has been developed in recent years. Homology (or comparative) modelling techniques have been shown to provide the most accurate models in cases, where experimental structures related to the protein of interest were available. Although the number of protein sequence families increases at a rate that is linear or almost linear with the addition of new sequences (4), the number of distinct protein folds in nature is limited (1,7) and the growth in the complexity of protein families appears as a result of the combination of domains (M. Levitt, manuscript in preparation). Achieving complete structural coverage of whole proteomes (on the level of individual soluble domain structures) by combining experimental and comparative modelling techniques therefore appears to be a realistic goal, and is already been pursued, e.g. by the Joint Center for Structural Genomics for the small model organism Thermotoga maritima (JCSG) (8,9). Assessment of the accuracy of methods for protein structure prediction, e.g. during the bi-annual CASP (Critical Assessment of Techniques for Protein Structure Prediction) experiments (10,11) or the automated EVA project (12), has demonstrated that comparative protein structure modelling is currently the most accurate technique for prediction of the 3D structure of proteins. During the CASP7 experiment, it became apparent that the best fully automated modelling methods have improved to a level where they challenge most human predictors in producing the most accurate models (13–15). Nowadays, comparative protein structure models are often sufficiently accurate to be employed for a wide spectrum of biomedical applications, such as structure based drug design (16–20), functional characterization of diverse members of a protein family (21), or rational protein engineering, e.g. the humanization of therapeutic antibodies, or to study functional properties of proteins (22–26). Here, we describe the SWISS-MODEL Repository, a database of annotated protein structure models generated by the SWISS-MODEL Pipeline, and a set of associated web-based services that facilitate protein structure modelling and assessment. We emphasize the improvements of the SWISS-MODEL Repository which have been implemented since our last report (27). These include a new pipeline for template selection, the integration with interactive tools in the SWISS-MODEL Workspace, the programmatic access via DAS (distributed annotation system) (28), the implementation of a reference frame for protein sequences based on md5 cryptographic hashes, and the integration with the Protein Model Portal (http://www.proteinmodelportal.org) of the PSI Structural Genomics Knowledge Base (29,30).

REPOSITORY CONTENTS, ACCESS AND INTERFACE

Homology modelling

The SWISS-MODEL Repository contains models that are calculated using a fully automated homology modelling pipeline. Homology modelling typically consists of the following steps: selection of a suitable template, alignment of target sequence and template structure, model building, energy minimization and/or refinement and model quality assessment. This requires a set of specialized software tools as well as up-to-date sequence and structure databases. The SWISS-MODEL pipeline (version 8.9) integrates these steps into a fully automated workflow by combining the required programs in a PERL based framework. Since template search and selection is a crucial step for successful model building, we have implemented a hierarchical template search and selection protocol, which is sufficiently fast to be used for automated large-scale modelling, sensitive in detecting low homology targets, and accurate in correctly identifying close target structures. In the first step, segments of the target sequence sharing close similarity to known protein structures are identified using a conservative BLAST (31) search with restrictive parameters [E-value cut-off: 10−5, 60% minimum sequence identity to sequences of the SWISS-MODEL Template Library SMTL (32)]. This ensures that information about close sequence relationships is not dispersed by the subsequent profile-based search strategies (33). If regions of the target sequence remain uncovered, in the second step a search for suitable templates is performed against a library of Hidden Markov Models for SMTL using HHSearch (14). Templates resulting from both steps are ranked according to their E-value, sequence identity to the target, resolution and structure quality. From this ranked list, the best templates are progressively selected to maximize the length of the modelled region of the protein. New templates are added if they significantly increase the coverage of the target sequence (spanning at least 25 consecutive residues), or new information is gained (e.g. templates spanning several domains help to infer relative domain orientation). For each selected target–template alignment, 3D models are calculated using ProModII (34) and energy minimized using the Gromos force field (35). The quality of the resulting model is assessed using the ANOLEA mean force potential (36). Depending on the size of the protein and the evolutionary distance to the template, model building can be relatively time-consuming. Therefore, comprehensive databases of pre-computed models (27,37,38) have been developed in order to be able to cross-link real-time model information with other biological data resources, such as sequence databases or genome browsers.

Model database

The SWISS-MODEL Repository is a relational database of models generated by the automated SWISS-MODEL pipeline based on protein sequences from the UniProt database (5). Within the database, model target sequences are uniquely identified by their md5 cryptographic hash of the full length raw amino acid sequence. This mechanism allows the redundancy in protein sequence databases entries to be reduced, and facilitates cross-referencing with databases using different accession code systems. Mapping between UniProt and various database accession code systems to our md5 based reference system is derived from the iProClass database (39). Regular updates are performed for all protein sequences in the SwissProt database (40), as well as complete proteomes of several model organisms (Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Arabidopsis thaliana, Escherichia coli, Bacillus subtilis, Saccharomyces cerevisiae, Caenorhabditis elegans and Hepacivirus). Incremental updates are performed on a regular basis in order to both include new target sequences from the UniProt database and to take advantage of newly available template structures, whereas full updates are required when major improvements to the underlying modelling algorithms have been made. The current SWISS-MODEL Repository release contains 3.45 million models for 2.72 million unique sequences, built on 26 185 different template structures (34 540 chains), covering 48.8% of the entries from UniProt (14.0), and more specifically 65.4% of the unique sequences of Swiss-Prot (56.0), the manually annotated section of the UniProt knowledgebase. The size of the models ranges from 25 up to 2059 residues (e.g. fatty acid synthase β-subunit from Thermomyces lanuginosus) with an average model length of 221 residues.

Graphical user web interface

The web interface at http://swissmodel.expasy.org/repository/ provides the main entry point to the SWISS-MODEL Repository. Models for specific proteins can be queried using different database accession codes (e.g. UniProt AC and ID, GenBank, IPI, Refseq) or directly with the protein amino acid sequence (or fragments thereof, e.g. for a specific domain). For a given target protein, a graphical overview illustrating the segments for which models (or experimental structures) are available is shown (Figure 1). Functional and domain annotation for the target protein is retrieved dynamically in real time using web service protocols to ensure that the annotation information is up-to-date. UniProt annotation of the target protein is retrieved via REST queries (http://www.uniprot.org). Structural domains in the target protein are annotated by PFAM domain assignment (41), which is retrieved dynamically by querying the InterPro (42) database using the DAS protocol (28). The md5-based reference frame for target proteins allows to update the database accession mappings in between modelling release cycles. This ensures that cross references with functional annotation resources such as InterPro correspond to proteins of identical primary sequence, thereby avoiding commonly observed problems with incorrect cross-references as a result of instable accession codes or asynchronous updates of different data resources. Finally, for each model, a summary page provides information on the modelling process (template selection and alignment), model quality assessment by ANOLEA (36) and Gromos (35), and in page visualization of the structure using the Astex Viewer (43) plugin.

Figure 1.

Typical view of a SWISS-MODEL Repository entry. For the UniProt entry P53354, the α-amylase I (EC 3.2.1.1; 1,4-α-d-glucan glucanohydrolase) from Aedes aegypti (Yellowfever mosquito), a model covering the active amylase domain is shown, including information on the template structure used for model building, the target–template sequence alignment, and quality assessment of the model. Functional annotation such as PFAM domain structure and UniProt annotation of the protein sequence is retrieved dynamically. Links to SWISS-MODEL Workspace enable the user to run additional model quality assessment tools on the model, or search the template library for alternative template structures.

Integration with SWISS-MODEL Workspace

The SWISS-MODEL Repository is a large-scale database of pre-computed 3D models. Often however, one may be interested in performing additional analyses either on the models themselves, or on the underlying protein target sequence. We have therefore implemented a tight link between the entries of the SWISS-MODEL Repository and the corresponding modules in the SWISS-MODEL Workspace, which provides an interactive web-based, personalized working environment (32,34,44). Besides the functionality for building protein models it provides various modules to assess protein structures and models. The estimation of the quality of a protein model is an important step to assess its usefulness for specific applications. In particular, models based on template structures sharing low sequence identity require careful evaluation. Therefore, entries from the Repository can be directly submitted to the Workspace for quality assessment using different global and local quality scores such as DFire (45), ProQRes (46) or QMEAN (47). The default output format for models in the Repository is the project file for the program DeepView (34); this program allows the underlying alignments to be adjusted manually and for the request to be resubmitted to Workspace for modelling. While new protein structures are deposited in the PDB on a daily basis, the respective modelling update cycles are more infrequent, resulting in a delay in the incorporation of new templates. The Repository therefore links directly to the corresponding template search module in Workspace, which allows searches for newly released templates to be performed. The direct cross-linking between Repository and Workspace allows combining the advantages of the database of pre-computed models with the flexibility of an interactive modelling system.

INTEROPERABILITY

Programmatic access

One of the major challenges of computational biology today is the integration of large amounts of diverse data in heterogeneous formats. Very often, data exchange within one domain, e.g. sequence-based data resources, is relatively straightforward, but seamless exchange between resources serving different data types, such as genome browsers and protein structure databases, is more difficult due to the lack of common and accepted standards. DAS (28) is a light-weight mechanism for web service-based annotation exchange. The DAS concept relies on a XML specification which defines the communication between server and client. Queries can be executed by sending a specific http-request to the DAS server. The result of the DAS-Server request is a human readable and easy-to-parse XML-document following the Biodas specifications (http://www.biodas.org). The DAS-Server of the SWISS-MODEL Repository is based on the DAS/1 standard and can be queried by primary UniProt accession codes or md5-hashs of the corresponding sequences. Individual models for a query sequence (‘SEGMENT’) are annotated as ‘FEATURE’, with information about the start and stop position in the target sequence, template-sequence identity and the URL to the corresponding SWISS-MODEL Repository entry. The DAS service allows the SWISS-MODEL Repository to be cross-linked with other resources using the same standards, e.g. genome browsers. The SWISS-MODEL Repository DAS service is accessible at http://swissmodel.expasy.org/service/das/swissmodel/.

The protein model portal

One of the major bottlenecks in the use of protein models is that, unlike for experimental structures, modelling resources are heterogeneous and distributed over numerous servers. However, it is often beneficial for the user to directly compare the results of different modelling methods for the same protein. We have therefore developed the protein model portal (PMP) as a component of the PSI structural genomics knowledge base (29,30). This resource provides access to all structures in the PDB, functional annotations, homology models, structural genomics protein target tracking information, available protocols and the potential to obtain DNA materials for many of the targets. The PMP currently provides access to several million pre-built models from four PSI centers, ModBase (38) and SWISS-MODEL Repository (27,37).

FUTURE DIRECTIONS

SWISS-MODEL Repository will be updated regularly to reflect the growth of the sequence and structure databases. Future releases of SWISS-MODEL Repository will include models of oligomeric assemblies, as well as models including essential co-factors, metal ions and structural ligands. Structural clustering of the Swiss Model Template Library will also allow us to routinely include ensembles of models for such proteins, which undergo extensive domain movements.

CITATION

Users of SWISS-MODEL Repository are requested to cite this article in their publications.

FUNDING

The PSI SGKB Protein Model Portal was supported by the National Institutes of Health as a sub-grant with Fox Chase Cancer Center (3 P20 GM076222-02S1); as a sub-grant with Rutgers University, under Prime Agreement Award Number (3U54GM074958-04S2). SWISS-MODEL Workspace and Repository have been supported by the Swiss Institute of Bioinformatics (SIB). Funding for open access charges: Swiss Institute of Bioinformatics. Conflict of interest statement. None declared.

44 in total

1. Progress from CASP6 to CASP7.

Authors: Andriy Kryshtafovych; Krzysztof Fidelis; John Moult
Journal: Proteins Date: 2007

2. The challenge of protein structure determination--lessons from structural genomics.

Authors: Lukasz Slabinski; Lukasz Jaroszewski; Ana P C Rodrigues; Leszek Rychlewski; Ian A Wilson; Scott A Lesley; Adam Godzik
Journal: Protein Sci Date: 2007-11 Impact factor: 6.725

3. Automated server predictions in CASP7.

Authors: James N D Battey; Jürgen Kopp; Lorenza Bordoli; Randy J Read; Neil D Clarke; Torsten Schwede
Journal: Proteins Date: 2007

4. Assessment of CASP7 predictions for template-based modeling targets.

Authors: Jürgen Kopp; Lorenza Bordoli; James N D Battey; Florian Kiefer; Torsten Schwede
Journal: Proteins Date: 2007

5. Template-based modeling and free modeling by I-TASSER in CASP7.

Authors: Yang Zhang
Journal: Proteins Date: 2007

6. Benchmarking template selection and model quality assessment for high-resolution comparative modeling.

Authors: M I Sadowski; D T Jones
Journal: Proteins Date: 2007-11-15

7. Harnessing knowledge from structural genomics.

Authors: Helen M Berman
Journal: Structure Date: 2008-01 Impact factor: 5.006

Review 8. Integration of bioinformatics resources for functional analysis of gene expression and proteomic data.

Authors: Hongzhan Huang; Zhang-Zhi Hu; Cecilia N Arighi; Cathy H Wu
Journal: Front Biosci Date: 2007-09-01

9. The Sorcerer II Global Ocean Sampling expedition: expanding the universe of protein families.

Authors: Shibu Yooseph; Granger Sutton; Douglas B Rusch; Aaron L Halpern; Shannon J Williamson; Karin Remington; Jonathan A Eisen; Karla B Heidelberg; Gerard Manning; Weizhong Li; Lukasz Jaroszewski; Piotr Cieplak; Christopher S Miller; Huiying Li; Susan T Mashiyama; Marcin P Joachimiak; Christopher van Belle; John-Marc Chandonia; David A Soergel; Yufeng Zhai; Kannan Natarajan; Shaun Lee; Benjamin J Raphael; Vineet Bafna; Robert Friedman; Steven E Brenner; Adam Godzik; David Eisenberg; Jack E Dixon; Susan S Taylor; Robert L Strausberg; Marvin Frazier; J Craig Venter
Journal: PLoS Biol Date: 2007-03 Impact factor: 8.029

10. Computational design of antibody-affinity improvement beyond in vivo maturation.

Authors: Shaun M Lippow; K Dane Wittrup; Bruce Tidor
Journal: Nat Biotechnol Date: 2007-09-23 Impact factor: 54.908

771 in total

1. Mechismo: predicting the mechanistic impact of mutations and modifications on molecular interactions.

Authors: Matthew J Betts; Qianhao Lu; YingYing Jiang; Armin Drusko; Oliver Wichmann; Mathias Utz; Ilse A Valtierra-Gutiérrez; Matthias Schlesner; Natalie Jaeger; David T Jones; Stefan Pfister; Peter Lichter; Roland Eils; Reiner Siebert; Peer Bork; Gordana Apic; Anne-Claude Gavin; Robert B Russell
Journal: Nucleic Acids Res Date: 2014-11-11 Impact factor: 16.971

2. Experimental Adaptive Evolution of Simian Immunodeficiency Virus SIVcpz to Pandemic Human Immunodeficiency Virus Type 1 by Using a Humanized Mouse Model.

Authors: Kei Sato; Naoko Misawa; Junko S Takeuchi; Tomoko Kobayashi; Taisuke Izumi; Hirofumi Aso; Shumpei Nagaoka; Keisuke Yamamoto; Izumi Kimura; Yoriyuki Konno; Yusuke Nakano; Yoshio Koyanagi
Journal: J Virol Date: 2018-01-30 Impact factor: 5.103

3. Architectural arrangement of the small nuclear RNA (snRNA)-activating protein complex 190 subunit (SNAP190) on U1 snRNA gene promoter DNA.

Authors: Matthew T Doherty; Yoon Soon Kang; Cheryn Lee; William E Stumph
Journal: J Biol Chem Date: 2012-10-04 Impact factor: 5.157

4. A novel MAs(III)-selective ArsR transcriptional repressor.

Authors: Jian Chen; Venkadesh Sarkarai Nadar; Barry P Rosen
Journal: Mol Microbiol Date: 2017-09-13 Impact factor: 3.501

5. X-ray crystal structure of Escherichia coli RNA polymerase σ70 holoenzyme.

Authors: Katsuhiko S Murakami
Journal: J Biol Chem Date: 2013-02-06 Impact factor: 5.157

6. Mutation in the PCSK9 Gene in Omani Arab Subjects with Autosomal Dominant Hypercholesterolemia and its Effect on PCSK9 Protein Structure.

Authors: Khalid Al-Waili; Ward Al-Muna Al-Zidi; Abdul Rahim Al-Abri; Khalid Al-Rasadi; Hilal Ali Al-Sabti; Karna Shah; Abdullah Al-Futaisi; Ibrahim Al-Zakwani; Yajnavalka Banerjee
Journal: Oman Med J Date: 2013-01

7. Molecular cloning and functional analysis of nine cinnamyl alcohol dehydrogenase family members in Populus tomentosa.

Authors: Nan Chao; Shu-Xin Liu; Bing-Mei Liu; Ning Li; Xiang-Ning Jiang; Ying Gai
Journal: Planta Date: 2014-08-06 Impact factor: 4.116

8. Identification and characterization of a type III polyketide synthase involved in quinolone alkaloid biosynthesis from Aegle marmelos Correa.

Authors: Mohankumar Saraladevi Resmi; Priyanka Verma; Rajesh S Gokhale; Eppurathu Vasudevan Soniya
Journal: J Biol Chem Date: 2013-01-17 Impact factor: 5.157

9. Sequential duplications of an ancient member of the DnaJ-family expanded the functional chaperone network in the eukaryotic cytosol.

Authors: Chandan Sahi; Jacek Kominek; Thomas Ziegelhoffer; Hyun Young Yu; Maciej Baranowski; Jaroslaw Marszalek; Elizabeth A Craig
Journal: Mol Biol Evol Date: 2013-01-16 Impact factor: 16.240

10. Mutations in PPCS, Encoding Phosphopantothenoylcysteine Synthetase, Cause Autosomal-Recessive Dilated Cardiomyopathy.

Authors: Arcangela Iuso; Marit Wiersma; Hans-Joachim Schüller; Ben Pode-Shakked; Dina Marek-Yagel; Mathias Grigat; Thomas Schwarzmayr; Riccardo Berutti; Bader Alhaddad; Bart Kanon; Nicola A Grzeschik; Jürgen G Okun; Zeev Perles; Yishay Salem; Ortal Barel; Amir Vardi; Marina Rubinshtein; Tal Tirosh; Gal Dubnov-Raz; Ana C Messias; Caterina Terrile; Iris Barshack; Alex Volkov; Camilla Avivi; Eran Eyal; Elisa Mastantuono; Muhamad Kumbar; Shachar Abudi; Matthias Braunisch; Tim M Strom; Thomas Meitinger; Georg F Hoffmann; Holger Prokisch; Tobias B Haack; Bianca J J M Brundel; Dorothea Haas; Ody C M Sibon; Yair Anikster
Journal: Am J Hum Genet Date: 2018-05-10 Impact factor: 11.025