Literature DB >> 17478513

DBAli tools: mining the protein structure space.

Marc A Marti-Renom¹, Ursula Pieper, M S Madhusudhan, Andrea Rossi, Narayanan Eswar, Fred P Davis, Fátima Al-Shahrour, Joaquín Dopazo, Andrej Sali.

Abstract

The DBAli tools use a comprehensive set of structural alignments in the DBAli database to leverage the structural information deposited in the Protein Data Bank (PDB). These tools include (i) the DBAlit program that allows users to input the 3D coordinates of a protein structure for comparison by MAMMOTH against all chains in the PDB; (ii) the AnnoLite and AnnoLyze programs that annotate a target structure based on its stored relationships to other structures; (iii) the ModClus program that clusters structures by sequence and structure similarities; (iv) the ModDom program that identifies domains as recurrent structural fragments and (v) an implementation of the COMPARER method in the SALIGN command in MODELLER that creates a multiple structure alignment for a set of related protein structures. Thus, the DBAli tools, which are freely accessible via the World Wide Web at http://salilab.org/DBAli/, allow users to mine the protein structure space by establishing relationships between protein structures and their functions.

Entities: Chemical Disease Species

Mesh：

Substances：
Proteins

Year: 2007 PMID： 17478513 PMCID： PMC1933139 DOI： 10.1093/nar/gkm236

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The number of known protein structures deposited in the Protein Data Bank (PDB) has grown exponentially over the years (1). This trend is expected to continue, partly due to the structural genomics efforts (2,3). Currently, there are ∼41 000 protein structures deposited in the PDB, containing ∼88 000 protein chains. These protein structures constitute a structural space that can be mined to facilitate the understanding, assignment and modification of protein function. Previously developed databases for the classification of protein structure domains, such as SCOP [http://scop.mrc-lmb.cam.ac.uk/scop/(4)] or CATH [http://www.cathdb.info (5)], and servers for functional annotation of protein structures, such as ProFunc [http://www.ebi.ac.uk/thornton-srv/databases/ProFunc/ (6,7)], ProKnow [http://www.doe-mbi.ucla.edu/Services/ProKnow (8)] and Phunctioner [http://www.sbg.bio.ic.ac.uk (9)], provide an effective way of describing and annotating the protein structure space. However, none of these servers combine a comprehensive database of protein structural alignments with tools for automatically annotating protein structures. Here, we describe five tools that aid in the analysis of the data stored in DBAli, our comprehensive relational database of pairwise and multiple structural alignments (10). These tools include (i) the DBAlit program that allows users to input their structure for comparison by MAMMOTH (11) against all chains in the PDB; (ii) the AnnoLite and AnnoLyze programs that annotate a target structure based on its stored relationships to other structures; (iii) the ModClus program that clusters structures by sequence and structure similarities; (iv) the ModDom program that identifies recurrent fragments, including domains, from structure; and (v) an implementation of the COMPARER method (12) in the SALIGN command in MODELLER that creates a multiple structure alignment for a set of related protein structures. The DBAli tools allow users to establish relationships between protein structures and their fragments in a flexible and dynamic manner. The DBAli database is briefly introduced first. Next, we describe each of the five tools that make use of the structural alignments deposited in DBAli. Finally, we discuss the use of the DBAli tools to analyze a structure determined by the New York Structural Genomics Research Consortium (NYSGXRC).

THE DBALI SERVER

The DBAli server (http://salilab.org/DBAli/) is divided into four main sections: the DBAli database, search pages, tools pages and special pages, each one of which is dedicated to a specific tasks (Figure 1).

Figure 1.

DBAli server organization.

The DBAli database

The DBAli database contains pairwise and multiple structure alignments of protein structures in the PDB. Pairwise alignments are updated weekly and multiple alignments are updated monthly. Currently (January 2007), DBAli contains a total of 86,277 PDB chains in ∼1.38 billion pairwise alignments with a MAMMOTH P-value higher than 2.0 (Table 1). DBAli also stores multiple structure alignments for 11,615 families with 30,900 non-redundant PDB chains representing 86,277 chains in PDB (Table 1). Structural and functional annotations are obtained from our databases LigBase (13) and PIBASE (14) and external databases such as CATH (5), SCOP (4), InterPro (15), PFam (16), EC (17) and GO (18). Cross-links between the external databases is adopted from the MSD database (19) (Table 1).

Table 1.

Internal and external data sources used by the DBAli tools

Database	Last update	Type of information	Number of entries	Reference
DBAli pairwise	January 2007	Structural alignments	1,379,352,642	(10)
DBAli multiple	January 2007	Structural alignments	11,615	(10)
MSD	November 2006	PDB chains	79,170	(19)
CATH	July 2006	Superfamilies	2,359	(5)
SCOP	April 2006	Domains	94,779	(4)
InterPro	July 2006	Domains and motifs	13,057	(15)
PFam	July 2006	Protein families	8,376	(16)
EC	November 2006	Enzymes	32,137	(17)
GO	July 2006	Functional terms	21,017	(18)
LigBase	February 2004	Protein ligands	101,359	(13)
PIBASE	September 2004	Protein interactions	158,915	(14)

Internal and external data sources used by the DBAli tools

The DBAli tools

DBAli incorporates five tools that use the structural relationships in DBAli.

DBAlit

The DBAlit program takes a user input structure in the PDB format and compares it against all chains in the PDB using the MAMMOTH algorithm. On average, a user-input structure is compared against all known protein structure chains in ∼200 min using 10 Xeon CPUs. To preserve the privacy of the coordinates, DBAli generates a random and unique chain identifier that is returned to the user by e-mail. The new chain identifier can then be used to retrieve data from the DBAli database as well as to use the other tools outlined below.

AnnoLite and AnnoLyze

The AnnoLite and AnnoLyze programs (20) annotate the structures processed by DBAli. A given structure is characterized by annotations of related structures in the MODBASE database of comparative models (21), the PIBASE database of structurally defined protein interfaces (14), the LIGBASE database of small molecule binding sites (13) and the MSD database of macromolecular structures (19). The inputs for the annotation programs are a PDB chain code and thresholds for filtering sequence and structure similarities. The AnnoLite program searches the DBAli database for structurally similar proteins and collects their known annotations. Next, a P-value score is calculated for each transferred annotation using a Fisher's exact test for 2 × 2 contingency tables comparing two groups of annotated chains (i.e. the group of similar chains to the query and the group of all annotated chains in the PDB)(22). Currently, AnnoLite annotates the input protein structure with CATH (5) and SCOP (4) fold assignments, EC numbers (17), InterPro entries (15), PFAM families (16) and Gene Ontology codes (23). The accuracy and coverage of AnnoLite were benchmarked with a set of fully annotated 1,879 non-redundant PDB chains. AnnoLite can reliably annotate a structure for all of the functional properties, with the exception of the GO cellular component term. For example, the CATH fold can be recovered for 96% of the dataset with 89% reliability and direct functional annotation with EC numbers and Gene Ontology molecular function codes can be recovered with reliabilities of 81 and 74% for 83 and 88% of the dataset, respectively. Additionally, AnnoLyze inherits ligands from LIGBASE and interacting partners from PIBASE. The output from the two programs provides an automatic annotation of the protein structures.

ModClus

The ModClus program clusters protein structures based on their sequence and structure similarities. The input to ModClus is a list of PDB chain codes. The output is a list of clusters of the input chains. The clustering depends on user-defined thresholds for structure and sequence similarities. ModClus implements a greedy algorithm as follows: (i) the first chain in the list seeds the first cluster; (ii) the next chain is compared by sequence and/or structure to all chains in each of the existing clusters; it either joins the first sufficiently similar cluster or seeds a new cluster if it is not sufficiently similar to all of the chains in any of the other clusters; (iii) the clustering continues with step (ii) until all chains are processed.

ModDom

The ModDom program assigns domain boundaries in a given structure using the superpositions stored in DBAli. The input is a PDB chain code, which is used as a query to identify structurally similar chains in DBAli. ModDom relies on the relationship between recurrent structures and structural units to predict domain boundaries. The program first builds a residue co-occurrence matrix based on structural alignments selected from the DBAli database and then clusters residue co-occurrences to find common fragments in the query protein structure. The ModDom program has been benchmarked for domain assignment with a non-redundant set of protein structures. ModDom assigns 80% of residues identically to the domain assignments in the SCOP database. The user is provided with all domain assignments and their scores, a structural conservation profile based on the retrieved alignments with similar structures, and a JMol window for inspection of the domain assignments.

SALIGN

The SALIGN command, an implementation of the COMPARER program (12) in MODELLER, generates a multiple structure alignment given a list of PDB chain codes. The alignment is displayed through the JMol applet. In addition, an HTML frame presents an easy-to-read sequence alignment corresponding to the structural superposition. The user is provided with options to download the alignment in the HTML or PIR formats and the superposed coordinates in the PDB format.

UTILITY AND DISCUSSION

Target selection strategies for structural genomics have led to the experimental determination of many protein structures whose functions are not yet known. The tools in DBAli can be employed to annotate the functions of such structures, as illustrated by the following example. The New York Structural Genomics Research Consortium (NYSGXRC) selected a hypothetical protein from Pseudomonas aeruginosa as a target for structure determination (target T1794). The structure was successfully determined and deposited in the PDB database (code 1u6l, release date 14 December 2004). Searches by PSI-BLAST (24) and threading by GenThreader (25) indicated similarity to the glyoxalase/bleomycin domain (PfamA family PF00903). This domain is found in several proteins including the bleomycin resistance protein and dioxygenases. The DBAli tools confirm and add to these findings. A search for structures similar to chain A of 1u6l (1u6lA) results in 306 related structures. The first annotated hit in the list of similar structures corresponds to a bleomycin resistance protein (PDB code 1xrkB). As a result, the AnnoLyze program (all parameters set to their default values except for minimal sequence identity set to 15%) predicts a binding site on 1u6l that may bind a bleomycin-like ligand. The AnnoLite program predicts that the 1u6lA chain adopts the glyoxalase/bleomycin resistance fold (SCOP code d.32.1.1). ModDom detects that the protein in fact contains two glyoxalase/bleomycin resistance fold domains. In summary, ModDom, AnnoLyze and AnnoLite, adding to the results from sequence-based searches, annotate 1u6lA as a two-domain antibiotic resistance protein and localize a putative binding site for the antibiotic bleomycin (Figure 2).

Figure 2.

Functional annotation of the target T1794 from the NYSGXRC consortium (PDB code 1u6l chain A). (a) The structure closest to the 1v6lA chain with known annotation is an antibiotic inhibitor from Streptoallateichus hindustanus (PDB code 1xrk chain B). It is shown in the ribbon representation in red, superposed on 1u6lA in blue (RMSD of 3.5 Å over 106 residues, sequence identity of 14.3% and MAMMOTH P-value = 13.8). (b) Predicted binding site for bleomycin A2 ligand (residues colored in red and orange: Lys80, Gly81, Cys82, Ser83, Ser85, Ans87, Gln108, Phe115, Trp116, Ser119, Gly121, Thr124, Gly128, Val129, Ala130 and Val133). (c) Domain boundaries of 1u6lA as assigned by ModDom. The protein structure is predicted to have two domains: residues 1–76 (yellow) and residues 77–123 (red). (d) Screen capture of the AnnoLite results for the 1u6lA chain.

CONCLUSIONS

The DBAli database stores a comprehensive and up-to-date comparison of protein structures in the PDB. The data are stored in a MySQL relational database and can be accessed and downloaded via a web server. Several tools have been developed that interact with the data deposited in DBAli, including the DBAlit, AnnoLite and AnnoLyze, ModClus, ModDom, and SALIGN programs. The design of DBAli allows easy cross-referencing to other databases. DBAli already includes links to the MODBASE, LigBase, PIBASE, PDB, CATH, SCOP, PFamA, InterPro, Enzyme and GO databases. Through further integration of new tools and other databases, DBAli and its tools are becoming a valuable resource to the structural biology community. As of January 2007, the DBAli tools have been used by more than 1,700 unique users from 79 different countries worldwide who performed on average ∼600 tasks per month.

AVAILABILITY AND REQUIREMENTS

DBAli is freely available on the Internet at http://salilab.org/DBAli and requires a web browser that is capable of running the JMol applet. The web interface is programmed in PHP and MySQL supports the underlying database.

25 in total

1. The Protein Data Bank.

Authors: H M Berman; J Westbrook; Z Feng; G Gilliland; T N Bhat; H Weissig; I N Shindyalov; P E Bourne
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. The ENZYME database in 2000.

Authors: A Bairoch
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

3. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.

Authors: M Ashburner; C A Ball; J A Blake; D Botstein; H Butler; J M Cherry; A P Davis; K Dolinski; S S Dwight; J T Eppig; M A Harris; D P Hill; L Issel-Tarver; A Kasarskis; S Lewis; J C Matese; J E Richardson; M Ringwald; G M Rubin; G Sherlock
Journal: Nat Genet Date: 2000-05 Impact factor: 38.330

4. PIBASE: a comprehensive database of structurally defined protein interfaces.

Authors: Fred P Davis; Andrej Sali
Journal: Bioinformatics Date: 2005-01-18 Impact factor: 6.937

5. Inference of protein function from protein structure.

Authors: Debnath Pal; David Eisenberg
Journal: Structure Date: 2005-01 Impact factor: 5.006

6. MODBASE: a database of annotated comparative protein structure models and associated resources.

Authors: Ursula Pieper; Narayanan Eswar; Fred P Davis; Hannes Braberg; M S Madhusudhan; Andrea Rossi; Marc Marti-Renom; Rachel Karchin; Ben M Webb; David Eramian; Min-Yi Shen; Libusha Kelly; Francisco Melo; Andrej Sali
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

7. ProFunc: a server for predicting protein function from 3D structure.

Authors: Roman A Laskowski; James D Watson; Janet M Thornton
Journal: Nucleic Acids Res Date: 2005-07-01 Impact factor: 16.971

8. The CATH Domain Structure Database and related resources Gene3D and DHS provide comprehensive domain family information for genome analysis.

Authors: Frances Pearl; Annabel Todd; Ian Sillitoe; Mark Dibley; Oliver Redfern; Tony Lewis; Christopher Bennett; Russell Marsden; Alistair Grant; David Lee; Adrian Akpor; Michael Maibaum; Andrew Harrison; Timothy Dallman; Gabrielle Reeves; Ilhem Diboun; Sarah Addou; Stefano Lise; Caroline Johnston; Antonio Sillero; Janet Thornton; Christine Orengo
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

9. InterPro, progress and status in 2005.

Authors: Nicola J Mulder; Rolf Apweiler; Teresa K Attwood; Amos Bairoch; Alex Bateman; David Binns; Paul Bradley; Peer Bork; Phillip Bucher; Lorenzo Cerutti; Richard Copley; Emmanuel Courcelle; Ujjwal Das; Richard Durbin; Wolfgang Fleischmann; Julian Gough; Daniel Haft; Nicola Harte; Nicolas Hulo; Daniel Kahn; Alexander Kanapin; Maria Krestyaninova; David Lonsdale; Rodrigo Lopez; Ivica Letunic; Martin Madera; John Maslen; Jennifer McDowall; Alex Mitchell; Anastasia N Nikolskaya; Sandra Orchard; Marco Pagni; Chris P Ponting; Emmanuel Quevillon; Jeremy Selengut; Christian J A Sigrist; Ville Silventoinen; David J Studholme; Robert Vaughan; Cathy H Wu
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

10. The AnnoLite and AnnoLyze programs for comparative annotation of protein structures.

Authors: Marc A Marti-Renom; Andrea Rossi; Fátima Al-Shahrour; Fred P Davis; Ursula Pieper; Joaquín Dopazo; Andrej Sali
Journal: BMC Bioinformatics Date: 2007-05-22 Impact factor: 3.169

15 in total

1. Protein structure fitting and refinement guided by cryo-EM density.

Authors: Maya Topf; Keren Lasker; Ben Webb; Haim Wolfson; Wah Chiu; Andrej Sali
Journal: Structure Date: 2008-02 Impact factor: 5.006

2. SALIGN: a web server for alignment of multiple protein sequences and structures.

Authors: Hannes Braberg; Benjamin M Webb; Elina Tjioe; Ursula Pieper; Andrej Sali; M S Madhusudhan
Journal: Bioinformatics Date: 2012-05-21 Impact factor: 6.937

3. Alignment of multiple protein structures based on sequence and structure features.

Authors: M S Madhusudhan; Benjamin M Webb; Marc A Marti-Renom; Narayanan Eswar; Andrej Sali
Journal: Protein Eng Des Sel Date: 2009-07-08 Impact factor: 1.650

Review 4. Protein structure databases.

Authors: Roman A Laskowski
Journal: Mol Biotechnol Date: 2011-06 Impact factor: 2.695

5. Insights into structure and function of SHIP2-SH2: homology modeling, docking, and molecular dynamics study.

Authors: Uzma Saqib; Mohammad Imran Siddiqi
Journal: J Chem Biol Date: 2011-02-12

6. Molecular docking screens using comparative models of proteins.

Authors: Hao Fan; John J Irwin; Benjamin M Webb; Gerhard Klebe; Brian K Shoichet; Andrej Sali
Journal: J Chem Inf Model Date: 2009-11 Impact factor: 4.956

7. ModBase, a database of annotated comparative protein structure models, and associated resources.

Authors: Ursula Pieper; Benjamin M Webb; David T Barkan; Dina Schneidman-Duhovny; Avner Schlessinger; Hannes Braberg; Zheng Yang; Elaine C Meng; Eric F Pettersen; Conrad C Huang; Ruchira S Datta; Parthasarathy Sampathkumar; Mallur S Madhusudhan; Kimmen Sjölander; Thomas E Ferrin; Stephen K Burley; Andrej Sali
Journal: Nucleic Acids Res Date: 2010-11-19 Impact factor: 16.971

8. MODBASE, a database of annotated comparative protein structure models and associated resources.

Authors: Ursula Pieper; Narayanan Eswar; Ben M Webb; David Eramian; Libusha Kelly; David T Barkan; Hannah Carter; Parminder Mankoo; Rachel Karchin; Marc A Marti-Renom; Fred P Davis; Andrej Sali
Journal: Nucleic Acids Res Date: 2008-10-23 Impact factor: 16.971

9. COPS--a novel workbench for explorations in fold space.

Authors: Stefan J Suhrer; Markus Wiederstein; Markus Gruber; Manfred J Sippl
Journal: Nucleic Acids Res Date: 2009-05-22 Impact factor: 16.971

10. A kernel for open source drug discovery in tropical diseases.

Authors: Leticia Ortí; Rodrigo J Carbajo; Ursula Pieper; Narayanan Eswar; Stephen M Maurer; Arti K Rai; Ginger Taylor; Matthew H Todd; Antonio Pineda-Lucena; Andrej Sali; Marc A Marti-Renom
Journal: PLoS Negl Trop Dis Date: 2009-04-21