Literature DB >> 21586582

iPBA: a tool for protein structure comparison using sequence alignment strategies.

Jean-Christophe Gelly¹, Agnel Praveen Joseph, Narayanaswamy Srinivasan, Alexandre G de Brevern.

Abstract

With the immense growth in the number of available protein structures, fast and accurate structure comparison has been essential. We propose an efficient method for structure comparison, based on a structural alphabet. Protein Blocks (PBs) is a widely used structural alphabet with 16 pentapeptide conformations that can fairly approximate a complete protein chain. Thus a 3D structure can be translated into a 1D sequence of PBs. With a simple Needleman-Wunsch approach and a raw PB substitution matrix, PB-based structural alignments were better than many popular methods. iPBA web server presents an improved alignment approach using (i) specialized PB Substitution Matrices (SM) and (ii) anchor-based alignment methodology. With these developments, the quality of ∼88% of alignments was improved. iPBA alignments were also better than DALI, MUSTANG and GANGSTA(+) in >80% of the cases. The webserver is designed to for both pairwise comparisons and database searches. Outputs are given as sequence alignment and superposed 3D structures displayed using PyMol and Jmol. A local alignment option for detecting subs-structural similarity is also embedded. As a fast and efficient 'sequence-based' structure comparison tool, we believe that it will be quite useful to the scientific community. iPBA can be accessed at http://www.dsimb.inserm.fr/dsimb_tools/ipba/.

Entities: Chemical Disease Gene

Mesh：

Year: 2011 PMID： 21586582 PMCID： PMC3125758 DOI： 10.1093/nar/gkr333

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Continuous increase in number of 3D structures of proteins necessitates development of efficient tools for structure comparison. Such developments facilitate characterization of function of a protein of known structure (1) or aid in evolutionary studies (2–4). Considering the complexity involved in obtaining an optimal superposition solely by global structural searches, a large majority of the structural alignment approaches focus on optimizing a combination of local segments of similarity to derive the global alignment (5–7). Many of the very recent approaches consider the match between secondary structural elements (8–10) while others are fragment based (11–16). This idea is extended further to investigate flexibility of protein structures (17,18). Local backbone conformations such as α-helices, β-strands, β-turns and PPII helices characterize a large part tertiary structure of a protein chain. A complete protein backbone can be approximated with a limited set of local conformations. Such a collection of local structural prototypes is called Structural Alphabets (SA). Protein Blocks (PBs) (19–21) is one such SA involving 16 pentapeptide conformations (represented by alphabets a to p), characterized by backbone dihedral angles. Several biological questions could be addressed based on PB-based abstraction. The main chain 3D information can be represented as a sequence in 1D, using PBs. This reduces the problem of protein structural comparison to a classical sequence alignment. Dynamic programming algorithms like Needleman Wunsch (22) and Smith Waterman (23) were used earlier for PB alignment and PB substitution matrix was generated for scoring the alignment (24–26). We propose an improved and novel version of PB alignment using (i) specialized substitution matrices for pairwise alignment and database search and (ii) an anchor-based dynamic programming algorithm. Most of the recent web tools for structure comparison are either dedicated to a database search (9–10,13,27,28) or for pairwise structural alignments (29–32). As an efficient tool for both pairwise alignments and database searches, this web-server serves as a good platform for such studies. A local alignment strategy for motif or sub-structure search is also available. The proposed development provides output such as: (i) different scoring schemes to indicate the quality of the alignment, (ii) user-friendly interface to view and analyze the 3D superposition and (iii) downloadable alignment files (both sequence and structural alignment).

MATERIALS AND METHODS

The server can be used to search for structural relatives of a query protein (Figure 1A) or to compare two protein structures (Figure 1B). In both cases, the user can decide whether to carry out alignments for the complete structure (global) or to look for the best local similarity (local).

Figure 1.

The framework of iPBA and underlying methods. User can either compare two structures or search for structural neighbors (mining) from a databank. The input and output web interfaces for pairwise structural alignment are highlighted with a blue background. The web interfaces for mining has a green background. The rest of the figure (white background) gives the outline of underlying methodological aspects. (A) Search for structural similar protein in 3D database. (B) Compare two protein structures. (C) Alignment approach. (D) Main outputs.

Input

For comparing two structures, the user can either provide the coordinates in the standard PDB format or enter the PDB code. The identifiers of chains to be compared should also be given. For searching related protein structure in database, only one PDB file or code is necessary (Figure 1A and B).

Pre-processing

Atomic coordinate sets are first translated into sequence of PBs (Figure 1C). PBs constitute 16 pentapeptide conformations (labeled from a to p) each described by a series of Φ, Ψ dihedral angles. A reasonable approximation of local structures (19) with a root mean square deviation (RMSD) of 0.42 Å could be obtained (33).

Computing pairwise alignment

The alignment method implemented in this server represents a significant improvement over our earlier work (24). In the previous work, the PB substitution matrix was generated from pairwise alignments in PALI database (3). This database was redundant in terms of the distribution of related proteins. We have so refined the databank. Hence the PB substitutions were calculated from a non-redundant subset sharing sequence identity <40% and a refined substitution matrix was generated. Also, in our previous approach, a simple Needleman–Wunsch (22) algorithm was used for alignment. Protein structural homologues are often characterized by conserved stretches separated by variable regions. Hence a combination of local and global alignment is expected to give a better performance. A set of local alignments (anchors) associated with these two sequences is derived using a modified version of SIM algorithm (34). The remaining segments between anchors (linkers) are then aligned using the Needleman–Wunsch algorithm (Figure 1C). Affine gap penalties are used for the anchor and linker alignments. Distance constraints on the structures are included to identify false anchors. The different parameters were optimized as done in the previous work based on alignments of proteins in PALI data set (3). A total of 80% of the alignments were better when compared to that obtained with our previous work (24). Different scores are used to quantify the quality of PB alignment: A score similar to Global Distance Test Total score (GDT_TS) (35) for PB sequence alignment, derived using seven decreasing cut-offs of PB substitution scores (similar to distance cut-offs for GDT_TS). where k corresponds to the total number of thresholds used, i.e. 7. P is the percentage of PB substitutions that are within the cut-off level j. The residue equivalences from the PB alignment then guides the 3D fitting of the structures by ProFit (36) (http://www.bioinf.org.uk/software/profit/) which reports the RMSD and number of aligned residues (within 5 Å) (Figure 2). The GDT_TS score for the alignment is also provided along with the Aln_Score and GDT_PB. Note that the GDT_TS score used for comparison of iPBA with other web-tools (Table 1) was computed with a maximum distance threshold of 5 Å. The percentage of equivalent residues was calculated from only one of the protein lengths. These variations were included to avoid bias in the score due to the different distance thresholds used by different methods and also due to incomplete alignment outputs provided by the servers.

Figure 2.

Table 1.

Comparison of iPBA with different structural alignment tools (web services)

Each protein pair is chosen in random from different structural classes (in parentheses), from the HOMSTRAD database (4). The number of aligned residues (as defined by different methods) and their RMSD is given within parentheses. The GDT_TS score calculated for increasing distances of 0.5 Å in the range 0.5–5 Å, is also shown in italics. The best and second best scores are highlighted in red and blue. (–––) reflects the incomplete output of the program which limits GDT_TS calculation. Rigid-body approaches have been tested with CE, DALI and TM-Align. Best RMSD and GDT_TS of the rigid-body approaches have been highlighted in bold.

Comparison of iPBA with other Rigid Body alignment methods. The 3D superposition of Nucleotide Kinases (PDB IDs: 1AKY and 1GKY) by different methods is shown. The RMSD (in bold) and the number of aligned residues (as reported by the tool) are also given. Comparison of iPBA with different structural alignment tools (web services) Each protein pair is chosen in random from different structural classes (in parentheses), from the HOMSTRAD database (4). The number of aligned residues (as defined by different methods) and their RMSD is given within parentheses. The GDT_TS score calculated for increasing distances of 0.5 Å in the range 0.5–5 Å, is also shown in italics. The best and second best scores are highlighted in red and blue. (–––) reflects the incomplete output of the program which limits GDT_TS calculation. Rigid-body approaches have been tested with CE, DALI and TM-Align. Best RMSD and GDT_TS of the rigid-body approaches have been highlighted in bold.

Database search

A sequence of PBs can also be used to search for structurally related proteins from a data set of structures (Figure 1A). SCOP version 1.75 SCOP (37) is used as the structure data set and the user can also search refined subsets derived at different sequence identity cut-offs. The top 100 hits are reported based on the PB alignment score which is scaled to values between −13 and 17. Values >1.5 are generally associated with high confidence. GDT_PB scores are also provided for the hits obtained. To account for the speed, structure based refinements are not included. User can carry out further alignments of the hits obtained (Figure 1A and B).

Output for pairwise alignments

With the help of Jmol applet, users can have a 3D analysis of superposed structures and also choose different visual representations of structure (Figure 1D). Images of aligned structures rendered in PyMol are also provided. The residue equivalences in the 3D alignment are given as a complete sequence alignment. The corresponding PBs are also shown in the alignment. PB stretches of high similarity, identified as anchors, are also highlighted (Figure 1D). The user can download coordinates of aligned structures in PDB format and PyMol scripts for local analysis of the superposition. Raw output file with sequence alignment and quality scores is also downloadable in text format.

Implementation

Implementation of this tool is mainly done in C, Python, HTML and also using Jmol and PyMol programs. The front-end use is based on html and php. Perl/cgi programs control the input while python and C based programs carry out the processing behind the database search and pairwise comparisons. Direct visualization and manipulation of aligned structured is enabled with a Jmol applet and static images of superposed structures are rendered in PyMol using internal ‘raytracer’ option. Supplementary Data S1 shows the schematic representation of series of steps involved in iPBA webserver.

DISCUSSION

As shown in Figure 1, it is quite simple to use the web-based iPBA alignment tool. User only needs to give the coordinates to mine SCOP (Figure 1A) or for pairwise superimposition (Figure 1B). Outputs are mainly given visually as sequence alignments and 3D structure superimpositions (Figure 1D). Output alignment files can be also downloaded for local use. The local alignment strategy also provides a route to detect specific structural motifs in proteins. The improvement in the alignment methodology and the use of specialized PB substitution matrices has greatly enhanced the quality of alignments and the mining efficiency. The PB-based alignment approach had shown an impressive performance as a structure comparison tool (24). Supplementary Figure 2 highlights the gain in alignment quality with respect to the earlier approach [PBALIGN, (24)]. One hundred randomly chosen SCOP domain pairs sharing <40% sequence identity were used for comparison. 89% of the alignments have a better RMSD when compared to PBALIGN (Supplementary Data S2). Comparison performed on a bigger benchmark data set also suggested that a significant gain of 82% in alignment quality could be achieved. The mining efficiency also improved by 6.8% and the gain was largely uniform across different structural classes. To present a picture on the performance, the quality of alignments generated by iPBA was compared with the output alignments of some of the other well-established tools like CE, DALI, FATCAT and TMalign (7,18,38,39) (Table 1). For the full-length chains (‘global’ alignment option), the alignments generated using iPBA has the least RMSD. However, the number of aligned residues is also lower in many cases. GDT_TS scores are more appropriate in such cases to give a better idea of the alignment quality. As highlighted in Table 1, iPBA generates alignments of very high quality. Among the non-flexible aligners (CE, DALI and TMalign), iPBA alignments have the best quality scores in the majority of cases. FATCAT produces flexible alignments and it is expected to give the best performance when flexible movements are involved. This is true for the first three cases in Table 1 where iPBA scores next to FATCAT. Thus the quality of iPBA alignments is largely comparable. In a systematic comparison using the standalone version of iPBA, the alignments were found to be better than DALI and MUSTANG in >80% of the cases. To demonstrate this, we chose the data set of 100 domain pairs from SCOP database, sharing <40% sequence identity. On this set of domain pairs, the alignments generated by iPBA were compared to those obtained with DALI (38), MUSTANG (40), GANGSTA+ (41) and TMalign (39). A total o 93.2 and 95.1% of the alignments had a better GDT_TS score compared to DALI and MUSTANG alignments respectively (Supplementary Data 3A and B). The quality of ∼81.6% of alignments were better than GANGSTA+ while the difference was less striking when compared to TMalign. About 45% of the alignments had a GDT_TS score lower than TMalign (Supplementary Data 3D), however the difference in scores for 80% of these cases was <3, reflecting a similar alignment. Figure 2 presents a view of the 3D alignments of two Nucleotide Kinase structures with similar folds, using different non-flexible alignment approaches like DALI, CE, TM-Align, GANGSTA+ and ALADYN. As highlighted (also see Table 1), the alignment quality is better with iPBA. A closer look on the figure can show that iPBA gives a more refined alignment with the equivalent secondary structural elements well fitted onto each other.

CONCLUSION

The ability to represent complete backbone conformation of the protein chain as a series of alphabets followed by the use of sequence alignment techniques mainly distinguishes iPBA from other structure comparison tools. In terms of alignment quality and the efficiency in detecting structural relatives, iPBA has been quite successful among the wide range of methods available (42). The local alignment option further adds to the utility of this approach. The web tool also provides an interface for the visualization and analysis of the alignments.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

French Ministry of Research; University of Paris Diderot – Paris 7; French National Institute for Blood Transfusion (INTS); French Institute for Health and Medical Research (INSERM) (to A.P.J., J.-C.G. and A.G.d.B.); Department of Biotechnology, India (to N.S.); CEFIPRA number 3903-E (to A.P.J.); CEFIPRA for collaborative grant (number 3903-E) (to N.S. and A.G.d.B.). Funding for open access charge: INSERM (NAR membership). Conflict of interest statement. None declared.

40 in total

Review 1. Structural genomics and its importance for gene function analysis.

Authors: J Skolnick; J S Fetrow; A Kolinski
Journal: Nat Biotechnol Date: 2000-03 Impact factor: 54.908

2. Bayesian probabilistic approach for predicting backbone structures in terms of protein blocks.

Authors: A G de Brevern; C Etchebest; S Hazout
Journal: Proteins Date: 2000-11-15

3. A substitution matrix for structural alphabet based on structural alignment of homologous proteins and its applications.

Authors: Manoj Tyagi; Venkataraman S Gowri; Narayanaswamy Srinivasan; Alexandre G de Brevern; Bernard Offmann
Journal: Proteins Date: 2006-10-01

4. Structural search and retrieval using a tableau representation of protein folding patterns.

Authors: Arun S Konagurthu; Peter J Stuckey; Arthur M Lesk
Journal: Bioinformatics Date: 2008-01-05 Impact factor: 6.937

5. Alignment of multiple protein structures based on sequence and structure features.

Authors: M S Madhusudhan; Benjamin M Webb; Marc A Marti-Renom; Narayanan Eswar; Andrej Sali
Journal: Protein Eng Des Sel Date: 2009-07-08 Impact factor: 1.650

6. CLePAPS: fast pair alignment of protein structures based on conformational letters.

Authors: Sheng Wang; Wei-Mou Zheng
Journal: J Bioinform Comput Biol Date: 2008-04 Impact factor: 1.122

7. Protein Block Expert (PBE): a web-based protein structure analysis server using a structural alphabet.

Authors: M Tyagi; P Sharma; C S Swamy; F Cadet; N Srinivasan; A G de Brevern; B Offmann
Journal: Nucleic Acids Res Date: 2006-07-01 Impact factor: 16.971

8. TM-align: a protein structure alignment algorithm based on the TM-score.

Authors: Yang Zhang; Jeffrey Skolnick
Journal: Nucleic Acids Res Date: 2005-04-22 Impact factor: 16.971

9. ProSMoS server: a pattern-based search using interaction matrix representation of protein structures.

Authors: Shuoyong Shi; Bhadrachalam Chitturi; Nick V Grishin
Journal: Nucleic Acids Res Date: 2009-05-06 Impact factor: 16.971

10. RAPIDO: a web server for the alignment of protein structures in the presence of conformational changes.

Authors: Roberto Mosca; Thomas R Schneider
Journal: Nucleic Acids Res Date: 2008-05-06 Impact factor: 16.971

31 in total

1. Cis-trans peptide variations in structurally similar proteins.

Authors: Agnel Praveen Joseph; Narayanaswamy Srinivasan; Alexandre G de Brevern
Journal: Amino Acids Date: 2012-01-08 Impact factor: 3.520

2. Aryl hydrocarbon receptor gene transitions (c.-742C>T; c.1661G>A) and idiopathic male infertility: a case-control study with in silico and meta-analysis.

Authors: Younes Aftabi; Abasalt Hosseinzadeh Colagar; Faramarz Mehrnejad; Ensiyeh Seyedrezazadeh; Emadoddin Moudi
Journal: Environ Sci Pollut Res Int Date: 2017-07-15 Impact factor: 4.223

3. β-Bulges: extensive structural analyses of β-sheets irregularities.

Authors: Pierrick Craveur; Agnel Praveen Joseph; Joseph Rebehmed; Alexandre G de Brevern
Journal: Protein Sci Date: 2013-09-14 Impact factor: 6.725

4. Structural analysis and molecular dynamics simulations of novel δ-endotoxin Cry1Id from Bacillus thuringiensis to pave the way for development of novel fusion proteins against insect pests of crops.

Authors: Budheswar Dehury; Mousumi Sahu; Jagajjit Sahu; Kishore Sarma; Priyabrata Sen; Mahendra K Modi; Madhumita Barooah; Manabendra Dutta Choudhury
Journal: J Mol Model Date: 2013-10-24 Impact factor: 1.810

5. Evolution of the Phosphoenolpyruvate Carboxylase Protein Kinase Family in C3 and C4 Flaveria spp.

Authors: Sophia H Aldous; Sean E Weise; Thomas D Sharkey; Daniel M Waldera-Lupa; Kai Stühler; Julia Mallmann; Georg Groth; Udo Gowik; Peter Westhoff; Borjana Arsova
Journal: Plant Physiol Date: 2014-05-21 Impact factor: 8.340

6. A novel approach to represent and compare RNA secondary structures.

Authors: Eugenio Mattei; Gabriele Ausiello; Fabrizio Ferrè; Manuela Helmer-Citterich
Journal: Nucleic Acids Res Date: 2014-04-21 Impact factor: 16.971

7. Cloning and molecular modelling of pectin degrading glycosyl hydrolase of family 28 from soil metagenomic library.

Authors: T A Sathya; Ani Methew Jacob; Mahejibin Khan
Journal: Mol Biol Rep Date: 2014-01-23 Impact factor: 2.316

Review 8. From local structure to a global framework: recognition of protein folds.

Authors: Agnel Praveen Joseph; Alexandre G de Brevern
Journal: J R Soc Interface Date: 2014-04-16 Impact factor: 4.118

9. A comparative proteomic approach to analyse structure, function and evolution of rice chitinases: a step towards increasing plant fungal resistance.

Authors: Kishore Sarma; Budheswar Dehury; Jagajjit Sahu; Ranjan Sarmah; Smita Sahoo; Mousumi Sahu; Priyabrata Sen; Mahendra Kumar Modi; Madhumita Barooah
Journal: J Mol Model Date: 2012-06-09 Impact factor: 1.810

10. Young Leaf Chlorosis 2 encodes the stroma-localized heme oxygenase 2 which is required for normal tetrapyrrole biosynthesis in rice.

Authors: Qingzhu Li; Fu-Yuan Zhu; Xiaoli Gao; Yi Sun; Sujuan Li; Yuezhi Tao; Clive Lo; Hongjia Liu
Journal: Planta Date: 2014-07-19 Impact factor: 4.116