Literature DB >> 16844973

Protein Block Expert (PBE): a web-based protein structure analysis server using a structural alphabet.

M Tyagi¹, P Sharma, C S Swamy, F Cadet, N Srinivasan, A G de Brevern, B Offmann.

Abstract

Encoding protein 3D structures into 1D string using short structural prototypes or structural alphabets opens a new front for structure comparison and analysis. Using the well-documented 16 motifs of Protein Blocks (PBs) as structural alphabet, we have developed a methodology to compare protein structures that are encoded as sequences of PBs by aligning them using dynamic programming which uses a substitution matrix for PBs. This methodology is implemented in the applications available in Protein Block Expert (PBE) server. PBE addresses common issues in the field of protein structure analysis such as comparison of proteins structures and identification of protein structures in structural databanks that resemble a given structure. PBE-T provides facility to transform any PDB file into sequences of PBs. PBE-ALIGNc performs comparison of two protein structures based on the alignment of their corresponding PB sequences. PBE-ALIGNm is a facility for mining SCOP database for similar structures based on the alignment of PBs. Besides, PBE provides an interface to a database (PBE-SAdb) of preprocessed PB sequences from SCOP culled at 95% and of all-against-all pairwise PB alignments at family and superfamily levels. PBE server is freely available at http://bioinformatics.univ-reunion.fr/PBE/.

Entities: Chemical Gene

Mesh：

Year: 2006 PMID： 16844973 PMCID： PMC1538797 DOI： 10.1093/nar/gkl199

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The central paradigm of protein science suggests that protein functions are directly controlled by protein structures. With the increasing number of solved protein structures, structure comparison methods are becoming increasingly important. A number of semi or fully automated structure comparison methods have been developed based on methodologies like alignment of secondary structure elements (1–3), environmental profiles (4) and distance measure matrices (5). Most of these methods use regular secondary structure information in their algorithms. By analyzing local protein structures, many groups have found recurring short structural motifs also called structural alphabet (SA) spanning structural space (6–8). These short motifs represent local structure variations in protein space upon which backbone model of most proteins can be built. They have been shown to be informative to analyze protein structures (9) and have been used in structure prediction (10), backbone reconstruction (11,12) and loop modeling (13). We present a web-based service called Protein Block Expert (PBE) for protein structure comparison and analysis using a SA of 16 pentapeptide structural motifs known as ‘Protein Blocks' (PBs) (14,15). A protein structure can be encoded into sequence of PBs by sliding an overlapping window of five residues. Hence, simplified 1D representation of protein structure can be used just like amino acid sequence analysis to find similarity, dissimilarity and relationship among proteins in terms of structure. PBE is similar to classical sequence alignment (16,17). Its concept is similar to SA-Search (18) web server, but differs greatly as it uses a specialized SA substitution matrix derived on the basis of aligned homologous proteins present in a large database of Phylogeny and ALIgnment of homologous protein structures (PALI) (19,20). Applications and validation of such a matrix have been demonstrated (M. Tyagi, V. S. Gowri, N. Srinivasan, A. G. de Brevern, B. Offmann, manuscript submitted). PBE is not only a service to find structural similarities between proteins or a mining tool for recognizing the fold of a protein structure, it also provides an interface to a database to study proteins in terms of PBs at the levels of superfamily and family. PBE provides the following features to the user: A tool to encode protein structure into PBs sequence. Structure comparison between a pair of proteins using PB description using both local and global alignment algorithms. Mining a databank, which is derived from SCOP, for proteins with similar fold. Access to a database of preprocessed PB sequences and pairwise alignments at family and superfamily levels. PBE is freely accessible at .

PBE-T: ENCODING PROTEIN STRUCTURE INTO PROTEIN BLOCKS

The set of PBs consists of 16 structural motifs of each five residue long (14,15). Each of the PBs is represented by a vector of eight φ,ψ dihedral angles associated with five consecutive Cα atoms and the 16 PBs are denoted by the letters a, b, …, p. Encoding of protein 3D structure into sequence of PBs as implemented in our server is a two-step process. First, protein backbone is encoded into sequence of (φ,ψ) angles calculated from backbone atomic positions. Second, an overlapping window of five Cα atoms, i.e. vector of eight (φ,ψ) angles is moved along the backbone. PBs for each window is assigned on the basis of smallest dissimilarity measure called root mean square deviation on angular values or r.m.s.d.a. (21) between observed (φ,ψ) values in the window and the standard dihedral angles for various PBs. PBs have been used in several prediction methods (22–24). This encoding is possible easily using PBE-T. It accepts a structure as an input and lists the sequence of PBs in the structure.

PBE-ALIGNc: PROTEIN STRUCTURE COMPARISON USING PROTEIN BLOCKS

Analysis of sequence of PBs using classical amino acid sequence alignment algorithms allows us to explore possibility of finding structural similarities between two proteins using reduced complexity of protein structure. PBE server has been designed and implemented to fulfill this requirement. It allows user to compare two proteins using simple dynamic programming (DP) algorithm by aligning two PB sequences using our PB substitution matrix. The substitution table used in our study was derived by re-encoding in terms of PBs the structurally aligned homologous proteins present in the PALI database (19,20). The detailed description of calculation, discussion on PB substitution matrix and proposed applications are reported elsewhere (M. Tyagi, V. S. Gowri, N. Srinivasan, A. G. de Brevern, B. Offmann, manuscript submitted). Indeed, local structural similarities between two uploaded protein structures are found using PB sequence alignment. This approach has already been successfully benchmarked and compared to standard flexible alignment methods like DALI (5) and rigid body superposition methods like STAMP (25) where >75% of structurally equivalent residues in our PB alignment method overlapped with those identified with these standard methods. Moreover, careful inspection of aligned coordinates from PB-ALIGNc after aligning PBs indeed shows identical and in some cases, lower r.m.s.d. values than DALI (M. Tyagi, V. S. Gowri, N. Srinivasan, A. G. de Brevern, B. Offmann, manuscript submitted). These results are expected to improve by using more robust DP algorithm combined with optimized gap penalty. Interestingly, in the same study we have shown how PB alignment method is able to pick up subtle similarities at local level between two proteins. Hence PB alignment is providing both local and global flavors of DP algorithms. In PBE-ALIGNc, the user is required to upload two protein structures in PDB format. After transforming the 3D structures into 1D PB sequences and the latter are aligned using DP algorithm. If the uploaded protein structures have more than one chain, option to select any one of the possible pair for alignment is presented. Once the selection has been done the selected pair is aligned. The output displays the aligned PB sequences along with the information like length of proteins, alignment length, best fit superposition r.m.s.d. value using ProFit program based on McLachlan algorithm (26). PB alignment is transformed into amino acid alignment to define equivalent regions required by ProFit and further iterations are done to obtain best fit r.m.s.d. value. The server provides the possibility to download the initial PB and corresponding amino acid alignments in Fasta format as well as the superimposed coordinates between the two structures. As PBE requires only backbone atoms to generate PB sequence and is independent of residues, the user can upload anonymous protein structures by changing all residues to any one kind and giving only coordinates of backbone atoms in the PDB format. Hence newly solved structures can be easily analyzed without making them public.

PBE-ALIGNm: MINING SCOP DATABASE FOR PROTEINS WITH SIMILAR FOLD

PBE-ALIGNm uses a database of protein domains derived from SCOP database (27). Protein structures were extracted from SCOP 1.65 via the ASTRAL (28) server using a sequence identity cutoff of 95% (SCOP95) with 9392 domains. These domains were encoded into PB sequences and are made available for user to query at family and superfamily levels in PBE-SAdb database. Further, an extensive all-against-all pairwise PB sequence alignments between all 7195 domains were generated using DP and our PB substitution matrix. Protein domains in SCOP95 having any chain breaks were not considered for PB sequence alignment process. Pairwise alignments within each of seven major structural classes from SCOP95, which amounts 5 405 433 alignments, are featured in PBE-SAdb database where option is provided to the user to view/download pairwise PB alignments at the level of family or superfamily. Each PB alignment in the generated databank had raw score given by DP algorithm. To remove the dependence of this value on the length of the two proteins, the score is also provided in the normalized form by dividing the raw score by the length of the alignment including gaps. This normalized score from global alignment algorithm is used to rank alignments during the following analysis. In the first study we analyzed the efficiency of our method to discriminate between various SCOP classes or in other words what is the overlap of scores between classes based on PB sequence alignment. This question is important since 1D representation of protein structure using PB sequence lacks topological information, which can create confusion due to identical linear sequence of regular structures in two proteins having different topologies. A dataset of 1500 protein domains was selected randomly from SCOP at 95% keeping the relative proportion of seven major classes same as in original databank. All-against-all PB sequence alignment for this dataset was performed. A jackknife approach was adopted to perform comprehensive analysis. Each time one domain was selected and was queried against the databank to find top 10 ranking PB alignments against the given query and statistics was calculated for true hits at each rank position. Appearance of same SCOP CLASS among top 10 ranks was considered a true hit. Analysis of the distribution of true hits shows that 85.9% of them are at first rank and a hit rate of 98.2% is achieved when first 10 ranking alignments are considered (data not shown). It should be noted that the value increases from 85.9 to 93% when the same analysis is performed on the 7267 × 7267 pairwise alignments. A confusion matrix between seven SCOP classes is also calculated taking into account only top hit for each query. Matrix is populated simply based on criteria if query protein and first rank protein have same class or not. Table 1 shows the generated matrix. Among all the four classes, alpha plus beta class was most confused class with 76.2% cases finding itself at first rank. Beta class was most well-behaved class with accuracy of 94.4% followed by alpha beta and alpha class. Low accuracy rate of multi-domain and membrane class can be attributed to very low number of proteins of this present in our dataset.

Table 1

Mining SCOP for similar structures using PB alignment

True class versus hit class	ALPHA	BETA	ALPHABETA	APLUSB	MULTIDOM	MEMBRANE	SMALL	Total
ALPHA	245 (88.1%)	1	12	9	1	5	5	278
BETA	2	404 (94.4%)	5	10	0	1	6	428
ALPHABETA	3	5	255 (89.5%)	18	3	0	1	285
APLUSB	16	23	27	240 (76.2%)	0	1	8	315
MULTIDOM	0	0	5	2	11 (61.1%)	0	0	18
MEMBRANE	10	5	0	1	1	12 (41.3%)	0	29
SMALL	2	15	0	8	0	0	122 (84.7%)	144
								1500

Confusion matrix between true (vertical) and predicted (horizontal) SCOP classes.

In a second study, we assessed how well a PB alignment can extract protein of similar fold from a databank within given a class. A jackknife approach (as done in previous analysis, cf. infra) was applied to calculate statistics for identifying true FOLD of a protein as defined by SCOP at various levels. Figure 1 shows the distribution of true hits at different rank positions. It is noteworthy that 81.3% of true hits are from first rank while 89.3% true hits are within top 10 ranking alignments.

Figure 1

Mining SCOP for similar structures using PB alignment. Distribution of number of hits in top 10 ranking alignments. If a given query and extracted alignment have same FOLD, a hit is counted at that position.

Further efficiency of mining similar folds within each seven major classes was studied and results are reported in Table 2. For each class, hit rate was calculated at three different levels, top10, top5 and top1 where first 10, 5 and first ranking alignments were considered, respectively. The ability of our method to extract same SCOP FOLD within top10 level vary from a hit rate of 86.1% for alpha class to 93.6% for alpha/beta class. Similarly, at top1 level, the hit rate varies from 70% (small protein class) to 88.4% (alpha/beta class). Consistent good level of hit rates across various classes to mine similar fold using PB alignment method gives support to basic ability of the method and quality of the substitution matrix.

Table 2

Mining SCOP for similar structures using PB alignment

SCOP class	Top10 (%)	Top5 (%)	Top1 (%)
Alpha (1312)	86.1 (1130)	82.6 (1087)	75.0 (985)
Beta (2076)	92.9 (1930)	91.4 (1897)	87.2 (1811)
AlphaBeta (1386)	93.6 (1298)	92.0 (1275)	88.4 (1226)
AplusB (1500)	88.3 (1325)	86.3 (1294)	81.3 (1219)
Small (700)	87.7 (614)	84.3 (590)	70.3 (492)
Membrane (139)	91.4 (127)	89.2 (124)	81.3 (113)
MultiDomain (82)	85.4 (70)	84.1 (69)	81.7 (67)

Hit rates (in percentage) for identifying similar fold within each SCOP classes. Are given rates that take into account Top10, Top5 and Top1 ranking alignments. Exact numbers for each case are given within parentheses.

These results hence illustrates that the use of PB substitution matrix with simple DP algorithm along with naïve scoring function is efficient to extract proteins sharing structural similarities from large dataset. PBE-ALIGNm provides this facility for mining structural similarities from a databank using a reduced representation of protein structures. User can upload a protein structure in PDB format and can decide against which databank the structure is to be queried. PBE gives option to select local or global alignment algorithm, setting up parameters like minimum length of proteins against which query should be aligned. Option is also given to decide if the user would like to align against whole databank or with some specific SCOP CLASS proteins. Typically, the runtime for a query is less than one minute.

INTERFACE TO PBE DATABASE

PBE server provides another feature for protein structure analysis using structural alphabets. We have created two databases of protein structures and are grouped under the PBE-SAdb facility. First is a database of 9392 protein domains extracted from SCOP95 that were translated into PB sequences. Second is a database of all possible pairwise PB sequence alignments within each SCOP class. An interface gives option for querying any one of these two databases at superfamily or family level by entering appropriate SCOP code. List of all family and superfamily codes and their description present in our database is available in the help section. In addition PB sequences or alignments can also be accessed by specifying a PDB id of a protein. Because PDB was filtered for 95% sequence identity cutoff, the list of the available PDB structures can also be checked in help section. Outputs can be easily downloaded with PB sequences or PB alignments in Fasta format. Facility to query and download PB sequences or PB alignments at family or superfamily level is expected to be of great help in studying protein structure conservation. This can also aid studies on variations in homologous proteins in terms of structural alphabets, which might provide better insight into sequence to structure relationship. Analysis of PB alignments to study conservation or variability of local structures is expected to provide better understanding of relationship between structure and function of homologous proteins.

TECHNICAL ASPECTS

Pairwise PB alignments for each given class were calculated using 32 processor IBM AIX52 machine. Database for PB sequences and alignments are maintained using MySQL server. Web server front end and back end processing are handled using HTML, CGI and Perl scripting along with JAVA (PBs encoding of protein) and C (DP) programs. Job requests in PBE-ALIGNm are queued and provided a randomly generated job-id that guarantees the inaccessibility of jobs to other users of the servers. PBE Server is maintained on a Linux-based single processor machine and is accessible at .

DISCUSSION AND PERSPECTIVES

Decrease in the complexity of protein space from three dimensions to one dimension with the combination of sequence analysis methods to study protein structure has opened a simple and exciting way of looking at protein structure space. Initial results of mining similar fold structures in databases and finding local structural similarities between proteins has been a promising start though far from exploiting full potential of such methodology. Low confusion rate across various SCOP classes and high efficiency rate to mine similar fold proteins from large database based on naïve scoring scheme indicates that PB alignment method is efficient enough to discriminate between different topologies despite lack of topological information. This success can be attributed to both, efficiency of PBs to represent local structural properties in more refined form compared to simple SSE representation and quality of substitution matrix. Sequence of PBs between regular SSEs and their alignment or misalignment might be playing important role in discriminating between correct and incorrect hits. Pairwise comparison of proteins using PBE-ALIGNc performed reasonably well when compared to standard methods like DALI and this was further validated on a large-scale basis here (7195 × 7195 pairwise alignments). Structure alignment using PBs has also shown its efficiency to locate subtle similarities at local level and to very efficiently mine for local structural similarities from large structural databases. Though, PB alignment is expected to be very advantageous in cases of distantly related proteins where residue–residue alignment is difficult to obtain. Finally, because prediction of protein backbone in terms of PB sequences is possible from amino acid sequence (14), this work opens up interesting perspectives for large-scale structural annotation of genomic data.

26 in total

1. The ASTRAL compendium for protein structure and sequence analysis.

Authors: S E Brenner; P Koehl; M Levitt
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. PALI-a database of Phylogeny and ALIgnment of homologous protein structures.

Authors: S Balaji; S Sujatha; S S Kumar; N Srinivasan
Journal: Nucleic Acids Res Date: 2001-01-01 Impact factor: 16.971

3. Bayesian probabilistic approach for predicting backbone structures in terms of protein blocks.

Authors: A G de Brevern; C Etchebest; S Hazout
Journal: Proteins Date: 2000-11-15

4. Protein structure alignment using environmental profiles.

Authors: J Jung; B Lee
Journal: Protein Eng Date: 2000-08

5. Small libraries of protein fragments model native protein structures accurately.

Authors: Rachel Kolodny; Patrice Koehl; Leonidas Guibas; Michael Levitt
Journal: J Mol Biol Date: 2002-10-18 Impact factor: 5.469

6. Integration of related sequences with protein three-dimensional structural families in an updated version of PALI database.

Authors: V S Gowri; Shashi B Pandit; P S Karthik; N Srinivasan; S Balaji
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

7. Extension of a local backbone description using a structural alphabet: a new approach to the sequence-structure relationship.

Authors: Alexandre G de Brevern; Hélène Valadié; Serge Hazout; Catherine Etchebest
Journal: Protein Sci Date: 2002-12 Impact factor: 6.725

8. SA-Search: a web tool for protein structure mining based on a Structural Alphabet.

Authors: Frédéric Guyon; Anne-Claude Camproux; Joëlle Hochez; Pierre Tufféry
Journal: Nucleic Acids Res Date: 2004-07-01 Impact factor: 16.971

9. New assessment of a structural alphabet.

Authors: Alexandre G de Brevern
Journal: In Silico Biol Date: 2005-03-16

10. Use of a structural alphabet for analysis of short loops connecting repetitive structures.

Authors: Laurent Fourrier; Cristina Benros; Alexandre G de Brevern
Journal: BMC Bioinformatics Date: 2004-05-12 Impact factor: 3.169

19 in total

1. The power of detecting enriched patterns: an HMM approach.

Authors: Zhiyuan Zhai; Shih-Yen Ku; Yihui Luan; Gesine Reinert; Michael S Waterman; Fengzhu Sun
Journal: J Comput Biol Date: 2010-04 Impact factor: 1.479

2. "Pinning strategy": a novel approach for predicting the backbone structure in terms of protein blocks from sequence.

Authors: A G De Brevern; C Etchebest; C Benros; S Hazout
Journal: J Biosci Date: 2007-01 Impact factor: 1.826

3. A new prediction strategy for long local protein structures using an original description.

Authors: Aurélie Bornot; Catherine Etchebest; Alexandre G de Brevern
Journal: Proteins Date: 2009-08-15

4. β-Bulges: extensive structural analyses of β-sheets irregularities.

Authors: Pierrick Craveur; Agnel Praveen Joseph; Joseph Rebehmed; Alexandre G de Brevern
Journal: Protein Sci Date: 2013-09-14 Impact factor: 6.725

5. Analysis of loop boundaries using different local structure assignment methods.

Authors: Manoj Tyagi; Aurélie Bornot; Bernard Offmann; Alexandre G de Brevern
Journal: Protein Sci Date: 2009-09 Impact factor: 6.725

Review 6. In silico studies on DARC.

Authors: Alexandre G de Brevern; Ludovic Autin; Yves Colin; Olivier Bertrand; Catherine Etchebest
Journal: Infect Disord Drug Targets Date: 2009-06

7. A reduced amino acid alphabet for understanding and designing protein adaptation to mutation.

Authors: C Etchebest; C Benros; A Bornot; A-C Camproux; A G de Brevern
Journal: Eur Biophys J Date: 2007-06-13 Impact factor: 1.733

8. A short survey on protein blocks.

Authors: Agnel Praveen Joseph; Garima Agarwal; Swapnil Mahajan; Jean-Christophe Gelly; Lakshmipuram S Swapna; Bernard Offmann; Frédéric Cadet; Aurélie Bornot; Manoj Tyagi; Hélène Valadié; Bohdan Schneider; Catherine Etchebest; Narayanaswamy Srinivasan; Alexandre G De Brevern
Journal: Biophys Rev Date: 2010-08-05

9. Use of a structural alphabet to find compatible folds for amino acid sequences.

Authors: Swapnil Mahajan; Alexandre G de Brevern; Yves-Henri Sanejouand; Narayanaswamy Srinivasan; Bernard Offmann
Journal: Protein Sci Date: 2014-10-25 Impact factor: 6.725

10. Assignment of PolyProline II conformation and analysis of sequence--structure relationship.

Authors: Yohann Mansiaux; Agnel Praveen Joseph; Jean-Christophe Gelly; Alexandre G de Brevern
Journal: PLoS One Date: 2011-03-31 Impact factor: 3.240