Literature DB >> 22693212

CPred: a web server for predicting viable circular permutations in proteins.

Wei-Cheng Lo¹, Li-Fen Wang, Yen-Yi Liu, Tian Dai, Jenn-Kang Hwang, Ping-Chiang Lyu.

Abstract

Circular permutation (CP) is a protein structural rearrangement phenomenon, through which nature allows structural homologs to have different locations of termini and thus varied activities, stabilities and functional properties. It can be applied in many fields of protein research and bioengineering. The limitation of applying CP lies in its technical complexity, high cost and uncertainty of the viability of the resulting protein variants. Not every position in a protein can be used to create a viable circular permutant, but there is still a lack of practical computational tools for evaluating the positional feasibility of CP before costly experiments are carried out. We have previously designed a comprehensive method for predicting viable CP cleavage sites in proteins. In this work, we implement that method into an efficient and user-friendly web server named CPred (CP site predictor), which is supposed to be helpful to promote fundamental researches and biotechnological applications of CP. The CPred is accessible at http://sarst.life.nthu.edu.tw/CPred.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2012 PMID： 22693212 PMCID： PMC3394280 DOI： 10.1093/nar/gks529

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The protein structural rearrangement phenomenon termed circular permutation (CP) can be viewed as if the amino- and carboxyl-termini of a protein were relocated along the circularized sequence of the protein. Although the mechanisms underlying natural CP cases are not fully understood (1–5), many CPs have been observed in well-known protein families [see (6) for summaries of proposed mechanisms for CP and natural CP cases]. To study CP, many artificial circular permutants have been generated. The outcomes of these previous studies have indicated that as long as the CP site, i.e. the position for creating new termini, is not at a residue essential for protein folding or function, circular permutants usually retain native structures and biological functions (1,3,7–9), although the structural stabilities, folding mechanisms and enzymatic activities might be changed (10–15). These discoveries have made CP a new mutagenesis method for studying protein structure and function (16–18) and a bioengineering technique to modify the stability, solubility and activities of proteins (13,19–21). Moreover, the CP technique allows two proteins to be covalently linked at positions other than their native termini, facilitating the creation of several useful protein switches, molecular biosensors and fusion proteins (22–24). Despite these interesting applications, the implementation of CP is much more difficult, expensive and time-consuming compared with traditional mutagenesis. Most importantly, not all positions in a protein structure are permissible for CP. However, since practical software for predicting viable CP sites (i.e. positions leading to correctly folded and stable permutants) is still unavailable, researchers interested in CP-based protein engineering may rely on uneconomic trial-and-error for finding appropriate CP sites. To facilitate fundamental researches based on CP and biotech applications of the CP-based mutagenesis, we aim to develop an effective web-based tool for predicting viable CP site in this work. CPs tend to occur at positions with high solvent accessibility (25), low sequence conservation and low ‘closeness’ (26), a structure-derived residue feature describing the amount of residues with which a given residue may interact directly or indirectly (27). However, predicting viable CP sites based on these properties yielded only marginal performance; the area under the receiver operating characteristic curve (AUC) values were all ≤0.7 (26). The major difficulty in developing CP viability predictors was the insufficiency of data, particularly the data of inviable CP sites (i.e. negative cases). In fact, the aforementioned predictors were developed and assessed with a data set composed of only one protein—dihydrofolate reductase (DHFR)—the entire 159-residued polypeptide of which had been subjected to systematic CPs (25). Although large data sets of CP-related protein structural homologs, such as the CP Database (CPDB) (28) and the database of GANGSTA+ Internet Services (GIS) (29), have been available since 2009, there is still a lack of negative cases. This is because most wet-lab researches only reported viable CP sites and bioinformatics CP-detecting methods could only identify existent (meaning viable) circular permutants. The DHFR data set contained only 73 negative cases, far from enough for developing reliable predictors. Recently, we have established several highly different data sets for developing viable CP site prediction methods (30). Among them, nrCPDB-40 and nrGIS-40 contained thousands of proteins with machine-identified viable CP sites, whereas Data set T consisted of six proteins other than DHFR with both experimentally verified viable and inviable CP sites, expanding the number of negative cases by 2.4-fold (30). Based on these data sets, the sequence and structural preferences of CP were extensively examined (30). The identified preferences were quantified into numerous features to develop a CP viability prediction method that combined four machine learning algorithms: artificial neural networks, the support vector machine, a random forest and a hierarchical feature integration procedure (30). As trained with Data set T, this method achieved an AUC of 0.91 for the DHFR data set and a large-scale prediction sensitivity of ≥0.72 for either nrCPDB-40 or nrGIS-40. However, this effective CP site prediction method is not efficient. Due to heavy computational loads caused by several structural features and the time-consuming data flow through numerous prediction models, it took minutes to deal with one protein. In our present work, we have implemented the developed CP viability prediction method into a user-friendly and quick response web server named CPred. Distributed computation techniques are used to accelerate the whole procedure, which now takes only seconds to make predictions. CPred is currently the most accurate method and is the only web server for predicting viable CP sites. We hope that it can be a good assistant for scientists and bioengineers to study and apply CP.

MATERIALS AND METHODS

The flowchart of CPred is illustrated in Figure 1a. After receiving the query protein structure from the input module, the main program distributes to several processors the computation tasks of feature values, which are collected again by the main program. The main program then creates four threads running different machine learning predictors, the results of which are integrated and processed by the main program and are finally delivered to the output interface. If the protein structure is input by specifying a PDB [Protein Data Bank (31)] or a Structural Classification of Proteins (32) entry identifier, the calculated feature values and final results will be cached to ensure a quick response once the same protein is queried again in the future.

Figure 1.

The flowchart and output of CPred. (a) CPred is a viable circular permutation cleavage site prediction web server, which is working based on distributed computation techniques. After receiving the query protein data, the main program of CPred will extract feature values, execute machine learning subroutines, integrate the prediction results and deliver the final results to the output interface. The computation loads of many steps are distributed to several processors, as indicated by the radical arrow lines. Some structural features and machine learning methods require much more computation power than others; subroutines responsible for them, as represented by multicelled boxes, are designed by applying distributed techniques as well. (b) The output interface of CPred provides a list (lower right) and a graphic profile (lower left) of the probability scores of all residues in the input protein. The structure, along with predicted viable CP sites, is presented by an interactive Jmol (33) object (upper left), which allows the user to change the display mode (cartoon, spacefilled, etc.) and to rotate, resize and dissect the structure. A downloadable text version of the CPred results is provided as well (upper right). The structures shown in panel (a) and (b) were respectively rendered using PyMol (45) and Jmol (33).

Experimental data sets

Literature-derived data sets: Data set T and the DHFR data set

Information of inviable CP sites is rare, and it is extremely difficult to find a protein with both experimentally verified viable and inviable CP sites. Before our previous work (30), DHFR was the only data set for training and evaluating CP site predictors. By screening the literature, we had additionally collected six such proteins and established Data set T. Collectively, Data set T (76 viable and 100 inviable CP sites) and the DHFR data set (86 viable and 73 inviable CP sites) are the largest CP site data set currently available with both viable and inviable sites. Data set T and the DHFR data set shared very low sequence identities (<9%); the former was used to train and test our prediction system and the latter was used as an independent evaluation data set. These data sets are available in (30).

Database-extracted data sets: nrCPDB-40 and nrGIS-40

CPDB (28) and GIS (29) are the largest protein structural databases providing information about CP-related structural homologs. Previously, we had reduced these databases to 40% sequence identity non-redundant subsets, nrCPDB-40 and nrGIS-40 (30). Any protein in these two data sets that shared sequence identities >40% with any protein in the Data set T or DHFR were further eliminated. Finally, the nrCPDB-40 and nrGIS-40 data sets contained 1059 and 2814 proteins, respectively, and any two data sets among nrCPDB-40, nrGIS-40, Data set T and the DHFR data set shared <40% sequence identities [see (30) for details].

Non-redundant data set of CP site: nrCPsitecpdb-40

All CP sites of the proteins in nrCPDB-40 had been extracted (30). Each CP site was represented by a 20-residued segment. These 20-residued CP site representative segments were reduced to a 40% sequence identity non-redundant subset named nrCPsitecpdb-40 (1087 CP sites). Note that a protein may possess more than one viable CP site, and thus the number of non-redundant CP sites (1087) in the nrCPsitecpdb-40 data set is larger than the number of proteins (1059) in the nrCPDB-40 data set. The aforementioned data sets had been released as a part of the supporting data of (30).

Computation of feature values

The CPred system extracts 46 features from an input PDB file (see Supplementary Table S1). Based on a statistical significance test known as the permutation test (33), we had previously examined the sequence and secondary structural propensities of CP by comparing the compositions of single-, oligo- and coupled-residue patterns of amino acid sequence and several secondary structural strings between the CP site segments of nrCPsitecpdb-40 and the whole protein sequences of nrCPDB-40 (30). A secondary structural string, for instance, a Ramachandran structural string (34), is a text description of the secondary structure or backbone conformation of a protein. In CPred, these propensities are quantified by using the propensity scoring algorithm proposed in (30). Before CPred extracts tertiary structural features, the reduce program (35) is used to restore hydrogen atoms to the PDB file. Structure-derived residue measures and properties, e.g. the closeness (26), relative solvent accessibility, centroid distance measure (36), weighted contact number (37), farness (30) and the Gaussian Network Model-derived mean-square fluctuation (38–40), are then computed. All the obtained propensity scores and residue measures are standardized based on the conventional Z-score method (30) into real number features suitable for applying machine learning methods.

Application of machine learning methods

Four machine learning methods are applied in the CPred system: (i) a three-layered and back propagation-based (41,42) artificial neural network; (ii) a support vector machine established with the LIBSVM (43); (iii) a random forest of 500 decision trees generated by the C4.5 package (44); and (iv) a hierarchical feature integration procedure, in which features are hierarchically classified into a rooted tree that directs how the feature values are integrated into a single value (30). To efficiently integrate the prediction results from these methods, the output of every method for each residue has been designed to be a real number score between 0 and 1 [see (30) for algorithms]. Because of the range of value of these scores and their being conceptually in direct proportion to the chance that a case is a positive case, we have termed them ‘probability scores’ for convenience (30). Since these scores have the same range of values, to integrate the prediction power of various methods, we simply average their probability scores into an integrated score, based on which predictions are made. The feasibilities of these individual and integrated methods for predicting viable CP sites had been well established in (30), where the detailed algorithms, parameter settings and performance data are available. In the current work, a major problem in applying these methods lies in the fact that the data flow through the 500 decision trees of the random forest is very time-consuming. To solve this problem, as illustrated in Figure 1a, distributed computation techniques are used.

PERFORMANCE

Evaluations of the prediction system with cross-validation techniques and independent data sets

Before our work, the best viable CP site prediction method was developed based on the closeness measure, which achieved an AUC of 0.7 on the DHFR data set (26) and sensitivity values of 0.62 and 0.61 on the nrCPDB-40 and nrGIS-40 data sets, respectively (30). In our previous study (30), in which the core method of the current CPred system was developed, Data set T was used to establish the prediction model and the 10-fold cross-validated AUC, sensitivity, specificity and Matthews correlation coefficient values for this training data set were 0.91, 0.86, 0.79 and 0.63, respectively. Evaluating the established model with the independent data set DHFR, the aforementioned four performance measures were 0.91, 0.71, 0.92 and 0.64, respectively. A large-scale prediction test on this system registered sensitivity values 0.75 and 0.72 for nrCPDB-40 and nrGIS-40, respectively [refer to (30) for details]. These data indicated that the core method of CPred outperformed previous methods with little data set dependence or overfitting. Since the CPred server is running the same core programs, its performance measures assessed with these data sets are the same with the values listed earlier in the text. In the actual CPred web server, the prediction model is trained with a combined data set of Data set T and DHFR. Evaluations made based on this combined data set with 10-fold cross-validation and based on independent data sets nrCPDB-40 and nrGIS-40 show that accuracy of the actual server is improved as the amount of training data has increased (Table 1).

Table 1.

Performance of CP viability prediction of CPred

Data set	Performance measure	Closeness	CPred
Training set (Data set T + DHFR data set)^a	AUC	0.753	0.940
	Sensitivity	0.741	0.889
	Specificity	0.687	0.898
	Matthews correlation coefficient	0.428	0.787
nrCPDB-40	Sensitivity	0.622	0.746
nrGIS-40	Sensitivity	0.614	0.719

aThese results were obtained with 10-fold cross-validation.

Performance of CP viability prediction of CPred aThese results were obtained with 10-fold cross-validation.

Evaluations of the developed probability score with information retrieval techniques

To help users interpret the results obtained with CPred, here we examine the average precisions of predictions at various decision thresholds of the probability score by performing 10-fold cross-validated information retrieval experiments. Table 2 demonstrates that a high threshold of probability score would retrieve fewer residues (i.e. a lower recall rate) but obtain a higher proportion of correct positive predictions (i.e. a higher precision) than a low threshold would. In the combined data set of Data set T and DHFR, any residue with a probability score ≥0.85 was an actual CP site (precision = 1). Since 82% of the residues predicted as viable CP sites (i.e. probability scores ≥0.5) were actual CP sites, this system is quite reliable. Experimenters expecting a high certainty about the viability of the created permutants may choose residues with probability scores ≥0.85 to apply CP; at this threshold, only 16% of all residues in a protein will be predicted as viable CP sites (i.e. the predicted positive fraction = 0.16).

Table 2.

Performance of CPred at various decision thresholds of the probability score

Probability score	PPF^a	Recall	Precision
≥0.90	0.06	0.13	1.00
≥0.85	0.16	0.33	1.00
≥0.80	0.26	0.52	0.99
≥0.75	0.33	0.66	0.96
≥0.70	0.39	0.74	0.92
≥0.65	0.43	0.81	0.92
≥0.60	0.49	0.88	0.87
≥0.50	0.54	0.92	0.82
≥0.40	0.61	0.97	0.77
≥0.30	0.69	1.00	0.70
≥0.20	0.78	1.00	0.62
≥0.10	0.90	1.00	0.54
≥0.00	1.00	1.00	0.48

aPPF: predicted positive fraction, meaning the proportion of residues predicted as viable CP sites among all residues in the data set.

Performance of CPred at various decision thresholds of the probability score aPPF: predicted positive fraction, meaning the proportion of residues predicted as viable CP sites among all residues in the data set.

Speed evaluations

Since some structural features used by CPred such as the closeness (26) and the weighted contact number (37) have a high time complexity in computation and the random forest of CPred possesses many subpredictors, reducing the running time is an important task for developing a quick response server. By applying distributed computation techniques, the computation loads of several time-consuming steps are shared by many processors, greatly enhancing the efficiency of the server. As listed in Supplementary Figure S2, CPred takes only ∼3.4 s to make predictions for a protein with 150–200 residues; even for proteins as large as 600 residues, the average running time is <22 s. Without distributed computation, the running time for proteins with 150–200 and approximately 600 residues is, respectively, around 48.8 and 513.6 s. These assessments were performed on the actual CPred server machine, which is a Linux computer with two 3.33 GHz octa-core Intel Xeon CPUs and 128 GB RAM.

WEB SERVER DESCRIPTION

The query interface of CPred accepts three types of input, inclusive of a PDB entry, a Structural Classification of Proteins entry or a PDB file. After the user submits the query data, a notification page will appear to show the status of computation and provide an URL through which the results can be retrieved at a later time if the user decides not to wait. The outputs of CPred include a list of probability scores for all residues of the input protein and an interactive Jmol (33) graphical display of the protein structure that demonstrates the predicted CP sites (Figure 1b). The list of results can be reordered according to the residue number, amino acid type or the probability score.

APPLICATIONS AND FUTURE WORKS

CPred is a user-friendly web server for predicting possible cleavage sites for creating correctly folded and stable circular permutants of proteins. It provides a convenient probability score to help the user select suitable CP cutting sites. An interesting application of CP is to create fusion proteins with tethered sites different from the native termini (22–24). To our knowledge, for every CP-involved fusion protein that has been created, CP was introduced into just one of the fused polypeptides. This is perhaps because of the difficulty in generating two viable circular permutants at the same time. The convenient probability score of CPred may potentiate the production of fusion circular permutants. CP has long been applied to study protein folding. The ability of CPred to predict CP sites implies that it can be used in reverse to predict residues important to folding. Improving protein function is also a useful application of CP. Since residues with low probability scores are unlikely to form viable—not to mention functionally improved—permutants, CPred holds promise for bioengineering by screening out improbable cases. To improve CPred, additional data will be continually collected for training the prediction model. The current CPred server requires protein structures for making predictions. However, there are so many proteins without determined structures. We supposed that a sequence-based viable CP site predictor will further facilitate the application of CP.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online: Supplementary Table 1, Supplementary Figure 2 and Supplementary References [34,46-50].

FUNDING

Funding for open access charge: The National Science Council, Taiwan [100-2745-B-009-001-ASP, Academic Summit Program of National Science Council to J.-K.H.]. National Science Council, Taiwan [100-2627-B-007-005 and 100-2319-B-400-001 to P.-C.L.]; The ‘Center for Bioinformatics Research of Aiming for the Top University Program’ of National Chiao Tung University and the Ministry of Education, Taiwan, R.O.C. Conflict of interest statement. None declared.

41 in total

1. The Protein Data Bank.

Authors: H M Berman; J Westbrook; Z Feng; G Gilliland; T N Bhat; H Weissig; I N Shindyalov; P E Bourne
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. Circular permutation and receptor insertion within green fluorescent proteins.

Authors: G S Baird; D A Zacharias; R Y Tsien
Journal: Proc Natl Acad Sci U S A Date: 1999-09-28 Impact factor: 11.205

3. Different circular permutations produced different folding nuclei in proteins: a computational study.

Authors: L Li; E I Shakhnovich
Journal: J Mol Biol Date: 2001-02-09 Impact factor: 5.469

4. Systematic circular permutation of an entire protein reveals essential folding elements.

Authors: M Iwakura; T Nakamura; C Yamane; K Maki
Journal: Nat Struct Biol Date: 2000-07

5. Naturally occurring circular permutations in proteins.

Authors: S Uliel; A Fliess; R Unger
Journal: Protein Eng Date: 2001-08

6. The ASTRAL Compendium in 2004.

Authors: John-Marc Chandonia; Gary Hon; Nigel S Walker; Loredana Lo Conte; Patrice Koehl; Michael Levitt; Steven E Brenner
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

7. Crystal structure of a natural circularly permuted jellyroll protein: 1,3-1,4-beta-D-glucanase from Fibrobacter succinogenes.

Authors: Li-Chu Tsai; Lie-Fen Shyur; Shu-Hua Lee; Su-Shiang Lin; Hanna S Yuan
Journal: J Mol Biol Date: 2003-07-11 Impact factor: 5.469

Review 8. Plasticity of enzyme active sites.

Authors: Annabel E Todd; Christine A Orengo; Janet M Thornton
Journal: Trends Biochem Sci Date: 2002-08 Impact factor: 13.807

9. Circular permutation analysis as a method for distinction of functional elements in the M20 loop of Escherichia coli dihydrofolate reductase.

Authors: T Nakamura; M Iwakura
Journal: J Biol Chem Date: 1999-07-02 Impact factor: 5.157

10. Regulatory potential, phyletic distribution and evolution of ancient, intracellular small-molecule-binding domains.

Authors: V Anantharaman; E V Koonin; L Aravind
Journal: J Mol Biol Date: 2001-04-13 Impact factor: 5.469

8 in total

1. The Structure of a Thermophilic Kinase Shapes Fitness upon Random Circular Permutation.

Authors: Alicia M Jones; Manan M Mehta; Emily E Thomas; Joshua T Atkinson; Thomas H Segall-Shapiro; Shirley Liu; Jonathan J Silberg
Journal: ACS Synth Biol Date: 2016-03-25 Impact factor: 5.110

2. Circular permutation prediction reveals a viable backbone disconnection for split proteins: an approach in identifying a new functional split intein.

Authors: Yun-Tzai Lee; Tz-Hsiang Su; Wei-Cheng Lo; Ping-Chiang Lyu; Shih-Che Sue
Journal: PLoS One Date: 2012-08-24 Impact factor: 3.240

8. A secondary structure-based position-specific scoring matrix applied to the improvement in protein secondary structure prediction.

Authors: Teng-Ruei Chen; Sheng-Hung Juan; Yu-Wei Huang; Yen-Cheng Lin; Wei-Cheng Lo
Journal: PLoS One Date: 2021-07-28 Impact factor: 3.240