Literature DB >> 20530534

DBCP: a web server for disulfide bonding connectivity pattern prediction without the prior knowledge of the bonding state of cysteines.

Abstract

The proper prediction of the location of disulfide bridges is efficient in helping to solve the protein folding problem. Most of the previous works on the prediction of disulfide connectivity pattern use the prior knowledge of the bonding state of cysteines. The DBCP web server provides prediction of disulfide bonding connectivity pattern without the prior knowledge of the bonding state of cysteines. The method used in this server improves the accuracy of disulfide connectivity pattern prediction (Q(p)) over the previous studies reported in the literature. This DBCP server can be accessed at http://120.107.8.16/dbcp or http://140.120.14.136/dbcp.

Entities: CellLine Chemical Disease Gene

Mesh：

Substances：

Year: 2010 PMID： 20530534 PMCID： PMC2896133 DOI： 10.1093/nar/gkq514

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Disulfide bonds play an important structural role in stabilizing protein conformations. For the protein folding prediction, a correct prediction of disulfide bridges can greatly reduce the search space (1,2). The prediction of disulfide bonding pattern helps, to a certain degree, predict the 3D structure of a protein and hence its function because disulfide bonds impose geometrical constraints on the protein backbones. Some recent research works had shown the close relation between the disulfide bonding patterns and the protein structures (3,4). In the realm of the disulfide bond prediction, four problems are addressed. The first is the protein chain classification: to classify if the protein contains disulfide bridge(s) or not, the second is the residue classification: to predict the bonding state of cysteines, the third is the bridge classification and the last is the prediction of the disulfide bonding pattern. Over the past years, significant progress has been made on the prediction of the disulfide bonding states (5–8) and the disulfide bonding pattern (9–17). For disulfide bonding pattern prediction, with the exception of the methods proposed by Ferrè and Clote (11, 12) and Cheng et al. (15), the others assume that the bonding states are known. The method proposed by Ferrè and Clote (11,12) and Cheng et al. (15) can be applied whether the bonding states are known or not. In this study, the coordinate (X, Y, Z) of the C of each amino acid in the protein predicted by MODELLER (18) is used as the feature. The support vector machine (SVM) is then trained to compute the connectivity probabilities of cysteine pairs. The Edmonds–Gabow maximum weight perfect matching algorithm (19) is utilized to find the connectivity pattern.

SYSTEM

The flowchart of this server illustrated by an example is shown in Figure 1.

Figure 1.

The flowchart of the DBCP illustrated by an example.

FEATURE

With the exception of the protein’s secondary structure, the features used in the previous studies on disulfide bonding connectivity prediction are protein sequence features and not related to the protein structure. In this study, we propose to use the structure-related feature. The MODELLER (18) is used to predict the coordinate (X, Y, Z) of the Cα of each amino acid in the protein sequence. Having the coordinates, we can compute the Euclidean distance D between the amino acid at the i-th position and the amino acid at the j-th position. We further extend the definition of Euclidean distance to the pair distance (PD). Let the positions of cysteine i and cysteine j be P and P, respectively. The PD between cysteine i and cysteine j is defined to be a vector (D[2], [2], … , D1, 1, D, , D 1, 1,…, D [(1)/2], [(1)2]) that contains w Euclidean distances, where w is the window size. If we have k cysteines in the protein, there are as many as cysteine pairs. Since most cysteine pairs will not constitute a disulfide bond, by examining D of the cysteine pairs that constitute a disulfide bond, we set a threshold of value 15 for D. In other words, if Dj is >15, this pair of cysteines will not be considered as a candidate that may have a disulfide bond. In order to make the values proper to be input to the SVM, which is −1 to 1, each component of the vector PD is normalized by the equation (D − 7.5)/7.5. The resultant vector is called the normalized PD (NPD) and is the input to the SVM.

INPUT

The inputs to the DBCP include three parts: Query name is an optional name for the user to distinguish his or her queries. If a user sends more than one query and selects sending the results by the e-mail, the order in which he or she receives the results may not match the order in which he or she sends the queries. The sequence of a protein (plain sequence without the header, the maximum size is 9000 residues) may be input in the input box. If one presses the [Sample Data] button, the sample input sequence will be displayed on the protein sequence input box. The email address (If ‘send the result by Email’ is checked, otherwise, the results will be displayed on the web page).

METHODOLOGY

Run BLAST to get the template sequence of the input sequence. The parameters of BLAST are set as follows: the Expectation value (E) threshold for saving hits is set to a very large value 10 000 and the database is set to pdb that contains sequences derived from the 3D structure records from the Protein Data Bank. If the E-value of the template sequence is >10 or the template sequence shares identity <25% to the input sequence, instead of going to Step 2, we use the method previously developed by us (20) to predict the disulfide bonding pattern. Align the input sequence and the template sequence. Feed the alignment file into MODELLER and run the procedure to evaluate the model of the input sequence using the template sequence. Get the coordinate (X, Y, Z) of the Cα (α Carbon) of each residue. Coding each cysteine pair as the NPD, this will be the input to the SVM. Feed the coding file into the SVM to predict the bonding probability of each cysteine pair with the trained model. The multiple trajectory search (21) is tightly integrated with the SVM training. For more details, please refer to the Supplementary Data on the DBCP web server. Coding the input file with the probabilities from the SVM output and using the modified weighted perfect matching algorithm to get the first level disulfide bonding connectivity. Justify the first level disulfide bonding connectivity with the thresholds to get the final result. Display the result on the web page or send the result to the user. In Step 1, if the E-value of the template sequence is >10 or the template sequence shares identity <25% to the input sequence, a previously proposed method (20) is used for prediction. In this method, the position-specific scoring matrix, the normalized bond lengths, the predicted secondary structure of protein and the physicochemical properties index of the amino acid were used as features. The multiple trajectory search and the SVM training were tightly integrated to train the predictor. For more details, please refer to (20).

ENVIRONMENT

The DBCP web server is free and open to all users and there is no login requirement. This prediction software was implemented using C language and the server-side scripting language PHP, and it employed the web page on the Apache web server.

OUTPUT

In this subsection, we introduce the results of the DBCP, as listed below: Job ID: the id assigned by the server for this query job. Query name: the same as that was input by the user. SEQUENCE: the input protein sequence. Positions of the cysteines. E-value: the E-value of the template sequence found by the BLAST. If the E-value is >10, the sequence identity and the E-value will be marked in red to indicate a warning. Identity: the identity between the template sequence and the input sequence. This value is provided by the MODELLER. If the template sequence shares <25% identity with the input sequence, the sequence identity and the E-value will be marked in red to indicate a warning. Probability: the prediction probability of each cysteine pair. Metal binding site score: a rough estimation that the cysteine pair may be involved in the metal binding site. Positions of oxidized cysteines. Predicted disulfide bonding connectivity pattern. Predicted positions of cysteines in metal binding site: this is only a rough prediction and users should consult other methods or web services for more accurate prediction of cysteines involved in metal binding sites (22).

EVALUATION OF WEB SERVER

We found four web sites that provided the prediction of the disulfide bonding connectivity pattern without prior knowledge of bonding state of cysteines (12,14–16). Cheng et al. (15) tested their prediction method by a 10-fold cross validation on the data set SPX (15). As a comparison, we also tested our method by a 10-fold cross validation on the same data set, and the results were shown in the Supplementary Data. The method proposed by Song et al. (16) can process only protein sequences that have less than 12 cysteines. Therefore, we conducted a test to compare our method only with the other three methods. We took 56 protein sequences from the SWISS-PROT database release no. 56.3 that are neither in the SWISS-PROT release no. 39 nor in the data set SPX, this set of sequences is denoted as ‘SP56NS’. The prediction accuracies of our method and the other three methods on this data set are shown in Table 1.

Table 1.

Comparison of the prediction accuracies on the data set SP56NS

Number of bonds	Number of sequences	DBCP (%)	Dipro (15) (%)	DiANNA (12) (%)	DISULFIND (14) (%)
2	10	50	10	10	0
3	10	80	40	10	30
4	10	70	20	0	0
5	10	60	10	0	0
6–9	16	50	0	0	n.a
All	56	60.7	14.3	3.6	5.4

Comparison of the prediction accuracies on the data set SP56NS Since the present version of the web server was trained by using the data set SPX, we also took 50 sequences from the SWISS-PROT database release no. 56.3 that are neither in the SWISS-PROT release no. 39 nor in the data set SPX. Furthermore, the pairwise sequence identity of these 50 sequences and the sequences in SPX is <25%. This set of sequences is denoted as ‘SP56NS_25’. The prediction accuracies of our method and the other three methods on this data set are shown in Table 2.

Table 2.

Comparison of the prediction accuracies on the data set SP56NS_25

Number of bonds	Number of sequences	DBCP (%)	Dipro (15) (%)	DiANNA (12) (%)	DISULFIND (14) (%)
2	10	60	20	20	0
3	10	50	30	0	30
4	10	50	30	0	10
5	10	50	0	0	0
6–14	10	30	0	0	n.a
All	50	48	16	4	8

Comparison of the prediction accuracies on the data set SP56NS_25 For checking the prediction accuracy when the input sequence has low identity to the overall set of PDB proteins, we took 32 sequences from the SWISS-PROT database release no. 56.3, where either the sequence shares identity <25% to the template sequence found by the BLAST or the E-value of the template sequence is >10. This set of sequences is denoted as ‘CHK25’. The prediction accuracy of DBCP for this data set is shown in Table 3.

Table 3.

The prediction accuracy on the data set CHK25

Number of bonds	Number of sequences	Qp (%)	Qc (%)
2	15	53.3	76.7
3	11	54.5	69.7
4	3	33.3	33.3
6–9	3	0	43.5
All	32	46.9	61.2

The prediction accuracy on the data set CHK25

LIMITATIONS

The prediction accuracy may degenerate if the template sequence found by the BLAST has an E-value >10 or the identity of the template sequence and the input sequence is <25%. The web server is designed aiming to predict the disulfide bonding connectivity pattern of a sequence that does not have cysteines involved in the metal binding sites. The prediction accuracy will degenerate if the input sequence contains cysteines involved in the metal binding sites. The metal binding site score is provided only to indicate possible metal binding sites. If this score is >0.5, users are strongly suggested using other methods or web services to more accurately predict the cysteines that are involved in the metal binding sites. The Metal Detector server (http://metaldetector.dsi.unifi.it/) is one of such web services.

CONCLUSION

A web-based application system called the DBCP is provided for the prediction of the disulfide bonding connectivity pattern without the prior knowledge of the bonding state of cysteines. In previous research works, without the prior knowledge of the bonding state of cysteines, to the best of our knowledge, the best accuracy of disulfide connectivity pattern prediction (Q) and that of disulfide bridge prediction (Q) are 51% and 52%, respectively, on the data set SPX with 10-fold cross validation. The method used in this server improved the prediction accuracies on the same test data set SPX to 84.4% (Q) and 94.6% (Q) with 10-fold cross validation. The comparison of the prediction accuracy of the DBCP with that of three other state-of-the-arts web services on the data sets SP56NS and SP56NS_25 also reveals that the DBCP outperforms the other three methods. If the template sequence found by the BLAST has an E-value >10 or the identity of the template sequence and the input sequence is <25%, another method previously proposed by us is used for prediction. In this case, the prediction accuracy may slightly degenerate. Since the DBCP is designed aiming to predict the disulfide bonding connectivity pattern of a sequence that does not have cysteines involved in the metal binding sites, for protein sequences that contain cysteines involved in the metal binding sites, other methods that can predict both the disulfide bonds and the metal binding sites will be more suitable for prediction. The high metal binding site score (e.g. >0.5) indicates that there may be cysteines involved in the metal binding sites. In this case, users are strongly suggested using other methods in addition to the DBCP and conclude the prediction result based on the results of all methods.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

National Science Council of ROC (contract number NSC98-2221-E005-049-MY3, partial); Ministry of Education, Taiwan, ROC under ATU plan (partial); Central Taiwan University of Science and Technology (grant CTU99-P-33). Funding for open access charge: National Chung Hsing University. Conflict of interest statement. None declared.

19 in total

1. Ab initio fold prediction of small helical proteins using distance geometry and knowledge-based scoring functions.

Authors: E S Huang; R Samudrala; J W Ponder
Journal: J Mol Biol Date: 1999-07-02 Impact factor: 5.469

2. Relationship between protein structures and disulfide-bonding patterns.

Authors: Chao-Chun Chuang; Chun-Yin Chen; Jinn-Moon Yang; Ping-Chiang Lyu; Jenn-Kang Hwang
Journal: Proteins Date: 2003-10-01

3. MONSSTER: a method for folding globular proteins with a small number of distance restraints.

Authors: J Skolnick; A Kolinski; A R Ortiz
Journal: J Mol Biol Date: 1997-01-17 Impact factor: 5.469

4. Comparative protein modelling by satisfaction of spatial restraints.

Authors: A Sali; T L Blundell
Journal: J Mol Biol Date: 1993-12-05 Impact factor: 5.469

5. Predicting the oxidation state of cysteines by multiple sequence alignment.

Authors: A Fiser; I Simon
Journal: Bioinformatics Date: 2000-03 Impact factor: 6.937

6. Prediction of the disulfide bonding state of cysteines in proteins with hidden neural networks.

Authors: Pier Luigi Martelli; Piero Fariselli; Luca Malaguti; Rita Casadio
Journal: Protein Eng Date: 2002-12

7. Prediction of the bonding states of cysteines using the support vector machines based on multiple feature vectors and cysteine state sequences.

Authors: Yu-Ching Chen; Yeong-Shin Lin; Chih-Jen Lin; Jenn-Kang Hwang
Journal: Proteins Date: 2004-06-01

8. MetalDetector: a web server for predicting metal-binding sites and disulfide bridges in proteins from sequence.

Authors: Marco Lippi; Andrea Passerini; Marco Punta; Burkhard Rost; Paolo Frasconi
Journal: Bioinformatics Date: 2008-07-16 Impact factor: 6.937

9. A novel database of disulfide patterns and its application to the discovery of distantly related homologs.

Authors: Herman W T van Vlijmen; Abhas Gupta; Lakshmi S Narasimhan; Juswinder Singh
Journal: J Mol Biol Date: 2004-01-23 Impact factor: 5.469

10. Disulfide connectivity prediction using recursive neural networks and evolutionary information.

Authors: Alessandro Vullo; Paolo Frasconi
Journal: Bioinformatics Date: 2004-01-22 Impact factor: 6.937

13 in total

1. Accurate disulfide-bonding network predictions improve ab initio structure prediction of cysteine-rich proteins.

Authors: Jing Yang; Bao-Ji He; Richard Jang; Yang Zhang; Hong-Bin Shen
Journal: Bioinformatics Date: 2015-08-07 Impact factor: 6.937

Review 2. Bacterial thiol oxidoreductases - from basic research to new antibacterial strategies.

Authors: Katarzyna M Bocian-Ostrzycka; Magdalena J Grzeszczuk; Anna M Banaś; Elżbieta Katarzyna Jagusztyn-Krynicka
Journal: Appl Microbiol Biotechnol Date: 2017-04-13 Impact factor: 4.813

Review 3. Redox biology: computational approaches to the investigation of functional cysteine residues.

Authors: Stefano M Marino; Vadim N Gladyshev
Journal: Antioxid Redox Signal Date: 2011-04-14 Impact factor: 8.401

Review 4. Analysis and functional prediction of reactive cysteine residues.

Authors: Stefano M Marino; Vadim N Gladyshev
Journal: J Biol Chem Date: 2011-12-06 Impact factor: 5.157

Review 5. Overcoming the Solubility Problem in E. coli: Available Approaches for Recombinant Protein Production.

Authors: Claudia Ortega; Pablo Oppezzo; Agustín Correa
Journal: Methods Mol Biol Date: 2022

6. On the relevance of sophisticated structural annotations for disulfide connectivity pattern prediction.

Authors: Julien Becker; Francis Maes; Louis Wehenkel
Journal: PLoS One Date: 2013-02-15 Impact factor: 3.240

7. The first venomous crustacean revealed by transcriptomics and functional morphology: remipede venom glands express a unique toxin cocktail dominated by enzymes and a neurotoxin.

Authors: Björn M von Reumont; Alexander Blanke; Sandy Richter; Fernando Alvarez; Christoph Bleidorn; Ronald A Jenner
Journal: Mol Biol Evol Date: 2013-10-16 Impact factor: 16.240