Literature DB >> 16844987

DiANNA 1.1: an extension of the DiANNA web server for ternary cysteine classification.

Abstract

DiANNA is a recent state-of-the-art artificial neural network and web server, which determines the cysteine oxidation state and disulfide connectivity of a protein, given only its amino acid sequence. Version 1.0 of DiANNA uses a feed-forward neural network to determine which cysteines are involved in a disulfide bond, and employs a novel architecture neural network to predict which half-cystines are covalently bound to which other half-cystines. In version 1.1 of DiANNA, described here, we extend functionality by applying a support vector machine with spectrum kernel for the cysteine classification problem-to determine whether a cysteine is reduced (free in sulfhydryl state), half-cystine (involved in a disulfide bond) or bound to a metallic ligand. In the latter case, DiANNA predicts the ligand among iron, zinc, cadmium and carbon. Available at: http://bioinformatics.bc.edu/clotelab/DiANNA/.

Entities: Chemical Disease Species

Mesh：

Substances：

Year: 2006 PMID： 16844987 PMCID： PMC1538812 DOI： 10.1093/nar/gkl189

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Cysteine residues play a unique role in determining protein stability and function. Cysteines may be reduced (free, where sulfur occurs in the reactive sulfhydryl form) or oxidized; the latter may be involved in a disulfide bond, i.e. a half-cystine, or instead covalently bound to a metallic ligand that is part of a prosthetic group. Experimental determination of cysteine species (free, half-cystine, ligand-bound) is non-trivial, and often only the knowledge of the three-dimensional structure indicates the species. For this reason, cysteine classification is an important bioinformatics problem that may be approached by using machine learning methods. In this paper, we apply support vector machines (SVM) to the ternary cysteine classification problem, to determine whether a given cysteine is free, a half-cystine or ligand-bound. To the best of our knowledge, the present paper describes the only existent ternary cysteine classification program. It is reasonable to assume that each species of cysteine resides in a distinct micro-environment which influences the cysteine redox potential and its steric accessibility. This hypothesis is confirmed and exploited in several machine learning approaches for cysteine classification that, while different, share the common feature that the discrimination is based on the analysis of the cysteine sequence context, using a symmetric sequence window of length w centered about each cysteine. Particular effort has been spent on the binary classification problem to discriminate intra-chain half-cystines from free cysteines, the latter being the most represented species. For this problem, various methods have yielded steadily increasing prediction accuracies (1,2). Nevertheless, other species of cysteines exist—namely ligand-bound cysteines and half-cystines involved in inter-chain disulfide bonds. Such cysteines reside in possibly different micro-environments, hence may be discernable from other species. Only one attempt has been made to discriminate ligand-bound cysteines; specifically, Passerini and Frasconi (3) obtained prediction accuracy of ∼90% for the binary classification problem of distinguishing ligand-bound cysteines from half-cystines. DiANNA 1.1 is the only software which performs ternary cysteine classification; all other cysteine classification web servers consider only the binary classification problem of discriminating free cysteines from intra-chain half-cystines. In this paper, we apply a SVM with (a variant of) the spectrum kernel (4) to classify cysteines into three different species: free, half-cystine or ligand-bound. For predicted ligand-bound cysteines, we further refine the classification by predicting the bound ligand to be iron, zinc, cadmium or carbon. Although we have some results concerning inter-chain disulfide bonds (data not shown), the DiANNA web server is intended only for use with single-chain proteins.

DATASET

To test and train a ternary SVM predictor for cysteine classification, it was necessary to build a dataset, in which each cysteine species is well represented. This was done as follows. From the Protein Data Bank (5), we extracted the set of single-chain proteins containing ligand-bound cysteines, and produced a non-redundant collection by using the program UniqueProt (6) with HSSP distance set to 0. This produced a list of 202 chains, denoted by UP. To enrich the small number (60) of half-cystines examples (which is probably not representative), we considered the 967 non-redundant protein chains used in (1) for training and testing a neural network to predict cysteine oxidation state prediction (dataset MA). We merged the UP and MA datasets, and re-applied UniqueProt to eliminate redundancy between the two lists. From each redundancy cluster, we selected one member containing ligand-bound cysteines, if available (if not, we selected the representative member proposed by UniqueProt). In this fashion, we obtained a dataset (denoted UPMA) of 526 chains, with adequate representation of each of the three cysteine classes. Table 1 displays the number of cysteines in each species, and Table 2 presents the number of chains containing each species. From each protein in UPMA, we extracted symmetric windows of size w centered around each cysteine. Different values of w were tested, and the best results were obtained for w = 17 [the same value led to the best performance in (3)]. The annotated UPMA list is available at URL .

Table 1

Total number of different cysteine species in datasets considered in this paper

Dataset	LC	IA HC	IE HC	FC	Total
UP	624	60	2	546	1230
MA	216	1481	37	2412	4109
UPMA	624	608	24	1199	2455

The description of each dataset can be found in the section ‘Dataset’. Legend: IA, intra-chain disulfide bonds; IE, inter-chain disulfide bonds; HC, half-cystines; FC, free cysteines; LC, ligand-bound cysteines.

Table 2

Breakdown of protein chains which contain at the same time half-cystines (HC), free cysteines (FC) and ligand-bound cysteines (LC), for each of the three datasets considered in this paper

Chains	UPMA	UP	MA
Total	526	202	967
w/ HC	140	19	291
w/ FC	363	139	716
w/ LC	189	202	52
w/ both HC and FC	28	9	65
w/ both HC and LC	17	19	1
w/ both FC and LC	128	139	26
w/ HC, FC and LC	7	9	0

SVM PREDICTION USING STRING KERNELS

SVMs were introduced by Vapnik within the context of a mathematically rigorous statistical learning theory—for a very clear exposition of this topic see (7). Often demonstrating better prediction accuracy than neural networks, SVMs have become increasingly popular in bioinformatics, with applications ranging from translation initiation site determination (8), remote homology detection in proteins (9), viral protease cleavage site prediction (10), fast computation of Z-scores for minimum free energy of RNA (11) and so on. To apply SVMs to the ternary cysteine classification problem, we use the spectrum representation (4) which describes an amino acid sequence by specifying the vector of k-mers which occur; i.e. for peptide p, define Φ(p) = 〈φ(x):a ∈ A〉, where φ(x) is the number of occurrences of the k-mer a in p, and A is the set of 1-letter codes of amino acids. Leslie et al. use the term spectrum kernel resp. mismatch kernel in (4,13), and Busuttil et al. use the term profile-based kernel in (14). More rigorously speaking, these authors actually apply classical kernels [e.g. the linear kernel in (4,13)] for new representations of amino acid sequences—the spectrum representation, mismatch representation, profile-based spectrum representation. In this paper, we obtained the best results when k = 3, so that the amino acid sequence p in each size w window is encoded by the vector Φ3(p) of 8000 coordinates, giving the number of occurrences of each 3-mer in p. With the spectrum representation, we used the software libSVM (12) with a degree 2 polynomial kernel, such that the cost parameter C = 1—for explanation of these parameters see (12). To train and test the SVMs we used 5-fold cross-validation, splitting positive and negative datasets into five random subsets of approximatively the same size. Using libSVM, the SVM multiclass classifier outputs, for each cysteine in the input sequence, the probability of being a free cysteine (FC), a half-cystine (HC) and ligand-bound (LC). To measure the performance of the algorithm we used the Q3 score, which is the ratio between correctly predicted examples and the total number of examples. The Q3 score is commonly used for the performance evaluation of three states (sheet, helix, coil) secondary structure predictors—e.g. see (15). Additionally, we computed the Q score, which is the fraction of proteins for which all cysteines are correctly classified. The results (Table 3) show that the highest Q3 and Q scores are obtained using for the spectrum representation with a degree 2 polynomial kernel (scores of 0.78 and 0.53, respectively). Although the papers (13) and (14,16) report that the mismatch and profile-based kernels outperform the spectrum kernel in protein classification experiments, we found that this is not the case for cysteine oxidation state prediction. Additional data describing the results of binary classification experiments can be found in the web supplement at the DiANNA web site.

Table 3

Performance measure (Q3 and Q scores) for the three-class prediction of LC, HC, FC using different kernels and input representation

Kernel	Q₃			Q_p
	SpR	MmR	PrfR	SpR	MmR	PrfR
Linear	0.75	0.64	0.63	0.45	0.43	0.43
Polynomial (2)	0.78	0.74	0.74	0.53	0.46	0.45
Polynomial (3)	0.76	0.72	0.72	0.5	0.47	0.47
RBF	0.75	0.73	0.72	0.43	0.43	0.43

The Q3 score is the ratio between correct prediction and total number of examples. The Q score is the fraction of proteins for which all cysteines are correctly predicted. Q3 and Q scores are obtained averaging the results of a 5-fold cross validation. Optimal values of the C parameter and the γ parameter for the radial basis function (RBF) kernel are estimated by a grid search. Legend: SpR—Spectrum representation; MmR—Mismatch representation; PrfR—Profile representation.

Table 4 displays the number of examples in dataset UPMA for each distinct ligand type in ligand-bound cysteines. For the cases for which we have at least 39 examples (i.e. Zn, Fe, Cd, C) we investigated whether machine learning can be used to discriminate the atomic species bound—i.e. whether sequence context of each type of ligand is significantly different. Experiments were performed where the positive set consisted of amino acid sequences symmetrically flanking those cysteines bound to a specific ligand (say iron), while the negative set consisted of sequences flanking cysteines bound to a different ligand. In the case of cadmium (Cd) and carbon (C), we randomly resampled the positive training set (which is substantially smaller than the negative training set) until the number of positive and negative examples was the same (note that the test set is unchanged). As in ternary cysteine classification, we found that the best discrimination was obtained in using the degree 2 polynomial kernel with the spectrum representation. Results are reported in Table 5 and Figure 1.

Table 4

Total number of distinct atomic ligands found covalently bound to cysteine residues in the UPMA dataset.

Cys-bound atom	Examples
As	2
Au	1
C	89
Cd	39
Cu	10
Fe	185
H	1
Hg	24
Mn	1
Ni	6
Pb	1
S	27
U	2
Zn	225

Table 5

Performance measures for the prediction of cysteines bound to specific ligands

Measure	Zn	Cd	Fe	C
Acc	0.93	0.99	0.91	0.96
Sen	0.8	0.97	0.67	0.74
Spe	0.99	1	0.98	0.99
MCC	0.84	0.99	0.74	0.83
AUC	0.97	0.97	0.94	0.94

Legend: Acc—accuracy; Sen—sensitivity; Spe—specificity; MCC—Matthew's correlation coefficient; AUC—area under the ROC curve.

Figure 1

ROC curves for the prediction of cysteines covalently bound to specific ligands. [For an explanation of receiver operating characteristic (ROC) curves see (20)].

WEB SERVER

DiANNA 1.1 has a simple user-friendly web interface, which allows the user to obtain a prediction of the state (free, half-cystine or ligand-bound) for each cysteine in an input protein. The ternary SVM predictor outputs the highest probability class, and, for those cysteines predicted as ligand-bound, the most likely ligand is displayed (among iron, zinc, cadmium, carbon), by a winner-takes-all decision. Additionally, as described previously (17,18), DiANNA 1.1 uses a state-of-the-art method to predict the disulfide connectivity—i.e. which cysteines form a disulfide bond with which other cysteines. A screen shot of the DiANNA 1.1 web server output for a ternary classification prediction is shown in Figure 2. Additionally, DiANNA 1.1 allows all possible binary classification predictions for the three cysteine classes (free, half-cystine, ligand-bound). The web server interface is largely self-explanatory. The upper panel of Figure 2 displays the input form, including the pull-down menu, which allows the user to choose the classifier used for cysteine state prediction (ternary classifier, or one of three binary classifiers). The lower panel of Figure 2 displays the output of the ternary cysteine state classifier, indicating the probability of each class (half-cystine, free cysteine, ligand-bound). In the case of predicted ligand-bound cysteines, the predicted ligand is listed in the right-most column. The user enters a protein in FASTA format, possibly including a FASTA comment, and chooses either to predict the cysteine state for each cysteine, or to determine the disulfide connectivity. The latter function has already been described in (17).

Figure 2

DiANNA ternary cysteine classification prediction input and output example. Upper panel: The DiANNA web-server update allows the user to choose between disulfide connectivity prediction and cysteine classification (ternary cysteine classification is only available in the 1.1 update). In the latter case, the user can type or paste a FASTA sequence in a text box, then choose among four different classification predictions by means of a drop down menu (i.e. the ternary LC versus HC versus FC classification, and the three binary classifications LC versus HC, LC versus FC and HC versus FC). Lower panel: Output for the ternary classification. For each cysteine in the submitted sequence, the SVM model predicts the probability of being half-cystine, free cysteine or ligand-bound. The class having the highest probability is highlighted. If a specific cysteine is predicted as ligand bound, a tentative prediction about the putative ligand (out of four possible ligands) is attempted.

CONCLUSION

Given the amino acid sequence of a protein, DiANNA (17) is a state-of-the-art method to predict disulfide connectivity topology. Version 1.0 of the DiANNA web server, described in (18), additionally predicts the oxidation state of each cysteine (free or half-cystine), by using our implementation of the neural network of Fariselli et al. (19). In version 1.1 of the DiANNA web server, described in this paper, we replace the binary classifier of (19) by a SVM with degree 2 polynomial kernel for the spectrum representation (4). Using libSVM, we obtain a ternary classifier, capable of discriminating between free cysteines, half-cystines and ligand-bound cysteines. Moreover, for the latter, DiANNA 1.1 predicts the type of ligand. To the best of our knowledge, this is the first application of string-based kernels to sequence windows; until this paper, such kernels had been used only for protein classification.

18 in total

1. Using the Fisher kernel method to detect remote protein homologies.

Authors: T Jaakkola; M Diekhans; D Haussler
Journal: Proc Int Conf Intell Syst Mol Biol Date: 1999

2. The spectrum kernel: a string kernel for SVM protein classification.

Authors: Christina Leslie; Eleazar Eskin; William Stafford Noble
Journal: Pac Symp Biocomput Date: 2002

3. The Protein Data Bank.

Authors: Helen M Berman; Tammy Battistuz; T N Bhat; Wolfgang F Bluhm; Philip E Bourne; Kyle Burkhardt; Zukang Feng; Gary L Gilliland; Lisa Iype; Shri Jain; Phoebe Fagan; Jessica Marvin; David Padilla; Veerasamy Ravichandran; Bohdan Schneider; Narmada Thanki; Helge Weissig; John D Westbrook; Christine Zardecki
Journal: Acta Crystallogr D Biol Crystallogr Date: 2002-05-29

4. Mining viral protease data to extract cleavage knowledge.

Authors: Ajit Narayanan; Xikun Wu; Z Rong Yang
Journal: Bioinformatics Date: 2002 Impact factor: 6.937

5. UniqueProt: Creating representative protein sequence sets.

Authors: Sven Mika; Burkhard Rost
Journal: Nucleic Acids Res Date: 2003-07-01 Impact factor: 16.971

6. Learning to discriminate between ligand-bound and disulfide-bound cysteines.

Authors: Andrea Passerini; Paolo Frasconi
Journal: Protein Eng Des Sel Date: 2004-05-27 Impact factor: 1.650

7. Mismatch string kernels for discriminative protein classification.

Authors: Christina S Leslie; Eleazar Eskin; Adiel Cohen; Jason Weston; William Stafford Noble
Journal: Bioinformatics Date: 2004-01-22 Impact factor: 6.937

8. Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching.

Authors: M Gribskov; N L Robinson
Journal: Comput Chem Date: 1996-03

9. Prediction of the disulfide bonding state of cysteines in proteins with hidden neural networks.

Authors: Pier Luigi Martelli; Piero Fariselli; Luca Malaguti; Rita Casadio
Journal: Protein Eng Date: 2002-12

10. Prediction of the bonding states of cysteines using the support vector machines based on multiple feature vectors and cysteine state sequences.

Authors: Yu-Ching Chen; Yeong-Shin Lin; Chih-Jen Lin; Jenn-Kang Hwang
Journal: Proteins Date: 2004-06-01

82 in total

1. A carbohydrate-binding family 48 module enables feruloyl esterase action on polymeric arabinoxylan.

Authors: Jesper Holck; Folmer Fredslund; Marie S Møller; Jesper Brask; Kristian B R M Krogh; Lene Lange; Ditte H Welner; Birte Svensson; Anne S Meyer; Casper Wilkens
Journal: J Biol Chem Date: 2019-09-26 Impact factor: 5.157

2. Convergent evolution of IL-6 in two leporids (Oryctolagus and Pentalagus) originated an extended protein.

Authors: Fabiana Neves; Joana Abrantes; Ana Pinheiro; Tereza Almeida; Paulo P Costa; Pedro J Esteves
Journal: Immunogenetics Date: 2014-07-16 Impact factor: 2.846

3. Exploring the Feasibility of the Sec Route to Secrete Proteins Using the Tat Route in Streptomyces lividans.

Authors: Sonia Gullón; Rebeca L Vicente; José R Valverde; Silvia Marín; Rafael P Mellado
Journal: Mol Biotechnol Date: 2015-10 Impact factor: 2.695

4. Expression profile and in silico characterization of novel RTF2h gene under oxidative stress in Indian catfish, Clarias magur (Hamilton 1822).

Authors: Prabhaker Yadav; Ratnesh K Tripathi; Rajeev K Singh; Vindhya Mohindra
Journal: Mol Biol Rep Date: 2016-10-14 Impact factor: 2.316

Review 5. Multifactorial level of extremostability of proteins: can they be exploited for protein engineering?

Authors: Debamitra Chakravorty; Mohd Faheem Khan; Sanjukta Patra
Journal: Extremophiles Date: 2017-03-10 Impact factor: 2.395

6. Deorphanization and target validation of cross-tick species conserved novel Amblyomma americanum tick saliva protein.

Authors: Albert Mulenga; Tae Kwon Kim; Adriana Mércia Guaratini Ibelli
Journal: Int J Parasitol Date: 2013-02-19 Impact factor: 3.981

7. Molecular and immunological characterization of cathepsin L-like cysteine protease of Paragonimus pseudoheterotremus.

Authors: Tippayarat Yoonuan; Supaporn Nuamtanong; Paron Dekumyoy; Orawan Phuphisut; Poom Adisakwattana
Journal: Parasitol Res Date: 2016-08-26 Impact factor: 2.289

8. Phospholipase A1 modulates the cell envelope phospholipid content of Brucella melitensis, contributing to polymyxin resistance and pathogenicity.

Authors: Tobias Kerrinnes; Briana M Young; Carlos Leon; Christelle M Roux; Lisa Tran; Vidya L Atluri; Maria G Winter; Renée M Tsolis
Journal: Antimicrob Agents Chemother Date: 2015-08-17 Impact factor: 5.191

9. Prediction of reversibly oxidized protein cysteine thiols using protein structure properties.

Authors: Ricardo Sanchez; Megan Riddle; Jongwook Woo; Jamil Momand
Journal: Protein Sci Date: 2008-03 Impact factor: 6.725

10. DBCP: a web server for disulfide bonding connectivity pattern prediction without the prior knowledge of the bonding state of cysteines.

Authors: Hsuan-Hung Lin; Lin-Yu Tseng
Journal: Nucleic Acids Res Date: 2010-06-08 Impact factor: 16.971