Literature DB >> 21646342

CSpritz: accurate prediction of protein disorder segments with annotation for homology, secondary structure and linear motifs.

Ian Walsh¹, Alberto J M Martin, Tomàs Di Domenico, Alessandro Vullo, Gianluca Pollastri, Silvio C E Tosatto.

Abstract

CSpritz is a web server for the prediction of intrinsic protein disorder. It is a combination of previous Spritz with two novel orthogonal systems developed by our group (Punch and ESpritz). Punch is based on sequence and structural templates trained with support vector machines. ESpritz is an efficient single sequence method based on bidirectional recursive neural networks. Spritz was extended to filter predictions based on structural homologues. After extensive testing, predictions are combined by averaging their probabilities. The CSpritz website can elaborate single or multiple predictions for either short or long disorder. The server provides a global output page, for download and simultaneous statistics of all predictions. Links are provided to each individual protein where the amino acid sequence and disorder prediction are displayed along with statistics for the individual protein. As a novel feature, CSpritz provides information about structural homologues as well as secondary structure and short functional linear motifs in each disordered segment. Benchmarking was performed on the very recent CASP9 data, where CSpritz would have ranked consistently well with a Sw measure of 49.27 and AUC of 0.828. The server, together with help and methods pages including examples, are freely available at URL: http://protein.bio.unipd.it/cspritz/.

Entities: Disease Gene Species

Mesh：

Year: 2011 PMID： 21646342 PMCID： PMC3125791 DOI： 10.1093/nar/gkr411

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The 3D native structure of proteins has been considered the major determinant of function for many years. Over the last decade there has been a growing realization of an alternative mechanism whereby non-folding regions are both widespread and also carry functional significance (1,2). These non-folding regions within a protein, coming in various guises ranging from fully extended to molten globule-like and partially folded structures (3), are collectively known as intrinsically disordered regions (4). Such regions often become structured upon binding to a target molecule and have been shown to be involved in various biological processes such as cell signaling or regulation (5), DNA binding (6) and molecular recognition in general (3,7). An interesting observation is that the amount of disorder within a proteome seems to correlate with complexity of the organism, with an apparent increase in disorder for eukaryotic organisms (8,9). The conservation of disorder (10,11) and specific amino acid patterns (12,13) (e.g. PxPxP) have also been studied. Indeed, there is a growing realization that intrinsically disordered regions are widely used as hubs for protein–protein interactions (14), for which structural data can be accessed in the ComSin database (15). Functional linear motifs (16,17), which are mostly hidden in disordered regions (18), have been characterized in resources such as ELM (19), an online repository of linear motifs. The experimental determination of native disorder, once considered an anomaly, can be time consuming, difficult and expensive. As a result, computational approaches have largely driven our understanding of disorder over the last decade (14). The bi-yearly Critical Assessment of Techniques for protein Structure Prediction (CASP) experiment has included a disorder category since CASP5 in 2002 (20). Previously published methods can be roughly divided into biophysical and machine learning approaches. The former rely on the unique amino acid distribution associated with protein disorder (21–23). Machine learning methods use either neural networks (24–26) or support vector machines (9,27) and are commonly based on sequence profiles, predicted secondary structure and more recently template structures (28). More recently, meta servers combining several biophysical and machine learning methods have been published (29–31). All these methods have shown promising results, possibly for two reasons: (i) as the amino acid sequence contains all the information to determine structure it is reasonable to assume that unstructured regions have specific amino acid propensities and (ii) disorder is important in many biological functions and therefore unstructured protein segments should be conserved by evolution. Knowing that disordered segments have a biased sequence, machine learning techniques should excel. In this paper we describe and benchmark CSpritz, an extension of our previous Spritz server (27) based on three distinct modules for the prediction of intrinsically disordered regions in proteins. The performance of the method will be benchmarked on the latest available data for short and long disordered segments. A novel addition to the CSpritz server is information about homologous structures found from PSI-BLAST searches, secondary structure and linear motifs contributing to the functional annotation of disordered segments.

MATERIALS AND METHODS

CSpritz predicts intrinsic disorder from protein sequences through a combination of three machine learning systems, which will be described in the following sections. Most methods consider short and long disorder separately, as they have different characteristics. Short disorder can be derived from residues missing backbone atoms in X-ray crystallographic structures deposited in the Protein Data Bank (PDB) (32). Long disorder is taken from the Disprot database (33) because it is largely missing from the PDB. All data sets used throughout training are appropriately redundancy reduced using UniqueProt (34) and in all cases contain only sequences available before May 2008 (i.e. the start of CASP8).

Spritz

The original Spritz (27) is based on PSI-BLAST (35) multiple sequence profiles and predicted secondary structure. Support Vector Machines (SVMs) were used on a local sequence window to train two specialized binary classifiers, for long and short regions of disorder. A description of the data sets can be found in the previous publication (27). In addition to the original ab initio version of Spritz, a filter removing PDB structural homologues from predicted disorder is implemented. This works by performing a PSI-BLAST search against a redundancy reduced sequence database. The generated sequence profile is then used in a final PSI-BLAST round against a filtered PDB. Residues matching a structural template are assigned a Spritz score below the disorder threshold.

Punch

Punch is a SVM based predictor extending Spritz. Sequence and structural homologues are detected as in Spritz. In addition, Porter secondary structure (36) and PaleAle relative solvent accessibility (37) are also included. Unlike Spritz, information about structural templates is encoded and fed directly to the SVM together with the other inputs. The two data sets used for learning (see Supplementary Data) are a large set of disordered X-ray chains derived from the PDB (December 2007) and a publicly available data set (24) based on disordered X-ray segments from the PDB (May 2004). The assignment of disorder is different in both data sets and does not necessarily intersect.

ESpritz

ESpritz is a fast predictor using bidirectional recursive neural networks (BRNNs) (38). BRNNs do not require contextual windows because they extract this information dynamically from the sequence. ESpritz consists of 20 inputs where each unit is allocated for one of the 20 amino acids. Although the method is very simple, the BRNN is useful for extracting relevant patterns required for disorder without the use of PSI-BLAST sequence alignments (results not shown). Like Spritz, two types of data based on long and short disorder types are designed (see Supplementary Material). The short disorder set is built from X-ray PDB structures (May 2008). Long disorder segments are extracted from Disprot (version 3.7) with identical sequences removed.

Linear motifs and secondary structure

It can be useful to unify the following information for disordered segments: (i) amino acids involved; (ii) secondary structure; and (iii) important linear motifs. CSpritz offers this predicted information in various forms (see output section). Secondary structure propensities are predicted from Porter (36). Linear motifs (LMs) are selected from ELM (19) as the ligand binding subset (names starting with LIG). ELM is a resource for predicting functional sites in eukaryotic proteins where functional sites are identified by patterns. These motifs are supposed to be representative of the more studied LM–protein binding examples. The selected LMs are returned when sub-sequences are matched by their regular expressions in ELM.

PERFORMANCE EVALUATION

Combination

Experiments were carried out for the best procedure to combine Punch, Spritz and ESpritz. After trying majority voting, unanimous votes and combination with neural networks, the simplest method of averaging the probabilities produced by each system was found to be the best (data not shown). The optimal decision threshold was determined on data independent from the benchmarking set by maximizing the Sw measure (39). CASP8 data (39) was used for short and Disprot (version 3.7) for long disorder. Regular expressions are incorporated to fill disordered regions separated by less than three residues. The Pearson correlation of the probabilities produced on CASP9 disorder targets was calculated to test how different the three predictors are. Table 1 shows this correlation and proves that the three systems are indeed sufficiently different. This is important for combining the three systems since it is well known that ensembling predictions which are different or uncorrelated improve generalization performance considerably (40). In particular, combination is especially beneficial when the wrongly predicted residues for each predictor do not correlate (i.e. their probabilities do not correlate) (41,42).

Table 1.

Pearson correlation of the three systems on CASP9 targets

	ESpritz	Spritz	Punch
ESpritz	1.00	0.51	0.59
Spritz		1.00	0.42
Punch			1.00

The probabilities are produced by each component on all residues for 117 CASP9 targets. Since the correlations are low, combining the three systems improves performance over the individual systems.

Pearson correlation of the three systems on CASP9 targets The probabilities are produced by each component on all residues for 117 CASP9 targets. Since the correlations are low, combining the three systems improves performance over the individual systems.

Benchmarking sets

Validation of short disorder segments is performed on the 117 CASP9 targets (URL: http://www.predictioncenter.org/casp9/), comparing with other groups taking part in the disorder category experiment according to their official CASP results. In order to validate the long disorder segments we choose DisProt entries enriched with PDB annotation from the SL data set defined in (43). Unfortunately, selecting sequences with <40% sequence identity to our training set leaves only 29 proteins. We also define a set of 569 X-ray sequences (Xray569) deposited in the PDB (resolution at most 2.5 Å and R-free <0.25) between May 2008 and September 2010 reduced by sequence identity using UniqueProt (34) to an HSSP value of 0 to our training data and among each other. Supplementary Table S1 shows the size and composition of the validation data sets. Note that to ensure a fair comparison to other methods on our benchmarking sets, CSpritz was in all cases run with sequence and PDB databases frozen prior to May 2008.

CASP short disorder

To assess the performance of our server for the short disorder option, we rank all groups participating in the CASP9 experiment. Table 2 shows the top 5 (out of 32) groups plus CSpritz and Spritz ranked by Sw, a commonly used measure at CASP. For Sw, as in the CASP8 assessment (39) the statistical significance of the evaluation scores was determined by bootstrapping: 80% of the targets were randomly selected 1000 times, and the standard error of the scores was calculated (i.e. 1.96*standard_error gives 95% confidence around mean for normal distributions). For a full list of rankings see the online methods page. Our results suggest a consistently good performance of our server, especially when taking into account that some of the top five are meta-servers and some are not publicly available.

Table 2.

Results for the top five performing groups at the CASP9 experiment, CSpritz and the original Spritz

GroupID: Name	Sw (±SE)	ACC	AUC
291: PRDOS2	50.44 (±1.08)	75.22	0.852
119: MULTICOM-REFINE	49.53 (±1.00)	74.77	0.818
000: CSpritz	49.27 (±1.02)	74.64	0.828
351: BIOMINE_DR_PDB	48.21 (±1.25)	74.11	0.818
374: GSMETADISORDERMD	47.13 (±0.96)	73.57	0.815
193: MASON	45.98 (±1.17)	73.00	0.740
000: Spritz	24.91 (±1.18)	62.46	0.716

Disordered segments of less than three residues were removed (results unchanged if included, see Supplementary Table S3). The standard error (SE) for Sw is shown in brackets. ACC is the accuracy, i.e. (sensitivity + specificity)/2, and AUC the area under the receiver operator curve. A total of 32 groups participated in CASP9 disorder prediction category.

Results for the top five performing groups at the CASP9 experiment, CSpritz and the original Spritz Disordered segments of less than three residues were removed (results unchanged if included, see Supplementary Table S3). The standard error (SE) for Sw is shown in brackets. ACC is the accuracy, i.e. (sensitivity + specificity)/2, and AUC the area under the receiver operator curve. A total of 32 groups participated in CASP9 disorder prediction category.

DisProt long disorder

The long disorder type performance of CSpritz was benchmarked by comparing Sw, accuracy and AUC with the original Spritz and state-of-the-art predictors PONDR-FIT (30), Disopred (9) and IUPred (23). Table 3 shows CSpritz performing significantly better than the other predictors for this type of disorder. In addition CSpritz improves over the long disorder predictions made by our previous server Spritz.

Table 3.

Comparison for DisProt disordered regions

Method	Sw (±SE)	ACC	AUC
CSpritz (short)	54.64 (±3.58)	77.32	0.837
CSpritz (long)	65.70 (±3.52)	82.85	0.891
Spritz (short)	12.12 (±6.16)	56.06	0.685
Spritz (long)	35.55 (±3.58)	67.78	0.734
PONDR-FIT	51.53 (±4.34)	75.77	0.817
Disopred2	46.20 (±4.00)	73.10	0.806
IUPred (short)	37.65 (±4.77)	68.83	0.814
IUPred (long)	42.57 (±4.75)	71.29	0.818

CSpritz is compared with the original Spritz, PONDR-FIT, Disopred and IUPred. Where applicable both short and long options are reported. The standard error (SE) for Sw is shown in brackets. ACC is the accuracy, i.e. (sensitivity + specificity)/2, and AUC the area under the receiver operator curve. The decision threshold and best Sw was found to be 0.26 and 51.85 on the training set.

Comparison for DisProt disordered regions CSpritz is compared with the original Spritz, PONDR-FIT, Disopred and IUPred. Where applicable both short and long options are reported. The standard error (SE) for Sw is shown in brackets. ACC is the accuracy, i.e. (sensitivity + specificity)/2, and AUC the area under the receiver operator curve. The decision threshold and best Sw was found to be 0.26 and 51.85 on the training set.

Large-scale performance

To estimate the run time of CSpritz compared to others and validate the predictions on a larger set of PDB structures we use the Xray569 set. The results (Supplementary Table S2) are similar to the DisProt set and confirm the performance of CSpritz compared to the other methods. As can be expected, all methods are better at predicting disorder at the N- and C-termini than in the central part of the protein sequences. The execution time for CSpritz is largely determined by the PSI-BLAST search and comparable to the original Spritz and Disopred2, with ca. 15 min for an average protein. When executing multiple predictions, the CSpritz web server will run up to five proteins in parallel, reducing the overall time significantly.

SERVER DESCRIPTION

The CSpritz input page is designed with simplicity in mind. A single or multiple sequences in FASTA format are the only input required and can be either pasted or uploaded as a file. Pasting is limited to 32 000 characters but uploading has no restrictions. User email address and a query title are optional. Either short (default) or long disorder options can be selected, with the appropriate decision thresholds determined on data not involved in the benchmarking. To facilitate navigation, help and methods pages are available at the top of the interface. The CSpritz output is presented in two main pages. The first page, displaying statistics, links to individual pages and a downloadable archive for all user supplied proteins, is present only if more than one sequence was submitted. A histogram of disordered segments and an archive for download containing all generated data are also available. Figure 1 shows a sample global page for the 117 CASP9 targets.

Figure 1.

Global output page for multiple sequences. Summary statistics are displayed for some interesting values about the disorder segments of all query sequences. An archive is offered for download containing all disorder predictions, linear motifs and statistics for each protein the user supplied. The inset shows a graph displaying the length distribution of disorder segments among all proteins. The second output displays predicted disorder and annotation for individual proteins. In addition to showing the sequence with predicted secondary structure and disorder, several statistics regarding the distribution of disorder are presented. An extensive description of the output is available as part of the online help page. Two graphs plot the probability of disorder and the number of available structural templates versus disordered regions in homologous PDB structures. The last part of the output concerns the presence of putative linear motifs and secondary structure propensity for disordered segments. This can be a useful source of functional annotation, as shown in Figure 2 for Drosophila melanogaster Cryptochrome (dCRY). Following computational analysis, functional linear motifs were experimentally confirmed in the disordered C-terminus of dCRY (44). CSpritz aims to speed up this type of analysis by providing additional clues. In dCRY the putative linear motifs (Figure 2) match the disordered residues having a favorable alpha helical propensity. It is known that many such interactions involve disorder to secondary structure transitions upon binding (45).

Figure 2.

Individual output page for D. melanogaster Cryptochrome. The main figure shows the list of available files and actual disorder prediction. The latter is composed of the amino acid sequence, its predicted secondary structure and the CSpritz disorder classification, with disordered residues highlighted in red font. Disorder statistics about the protein is presented on the right. Two insets show the graphs for the disorder propensity plot (top right) and number of available structural coordinates versus disordered segments in homologous sequences. The inset on the bottom part shows the annotated disordered segment covering the C-terminus of Cryptochrome (residues 513–542). The propensities for secondary structure and location of putative functional motifs are shown. Links to the ELM description of the motif amino acids involved in the motif are supplied on the right. A graph and probabilities secondary structure propensity are also supplied.

CONCLUSIONS

We have described CSpritz, a novel web server for the prediction of intrinsically disordered protein segments from sequence. It allows the batch prediction of many sequences simultaneously, providing overview statistics. The single protein sequence is annotated with disorder and useful information regarding local secondary structure and possible interaction motifs, providing a first step towards the functional interpretation of disorder. Future work will concentrate on improving the functional description of disordered regions by including other types of related information such as repeats (46) and aggregation (47).

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

University of Padova (CPDA098382, CPDR097328 to S.T.); FIRB Futuro in Ricerca (RBFR08ZSXY to S.T.). Funding for open access charge: FIRB Futuro in Ricerca grant from the Italian Ministry of Education, University and Research (MIUR). Conflict of interest statement. None declared.

43 in total

1. Evolution of structurally disordered proteins promotes neostructuralization.

Authors: Jessica Siltberg-Liberles
Journal: Mol Biol Evol Date: 2010-10-29 Impact factor: 16.240

2. Cell regulation: determined to signal discrete cooperation.

Authors: Toby J Gibson
Journal: Trends Biochem Sci Date: 2009-09-08 Impact factor: 13.807

3. Assessment of disorder predictions in CASP8.

Authors: Orly Noivirt-Brik; Jaime Prilusky; Joel L Sussman
Journal: Proteins Date: 2009

4. Protein-peptide interactions adopt the same structural motifs as monomeric protein folds.

Authors: Peter Vanhee; Francois Stricher; Lies Baeten; Erik Verschueren; Tom Lenaerts; Luis Serrano; Frederic Rousseau; Joost Schymkowitz
Journal: Structure Date: 2009-08-12 Impact factor: 5.006

5. Parameterization of disorder predictors for large-scale applications requiring high specificity by using an extended benchmark dataset.

Authors: Fernanda L Sirota; Hong-Sain Ooi; Tobias Gattermayer; Georg Schneider; Frank Eisenhaber; Sebastian Maurer-Stroh
Journal: BMC Genomics Date: 2010-02-10 Impact factor: 3.969

6. Library of disordered patterns in 3D protein structures.

Authors: Michail Yu Lobanov; Eugeniya I Furletova; Natalya S Bogatyreva; Michail A Roytberg; Oxana V Galzitskaya
Journal: PLoS Comput Biol Date: 2010-10-14 Impact factor: 4.475

7. PONDR-FIT: a meta-predictor of intrinsically disordered amino acids.

Authors: Bin Xue; Roland L Dunbrack; Robert W Williams; A Keith Dunker; Vladimir N Uversky
Journal: Biochim Biophys Acta Date: 2010-01-25

8. Improved sequence-based prediction of disordered regions with multilayer fusion of multiple information sources.

Authors: Marcin J Mizianty; Wojciech Stach; Ke Chen; Kanaka Durga Kedarisetti; Fatemeh Miri Disfani; Lukasz Kurgan
Journal: Bioinformatics Date: 2010-09-15 Impact factor: 6.937

9. Protein secondary structure appears to be robust under in silico evolution while protein disorder appears not to be.

Authors: Christian Schaefer; Avner Schlessinger; Burkhard Rost
Journal: Bioinformatics Date: 2010-01-16 Impact factor: 6.937

10. ELM: the status of the 2010 eukaryotic linear motif resource.

Authors: Cathryn M Gould; Francesca Diella; Allegra Via; Pål Puntervoll; Christine Gemünd; Sophie Chabanis-Davidson; Sushama Michael; Ahmed Sayadi; Jan Christian Bryne; Claudia Chica; Markus Seiler; Norman E Davey; Niall Haslam; Robert J Weatheritt; Aidan Budd; Tim Hughes; Jakub Pas; Leszek Rychlewski; Gilles Travé; Rein Aasland; Manuela Helmer-Citterich; Rune Linding; Toby J Gibson
Journal: Nucleic Acids Res Date: 2009-11-17 Impact factor: 16.971

30 in total

Review 1. Comprehensive review of methods for prediction of intrinsic disorder and its molecular functions.

Authors: Fanchi Meng; Vladimir N Uversky; Lukasz Kurgan
Journal: Cell Mol Life Sci Date: 2017-06-06 Impact factor: 9.261

2. Binding of the N-terminal region of coactivator TIF2 to the intrinsically disordered AF1 domain of the glucocorticoid receptor is accompanied by conformational reorganizations.

Authors: Shagufta H Khan; Smita Awasthi; Chunhua Guo; Devrishi Goswami; Jun Ling; Patrick R Griffin; S Stoney Simons; Raj Kumar
Journal: J Biol Chem Date: 2012-11-06 Impact factor: 5.157

3. DISOselect: Disorder predictor selection at the protein level.

Authors: Akila Katuwawala; Christopher J Oldfield; Lukasz Kurgan
Journal: Protein Sci Date: 2019-11-07 Impact factor: 6.725

4. Fly cryptochrome and the visual system.

Authors: Gabriella Mazzotta; Alessandro Rossi; Emanuela Leonardi; Moyra Mason; Cristiano Bertolucci; Laura Caccin; Barbara Spolaore; Alberto J M Martin; Matthias Schlichting; Rudi Grebler; Charlotte Helfrich-Förster; Stefano Mammi; Rodolfo Costa; Silvio C E Tosatto
Journal: Proc Natl Acad Sci U S A Date: 2013-03-27 Impact factor: 11.205

5. Combined treatment of human multiple myeloma cells with bortezomib and doxorubicin alters the interactome of 20S proteasomes.

Authors: Alexey G Mittenberg; Valeria O Kuzyk; Sergey V Shabelnikov; Daria P Gorbach; Alla N Shatrova; Olga A Fedorova; Nickolai A Barlev
Journal: Cell Cycle Date: 2018-08-01 Impact factor: 4.534

6. Computational Prediction of Intrinsic Disorder in Protein Sequences with the disCoP Meta-predictor.

Authors: Christopher J Oldfield; Xiao Fan; Chen Wang; A Keith Dunker; Lukasz Kurgan
Journal: Methods Mol Biol Date: 2020

7. Human 14-3-3 paralogs differences uncovered by cross-talk of phosphorylation and lysine acetylation.

Authors: Marina Uhart; Diego M Bustos
Journal: PLoS One Date: 2013-02-13 Impact factor: 3.240

8. Unsupervised Integration of Multiple Protein Disorder Predictors: The Method and Evaluation on CASP7, CASP8 and CASP9 Data.

Authors: Ping Zhang; Zoran Obradovic
Journal: Proteome Sci Date: 2011-10-14 Impact factor: 2.480

9. DNdisorder: predicting protein disorder using boosting and deep networks.

Authors: Jesse Eickholt; Jianlin Cheng
Journal: BMC Bioinformatics Date: 2013-03-06 Impact factor: 3.169

10. Analysis and consensus of currently available intrinsic protein disorder annotation sources in the MobiDB database.

Authors: Tomás Di Domenico; Ian Walsh; Silvio C E Tosatto
Journal: BMC Bioinformatics Date: 2013-04-22 Impact factor: 3.169