Literature DB >> 20025768

PreDisorder: ab initio sequence-based prediction of protein disordered regions.

Xin Deng1, Jesse Eickholt, Jianlin Cheng.   

Abstract

BACKGROUND: Disordered regions are segments of the protein chain which do not adopt stable structures. Such segments are often of interest because they have a close relationship with protein expression and functionality. As such, protein disorder prediction is important for protein structure prediction, structure determination and function annotation.
RESULTS: This paper presents our protein disorder prediction server, PreDisorder. It is based on our ab initio prediction method (MULTICOM-CMFR) which, along with our meta (or consensus) prediction method (MULTICOM), was recently ranked among the top disorder predictors in the eighth edition of the Critical Assessment of Techniques for Protein Structure Prediction (CASP8). We systematically benchmarked PreDisorder along with 26 other protein disorder predictors on the CASP8 data set and assessed its accuracy using a number of measures. The results show that it compared favourably with other ab initio methods and its performance is comparable to that of the best meta and clustering methods.
CONCLUSION: PreDisorder is a fast and reliable server which can be used to predict protein disordered regions on genomic scale. It is available at http://casp.rnet.missouri.edu/predisorder.html.

Entities:  

Mesh:

Substances:

Year:  2009        PMID: 20025768      PMCID: PMC3087350          DOI: 10.1186/1471-2105-10-436

Source DB:  PubMed          Journal:  BMC Bioinformatics        ISSN: 1471-2105            Impact factor:   3.169


Background

While most regions of a protein adopt localized, stable structures, there are some segments of the protein chain which do not. These are regions whose coordinates are hard to determine by experimental techniques or that simply do not fold into stable structures [1,2]. Such regions are known as disordered regions. Proteins with disordered regions are capable of binding to multiple partners and participating in various reactions and pathways [3-5]. Disordered regions can also give rise to the poor expression of a protein, making it difficult to produce for crystallization or other purposes [6]. Consequently, the prediction of disordered regions in proteins has implications for protein production, structure prediction and determination, function annotation and cellular process recognition. Measuring native disorder experimentally is time consuming and expensive and thus computational approaches for the prediction of protein disordered regions have received considerable attention in recent years [7]. As a result, a number of disorder prediction software and web services and their underlying methods are quickly becoming a valuable tool for protein structure prediction, determination, and function annotation [8-18]. To stimulate further development of disorder prediction, CASP has dedicated a category to blindly benchmark the current state of the art. Here we benchmark our ab initio and consensus (or meta) disorder predictors along with dozens of other predictors that participated in the CASP8 experiment. The good performance of our PreDisorder server makes it a valuable and accurate tool for protein structure prediction, protein determination and protein engineering.

Implementation

Ab initio neural network method

Our server, PreDisorder, is based on our ab initio method that participated in CASP8 under the group name MULTICOM_CMFR. This is a machine learning approach using 1-D Recursive Neural Networks. With this approach, a target protein sequence is first aligned against several template profiles using PSI-BLAST. This creates an input profile of the sequence. This profile along with the predicted secondary structure and solvent accessibility is fed into a 1D Recursive Neural Network (1D-RNN) that makes the disorder predictions [6]. More specifically for each protein sequence, the input is a 1-dimentional array I whose length is the total number of the residues in the sequence. Each element Iof the array is a vector with 25 values which represent the residue i. Of these 25 values, 20 represent the frequencies of each amino acid at the corresponding position from PSI-BLAST profile [19]. The other five are binary values used to encode the predicted secondary structure (Helix, Strand or Coil) and solvent accessibility of the residue [20-22]. Based on the input I, the 1D-RNN produces an array of real numbers O, where the ielement Ois the probability that the iresidue will be disordered. A large curated dataset was randomly divided into ten subsets of approximately equal size in the preparation for the following ten-fold cross-validated training and testing. And then, this 1D-RNN was trained and cross-validated using the ten subsets [23]. Finally, the predicted disorder probabilities of the residues were re-scaled so that the ratio of residues with disorder probability greater than or equal to 0.5 is close to the ratio of the disorder residues in the training dataset [23]. Specifically, the scaling method first identified a probability threshold t (e.g. 0.1) for selecting predicted disorder residues such that the ratio (the number of predicted disordered residues/the number of total residues in the test dataset) is equal to the ratio of disorder residues in the training dataset (e.g. 5%). And then the predicted disorder probabilities (x) was re-scaled as x/t * 0.5 (if x <= t) or 0.5 + 0.5 * (x - t)/(1 - t) (if x >t).

Meta method

A meta method is a consensus approach that makes predictions based on the output of other predictors. Similar ideas have been applied to solve many prediction problems such as protein fold recognition and achieved much better performance than individual predictors. One such example of this approach is 3D-Jury. 3D-Jury is an automated protein structure meta prediction system available through Meta Server, and it generates meta-predictions from a variety of models gained by variable methods [24][25][26]. Our new meta predictor MULTICOM makes predictions based on a consensus formed from other CASP8 disorder predictors. It removes a few very inaccurate disorder predictors and then averages the output of the remaining disorder predictors. Our simple averaging approach is different from other meta methods based on consensus voting.

Results and discussion

We evaluated 27 disorder predictors that participated in CASP8. Among these predictors were our ab initio method predictor (MULTICOM-CMFR) and meta predictor (MULTICOM). They were evaluated on 117 protein targets whose structures were available when our evaluation was conducted. These targets contain 25431 residues and all the disorder predictions for them were downloaded from the CASP8 web site [27]. When evaluating the disorder predictions against the protein targets, target residues that did not have corresponding coordinates in its PDB file were considered to be disordered. The disorder annotations for the targets were curated by Dr.McGuffin [28]. Each residue in the target sequence is tagged with a binary label of "O" (order) or "D" (disorder). We evaluated the methods on all 117 targets and two subsets (97 X-ray structures and 20 NMR structures), respectively. It is worth pointing out that our evaluation serves as a complementary, comparative benchmark of our methods. Readers should refer to the CASP8 assessment paper for the official assessment of disorder predictions [29]. In evaluating the disorder predictors, we considered a number of different, commonly used measurements of performance for binary classifiers. One such measurement was the ROC score. This value represents the area under the Receiver Operating Characteristic (AUC) curve and measures the performance of a classifier system and its dependence upon its discrimination threshold. Ranking the predictors using ROC curves is a widely used method in bioinformatics and CASP competitions [7,30,31]. Another set of commonly used measurements for classifier systems are sensitivity and specificity. For each disorder predictor, we calculated the Positive Sensitivity (), Positive Specificity (), Negative Sensitivity () and Negative Specificity () [31]. Here, TP is the number of true positives (residues correctly identified as disordered) and FP is the number of false positives (residues predicted as disordered, but experimentally ordered). TN is the number of true negatives (residues correctly identified as ordered) and FN the number of false negatives (residues predicted as ordered, but experimentally disordered). While in principle it is possible for a system to achieve both high values for positive and negative sensitivity, in practice it does not happen often. Usually, a sharp increased in one, results in a decrease in the other. An extreme example would be a predictor which identifies all residues as disordered. Such a system would have a positive sensitivity of 100% and a negative sensitivity of 0%. In an attempt to join several of these measurements into one, we considered the product of positive sensitivity and negative sensitivity and the harmonic mean, or F-measure, of the positive sensitivity and positive specificity [32]. We also calculated a weighted score for each predictor. This is a measure which was introduced in CASP6 and is defined as Score () where Wdisorder is set to 92.63 and Worder to 7.37 [31]. As defined, this measure greatly rewards disordered residues correctly identified as classified as disordered while heavily penalizing any disordered residue that is misclassified. Table 1 reports the ROC scores, weighted score, positive sensitivity, negative specificity, negative sensitivity, negative specificity, product of positive sensitivity and negative sensitivity, F-measure respectively of all the disorder predictors. Moreover, Table 1 also shows the total number of residues predicted by each predictor respectively. For comparison, we also repeated the evaluation for the "only x-ray" and the "only NMR" sets, and the results are shown in Table 2 and Table 3. Figure 1 shows the ROC curves for the predictors. The predictors are ordered by ROC scores since the ROC measure is probably the most balanced measurement.
Table 1

Results for protein disorder predictors that participated in CASP8 on 117 targets.

Disorder PredictorROC ScoreWeighed ScoreTotal Res.Pos. Sens.Pos. Spec.Neg. Sens.Neg. Spec.Sens. Prod.F-meas.
MULTICOM0.8796.89251480.5320.6190.9730.9620.5180.572
DISOclust0.8627.822240210.7530.2480.8130.9760.6120.373
fais-server0.8616.613240210.5220.5330.9630.9610.5030.528
MULTICOM-CMFR*0.8598.06254310.7240.2990.8590.9740.6220.423
3Dpro*0.8554.826239340.3780.7270.9880.9490.3730.497
GS-MetaServer0.8525.356249760.410.7110.9860.9530.4040.52
mariner1*0.8467.469231480.6220.3960.9230.9680.5740.484
Distill-Punch*0.8460.798233110.0670.750.9980.930.0670.123
Metaprdos0.8426.431223630.5050.5390.9660.9610.4880.522
Distill-Punch*0.8390.823237300.0630.8810.9990.930.0630.118
CBRC_POODLE*0.8396.259240210.5220.4030.9370.960.4890.455
GeneSilicoMeta0.8396.338249760.4890.6270.9760.9590.4780.55
DISOPRED*0.8316.497240210.5220.4850.9550.9610.4980.503
Biomine*0.8256.394234720.5170.470.9520.960.4920.493
OnD-CRF*0.825.214240210.470.310.9140.9550.4290.373
Spritz*0.7915.431240210.4340.5140.9660.9540.420.471
GSMetaDisorder0.7726.31245040.5790.2890.880.9610.510.386
Distill*0.7585.637240210.6710.1730.7380.9650.4950.276
LEE-SERVER*0.7413.595219550.310.4490.9680.9430.30.367

Results of our evaluation of all the protein disorder predictors ranked by ROC Score (AUC). Total Res. represents the total number of residues predicted. Pos. Spec. and Neg. Spec. are the positive and negative specificities. Pos. Sens. and Neg. Sens. are the positive and negative sensitivities. F-meas. is the F-Measure of the positive sensitivity and positive specificity and Sens. Prod. is the product of the positive sensitivity and negative sensitivity. * denotes ab initio methods. MULTICOM-CMFR is the predictor name of PreDisorder in CASP8.

Table 2

Results for protein disorder predictors that participated in CASP8 on the 20 NMR targets (T0437, T0460, T0462, T0464, T0466, T0467, T0468, T0469, T0471, T0472, T0473, T0474, T0475, T0476, T0480, T0482, T0484, T0492, T0498, T0499).

Disorder PredictorROC ScoreWeighed ScoreTotal Res.Sens.Spec.Sens. Prod.F-measure.
GSMetaDisorder0.8003.80116540.7860.7000.5500.082
MULTICOM-CMFR*0.7924.39318950.8160.7000.5720.125
fais-server0.7695.40716010.5100.8740.4460.186
OnD-CRF*0.7615.67116010.4290.9250.3970.226
MULTICOM0.7605.28318950.4690.8780.4120.155
Biomine*0.7534.75314890.4650.8450.3930.139
CBRC_POODLE*0.7443.86416010.4900.7740.3790.113
DISOPRED*0.7415.45816010.5310.8700.4620.188
Metaprdos0.7315.97116010.4490.9380.4210.263
3Dpro*0.7296.11514110.4580.9480.4340.312
Distill*0.7203.04216010.6940.6360.4410.105
GS-MetaServer0.7155.52517750.3670.9330.3430.197
DISOclust0.7093.01116010.5710.6820.3900.098
Distill-Punch*0.7034.10716010.0000.9860.0000.000
GeneSilicoMeta0.7035.54017750.4690.8970.4210.185
Spritz*0.6794.77216010.2240.9430.2120.149
Mariner1*0.6664.63616010.5920.7880.4660.143
LEE-SERVER*0.6545.48616840.3490.9320.3250.176

* denotes ab initio methods.

Table 3

Results for protein disorder predictors that participated in CASP8 on the 97 x-ray targets.

Disorder PredictorROC ScoreWeighed ScoreTotal Res.Sens.Spec.Sens. Prod.F-measure.
MULTICOM0.8877.021232530.5340.9810.5230.610
DISOclust0.8688.165224200.7580.8230.6240.397
fais-server0.8676.699224200.5230.9690.5060.555
MULTICOM-CMFR*0.8658.355235360.7210.8730.6300.455
3Dpro*0.8624.745225230.3750.9900.3720.507
GS-MetaServer0.8615.343232010.4110.9910.4070.541
Mariner1*0.8577.679215470.6230.9330.5820.518
Distill-Punch*0.8560.554217100.0690.9990.0690.128
CBRC_POODLE*0.8486.430224200.5230.9490.4960.493
GeneSilicoMeta0.8486.399232010.4900.9820.4810.579
Metaprdos0.8466.467207620.5070.9680.4910.536
DISOPRED*0.8376.571224200.5210.9610.5010.528
Biomine*0.8296.506219830.5190.9590.4980.522
OnD-CRF*0.8225.181224200.4710.9130.4300.379
Spritz*0.8005.478224200.4400.9680.4260.486
GSMetaDisorder0.7786.492228500.5760.8940.5150.417
Distill*0.7615.822224200.6700.7460.4990.289
LEE-SERVER*0.7493.438202710.3090.9710.3000.379

* denotes ab initio methods.

Figure 1

ROC curves of CASP8 predictors (ordered by ROC score) on the CASP8 dataset which consisted of 117 protein targets.

Results for protein disorder predictors that participated in CASP8 on 117 targets. Results of our evaluation of all the protein disorder predictors ranked by ROC Score (AUC). Total Res. represents the total number of residues predicted. Pos. Spec. and Neg. Spec. are the positive and negative specificities. Pos. Sens. and Neg. Sens. are the positive and negative sensitivities. F-meas. is the F-Measure of the positive sensitivity and positive specificity and Sens. Prod. is the product of the positive sensitivity and negative sensitivity. * denotes ab initio methods. MULTICOM-CMFR is the predictor name of PreDisorder in CASP8. Results for protein disorder predictors that participated in CASP8 on the 20 NMR targets (T0437, T0460, T0462, T0464, T0466, T0467, T0468, T0469, T0471, T0472, T0473, T0474, T0475, T0476, T0480, T0482, T0484, T0492, T0498, T0499). * denotes ab initio methods. Results for protein disorder predictors that participated in CASP8 on the 97 x-ray targets. * denotes ab initio methods. ROC curves of CASP8 predictors (ordered by ROC score) on the CASP8 dataset which consisted of 117 protein targets. The CASP8 disorder prediction methods can be classified into four main categories [33]: (1) Meta method. Predictors like MULTICOM, GS-MetaServer, Metaprdos, GeneSilico, GSMetaDisorder and Distill use this method to fulfill disorder prediction. (2) Clustering method. For instance, it is used by predictor DISOclust. DISOclust first gains multiple 3D models from the nFOLD3 server and then makes disorder predictions by combining the results obtained from running the DISOclust method and DISOPRED3 method. (3) Ab initio method. A large number of predictors in CASP8 adopt this method and examples include 3Dpro, Mariner, Spritz, biomine, CBRC_poodle, disopred, OnD-CRF and our predictor MULTICOM_CMFR. (4) Hybrid method. Fais-server is a hybrid method that combines both ab initio predictions and homology-based template information. Both ab initio and hybrid methods usually exist as standalone packages, while meta methods rely on other predictors. In examining the results, no one method appears to perform decisively better than the rest according to all the measures. Predictors from each of the three types of methods (ab initio, meta and clustering) are represented in the top seven when comparing the predictors only on the basis of ROC score, weighted score, specificity or sensitivity. The meta method MULITCOM, the clustering method DISOclust, the hybrid method Fais-server and ab initio method MULTICOM-CMFR and 3Dpro are among top 5 in terms of ROC scores. Other ab initio predictors such as mariner1 and Distill-Punch also performed well. Interestingly, our ab initio predictor MULTICOM-CMFR also ranks first in weighted score and product of positive and negative sensitivity. Being an ab initio method, it also has the benefit of being able to make predictions solely on an input sequence. The other types of methods need additional information such as output from other predictors (e.g. meta methods), tertiary structure models (clustering methods), or homologous structure templates (hybrid methods). Consequently, our PreDisorder server based on MULTICOM-CMFR is generally an accurate predictor that can be applied to the genome-scale annotation of protein disordered regions. Especially regarding the limits of predictability of intrinsically disordered residues from crystallographic experiments, both of our methods performed well on the X-ray targets shown in Table 3[34]. Several methods (e.g., MULTICOM, DISOclust, fais-server, MULTICOM-CMFR, 3Dpro, mariner and Distill-Punch) yield similarly good AUC scores (>= 0.846), suggesting that the accuracy of disorder predictions might be close to the limit [34]. All of the predictors do quite well with respect to negative specificity and negative sensitivity. This is not too surprising as the most of the residues in a protein are ordered and hence the number of true negatives (TN) is very close to the true negatives plus false positives (TN+FP) and to the true negatives plus the false negatives (TN+FN).

Conclusion

This paper presents our disorder prediction web server, PreDisorder, and evaluates its performance against several other disorder predictors. We benchmarked MULTICOM-CMFR, the method employed by Predisorder and our meta method MULTICOM, along with several other protein disorder predictors on the 117 targets used in CASP8. The results show that our method is among the best and provides reliable protein disordered region predictions. Therefore, our server (PreDisorder) is a useful tool for structural and functional genomics.

Availability and Requirements

Project name: PreDisorder Project Home Page: http://casp.rnet.missouri.edu/predisorder.html Operating system(s): Platform independent (web server) Programming languages: Perl, C, C++ Other requirements: None License: Web application is freely accessible for all users. Any restrictions to use by non-academics: None The use of PreDisorder is straight forward and takes place through a simple input form. The input form requires only three inputs: email address, target name and protein sequence. PreDisorder can make predictions in a very short time and sends the results back to users via email. Disorder prediction results include the user-defined target name, the author, any predictor remarks and the disorder predictions. These predictions are in CASP format and occupy several lines. Each line contains the residue code, an order or disorder assignment code and the number specifying the associated probability of disorder. We also return the results in graphical form, as seen in Figure 2. In this graph, users can visualize changes in the likelihood of disorder from residue to residue over the submitted sequence. The red curve shows our predicted probability of disorder for each residue in the target sequence, the green curve represents the determined disorder result by biological experiment for the target. In addition, the blue line y = 0.5 represents the threshold we chose to judge the probability of disorder for a residue.
Figure 2

Example output from PreDisorder showing probability of disorder for each residue in a sequence (CASP8 target T0470). The red curve represents predicted disorder probabilities. The green curve denotes real disorder annotations (1 - disorder; 0 - not disorder).

Example output from PreDisorder showing probability of disorder for each residue in a sequence (CASP8 target T0470). The red curve represents predicted disorder probabilities. The green curve denotes real disorder annotations (1 - disorder; 0 - not disorder).

Authors' contributions

JC designed and implemented the disorder prediction methods and conducted CASP8 experiments. XD evaluated the predictors. XD and JE wrote the first draft of the manuscript. DX, JE and JC set up the web server. All the authors edited the manuscript and approved it.
  28 in total

1.  The protein trinity--linking function and disorder.

Authors:  A K Dunker; Z Obradovic
Journal:  Nat Biotechnol       Date:  2001-09       Impact factor: 54.908

2.  Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles.

Authors:  Gianluca Pollastri; Darisz Przybylski; Burkhard Rost; Pierre Baldi
Journal:  Proteins       Date:  2002-05-01

3.  Intrinsic disorder and protein function.

Authors:  A Keith Dunker; Celeste J Brown; J David Lawson; Lilia M Iakoucheva; Zoran Obradović
Journal:  Biochemistry       Date:  2002-05-28       Impact factor: 3.162

4.  Evaluation of disorder predictions in CASP5.

Authors:  Eugene Melamud; John Moult
Journal:  Proteins       Date:  2003

5.  3D-Jury: a simple approach to improve protein structure predictions.

Authors:  Krzysztof Ginalski; Arne Elofsson; Daniel Fischer; Leszek Rychlewski
Journal:  Bioinformatics       Date:  2003-05-22       Impact factor: 6.937

Review 6.  Intrinsically unstructured proteins.

Authors:  Peter Tompa
Journal:  Trends Biochem Sci       Date:  2002-10       Impact factor: 13.807

7.  The DISOPRED server for the prediction of protein disorder.

Authors:  Jonathan J Ward; Liam J McGuffin; Kevin Bryson; Bernard F Buxton; David T Jones
Journal:  Bioinformatics       Date:  2004-03-25       Impact factor: 6.937

8.  Assessment of disorder predictions in CASP8.

Authors:  Orly Noivirt-Brik; Jaime Prilusky; Joel L Sussman
Journal:  Proteins       Date:  2009

Review 9.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Authors:  S F Altschul; T L Madden; A A Schäffer; J Zhang; Z Zhang; W Miller; D J Lipman
Journal:  Nucleic Acids Res       Date:  1997-09-01       Impact factor: 16.971

10.  Prediction and functional analysis of native disorder in proteins from the three kingdoms of life.

Authors:  J J Ward; J S Sodhi; L J McGuffin; B F Buxton; D T Jones
Journal:  J Mol Biol       Date:  2004-03-26       Impact factor: 5.469

View more
  43 in total

1.  How disordered is my protein and what is its disorder for? A guide through the "dark side" of the protein universe.

Authors:  Philippe Lieutaud; François Ferron; Alexey V Uversky; Lukasz Kurgan; Vladimir N Uversky; Sonia Longhi
Journal:  Intrinsically Disord Proteins       Date:  2016-12-21

Review 2.  Comprehensive review of methods for prediction of intrinsic disorder and its molecular functions.

Authors:  Fanchi Meng; Vladimir N Uversky; Lukasz Kurgan
Journal:  Cell Mol Life Sci       Date:  2017-06-06       Impact factor: 9.261

3.  Evaluation of disorder predictions in CASP9.

Authors:  Bohdan Monastyrskyy; Krzysztof Fidelis; John Moult; Anna Tramontano; Andriy Kryshtafovych
Journal:  Proteins       Date:  2011-09-16

Review 4.  A comprehensive overview of computational protein disorder prediction methods.

Authors:  Xin Deng; Jesse Eickholt; Jianlin Cheng
Journal:  Mol Biosyst       Date:  2011-08-26

5.  Disease risk of missense mutations using structural inference from predicted function.

Authors:  Jeremy A Horst; Kai Wang; Orapin V Horst; Michael L Cunningham; Ram Samudrala
Journal:  Curr Protein Pept Sci       Date:  2010-11       Impact factor: 3.272

6.  Intrinsic disorder within and flanking the DNA-binding domains of human transcription factors.

Authors:  Xin Guo; Martha L Bulyk; Alexander J Hartemink
Journal:  Pac Symp Biocomput       Date:  2012

7.  Characterization of an insect-specific flavivirus (OCFVPT) co-isolated from Ochlerotatus caspius collected in southern Portugal along with a putative new Negev-like virus.

Authors:  Daniela Duque Ferreira; Shelley Cook; Ângela Lopes; António Pedro de Matos; Aida Esteves; Ana Abecasis; António Paulo Gouveia de Almeida; João Piedade; Ricardo Parreira
Journal:  Virus Genes       Date:  2013-07-23       Impact factor: 2.332

8.  A critical evaluation of in silico methods for detection of membrane protein intrinsic disorder.

Authors:  Edward E Pryor; Michael C Wiener
Journal:  Biophys J       Date:  2014-04-15       Impact factor: 4.033

9.  Computational Prediction of Intrinsic Disorder in Protein Sequences with the disCoP Meta-predictor.

Authors:  Christopher J Oldfield; Xiao Fan; Chen Wang; A Keith Dunker; Lukasz Kurgan
Journal:  Methods Mol Biol       Date:  2020

Review 10.  Classification of intrinsically disordered regions and proteins.

Authors:  Robin van der Lee; Marija Buljan; Benjamin Lang; Robert J Weatheritt; Gary W Daughdrill; A Keith Dunker; Monika Fuxreiter; Julian Gough; Joerg Gsponer; David T Jones; Philip M Kim; Richard W Kriwacki; Christopher J Oldfield; Rohit V Pappu; Peter Tompa; Vladimir N Uversky; Peter E Wright; M Madan Babu
Journal:  Chem Rev       Date:  2014-04-29       Impact factor: 60.622

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.