Literature DB >> 18430742

OnD-CRF: predicting order and disorder in proteins using [corrected] conditional random fields.

Lixiao Wang1, Uwe H Sauer.   

Abstract

MOTIVATION: Order and Disorder prediction using Conditional Random Fields (OnD-CRF) is a new method for accurately predicting the transition between structured and mobile or disordered regions in proteins. OnD-CRF applies CRFs relying on features which are generated from the amino acids sequence and from secondary structure prediction. Benchmarking results based on CASP7 targets, and evaluation with respect to several CASP criteria, rank the OnD-CRF model highest among the fully automatic server group. AVAILABILITY: http://babel.ucmp.umu.se/ond-crf/

Entities:  

Mesh:

Substances:

Year:  2008        PMID: 18430742      PMCID: PMC2387219          DOI: 10.1093/bioinformatics/btn132

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 INTRODUCTION

Many proteins carry out important biological functions by means of intrinsically unstructured sequence intervals (Dunker et al., 2002; Romero et al., 1999). Identification of disordered regions in protein sequences can help to reduce bias in sequence similarity analyses and delineate boundaries of protein domains to guide structural and functional studies (Ferron et al., 2006). Several state-of-the-art approaches have been proposed for prediction of ordered and disordered residues, such as neural networks (NNs) and support vector machines (SVMs). Similar to NNs and SVMs, the conditional random fields (CRFs) (Lafferty et al., 2001) are discriminative supervised machine learning methods. CRFs need training with labeled empirical data in order to learn the classification. However, compared to NNs and SVMs, CRFs are able to take into account inter-relation information between two labels of neighboring residues. Here, we describe how to apply CRFs to build an order and disorder prediction model. We then compare the OnD-CRF method to prediction methods that successfully participated in CASP7 (Bordoli et al., 2007).

2 METHODS

The OnD-CRF training dataset is derived from high-resolution crystal structures. It contains 215 612 residues, of which 13 909 are defined as disordered (Cheng et al., 2005a) since they are part of a crystallized protein but lack a coordinate entry in the PDB file. The training dataset does not contain any of the CASP7 targets. Performance is assessed with respect to the area under the ROC curve (AUC), the average of sensitivity and specificity (ACC) and a weighted score, Sw, that considers the rates of ordered and disordered residues in the dataset (Jin and Dunbrack, 2005). The AUC is a measure of the overall predictor quality, with a value of 1.0 for a perfect predictor and 0.5 for a random predictor. The weighted score Sw and the ACC, introduced in CASP6 and CASP7, are used to evaluate the overall prediction accuracy based on an imbalanced dataset. Throughout, we use CRF++ 0.49 (http://crfpp.sourceforge.net/) to generate the OnD-CRF. The template file contains the rules for selecting the features that we use for training the OnD-CRF model. The features are extracted only from the amino acid sequence and, using SSpro (Cheng et al., 2005b), from the predicted secondary structure. After 10-fold cross validation, we find that a sliding window size of nine amino acids yields the best template file. The set of parameters which give rise to an AUC value of 0.864 for the OnD-CRF build on the training dataset are: 1.018 for the hyper-parameter ‘C’, which trades the balance between overfitting and underfitting and 5 for the parameter ‘f’, which sets the cut-off threshold for the features. For all other parameters we use the default CRF++ 0.49 values. As a result of 10-fold cross validation, we find an optimal P-value cut-off of P < 0.05 for ordered and P ≥ 0.05 for disordered amino acids. Using this cut-off the OnD-CRF model achieves an ACC of 0.790 and a weighted score Sw of 0.580, based on the training dataset.

3 BENCHMARKING RESULTS

For benchmarking, we use all 96 targets available during CASP7 and compare the results obtained with OnD-CRF to the results of the 15 methods that predicted 93 or more targets. Evaluation is done with respect to the AUC, the sensitivity, Ssens, the specificity, Sspec, their product, Sprod, the ACC and Sw. The sensitivity and specificity can be interpreted as the fraction of correctly identified disordered and ordered residues, respectively. The benchmarking results for all 16 disorder prediction methods, subdivided into the fully automated server group and the human expert group, are listed in Table 1. Within the automatic server group, the OnD-CRF method reaches the best overall performance with highest scores for AUC, Ssens, Sprod, ACC and Sw. The performance of OnD-CRF method is comparable to the best human expert methods such as ISTZORAN and fais. The results show, that OnD-CRF is an accurate and effective method for the fully automated prediction of disorder in proteins.
Table 1.

Comparing OnD-CRF with prediction methods that participated in CASP7

MethodAUCSsensSspecSprodACCSw
CASP7 Automatic Server Group
    OnD-CRFa0.8390.6880.8130.5600.7500.501
    DISproa0.8220.5970.8540.5100.7260.451
    GeneSilicoMetaServerd0.8040.5270.9120.4810.7200.440
    BIME@NTU_serva0.7980.5910.8390.4960.7150.430
    DISOPREDa0.8370.4250.9530.4050.6890.378
    Distilla0.7240.5580.7880.4400.6730.346
    MBI-NTU-serva0.7960.3270.9710.3180.6490.298
    DRIPPREDb0.7580.3830.9080.3480.6460.291
CASP7 Human Expert Group
    ISTZORANb0.8600.7250.8370.6070.7810.562
    faisa0.8440.5560.9240.5140.7400.481
    CBRC-DRa0.8500.4540.9660.4390.7100.420
    BIME@NTUc0.8040.5360.8830.4730.7100.419
    IUPredb0.7770.3960.9470.3750.6720.343
    CBRC-DP_DRa0.7040.3380.9710.3280.6550.309
    Okab0.6090.2800.9370.2620.6090.218
    Softberrya0.7040.2010.9710.1950.5860.172

The entries are sorted with respect to the weighted score Sw.

Number of predicted targets: a96; b95; c94; d93; AUC: Area Under ROC Curve (Bordoli et al., 2007); Sens = TP/(TP + FN); Sspec = TN/(TN + FP); Sprod = Ssens × Sspec; ACC = (Ssens + Sspec)/2; Sw = (WdisorderNTP – WorderNFP +WorderNTN – WdisorderNFN)/(WdisorderNdisorder + WorderNorder).

Comparing OnD-CRF with prediction methods that participated in CASP7 The entries are sorted with respect to the weighted score Sw. Number of predicted targets: a96; b95; c94; d93; AUC: Area Under ROC Curve (Bordoli et al., 2007); Sens = TP/(TP + FN); Sspec = TN/(TN + FP); Sprod = Ssens × Sspec; ACC = (Ssens + Sspec)/2; Sw = (WdisorderNTP – WorderNFP +WorderNTN – WdisorderNFN)/(WdisorderNdisorder + WorderNorder).

4 OND-CRF PREDICTION EXAMPLE

We demonstrate the power of the OnD-CRF method on a particular example. The structure of the human cancer-related signaling adaptor protein CRK was recently determined by NMR. The protein harbors one SH2 and two SH3 domains, SH2-nSH3-cSH3 (pdb codes 2EYZ, 2EYV, 2EYW, 2EYX) (Kobashigawa et al., 2007). We compare the OnD-CRF prediction, blue curve, with the experimentally determined domains of CRK (Fig. 1) and superimpose their boundaries in the form of colored boxes onto the OnD-CRF curve. The OnD-CRF prediction of the ordered and disordered regions of CRK is in close agreement with the solution NMR structure of this molecule. This is remarkable, since the training dataset includes only high-resolution crystal structures. As shown in Figure 1, the SH2 domain (residues 10–120), the nSH3 domain (134–191) and the cSH3 domain (238–293) are located in the regions of the OnD-CRF plot with highest probability for ordered residues. Interestingly, the amino-acid interval with a high probability for disorder, located roughly in the middle of the CRK-SH2 domain, corresponds precisely to the highly dynamic loop (residues 65–85) connecting the βD and βE strands. The DE-loop can change its conformation to provide a crucial inter-domain contact surface when binding to the Abl SH3 domain (Donaldson et al., 2002). Besides the prediction of disordered sequence intervals, we suggest that the accuracy of the OnD-CRF predictions can be used to determine domain boundaries for 3D structure analysis.
Fig. 1.

OnD-CRF Prediction analysis for CRK. The blue curve represents the predicted disorder probability at each amino acid position. The horizontal red line at 0.05 probability, represents the boundary between order and disorder. The NMR structures of the three CRK domains are shown above the graph. Their boundaries are marked as magenta, green and blue bars, respectively, and overlap with the mostly ordered regions of the OnD-CRF prediction. Note the accurately predicted flexible ‘DE loop’ in the SH2 domain between residues 65–85 (dashed line).

OnD-CRF Prediction analysis for CRK. The blue curve represents the predicted disorder probability at each amino acid position. The horizontal red line at 0.05 probability, represents the boundary between order and disorder. The NMR structures of the three CRK domains are shown above the graph. Their boundaries are marked as magenta, green and blue bars, respectively, and overlap with the mostly ordered regions of the OnD-CRF prediction. Note the accurately predicted flexible ‘DE loop’ in the SH2 domain between residues 65–85 (dashed line).
  8 in total

1.  Folding minimal sequences: the lower bound for sequence complexity of globular proteins.

Authors:  P Romero; Z Obradovic; A K Dunker
Journal:  FEBS Lett       Date:  1999-12-03       Impact factor: 4.124

2.  Intrinsic disorder and protein function.

Authors:  A Keith Dunker; Celeste J Brown; J David Lawson; Lilia M Iakoucheva; Zoran Obradović
Journal:  Biochemistry       Date:  2002-05-28       Impact factor: 3.162

3.  Assessment of disorder predictions in CASP6.

Authors:  Yumi Jin; Roland L Dunbrack
Journal:  Proteins       Date:  2005

Review 4.  A practical overview of protein disorder prediction methods.

Authors:  François Ferron; Sonia Longhi; Bruno Canard; David Karlin
Journal:  Proteins       Date:  2006-10-01

5.  Assessment of disorder predictions in CASP7.

Authors:  Lorenza Bordoli; Florian Kiefer; Torsten Schwede
Journal:  Proteins       Date:  2007

6.  Structure of a regulatory complex involving the Abl SH3 domain, the Crk SH2 domain, and a Crk-derived phosphopeptide.

Authors:  Logan W Donaldson; Gerald Gish; Tony Pawson; Lewis E Kay; Julie D Forman-Kay
Journal:  Proc Natl Acad Sci U S A       Date:  2002-10-16       Impact factor: 11.205

7.  Structural basis for the transforming activity of human cancer-related signaling adaptor protein CRK.

Authors:  Yoshihiro Kobashigawa; Mieko Sakai; Masato Naito; Masashi Yokochi; Hiroyuki Kumeta; Yoshinori Makino; Kenji Ogura; Shinya Tanaka; Fuyuhiko Inagaki
Journal:  Nat Struct Mol Biol       Date:  2007-05-21       Impact factor: 15.369

8.  SCRATCH: a protein structure and structural feature prediction server.

Authors:  J Cheng; A Z Randall; M J Sweredoski; P Baldi
Journal:  Nucleic Acids Res       Date:  2005-07-01       Impact factor: 16.971

  8 in total
  32 in total

Review 1.  Understanding protein non-folding.

Authors:  Vladimir N Uversky; A Keith Dunker
Journal:  Biochim Biophys Acta       Date:  2010-02-01

2.  How disordered is my protein and what is its disorder for? A guide through the "dark side" of the protein universe.

Authors:  Philippe Lieutaud; François Ferron; Alexey V Uversky; Lukasz Kurgan; Vladimir N Uversky; Sonia Longhi
Journal:  Intrinsically Disord Proteins       Date:  2016-12-21

Review 3.  Comprehensive review of methods for prediction of intrinsic disorder and its molecular functions.

Authors:  Fanchi Meng; Vladimir N Uversky; Lukasz Kurgan
Journal:  Cell Mol Life Sci       Date:  2017-06-06       Impact factor: 9.261

4.  AUCpreD: proteome-level protein disorder prediction by AUC-maximized deep convolutional neural fields.

Authors:  Sheng Wang; Jianzhu Ma; Jinbo Xu
Journal:  Bioinformatics       Date:  2016-09-01       Impact factor: 6.937

5.  Evaluation of disorder predictions in CASP9.

Authors:  Bohdan Monastyrskyy; Krzysztof Fidelis; John Moult; Anna Tramontano; Andriy Kryshtafovych
Journal:  Proteins       Date:  2011-09-16

6.  Phenotypic profiling of Mycobacterium tuberculosis EspA point mutants reveals that blockage of ESAT-6 and CFP-10 secretion in vitro does not always correlate with attenuation of virulence.

Authors:  Jeffrey M Chen; Ming Zhang; Jan Rybniker; Laetitia Basterra; Neeraj Dhar; Anna D Tischler; Florence Pojer; Stewart T Cole
Journal:  J Bacteriol       Date:  2013-09-27       Impact factor: 3.490

7.  Kinesin tail domains are intrinsically disordered.

Authors:  Mark A Seeger; Yongbo Zhang; Sarah E Rice
Journal:  Proteins       Date:  2012-07-07

8.  Grammatical-Restrained Hidden Conditional Random Fields for Bioinformatics applications.

Authors:  Piero Fariselli; Castrense Savojardo; Pier Luigi Martelli; Rita Casadio
Journal:  Algorithms Mol Biol       Date:  2009-10-22       Impact factor: 1.405

9.  Intrinsic Disorder in the Kinesin Superfamily.

Authors:  Mark A Seeger; Sarah E Rice
Journal:  Biophys Rev       Date:  2013-09-01

10.  Prediction of protein binding sites in protein structures using hidden Markov support vector machine.

Authors:  Bin Liu; Xiaolong Wang; Lei Lin; Buzhou Tang; Qiwen Dong; Xuan Wang
Journal:  BMC Bioinformatics       Date:  2009-11-20       Impact factor: 3.169

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.