Literature DB >> 19420062

NNcon: improved protein contact map prediction using 2D-recursive neural networks.

Allison N Tegge¹, Zheng Wang, Jesse Eickholt, Jianlin Cheng.

Abstract

Protein contact map prediction is useful for protein folding rate prediction, model selection and 3D structure prediction. Here we describe NNcon, a fast and reliable contact map prediction server and software. NNcon was ranked among the most accurate residue contact predictors in the Eighth Critical Assessment of Techniques for Protein Structure Prediction (CASP8), 2008. Both NNcon server and software are available at http://casp.rnet.missouri.edu/nncon.html.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2009 PMID： 19420062 PMCID： PMC2703959 DOI： 10.1093/nar/gkp305

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Predicting residue contacts is an important problem in protein structure prediction. Contact maps, a matrix representation of protein residue–residue contacts within a distance threshold, provide an avenue for predicting protein 3D structure (1,2). There have been several algorithms developed to reconstruct protein 3D structure from an accurate contact map using distance-based algorithms developed for protein structure prediction and nuclear magnetic resonance (NMR) structure determination (3–7). Even though contact prediction is presumably as hard as ab initio 3D structure prediction, it can be readily formulated as a classification problem, which can be tackled by knowledge-based reasoning methods, such as correlated mutation (8–14) and machine learning (15–27). As more and more evidence shows that sequence-based contact predictions can be used to infer protein folding rates, evaluate protein models (28), and improve 3D structure prediction (29), contact map prediction is becoming increasingly important and useful. To date, however, only a few contact prediction servers [e.g. SCRATCH, Distill, SVMcon, SAM, RECON (30–34)] and a software package (SVMcon) are publicly available. To fill the gap, we describe a fast, state-of-the-art neural network-based contact map predictor NNcon that was ranked among the best methods in the Eighth Critical Assessment of Techniques for Protein Structure Prediction (CASP8), 2008 (35).

HYBRID CONTACT PREDICTION METHODS

We used 2D-Recursive Neural Network (2D-RNN) models to predict both general residue-residue contacts and specific beta contacts (i.e. beta-residue pairs in beta sheets).

General contact prediction

2D-RNN is a 2D machine learning method designed to map 2D input information into 2D output targets (36). The basic architecture of 2D-RNN contact predictions is illustrated in Figure 1.

Figure 1.

The 2D-RNN architecture for contact prediction. For a protein sequence with length n, the input to a 2D-RNN is an n × n input matrix and the output is an n × n probability matrix residue contacts.

The 2D-RNN architecture for contact prediction. For a protein sequence with length n, the input to a 2D-RNN is an n × n input matrix and the output is an n × n probability matrix residue contacts. The 2D-RNNs in NNcon are trained on a large data set consisting of 482 proteins and validated on a data set of 48 proteins. The real contacts were calculated as those residue pairs with C-α atoms within a set distance threshold. Ten 2D-RNN models were trained in order to create an ensemble of models that predict contacts. We trained two sets of 2D-RNN to predict contacts at an 8 Å and 12 Å threshold, respectively.

Beta-contact prediction

The general residue–residue contacts are defined based on a standard distance threshold of 8 and 12 Å. To take advantage of physiochemical constraints (i.e. hydrogen bonds) in beta sheets, we use 2D-RNN to directly predict beta-residue pairings within beta sheets (37). NNcon treated the prediction of inter-strand residue pairings as an additional binary classification problem, and refined these regions locally. The 2D-RNNs were trained and validated on the data set using 10-fold cross-validation on a large data set consisting of 916 chains and 2533 beta sheets (37). The ensemble of these 10 models is used to make predictions.

Combination of general and specific contact maps

Since the specific beta-contact predictor models predict beta contacts more accurately than the general contact map predictor models, we combined the predictions from these two methods for those proteins containing beta sheets. If the probability of a beta-residue pairing from the general contact model is less than that predicted by the beta-specific contact predictor, the general prediction is replaced by the beta-specific prediction value. The revised general contact map predictions are then finalized as the final contact map.

IMPLEMENTATION OF WEB SERVER

Both the NNcon web server and executable are freely available to all users at http://casp.rnet.missouri.edu/nncon.html and there is no login requirement. The input for the web server includes an e-mail address where the results will be sent, a target name, and an unformatted protein sequence. The e-mail output includes both a main message and several attachments. The main message includes the selected residue–residue contacts, in CASP format, at an 8 Å threshold, with sequence separation ≥6 and a predicted probability ≥0.1; and the average contact order and the average contact number derived from the predicted contact probability matrix at 8 Å. The attachments include a contact map image file, the full-contact probably matrix at 8 Å, and the full contact probability matrix at 12 Å. Users can select contacts from these probability matrices according to any probability threshold. The server can accept multiple submissions concurrently through a task queue. NNcon predictions are much faster than support vector machine contact map predictors, such as SVMcon, which contain hundreds of thousands of support vectors. NNcon can make a prediction for a protein of average size (250 residues) in just a few minutes. The server can also make predictions for large proteins with up to 1000 residues in under an hour. A Linux version of the contact prediction software is also available for download at the web site and the readme file contains the necessary installation instructions. This version of NNcon requires two input parameters at the command prompt: the name of a FASTA file and an output directory. The prediction results in the output directory include name.cm8a and name.cm12a, which are the predicted contact probability matrices for 8 and 12 Å, respectively.

EVALUATION OF WEB SERVER

NNcon was blindly tested in the CASP8 data set. We first evaluated NNcon against SVMcon, one of the top ranked contact map predictors in CASP7, on 116 CASP8 protein targets (Table 1). Both NNcon and SVMcon use pure ab initio methods to predict contacts within a protein. Next, we compared NNcon with all the CASP8 contact predictors on the 11 ab initio CASP8 domains, as shown in Table 2. All the contact predictions for these predictors and the 3D structures of the protein targets were downloaded from http://predictioncenter.org/casp8/.

Table 1.

Results of NNcon and SVMcon on 116 CASP8 targets

Method	Acc6	Cov6	Acc12	Cov12	Acc24	Cov24
NNcon (L/5)	0.58	0.07	0.51	0.06	0.31	0.05
SVMcon (L/5)	0.5	0.06	0.42	0.06	0.27	0.05

Acc6, Acc12, Acc24 denote prediction accuracy (specificity) at sequence separation ≥6, 12, 24 residues, respectively. Cov6, Cov12, Cov24 denote prediction coverage (sensitivity) at sequence separation ≥6, 12, 24 residues, respectively.

Table 2.

Multiple contact map predictors evaluated on 11 CASP8 ab initio domains

Method	Acc6	Cov6	Acc12	Cov12	Acc24	Cov24
NNcon	0.68	0.11	0.51	0.09	0.18	0.05
SVMcon	0.68	0.09	0.39	0.09	0.18	0.05
SAM08_2stage	0.28	0.05	0.26	0.06	0.17	0.05
SAM06	0.26	0.04	0.24	0.05	0.16	0.06
Fang	0.44	0.07	0.31	0.06	0.16	0.05
MUprot	0.59	0.09	0.37	0.08	0.15	0.05
Distill	0.32	0.05	0.16	0.03	0.14	0.05
3Dpro	0.05	0.01	0.33	0.07	0.14	0.05
SAM08_server	0.24	0.04	0.21	0.05	0.13	0.05
SVMSEQ	0.56	0.09	0.34	0.07	0.13	0.05
Hamilton	0.08	0.01	0.12	0.02	0.12	0.02
Spine	0.09	0.01	0.09	0.02	0.07	0.02
Lee	0.1	0.01	0.09	0.02	0.07	0.02
Pairings	0.36	0.05	0.35	0.06	0.05	0.01

For each domain, select top L/5 predicted contacts ranked by contact probabilities. Acc6, Acc12, Acc24 denote prediction accuracy (specificity) at sequence separation ≥6, 12, 24 residues, respectively. Cov6, Cov12, Cov24 denote prediction coverage (sensitivity) at sequence separation ≥6, 12, 24 residues, respectively.

Results of NNcon and SVMcon on 116 CASP8 targets Acc6, Acc12, Acc24 denote prediction accuracy (specificity) at sequence separation ≥6, 12, 24 residues, respectively. Cov6, Cov12, Cov24 denote prediction coverage (sensitivity) at sequence separation ≥6, 12, 24 residues, respectively. Multiple contact map predictors evaluated on 11 CASP8 ab initio domains For each domain, select top L/5 predicted contacts ranked by contact probabilities. Acc6, Acc12, Acc24 denote prediction accuracy (specificity) at sequence separation ≥6, 12, 24 residues, respectively. Cov6, Cov12, Cov24 denote prediction coverage (sensitivity) at sequence separation ≥6, 12, 24 residues, respectively.

COMPARISON OF NNcon AND SVMcon

Both NNcon and SVMcon were evaluated on 116 CASP8 targets. For each target, the top L/5 predicted contacts were selected, where L is the residue length of the protein. Then we calculated prediction coverage (sensitivity [TP/(TP + FN)]) and accuracy (specificity [TP/(TP + FP)]) for sequence separation of at least 6 residues, 12 residues and 24 residues, respectively, where TP, FP, TN and FN, are true positive, false positive, true negative and false negative predictions, respectively. NNcon had higher performance statistics than SVMcon in both coverage and accuracy for all sequence separation distances (Table 1). The sensitivities, overall, are lower than specificities in all predictions because only a small number of predicted contacts (L/5) are selected.

Comparison with other predictors on CASP8 ab initio domains

NNcon, as well as other CASP8 predictors, were evaluated on 11 CASP ab initio domains and then compared. The top L/5 predicted contacts were again used in the calculations. As Table 2 shows, NNcon performed favorably when compared with other predictors, especially at sequence separations ≥12 residues.

A good CASP8 contact prediction example

Figure 2 shows the predictions from the NNcon server for the target T0507. NNcon correctly identified key contacts in the beta sheets which can be very useful for predicting the final structure of the protein.

Figure 2.

(a) The 3D structure of CASP8 target T0507. The protein has five strands (S1–S5) that forms a parallel beta sheet. (b) The true contact map (upper triangle, blue) and predicted contact map (lower triangle, red). Each dot denotes a contact. It shows that some key contacts in four strand pairs (S1–S2, S2–S3, S1–S4, S4–S5) are correctly predicted. (c) Selective visualization of four residue–residue contacts correctly predicted (2–30, 3–31, 4–90, 7–34).

INFERENCE OF CONTACT ORDER AND CONTACT NUMBER

For each of the 48 proteins in the test data set, the average contact number and the average contact order of all the residues were calculated, and then correlated with the actual values. The actual (resp. predicted) contact number for each residue at 8 Å threshold was calculated as the total number of actual (resp. predicted) contacts with sequence separation greater than five residues. The actual (resp. predicted) contact order for each residue is the sum of sequence separations of actual (resp. predicted) contacts with sequence separation greater than five residues, and then normalized by the protein sequence length. The Pearson correlations between the average actual and predicted contact number (0.85) and contact order (0.65) were strong, indicating that NNcon can successfully infer the actual average contact number and contact order of each protein from the predicted contact map. In the web server, the average contact number and order for the entire query protein are reported.

CONCLUSION

We have described NNcon—a fast and reliable web server and software for protein contact map prediction. NNcon was ranked among the most accurate methods in the CASP8 experiment, 2008. The contact map predicted by NNcon can be used to estimate the contact order and contact number of a protein. On average, a contact map prediction can be made in under a few minutes on one single-processor PC, making the method a valuable tool in large-scale contact map predictions.

FUNDING

MU Bioinformatics consortium, a MU research board grant and a MU research council grant to J.C. and a NLM fellowship to A.N.T. Funding for open access charge: MU faculty startup grant. Conflict of interest statement. None declared.

33 in total

1. Improved prediction of the number of residue contacts in proteins by recurrent neural networks.

Authors: G Pollastri; P Baldi; P Fariselli; R Casadio
Journal: Bioinformatics Date: 2001 Impact factor: 6.937

2. Predicting interresidue contacts using templates and pathways.

Authors: Yu Shao; Christopher Bystroff
Journal: Proteins Date: 2003

3. Automated structure prediction of weakly homologous proteins on a genomic scale.

Authors: Yang Zhang; Jeffrey Skolnick
Journal: Proc Natl Acad Sci U S A Date: 2004-05-04 Impact factor: 11.205

4. Protein contact prediction using patterns of correlation.

Authors: Nicholas Hamilton; Kevin Burrage; Mark A Ragan; Thomas Huber
Journal: Proteins Date: 2004-09-01

5. Striped sheets and protein contact prediction.

Authors: Robert M MacCallum
Journal: Bioinformatics Date: 2004-08-04 Impact factor: 6.937

6. Global fold determination from a small number of distance restraints.

Authors: A Aszódi; M J Gradwell; W R Taylor
Journal: J Mol Biol Date: 1995-08-11 Impact factor: 5.469

7. Improving contact predictions by the combination of correlated mutations and other sources of sequence information.

Authors: O Olmea; A Valencia
Journal: Fold Des Date: 1997

8. MONSSTER: a method for folding globular proteins with a small number of distance restraints.

Authors: J Skolnick; A Kolinski; A R Ortiz
Journal: J Mol Biol Date: 1997-01-17 Impact factor: 5.469

9. Correlated mutations and residue contacts in proteins.

Authors: U Göbel; C Sander; R Schneider; A Valencia
Journal: Proteins Date: 1994-04

10. Can three-dimensional contacts in protein structures be predicted by analysis of correlated mutations?

Authors: I N Shindyalov; N A Kolchanov; C Sander
Journal: Protein Eng Date: 1994-03

53 in total

1. MSACompro: protein multiple sequence alignment using predicted secondary structure, solvent accessibility, and residue-residue contacts.

Authors: Xin Deng; Jianlin Cheng
Journal: BMC Bioinformatics Date: 2011-12-14 Impact factor: 3.169

2. Protein contact prediction by integrating joint evolutionary coupling analysis and supervised learning.

Authors: Jianzhu Ma; Sheng Wang; Zhiyong Wang; Jinbo Xu
Journal: Bioinformatics Date: 2015-08-14 Impact factor: 6.937

3. Multi-Dimensional Scaling and MODELLER-Based Evolutionary Algorithms for Protein Model Refinement.

Authors: Yan Chen; Yi Shang; Dong Xu
Journal: Proc Congr Evol Comput Date: 2014-07

4. An Improved Integration of Template-Based and Template-Free Protein Structure Modeling Methods and its Assessment in CASP11.

Authors: Jilong Li; Badri Adhikari; Jianlin Cheng
Journal: Protein Pept Lett Date: 2015 Impact factor: 1.890

9. KScons: a Bayesian approach for protein residue contact prediction using the knob-socket model of protein tertiary structure.

Authors: Qiwei Li; David B Dahl; Marina Vannucci; Hyun Joo; Jerry W Tsai
Journal: Bioinformatics Date: 2016-08-24 Impact factor: 6.937

10. Prediction of protein long-range contacts using an ensemble of genetic algorithm classifiers with sequence profile centers.

Authors: Peng Chen; Jinyan Li
Journal: BMC Struct Biol Date: 2010-05-17