Literature DB >> 26963911

Comparison of Algorithms for Prediction of Protein Structural Features from Evolutionary Data.

Robert P Bywater1.   

Abstract

Proteins have many functions and predicting these is still one of the major challenges in theoretical biophysics and bioinformatics. Foremost amongst these functions is the need to fold correctly thereby allowing the other genetically dictated tasks that the protein has to carry out to proceed efficiently. In this work, some earlier algorithms for predicting protein domain folds are revisited and they are compared with more recently developed methods. In dealing with intractable problems such as fold prediction, when different algorithms show convergence onto the same result there is every reason to take all algorithms into account such that a consensus result can be arrived at. In this work it is shown that the application of different algorithms in protein structure prediction leads to results that do not converge as such but rather they collude in a striking and useful way that has never been considered before.

Entities:  

Mesh:

Substances:

Year:  2016        PMID: 26963911      PMCID: PMC4786192          DOI: 10.1371/journal.pone.0150769

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


Introduction

Although protein structure determination by biophysical techniques such as X-ray crystallography, cryoelectron microscopy and NMR has become highly automated, there will, for several reasons, continue to be interest in pursuing theoretical predictions of protein structure. Despite the high productivity of the mentioned experimental methods, the rate at which genomics and proteomics data are generated still outstrips the rate at which structures can be determined experimentally. Performing mutant studies for planning protein engineering experiments or screening for proteomic therapeutics (for example: immunotherapeutics) are most rapidly done in silico. Further, it is not simply the case that any given gene has a (singular) function. The protein prescribed by its gene sequence has many functions [1,2]. This implies, in turn, that these are also encoded in the gene. Somewhere, but where and how? As earlier shown [1,2] this is done in a disjoint fashion. So the problem becomes an issue of how to partition the protein sequence information and map these subsets of the entire gene sequence onto this set of functions (the inherent assumption in all this, which dealt with in more detail elsewhere, is that the mapping of sequence loci into function space is both surjective and injective). While many of these issues have been addressed in recent [1] and earlier [2] publications the focus here is on folding and contact prediction and not on any of the other functions that any given protein most certainly has. Theoretical/computational protein folding studies have undergone a steady series of developments in recent years. These have included significant accomplishments in protein dynamics [3-5], methods based on a collage of overlapping peptide fragments [6], and a variety bioinformatics approaches [7-12]. The latter have usually involved finding patterns of coevolution within multiple sequence alignments. The so-called correlated mutation analysis (CMA) approach identifies residue positions that show a common pattern of conservation and are deemed to signify the maintenance of some key structural feature, a “contact”. Typically, what one had in mind in these studies was protein folding, the need for the protein to fold into domains with a compact (predominantly “hydrophobic”) core. A similar argument was used to propose that protein-protein interactions could likewise be predicted [13,14]. As an extension of the CMA idea, studies of patterns of sequence variability (VAR) and Shannon entropy (ENT) [15,16] allowed a distinction to be made between sites in the protein core, or surrounding ligand-binding sites, for example. The first steps towards unravelling the multifunctional nature of proteins [1,2] had been taken. This was recently supplemented by an alternative approach based on Kolmogorov complexity (KOL) [1] which represents a new way to partition protein sequence information into its constituent functionalities. In this paper, the focus will be restricted to protein folding, or more specifically, the folding of individual domains. The extent to which KOL, and its antecedents [15,16] VAR/ENT (here considered jointly and called VRN), can be used to predict these domain structures will be considered, as well as alternative methods. Foremost among these are methods which have been based on frequency of contacts between amino acid residue sidechains [17,18]. In the present work distinction is made between an earlier method [17] in which a PDB-derived likelihood matrix was used to predict intradomain contacts (referred to herein as the SVB method) and a later development based on pair-to-pair contacts [18] (P2P). The P2PConPred program [18] calculates correlations between sites based on a predefined P2P matrix which in turn is based on the Blocks database [19]. The P2P website states: “The P2P is currently designed to reflect probabilities of pair to pair substitutions at two positions with physical contact. The ultimate goal is to detect residue-residue contact solely based on the evolutionary information stored in multiple sequence alignment.”. The present paper includes results from the use of the P2P program but proceeds towards the same ultimate goal in ways that P2P probably did not envisage.

Methods

Before the VRN and KOL measurements can be made it is important to decide the range of values that give results that are relevant to the type of contact being studied. This question has been studied earlier for VRN [15,16] and KOL [1]. For VRN, the values obtained in the original work [15,16] and used elsewhere [1] were used. In the case of KOL the results of making these investigations were not published before so this is done here. Reference is made to Fig 1 which shows how the MCC values for KOL calculated at two different distance cutoffs, 6Å and 10Å (these turn out to be good choices as the later results show, but other values could have been chosen) vary as a function of the range of KOL values is varied, in the range 0.2 to 0.5, the width of each slice of that range is 0.05. As is shown in Fig 1, the optimal MCC value is the same for 6Å and 10Å (and intervening values, data not shown).
Fig 1

Selection of cutoff ranges for KOL.

The abscissa of each data point indicates the center of the range, all ranges have a width of ± 0.05 on the KOL scale. The measurements were made at cutoff distances 6Å (KOL06) and 10Å (KOL10). The total number of hits is shown in red.

Selection of cutoff ranges for KOL.

The abscissa of each data point indicates the center of the range, all ranges have a width of ± 0.05 on the KOL scale. The measurements were made at cutoff distances 6Å (KOL06) and 10Å (KOL10). The total number of hits is shown in red. The data to produce Fig 2 and S1 Fig and the Tables 1 and 2 come from the following sources:
Fig 2

Contact distance plots for the nonredundant set of 10 proteins for CMA, KOL, VRN, P2P and SVB (text colours here correspond to the colours in the figures).

The identities of the proteins for each plot are listed in Table 2 - column 1: PDB I.d., column 2: working name for the protein, column3: figure number.

Table 1

Full results for all methods for the protein 1a4v (item 1 in Table 2).

AlgorithmCutoff ÅTPFPFNTNMCCACCYPRECSENSYMCC corr
SVB47549386504-.00310.8660.1150.0070.0032
62215683764880.00450.8620.1240.0260.0248
83427971364770.00630.8590.1090.0460.0441
106249449864490.03970.8510.1120.1110.1073
1210779619764030.14630.8390.1180.3520.2589
14146112313061040.18760.7940.1150.5290.3583
16198151452152700.03660.6760.1160.2750.2831
1825618708774500-.05370.5660.1200.2260.2715
20310225312603680-.15640.4490.1210.1970.2673
P2P47612374120.14990.9870.1030.2330.1050
671627872560.04320.9660.0410.0820.0635
8928620270060.00290.9330.0310.0430.0407
10135014176572-.03740.8740.0250.0300.0302
12238027185960-.08350.7910.0280.0310.0301
1432113010465295-.14170.7010.0280.0300.0290
1638152014364509-.22170.5960.0240.0260.0248
1843187617923792-.30300.5000.0220.0230.0222
2047226021763020-.40260.3960.0200.0210.0211
KOL4134733171120.07330.9460.2170.0380.0796
63514822970910.13390.9400.1910.1330.1542
85827210670670.22580.9340.1760.3540.2636
1010248710968050.25610.8930.1730.4830.3237
1215378841061520.12590.8000.1630.2720.2385
14200111673854490.03760.7000.1520.2130.2083
16266150611284603-.05100.5780.1500.1910.1955
18304186214843853-.14650.4730.1400.1700.1787
20340224618683049-.25910.3610.1310.1540.1646
VRN4106137470580.03980.9390.1410.0260.0461
63516227270340.11340.9330.1780.1140.1337
86628614970020.21120.9240.1880.3070.2490
101205016668160.32540.8920.1930.6450.4200
1218780236761470.17170.7940.1890.3380.2843
14245113069554330.07570.6910.1780.2610.2464
16306152010854592-.02600.5710.1680.2200.2205
18350187614413836-.12410.4650.1570.1950.2011
20387226018253031-.24060.3520.1460.1750.1921
CMA4061887354-.00990.9800.0000.000-0.0036
631621373250.05220.9760.0180.1880.0725
852861377075-.00260.9420.0170.0350.0352
1075013526643-.04300.8840.0140.0190.0246
1288026536040-.09600.8040.0100.0120.0166
14911309815383-.15510.7160.0080.0090.0156
1615152013714597-.22860.6110.0100.0110.0179
1824187617273876-.30390.5130.0130.0140.0382
2028226021113104-.40030.4100.0120.0130.0234

In this table the following statistical checks were carried out, in accordance with recently established practice [1,21]:

MCC (Matthew's Correlation Coefficient):

(TP*TN–FP*FN)/√((TP+FP)*(TP+FN)*(TN+FP)*(TN+FN))

Accuracy (ACCY): (TN–TP)/(TP+FP+FN+TN)

Precision (PREC): TP/(TP+FP)

Sensitivity (SENSY): TP/(TP+FN)

Table 2

Summary of results for all 10 protein families (parent protein identified in columns 1 and 2).

Protein domain (4-letter code)Protein typeFigure (A-I in S1 Fig)CATH class (click on links for details including 3D structure)Peaks in contact distances vs MCC plots (secondary peaks)
CMAKOLVRNP2PSVB
1a4v_α-lactalbumin21.10.530.106.0(8.0) 10.08.0 12.0~14.0
5tim_TIM barrelA3.20.20.705.0 (8.0)6.0 (5.0)10.0 (8.0)12.0 (10.0)~
1ewkareceptor ligand binding domainB3.40.50.2300-5.5 (4.0)6.0-4.5 & -5.0 (5.5)10.0~
1fw0areceptor membrane domainC3.40.190.105.55.55.5~~
1ulkblectinD3.30.60.105.54.0 (6.0)5.56.010.0
1kx5ehistoneE1.10.20.105.55.08.012.04.5
2b4sbinsulin receptor TK domainF2.60.40.14108.0 (5.0)8.010.014.0~
1xckaGroELG3.30.260.108.0 (6.0)8.08.012.0~
1bpyaDNA β polymeraseH3.30.210.108.0 (5.5)3.8 (5.5)8.0(5.0)~
1n8kaalcohol dehydrogenaseI3.40.50.7208.0 (5.0 & 6.0)6.0 (5.0)8.08.0~
VRN and KOL: were determined using previously published methods [1] for dealing with a specially designed nonredundant database of protein domains. Briefly, for each protein a multiple sequence alignment was produced using the PredictProtein program [20]. This program generates VAR and ENT data, although these need to be parsed and extracted into a usable form. The KOL data are not provided directly but can be calculated based on the complexity of the alignments at each position in the consensus sequence. These methods are all described in detail in [1]. SVB: previously published [17] contact matrix. P2P: http://ignmtest.ccbb.pitt.edu/p2pdocs/p2p_doc.html CMA: http://gremlin.bakerlab.org/sub.php

Contact distance plots for the nonredundant set of 10 proteins for CMA, KOL, VRN, P2P and SVB (text colours here correspond to the colours in the figures).

The identities of the proteins for each plot are listed in Table 2 - column 1: PDB I.d., column 2: working name for the protein, column3: figure number. In this table the following statistical checks were carried out, in accordance with recently established practice [1,21]: MCC (Matthew's Correlation Coefficient): (TP*TN–FP*FN)/√((TP+FP)*(TP+FN)*(TN+FP)*(TN+FN)) Accuracy (ACCY): (TN–TP)/(TP+FP+FN+TN) Precision (PREC): TP/(TP+FP) Sensitivity (SENSY): TP/(TP+FN) For all of the above methods, comparisons were made between “hits” identified by the method and those in the 2D contact map for each target protein (listed in Table 2, columns 1 & 2). Contact maps at cutoff values 3.8, 4.0, 4.5, 5.0, 5.5, 6.0, 8,0, 10,0, 12.0, 14,0, 16.0, 18,0 and 20.0 were calculated and used for these comparisons. A count was made of true and false positives to calculate the Matthews correlation coefficients (MCCs) ([1,21] See also caption to Table 1) for each method at each cutoff value. The MCCs are plotted along the abscissae and the cutoffs form the ordinates of the plots in Fig 2 and S1 Fig. The results of applying these methods to the target proteins in this work are shown in Fig 2 and S1 Fig. Table 1 records the data for a single member of the set of proteins PDB i.d.:1a4v. (Corresponding data for all the others is available from the author) and Table 2. The members of the studied protein set (Table 2 columns 1 & 2) were chosen according to dual requirements for wide coverage of domain fold space (Table 2 column 4) and accuracy of the crystal structures (R and B-values obtainable through the links in column 4 of Table 2). Structural data including rotatable figures are also reachable through the same links. The question of noise and random effects in all this data has not been ignored; quite the converse. For each of the above metrics, the behaviour of a set of predictions based on completely random inputs was calculated and used to correct the metrics (subtraction of RND). Statistical correlations between CMA, KOL, VRN, P2P and SVB (corrected for RND) displayed (Fig 3) as a principle component analysis diagram. The first two components ((dominant–see insert) are plotted. This diagram was produced using the statistics package R (http://www.r-project.org/).
Fig 3

Principle component analysis of CMA, KOL, VRN, P2P and SVB for the entire set of protein domains.

Results and Discussion

The data arising out of this work can be used to construct 2D contact maps from which 3D structures can, in principle, always be derived using distance geometry [22-26]. There are ample precedents for presenting folding predictions in this way [1,10,11,17]. But now there is a new and better definition of “contact” because now we can define contacts in terms of preferred rather than just assumed (see below) distances, depending on the method used (Table 1). In addition to this enhanced contact data there is a wealth of other data which have earlier been enlisted in these endeavours such as predictions of secondary structure [1,10] and accessibility [1] which confer additional credence to the results of any attempts to compute the 3D structure. Once these juxtaposed results have been presented, the next step is to decide how best to combine the 2D contact predictions from SVB, P2P, VRN, KOL and CMA in such a way as to provide the best consensus set, “best” being here defined as leading to rapid convergence towards a structure for the protein that corresponds to the crystal structure for this protein. As stated, there has been widespread interest in trying to predict intra-protein (as well as inter-protein) contacts. But very little is ever said about the nature of those contacts. The definition of “contact” can be very vague or ambiguous, often referring to “hydrogen bond” or “Van der Waals” contacts. Neither of these include the possibility of and maybe even need for long range effects that are not contacts as such. A standard (http://www.ccp4.ac.uk/html/contact.html) range for short contacts is 2.0Å-3.66Å while a considerably wider range, 6Å-12Å, is considered significant in order to cater for all contact types (http://en.wikipedia.org/wiki/Protein_contact_map). Given that there is such a wide spread of distances which are involved in defining a “contact” it now becomes interesting and, as it turns out, important to ask the question: in each of the algorithms for contact prediction listed above and summarised in Table 1, what is the characteristic contact distance for each of these algorithms? The answer is provided by the cutoff distance for each case where MCC is a maximum. When this is done it becomes apparent that the various algorithms predict contacts having different characteristic distances. A clear conclusion from this work is that there is no “one-size-fits-all” algorithm for inter-residue contact prediction. One would clearly not choose the SVB or P2P alternatives since their behaviour is somewhat erratic and often confined to predicting rather long spacings (>10Å). There might be correlated behaviour over such long distances, but they can hardly be considered a “contact”. But it obviously makes sense to use CMA, VRN and KOL. One may legitimately ask why, given that CMA got off to such a good start in fold prediction, is there any need to consider other methods? Do the apparent correlations in CMA really correspond to events in coevolution? There have been many discussions on this question [12,27,28] and more remains to be discovered. In particular, there are clear indications [28] that CMA “hits” may reflect the rate of coevolution in relation to preserving arenaceous (i.e. low resolution) structural features such as the protein “core” rather than acting as a predictor of specific pairs of contacting residues as such. But insofar that CMA can with appropriate noise filters be used to predict contacts the Gremlin approach [29] is most useful and it produces results of very high fidelity. This paper has its main focus on protein folding, or rather, domain folding. Several different methods for predicting domain folds were compared and it was found that these methods work in subtly different ways in that they predict contacts with different values. There is every reason therefore to use more than one, but not all, of these methods. Together they provide a more robust and information-rich prediction model and, while they do not “converge” as such, they “collude” in a way that could to lead to a more reliable result (at least as far as VRN and KOL are concerned). From Fig 3 it would appear that KOL, VRN and CMA are controlled by similar underlying factors and all three correlate in an almost antiparallel fashion with cutoff. Of course, there is no linear correlation as is made abundantly clear in Fig 2 and S1 Fig. P2P and SVB are almost orthogonal to the cutoff indicating little or no dependency in that sense. One of the missing items in much published work is a clear definition of what is meant by “contact”. A mention of this has been made [28] which amounts to a general assumption throughout the CMA debate that, for example a “hydrophobic-hydrophobic” contact can be replaced by a hydrogen bond or a salt-bridge or an “aromatic-aromatic” contact. As if these were freely interchangeable. But they are not interchangeable in such a simple way [30]. These interactions are based on entirely different mechanisms and replacement of one by the other is not to be regarded as a “compensatory mutation” [30]. Another missing item in previously published work is that there has been such a focus on “contacts”, however these are defined and/or measured, that other most important protein functions seem to have been forgotten. Exceptions to this is a precursor to the present paper [1] and a most important earlier paper [31] that sets out to consider the ability to disentangle direct and indirect correlations and to facilitate computational predictions of alternative protein conformations, protein complex formation, and even the de novo prediction of protein domain structures. Together with the efforts of the present author, this seems to be a valid and useful way forward. To this end, future extensions of this work will give further consideration to these other protein functions [1] that are encoded in the gene (the ability to fold into two (or more) conformational states, to be able to reach one state from the other, arriving at the correct locus inside or outside the cell, or within the cell membrane, recognition/binding to other proteins, recognition/binding of small ligands (orthosteric and/or allosteric agents)). Indeed, much of the difficulty surrounding the use of these contact prediction methods arises out of the fact that so many different functions are encoded in the gene and attempts to partition them lead to the kind of results that have been revealed in this work. Thus the use of the verb “disentangle” [31] is highly appropriate in this context.

Conclusions

This paper has dealt with the question of which inputs to use when conducting ab initio predictions of domain folds. Five methods were compared and it was found that they all make predictions in different ways. Different in respect to which interatomic distance or displacement that they best predict. CMA, VRN and KOL emerge as being the most useful methods for predicting “contacts”. The first two are already well established, while the Kolmogorov approach [1, 32] represents a novel and promising addition to the arsenal of techniques. As for the interatomic contacts themselves, no account has been taken here of the nature of the atom types involved (but it already is one of the ongoing extensions from this work). Here, a standard “CA-CA” proximity metric is assumed as a definition for all “contacts”. But the issue is an important one. Depending on the chemical nature of the participating atom types the problem (generally) of finding matching pairs amounts is a case of the mathematically well defined “marriage problem” (Gale-Shapley algorithm). This is applicable to “+/- type” interactions or wherever there is a duality or asymmetry in the interaction. But there are also interactions of a more neutral or symmetric kind such as “hydrophobic” interactions typical of the way that aliphatic, and to some extent aromatic, sidechains interact. These have more the character of the “stable-roommate problem” (Irving algorithm). It is intended that these distinctions will form the basis of yet another extension of this work.

This file contains figures A, B, C, D, E, F, G, H and I.

The identities of the proteins is stated in columns 1 and 2 of Table 2. Software developed for this paper is all available free of charge to interested users. The source code for the fortran program (predcon.f) which calculates the MCC scores from the input data is provided and users are strongly advised to peruse this file before use making note of the comments and the default names of the input files. Examples of the input files are provided for the purposes of getting the correct format. The program should be run once for each cutoff value and it is advisable to rename the output files using a filename that incorporates that value A shellscript (renumber) is provided to help with that and an awkscript (awkward) is provided to enable the different measurements (CMA, VRN, etc.) to be corrected for the random (RND) scores. The correct command for compilation of predcon.f under Linux is: gfortran -Wall -O3 -ffixed-line-length-132 -o predcon.exe predcon.f (GZ) Click here for additional data file.

The software is available in S1 Software.

It goes without saying that no commercial use may be made of these programs or scripts in whole or in part without express permission of the author. (GZ) Click here for additional data file.
  27 in total

1.  Correlated mutation analyses on very large sequence families.

Authors:  L Oliveira; A C M Paiva; G Vriend
Journal:  Chembiochem       Date:  2002-10-04       Impact factor: 3.164

2.  Identification of functionally conserved residues with the use of entropy-variability plots.

Authors:  Laerte Oliveira; Paulo B Paiva; Antonio C M Paiva; Gerrit Vriend
Journal:  Proteins       Date:  2003-09-01

3.  Relationship between protein structure and geometrical constraints.

Authors:  O Lund; J Hansen; S Brunak; J Bohr
Journal:  Protein Sci       Date:  1996-11       Impact factor: 6.725

4.  Correlated mutations contain information about protein-protein interaction.

Authors:  F Pazos; M Helmer-Citterich; G Ausiello; A Valencia
Journal:  J Mol Biol       Date:  1997-08-29       Impact factor: 5.469

5.  Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions.

Authors:  K T Simons; C Kooperberg; E Huang; D Baker
Journal:  J Mol Biol       Date:  1997-04-25       Impact factor: 5.469

6.  Correlated mutations and residue contacts in proteins.

Authors:  U Göbel; C Sander; R Schneider; A Valencia
Journal:  Proteins       Date:  1994-04

7.  Compensating changes in protein multiple sequence alignments.

Authors:  W R Taylor; K Hatrick
Journal:  Protein Eng       Date:  1994-03

8.  Improved prediction of protein secondary structure by use of sequence profiles and neural networks.

Authors:  B Rost; C Sander
Journal:  Proc Natl Acad Sci U S A       Date:  1993-08-15       Impact factor: 11.205

9.  Correlation of co-ordinated amino acid substitutions with function in viruses related to tobacco mosaic virus.

Authors:  D Altschuh; A M Lesk; A C Bloomer; A Klug
Journal:  J Mol Biol       Date:  1987-02-20       Impact factor: 5.469

10.  Covariation Is a Poor Measure of Molecular Coevolution.

Authors:  David Talavera; Simon C Lovell; Simon Whelan
Journal:  Mol Biol Evol       Date:  2015-05-04       Impact factor: 16.240

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.