Literature DB >> 19944385

How many 3D structures do we need to train a predictor?

Pantelis G Bagos1, Georgios N Tsaousis, Stavros J Hamodrakas.   

Abstract

It has been shown that the progress in the determination of membrane protein structure grows exponentially, with approximately the same growth rate as that of the water-soluble proteins. In order to investigate the effect of this, on the performance of prediction algorithms for both alpha-helical and beta-barrel membrane proteins, we conducted a prospective study based on historical records. We trained separate hidden Markov models with different sized training sets and evaluated their performance on topology prediction for the two classes of transmembrane proteins. We show that the existing top-scoring algorithms for predicting the transmembrane segments of alpha-helical membrane proteins perform slightly better than that of beta-barrel outer membrane proteins in all measures of accuracy. With the same rationale, a meta-analysis of the performance of the secondary structure prediction algorithms indicates that existing algorithmic techniques cannot be further improved by just adding more non-homologous sequences to the training sets. The upper limit for secondary structure prediction is estimated to be no more than 70% and 80% of correctly predicted residues for single sequence based methods and multiple sequence based ones, respectively. Therefore, we should concentrate our efforts on utilizing new techniques for the development of even better scoring predictors.

Entities:  

Mesh:

Substances:

Year:  2009        PMID: 19944385      PMCID: PMC5054404          DOI: 10.1016/S1672-0229(08)60041-8

Source DB:  PubMed          Journal:  Genomics Proteomics Bioinformatics        ISSN: 1672-0229            Impact factor:   7.691


Introduction

The three-dimensional (3D) structure of a protein is determined by its amino acid sequence in a given environment and, consequently, determines its exact biological function (. However, experimental methods for determining the structure of a given protein such as X-ray crystallography and nuclear magnetic resonance are expensive, time-consuming and in many cases (i.e., concerning membrane proteins) not easy for a number of reasons. Thus, from the early days of computational biology, several attempts were made in order to develop algorithms that can predict the secondary structure of a protein using only information encoded in its primary sequence. Later on, similar algorithms were developed for predicting more specialized secondary structure features such as transmembrane helices and β-strands of transmembrane (TM) proteins. In a typical case, a limited number of non-homologous sequences with known 3D structure are used for training the algorithm and the method is supposedly able to predict the secondary structure of newly discovered and unrelated proteins. Thus, we expect that as newly solved 3D structures are accumulated, the prediction methods would become better. It has been shown that the progress of protein structure determination is approximately the same for membrane proteins and water-soluble ones, taking into consideration the year of the first published structure (. However, membrane proteins have a delay in the appearance of the first published structure of about 25 years compared to the water-soluble proteins. Moreover, for training a predictor, usually a non-redundant dataset is used. From the early years of the structure prediction algorithms, it was anticipated that an increase in the number of non-homologous sequences with known structure would enhance the prediction accuracy. However, later it became evident that after a particular point, the prediction accuracy could not be further improved just by increasing the size of the training set. In this work, we try to empirically answer the question regarding the relationship between the size of the training set and the prediction accuracy. We address separately the general problem of predicting the secondary structure of proteins, and that of predicting the TM segments of membrane proteins (α-helical and β-barrels). The methods used for secondary structure prediction of water-soluble proteins appeared much earlier in the progress of biological research and continue to grow, taking advantage of the increasing number of available unique structures determined year by year. However, even using the most advanced computational techniques devoted to this task (neural networks, support vector machines, etc.) and including as input evolutionary information in the form of multiple alignments, it is currently acceptable that their prediction performance cannot exceed an upper limit, no matter what the increase of the training set would be. In order to quantify this common belief, we performed a meta-analysis of published results using data from the existing literature. Concerning membrane proteins, we have conducted a historical prospective study in order to illustrate the potential impact of newly determined 3D structures in the topology prediction by state-of-the-art machine learning computational methods. Along these lines, we have used the hidden Markov model (HMM)-based computational methods recently proposed by our group, namely PRED-TMBB 3., 4. and HMM-TM (, as platforms to get an estimate of the improvement of computational predictive methods, as more (unique) structures become available for both α-helical and β-barrel TM proteins.

Results and Discussion

The literature search for secondary structure prediction algorithms identified 59 studies that fulfilled our criteria (Table 1). The methods are classified into two classes according to the input they use, those using single sequence information (23 methods) and those using evolutionary information in the form of multiple alignments (36 methods). The methods are highly heterogeneous according to the algorithmic technique they utilize; we encountered feedforward neural networks with various fixed topologies (FFNNs), cascaded correlation neural networks (CC-NNs), recurrent neural networks (RNNs), partially recurrent neural networks (PRNNs), bidirectional recurrent neural networks (BRNNs), hybrid methods such as hidden neural networks (HNNs), linear regression classifiers, support vector machines (SVMs), nearest neighbor methods, Bayesian networks (BNs) and various propensity based statistical methods. The training set used in each method also varied dramatically among methods, from 27 sequences in the earlier works ( to 3,925 sequences in the most recent work (. The datasets used for the historical prospective study (for both α-helical and β-barrel TM proteins) are listed in the supplementary material at http://bioinformatics.biol.uoa.gr/historical/.
Table 1

Studies included in the meta-analysis for the accuracy of the secondary structure prediction algorithms

YearReferenceTraining set (No. of proteins)Q3Evolutionary information
1978322953NO
1978332557NO
1986346162.2NO
1987355961.3NO
1987366863NO
1987372566YES
1988386258.7NO
19883910664.3NO
1989404863NO
1990416264NO
19924210766.4NO
1993439164.5NO
19934412672YES
19934511068NO
19964631872.9YES
19964631867NO
19964726764.4NO
19964812671.3YES
19964812666.3NO
19974955675YES
19975040267.5NO
19975151268NO
19975151272.4YES
1997529073.5YES
19975330472YES
19975347367NO
1999541,18076.6YES
19995568176.6YES
19995639672.9YES
19995718776.5YES
20005848076.4YES
20005949676.7YES
2000601,03280.6YES
20006145268.8NO
20016251373.5YES
20016339673.7YES
20016339668.8NO
20011612675.1YES
20026451373.5YES
20026451367.5NO
2002651,18078.13YES
20036648078.5YES
20036712672.8YES
2003681,46077.07YES
20046951375.2YES
2004701,61270.2NO
20047151377YES
20047251378.44YES
20047351376.5YES
2005743,55377.1YES
20057586078.4YES
20057639676.3YES
2005772,17179YES
20057851379.4YES
20057951369NO
20057951376.4YES
20058037476YES
200573,92581.8YES
20058129770YES
In Table 2, we list the detailed results of fitting the linear and non-linear curves on the measures of performance (Q, C and SOV) for α-helical and β-barrel TM proteins, as well as on the Q3 statistic for secondary structure prediction algorithms (see Materials and Methods). From the root mean squared error (RMSE) statistics, it is clear that in the case of β-barrel TM proteins and secondary structure (both with and without the use of multiple alignments), the non-linear model fits better to the data. For α-helical TM proteins, the RMSEs are nearly equivalent in all three cases. However, the growth rate represented by the β1 coefficients is very small, a fact indicated also in their large standard errors (resulting in marginally statistical significant slopes for Q and C, and in an insignificant one for SOV). Concerning secondary structure predictions, the estimates correspond to an upper limit for the performance of the single sequence methods at around 70%, whereas at the same time for multiple sequence methods this limit is somewhere around 80% (Figure 1). The differences between single sequence based methods and those using multiple alignments are reflected in the estimated β1 coefficient of the model for each class (0.022 vs 0.002). This parameter expresses the shape of the fitted line. For instance, larger values correspond to a rapid initial growth and faster saturation, as opposed to smaller values that correspond to a more smooth increase. Even though one has to have in mind that we are comparing entirely different methods, it seems that there are differences between the two distributions. Thus, the linear phase for the growth of performance for single sequence based methods is estimated to be for datasets <200, whereas the same for multiple alignment based methods (mostly using NNs and SVMs) is for datasets <1000. It seems that methods depending on multiple alignments are more dependent on the size of the training set, perhaps as a consequence of the fact that they utilize many more trainable parameters.
Table 2

Results obtained from the linear and non-linear regression for secondary structure, α-helical membrane and β-barrel membrane proteins

β0 (SE)β1 (SE)β2 (SE)RMSE
Non-linearβ-barrel TM proteins
Qβ0.869 (0.010)0.153 (0.034)−7.876 (2.579)0.0070
Cβ0.734 (0.0217)0.153 (0.036)−2.183 (1.516)0.0149
SOV
0.874 (0.0121)
0.216 (0.041)
−1.398 (1.184)
0.0132
α-helical TM proteins
Qα0.884 (0.018)0.019 (0.020)−140.0415 (145.267)0.0093
Cα0.776 (0.098)0.012 (0.018)−139.895 (183.753)0.0213
SOV
0.904 (0.013)
1.984 (−)
14.583 (0.366)
0.0376
Secondary structure
Q3 (single)0.679 (0.006)0.022 (0.004)−50.405 (17.613)0.0182
Q3 (multiple)0.790 (0.011)0.002 (7.4×10−4)−976.918 (351.151)0.0219

Linearβ-barrel TM proteins
Qβ0.735 (0.014)0.007 (9.7×10−4)0.0157
Cβ0.462 (0.028)0.013 (0.002)0.0322
SOV
0.646 (0.034)
0.012 (0.002)

0.0389
α-helical TM proteins
Qα0.843 (0.006)3.3×10−4 (8.6×10−5)0.0093
Cα0.655 (0.013)7.9×10−4 (1.9×10−4)0.0206
SOV
0.879 (0.025)
3.3×10−4 (3.7×10−4)

0.0398
Secondary structure
Q3 (single)0.627 (0.009)7.1×10−5 (2.2×10−5)0.0349
Q3 (multiple)0.740 (0.007)2.0×10−5 (7.0×10−6)0.0269
Figure 1

The prediction accuracy (Q3) of secondary structure prediction algorithms in relation to the size of the training set. Single sequence methods are depicted with squares and multiple alignment-based ones are depicted with triangles. The non-linear regression curves for single sequence and multiple alignment ones are depicted with solid and dotted lines respectively.

Comparing α-helical TM proteins with β-barrels (Figure 2, Figure 3), we can also observe that the former can achieve a borderline better performance in any measure studied. This is something expected since it is well known that predicting the TM regions in α-helical membrane proteins is a much easier task compared to the β-barrels. Furthermore, β-barrels need less 3D structures in order to train a predictor. This has to be interpreted taking into account the smaller number of parameters used in the models as well as the existence of fewer structural folds of TM β-barrels. Comparing the prediction of TM proteins (α-helical ones and β-barrels) to secondary structure prediction, we have to also note the superior performance of algorithms for TM protein topology prediction. Once again, this is something that we expected since TM protein topology prediction can be seen as a very specialized case of secondary structure prediction. The limitations imposed by the lipid bilayer restrict the possible conformations of a polypeptide chain, making the prediction a relatively easier task. On the contrary, trying to predict the secondary structure is harder since the prediction algorithm has to be able to predict all the available conformations deriving from the large number of structural folds. Recent studies 8., 9. suggest that the possible “folds” of membrane proteins are limited (in a same way that the number of soluble proteins’ folds is limited). Thus, we expect that the findings reported in this work could be extrapolated to the future provided that a fold, which is completely different compared to what we already have seen, will not appear. Given that the basic principles governing membrane protein folding (such as hydrophobicity) have already been taken into account when designing these algorithms, we have no reason to expect a dramatical change in the future.
Figure 2

The prediction accuracy (Q) of prediction algorithms for α-helical membrane proteins in relation to the size of the training set. The non-linear and linear regression curves are depicted with solid and dotted lines respectively.

Figure 3

The prediction accuracy (Q) of prediction algorithms for β-barrel membrane proteins in relation to the size of the training set. The non-linear and linear regression curves are depicted with solid and dotted lines respectively.

In our historical prospective study, we used solely methods with single sequence information. In the case of α-helical TM proteins, HMM-TM has been shown to outperform the high-scoring methods currently available such as TMHMM and HMMTOP that use single sequences, and compares favorably against the newly developed methods that use multiple alignments. In case we used multiple alignments in our method, perhaps a higher plateau could be reached and maybe we had a slightly different shape of the growth curve; however, the general conclusions would remain unaffected. In the case of β-barrels, PRED-TMBB has been shown to be one of the most successful prediction algorithms outperforming even methods that use evolutionary information. Furthermore, it has been shown that HMM methods outperform other methods based on NNs and SVMs in both prediction of α-helical ( and β-barrel TM proteins (. Thus, the results of this study are not likely to be inflated by the type of the prediction method used. The type of the algorithmic technique used for secondary structure prediction has a direct impact on the performance, and the accumulated experience over the years has provided researchers with useful heuristic rules that increase the performance. Furthermore, for algorithms using evolutionary information derived from multiple alignments, the choice of a particular algorithm such as BLAST or PSI-BLAST (, HMMER (, or CLUSTAL ( in order to perform the database search and the alignment may influence the results. In addition, the size of the database on which the search is performed has been shown to influence the results greatly, thus favoring the more recently published methods 15., 16.. However, the results reported here clearly indicate that, using existing algorithmic techniques, the performance of secondary structure prediction algorithms cannot be further improved by increasing the size of the training set. In the case of membrane proteins, the study that we conducted eliminates all the possible sources of variation (different methods for training, different selection criteria for the dataset, etc.), thus it is expected to produce unbiased estimates for the dependence on the size of the training set. The total number of freely estimated parameters in the model used for β-barrel membrane proteins is 175, whereas the respective number for the model used for α-helical membrane proteins is 304. These numbers are adequate for training a prediction method using some dozens of proteins (i.e., thousands of amino acids as the observations) and in any case are significantly smaller compared to the number of freely estimated parameters (weights) needed by an NN method. Perhaps if we used an NN method, different estimates would have been produced. However, HMMs have been proved to be not only the most parsimonious among the machine learning algorithms, but also the most efficient for predicting the topology of TM proteins. Furthermore, the particular HMM methods used here have been found to be among the top-scoring ones in the literature 5., 11.. The major finding of this work is the identification of an upper limit for the performance of the prediction algorithms. We have shown that using the existing algorithmic techniques, the prediction performance cannot be further improved by simply adding sequences to the training set. Thus, we need to develop new algorithmic techniques entirely different from the ones used up to now. Such methods definitively need to be able to exploit long-range interactions (correlations) along the sequence (. All currently available techniques are based (one way or the other) on the use of the statistical properties of neighboring amino acid residues along the sequence. Thus, they all use local information and ignore long-range dependencies, which are highly important for the stability of the secondary structure elements and in some cases such as the β-sheet, are responsible for their formation. A few methods have been used already for incorporating long-range interactions along a protein sequence in the secondary structure prediction problem using neural networks ( or variations of the stochastic context free grammars 19., 20., whereas other methods mainly based on neural networks are devoted solely to predict the long-range interactions 21., 22., 23., 24.. Such techniques are computationally more demanding, but given that computational power continues to grow, their use should be exploited further in the context of structure prediction algorithms in the near future.

Materials and Methods

Performing a literature search in PubMed (www.pubmed.gov), we identified studies describing an algorithm for secondary structure prediction that reported: (1) explicitly the use of a non-redundant training set, and (2) the prediction performance using the percentage of correctly predicted residues (Q3) in a three-state mode (H-helix, E-extended, C-coil) in a test set having no significant similarity with the set used for training. For the latter, we accepted either the test on an independent set or the results of a cross validation or a jackknife test. We further classified the algorithms into two classes, those that depend on single sequence information, and those that use evolutionary information derived from multiple alignments. If a certain prediction method reports both the results using single sequences and multiple alignments, these results are counted separately. Finally, if a method reports the performance on two or more large independent sets, we kept only the one with the highest accuracy. For the analysis regarding TM proteins, in order to eliminate the inherent variability of the different methods applied on different datasets, we decided to conduct a prospective study based on historical records (a so-called “historical prospective study”). We used PDB_TM 25., 26. in order to collect all the available high-resolution structures of α-helical and β-barrel TM proteins deposited in the Protein Data Bank (. Consequently, we ranked these structures according to the year of publication. Thus, we were able to create datasets corresponding to the structures available for each year in the range 1995–2005. Since there was a slight delay between the elucidation of the first structure of an α-helical membrane protein (1986) and that of the first structure of a porin (1992), we decided to subtract the offset of 6 years, and thus obtain datasets for each year following the first published structure. For each dataset, we performed a redundancy check, using algorithm 2 from Hobohm et al. (. Non-redundant datasets were created by removing all chains for which a putative homologous entry was already in the set, with the threshold defined as <30% pairwise sequence similarity (in a length of more than 80 residues) in a BLAST alignment (. For sequences shorter than 80 residues, which are frequent among single-spanning membrane proteins, we used the similarity of less than 50% as threshold in a length of more than 30 residues. For each such set, we trained separately a different HMM in order to predict the TM segments. The model used for the β-barrels was identical to the one introduced with the PRED-TMBB method (, whereas the model for α-helical membrane proteins was the same as that used in HMM-TM (. Concerning β-barrels, we evaluated the performance on the jackknife test (i.e., removing a protein from the training set, training the model with the remaining proteins and performing the test on the protein removed). In the case of α-helical TM proteins, where the training sets were larger, we used a seven-fold cross-validation procedure. Since the sequences do not show any significant similarity (no more than 30% identities in a BLAST comparison), the results of the study were approximately to what would have been observed if such an algorithm was applied at that particular time. We used the Matthews correlation coefficient (C and C for α-helical and β-barrel TM proteins respectively) and the percentage of correctly predicted residues (Q and Q for α-helical and β-barrel TM proteins respectively) (, as well as the segment overlap measure (SOV) ( against the structures used for training each HMM. In both cases (α-helical and β-barrel TM proteins), the observed structures, against which the comparisons were performed, were obtained by visual inspection of the 3D structures. Especially for α-helical TM proteins, as explained in detail in the respective paper (, a procedure for the refinement of the boundaries of the TM segments was performed prior to train the final model. The relationship between the sizes of the training set with the performance of the prediction algorithms was assessed using linear and non-linear models. We fitted a simple linear regression line on each of the parameters (C, Q and SOV denoted here as y) against the number of proteins in the training set (x): Here of interest is the coefficient β1, which denotes the amount of increase in the predictive performance that we can achieve by adding one more protein to the training set. In order to check for non-linearity with respect to the training set, we used the non-linear model of von Bertalanffy (: This model requires the estimation of three parameters β0, β1 and β2. β0 corresponds to the maximal prediction performance, β1 corresponds to the growth rate and β2 is an offset corresponding to the hypothetical size of a training set required in order to have a y equal to zero. The parameters were estimated iteratively by non-linear least squares. In order to decide which model fits better to the data (linear vs non-linear), we used the RMSE statistic given by the formula:where ŷ is the predicted model value for the ith observation. Smaller values of RMSE denote a better fit.

Authors’ contributions

PGB conceived the study, designed the algorithms, performed the statistical analysis and wrote the manuscript. GNT collected the datasets, performed the training procedure and participated in writing the manuscript. SJH supervised the project and co-wrote the manuscript. All authors read and approved the final manuscript.

Competing interests

The authors have declared that no competing interests exist.
  79 in total

1.  Cascaded multiple classifiers for secondary structure prediction.

Authors:  M Ouali; R D King
Journal:  Protein Sci       Date:  2000-06       Impact factor: 6.725

2.  The progress of membrane protein structure determination.

Authors:  Stephen H White
Journal:  Protein Sci       Date:  2004-07       Impact factor: 6.725

3.  Prediction of protein secondary structure based on residue pairs.

Authors:  Xin Liu; Li-Mei Zhang; Wei-Mou Zheng
Journal:  J Bioinform Comput Biol       Date:  2004-06       Impact factor: 1.122

4.  Porter: a new, accurate server for protein secondary structure prediction.

Authors:  Gianluca Pollastri; Aoife McLysaght
Journal:  Bioinformatics       Date:  2004-12-07       Impact factor: 6.937

5.  A bi-recursive neural network architecture for the prediction of protein coarse contact maps.

Authors:  Alessandro Vullo; Paolo Frasconi
Journal:  Proc IEEE Comput Soc Bioinform Conf       Date:  2002

6.  Predicting protein secondary structure and solvent accessibility with an improved multiple linear regression method.

Authors:  Sanbo Qin; Yun He; Xian-Ming Pan
Journal:  Proteins       Date:  2005-11-15

7.  The effect of long-range interactions on the secondary structure formation of proteins.

Authors:  Daisuke Kihara
Journal:  Protein Sci       Date:  2005-06-29       Impact factor: 6.725

8.  A limited universe of membrane protein families and folds.

Authors:  Amit Oberai; Yungok Ihm; Sanguk Kim; James U Bowie
Journal:  Protein Sci       Date:  2006-07       Impact factor: 6.725

Review 9.  The formation and stabilization of protein structure.

Authors:  C B Anfinsen
Journal:  Biochem J       Date:  1972-07       Impact factor: 3.857

10.  A Hidden Markov Model method, capable of predicting and discriminating beta-barrel outer membrane proteins.

Authors:  Pantelis G Bagos; Theodore D Liakopoulos; Ioannis C Spyropoulos; Stavros J Hamodrakas
Journal:  BMC Bioinformatics       Date:  2004-03-15       Impact factor: 3.169

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.