| Literature DB >> 26606303 |
Marc Corrales1,2,3, Pol Cuscó1,2,3, Dinara R Usmanova2,4,5, Heng-Chang Chen1,2,3, Natalya S Bogatyreva2,4,6, Guillaume J Filion1,2,3, Dmitry N Ivankov2,4,6.
Abstract
The prediction of protein folding rates is a necessary step towards understanding the principles of protein folding. Due to the increasing amount of experimental data, numerous protein folding models and predictors of protein folding rates have been developed in the last decade. The problem has also attracted the attention of scientists from computational fields, which led to the publication of several machine learning-based models to predict the rate of protein folding. Some of them claim to predict the logarithm of protein folding rate with an accuracy greater than 90%. However, there are reasons to believe that such claims are exaggerated due to large fluctuations and overfitting of the estimates. When we confronted three selected published models with new data, we found a much lower predictive power than reported in the original publications. Overly optimistic predictive powers appear from violations of the basic principles of machine-learning. We highlight common misconceptions in the studies claiming excessive predictive power and propose to use learning curves as a safeguard against those mistakes. As an example, we show that the current amount of experimental data is insufficient to build a linear predictor of logarithms of folding rates based on protein amino acid composition.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26606303 PMCID: PMC4659572 DOI: 10.1371/journal.pone.0143166
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Correlation of Huang and Tian’s model.
The correlation between Ω, the sum of amino acid foldabilities proposed in [26], and the log folding rates for two-state proteins. Blue dots represent proteins from the data set of Huang and Tian [26]. Red symbols show two-state proteins from data set 113. Correlation coefficients were calculated using only proteins of length comprised between 30 and 200 residues, depicted as circles (0.82 for Huang and Tian’s set and 0.63 for two-state proteins from data set 113). Proteins with fewer than 30 amino acid residues are shown as triangles, while those with more than 200 residues are shown as squares. The line shows the prediction from the original model by Huang and Tian [26].
Fig 2Correlation coefficient of Huang and Tian’s model for different samples.
Forty data points were randomly sampled from a meta data set and the model described by Huang and Tian [26] was fitted again 10,000 times. The meta data set consists of two-state proteins from 30 to 200 residues combined from [26] and data set 113, without duplicates. The histogram of the obtained correlation coefficients was then plotted. The correlation coefficient ranges from 0.5 to 0.8 approximately, which shows that robust estimation of the correlation cannot be achieved with 40 proteins.
Fig 3Cross-validation results for two independent Gaussian samples.
In this toy model, we try predict a variable from an uncorrelated predictor. The predictive power is null, but the model can be overtrained and give the illusion that the variables are correlated. We repeatedly performed 5-fold cross validation 1,000,000 times on the same data set (n = 100). The plot shows the distribution of the obtained coefficient of correlation. The highest value is 0.202, and the lowest is -0.472 (associated p-values without multiple-hypothesis correction equal to 0.044 and 7·10−7, respectively).
Fig 4Learning curves of the linear regression model.
The mean (n = 1000) correlation coefficient of the training and test sets between the predicted and observed log folding rates (blue and red lines, respectively) is plotted as a function of the dataset size, together with the standard deviations of both sets (blue and red regions, respectively). Sixty percent of the examples are assigned to the training set and 40% to the test set. a. Log folding rates were fitted with 20 features corresponding to the absolute amino acid frequency of each protein. A clear overfit can be seen as a gap between the two correlation lines. b. Log folding rates were fitted using a single feature corresponding to the amino acid length of each protein to the power of 2/3, ln(k ) ~ -L 2/3 [13]. There exists a nearly-perfect correspondence between training and test sets, and a slightly higher correlation on the test set than in Fig 4A.
Fig 5Learning curves of the contact order models.
a. Relative contact order model with fixed parameters d and ΔL (atoms contact in three-dimensional protein structure if they are closer than d = 6Å and belong to the residues having distance by chain ΔL ≥ 1). b. Absolute contact order model with fixed parameters d and ΔL. Relative (c) and absolute (d) contact order models with varying parameters d and ΔL. For relative contact order model we restrict the data set to two-state proteins having less than 150 residues.