| Literature DB >> 27564391 |
Brett M Kroncke1, Amanda M Duran1, Jeffrey L Mendenhall1, Jens Meiler1, Jeffrey D Blume1, Charles R Sanders1.
Abstract
There is a compelling and growing need to accurately predict the impact of amino acid mutations on protein stability for problems in personalized medicine and other applications. Here the ability of 10 computational tools to accurately predict mutation-induced perturbation of folding stability (ΔΔG) for membrane proteins of known structure was assessed. All methods for predicting ΔΔG values performed significantly worse when applied to membrane proteins than when applied to soluble proteins, yielding estimated concordance, Pearson, and Spearman correlation coefficients of <0.4 for membrane proteins. Rosetta and PROVEAN showed a modest ability to classify mutations as destabilizing (ΔΔG < -0.5 kcal/mol), with a 7 in 10 chance of correctly discriminating a randomly chosen destabilizing variant from a randomly chosen stabilizing variant. However, even this performance is significantly worse than for soluble proteins. This study highlights the need for further development of reliable and reproducible methods for predicting thermodynamic folding stability in membrane proteins.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27564391 PMCID: PMC5024705 DOI: 10.1021/acs.biochem.6b00537
Source DB: PubMed Journal: Biochemistry ISSN: 0006-2960 Impact factor: 3.162
Figure 1Boxplot of experimental (reference) and predicted value distributions. The middle line in the box is the median, and upper and lower bounds to the boxes are the upper and lower quartiles, respectively. Nonoutlier extrema are bracketed with dashed lines above and below the upper and lower quartiles, respectively. Dots are outliers beyond 1.5 times the upper or lower quartile.
Summary of Methods Evaluated
| name | brief description | method | calibrated | sequence | Pearson | stability data sets |
|---|---|---|---|---|---|---|
| Rosetta[ | Structure knowledge-based potential. Score terms considered: van der Waals, electrostatics, solvation, hydrogen bond, rotamer probability. ddG_monomer application | N/A | 0.69 (high), 0.68 (low) | ProTherm[ | ||
| I Mutant 3.0[ | Support vector machine (SVM)-based predictor; can use sequence information and structure information to predict destabilizing, neutral, and stabilized | SVM | X | X | 0.69 | Thermodynamic Database for Proteins and Mutants ProTherm (September 2005) |
| FoldX[ | Empirical force field calibrated with experimental ddG values. Score terms considered: van der Waals, solvation, hydrogen bonding, water bridges, electrostatic, entropy of backbone and side chain, and atomic clashes | grid search | X | 0.8 | derived from ProTherm | |
| mCSM[ | Graph-based structural signatures: distance patterns between atoms to represent the environment. Also considers pharmacophore changes and experimental conditions. Supervised learning machine learning methods trained on regression and classification | ANN | X | 0.82 | derived from ProTherm | |
| SDM[ | Statistical potential energy function (structure): evaluates amino acid structural propensities in homologous protein families | N/A | X | 0.58 | derived from ProTherm | |
| DUET[ | SVM that combines mCSM and SDM methods | SVM | X | X | 0.71 | ProTherm (low-redundancy set) |
| PPSC (M8)[ | SVM with eight attributes: hydropathy, isotropic surface area, electronic charge, volume, contact energy | SVM | X | 0.65 | derived from ProTherm | |
| PPSC (M47)[ | SVM trained with 8 + 40 additional protein features
from ref ( | SVM | X | 0.82 | derived from ProTherm | |
| PROVEAN[ | Pairwise sequence alignment scores to predict effects of a mutation, including deletions, insertions, and multiple substitutions | N/A | X | 0.71 | derived from UniProtKB and Swiss-Prot databases | |
| ELASPIC[ | Machine learning approach that combines semiempirical force fields, sequence conservation scores, and structural information through stochastic gradient boosting of decision trees | SGBT-DT | X | X | 0.77 | ProTherm |
| EASE-MM[ | Sequence-based SVM model that evaluates the predicted secondary structure and accessible surface area of the region of interest | SVM | X | X | 0.56 | derived from ProTherm |
Type of machine learning method used: artificial neural network (ANN), support vector machine (SVM), and stochastic gradient boosting of decision trees (SGBT-DT).
The predictive method is calibrated to experimental ΔΔG values.
Reported Pearson correlation coefficient.
Used to derive both training and testing sets unless otherwise noted.
Activity correlation.
Summary of Statistical Methods Used To Evaluate Predictive Methods
| quantification method | description |
|---|---|
| concordance CC | The concordance
correlation coefficient measures the degree
to which the predicted ΔΔ |
| Pearson CC | The Pearson correlation coefficient measures the degree
to
which a uniform linear transformation of the predicted ΔΔ |
| Spearman rank CC | The Spearman rank correlation coefficient measures the degree
to which the rank ordering of the predicted ΔΔ |
| ROC and AUC | The area-under-the-receiver operating characteristic
(ROC)
curve tests several cutoff values for binning mutations as neutral
or destabilizing between the most negative calculated ΔΔ |
CC indicates correlation coefficient.
Figure 2Reference (experimental) ΔΔG values vs calculated ddG values (x-axis) from each method tested (see also Table S1). Red lines are simple linear regressions from which Pearson correlations are derived; blue lines are flexible nonparametric trend lines. For the Rosetta and FoldX plots, a few predicted points were outliers that fall outside of the plotted window. The dashed line is the y = x line measuring perfect agreement between the predicted ΔΔG and the experimental values and is plotted for methods constructed to make direct predictions.
Figure 3(A) Performance of each evaluated method in predicting true ΔΔG values (concordance correlation coefficient), linearly correlated ddG values (Pearson correlation coefficient), and rank order (Spearman rank order correlation coefficient). The hash marks in the upper portions of this plot indicate the published results for each method. We also evaluated the concordance, Pearson, and Spearman correlation coefficients using the calculated and experimental data previously reported[37] for a mostly water-soluble protein data set to control for processing differences, shown as triangles. (B) Receiver operating characteristic curves of the classification of variants that are more destabilized or less destabilized than 0.5 kcal/mol. We generated the black bold trace using data from a previous ΔΔG calculation effort[37] involving mostly soluble proteins.