| Literature DB >> 21998155 |
Gilad Wainreb1, Lior Wolf, Haim Ashkenazy, Yves Dehouck, Nir Ben-Tal.
Abstract
MOTIVATION: Accurate prediction of protein stability is important for understanding the molecular underpinnings of diseases and for the design of new proteins. We introduce a novel approach for the prediction of changes in protein stability that arise from a single-site amino acid substitution; the approach uses available data on mutations occurring in the same position and in other positions. Our algorithm, named Pro-Maya (Protein Mutant stAbilitY Analyzer), combines a collaborative filtering baseline model, Random Forests regression and a diverse set of features. Pro-Maya predicts the stability free energy difference of mutant versus wild type, denoted as ΔΔG.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21998155 PMCID: PMC3223369 DOI: 10.1093/bioinformatics/btr576
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Prediction scheme for a query mutation with known ΔΔG values for additional mutations at the same position. (A) The input for this prediction scheme includes query (Q) and known (M) mutations at the query position. The ΔΔG. (B–E) Calculate the predicted ΔΔG of Q using the Random Forests algorithm. (F) Add the ΔΔG values of M to the appropriate elements in the energy matrix r, according to the MU identity and position of M. (G) Given the training set (matrix r), and the features (including the ΔΔG predicted by Random Forests (ΔΔGRF)). start the stochastic gradient descent and calculate the ΔΔG of Q (H).
Cross-validation results
| Mutation number | Dataset | Performance measure | Pro-Maya | Prethermut | PoPMuSiC-2.0 | FoldX | |||
|---|---|---|---|---|---|---|---|---|---|
| ΔΔGRF | CFCB | ΔΔGRF U CFCB | |||||||
| All the dataset | 2155 | Potapov-DB | PCC | 0.74±0.01 | 0.77±0.01 | 0.72±0.01 | 0.62±0.01 | 0.55±0.02 | |
| RMSE (kcal/mol) | 1.13 | 1.09 | 1.20 | 1.35 | 1.64 | ||||
| 2648 | PoPMuSiC-DB | PCC | 0.74±0.01 | 0.77±0.01 | 0.71±0.01 | 0.62±0.01 | 0.52±0.02 | ||
| RMSE (kcal/mol) | 0.99 | 0.94 | 1.05 | 1.15 | 1.71 | ||||
| SRPM | 752 | Potapov-DB | PCC | 0.59±0.03 | 0.57±0.03 | 0.48±0.04 | 0.50±0.03 | ||
| RMSE (kcal/mol) | 1.28 | 1.30 | 1.39 | 1.57 | |||||
| 913 | PoPMuSiC-DB | PCC | 0.64±0.02 | 0.61±0.02 | 0.55±0.02 | 0.44±0.03 | |||
| RMSE (kcal/mol) | 1.11 | 1.14 | 1.21 | 1.74 | |||||
| MRPM | 1403 | Potapov-DB | PCC | 0.80±0.01 | 0.83±0.01 | 0.77±0.01 | 0.69±0.01 | 0.58±0.02 | |
| RMSE (kcal/mol) | 1.07 | 0.98 | 1.14 | 1.32 | 1.67 | ||||
| 1735 | PoPMuSiC-DB | PCC | 0.79±0.01 | 0.82±0.01 | 0.75±0.01 | 0.66±0.01 | 0.55±0.02 | ||
| RMSE (kcal/mol) | 0.92 | 0.85 | 0.99 | 1.12 | 1.69 | ||||
The PCC and RMSE of current methods and Pro-Maya's CFCB and Random Forests (ΔΔGRF) prediction schemes on the PoPMuSiC-DB and Potapov-DB datasets and its subsets. The two subsets are mutations at positions absent from the training set (SRPM), and mutations at positions found in the training set (MRPM). The ΔΔGRF ∪ CFCB column reports the total performance for the ΔΔGRF and CFCB results on the SRPM and MRPM subsets, respectively. The average and SD of the performance measures were obtained by a bootstrap procedure run for 1000 iterations performed on the cross-validation predictions. As can be seen, Pro-Maya outperforms the other methods. Moreover, the results for the MRPM set indicate that the incorporation of experimental data regarding mutations at the query position improved the prediction accuracy.
Performance over the validation set
| Mutation number | Performance measure | Pro-Maya | Prethermut | PoPMuSiC-2.0 | |
|---|---|---|---|---|---|
| All the dataset | 350 | PCC | 0.79 | 0.72 | 0.69 |
| RMSE (kcal/mol) | 0.96 | 1.12 | 1.16 | ||
| SRPM | 196 | PCC | 0.69 | 0.65 | 0.65 |
| RMSE (kcal/mol) | 1.09 | 1.15 | 1.15 | ||
| MRPM | 154 | PCC | 0.89 | 0.79 | 0.75 |
| RMSE (kcal/mol) | 0.77 | 1.09 | 1.18 |
The PCC and RMSE of Pro-Maya's [Pro-Maya's final performance is the total performance for the Random Forests and collaborative filtering results on the SRPM and MRPM subsets, respectively (ΔΔGRF ∪ CFCB)], Prethermut's and PoPMuSiC-2.0's prediction schemes on the whole validations set, and the MRPM and SRPM subsets. As can be seen, Pro-Maya performs better on the entire validation set and subsets.
Fig. 2.The PCC of Pro-Maya on the PoPMuSiC-DB versus the number of known mutations at the query position using the LOO-all and LOO-neglect. The number of mutations in each group is shown in parentheses. For example, the second data point of the black curve indicates the performance of Pro-Maya on 327 query mutations ate positions which have two additional mutations with a known ΔΔG in the training set. The first data point of the grey curve was calculated using the ΔΔGRF. The difference between the grey and black curves indicates the PCC improvement achieved by the addition of a single known mutation in the query position. The results suggest that the improvement in accuracy is facilitated by the incorporation of as few as 1–2 known ΔΔG values in the query position.