| Literature DB >> 23077576 |
Abstract
The ability to improve protein thermostability via protein engineering is of great scientific interest and also has significant practical value. In this report we present PROTS-RF, a robust model based on the Random Forest algorithm capable of predicting thermostability changes induced by not only single-, but also double- or multiple-point mutations. The model is built using 41 features including evolutionary information, secondary structure, solvent accessibility and a set of fragment-based features. It achieves accuracies of 0.799,0.782, 0.787, and areas under receiver operating characteristic (ROC) curves of 0.873, 0.868 and 0.862 for single-, double- and multiple- point mutation datasets, respectively. Contrary to previous suggestions, our results clearly demonstrate that a robust predictive model trained for predicting single point mutation induced thermostability changes can be capable of predicting double and multiple point mutations. It also shows high levels of robustness in the tests using hypothetical reverse mutations. We demonstrate that testing datasets created based on physical principles can be highly useful for testing the robustness of predictive models.Entities:
Mesh:
Substances:
Year: 2012 PMID: 23077576 PMCID: PMC3471942 DOI: 10.1371/journal.pone.0047247
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
The features and their distributions in the training dataset.
| Feature class | Feature | Median | Mean | p-Value (K-S test) | Description | ||
| SM | DM | SM | DM | ||||
| Secondary structure & solvent accessibility | Helix | 0 | 0 | 0.418 | 0.335 | 7.5×10−3 | The secondary structure of wild-type residue. |
| Sheet | 0 | 0 | 0.201 | 0.308 | 2.5×10−4 | ||
| Coil | 0 | 0 | 0.381 | 0.357 | 0.95 | ||
| Exposed | 1 | 0 | 0.685 | 0.478 | 6.0×10−15 | The solvent accessibility of wild-type residue. | |
| Buried | 0 | 1 | 0.315 | 0.522 | 6.0×10−15 | ||
| Relative difference | POSI | 0 | 0 | −0.00528 | −0.0327 | 0.259 | Composition difference of positive charged residues (RKH) |
| CHAR | 0 | 0 | −0.0609 | −0.0469 | 1.3×10−4 | Composition difference of charged residues (RKHDE) | |
| SMAL | 0 | 0 | −0.113 | −0.0427 | 1.8×10−4 | Composition difference of small residues (T and D) | |
| TINY | 0 | 0 | 0.0661 | 0.167 | 2.2×10−14 | Composition difference of tiny residues (A, G, P, S) | |
| dASA | 0.000350 | −0.0159 | −0.00153 | −0.0179 | 2.2×10−16 | Difference of the average of the maximum solvent accessible surface area. | |
| pIa | 0.0023 | 0.000400 | 0.00657 | 0.00114 | 1.1×10−8 | Difference of the average pI on all residues. | |
| Evolutionary information | Wtlo | 0.0300 | 0.0400 | 0.0329 | 0.0394 | 2.7×10−12 | The log-odds of wild-type residue in PSSM |
| Wtwt | 0.170 | 0.280 | 0.306 | 0.367 | 1.6×10−15 | The weighted-score of wild-type residue in PSSM | |
| Mulo | 0 | −0.00990 | 0.000163 | −0.0102 | 2.4×10−13 | The log-odds of mutant residue in PSSM | |
| Muwt | 0.0100 | 0.0200 | 0.0969 | 0.0562 | 3.1×10−11 | The weighted-score of mutant residue in PSSM | |
| wtlo5 | 3.80 | 4.00 | 3.82 | 3.74 | 0.56 | The averages of the log-odds of 5 neighboring residues to the WT residue. | |
| wtwt5 | 33.7 | 33.6 | 35.5 | 36.6 | 0.25 | The averages of the weighted-score of 5 neighboring residues to WT residue. | |
| wtlo9 | 3.90 | 4.00 | 3.74 | 3.84 | 0.13 | The averages of the log-odds of 5 neighboring residues to the WT residue. | |
| wtwt9 | 34.8 | 34.8 | 35.4 | 36.7 | 0.25 | The averages of the weighted-score of 5 neighboring residues to WT residue. | |
| wtlo15 | 4.07 | 4.00 | 3.86 | 3.79 | 5.5×10−3 | The averages of the log-odds of 15 neighboring residues to the WT residue. | |
| wtwt15 | 35.4 | 34.5 | 36.2 | 36.1 | 0.069 | The averages of the weighted-score of 15 neighboring residues to WT residue. | |
| PROTS features | FBocc | 0.0134 | −0.0286 | 0.012 | −0.0271 | 2.2×10−22 | The potential difference from the occurrence of continuous tetra-peptide fragments |
| FBhel | 0.00430 | 0.00100 | 0.00292 | 0.00210 | 0.085 | The potential difference from the occurrence of continuous tetra-peptide fragments which in helix, sheet, coil, buried, exposed or intermediate status. | |
| FBshe | 0.00250 | −0.0152 | 0.00296 | 0.00147 | 0.73 | ||
| FBcoi | 0.00350 | −0.000400 | 0.00141 | −0.00111 | 0.012 | ||
| FBexp | 0.00540 | −0.000300 | 0.00428 | −0.00101 | 5.11×10−8 | ||
| FBbur | 0.00100 | 0.00100 | 0.000984 | 0.00108 | 0.95 | ||
| FBint | 0.00410 | 0.00205 | 0.00336 | 0.00165 | 0.042 | ||
| FDhel | 0.0320 | 0.0792 | 0.0612 | 0.0917 | 0.82 | The propensity difference of continuous tetra-peptide fragments which in helix, sheet, coil, buried, exposed or intermediate status. | |
| FDshe | −0.0246 | −0.00115 | −0.0443 | 0.00210 | 0.77 | ||
| FDcoi | 0.0550 | −0.0243 | 0.0443 | 0.00530 | 0.28 | ||
| FDexp | 0.0737 | −0.0460 | 0.0773 | −0.0186 | 2.5×10−4 | ||
| FDbur | −0.0788 | 0.0213 | −0.0876 | 0.0467 | 0.043 | ||
| FDint | 0.0606 | 0.0590 | 0.0715 | 0.0710 | 0.86 | ||
| FBDTocc | 0.0112 | −0.0608 | 0.0188 | −0.0719 | 9.0×10−15 | The entropy difference from the occurrence of Delaunay four-residue fragments | |
| FBDTD43 | 0.00975 | −0.0953 | 0.0117 | −0.103 | 2.2×10−16 | The entropy difference from the occurrence of Delaunay four-residue fragments with at least 3 sequentially continuous residues, only 2 continuous residues and four non-neighboring residues, respectively. | |
| FBDTD2 | 0.00140 | −0.0287 | 0.0134 | −0.0365 | 9.2×10−13 | ||
| FBDTD1 | 0 | 0 | 0.000680 | −0.0100 | 1.6×10−7 | ||
| FBDTDD43 | −0.00345 | 0.00300 | −0.00271 | 0.00271 | 1.4×10−3 | The propensity difference of Delaunay four-residue fragments with at least 3 sequentially continuous residues, only 2 continuous residues and four non-neighboring residues, respectively. | |
| FBDTDD2 | −0.00805 | 0.00650 | 0.00736 | 0.00742 | 0.16 | ||
| FBDTDD1 | 0 | 0 | 0.00550 | −0.00746 | 0.067 | ||
The p-values are calculated using the Kolmogorov-Smirnov test (K-S test). Boxplots of these features are available in Figure S1.
Structure-based features. SM: stabilizing mutations; DM: destabilizing mutations.
Figure 1The importance of each feature contributed to the regression predictive models in cross validation.
The error bars denote the variation in five-fold cross validation.
Comparison of prediction performance in cross-validation test.
| Methods | WT−>MT | MT−>WT | ||||
| AUC | ACC | R | AUC | ACC | R | |
| MUpro | 0.687 | 0.813 | 0.483 | 0.564 | 0.273 | 0.167 |
| I-Mutant2.0 | 0.694 | 0.775 | 0.540 | 0.557 | 0.683 | 0.069 |
| LSE | 0.577 | 0.614 | 0.155 | 0.577 | 0.614 | 0.155 |
| FoldX | 0.738 | 0.714 | 0.497 | - | - | - |
| EGAD | 0.745 | 0.732 | 0.595 | - | - | - |
| PROTS (Structure based) | 0.819 | 0.788 | 0.402 | 0.819 | 0.788 | 0.402 |
| PROTS (Sequence based) | 0.815 | 0.788 | 0.387 | 0.815 | 0.788 | 0.387 |
| PROTS_RF (Structure based) |
|
|
|
|
|
|
| PROTS_RF (Sequence based) |
|
|
|
|
|
|
Prediction values were provided by Potapov et al. [32].
AUC: area under ROC curve; ACC: accuracy; R: Pearson Correlation Coefficient.
The performance of ΔΔG prediction by PROTS-RF for mutations and hypothetical reversed mutations in the D180 dataset, and compare with the WET model.
| Dataset | D180 | ||
| Mutation directions | WT−>MT | MT−>WT | |
| Structure-based predictions | AUC | 0.868 | 0.863 |
| ACC | 0.782 | 0.780 | |
| R | 0.775 | 0.774 | |
| Sequence-based predictions | AUC | 0.869 | 0.868 |
| ACC | 0.798 | 0.797 | |
| R | 0.755 | 0.757 | |
| WET | AUC | 0.961 | 0.518 |
| ACC | 0.85 | 0.572 | |
| R | 0.930 | 0.110 | |
AUC: area under ROC curve; ACC: accuracy; R: Pearson Correlation Coefficient.
Figure 2Linear regression and classification of the 180 double point mutations.
Figure 3Linear regression and classification of the 141 multiple point mutations.
The performance of ΔΔG prediction by PROTS-RF for mutations and hypothetical reversed mutations in the D141 dataset.
| Dataset | D141 | ||
| Mutation directions | WT−>MT | MT−>WT | |
| Structure-based predictions | AUC | 0.862 | 0.858 |
| ACC | 0.787 | 0.789 | |
| R | 0.663 | 0.659 | |
| Sequence-based predictions | AUC | 0.855 | 0.844 |
| ACC | 0.779 | 0.746 | |
| R | 0.637 | 0.629 | |
AUC: area under ROC curve; ACC: accuracy; R: Pearson Correlation Coefficient.
Figure 4Structure and sequence based prediction of mutations of staphylococcal nuclease.
Empty symbols are prediction for mutations with experimental data, and the corresponding crossed-symbols are the prediction for hypothetical reverse mutations. The structural figure is based on the PDB entry 1STN.