| Literature DB >> 20598148 |
Jian Tian1, Ningfeng Wu, Xiaoyu Chu, Yunliu Fan.
Abstract
BACKGROUND: An important aspect of protein design is the ability to predict changes in protein thermostability arising from single- or multi-site mutations. Protein thermostability is reflected in the change in free energy (DeltaDeltaG) of thermal denaturation.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20598148 PMCID: PMC2906492 DOI: 10.1186/1471-2105-11-370
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Classification and regression performance of Prethermut on the M-dataset
| Methoda | Mutation | nb | MCC | Q2 (%) | Sensitivity | Specificity | |
|---|---|---|---|---|---|---|---|
| RF | 1 | 2765 | 0.46 | 77.3 | 71.3 | 7 9.7 | 0.70 |
| RF | 2 | 441 | 0.66 | 84.8 | 81.0 | 86.5 | 0.79 |
| RF | 3 | 93 | 0.86 | 96.8 | 84.6 | 98.8 | 0.87 |
| RF | ≥4 | 67 | 0.92 | 97.0 | 93.8 | 98.0 | 0.86 |
| RF | ≥1 | 3366 | 0.50 | 79.7 | 73.6 | 81.1 | 0.72 |
| SVM | 1 | 2765 | 0.39 | 79.8 | 41.2 | 92.1 | 0.64 |
| SVM | 2 | 441 | 0.59 | 83.0 | 51.1 | 97.4 | 0.74 |
| SVM | 3 | 93 | 0.45 | 89.7 | 23.1 | 100.0 | 0.79 |
| SVM | ≥4 | 67 | 0.66 | 88.1 | 50.0 | 100.0 | 0.78 |
| SVM | ≥1 | 3366 | 0.43 | 79.7 | 42.7 | 93.2 | 0.67 |
All of the results were obtained by a 10-fold cross validation on the M-dataset. See Methods for definitions of overall accuracy (Q2), Matthews correlation coefficient (MCC), sensitivity, specificity, and Pearson correlation coefficient (r). aThe number of trees in the random forests (RF) method is 10000; the parameters for the support vector machine (SVM) method are gamma (g) = 2, cost (c) = 8, and the weight for the positive samples (w) = 3. bn is the number of mutant proteins in the sample; the total number of proteins in the M-dataset was 3366.
Structural features used in Prethermut
| Feature | Programa | Feature | Program |
|---|---|---|---|
| Total energy | FoldX | Stereochemical improper dihedral potential | Modeller 9.7 |
| Backbone H-bond | FoldX | Frequency_[0,2.1) b | Modeller 9.7 |
| Sidechain H-bond | FoldX | Frequency_[2.1,2.2) | Modeller 9.7 |
| Van der Waals forces | FoldX | Frequency_[2.2,2.3) | Modeller 9.7 |
| Electrostatic attractions | FoldX | Frequency_[2.3,2.4) | Modeller 9.7 |
| Solvation polar | FoldX | Frequency_[2.4,2.5) | Modeller 9.7 |
| Solvation hydrophobic | FoldX | Frequency_[2.5,2.6) | Modeller 9.7 |
| Van der Waals clashes | FoldX | Frequency_[2.6,2.7) | Modeller 9.7 |
| Entropy side chain | FoldX | Frequency_[2.7,2.8) | Modeller 9.7 |
| Entropy main chain | FoldX | Frequency_[2.8,2.9) | Modeller 9.7 |
| Torsional clash | FoldX | Frequency_[2.9,3.0) | Modeller 9.7 |
| Backbone clash | FoldX | Frequency_[3.0,3.1) | Modeller 9.7 |
| Helix dipole | FoldX | Frequency_[3.1,3.2) | Modeller 9.7 |
| Current energy | Modeller 9.7 | Frequency_[3.2,3.3) | Modeller 9.7 |
| Bond energy | Modeller 9.7 |
aThe corresponding feature was calculated by the programs (FoldX [11,28] and Modeller 9.7 [29]). bThe frequency of short non-covalent contacts with a distance of less than 2.1 Å.
Figure 1Receiver operating characteristic curves for random prediction and the prediction of Prethermut using the random forests (RF) and support vector machines (SVM) methods. The curves were obtained from the 10-fold cross validation test on the M-dataset.
Figure 2Pearson correlation coefficient (. The results were calculated on the M-dataset with 10-fold cross validation by the random forests method.
Figure 3Pearson correlation coefficient (. The results were calculated on the M-dataset with 10-fold cross validations by the random forests regression method (left panel) and support vector regression method (right panel).
Performance of Prethermut on the M-dataset with different ranges of absolute ΔΔG
| Methoda | Range of absolute ΔΔ | mb | MCC | Q2 (%) | Sensitivity | Specificity | |
|---|---|---|---|---|---|---|---|
| RF | [0, 1) | 1466 | 0.33 | 66.8 | 68.9 | 65.5 | 0.39 |
| RF | [1, 2) | 873 | 0.57 | 84.0 | 78.7 | 85.2 | 0.56 |
| RF | [2, 3) | 509 | 0.66 | 91.0 | 88.1 | 91.3 | 0.69 |
| RF | [3, 14) | 518 | 0.77 | 94.8 | 87.9 | 95.7 | 0.72 |
| SVM | [0, 1) | 1466 | 0.28 | 68.3 | 36.9 | 87.1 | 0.31 |
| SVM | [1, 2) | 873 | 0.52 | 86.3 | 49.7 | 95.0 | 0.55 |
| SVM | [2, 3) | 509 | 0.64 | 93.3 | 57.6 | 98.0 | 0.65 |
| SVM | [3, 14) | 1466 | 0.62 | 93.4 | 44.8 | 99.6 | 0.63 |
All results were obtained by a 10-fold cross validation on the M-dataset. See Methods for definitions of overall accuracy (Q2), Matthews correlation coefficient (MCC), sensitivity, specificity, and Pearson correlation coefficient (r). aThe number of trees in the random forests (RF) method is 10000; the parameters for the support vector machine (SVM) method are gamma (g) = 2, cost (c) = 8, and the weight for the positive samples (w) = 3. bm is the number of mutant proteins in the M-dataset that have the same range of absolute ΔΔG.
Figure 4Average prediction accuracy calculated cumulatively with a reliability index (RI) above a given value. The results were based on the M-dataset with 10-fold cross validation by the random forests (RF, squares) method and support vector machine (SVM, circles) method.
Performance of Prethermut and other computational methods on the S-dataset
| Method | Q2 (%) | na | |
|---|---|---|---|
| CC/PBSA | 0.56 | 78.6 | 478 |
| EGAD | 0.59 | 71.0 | 1065 |
| FoldX | 0.5 | 69.5 | 1200 |
| Hunter | 0.45 | 69.4 | 1594 |
| I-Mutant2.0 | 0.54 | 77.5 | 933 |
| Rosetta | 0.26 | 73.4 | 1913 |
| Combining method | 0.64 | 80.8 | 407 |
| Prethermut (RF)b | 0.72 | 78.6 | 2156 |
| Prethermut (SVM)c | 0.70 | 83.2 | 2156 |
See Methods for definitions of overall accuracy (Q2) and Pearson correlation coefficient (r). The prediction results of CC/PBSA, EGAD, FoldX, Hunter, I-Mutant 2.0, Rosetta, and Combining method were obtained from Potapov et al. [2]. an is the number of mutant proteins for which the method correctly predicted the change in thermostability. bThe number of trees in the Random forests (RF) method is 10000. The results were obtained by a 10-fold cross validation on the S-dataset. cThe parameters for the support vector machine (SVM) method are gamma (g) = 2, cost (c) = 4, and the weight for the positive samples (w) = 5. The results were obtained by a 10-fold cross validation on the S-dataset.