| Literature DB >> 29382060 |
Ramin Dehghanpoor1, Evan Ricks2, Katie Hursh3, Sarah Gunderson4, Roshanak Farhoodi5, Nurit Haspel6, Brian Hutchinson7,8, Filip Jagodzinski9.
Abstract
Predicting how a point mutation alters a protein's stability can guide pharmaceutical drug design initiatives which aim to counter the effects of serious diseases. Conducting mutagenesis studies in physical proteins can give insights about the effects of amino acid substitutions, but such wet-lab work is prohibitive due to the time as well as financial resources needed to assess the effect of even a single amino acid substitution. Computational methods for predicting the effects of a mutation on a protein structure can complement wet-lab work, and varying approaches are available with promising accuracy rates. In this work we compare and assess the utility of several machine learning methods and their ability to predict the effects of single and double mutations. We in silico generate mutant protein structures, and compute several rigidity metrics for each of them. We use these as features for our Support Vector Regression (SVR), Random Forest (RF), and Deep Neural Network (DNN) methods. We validate the predictions of our in silico mutations against experimental Δ Δ G stability data, and attain Pearson Correlation values upwards of 0.71 for single mutations, and 0.81 for double mutations. We perform ablation studies to assess which features contribute most to a model's success, and also introduce a voting scheme to synthesize a single prediction from the individual predictions of the three models.Entities:
Keywords: DNN; RF; SVR; machine learning; protein mutational study; rigidity analysis
Mesh:
Substances:
Year: 2018 PMID: 29382060 PMCID: PMC6017198 DOI: 10.3390/molecules23020251
Source DB: PubMed Journal: Molecules ISSN: 1420-3049 Impact factor: 4.411
Figure 1Cartoon (left) and Rigidity analysis (right) of PDB file 1hvr. Atoms in different rigid clusters are colored by cluster membership. The largest rigid cluster (red-brown) spans both halves of the protein.
Test set results for regression models for single and double mutants, as well as the union of the two (combined). RD = Rigidity Distance. The best results are shown in bold font.
| Single Mutants | Double Mutants | Combined | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| RD | Measure | SVR | RF | DNN | SVR | RF | DNN | SVR | RF | DNN |
| lm | RMSE | 1.53 | 1.60 | 1.61 | 1.41 | 1.74 | 1.54 | 1.39 | 1.71 | |
| R | 0.60 | 0.71 | 0.58 | 0.76 | 0.79 | 0.66 | 0.65 | 0.72 | 0.52 | |
| sm1 | RMSE | 1.52 | 1.35 | 1.60 | 1.60 | 1.37 | 1.64 | 1.54 | 1.39 | 1.80 |
| R | 0.60 | 0.71 | 0.57 | 0.76 | 0.71 | 0.65 | 0.72 | 0.46 | ||
| sm2 | RMSE | 1.53 | 1.35 | 1.71 | 1.61 | 1.36 | 1.90 | 1.54 | 1.40 | 1.87 |
| R | 0.60 | 0.71 | 0.55 | 0.76 | 0.60 | 0.65 | 0.72 | 0.46 | ||
| sm3 | RMSE | 1.52 | 1.35 | 1.60 | 1.60 | 1.38 | 1.93 | 1.54 | 1.39 | 1.81 |
| R | 0.60 | 0.71 | 0.58 | 0.76 | 0.80 | 0.60 | 0.65 | 0.72 | 0.44 | |
| sm4 | RMSE | 1.52 | 1.57 | 1.56 | 1.38 | 1.83 | 1.54 | 1.38 | 1.77 | |
| R | 0.60 | 0.71 | 0.57 | 0.77 | 0.80 | 0.64 | 0.66 | 0.73 | 0.55 | |
| sm5 | RMSE | 1.53 | 1.35 | 1.70 | 1.60 | 1.35 | 1.89 | 1.54 | 1.39 | 1.74 |
| R | 0.60 | 0.71 | 0.52 | 0.76 | 0.52 | 0.65 | 0.72 | 0.51 | ||
| Avg. | RMSE | 1.53 | 1.35 | 1.63 | 1.60 | 1.38 | 1.82 | 1.54 | 1.39 | 1.78 |
| R | 0.60 | 0.71 | 0.56 | 0.76 | 0.80 | 0.62 | 0.65 | 0.72 | 0.49 | |
Figure 2Test set predicted versus actual for single (a) and multiple (b) mutation data.
Figure 3Test set predicted vs. actual , R = 0.73, for best RF model using sm4 rigidity distance metric, for the combined single and double mutations dataset
Comparing training and test set results for regression models, averaged over the six RD metrics.
| Accuracy | Measure | SVR | RF | DNN |
|---|---|---|---|---|
| Single | Avg. Train RMSE | 1.08 | 0.47 | 1.04 |
| Avg. Test RMSE | 1.53 | 1.35 | 1.67 | |
| Avg. Train R | 0.79 | 0.97 | 0.79 | |
| Avg. Test R | 0.60 | 0.71 | 0.56 | |
| Double | Avg. Train RMSE | 1.08 | 0.70 | 1.20 |
| Avg. Test RMSE | 1.60 | 1.38 | 1.82 | |
| Avg. Train R | 0.83 | 0.93 | 0.69 | |
| Avg. Test R | 0.76 | 0.80 | 0.62 | |
| Combined | Avg. Train RMSE | 1.11 | 0.50 | 1.33 |
| Avg. Test RMSE | 1.52 | 1.39 | 1.78 | |
| Avg. Train R | 0.79 | 0.96 | 0.62 | |
| Avg. Test R | 0.65 | 0.72 | 0.49 |
Test set results for voting schemes for single mutants using SVR and DNN predictions. The voting schemes that showed the best improvement are highlighted in bold fonts.
| RD | Measure | SVR | DNN-AVG | VSuwa | VSrmse-wa | VSc-wa | VScombined-wa |
|---|---|---|---|---|---|---|---|
| lm | RMSE | 1.53 | 1.46 | 1.43 | |||
| R | 0.60 | 0.64 | 0.65 | ||||
| sm1 | RMSE | 1.52 | 1.55 | 1.46 | 1.46 | 1.46 | 1.46 |
| R | 0.60 | 0.59 | 0.64 | 0.64 | 0.64 | 0.64 | |
| sm2 | RMSE | 1.53 | 1.57 | 1.45 | 1.45 | 1.45 | 1.45 |
| R | 0.60 | 0.60 | 0.64 | 0.65 | 0.65 | 0.65 | |
| sm3 | RMSE | 1.52 | 1.57 | 1.46 | 1.46 | 1.46 | 1.46 |
| R | 0.60 | 0.59 | 0.64 | 0.64 | 0.64 | 0.64 | |
| sm4 | RMSE | 1.52 | 1.56 | 1.48 | 1.48 | 1.48 | 1.48 |
| R | 0.60 | 0.57 | 0.62 | 0.63 | 0.63 | 0.63 | |
| sm5 | RMSE | 1.53 | 1.63 | 1.50 | 1.49 | 1.49 | 1.49 |
| R | 0.60 | 0.55 | 0.61 | 0.62 | 0.62 | 0.62 |
Figure 4Pearson Correlation (R) values for voting schemes using SVR and DNN predictions for a single mutation using various rigidity distances.
Test set results for voting schemes for single mutants using SVR, DNN and RF predictions. The best results are highlighted in bold fonts.
| RD | Measure | DNN-AVG | SVR | RF | VSuwa | VSrmse-wa | VSc-wa | VScombined-wa |
|---|---|---|---|---|---|---|---|---|
| lm | RMSE | 1.46 | 1.53 | 1.34 | 1.37 | 1.35 | 1.35 | |
| R | 0.64 | 0.60 | 0.71 | 0.70 | 0.71 | 0.71 | ||
| sm1 | RMSE | 1.55 | 1.52 | 1.35 | 1.39 | 1.37 | 1.37 | 1.34 |
| R | 0.59 | 0.60 | 0.71 | 0.69 | 0.70 | 0.70 | 0.71 | |
| sm2 | RMSE | 1.57 | 1.53 | 1.35 | 1.37 | 1.35 | 1.35 | |
| R | 0.60 | 0.60 | 0.71 | 0.69 | 0.71 | 0.71 | ||
| sm3 | RMSE | 1.56 | 1.52 | 1.35 | 1.38 | 1.36 | 1.36 | 1.34 |
| R | 0.59 | 0.60 | 0.71 | 0.69 | 0.70 | 0.71 | 0.72 | |
| sm4 | RMSE | 1.56 | 1.52 | 1.34 | 1.40 | 1.37 | 1.38 | 1.34 |
| R | 0.57 | 0.60 | 0.71 | 0.68 | 0.70 | 0.70 | 0.72 | |
| sm5 | RMSE | 1.63 | 1.53 | 1.35 | 1.41 | 1.37 | 1.37 | 1.35 |
| R | 0.55 | 0.60 | 0.71 | 0.67 | 0.70 | 0.70 | 0.71 |
Figure 5Pearson Correlation (R) values for machine learning models and voting schemes with information from SVR, DNN, and RF predictions for a single mutation using various rigidity distances.
Test set RF ablation results on single mutations.
| Accuracy Measure | Feature 1 | Feature 2 | Feature 3 | No Ablation | |
|---|---|---|---|---|---|
| RMSE | 1.48 | 1.39 | 1.38 | 1.34 | |
| R | 0.63 | 0.68 | 0.69 | 0.71 | |
| RMSE | 1.50 | 1.39 | 1.38 | 1.35 | |
| R | 0.62 | 0.66 | 0.69 | 0.71 | |
| RMSE | 1.50 | 1.39 | 1.38 | 1.35 | |
| R | 0.62 | 0.68 | 0.69 | 0.71 | |
| RMSE | 1.49 | 1.38 | 1.37 | 1.36 | |
| R | 0.62 | 0.68 | 0.69 | 0.70 | |
| RMSE | 1.49 | 1.38 | 1.38 | 1.35 | |
| R | 0.63 | 0.69 | 0.69 | 0.71 | |
| RMSE | 1.50 | 1.40 | 1.38 | 1.36 | |
| R | 0.62 | 0.68 | 0.69 | 0.70 |
Test set RF ablation results on double mutations.
| Accuracy Measure | Feature 1 | Feature 2 | Feature 3 | No Ablation | |
|---|---|---|---|---|---|
| RMSE | 1.47 | 1.41 | 1.40 | 1.39 | |
| R | 0.77 | 0.79 | 0.79 | 0.80 | |
| RMSE | 1.46 | 1.39 | 1.39 | ||
| R | 0.77 | 0.80 | 0.80 | ||
| RMSE | 1.46 | 1.39 | 1.38 | 1.37 | |
| R | 0.77 | 0.80 | 0.80 | 0.81 | |
| RMSE | 1.46 | 1.40 | 1.38 | 1.37 | |
| R | 0.77 | 0.80 | 0.80 | 0.80 | |
| RMSE | 1.45 | 1.40 | 1.40 | 1.37 | |
| R | 0.78 | 0.79 | 0.80 | 0.80 | |
| RMSE | 1.47 | 1.39 | 1.39 | 1.36 | |
| R | 0.77 | 0.80 | 0.80 | 0.81 |
Test set RF ablation results on combined mutations.
| Accuracy Measure | Feature 1 | Feature 2 | Feature 3 | No Ablation | |
|---|---|---|---|---|---|
| RMSE | 1.47 | 1.43 | 1.42 | 1.40 | |
| R | 0.68 | 0.70 | 0.70 | 0.72 | |
| RMSE | 1.48 | 1.42 | 1.41 | 1.39 | |
| R | 0.68 | 0.71 | 0.71 | 0.72 | |
| RMSE | 1.48 | 1.42 | 1.42 | 1.39 | |
| R | 0.68 | 0.70 | 0.71 | 0.72 | |
| RMSE | 1.48 | 1.42 | 1.42 | 1.39 | |
| R | 0.68 | 0.70 | 0.71 | 0.72 | |
| RMSE | 1.47 | 1.42 | 1.41 | 1.39 | |
| R | 0.69 | 0.71 | 0.71 | 0.72 | |
| RMSE | 1.48 | 1.43 | 1.41 | 1.40 | |
| R | 0.68 | 0.70 | 0.71 | 0.72 |
Figure 6Sigmoid functions for scaling RD metric values. The green sigmoid acts much like a step function and rigid clusters made up of 10 or fewer atoms are weighted by a factor of 0. Using the violet sigmoid, atoms in a cluster size up to 200 atoms would be assigned a near 0 weight, atoms in clusters of size 200–300 would be weighed by 0.1–0.8, and atoms in clusters of 300+ atoms would be weighted by 0.8 or more. Reproduced from [37].
Data set sizes.
| Dataset | Training | Development | Test | Total |
|---|---|---|---|---|
| Single | 1488 | 331 | 320 | 2139 |
| Double | 147 | 60 | 107 | 314 |
| Combined | 1635 | 391 | 427 | 2453 |