| Literature DB >> 29949987 |
Emmi Jokinen1, Markus Heinonen1,2, Harri Lähdesmäki1.
Abstract
Motivation: Proteins are commonly used by biochemical industry for numerous processes. Refining these proteins' properties via mutations causes stability effects as well. Accurate computational method to predict how mutations affect protein stability is necessary to facilitate efficient protein design. However, accuracy of predictive models is ultimately constrained by the limited availability of experimental data.Entities:
Mesh:
Substances:
Year: 2018 PMID: 29949987 PMCID: PMC6022679 DOI: 10.1093/bioinformatics/bty238
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Pipeline illustration for mGPfusion. (a) M = 21 substitution matrices utilize different information sources and give scores to pairwise amino acid substitutions. (b) The wild-type structures from Protein Data Bank are modelled as contact graphs. (c) The graph kernel measures similarity of two sequences by a substitution model S over all positions p and their neighbourhoods in the contact graph. (d) Each substitution matrix is used to create a separate covariance matrix. (e) Multiple kernel learning (MKL) is used for finding the optimal combination of the base kernels. The kernel matrix measures variant similarities. (f) Experimentally measured values y are gathered from Protherm and Rosetta’s ddg monomer application is used to simulate the stability effects y for all single point mutations. (g) Bayesian scaling for the simulated values y at the x-axis. Possible scalings are coloured with green and the chosen scaling from y into scaled values is marked by black dots. The scaling is fitted to a subset of experimentally measured stabilities y (circles). (h) The stability predictive GP model is trained using experimental and simulated data through the kernel matrix
The 15 protein data from ProTherm database with counts of point mutations, all mutations and of simulated point mutation stability changes
| Protein (organism) | PDB | Mutations | ||
|---|---|---|---|---|
| all | point | point (sim) | ||
| T4 Lysozyme (Enterobacteria phage T4) | 2LZM | 349 | 264 | 3116 |
| Barnase ( | 1BNI | 182 | 163 | 2052 |
| Gene V protein ( | 1VQB | 124 | 92 | 1634 |
| Glycosyltransferase A ( | 1LZI | 116 | 114 | 2470 |
| Chymotrypsin inhibitor 2 ( | 2CI2 | 98 | 77 | 1235 |
| Protein G ( | 1PGA | 89 | 34 | 1064 |
| Ribonuclease H ( | 2RN2 | 83 | 65 | 2945 |
| Cold shock protein B ( | 1CSP | 80 | 50 | 1273 |
| Apomyoglobin ( | 1BVC | 80 | 56 | 2907 |
| Hen egg white lysozyme ( | 4LYZ | 63 | 50 | 2451 |
| Ribonuclease A ( | 1RTB | 57 | 50 | 2356 |
| Peptidyl-prolyl cis-trans isomerase ( | 1PIN | 56 | 56 | 2907 |
| Ribonuclease T1 isozyme ( | 1RN1 | 53 | 48 | 1957 |
| Ribonuclease ( | 1RGG | 54 | 45 | 1824 |
| Bovine pancreatic trypsin inhibitor ( | 1BPI | 53 | 47 | 1102 |
| Total | 1537 | 1211 | 31293 | |
Comparison of different methods on the 15 protein dataset with respect to ρ and rmse
| Method | Correlation | |||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Point mutations | Multiple mutations | All mutations | Point mutations | Multiple mutations | All mutations | |||||||||||||
| Cross-validation level | Cross-validation level | Cross-validation level | Cross-validation level | Cross-validation level | Cross-validation level | |||||||||||||
| mut. | pos. | prot. | mut. | pos. | prot. | mut. | pos. | prot. | mut. | pos. | prot. | mut. | pos. | prot. | mut. | pos. | prot. | |
| mGPfusion | 0.61 | 0.49 | 0.64 | 0.52 | 1.07 | 2.45 | 2.53 | 1.87 | 1.84 | |||||||||
| mGPfusion, only B62 | 0.79 | 0.69 | 0.56 | 0.86 | 0.82 | 1.11 | 1.30 | 1.62 | 1.43 | 1.18 | ||||||||
| mGP | 0.81 | 0.51 | – | 0.86 | 0.52 | – | 0.83 | 0.50 | – | 1.54 | – | 1.44 | 2.65 | – | 1.14 | 2.09 | – | |
| mGP, only B62 | 0.76 | 0.34 | – | 0.86 | 0.55 | – | 0.80 | 0.49 | – | 1.26 | 1.95 | – | 1.45 | 2.56 | – | 1.30 | 2.23 | – |
| Rosetta scaled | 0.65 | 0.63 | – | 0.51 | 0.39 | – | 0.60 | 0.48 | – | 1.35 | 1.38 | – | 2.49 | 2.99 | – | 1.66 | 2.22 | – |
| Predictions from off-the-shelf implementations with no cross-validation | ||||||||||||||||||
| Rosetta | 0.55 | 0.40 | 0.49 | 1.63 | 2.74 | 1.92 | ||||||||||||
| mCSM | 0.61 | – | – | 1.40 | – | – | ||||||||||||
| PoPMuSiC | 0.64 | – | – | 1.37 | – | – | ||||||||||||
Note: Mutation, position and protein are referred to as mut., pos. and prot., respectively. Largest correlation value or smallest rmse of each column is bolded for convenience. Predictions from off-the-shelf implementations of Rosetta, mCSM and PoPMuSiC are used directly without cross-validation.
Fig. 2.Average weights for kernels utilising the described substitution matrices from AAindex2, when GP models were trained with mutation level cross-validation. Basis for the substitution matrices are obtained from (Tomii and Kanehisa, 1996). * were added to AAindex2 in a later release, and their basis were not determined by Tomii and Kanehisa (1996)
Fig. 3.Scatter plot for the mutation level (leave-one-out) predictions made for all 15 proteins (see Table 1). The colour indicates the number of simultaneous mutations
Root-mean-square errors for different number of simultaneous mutations for all 15 proteins, with models trained by leave-one-out cross-validation
| Mutations | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|---|
| Occurences | 1211 | 207 | 52 | 42 | 4 | 8 | 3 | 3 | 6 | 1 |
| mGPfusion | 1.07 | 1.06 | 0.80 | 0.51 | 0.40 | 1.01 | 3.02 | 5.89 | 5.16 | 0.25 |
| mGPfusion, only B62 | 1.11 | 1.12 | 0.77 | 0.59 | 0.29 | 1.14 | 3.00 | 6.78 | 5.56 | 0.11 |
| mGP | 1.04 | 1.03 | 0.61 | 0.50 | 0.18 | 0.92 | 3.23 | 6.18 | 6.75 | 0.08 |
| mGP, only B62 | 1.26 | 0.96 | 0.65 | 0.83 | 0.26 | 1.14 | 2.95 | 6.90 | 6.57 | 0.05 |
| Rosetta scaled | 1.35 | 2.10 | 1.92 | 2.94 | 2.29 | 2.32 | 2.93 | 6.75 | 7.28 | 2.69 |
| Rosetta | 1.63 | 2.27 | 2.11 | 3.78 | 2.93 | 2.21 | 2.92 | 5.80 | 7.45 | 3.42 |
Note: Rosetta is added for comparison.
Fig. 4.(a) Correlation and (b) root mean square error of predictions made by models with different number of experimental training samples for T4 Lysozyme (2LZM). The results of Rosetta, mCSM and PoPMuSiC are invariant to training data (because mCSM and PoPMuSiC are pre-trained), and are thus constant lines. For both figures, an average of 100 randomly selected training sets is taken at each point