| Literature DB >> 22253587 |
Abstract
The prediction of protein-protein kinetic rate constants provides a fundamental test of our understanding of molecular recognition, and will play an important role in the modeling of complex biological systems. In this paper, a feature selection and regression algorithm is applied to mine a large set of molecular descriptors and construct simple models for association and dissociation rate constants using empirical data. Using separate test data for validation, the predicted rate constants can be combined to calculate binding affinity with accuracy matching that of state of the art empirical free energy functions. The models show that the rate of association is linearly related to the proportion of unbound proteins in the bound conformational ensemble relative to the unbound conformational ensemble, indicating that the binding partners must adopt a geometry near to that of the bound prior to binding. Mirroring the conformational selection and population shift mechanism of protein binding, the models provide a strong separate line of evidence for the preponderance of this mechanism in protein-protein binding, complementing structural and theoretical studies.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22253587 PMCID: PMC3257286 DOI: 10.1371/journal.pcbi.1002351
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Molecular descriptors.
| Term | Description |
| DFIRE | The DFire atomistic distance potential |
| OPUS_PSP | The OPUS-PSP orientational atomistic contact potential |
| OPUS_CA | The OPUS-CA combined residue level potential |
| DDFIRE | The DDFire orientational atomistic distance potential |
| ATOM_P | The proportion of polar atoms at the interface |
| RES_C | The proportion of charged residues at the interface |
| QP_PP | The REFINER residue level contact potential |
| MJPL_PP | The residue level contact potential reported in |
| RO_PP | The residue level contact potential reported in |
| MJ2H_PP | The residue level contact potential reported in |
| GEN_4_BODY | A four-body residue level contact potential |
| SASA | The SASA solvation model |
| LK_SOLV | The EEF1 solvation model |
| NUM_HB | The number of interfacial hydrogen bonds |
| H_BOND | The hydrogen bonding potential implemented in FireDock |
| ROS_HBOND | The hydrogen bonding potential implemented in PyRosetta |
| ROS_FA_ATR | The London dispersion energy implemented in PyRosetta |
| ROS_CG | The PyRosetta coarse-grain potential |
| ROS_CG_BETA | The PyRosetta coarse-grain C |
| ROS_CG_VDW | The PyRosetta coarse-grain Van der Waals potential |
| NIP | An interface packing score |
| STC_H | A simple binding enthalpy score |
| STC_S_SC | A side-chain entropy model |
| S_WLC_INT2 | A disorder to order transition entropy model |
Descriptions of the basic molecular descriptors highlighted in this work. Where descriptors appear in the text without suffix, this indicated that values are either computed directly or as changes upon complexation, calculated as the difference between the bound complex and the unbound protein in the bound conformation. Those appearing suffixed with _UB pertain to the conformational changes upon binding, and are calculated as the difference between unbound proteins in the bound and unbound conformations. The suffixes _ENS and _EBU respectively correspond the interaction and conformation descriptors which are averaged over conformational ensembles. Briefly, CONCOORD 2.1 was used to generate 100 conformations surrounding the complex and its unbound constituents [23]. Descriptors are calculated using mean values derived from these ensembles.
Figure 1An early stopping surface.
The surface shows how the RMSE of the predicted binding free energies of the test set, calculated via equation 1, vary with the number of features used in the rate constant models. This surface correspond to scheme 2 in Table 4. The and models which are selected, which use two features each, corresponds to the RMSE minimum.
Results for training, model selection and validation.
|
|
|
|
| |||||||||
| Sel. | Scheme | # | Corr. | RMSE | # | Corr. | RMSE | RMSE | Corr. | RMSE | Corr. | p |
| RMSE | 1 | 2 | 0.70 | 0.89 | 5 | 0.79 | 1.17 | 2.45 | 0.69 | 3.59 | 0.09 | 0.45 |
| 2 | 2 | 0.70 | 0.89 | 2 | 0.56 | 1.58 | 3.36 | 0.10 | 2.61 | 0.59 |
| |
| 3 | 8 | 0.77 | 0.86 | 2 | 0.45 | 1.47 | 2.50 | 0.60 | 3.67 | 0.19 | 0.14 | |
| 4 | 2 | 0.53 | 1.14 | 2 | 0.45 | 1.47 | 3.26 | 0.17 | 2.80 | 0.51 |
| |
| Corr. | 1 | 2 | 0.70 | 0.89 | 6 | 0.82 | 1.10 | 2.54 | 0.69 | 3.54 | 0.12 | 0.29 |
| 2 | 5 | 0.83 | 0.69 | 4 | 0.72 | 1.31 | 3.94 | 0.22 | 3.27 | 0.39 |
| |
| 3 | 3 | 0.61 | 1.06 | 18 | 0.90 | 0.73 | 2.80 | 0.72 | 3.84 | 0.03 | 0.85 | |
| 4 | 10 | 0.80 | 0.80 | 2 | 0.45 | 1.47 | 3.67 | 0.27 | 2.87 | 0.43 |
| |
Results for feature selection, model selection and validation, using the two selection criteria and the four data partitioning schemes. The number of features for the and models is shown (#), alongside their leave-one-out cross-validation correlations and RMSE. The RMSE and correlation of the values used for selecting these models is also shown, as are those when the model is applied to the validation set, along with the significance of correlation.
Significant correlations between association rates and molecular descriptors.
| Descriptor | Correlation |
| DFIRE_EBU | −0.47 |
| OPUS_PSP_EBU | −0.40 |
| OPUS_CA_EBU | −0.40 |
| DDFIRE_EBU | −0.38 |
| H_BOND_ENS | −0.35 |
| ROS_HBOND_UB | −0.35 |
| ATOM_P | 0.39 |
| NUM_HB | 0.39 |
Significant (p<0.01) correlations between association rates and molecular descriptors using the 44 complexes for which kinetic data is available.
Significant correlations between association rates and molecular descriptors for the validated set.
| Descriptor | Correlation |
| OPUS_PSP_EBU | −0.60 |
| H_BOND_ENS | −0.59 |
| ROS_HBOND_ENS | −0.56 |
| H_BOND | −0.56 |
| DFIRE_EBU | −0.56 |
| QP_PP | −0.52 |
| ROS_FA_ATR_ENS | −0.49 |
| ROS_HBOND | −0.49 |
| STC_S_SC_ENS | −0.48 |
| MJPL_PP | −0.48 |
| ROS_FA_ATR | −0.48 |
| SASA | 0.48 |
| LK_SOLV | 0.49 |
| LK_SOLV_ENS | 0.51 |
| RO_PP | 0.52 |
| NUM_HB | 0.57 |
Significant (p<0.01) correlations between association rates and molecular descriptors using the 27 complexes for which kinetic data is available and the binding affinity is known with high confidence.
Figure 2A Venn Diagram showing the four combinations of training, model selection and validation sets.
Rectangles corresponds to all 137 complexes in the binding affinity benchmark [12]. The left circle corresponds to the 44 complexes for which kinetic data could be found. The right circle corresponds to the set of 57 complexes with high confidence affinities. These are the complexes for which similar affinities have been determined in multiple experimental setups, as previously determined [13]. The intersection of these sets contains 27 complexes.
Results for training, model selection and validation (2OZA omitted).
|
|
|
|
| |||||||||
| Sel. | Scheme | # | Corr. | RMSE | # | Corr. | RMSE | RMSE | Corr. | RMSE | Corr. | p |
| RMSE | 1 | 1 | 0.48 | 1.06 | 4 | 0.80 | 1.15 | 2.84 | 0.51 | 3.76 | 0.08 | 0.48 |
| 2 | 1 | 0.48 | 1.06 | 2 | 0.58 | 1.54 | 3.66 | 0.00 | 2.91 | 0.48 |
| |
| 3 | 9 | 0.80 | 0.78 | 5 | 0.73 | 1.11 | 2.36 | 0.72 | 3.46 | 0.25 |
| |
| 4 | 7 | 0.72 | 0.91 | 1 | 0.38 | 1.51 | 3.16 | 0.32 | 2.66 | 0.59 |
| |
| Corr. | 1 | 1 | 0.48 | 1.06 | 5 | 0.85 | 1.01 | 2.95 | 0.52 | 3.94 | 0.09 | 0.43 |
| 2 | 2 | 0.65 | 0.92 | 21 | 1.00 | 0.00 | 4.12 | 0.31 | 3.86 | 0.39 |
| |
| 3 | 9 | 0.80 | 0.78 | 5 | 0.73 | 1.11 | 2.36 | 0.72 | 3.46 | 0.25 |
| |
| 4 | 7 | 0.72 | 0.91 | 2 | 0.51 | 1.43 | 3.18 | 0.33 | 2.55 | 0.60 |
| |
Results for feature selection, model selection and validation, using the two selection criteria and the four data partitioning schemes. The outlier, 2OZA, was omitted from these runs. The number of features for the and models is shown (#), alongside their leave-one-out cross-validation correlations and RMSE. The RMSE and correlation of the values used for selecting these models is also shown, as are those when the model is applied to the validation set, along with the significance of correlation.
Selected models.
|
| Error |
| Error | |||||||
| Feat. |
|
| RMS | RMS | Feat. |
|
| RMS | RMS | |
|
| CONSTANT | 4.29 | - | 0.81 | 0.89 | CONSTANT | −2.11 | - | 1.41 | 1.58 |
| NUM_HB | 7.29e-2 | 0.52 | ROS_CG_BETA | −6.77e-1 | −0.73 | |||||
| DFIRE_EBU | −3.60e-3 | −0.50 | OPUS_CA_ENS | 3.77e-2 | 0.67 | |||||
|
| CONSTANT | 4.18 | - | 1.05 | 1.14 | CONSTANT | −6.32 | - | 1.39 | 1.47 |
| NUM_HB | 7.09e-2 | 0.39 | ROS_CG_BETA | −4.89e-1 | −0.52 | |||||
| DFIRE_EBU | −3.19e-3 | −0.47 | NIP | 8.61e3 | 0.51 | |||||
|
| CONSTANT | 5.80 | - | 0.76 | 0.90 | CONSTANT | −0.87 | - | 1.44 | 1.52 |
| RES_C | −6.87e-2 | −0.53 | MJ2H_PP | 1.20e-2 | 0.46 | |||||
| NUM_HB | 7.99e-2 | 0.42 | ||||||||
| ROS_CG_VDW | −1.01 | −0.27 | ||||||||
| STC_H | −5.84e-2 | −0.28 | ||||||||
| GEN_4_BODY_UB | 1.43e-2 | 0.39 | ||||||||
| DFIRE_EBU | −2.76e-3 | −0.41 | ||||||||
| S_WLC_INT2 | −2.77e-1 | −0.19 | ||||||||
|
| CONSTANT | 5.80 | - | 0.76 | 0.90 | CONSTANT | −0.67 | - | 1.29 | 1.43 |
| RES_C | −6.87e-2 | −0.53 | MJ2H_PP | 1.36e-2 | 0.53 | |||||
| NUM_HB | 7.99e-2 | 0.42 | MJPL_PP_UB | 3.98e-3 | 0.40 | |||||
| ROS_CG_VDW | −1.01 | −0.27 | ||||||||
| STC_H | −5.84e-2 | −0.28 | ||||||||
| GEN_4_BODY_UB | 1.43e-2 | 0.39 | ||||||||
| DFIRE_EBU | −2.76e-3 | −0.41 | ||||||||
| S_WLC_INT2 | −2.77e-1 | −0.19 | ||||||||
The four models which were selected for further analysis. For each feature, absolute weights () and normalized weights (), found after converting to z-scores, are shown. The term CONSTANT refers to the constant determined during regression. Root mean square error (RMS) and leave-one-out cross-validated error () are also shown.
Figure 3Models a, b, c and d.
The and models, applied to the all the complexes for which kinetic data is available (with outlier 2OZA omitted from models c and d). Complexes in the intersection with the high confidence interactions are shown as circles, with the remainder shown as triangles. Points are coloured according to binding affinity. The combined predictions, applied to the validation set, are also shown. These correspond to the set of high confidence affinities for which the rate constants are not known.
Figure 4A Flowchart of the feature selection algorithm.
The algorithm can be divided into two parts. In the first, a set of descriptor subsets, T, is constructed by first iterating over the set of descriptors subsets kept in the previous iteration, S. In the first iteration, S contains only the empty set. For each member, S , new descriptor subsets are created by combining S with each descriptor not already in S . These are collected into T, and evaluated by their 5-fold cross-validated RMSE in the second part of the algorithm. The 20 best performing subsets are kept for the next iteration, and that with the lowest RMSE is stored for later model selection and validation. If the lowest RMSE in the current iteration, , is higher than the lowest RMSE found in all previous iterations, , then the speculative round counter, , is incremented. Otherwise it is reset to 0. The algorithm terminates after 10 consecutive speculative rounds.