| Literature DB >> 28001004 |
Yoshifumi Fukunishi1, Satoshi Yamasaki2, Isao Yasumatsu2,3, Koh Takeuchi1, Takashi Kurosawa2,4, Haruki Nakamura5.
Abstract
In order to improve docking score correction, we developed several structure-based quantitative structure activity relationship (QSAR) models by protein-drug docking simulations and applied these models to public affinity data. The prediction models used descriptor-based regression, and the compound descriptor was a set of docking scores against multiple (∼600) proteins including nontargets. The binding free energy that corresponded to the docking score was approximated by a weighted average of docking scores for multiple proteins, and we tried linear, weighted linear and polynomial regression models considering the compound similarities. In addition, we tried a combination of these regression models for individual data sets such as IC50 , Ki , and %inhibition values. The cross-validation results showed that the weighted linear model was more accurate than the simple linear regression model. Thus, the QSAR approaches based on the affinity data of public databases should improve docking scores.Entities:
Keywords: Binding free energy; ChEMBL; Docking score; Protein-compound docking
Mesh:
Substances:
Year: 2016 PMID: 28001004 PMCID: PMC5297997 DOI: 10.1002/minf.201600013
Source DB: PubMed Journal: Mol Inform ISSN: 1868-1743 Impact factor: 3.353
Figure 1Schematic representation of each principal component regression (PCR) model.
Statistics of ΔG values (kcal/mol) converted from ChEMBL data.
| Data type | Average | σ [a] | ΔGmin [b] | ΔGmax [c] | RMSD [d] |
|---|---|---|---|---|---|
| Whole | −9.67 | 2.41 | −18.51 | −0.55 | – |
| Ki/Kd | −9.30 | 1.69 | −16.30 | −1.34 | 1.55 |
| IC50 | −9.37 | 2.18 | −18.51 | −0.55 | 1.87 |
| Activity | −0.88 | 5.74 | −9.35 | −4.11 | 2.58 |
| %inhibition | −2.66 | 4.14 | −9.35 | −4.11 | 2.99 |
[a] The standard deviation of the whole observed data (kcal/mol). [b] The minimum ΔG value of the data set (kcal/mol). [c] The maximum ΔG value of the data set (kcal/mol). [d] The root mean square deviation (RMSD) of the multiply‐observed data for the same protein‐ligand pairs (kcal/mol)
Correlation coefficient (R) and area under the curve (AUC) of the receiver operating characteristic (ROC) curve by mathematical simulation.
| Error exp [a] | Error calc [b] | R | AUC (%) |
|---|---|---|---|
| 0.58 | 0.82 | 0.90 | 97 |
| 0.58 | 1.29 | 0.79 | 94 |
| 0.58 | 1.51 | 0.74 | 91 |
| 0.58 | 1.83 | 0.67 | 88 |
| 0.70 | 0.91 | 0.88 | 97 |
| 0.70 | 1.35 | 0.78 | 94 |
| 0.70 | 1.55 | 0.73 | 92 |
| 0.70 | 1.87 | 0.66 | 88 |
| 1.16 | 1.30 | 0.79 | 97 |
| 1.16 | 1.64 | 0.70 | 93 |
| 1.16 | 1.81 | 0.66 | 92 |
| 1.16 | 2.09 | 0.59 | 88 |
| 1.39 | 1.51 | 0.75 | 97 |
| 1.39 | 1.81 | 0.66 | 94 |
| 1.39 | 1.97 | 0.62 | 92 |
| 1.39 | 2.23 | 0.56 | 88 |
[a] Simulated experimental error (kcal/mol). [b] Simulated prediction error (kcal/mol).
Figure 2Schematic representation of the leave‐one‐out (LOO) cross‐validation procedure for models 1–6. The similarity calculation and making copy of assay data (in model 6) were applied only for the weighted principle component regression (PCR) model (gray boxes). Nc is the number of compounds. For the replica and partial‐replica PCR models, the initial docking scores and set of ΔG were replicated. *: Regression type was polynomial only for the polynomial PCR model. Otherwise, the regression was a multilinear regression (MLR).
Average correlation coefficient (R/Q) between the experimental data and the calculated data obtained by the six regression models over all 79 proteins.
| Model | R [a] | RMSE [b] | Q [c] | RMSE [d] | ||
|---|---|---|---|---|---|---|
| Simple PCR model | 0.81 | 1.17 | 0.63 | 1.58 | ||
| Polynomial PCR model | 0.69 | 1.47 | 0.58 | 1.66 | ||
| Replica PCR model | NR [e] | |||||
| 1 | 0.81 | 1.17 | 0.63 | 1.59 | ||
| 2 | 0.81 | 1.17 | 0.63 | 1.59 | ||
| 5 | 0.81 | 1.17 | 0.62 | 1.60 | ||
| 10 | 0.81 | 1.17 | 0.60 | 1.64 | ||
| Partial‐replica PCR model | NR [e] | |||||
| 1 | 0.82 | 1.19 | 0.63 | 1.59 | ||
| 2 | 0.82 | 1.16 | 0.63 | 1.59 | ||
| 5 | 0.82 | 3.26 | 0.62 | 1.60 | ||
| 10 | 0.82 | 1.19 | 0.60 | 1.63 | ||
| Weighted PCR model | x | NR [e] | ||||
| 0.1 | 1 | 0.89 | 0.87 | 0.66 | 1.54 | |
| 2 | 0.89 | 0.87 | 0.66 | 1.54 | ||
| 4 | 0.89 | 0.87 | 0.66 | 1.54 | ||
| 8 | 0.89 | 0.87 | 0.65 | 1.56 | ||
| 0.2 | 1 | 0.89 | 0.87 | 0.66 | 1.54 | |
| 2 | 0.89 | 0.87 | 0.65 | 1.55 | ||
| 4 | 0.89 | 0.87 | 0.65 | 1.55 | ||
| 8 | 0.89 | 0.87 | 0.64 | 1.57 | ||
| 0.3 | 1 | 0.89 | 0.87 | 0.65 | 1.54 | |
| 2 | 0.89 | 0.87 | 0.65 | 1.55 | ||
| 4 | 0.89 | 0.87 | 0.65 | 1.56 | ||
| 8 | 0.89 | 0.87 | 0.64 | 1.58 | ||
| 0.5 | 1 | 0.89 | 0.87 | 0.65 | 1.54 | |
| 2 | 0.89 | 0.87 | 0.65 | 1.55 | ||
| 4 | 0.89 | 0.87 | 0.64 | 1.56 | ||
| 8 | 0.89 | 0.87 | 0.63 | 1.58 | ||
| Classified PCR | 0.92 | 0.42 | 0.71 | 0.98 | ||
| Combined PCR | 0.92 | 0.42 | 0.61 [f] | ND [f] | ||
[a] Average correlation coefficient between the experimental and calculated data. [b] Average root mean square deviation (RMSD) error between the experimental and calculated data (kcal/mol). [c] Average correlation coefficient between the experimental and calculated data obtained by the leave‐one‐out (LOO) cross‐validation test. [d] Average RMSD error between the experimental and calculated data obtained by the LOO cross‐validation test (kcal/mol). [e] Number of replicas. [f] In 12 cases out of 97 proteins, the root mean square deviation error (RMSE)>106.
Figure 3Correlation between the experimental and prediction data for all 97 kinase proteins obtained by the weighted PCR model with x=0.1 and the number of replicas=1. Black dots represent the least‐squares fitting line, and the correlation coefficient is 0.76. The dotted line represents the fitted result.