| Literature DB >> 25159129 |
Hongjian Li1, Kwong-Sak Leung, Man-Hon Wong, Pedro J Ballester.
Abstract
BACKGROUND: State-of-the-art protein-ligand docking methods are generally limited by the traditionally low accuracy of their scoring functions, which are used to predict binding affinity and thus vital for discriminating between active and inactive compounds. Despite intensive research over the years, classical scoring functions have reached a plateau in their predictive performance. These assume a predetermined additive functional form for some sophisticated numerical features, and use standard multivariate linear regression (MLR) on experimental data to derive the coefficients.Entities:
Mesh:
Substances:
Year: 2014 PMID: 25159129 PMCID: PMC4153907 DOI: 10.1186/1471-2105-15-291
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
The three combinations of three different sets of features used to train RF models in this study
| Model | Features |
|---|---|
| RF::Cyscore | 4 Cyscore features |
| RF::CyscoreVina | 4 Cyscore features + 6 AutoDock Vina features |
| RF::CyscoreVinaElem | 4 Cyscore features + 6 AutoDock Vina features + |
| 36 RF-Score features |
The statistics of the five partitions of PDBbind v2013 refined set (N = 2959)
| # | Complexes | Lowest pKd | Highest pKd |
|---|---|---|---|
| 1 | 592 | 2.00 | 11.74 |
| 2 | 592 | 2.00 | 11.80 |
| 3 | 592 | 2.00 | 11.85 |
| 4 | 592 | 2.00 | 11.92 |
| 5 | 591 | 2.05 | 11.72 |
The numbers of test samples and training samples for the PDBbind v2007, v2012 and v2013 benchmarks used in this study
| Benchmark | Test samples | Training samples |
|---|---|---|
| v2007 | 195 | 247, 1105 |
| v2012 | 201 | 247, 2696 |
| v2013 | 592 | 592, 1184, 1776, 2367 |
Figure 1Prediction performance of MLR::Cyscore, RF::Cyscore, RF::CyscoreVina and RF::CyscoreVinaElem trained with varying numbers of samples. First row: root mean square error RMSE. Second row: standard deviation SD in linear correlation. Third row: Pearson correlation coefficient Rp. Fourth row: Spearman correlation coefficient Rs. Left column: PDBbind v2007 benchmark (N = 195). Center column: PDBbind v2012 benchmark (N = 201). Right column: PDBbind v2013 round-robin benchmark (N = 592).
Cross validation results of the four models on the five partitions of PDBbind v2013 refined set (N = 2959) in terms of root mean square error RMSE, standard deviation SD in linear correlation, Pearson correlation coefficient Rp and Spearman correlation coefficient Rs
| MLR::Cyscore | RF::Cyscore | RF::CyscoreVina | RF::CyscoreVinaElem | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| # | N | RMSE | SD | Rp | Rs | RMSE | SD | Rp | Rs | RMSE | SD | Rp | Rs | RMSE | SD | Rp | Rs |
| 1 | 592 | 1.66 | 1.66 | 0.560 | 0.555 | 1.60 | 1.60 | 0.601 | 0.588 | 1.41 | 1.41 | 0.708 | 0.709 | 1.33 | 1.33 | 0.748 | 0.746 |
| 2 | 592 | 1.62 | 1.62 | 0.589 | 0.600 | 1.51 | 1.51 | 0.657 | 0.641 | 1.38 | 1.37 | 0.730 | 0.725 | 1.30 | 1.29 | 0.764 | 0.766 |
| 3 | 592 | 1.69 | 1.70 | 0.531 | 0.529 | 1.66 | 1.66 | 0.561 | 0.545 | 1.49 | 1.49 | 0.668 | 0.665 | 1.41 | 1.41 | 0.711 | 0.709 |
| 4 | 592 | 1.68 | 1.68 | 0.542 | 0.557 | 1.63 | 1.63 | 0.580 | 0.576 | 1.51 | 1.51 | 0.657 | 0.661 | 1.41 | 1.41 | 0.711 | 0.722 |
| 5 | 591 | 1.65 | 1.65 | 0.559 | 0.553 | 1.57 | 1.57 | 0.615 | 0.586 | 1.42 | 1.42 | 0.701 | 0.692 | 1.30 | 1.30 | 0.758 | 0.749 |
| avg | 1.66 | 1.66 | 0.556 | 0.559 | 1.59 | 1.59 | 0.603 | 0.587 | 1.44 | 1.44 | 0.693 | 0.690 | 1.35 | 1.35 | 0.738 | 0.738 |
Leave-cluster-out cross validation results of the four models on the 23 protein families (A to W) and 3 multi-family (X to Z) clusters of PDBbind v2009 refined set (N = 1740) in terms of root mean square error RMSE, standard deviation SD in linear correlation, Pearson correlation coefficient Rp and Spearman correlation coefficient Rs
| MLR::Cyscore | RF::Cyscore | RF::CyscoreVina | RF::CyscoreVinaElem | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Cluster name | Cluster | N | RMSE | SD | Rp | Rs | RMSE | SD | Rp | Rs | RMSE | SD | Rp | Rs | RMSE | SD | Rp | Rs |
| HIV protease | A | 188 | 1.65 | 1.53 | 0.259 | 0.216 | 1.70 | 1.51 | 0.310 | 0.201 | 1.76 | 1.56 | 0.182 | 0.105 | 1.77 | 1.56 | 0.166 | 0.129 |
| trypsin | B | 74 | 1.24 | 1.11 | 0.612 | 0.695 | 1.10 | 1.11 | 0.610 | 0.636 | 0.96 | 0.97 | 0.723 | 0.700 | 0.93 | 0.93 | 0.751 | 0.715 |
| carbonic anhydrase | C | 57 | 2.47 | 1.35 | 0.473 | 0.343 | 2.44 | 1.43 | 0.368 | 0.264 | 2.60 | 1.37 | 0.448 | 0.372 | 2.33 | 1.35 | 0.481 | 0.234 |
| thrombin | D | 53 | 1.52 | 1.40 | 0.702 | 0.676 | 1.50 | 1.44 | 0.680 | 0.611 | 1.47 | 1.45 | 0.675 | 0.675 | 1.46 | 1.40 | 0.699 | 0.680 |
| protein tyrosine phosphatase | E | 32 | 1.23 | 1.06 | 0.411 | 0.313 | 1.30 | 1.10 | 0.338 | 0.268 | 1.36 | 0.98 | 0.538 | 0.542 | 1.23 | 0.89 | 0.643 | 0.615 |
| factor Xa | F | 32 | 1.18 | 0.96 | 0.604 | 0.634 | 1.54 | 1.13 | 0.367 | 0.356 | 1.53 | 1.02 | 0.533 | 0.498 | 1.61 | 1.07 | 0.470 | 0.470 |
| urokinase | G | 29 | 1.15 | 1.14 | 0.643 | 0.602 | 1.10 | 1.14 | 0.642 | 0.645 | 1.25 | 1.27 | 0.516 | 0.436 | 1.05 | 1.06 | 0.699 | 0.624 |
| different similar transporters | H | 29 | 0.96 | 0.96 | 0.285 | 0.122 | 1.27 | 0.99 | 0.056 | -0.040 | 1.10 | 0.98 | 0.188 | 0.077 | 1.01 | 0.93 | 0.354 | 0.123 |
| c-AMP dependent kinase | I | 17 | 1.32 | 1.15 | 0.537 | 0.537 | 1.16 | 1.11 | 0.582 | 0.602 | 0.94 | 0.91 | 0.748 | 0.664 | 1.06 | 0.91 | 0.747 | 0.644 |
|
| J | 17 | 1.03 | 0.78 | 0.383 | 0.316 | 1.04 | 0.76 | 0.444 | 0.365 | 0.92 | 0.72 | 0.518 | 0.443 | 1.05 | 0.68 | 0.597 | 0.649 |
| antibodies | K | 16 | 1.41 | 1.43 | 0.693 | 0.706 | 1.67 | 1.76 | 0.455 | 0.466 | 1.47 | 1.51 | 0.645 | 0.643 | 1.36 | 1.33 | 0.739 | 0.777 |
| casein kinase II | L | 16 | 0.75 | 0.58 | 0.538 | 0.358 | 0.76 | 0.58 | 0.535 | 0.330 | 0.90 | 0.60 | 0.493 | 0.322 | 0.97 | 0.61 | 0.454 | 0.309 |
| ribonuclease | M | 15 | 1.12 | 1.20 | 0.230 | 0.340 | 1.07 | 1.06 | 0.505 | 0.281 | 1.11 | 0.99 | 0.595 | 0.481 | 1.23 | 1.03 | 0.551 | 0.493 |
| thermolysin | N | 14 | 1.15 | 1.14 | 0.680 | 0.635 | 0.98 | 1.03 | 0.748 | 0.648 | 1.04 | 1.12 | 0.696 | 0.565 | 0.97 | 1.05 | 0.738 | 0.636 |
| CDK2 kinase | O | 13 | 1.06 | 0.80 | 0.841 | 0.812 | 1.14 | 1.01 | 0.733 | 0.817 | 1.14 | 1.02 | 0.729 | 0.661 | 1.12 | 1.14 | 0.640 | 0.525 |
| glutamate receptor 2 | P | 13 | 1.08 | 0.85 | 0.070 | 0.096 | 1.09 | 0.85 | 0.120 | 0.097 | 1.08 | 0.85 | 0.116 | 0.121 | 1.00 | 0.84 | 0.123 | 0.016 |
| P38 kinase | Q | 13 | 0.55 | 0.57 | 0.834 | 0.896 | 0.76 | 0.66 | 0.762 | 0.757 | 0.95 | 0.62 | 0.799 | 0.764 | 0.59 | 0.51 | 0.870 | 0.896 |
|
| R | 12 | 1.44 | 1.33 | 0.892 | 0.725 | 1.57 | 1.51 | 0.858 | 0.620 | 1.54 | 1.51 | 0.860 | 0.687 | 1.43 | 1.31 | 0.895 | 0.687 |
| tRNA-guanine transglycosylase | S | 12 | 0.90 | 0.95 | 0.463 | 0.544 | 1.06 | 1.04 | 0.212 | 0.375 | 0.87 | 0.95 | 0.457 | 0.403 | 0.87 | 0.95 | 0.457 | 0.522 |
| endothiapepsin | T | 11 | 1.18 | 1.30 | 0.435 | 0.215 | 1.28 | 1.35 | 0.358 | 0.210 | 1.35 | 1.36 | 0.345 | 0.215 | 1.36 | 1.27 | 0.480 | 0.210 |
|
| U | 10 | 1.67 | 1.63 | -0.004 | 0.248 | 1.65 | 1.62 | 0.116 | 0.188 | 1.73 | 1.62 | 0.089 | 0.176 | 1.83 | 1.63 | 0.053 | 0.103 |
| carboxypeptidase A | V | 10 | 2.13 | 1.99 | 0.479 | 0.523 | 1.90 | 1.89 | 0.556 | 0.370 | 1.82 | 1.76 | 0.632 | 0.467 | 1.77 | 1.54 | 0.734 | 0.685 |
| penicillopepsin | W | 10 | 1.71 | 1.87 | 0.339 | 0.188 | 1.78 | 1.94 | 0.236 | 0.188 | 1.81 | 1.96 | 0.183 | 0.030 | 1.91 | 1.99 | 0.078 | -0.030 |
| families with 4-9 complexes | X | 386 | 1.73 | 1.71 | 0.500 | 0.577 | 1.61 | 1.60 | 0.587 | 0.598 | 1.58 | 1.56 | 0.610 | 0.612 | 1.54 | 1.53 | 0.630 | 0.632 |
| families with 2-3 complexes | Y | 340 | 1.64 | 1.64 | 0.510 | 0.495 | 1.64 | 1.63 | 0.522 | 0.505 | 1.55 | 1.55 | 0.583 | 0.580 | 1.51 | 1.52 | 0.608 | 0.595 |
| singletons | Z | 321 | 1.76 | 1.74 | 0.407 | 0.417 | 1.81 | 1.75 | 0.397 | 0.395 | 1.70 | 1.68 | 0.476 | 0.467 | 1.67 | 1.65 | 0.503 | 0.507 |
| average | 1.35 | 1.24 | 0.493 | 0.470 | 1.38 | 1.27 | 0.465 | 0.414 | 1.37 | 1.23 | 0.515 | 0.450 | 1.33 | 1.18 | 0.545 | 0.479 | ||
| standard deviation | 0.41 | 0.38 | 0.216 | 0.217 | 0.38 | 0.37 | 0.209 | 0.212 | 0.39 | 0.36 | 0.211 | 0.211 | 0.39 | 0.35 | 0.228 | 0.251 |
Prediction performance of 25 scoring functions evaluated on PDBbind v2007 core set (N = 195) in terms of Pearson correlation coefficient Rp, Spearman correlation coefficient Rs and standard deviation SD in linear correlation on the test set
| Scoring function | Rp | Rs | SD |
|---|---|---|---|
| RF::CyscoreVinaElem | 0.803 | 0.798 | 1.42 |
| RF-Score::Elem-v2 | 0.803 | 0.797 | 1.54 |
| SFCscoreRF | 0.779 | 0.788 | 1.56 |
| RF-Score | 0.774 | 0.762 | 1.59 |
| ID-Score | 0.753 | 0.779 | 1.63 |
| RF::CyscoreVina | 0.749 | 0.759 | 1.58 |
| SVR-Score | 0.726 | 0.739 | 1.70 |
| RF::Cyscore | 0.687 | 0.694 | 1.73 |
| Cyscore | 0.660 | 0.687 | 1.79 |
| X-Score::HMScore | 0.644 | 0.705 | 1.83 |
| DrugScoreCSD | 0.569 | 0.627 | 1.96 |
| SYBYL::ChemScore | 0.555 | 0.585 | 1.98 |
| DS::PLP1 | 0.545 | 0.588 | 2.00 |
| GOLD::ASP | 0.534 | 0.577 | 2.02 |
| SYBYL::G-Score | 0.492 | 0.536 | 2.08 |
| DS::LUDI3 | 0.487 | 0.478 | 2.09 |
| DS::LigScore2 | 0.464 | 0.507 | 2.12 |
| GlideScore-XP | 0.457 | 0.435 | 2.14 |
| DS::PMF | 0.445 | 0.448 | 2.14 |
| GOLD::ChemScore | 0.441 | 0.452 | 2.15 |
| SYBYL::D-Score | 0.392 | 0.447 | 2.19 |
| DS::Jain | 0.316 | 0.346 | 2.24 |
| GOLD::GoldScore | 0.295 | 0.322 | 2.29 |
| SYBYL::PMF-Score | 0.268 | 0.273 | 2.29 |
| SYBYL::F-Score | 0.216 | 0.243 | 2.35 |
The scoring functions are sorted in the descending order of Rp. RF::CyscoreVinaElem and Cyscore rank 1st and 9th respectively in terms of Rp. The statistics for the other 21 scoring functions are collected from [8, 22, 31].
Figure 2Correlation plots of predicted binding affinities against measured ones. Top row: Cyscore. Bottom row: RF::CyscoreVinaElem. Left column: PDBbind v2007 benchmark (N = 195), with RF::CyscoreVinaElem trained on 1105 complexes. Center column: PDBbind v2012 benchmark (N = 201), with RF::CyscoreVinaElem trained on 2696 complexes. Right column: PDBbind v2013 round-robin benchmark (N = 592), with RF::CyscoreVinaElem trained on 2367 complexes.
Figure 3RF::Cyscore feature importance estimated on internal OOB data of the 1105 complexes from PDBbind v2007 refined set. The four features are hydrophobic free energy (Hydrophobic), van der Waals interaction energy (Vdw), hydrogen bond interaction energy (HBond) and ligand’s conformational entropy (Ent). The %IncMSE value of a particular feature was computed as the percentage of increase in mean square error observed in OOB prediction when that features was randomly permuted.