| Literature DB >> 35573039 |
Oghenejokpeme I Orhobor1, Abbi Abdel Rehim1, Hang Lou2, Hao Ni2,3, Ross D King1,4,3.
Abstract
The representation of the protein-ligand complexes used in building machine learning models play an important role in the accuracy of binding affinity prediction. The Extended Connectivity Interaction Features (ECIF) is one such representation. We report that (i) including the discretized distances between protein-ligand atom pairs in the ECIF scheme improves predictive accuracy, and (ii) in an evaluation using gradient boosted trees, we found that the resampling method used in selecting the best hyperparameters has a strong effect on predictive performance, especially for benchmarking purposes.Entities:
Keywords: machine learning; protein binding affinity prediction; scoring functions
Year: 2022 PMID: 35573039 PMCID: PMC9066299 DOI: 10.1098/rsos.211745
Source DB: PubMed Journal: R Soc Open Sci ISSN: 2054-5703 Impact factor: 3.653
The number of features generated for each of the benchmark datasets using the ECIF and PDECIF approaches at different distances (angstroms).
| benchmark | distance | ECIF | PDECIF |
|---|---|---|---|
| CASF 2007 | 4 | 856 | 2178 |
| 6 | 1161 | 2482 | |
| 8 | 1244 | 2563 | |
| 10 | 1285 | 2595 | |
| CASF 2013 | 4 | 996 | 2290 |
| 6 | 1226 | 2520 | |
| 8 | 1268 | 2561 | |
| 10 | 1288 | 2579 | |
| CASF 2016 | 4 | 1078 | 2485 |
| 6 | 1332 | 2739 | |
| 8 | 1376 | 2781 | |
| 10 | 1399 | 2803 | |
| CASF 2019 | 4 | 1176 | 2584 |
| 6 | 1362 | 2770 | |
| 8 | 1389 | 2795 | |
| 10 | 1402 | 2807 |
P-values from paired t-test statistical testing of the difference in predictive performance (R) between the considered representations across the different resampling methods.
| representation pairs | train–test | CV5 | CV10 |
|---|---|---|---|
| ECIF − ECIF + Ligand | 9.369 × 10−5 | 1.918 × 10−4 | 1.502 × 10−4 |
| ECIF − PDECIF | 2.133 × 10−3 | 6.220 × 10−3 | 6.419 × 10−3 |
| ECIF − PDECIF + Ligand | 5.471 × 10−5 | 1.271 × 10−4 | 1.391 × 10−4 |
| ECIF + Ligand − PDECIF | 5.364 × 10−1 | 1.388 × 10−1 | 1.188 × 10−1 |
| ECIF + Ligand − PDECIF + Ligand | 6.175 × 10−3 | 5.564 × 10−4 | 3.924 × 10−3 |
| PDECIF − PDECIF + Ligand | 3.688 × 10−5 | 6.243 × 10−8 | 1.052 × 10−6 |
Predictive performance (R/RMSE) for the ECIF and PDECIF representations with and without the ligand features for the CASF 2007 and 2013 benchmark datasets when the hyperparameters for the predictive model are selected using the train–test and cross-validation (k = {5, 10}) resampling methods. For each benchmark year and distance pair, the best performing representation (with and without ligand features) and resampling method is in italics. The overall best performing combination for the given benchmark dataset is in boldface.
| year—distance | representation | train–test | CV5 | CV10 |
|---|---|---|---|---|
| CASF 2007—4 | ECIF | 0.739/1.663 | 0.736/1.665 | 0.729/1.692 |
| PDECIF | 0.807/1.468 | 0.802/1.482 | ||
| ECIF + ligand | 0.759/1.583 | 0.787/1.562 | 0.783/1.562 | |
| PDECIF + ligand | 0.811/1.467 | 0.812/1.471 | ||
| CASF 2007—6 | ECIF | 0.812/1.472 | 0.808/1.498 | |
| PDECIF | 0.803/1.494 | 0.808/1.467 | 0.806/1.472 | |
| ECIF + ligand | 0.814/1.459 | 0.820/1.455 | 0.812/1.460 | |
| PDECIF + ligand | 0.823/1.430 | 0.817/1.446 | ||
| CASF 2007—8 | ECIF | 0.805/1.468 | 0.813/1.449 | 0.815/1.446 |
| PDECIF | 0.812/1.450 | 0.811/1.460 | ||
| ECIF + ligand | 0.815/1.472 | 0.818/1.443 | 0.820/1.442 | |
| PDECIF + ligand | 0.827/1.418 | 0.825/1.418 | ||
| CASF 2007—10 | ECIF | 0.811/1.473 | 0.811/1.448 | |
| PDECIF | 0.811/1.476 | 0.808/1.476 | 0.807/1.481 | |
| ECIF + ligand | 0.802/1.496 | 0.820/1.461 | 0.817/1.438 | |
| PDECIF + ligand | 0.814/1.486 | 0.818/1.440 | ||
| CASF 2013—4 | ECIF | 0.708/1.629 | 0.694/1.655 | 0.717/1.613 |
| PDECIF | 0.762/1.522 | 0.773/1.499 | ||
| ECIF + ligand | 0.777/1.484 | 0.776/1.480 | 0.778/1.481 | |
| PDECIF + ligand | 0.800/1.432 | 0.798/1.431 | ||
| CASF 2013—6 | ECIF | 0.772/1.484 | 0.779/1.475 | 0.774/1.478 |
| PDECIF | 0.786/1.461 | 0.783/1.467 | ||
| ECIF + ligand | 0.801/1.419 | 0.791/1.437 | 0.790/1.439 | |
| PDECIF + ligand | 0.811/1.405 | 0.802/1.420 | ||
| CASF 2013—8 | ECIF | 0.772/1.487 | 0.769/1.497 | 0.772/1.483 |
| PDECIF | 0.774/1.485 | 0.784/1.459 | ||
| ECIF + ligand | 0.799/1.420 | 0.797/1.423 | 0.799/1.422 | |
| PDECIF + ligand | 0.800/1.420 | 0.804/1.410 | ||
| CASF 2013—10 | ECIF | 0.781/1.464 | 0.779/1.469 | |
| PDECIF | 0.780/1.469 | 0.775/1.478 | 0.778/1.472 | |
| ECIF + ligand | 0.800/1.420 | 0.798/1.416 | ||
| PDECIF + ligand | 0.798/1.421 | 0.796/1.424 | 0.797/1.423 |
Predictive performance (R/RMSE) for the ECIF and PDECIF representations with and without the ligand features for the CASF 2016 and 2019 benchmark datasets when the hyperparameters for the predictive model are selected using the train–test and cross-validation (k = {5, 10}) resampling methods. For each benchmark year and distance pair, the best performing representation (with and without ligand features) and resampling method is in italics. The overall best performing combination for the given benchmark dataset is in boldface.
| year—distance | representation | train–test | CV5 | CV10 |
|---|---|---|---|---|
| CASF 2016—4 | ECIF | 0.752/1.497 | 0.752/1.495 | 0.748/1.501 |
| PDECIF | 0.816/1.334 | 0.818/1.329 | ||
| ECIF + ligand | 0.818/1.335 | 0.822/1.319 | 0.822/1.323 | |
| PDECIF + ligand | 0.840/1.273 | 0.839/1.275 | ||
| CASF 2016—6 | ECIF | 0.808/1.343 | 0.811/1.335 | 0.802/1.353 |
| PDECIF | 0.833/1.280 | 0.828/1.293 | ||
| ECIF + ligand | 0.840/1.263 | 0.840/1.260 | 0.829/1.284 | |
| PDECIF + ligand | 0.843/1.252 | 0.840/1.258 | ||
| CASF 2016—8 | ECIF | 0.806/1.343 | 0.804/1.350 | 0.797/1.361 |
| PDECIF | 0.823/1.303 | 0.824/1.305 | ||
| ECIF + ligand | 0.831/1.281 | 0.832/1.275 | 0.838/1.263 | |
| PDECIF + ligand | 0.831/1.276 | 0.842/1.256 | ||
| CASF 2016—10 | ECIF | 0.815/1.320 | 0.812/1.328 | 0.816/1.314 |
| PDECIF | 0.825/1.298 | 0.823/1.300 | ||
| ECIF + ligand | 0.842/1.252 | 0.842/1.256 | ||
| PDECIF + ligand | 0.842/1.252 | 0.844/1.246 | 0.839/1.260 | |
| CASF 2019—4 | ECIF | 0.793/1.424 | 0.795/1.417 | 0.791/1.426 |
| PDECIF | 0.853/1.239 | 0.851/1.249 | ||
| ECIF + ligand | 0.833/1.294 | 0.833/1.289 | 0.832/1.290 | |
| PDECIF + ligand | 0.859/1.217 | 0.855/1.223 | ||
| CASF 2019—6 | ECIF | 0.832/1.284 | 0.837/1.272 | 0.833/1.284 |
| PDECIF | 0.850/1.240 | 0.849/1.241 | ||
| ECIF + ligand | 0.847/1.237 | 0.848/1.230 | 0.853/1.223 | |
| PDECIF + ligand | 0.859/1.208 | 0.860/1.204 | ||
| CASF 2019—8 | ECIF | 0.831/1.290 | 0.836/1.281 | 0.839/1.268 |
| PDECIF | 0.845/1.251 | 0.848/1.244 | ||
| ECIF + ligand | 0.854/1.222 | 0.852/1.230 | 0.851/1.227 | |
| PDECIF + ligand | 0.857/1.215 | 0.854/1.222 | ||
| CASF 2019—10 | ECIF | 0.832/1.282 | 0.842/1.258 | 0.837/1.271 |
| PDECIF | 0.848/1.245 | 0.849/1.248 | ||
| ECIF + ligand | 0.856/1.211 | 0.858/1.208 | 0.854/1.217 | |
| PDECIF + ligand | 0.855/1.215 | 0.857/1.203 |