| Literature DB >> 31350558 |
Melanie Schneider1, Jean-Luc Pons1, William Bourguet1, Gilles Labesse1.
Abstract
MOTIVATION: Nowadays, virtual screening (VS) plays a major role in the process of drug development. Nonetheless, an accurate estimation of binding affinities, which is crucial at all stages, is not trivial and may require target-specific fine-tuning. Furthermore, drug design also requires improved predictions for putative secondary targets among which is Estrogen Receptor alpha (ERα).Entities:
Year: 2020 PMID: 31350558 PMCID: PMC6956784 DOI: 10.1093/bioinformatics/btz538
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Structure-based dataset generation approach. The ligand dataset was extracted from the BindingDB (BDB), which uses VConf for 3D conformation generation and VCharge for charge assignment. Two more partial charge models (MMFF and Gasteiger) and two other 3D conformation generators (OpenBabel and Frog2) were employed to generate a total of five ligand sets. Those were submitted to the @TOME server for docking and complex evaluation. The @TOME output datasets ‘MMFF’, ‘Gast’, ‘BDB’, ‘OB3D’ and ‘Frog3D’ (containing the results of 20 dockings per ligand in different structures) were grouped in three combined datasets, a different charge dataset ‘dCharge’, a different conformation dataset ‘dConf’, and an ‘ALL’ dataset
Structure-based docking metrics
| Metric name | Short description |
|---|---|
| PlantsFull | PLANTS score (with anchor weight) ( |
| Plants | PLANTS ChemPLP score (without weight) |
| PlantsLR | PLANTS pKa (calculated by linear regression on PDBbind) |
| MedusaScore | Medusa original score ( |
| MedusaLR | MedusaScore pKa (calculated by linear regression on PDBbind) |
| XScore | XScore affinity score (pKa) ( |
| DSX | DSX original score ( |
| DSXLR | DSX pKa (calculated by linear regression on PDBbind) |
| AtomeScore | @TOME pKa = mean(PLANTS, XScore, MedusaScore, DSX) |
| Tanimoto | Similarity between candidate ligand and anchor ligand |
| AtomSA | S.A. @TOME score |
| QMean | QMean score of receptor model |
| AnchKd | Affinity calculated between receptor/anchor (pKa) |
| AnchorFit | Candidate/ligand superimposition score (PLANTS software) |
| LigandEnergy | Internal energy of ligand (AMMP force field) |
| LPC | LPC software score (receptor/ligand complementarity function) |
| PSim | Similarity to receptor/ligand interaction profile in PDB template |
| CpxQuality | Complex quality consensus score |
| LPE | Ligand Position Error (SVM multi-variable linear regression) |
Ligand-based molecular descriptors
| Abbrv. | CDK descriptor name | Short description |
|---|---|---|
| MW | Weight | molecular weight |
| VABC | VABC | volume descriptor |
| nAtom | AtomCount | number of atoms |
| nBond | BondCount | number of bonds |
| nRotBond | RotatableBondsCount | number of rotatable bonds |
| nAromBond | AromaticBondsCount | number of aromatic bonds |
| nHBDon | HBondDonorCount | number of hydrogen bond donors |
| nHBAcc | HBondAcceptorCount | number of hydrogen bond acceptors |
| TPSA | TPSA | Topological Polar Surface Area |
| XLogP | XLogP | prediction of logP based on the atom-type method called XLogP |
| HybRatio | HybridizationRatio | fraction of sp3 carbons to sp2 carbons |
Pearson correlations (r) on all five datasets between experimental affinities and scores from four scoring functions Plants, MedusaScore, DSX and XScore, of (1) the best pose selected by @TOME, and of (2) the median scores of the four scoring functions, calculated on 20 dockings per ligand on all five datasets
| Dataset name | Plants | MedusaScore | DSX | XScore |
|---|---|---|---|---|
| (1) |
| |||
| Gast | 0.042 | 0.154 | 0.129 | 0.060 |
| MMFF | 0.063 | 0.182 | 0.157 | 0.082 |
| BDB | 0.038 | 0.111 | 0.118 | 0.076 |
| OB3D | 0.109 | 0.180 | 0.143 | 0.129 |
| Frog3D | 0.022 | 0.132 | 0.118 | 0.040 |
| (2) |
| |||
| Gast | −0.031 | 0.204 | 0.019 | 0.049 |
| MMFF | −0.025 | 0.192 | 0.038 | 0.054 |
| BDB | −0.019 | 0.087 | 0.022 | 0.059 |
| OB3D | −0.017 | 0.175 | 0.008 | 0.050 |
| Frog3D | −0.048 | 0.199 | 0.005 | 0.036 |
Fig. 2.Variable importance of the top 30 variables, tracked during model training for the model trained on the ‘ALL’ dataset with the full variable set. Structure-based docking metrics have an extension (_Med or _SD). The suffix _Med stands for the calculated median of the variable for a ligand’s 20 dockings and _SD is the respective standard deviation of this variable
Fig. 3.Performance evaluation of extended models on their respective 20% left-out test sets. The initial dataset of 281 ligands is extended by a set of 66 xenochemicals. The heatmap shows Pearson correlations between predictions and measures for all combinations of training model and prediction set. The different training models are listed as rows and the test sets, on which the predictions were made, are listed as columns. RF models were trained on each dataset separately (‘MMFF’, ‘Gast’, ‘BDB’, ‘OB3D’, ‘Frog3D’), on the combination of the three different 3D conformation datasets ({‘BDB’, ‘OB3D’, ‘Frog3D’} = ‘dConf’), on the combination of the three different partial charge datasets ({‘MMFF’, ‘Gast’, ‘BDB’} = ‘dCharge’) and on all five datasets combined (= ‘ALL’). The predictions with the Pearson correlation highlighted in the heatmap (black box) is plotted as scatter-plot for details below. The scatter plot shows the actual predicted versus measured affinities together with a regression line (dashed line), the optimal prediction line (solid diagonal) and the evaluation metrics—Pearson correlation coefficient (r), coefficient of determination (R2) and root-mean-square error (RMSE). All evaluation metrics were calculated with respect to the actual values (solid diagonal), not the regression line
Model performances on the FDA ER-EDKB test set
| Algorithm | Training set | Variable type | Pearson correlation |
|---|---|---|---|
| RF | ALL+Xeno | @TOME+LD | 0.748 |
| RF | ALL+Xeno | @TOME+LD+MACCS | 0.740 |
| RF | ALL | @TOME+LD | 0.663 |
| RF | ALL | @TOME+LD+MACCS | 0.648 |
| RF | BDB+Xeno | @TOME+LD | 0.712 |
| RF | BDB+Xeno | @TOME+LD+MACCS | 0.688 |
| RF | BDB | @TOME+LD | 0.584 |
| RF | BDB | @TOME+LD+MACCS | 0.542 |
| RF | BDB+Xeno | MACCS only | 0.487 |
Note: The presented models employ all the RF algorithm and differ in training set composition concerning used molecules and in type of variables used. @TOME+LD = docking evaluation variables from the @TOME server + ligand descriptors calculated with CDK.
Comparison of cross-predictions between the Ki and IC50 models and datasets
| BDB Ki training set | BDB Ki test set | BDB IC50 training set | BDB IC50 test set | |
|---|---|---|---|---|
| number of compounds | 225 | 56 | 1319 | 322 |
| Ki ALL model | 0.99 | 0.77 | 0.64 | 0.69 |
| IC50 ALL model | 0.64 | 0.49 | 1.00 | 0.87 |
Note: Pearson correlations between experimental affinities and the random forest predictions are reported.
Evaluation of best RF models on various datasets
| Prediction set | Xeno | FDA | IC50 | Ki |
|---|---|---|---|---|
| RF ALL model | ||||
| Ki+Xeno | 0.98 | 0.75 | 0.65 | 0.96 |
| Ki | 0.48 | 0.66 | 0.65 | 0.77* |
| IC50 | 0.25 | 0.35 | 0.87* | 0.61 |
Note: Pearson correlations between experimental affinities and the RF predictions are reported for the whole datasets but for values marked with ‘*’ that indicates values for a 20% test set.