| Literature DB >> 29104352 |
Piotr F J Lipiński1, Przemysław Szurmak2.
Abstract
A common practice in modern QSAR modelling is to derive models by variable selection methods working on large descriptor pools. As pointed out previously, this is intrinsically burdened with the risk of finding random correlations. Therefore it is desirable to perform tests showing the performance of models built on random data. In this contribution, we introduce a simple and freely available software tool SCRAMBLE'N'GAMBLE that is aimed at facilitating data preparation for y-randomization and pseudo-descriptors tests. Then, four close-to-real-world modelling situations are analysed. The tests indicate what the quality of obtained QSAR models is like in comparison to chance models derived from random data. The non-randomness is not the only requirement for a good QSAR model, however, it is a good practice to consider it together with internal statistical parameters and possible physical interpretations of a model.Entities:
Keywords: Chance correlations; QSAR; QSAR validation; y-Randomization
Year: 2017 PMID: 29104352 PMCID: PMC5655615 DOI: 10.1007/s11696-017-0215-7
Source DB: PubMed Journal: Chem Zvesti ISSN: 0366-6352 Impact factor: 2.097
Details of the workflow for singular cases
| Case | I | II | III | IV |
|---|---|---|---|---|
| QSAR task | Linear equation of up to 3 variables, descriptors based on 2D structure | Linear equation of up to 3 variables, descriptors based on 3D structure | Linear equation, Fujita-Ban model (Fujita and Ban | Classification model |
| Molecules | Various steroids (Fig. | Various steroids (Fig. | Fentanyl derivatives (3-methyl-1,4-disubstituted piperidines) (Table SI-1 in Electronic supporting material) | Various compounds (Tables SI-5 to SI-8) |
| Dependent variable | Logarithm of binding affinity to sex-hormone-binding globulin (SHBG) (Fig. | Logarithm of binding affinity to corticosteroid-binding globulin (CBG) (Fig. | Effective dose ED50 in mouse hot plate test (analgesic activity test) (Table SI-1) | Whether a molecule binds or does not bind the glucocorticoid receptor (Tables SI-5 to SI-8) |
| Training set | 21 steroids of the benchmark Cramer data set (Cramer et al. | 21 steroids of the benchmark Cramer data set (Cramer et al. | 36 active derivatives (Lalinde et al. | 100 active molecules and 3600 decoys randomly chosen from Directory of useful decoys, enhanced (DUD-E). (Mysinger et al. |
| Test set | Up to 12 molecules (within the applicability domains of the models found) taken from the extended benchmark steroid data set (Cherkasov et al. | 10 steroids of the benchmark Cramer data set (Cramer et al. | 10 inactive derivatives (Lalinde et al. | Other 50 active molecules and 1800 decoys randomly chosen from Directory of Useful Decoys, Enhanced (DUD-E). (Mysinger et al. |
| Calculation of molecular descriptors | 2D descriptors in DRAGON 6 (Talete Srl | Structure optimization (B3LYP/6-31G*) in Gaussian 09 (Frisch et al. | Indicator variables | 2D descriptors in DRAGON 6 (Talete Srl |
| Descriptors calculated | 3764 | 127 | 9 | 3764 |
| Descriptors included in training the models | 89 | 49 | 9 | 1090 |
| Training QSAR model | Genetic function approximation (GFA) algorithm (Rogers and Hopfinger | Genetic function approximation (GFA) algorithm (Rogers and Hopfinger | Linear regression routine in Microsoft Excel 2016 | In-house script based on a Python scikit-learn library for machine learning. ( |
| Number of random data sets generated | 300 ( | 300 ( | 300 ( | 25 (pseudo-descriptors with original distributions, pseudo-descriptors with uniform distributions) |
Fig. 1General view of SCRAMBLE’N’GAMBLE interface
Fig. 2Steroid molecules used in Cases I and II. The figures under names are binding affinities to corticosteroid-binding globulin (upper figures, Case II) and sex-hormone-binding globulin (lower figures, Case I)
Top 10 QSAR models obtained by the GFA procedure (Case I)
| No |
|
|
|
|
|
|
|---|---|---|---|---|---|---|
| 1 | −69.185 − 1.103 × X5v + 38.155 × MATS5m + 44.524 × SpMin3_Bh(m) | 0.811 | 0.690 | 0.051, | 0.339 | 0.023 |
| 2 | −60.380 + 28.860 × MATS5m − 45.301 × MATS3v + 34.208 × SpMin3_Bh(m) | 0.792 | 0.706 | 0.015, | 0.312 | <0.001 |
| 3 | −89.817 + 160.850 × X2A + 36.696 × MATS5m + 30.049 × SpMin3_Bh(m) | 0.784 | 0.701 | 0.027, | 0.300 | 0.017 |
| 4 | 22.870 + 127.78 × VE2_Dt − 5.549 × SpDiam_AEA(dm) − 0.559 × NsssCH | 0.782 | 0.673 | 0.322, | 0.297 | 0.289 |
| 5 | 14.787 − 2.262 × IDDE + 157.890 × VE2_Dt − 3.801 × SpDiam_AEA(dm) | 0.777 | 0.669 | 0.030, | 0.290 | 0.024 |
| 6 | −25.180 + 30.414 × MATS5 m + 23.871 × SpMin3_Bh(m) − 1.935 × SpDiam_AEA(dm) | 0.771 | 0.679 | 0.146, | 0.280 | 0.097 |
| 7 | 2.426 + 186.700 × VE2_Dt + 0.070 × P_VSA_s_4 − 4.433 × SpDiam_AEA(dm) | 0.766 | 0.620 | 0.062, | 0.273 | 0.059 |
| 8 | −67.765 + 34.796 × MATS5m − 8.129 × MATS8 m + 40.696 × SpMin3_Bh(m) | 0.766 | 0.615 | 0.1085, | 0.273 | 0.084 |
| 9 | 61.058 + 170.730 × VE2_Dt − 178.400 × ChiA_B(p) − 5.089 × SpDiam_AEA(dm) | 0.765 | 0.613 | 0.057, | 0.271 | 0.050 |
| 10 | 35.620 + 34.296 × MATS5m − 25.263 × SpMax4_Bh(i) + 32.296 × SpMin3_Bh(m) | 0.762 | 0.633 | 0.016, | 0.266 | 0.003 |
K aff logarithm of the affinity for SHBG, ChiA_B(p) average Randic-like index from Burden matrix weighted by polarizability, IDDE mean information content on the distance degree equality, MATS3v Moran autocorrelation of lag 3 weighted by van der Waals volume, MATS5m Moran autocorrelation of lag 5 weighted by mass, MATS5m Moran autocorrelation of lag 5 weighted by mass, MATS8m Moran autocorrelation of lag 8 weighted by mass, NsssCH Number of atoms of type sssCH, P_VSA_s_4 P_VSA-like on I-state, bin 4, SpDiam_AEA(dm) spectral diameter from augmented edge adjacency matrix weighted by dipole moment, SpMax4_Bh(i) largest eigenvalue n. 4 of Burden matrix weighted by ionization potential, SpMin3_Bh(m) smallest eigenvalue n. 3 of Burden matrix weighted by mass, VE2_Dt average coefficient of the last eigenvector from detour matrix, X2A average connectivity index of order 2, X5v valence connectivity index of order 5
r 2 coefficient of determination in the training set
q 2 cross-validated coefficient of determination in the training set (internal validation)
R 2 coefficient of determination in the test set (external validation)
n number of molecules used for the external validation (that is: found in the applicability domain of a given model)
a parameter of model non-randomness proposed in (Mitra et al. 2010)
a parameter describing the prediction of the absolute response date of the test set, proposed in (Pratim Roy et al. 2009)
Predictive power of the chance models (Case I)
| mhr2 | SD | +1 SD | +2.3 SD | +3 SD | mhq2 | SD | +2.3 SD | +3 SD | |
|---|---|---|---|---|---|---|---|---|---|
|
| 0.544 | 0.123 | 0.667 | 0.827 | 0.913 | 0.316 | 0.183 | 0.737 | 0.865 |
| Pseudo-descriptors (original distributions) | 0.656 | 0.066 | 0.723 | 0.809 | 0.855 | 0.546 | 0.101 | 0.778 | 0.849 |
| Pseudo-descriptors (uniform distributions) | 0.669 | 0.068 | 0.737 | 0.825 | 0.873 | 0.571 | 0.091 | 0.780 | 0.843 |
mhr 2 mean highest coefficient of determination from 300 test runs
SD standard deviation
mhq 2 mean highest cross-validated coefficient of determination from 300 test runs
Top 10 QSAR models obtained by the GFA procedure (Case II)
| No |
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|
| 1 | −9.394 + 5.644 × | 0.892 | 0.823 | 0.826 | 0.365, | 0.546 | 0.131 |
| 2 | −9.934 − 5.780 × q3 + 2.055 × Shadow_Ylength − 0.263 × Shadow_YZ | 0.885 | 0.802 | 0.739 | 0.325, | 0.538 | 0.178 |
| 3 | −4.871 − 6.046 × q3 + 5.795 × JX − 1.168 × Shadow_Zlength | 0.879 | 0.805 | 0.786 | 0.732, | 0.531 | 0.630 |
| 4 | 8.117 − 6.065 × q3 + 1.013 × CHI_3_C − 2.035 × Shadow_Zlength | 0.877 | 0.790 | 0.778 | 0.635, | 0.529 | 0.366 |
| 5 | −1.520 − 6.636 × q3 + 0.840 × Shadow_Ylength − 1.209 × Shadow_Zlength | 0.877 | 0.775 | 0.750 | 0.255, | 0.529 | 0.078 |
| 6 | 4.437 + 1.986 × q2 − 5.312 × q3 − 1.112 × Shadow_Zlength | 0.871 | 0.803 | 0.802 | 0.654, | 0.522 | 0.484 |
| 7 | 9.151 − 6.785 × q3 + 2.103 × srcm2 − 2.035 × Shadow_Zlength | 0.867 | 0.789 | 0.736 | 0.691, | 0.518 | 0.371 |
| 8 | 3.079 + 2.976 × q2 − 4.246 × q3 − 0.074 × ALogP_MR | 0.867 | 0.788 | 0.800 | 0.619, | 0.518 | 0.425 |
| 9 | 3.451 − 2.334 × q1 − 5.156 × q3 − 1.034 × Shadow_Zlength | 0.866 | 0.805 | 0.102 | 0.579, | 0.516 | 0.334 |
| 10 | −5.392 − 6.237 × q3 − 0.076 × ALogP_MR + 1.113 × Shadow_Ylength | 0.863 | 0.694 | 0.600 | 0.636, | 0.513 | 0.514 |
K aff negative logarithm of the affinity for CBG, ALogP_MR the Ghose and Crippen estimate of molar refractivity, CHI_3_C Kier and Hall molecular connectivity index, cluster subgraph of order 3, JX Balaban Index JX, q1 CHELPG atomic charge at the C1 atom of the steroid skeleton (IUPAC numbering), q2 CHELPG atomic charge at the C2 atom of the steroid skeleton (IUPAC numbering), q3 CHELPG atomic charge at the C3 atom of the steroid skeleton (IUPAC numbering), Shadow_Ylength length of molecule in the y dimension, Shadow_YZ area of the molecular shadow in the yz plane, Shadow_Zlength length of molecule in the z dimension, srcm2 Sinister-Rectus Chirality Measure weighted by mass
r 2 coefficient of determination in the training set
cross-validated coefficient of determination in the training set (internal validation, leave-one-out procedure)
cross-validated coefficient of determination in the training set (internal validation, leave-three-out procedure)
R 2 coefficient of determination in the test set (external validation)
n number of molecules used for the external validation (that is: found in the applicability domain of a given model)
a parameter of model non-randomness proposed in (Mitra et al. 2010)
a parameter describing the prediction of the absolute response date of the test set, proposed in (Pratim Roy et al. 2009)
Predictive power of the chance models (Case II)
| mhr2 | SD | +1 SD | +2.3 SD | +3 SD | mhq2 | SD | +2.3 SD | +3 SD | |
|---|---|---|---|---|---|---|---|---|---|
|
| 0.475 | 0.105 | 0.580 | 0.716 | 0.789 | 0.341 | 0.148 | 0.681 | 0.785 |
| Pseudo-descriptors (original distributions) | 0.570 | 0.090 | 0.660 | 0.776 | 0.839 | 0.456 | 0.118 | 0.728 | 0.811 |
| Pseudo-descriptors (uniform distributions) | 0.558 | 0.098 | 0.656 | 0.784 | 0.853 | 0.431 | 0.128 | 0.725 | 0.815 |
mhr 2 mean highest coefficient of determination from 300 test runs
SD standard deviation
mhq 2 mean highest cross-validated coefficient of determination from 300 test runs
Fig. 3Progesterone in the binding site of CBG (PDB accession code: 4BB2)
Fig. 4Plot of q3 and K aff
Fujita-Ban QSAR model of fentanyls activity (Case III)
| Equation terms | Coefficient | Standard error | |
|---|---|---|---|
| 3-CH3 | Intercept (parent) | 2.604 | 0.330 |
| Cis/trans | −0.001 | 0.225 | |
| L | Phenylethyl | −0.281 | 0.237 |
| Tetrazolylethyl | −1.657 | 0.413 | |
| Thienylethyl | 0.000 | 0.000 | |
| R | CH2OCH3 | 0.000 | 0.000 |
| CH(CH3)OCH3 | −1.116 | 0.262 | |
| Furoyl | −0.722 | 0.247 | |
| X | F | 0.163 | 0.262 |
| Cl | −0.918 | 0.348 |
Fig. 5Plot of predicted and experimental activities for the QSAR model in Case III
Predictive power of the chance models (Case III)
| mhr2 | SD | +1 SD | +2.3 SD | +3 SD | mhq2 | SD | +2.3 SD | +3 SD | |
|---|---|---|---|---|---|---|---|---|---|
| Pseudo-descriptors test (original distributions) | 0.293 | 0.108 | 0.401 | 0.541 | 0.617 | – | – | – | – |
| Pseudo-descriptors test (uniform distributions) | 0.285 | 0.112 | 0.397 | 0.542 | 0.621 | – | – | – | – |
|
| 0.235 | 0.092 | 0.327 | 0.446 | 0.511 | – | – | – | – |
Most q 2 were negative
mhr 2 mean highest coefficient of determination from 300 test runs
SD standard deviation
mhq 2 mean highest cross-validated coefficient of determination from 300 test runs
Fig. 6Scheme of the classification tree (Case IV). Eig02_EA(ri) eigenvalue n. 2 from edge adjacency matrix weighted by resonance integral, GATS7e Geary autocorrelation of lag 7 weighted by Sanderson electronegativity, nRCONR2 number of tertiary amides (aliphatic), NssssC number of atoms of type ssssC (>C<), where < or > are two single bonds, qnmax maximum negative charge
Statistical parameters of the decision tree (Case IV) and comparison with different random models and no-model predictions
| TPa | TNb | FPc | FNd | ACCe | PRECf | SENSg | SPECh | FALLi | F1j | |
|---|---|---|---|---|---|---|---|---|---|---|
| Training set | ||||||||||
| QSAR model trained on real data | ||||||||||
| 90 | 3371 | 226 | 10 | 0.94 | 0.28 | 0.90 | 0.94 | 0.06 | 0.43 | |
| No-model: all binders | ||||||||||
| 100 | 0 | 3600 | 0 | 0.03 | 0.03 | 1.00 | 0.00 | 1.00 | 0.06 | |
| No-model: all non-binders | ||||||||||
| 0 | 3600 | 0 | 100 | 0.97 | NaNk | 0.00 | 1.00 | 0.00 | NaN | |
| No-model: coin toss | ||||||||||
| 50 | 1800 | 1800 | 50 | 0.50 | 0.03 | 0.50 | 0.50 | 0.50 | 0.06 | |
| QSAR models trained on pseudo-descriptors (original distributions) | ||||||||||
| Mean | 80.00 | 1785.40 | 1811.60 | 19.00 | 0.50 | 0.04 | 0.81 | 0.50 | 0.50 | 0.08 |
| SDl | 9.20 | 402.30 | 402.30 | 9.20 | 0.11 | 0.01 | 0.09 | 0.11 | 0.11 | 0.02 |
| +1 SD | 89.20 | 2187.80 | 2213.90 | 28.20 | 0.61 | 0.05 | 0.90 | 0.61 | 0.62 | 0.10 |
| +2.3 SD | 101.20 | 2710.80 | 2736.90 | 40.20 | 0.75 | 0.06 | 1.02 | 0.75 | 0.76 | 0.13 |
| QSAR models trained on pseudo-descriptors (uniform distribution) | ||||||||||
| Mean | 82.30 | 1666.20 | 1930.80 | 16.70 | 0.47 | 0.04 | 0.83 | 0.46 | 0.54 | 0.08 |
| SD | 13.50 | 461.70 | 461.70 | 13.50 | 0.12 | 0.01 | 0.14 | 0.13 | 0.13 | 0.02 |
| +1 SD | 95.80 | 2127.90 | 2392.60 | 30.30 | 0.60 | 0.05 | 0.97 | 0.59 | 0.67 | 0.10 |
| +2.3 SD | 113.40 | 2728.10 | 2992.80 | 47.90 | 0.75 | 0.06 | 1.15 | 0.76 | 0.83 | 0.13 |
| Test set | ||||||||||
| QSAR model trained on real data | ||||||||||
| Mean | 45 | 1674 | 126 | 5 | 0.93 | 0.26 | 0.90 | 0.93 | 0.07 | 0.40 |
aTrue positives
bTrue negatives
cFalse positives
dFalse negatives
eAccuracy
fPrecision
gSensitivity
hSpecificity
iFall-out
jF1-score
kNot a number
lStandard deviation