| Literature DB >> 35300309 |
Gabriela Falcón-Cano1, Christophe Molina2, Miguel Ángel Cabrera-Pérez1,3,4.
Abstract
In-silico prediction of aqueous solubility plays an important role during the drug discovery and development processes. For many years, the limited performance of in-silico solubility models has been attributed to the lack of high-quality solubility data for pharmaceutical molecules. However, some studies suggest that the poor accuracy of solubility prediction is not related to the quality of the experimental data and that more precise methodologies (algorithms and/or set of descriptors) are required for predicting aqueous solubility for pharmaceutical molecules. In this study a large and diverse database was generated with aqueous solubility values collected from two public sources; two new recursive machine-learning approaches were developed for data cleaning and variable selection, and a consensus model based on regression and classification algorithms was created. The modeling protocol, which includes the curation of chemical and experimental data, was implemented in KNIME, with the aim of obtaining an automated workflow for the prediction of new databases. Finally, we compared several methods or models available in the literature with our consensus model, showing results comparable or even outperforming previous published models.Entities:
Keywords: ADME; KNIME; Quantitative Structure-Property Relationship (QSPR); Random Forest; aqueous solubility; machine learning; supervised recursive selection
Year: 2020 PMID: 35300309 PMCID: PMC8915604 DOI: 10.5599/admet.852
Source DB: PubMed Journal: ADMET DMPK ISSN: 1848-7718
Figure 1.Protocol for data curation.
Figure 2.Schematic description of supervised recursive variable selection methodology
Figure 3.Schematic description of the supervised recursive data cleaning methodology
Figure 4.Schematic description of the modeling protocol
Performance of different models published in the prediction of our Test Set 2 (belonging to Solubility Challenge)
| Solubility Challenge (N = 26) | ||
|---|---|---|
| Models | r2 | RMSE |
| PP-solubility | 0.29 | 1.66 |
| PP-ADMET solubility | 0.28 | 1.49 |
| ACD | 0.53 | 1.14 |
| MOE | 0.44 | 1.30 |
| QikProp | 0.34 | 1.24 |
| QikProp-CI | 0.57 | 1.12 |
| ADMET PREDICTOR | 0.32 | 1.25 |
| Volsurf + | 0.44 | 1.09 |
| FAF | 0.18 | 1.19 |
| VCC lab | 0.39 | 1.18 |
| ISIDA | 0.35 | 1.09 |
| RFR [ | 0.57[ | 0.92[ |
| This study[ | 0.56 | 0.93 |
| 0.72[ | 0.73[ | |
a In these studies, 28 compounds were included in the second dataset.
b Values calculated from the original data reported in the article (Table A2) [29].
c Results obtained when the intrinsic solubility value of folic acid (outlier) was replaced by the aqueous solubility value. PP: Pipeline Pilot
Distribution of aqueous solubility data in the chemical space defined by six molecular properties
| Solubility Class | ||||
|---|---|---|---|---|
| Physicochemical descriptor | log | log | log | log |
| 159.0 (95.3) | 191.2 (83.5) | 249.5 (89.00) | 348.7 (134.4) | |
| TPSA | 62.4 (51.5) | 57.7 (40.1) | 58.1 (38.1) | 51.1 (43.9) |
| ALOGP | -0.3 (1.5) | 1.0 (1.2) | 2.2 (1.2) | 4.7 (2.5) |
| RBN | 2.8 (3.1) | 2.9 (2.7) | 4.0 (3.3) | 6.0 (6.6) |
| HBD | 1.8 (2.1) | 1.3 (1.4) | 1.1 (1.2) | 0.8 (1.2) |
| HBA | 3.5 (2.8) | 3.3 (2.3) | 3.5 (2.2) | 3.5 (2.7) |
|
| 924 | 3519 | 4710 | 3521 |
|
| 7 | 28 | 37 | 28 |
Figure 5.Correlation among RBN, TPSA, MW, and ALOGP properties and experimental aqueous solubility (log S). The colour scale refers to solubility ranges.
Performance of the final consensus model for the training, internal test and external test sets
|
| ||||||
|
|
|
|
| |||
|
|
|
|
|
|
| |
|
| 0.97 | 0.00 | 0.39 | 0.00 | 0.25 | 0.00 |
|
| 0.87 | 0.00 | 0.75 | 0.01 | 0.55 | 0.01 |
|
| 0.64 | 0.02 | 0.97 | 0.03 | 0.77 | 0.01 |
| | 0.83 | 0.02 | 0.81 | 0.06 | 0.62 | 0.05 |
| | 0.47 | 0.03 | 1.02 | 0.03 | 0.78 | 0.02 |
| | 0.40 | 0.03 | 0.98 | 0.02 | 0.78 | 0.02 |
| | 0.78 | 0.04 | 1.00 | 0.08 | 0.81 | 0.05 |
|
| ||||||
|
|
|
| ||||
|
|
|
|
|
|
| |
|
| 0.69 | 0.02 | 0.92 | 0.03 | 0.71 | 0.02 |
| | 0.84 | 0.02 | 0.79 | 0.04 | 0.60 | 0.01 |
| | 0.56 | 0.04 | 0.93 | 0.04 | 0.70 | 0.03 |
| | 0.47 | 0.03 | 0.92 | 0.03 | 0.73 | 0.02 |
| | 0.80 | 0.02 | 0.96 | 0.04 | 0.74 | 0.03 |
r2: the squared of correlation coefficient of regression, Std: standard deviation obtained from a five iterations loop, RMSE: root mean squared error, MAE: mean absolute error. TS1-TS4: test set 1 to 4.
Figure 6.Prediction of aqueous solubility (log S) for the external set I, using the final consensus model
Aqueous solubility values for the five outlier compounds detected with the consensus model
| Compound | Log S[ | Log S[ | Log S[ | Reference |
|---|---|---|---|---|
| Folic acid | -5.96 | -2.96 | > -2.87 | [ |
| Antipyrine | 0.45 | -1.95 | -0.58 | [ |
| Amiodarone | -10.4 | -7.51 | < -7.17 | [ |
| Cisapride | -6.78 | -4.23 | -4.7 | [ |
| Enalapril | -1.36 | -3.48 | -3.33 | [ |
aexperimental intrinsic solubility values [29];
baqueous solubility predicted with the consensus model;
cexperimental aqueous solubility values collected from literature.
Performance of Pipeline Pilot (PP) in the prediction of aqueous solubility for the External Set I (N=181)
| External Set I | r2 | RMSE | MAE |
|---|---|---|---|
|
| 0.73 | 1.01 | 0.79 |
|
| 0.08 | 1.34 | 1.03 |
|
| -0.71 | 1.65 | 1.12 |
|
| 0.45 | 1.59 | 1.12 |
|
| 0.12 | 1.53 | 1.07 |
r2: the squared of correlation coefficient of regression, RMSE: root mean squared error, MAE: mean absolute error. TS1-TS4: test set 1 to 4
Figure 7.Descriptors most frequently used in the prediction of aqueous solubility (log S).
Performance of the classification model for the training, internal test and external test sets
|
| ||||||||
|
|
|
|
| |||||
|
|
|
|
|
|
|
|
| |
|
| 0.94 | 0.00 | 0.90 | 0.01 | 0.98 | 0.01 | 0.92 | 0.01 |
|
| 0.88 | 0.01 | 0.78 | 0.01 | 0.96 | 0.01 | 0.80 | 0.01 |
|
| 0.80 | 0.01 | 0.60 | 0.04 | 0.89 | 0.01 | 0.71 | 0.02 |
| | 0.85 | 0.01 | 0.61 | 0.01 | 0.83 | 0.01 | 0.78 | 0.01 |
| | 0.79 | 0.01 | 0.71 | 0.01 | 1.00 | 0.01 | 0.71 | 0.01 |
| | 0.74 | 0.01 | 0.46 | 0.05 | 0.85 | 0.01 | 0.61 | 0.01 |
| | 0.96 | 0.01 | 0.86 | 0.01 | 0.91 | 0.01 | 0.91 | 0.01 |
|
| ||||||||
|
|
|
|
| |||||
|
|
|
|
|
|
|
|
| |
|
| 0.81 | 0.01 | 0.61 | 0.04 | 0.92 | 0.01 | 0.70 | 0.01 |
| | 0.85 | 0.01 | 0.70 | 0.01 | 0.92 | 0.01 | 0.78 | 0.01 |
| | 0.86 | 0.01 | 0.71 | 0.01 | 1.0 | 0.01 | 0.71 | 0.01 |
| | 0.76 | 0.01 | 0.50 | 0.05 | 0.88 | 0.02 | 0.64 | 0.01 |
| | 0.91 | 0.01 | 0.74 | 0.01 | 1.00 | 0.01 | 0.82 | 0.01 |
Std: standard deviation obtained from a five-iteration loop, RMSE: root mean squared error. TS1-TS4: test set 1 to 4.
Figure 8.Workflow for modeling aqueous solubility. From top to bottom. First rectangle: Development and validation of QSPR protocol; second rectangle: the same sequence of steps to automate the prediction of a new external dataset based on the consensus model obtained in the development section.
Performance of the final consensus model for the external set II (N=30)
| r2 | RMSE | MAE | ||||
|---|---|---|---|---|---|---|
| External Set II | Mean | Std | Mean | Std | Mean | Std |
|
| 0.43 | 0.01 | 0.73 | 0.02 | 0.56 | 0.01 |
|
| 0.66 | 0.01 | 0.59 | 0.01 | 0.46 | 0.01 |
r2: the squared of correlation coefficient of regression, Std: standard deviation obtained from a five iteration loop, RMSE: root mean squared error, MAE: mean absolute error.
a Molecules with Solubility Prediction Variance < 3.
Summary of the QSPR models (Regression and Classification) for aqueous solubility prediction published in the last ten years
| Ref. | Model type | Method | Ntrain | Ntest | Next | Descriptors | Better Model performance for external sets |
|---|---|---|---|---|---|---|---|
| [ | Regression | MLR, ANN and SVM | 60 | 14 | Topological and structural descriptors | ANN: | |
| [ | Classification | DT, NB, NN, SVM and DF | 762 | 412 | 102 | 1D-3D descriptors and fingerprints | Tset: LS (86.7 %), MS (83.9 %), HS (88.7 %)
|
| [ | Classification | RF, SVM, BRANNs | 711 | 131 | T1:747
| pharmacophore fingerprint, MOE, Volsurf and ParaSurf08 descriptors | External test set 1(accuracy): 64.7 % (three classes)
|
| [ | Classification | SVM | 41501 | 4510 | 32 | Fingerprints and physicochemical properties | Test set (accuracy): 84 %
|
| [ | Regression | RF | 3970 | D1: 26
| Fingerprints | D2: | |
| [ | Regression | PLS | 1004 | 252 | Log P, TPSA and melting point | ||
| [ | Regression | RNN | D1: 1029
| D1: 115
| Feature vectors | R2 = 0.92; RMSE = 0.58
| |
| [ | Regression | MLR, MLREM, BRANNLP | 3567 | 911 | 32 | 86 Volsurf descriptors | MLR: |
| [ | Regression | AMP, LoRep | 2093 | 522 | 43 | 32 HYBOT and DRAGON descriptors (2 descriptors) | |
| [ | Classification/ Regression | Global: MLR, RF, SVM
| D1: 818
| D1: 204
| 12 2D-physicochemical descriptors | D1: Consensus: R2test = 0.93; RMSE = 0.58
| |
| [ | Regression | Global: MLR,
| 349 | 38 | 15 physicochemical descriptors (HYBOT, DRAGON, SYBYL and VolSurf+) | RF: R2test = 0.74; RMSE = 0.72
| |
| [ | Regression | CNN | 1116 | Fingerprints | R2 = 0.93; RMSE = 0.56 | ||
| [ | Regression | RF, GBDT, MT-DNN | 1708 | D1:1207
| Topological and 2D descriptors | MT-ESTD R2test-1 = 0.94; RMSE = 0.69
| |
| [ | Regression | DNN | 8949 | 994 | 62 | RDKit descriptors | R2ext = 0.39; RMSE = 0.68 |
| [ | Regression | MLR, RF | 4449 | 196 | D1: 21
| RDKit and ABSOLV descriptors | |
| This study | Regression/ Classification | RF, GBT | T2: 12431 | 1834 | 65
| Consensus model:
| |
| T1: 10592 | Ext I
| 65
| Consensus model:
|
ANN: Artificial neural network; SVM: Support vector machines; MLR: Multilinear regression; RF: Random Forest; PLS: Partial Least-Squares; RNN: Recursive Neural Networks; MLREM: multiple linear regression employing an expectation maximization algorithm and a sparse prior method; BRANNLP: Bayesian regularized artificial neural network with a Laplacian prior; AMP: arithmetic mean property; LoRep: local one-parameter regression. RCNN: regression corrected by nearest neighbors; AMP: arithmetic mean property and LoReP: local regression property. GBDT: Gradient boosting decision tree. MT-DNN: Multitask deep neural network; DNN: Deep neural network; GBT: Gradient Boosting Tree; DT: Decision Tree; NB: Naïve Bayes; NN: Neural Network; DF: Decision Forest;
aResults obtained when the intrinsic solubility values of the five outliers were replaced by the aqueous solubility values.
Performance of the classification model for the external set II (N=30)
|
| ||||||||
|
|
|
|
| |||||
|
|
|
|
|
|
|
|
| |
|
| 0.83 | 0.01 | 0.67 | 0.01 | 0.73 | 0.01 | 0.93 | 0.01 |
|
| ||||||||
|
|
|
|
| |||||
|
|
|
|
|
|
|
|
| |
|
| 0.80 | 0.01 | 0.60 | 0.02 | 0.73 | 0.01 | 0.87 | 0.01 |
Std: standard deviation obtained from a five iteration loop, RMSE: root mean squared error.