| Literature DB >> 35300359 |
Gabriela Falcón-Cano1, Christophe Molina2, Miguel Ángel Cabrera-Pérez1,3,4.
Abstract
Computational models for predicting aqueous solubility from the molecular structure represent a promising strategy from the perspective of drug design and discovery. Since the first "Solubility Challenge", these initiatives have marked the state-of-art of the modelling algorithms used to predict drug solubility. In this regard, the quality of the input experimental data and its influence on model performance has been frequently discussed. In our previous study, we developed a computational model for aqueous solubility based on recursive random forest approaches. The aim of the current commentary is to analyse the performance of this already trained predictive model on the molecules of the second "Solubility Challenge". Even when our training set has inconsistencies related to the pH, solid form and temperature conditions of the solubility measurements, the model was able to predict the two sets from the second "Solubility Challenge" with statistics comparable to those of the top ranked models. Finally, we provided a KNIME automated workflow to predict aqueous solubility of new drug candidates, during the early stages of drug discovery and development, for ensuring the applicability and reproducibility of our model.Entities:
Keywords: ADME; KNIME; Quantitative Structure-Property Relationship (QSPR); Random Forest; Second Solubility Challenge; aqueous solubility; machine learning; supervised recursive variable selection
Year: 2021 PMID: 35300359 PMCID: PMC8920098 DOI: 10.5599/admet.979
Source DB: PubMed Journal: ADMET DMPK ISSN: 1848-7718
Performance of the final consensus model for the molecules of the second “Solubility Challenge”
| Test | r2 | r2 | RMSE | MAE | Bias | % 0.5 log | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Mean | Std | Mean | Std | Mean | Std | Mean | Std | Mean | Std | Mean | Std | |
| Test Set 1 |
| 0.01 | 0.58 | 0.01 |
| 0.03 | 0.74 | 0.03 | -0.234 | 0.01 | 40 | 1 |
| Test Set 2 |
| 0.02 | 0.78 | 0.01 |
| 0.1 | 0.77 | 0.1 | -0.278 | 0.02 | 40 | 6 |
Figure 1.Plot of log S (predicted) vs log S0 (experimental) for both test sets. Molecules with residual values higher than 0.5 (logarithm units) are highlighted in red.
Figure 2.Comparison between the top-rank models of the Second Solubility Challenge and our results (according to RMSE)
Figure 3.Overlapping log S0 against log Sw analysis between the molecules of the second “Solubility Challenge” and the training set. For modelling purposes, these overlapping molecules were eliminated from the training set.
Summary of solubility values for the outliers
| Structure | Name | log | log | log | log Sw[ |
|---|---|---|---|---|---|
|
| Amiodarone | -10.4 | -9.35 | -7.54 | -7.17 [ |
|
| Cisapride | -6.78 | -5.23 | -4.27 | -4.7 [ |
|
| Folic Acid | -5.96 | -5.44 | -3.12 | > -2.87 [ |
aIntrinsic Aqueous Solubility reported in the second “Solubility Challenge”
bAqueous Solubility reported for the three outliers in the initial source set
cAqueous Solubility reported in other sources
Mean with Std statistics based on two training sets when predicting the second test of the second “Solubility Challenge” using our method (Recursive Random Forest (consensus)) versus a single RRF: reliable solubility measurements (data challenge) and literature solubility data.
| Test | Reliable solubility measurements (data challenge) | Literature solubility data (reported in Initial Data Source) | ||
|---|---|---|---|---|
| r2 (validation) | RMSE (validation) | r2 (validation) | RMSE (validation) | |
| Recursive Random Forest (consensus) | 0.30 (0.05) | 1.79 (0.06) | 0.29 (0.05) | 1.80 (0.05) |
| Single Random Forest Regression | 0.19 (0.01) | 1.93 (0.02) | 0.14 (0.06) | 1.98 (0.06) |
*The results are reported as Mean (Std). The Std was computed by repeating 10-times the modelling procedure.