| Literature DB >> 33188226 |
Samuel Boobier1, David R J Hose2, A John Blacker1, Bao N Nguyen3.
Abstract
Solubility prediction remains a critical challenge in drug development, synthetic route and chemical process design, extraction and crystallisation. Here we report a successful approach to solubility prediction in organic solvents and water using a combination of machine learning (ANN, SVM, RF, ExtraTrees, Bagging and GP) and computational chemistry. Rational interpretation of dissolution process into a numerical problem led to a small set of selected descriptors and subsequent predictions which are independent of the applied machine learning method. These models gave significantly more accurate predictions compared to benchmarked open-access and commercial tools, achieving accuracy close to the expected level of noise in training data (LogS ± 0.7). Finally, they reproduced physicochemical relationship between solubility and molecular properties in different solvents, which led to rational approaches to improve the accuracy of each models.Entities:
Year: 2020 PMID: 33188226 PMCID: PMC7666209 DOI: 10.1038/s41467-020-19594-z
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Fig. 1Concepts of solubility prediction and data availability.
a Physical aspects of dissolution process of solid and corresponding descriptors. b Curated solubility datasets for this study and their LogS distributions (N = number of datapoints, T = number of datapoints in training set, S = number of datapoints in test set).
List of descriptors and how they were calculated.
| No. | Namee | Description | No. | Namee | Description |
|---|---|---|---|---|---|
| Zero-point energy of optimised gas structure (Hartrees) | 12 | Dipole moment of solution structure (Debye) | |||
| Zero-point energy of optimised solution structure (Hartrees) | 13 | Sum of charges on solution structure oxygen atoms | |||
| Solvation energy calculated as E0_solv - E0_gas (Hartrees) | 14 | Sum of charges on solution structure carbon atoms | |||
| Gibbs free energy of optimised gas structure (Hartrees) | 15 | Charge on most negative atom of solution structure | |||
| 5 | Gibbs free energy of optimised solution structure (Hartrees) | 16 | Charge on most positive atom of solution structure | ||
| 6 | Solvation energy calculated as G_solv - G_gas (Hartrees) | 17 | Sum of charges on solution structure non-hydrogen/carbon atoms | ||
| HOMO energy of gas phase structure of the solute (eV) | 18 | Molar volume (cm−3.mol) | |||
| LUMO energy of gas phase structure the solute (eV) | 19 | Solvent Accessible Surface Area (Å2) | |||
| 9 | Energy gap between solute LUMO and solvent HOMO (eV) | 20 | Molecular weight (Daltons) | ||
| 10 | Energy gap between solvent LUMO and solute HOMO (eV) | Number of all atoms in molecule | |||
| Dipole moment of gas structure (Debye) | 22 | Experimental melting point (°C) |
aGaussian 09 derived descriptors were computed using DFT B3LYP/6-31 + G(d), solution structures were calculated using Polarizable Continuum Model IEFPCM for the solvent; bDerived using PyMOL and molecular structure optimised with Gaussian 09; cDerived with Python; dfrom Reaxys; eThe descriptors N_atoms, E0_gas, E0_solv, DeltaE0_sol, G_gas, gas_dip, HOMO and LUMO (in bold font) were removed from the descriptor list.
Fig. 2Results of initial machine learning prediction models.
a Descriptor correlation analysis, b principal component analysis of the descriptors with Water_set_wide; and plots of predicted vs experimental LogS, with predicted errors, using GP algorithm for c Water_wide_set, d Water_narrow_set, e Ethanol_set, f Benzene_set, g Acetone_set; and h distributions of predicted errors (1 standard deviation) for each dataset with GP; and i impact of the removal of a single descriptor on ET prediction models (blue: Water_set_wide, orange: Benzene_set), j feature importance plot for ET prediction models (blue: Water_set_wide, orange: Benzene_set).
Table of prediction model metrics using machine learning methods with five datasetsa.
| No. | Dataset | Metricb | MLR | PLS | ANN | SVM | GP | RF | ET | Bag | Stdevc |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | R2 | 0.80 | 0.80 | 0.90 | 0.89 | 0.88 | 0.90 | 0.90 | 0.02 | ||
| 2 | RMSE | 1.15 | 1.16 | 0.84 | 0.85 | 0.89 | 0.83 | 0.82 | 0.06 | ||
| 3 | %LogS±0.7 | 50.5 | 51.6 | 58.9 | 68.4 (91.6)e | 60.0 | 66.3 | 58.9 | 5.64 | ||
| 4 | %LogS±1.0 | 65.2 | 66.3 | 78.9 | 78.9 | 74.7 (94.7)e | 75.8 | 76.8 | 3.24 | ||
| 5 | R2 | 0.58 | 0.57 | 0.73 | 0.69 | 0.68 | 0.69 | 0.69 | 0.03 | ||
| 6 | RMSE | 1.07 | 1.08 | 0.77 | 0.87 | 0.86 | 0.81 | 0.81 | 0.07 | ||
| 7 | %LogS±0.7 | 62.1 | 62.1 | 67.2 | 72.4 (93.1)e | 67.2 | 72.4 | 65.5 | 4.35 | ||
| 8 | %LogS±1.0 | 74.1 | 72.4 | 77.6 | 75.9 (96.6)e | 81.0 | 81.0 | 2.89 | |||
| 9 | R2 | 0.68 | 0.68 | 0.74 | 0.72 | 0.75 | 0.72 | 0.02 | |||
| 10 | RMSE | 0.82 | 0.83 | 0.74 | 0.76 | 0.73 | 0.77 | 0.03 | |||
| 11 | %LogS±0.7 | 61.7 | 61.7 | 68.7 | 65.2 | 60.9 | 66.1 | 60.9 | 3.41 | ||
| 12 | %LogS±1.0 | 80.0 | 80.0 | 81.7 | 81.7 (98.3)e | 82.6 | 80.9 | 81.7 | 1.30 | ||
| 13 | R2 | 0.29 | 0.29 | 0.49 | 0.51 | 0.51 | 0.50 | 0.52 | 0.02 | ||
| 14 | RMSE | 0.98 | 0.99 | 0.88 | 0.81 | 0.80 | 0.81 | 0.80 | 0.04 | ||
| 15 | %LogS±0.7 | 50.7 | 51.4 | 64.1 | 64.1 | 66.2 (93.7)e | 64.8 | 62.7 | 1.04 | ||
| 16 | %LogS±1.0 | 72.5 | 71.8 | 76.8 | 78.9 | 77.5 (95.1)e | 78.9 | 79.6 | 2.20 | ||
| 17 | R2 | 0.64 | 0.64 | 0.67 | 0.71 | 0.70 | 0.72 | 0.72 | 0.03 | ||
| 18 | RMSE | 0.66 | 0.66 | 0.63 | 0.58 | 0.58 | 0.57 | 0.57 | 0.03 | ||
| 19 | %LogS±0.7 | 75.5 | 74.5 | 77.7 | 76.6 | 76.6 | 76.6 | 75.5 | 0.78 | ||
| 20 | %LogS±1.0 | 86.2 | 85.1 | 88.3 | 89.4 | 89.4 | 0.87 | ||||
| 21 | R2 | 0.36 | 0.35 | 0.42 | 0.40 | 0.40 | 0.41 | 0.01 | |||
| 22 | RMSE | 0.87 | 0.87 | 0.87 | 0.84 | 0.84 | 0.02 | ||||
| 23 | %LogS±0.7 | 60.9 | 62.0 | 67.4 | 68.5 (91.3)e | 62.0 | 63.0 | 62.0 | 4.68 | ||
| 24 | %LogS±1.0 | 78.3 | 80.4 | 79.3 | 81.5 | 80.4 | 78.2 | 80.4 | 1.25 |
aMachine learning methods were applied using scikit-learnt and GPy packages in Python[49,50].
bThe best model for each metric with each dataset is in bold.
cStandard deviation of the metrics for ANN, SVM, GP, RF, ET and Bag.
dMetrics obtained by limiting the evaluation to the LogS = –4 - 1 zone only.
eMetrics in brackets are calculated including the entire predicted error range of each predicted solubility.
Model metrics for Water_set_wide and Water_set_narrow using HF/SMD descriptors.
| Dataset | Method | %LogS ± 0.7a | %LogS ± 1.0a |
|---|---|---|---|
| ANN | 68.4 (+9.5) | 84.2 (+5.3) | |
| SVM | 72.6 (+1.1) | 83.2 (+4.2) | |
| ET | 69.5 (+3.2) | 84.2 (+0.0) | |
| GP | 70.5 (+2.1) | 82.1 (+8.4) | |
| ANN | 70.4 (+1.7) | 82.6 (−1.7) | |
| SVM | 68.7 (+3.5) | 85.2 (+3.5) | |
| ET | 67.0 (+0.9) | 81.7 (+0.9) | |
| GP | 73.9 (+0.9) | 81.7 (+0.0) |
aThe changes compared to those obtained using DFT/PCM descriptors are in brackets.
Fig. 3Benchmarking results against other predictive models.
Predicted vs experimental LogS for a ET model; b GSE model; c AquaSol model; d EPI Suite 1 model, e EPI Suite 2 model; f COSMOtherm calculations; for g, ET model; h COSMOtherm calculations; for i ET model; j COSMOtherm calculations; for k ET model; l COSMOtherm calculations; and prediction results using datasets from AstraZeneca m functional group distribution analysis for dataset from AstraZeneca and ; predicted vs experimental LogS for n ET model for (without m.p.); o ET model for (without m.p.); p ET model for (without m.p.); q COSMOtherm calculations for ; r COSMOtherm calculations for ; and s COSMOtherm calculations for .