| Literature DB >> 35241693 |
Jintao Meng1,2,3, Peng Chen4,5, Mohamed Wahib6,7, Mingjun Yang8, Liangzhen Zheng1, Yanjie Wei9, Shengzhong Feng10, Wei Liu3.
Abstract
Intrinsic solubility is a critical property in pharmaceutical industry that impacts in-vivo bioavailability of small molecule drugs. However, solubility prediction with Artificial Intelligence(AI) are facing insufficient data, poor data quality, and no unified measurements for AI and physics-based approaches. We collect 7 aqueous solubility datasets, and present a dataset curation workflow. Evaluating the curated data with two expanded deep learning methods, improved RMSE scores on all curated thermodynamic datasets are observed. We also compare expanded Chemprop enhanced with curated data and state-of-art physics-based approach using pearson and spearman correlation coefficients. A similar performance on pearson with 0.930 and spearman with 0.947 from expanded Chemprop is achieved. A steadily improved pearson and spearman values with increasing data points are also illustrated. Besides that, the computation advantage of AI models enables quick evaluation of a large set of molecules during the hit identification or lead optimization stages, which helps further decision making within the time cycle at drug discovery stage.Entities:
Year: 2022 PMID: 35241693 PMCID: PMC8894363 DOI: 10.1038/s41597-022-01154-3
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Statistical information of the number of records in the 7 collected datasets.
| Dataset | No. of Records in | Weights | Additional Columns of Org Dataset | ||
|---|---|---|---|---|---|
| Org | Cln | Cure | |||
| AQUA | 1311 | 1311 | 1354 | 1.0 | |
| PHYS | 2010 | 2001 | 2001 | 1.0 | star_flag |
| ESOL | 1128 | 1116 | 1157 | 1.0 | |
| OCHEM | 6525 | 4218 | 3766 | 0.85 | |
| AQSOL | 9982 | 8701 | 9061 | 0.4 | group |
| CHEMBL | 30099 | 30099 | 28675 | 0.8 | comment |
| KINECT | 164273 | 82057 | 81935 | — | temperature, pH value |
“Org” is the original dataset, “Cln” denotes the dataset after Data Filtering, “Cure” is the dataset after Data Curation using the clustering algorithm across multiple datasets, “Weights” denotes the assigned weights for each dataset to identify the dataset quality, and “Additional Columns of Org Dataset” includes special properties reserved by some of the datasets.
Fig. 1The data curation workflow of filtering, evaluating, and clustering on the 7 collected datasets.
List of information for all cleaned and curated datasets in terms of the name, description, and type of each column.
| Column Name | Description | Type |
|---|---|---|
| Smiles | SMILES representation of compound | String |
| LogS | Experimental aqueous solubility value (LogS) | String |
| Weight | weighted quality score in [0, 1] | Float |
Fig. 2Redundancy matrices showing the percentage of repetitive molecules between two datasets. The upper table A summarizes the percentages of molecules with the same solubility values, and the lower table B describes the percentages of molecules with different solubility values.
The collected RMSE and confidence intervals of Chemprop or weighted Chemprop trained on the 7 datasets.
| Split Type | Dataset | RMSE & Confidence Intervals | ||
|---|---|---|---|---|
| Org | Cln | Cure | ||
| Random | AQUA | 0.573 ± 0.037 | 0.583 ± 0.057 | 0.536 ± 0.042 |
| PHYSP | 0.550 ± 0.026 | 0.600 ± 0.032 | 0.515 ± 0.018 | |
| ESOL | 0.596 ± 0.075 | 0.619 ± 0.044 | 0.512 ± 0.047 | |
| OCHEM | 0.548 ± 0.024 | 0.639 ± 0.044 | 0.522 ± 0.017 | |
| AQSOL | 1.023 ± 0.035 | 0.820 ± 0.036 | 0.518 ± 0.022 | |
| CHEMBL | 0.917 ± 0.017 | 0.811 ± 0.016 | 0.499 ± 0.011 | |
| KINECT | 0.401 ± 0.003 | 0.431 ± 0.003 | 0.432 ± 0.003 | |
| Scaffold | AQUA | 0.850 ± 0.086 | 0.849 ± 0.075 | 0.697 ± 0.043 |
| PHYS | 0.833 ± 0.058 | 0.813 ± 0.115 | 0.691 ± 0.092 | |
| ESOL | 0.854 ± 0.097 | 0.808 ± 0.090 | 0.711 ± 0.073 | |
| OCHEM | 0.847 ± 0.067 | 0.808 ± 0.075 | 0.695 ± 0.061 | |
| AQSOL | 1.073 ± 0.062 | 0.968 ± 0.045 | 0.596 ± 0.033 | |
| CHEMBL | 1.040 ± 0.038 | 0.900 ± 0.049 | 0.555 ± 0.031 | |
| KINECT | 0.433 ± 0.015 | 0.461 ± 0.008 | 0.460 ± 0.008 | |
The data partition strategies include both random and scaffold strategies. Five models are ensembled to improve the model accuracy. We average the RMSE by running each model 8 times and then calculate the corresponding confidence interval. The original Chemprop is used on “Org” dataset, and the weighted Chemprop is applied on both “Cln” and “Cure” datasets.
Fig. 3Data curation schedule for the 6 thermodynamic datasets. The datasets are divided into 2 groups: high quality and low quality groups. Two curation operations, i.e., inter-group curation and intra-group curation, are illustrated. The feasible curation operations for each dataset are denoted by the lines. For example, AQUA can be curated with the AQUA, PHYS, and ESOL datasets, and AQSOL can be curated with all dataset in high quality group, and CHEMBL.
The collected RMSE and confidence intervals of AttentiveFPwhen trained on the 7 datasets[29].
| Split Type | Dataset | RMSE & Confidence Intervals | ||
|---|---|---|---|---|
| Org | Cln | Cure | ||
| Random | AQUA | 0.616 ± 0.027 | 0.639 ± 0.014 | 0.579 ± 0.020 |
| PHYS | 0.649 ± 0.019 | 0.643 ± 0.013 | 0.551 ± 0.024 | |
| ESOL | 0.642 ± 0.017 | 0.641 ± 0.025 | 0.594 ± 0.022 | |
| OCHEM | 0.6018 ± 0.012 | 0.651 ± 0.020 | 0.6016 ± 0.010 | |
| AQSOL | 0.826 ± 0.027 | 0.760 ± 0.012 | 0.593 ± 0.004 | |
| Scaffold | AQUA | 0.743 ± 0.038 | 0.747 ± 0.031 | 0.676 ± 0.038 |
| PHYS | 0.782 ± 0.037 | 0.789 ± 0.037 | 0.687 ± 0.038 | |
| ESOL | 0.761 ± 0.048 | 0.801 ± 0.043 | 0.731 ± 0.073 | |
| OCHEM | 0.746 ± 0.011 | 0.779 ± 0.019 | 0.703 ± 0.016 | |
| AQSOL | 0.872 ± 0.017 | 0.842 ± 0.019 | 0.630 ± 0.008 | |
The data partition strategies include both random and scaffold partitioning, and the partition ratio is [0.8, 0.1, 0.1] for training, testing, and evaluation. In this experiment, 5 models are ensembled 8 times to average the RMSE values and calculate the corresponding confidence interval. Because AttentiveFPis time consuming on a very large dataset, the CHEMBL and KINECT datasets are not recorded, as their training times are longer than 150 hours. The original AttentiveFP is used on “Org” dataset, and The weighted AttentiveFP is applied on both “Cln” and “Cure” datasets.
Fig. 4Comparison of r2 values for ensembled models with the best RMSE scores in Table 3 for Chemprop (left figure) or weighted Chemprop (right figure) when predicting 48 molecules.
Fig. 5Comparison of R values on ensembled models with the best RMSE scores in Table 3 for Chemprop (left figure) or weighted Chemprop (right figure) when predicting 48 molecules.
Statistical time-usage (averaged over 100 rounds) of predicting compounds in evaluation, ESOL, and AQSOL datasets with weighted Chemprop on three computers.
| Desktop | Time Usage (in seconds) | |||
|---|---|---|---|---|
| CPU | GPU | Evaluation (48) | ESOL (1128) | AQSOL (9982) |
| E3-1225 v6 | — | 1.28 | 8.11 | 86.56 |
| E3-1225 v6 | Quadro P400 | 1.34 | 8.49 | 86.07 |
| Platinum 8180 | — | 0.70 | 9.98 | 107.93 |
| Platinum 8180 | GTX 1050Ti | 0.61 | 8.27 | 86.28 |
| Platinum 8180 | Tesla T4 | 0.62 | 8.42 | 91.20 |
The number of molecules containing in these datasets are 48, 1311, 9982 respectively. The time-usage is measured in seconds. The efficiency of the prediction workload is about 4% on Tesla T4, 6% on GTX 1050Ti, and 9% on Quadro P400, thus the running time has limited relation with GPU cards for unsaturated workload.
Difference of the issues concerned by AI experts and Drug design experts.
| AI experts | Drug design experts |
|---|---|
| Data volume | Correlation coefficients in compound lead optimization |
| Data quality | Generalization ability on different series of compounds |
| Measurement standard | Computing resource and its running time |
| Predictive accuracy | Open source availability |