| Literature DB >> 31877719 |
Floriane Montanari1, Lara Kuhnke1, Antonius Ter Laak1, Djork-Arné Clevert1.
Abstract
Simple physico-chemical properties, like logD, solubility, or melting point, can reveal a great deal about how a compound under development might later behave. These data are typically measured for most compounds in drug discovery projects in a medium throughput fashion. Collecting and assembling all the Bayer in-house data related to these properties allowed us to apply powerful machine learning techniques to predict the outcome of those assays for new compounds. In this paper, we report our finding that, especially for predicting physicochemical ADMET endpoints, a multitask graph convolutional approach appears a highly competitive choice. For seven endpoints of interest, we compared the performance of that approach to fully connected neural networks and different single task models. The new model shows increased predictive performance compared to previous modeling methods and will allow early prioritization of compounds even before they are synthesized. In addition, our model follows the generalized solubility equation without being explicitly trained under this constraint.Entities:
Keywords: ADMET prediction; QSAR; graph convolutional networks; multitask learning; solubility
Mesh:
Substances:
Year: 2019 PMID: 31877719 PMCID: PMC6982787 DOI: 10.3390/molecules25010044
Source DB: PubMed Journal: Molecules ISSN: 1420-3049 Impact factor: 4.411
ADMET datasets used to train the models.
| Endpoint | Code | # Compounds | Data Transformation | Helper Task |
|---|---|---|---|---|
| LogD (pH7.5) | LOD | 76,548 | none | no |
| LogD (pH2.3) | LOA | 236,280 | none | no |
| Membrane affinity | LOM | 64,506 | log10 | no |
| Human serum albumin binding | LOH | 61,398 | log10 | no |
| Melting point | LMP | 90,589 | none | no |
| Solubility (DMSO) | LOO | 38,841 | log10(mol/L) | no |
| Solubility (powder) | LOP | 2334 | log10(mol/L) | no |
| Solubility (nephelometry) | LON | 88,301 | log10(mol/L) | yes |
| Solubility (DMSO not fully dissolved) | LOX | 7392 | log10(mol/L) | yes |
| Solubility (no assay annotation) | LOQ | 50,016 | log10(mol/L) | yes |
Figure 1Pearson’s correlation coefficients between pairs of endpoints. When less than 25 compounds were measured in both members of the pairs, no correlation is reported. Endpoint codes are listed in Table 1.
Performance of different learning algorithm in the ten ADMET endpoints. We report the average of cluster split cross-validation folds (not used for parameter tuning). The best performing method is given in bold (as well as those for which standard deviations overlap, see supplementary Table S1 for standard deviations of the folds).
| Random Forest | STNN a | STNN Graph Conv b | MTNN c | MTNN Graph Conv d | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| R2 | Spearman | R2 | Spearman | R2 | Spearman | R2 | Spearman | R2 | Spearman | |
| LOD e | 0.63 | 0.81 | 0.78 | 0.89 |
|
| 0.75 | 0.88 |
|
|
| LOA f | 0.49 | 0.76 | 0.72 | 0.89 |
|
| 0.64 | 0.86 | 0.91 | 0.96 |
| LOM g | 0.43 | 0.68 | 0.53 | 0.75 |
|
|
|
|
|
|
| LOH h | 0.39 | 0.65 | 0.49 | 0.73 | 0.56 | 0.76 | 0.49 | 0.73 |
|
|
| LMP i |
| 0.63 | 0.31 | 0.66 |
|
| 0.35 | 0.64 |
|
|
| LOO j | 0.43 |
|
|
|
|
|
|
|
|
|
| LOP k |
| 0.49 |
| 0.48 |
|
|
|
|
|
|
| LON l | 0.50 | 0.69 | 0.53 | 0.74 |
| 0.75 | 0.54 | 0.73 |
|
|
| LOX m | 0.33 | 0.61 | 0.37 | 0.64 | 0.33 | 0.65 |
| 0.72 |
|
|
| LOQ n | 0.46 | 0.70 | 0.51 | 0.74 |
| 0.77 |
| 0.75 |
|
|
a single task neural network, b single task graph convolutional network, c multitask neural network, d multitask graph convolutional network, e logD, f logD in acidic pH, g membrane affinity, h human serum albumin binding, i melting point, j solubility from DMSO, k solubility from powder, l solubility from nephelometry, m solubility from DMSO not fully dissolved, n solubility no assay information.
Performance of the multitask graph convolutional model without helper tasks. Average of cluster split cross-validation folds. In parenthesis, difference with the results from the multitask graph convolutional model in Table 2.
| R2 | Spearman | |
|---|---|---|
| LOD a | 0.87 (−0.01) | 0.94 |
| LOA b | 0.92 (+0.01) | 0.96 |
| LOM c | 0.71 | 0.84 |
| LOH d | 0.65 | 0.83 (+0.01) |
| LMP e | 0.52 (+0.01) | 0.73 |
| LOO f | 0.57 (−0.02) | 0.76 (−0.01) |
| LOP g | 0.56 | 0.74 (−0.02) |
a logD, b logD in acidic pH, c membrane affinity, d human serum albumin binding, e melting point, f solubility from DMSO, g solubility from powder.
Performance of the multitask graph convolutional model in a time split dataset.
| R2 | Spearman | RMSE | Test Set Size | |
|---|---|---|---|---|
| LOD a | 0.87 | 0.93 | 0.33 | 32,794 |
| LOA b | 0.91 | 0.96 | 0.35 | 46,481 |
| LOM c | 0.69 | 0.86 | 0.49 | 197 |
| LOH d | 0.55 | 0.78 | 0.61 | 614 |
| LMP e | 0.35 | 0.59 | 45 °C | 55 |
| LOO f | 0.48 | 0.74 | 0.90 | 22,803 |
| LOP g | 0.54 | 0.75 | 0.84 | 935 |
a logD, b logD in acidic pH (random split), c membrane affinity, d human serum albumin binding, e melting point, f solubility from DMSO, g solubility from powder.
Figure 2Distribution of molecular properties (number of rotatable bonds, number of aromatic rings, molecular weight, number of H bond acceptors, and the number of H bond donors) in the aggregated dataset containing 537,443 unique molecules tested in at least one of the endpoints of interest.
Figure 3Input feature preprocessing and architecture of the fully connected neural networks. When only one output unit exists, then we talk about single task neural networks (STNN).