| Literature DB >> 33430971 |
Josep Arús-Pous1,2, Simon Viet Johansson3, Oleksii Prykhodko3, Esben Jannik Bjerrum3, Christian Tyrchan4, Jean-Louis Reymond5, Hongming Chen3, Ola Engkvist3.
Abstract
Recurrent Neural Networks (RNNs) trained with a set of molecules represented as unique (canonical) SMILES strings, have shown the capacity to create large chemical spaces of valid and meaningful structures. Herein we perform an extensive benchmark on models trained with subsets of GDB-13 of different sizes (1 million, 10,000 and 1000), with different SMILES variants (canonical, randomized and DeepSMILES), with two different recurrent cell types (LSTM and GRU) and with different hyperparameter combinations. To guide the benchmarks new metrics were developed that define how well a model has generalized the training set. The generated chemical space is evaluated with respect to its uniformity, closedness and completeness. Results show that models that use LSTM cells trained with 1 million randomized SMILES, a non-unique molecular string representation, are able to generalize to larger chemical spaces than the other approaches and they represent more accurately the target chemical space. Specifically, a model was trained with randomized SMILES that was able to generate almost all molecules from GDB-13 with a quasi-uniform probability. Models trained with smaller samples show an even bigger improvement when trained with randomized SMILES models. Additionally, models were trained on molecules obtained from ChEMBL and illustrate again that training with randomized SMILES lead to models having a better representation of the drug-like chemical space. Namely, the model trained with randomized SMILES was able to generate at least double the amount of unique molecules with the same distribution of properties comparing to one trained with canonical SMILES.Entities:
Keywords: Chemical databases; Deep learning; Generative models; Randomized SMILES; Recurrent Neural Networks; SMILES
Year: 2019 PMID: 33430971 PMCID: PMC6873550 DOI: 10.1186/s13321-019-0393-0
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Fig. 1Traversal of the molecular graph of Aspirin using three methods: a the canonical ordering of the molecule; b atom order randomization without RDKit restrictions; c Atom order randomization with RDKit restrictions of the same atom ordering as b. Atom ordering is specified with a number ranking from 1 to 13 for each atom and the arrows show the molecular graph traversal process. Notice that the atom ordering is altered in c, prioritizing the sidechains (red arrows) when traversing a ring and preventing SMILES substrings like c1cc(c(cc1))
Fig. 2Architecture of the RNN model used in this study. For every step , input one-hot encoded token goes through an embedding layer of size , followed by GRU/LSTM layers of size with dropout in-between and then a linear layer that has dimensionality and the size of the vocabulary. Lastly a softmax is used to obtain the token probability distribution . symbolizes the input hidden state matrix at step
Training and validation set sizes for the different benchmarks
| Model | Training set size | Validation set size |
|---|---|---|
| GDB-13 1M | 1,000,000 | 10,000 |
| GDB-13 10K | 10,000 | 1000 |
| GDB-13 1K | 1000 | 1000 |
| ChEMBL | 1,483,943 | 78,102 |
Notice that depending on the expected size of the target chemical space and the total amount of molecules, different ratios have been used
Hyperparameter combinations used in the grid search
| Model | l | w | d | b | RNN |
|---|---|---|---|---|---|
| GDB-13 1M | 3 | 512 | 0, 25, 50 | 64, 128, 256, 512 | GRU, LSTM |
| GDB-13 10K | 2, 3, 4 | 256, 384, 512 | 0, 25, 50 | 8, 16, 32 | LSTM |
| GDB-13 1K | 2, 3, 4 | 128, 192, 256 | 0, 25, 50 | 4, 8, 16 | LSTM |
| ChEMBL | 3 | 512 | 0, 25, 50 | 64, 128, 256, 512 | LSTM |
Number of layers (l), dimensions of the RNN layers (w), dropout rate % (d), batch size (b), RNN cell type (RNN)
Best models trained on subsets of GDB-13 after the hyperparameter optimization
| Set | SMILES | Time | % GDB-13 | Valid | Unif | Comp | Closed | UCC |
|---|---|---|---|---|---|---|---|---|
| 1M | Canonical | 4:08 | 72.8 | 0.994 | 0.879 | 0.861 | 0.633 | |
| Rand. unr. | 31:47 | 80.9 | 0.995 | 0.970 | 0.929 | 0.876 | 0.790 | |
| Rand. unr. no DA | 1:37 | 77.0 | 0.987 | 0.957 | 0.795 | 0.883 | 0.672 | |
| Rand. rest. no DA | 1:21 | 78.2 | 0.992 | 0.957 | 0.829 | 0.898 | 0.712 | |
| DS branch | 1:33 | 72.1 | 0.987 | 0.881 | 0.828 | 0.834 | 0.608 | |
| DS rings | 1:11 | 68.6 | 0.979 | 0.852 | 0.788 | 0.798 | 0.535 | |
| DS both | 1:05 | 68.4 | 0.979 | 0.851 | 0.785 | 0.796 | 0.532 | |
| 10K | Canonical | 0:04 | 38.8 | 0.905 | 0.666 | 0.445 | 0.426 | 0.126 |
| 1K | Canonical | 0:01 | 14.5 | 0.504 | 0.611 | 0.167 | 0.133 | 0.014 |
See “Methods” section for a description of the ratios
Best result for each training set size are indicated in italics
Set Benchmark training set size, SMILES SMILES variant, including randomized variants with and without data augmentation (DA), Time training time up in hh:mm, % GDB-13 Percent of unique molecules from GDB-13 generated in a 2 billion sample with replacement, Valid valid SMILES, Unif uniformity ratio, Comp completeness ratio, Closed closedness ratio, UCC UCC ratio
Fig. 3Plot illustrating the percent of GDB-13 sampled alongside the sample size of the ideal model (blue) and the best of the canonical (yellow), randomized restricted (green) and randomized unrestricted (orange) models. Notice that the ideal model is always an upper bound and eventually () would sample the entire GDB-13. The trained models would reach the same point much later
Fig. 4Histograms of different statistics from the randomized SMILES models. a Kernel Density Estimates (KDEs) of the number of randomized SMILES per molecule from a sample of 1 million molecules from GDB-13. The plot has the x axis cut at 5000, but the unrestricted randomized variant plot has outliers until 15,000. b KDEs of the molecule negative log-likelihood (NLL) for each molecule (summing the probabilities for each randomized SMILES) for the same sample of 1 million molecules from GDB-13. The plot is also cropped between range . c Histograms between the NLL of all the restricted randomized SMILES of two molecules from GDB-13
Fig. 5Linear regression plots between the UC-JSD and the UCC ratio. a Canonical SMILES . b Restricted randomized SMILES . c Unrestricted randomized SMILES
Best models from the ChEMBL benchmark for both SMILES variants
| SMILES | Time | % Valid | % Unique | FCD |
|---|---|---|---|---|
| Canonical | 131:32 | 98.26 | 34.67 | 0.0712 |
| Rest. Random. | 84:22 | 98.33 | 64.09 | 0.1265 |
SMILES SMILES variant, Time time used to train the model hhh:mm, % Valid Percent of valid molecules, % Unique Percent of unique molecules in a 2 billion SMILES sample, Fréchet ChemNet Distance (FCD) between the validation and a sample of 75,000 molecules (FCD)
Fig. 6Kernel Density Estimates (KDEs) of the Molecule negative log-likelihoods (NLLs) of the ChEMBL models for the canonical SMILES variant (left) and the randomized SMILES variant (right). Each line symbolizes a different subset of 50,000 molecules from: Training set (green), validation set (orange), randomized SMILES model (blue) and canonical SMILES model (yellow). Notice that the Molecule NLLs for the randomized SMILES model (right) are obtained from the sum of all the probabilities of the randomized SMILES for each of the 50,000 molecules (adding up to 320 million randomized SMILES), whereas those from the canonical model are the canonical SMILES of the 50,000 molecules