| Literature DB >> 29532027 |
Rafael Gómez-Bombarelli1, Jennifer N Wei2, David Duvenaud3, José Miguel Hernández-Lobato4, Benjamín Sánchez-Lengeling2, Dennis Sheberla2, Jorge Aguilera-Iparraguirre1, Timothy D Hirzel1, Ryan P Adams5,6, Alán Aspuru-Guzik2,7.
Abstract
We report a method to convert discrete representations of molecules to and from a multidimensional continuous representation. This model allows us to generate new molecules for efficient exploration and optimization through open-ended spaces of chemical compounds. A deep neural network was trained on hundreds of thousands of existing chemical structures to construct three coupled functions: an encoder, a decoder, and a predictor. The encoder converts the discrete representation of a molecule into a real-valued continuous vector, and the decoder converts these continuous vectors back to discrete molecular representations. The predictor estimates chemical properties from the latent continuous vector representation of the molecule. Continuous representations of molecules allow us to automatically generate novel chemical structures by performing simple operations in the latent space, such as decoding random vectors, perturbing known chemical structures, or interpolating between molecules. Continuous representations also allow the use of powerful gradient-based optimization to efficiently guide the search for optimized functional compounds. We demonstrate our method in the domain of drug-like molecules and also in a set of molecules with fewer that nine heavy atoms.Entities:
Year: 2018 PMID: 29532027 PMCID: PMC5833007 DOI: 10.1021/acscentsci.7b00572
Source DB: PubMed Journal: ACS Cent Sci ISSN: 2374-7943 Impact factor: 14.553
Figure 1(a) A diagram of the autoencoder used for molecular design, including the joint property prediction model. Starting from a discrete molecular representation, such as a SMILES string, the encoder network converts each molecule into a vector in the latent space, which is effectively a continuous molecular representation. Given a point in the latent space, the decoder network produces a corresponding SMILES string. A mutilayer perceptron network estimates the value of target properties associated with each molecule. (b) Gradient-based optimization in continuous latent space. After training a surrogate model f(z) to predict the properties of molecules based on their latent representation z, we can optimize f(z) with respect to z to find new latent representations expected to have high values of desired properties. These new latent representations can then be decoded into SMILES strings, at which point their properties can be tested empirically.
Figure 2Representations of the sampling results from the variational autoencoder. (a) Kernel Density Estimation (KDE) of each latent dimension of the autoencoder, i.e., the distribution of encoded molecules along each dimension of our latent space representation; (b) histogram of sampled molecules for a single point in the latent space; the distances of the molecules from the original query are shown by the lines corresponding to the right axis; (c) molecules sampled near the location of ibuprofen in latent space. The values below the molecules are the distance in latent space from the decoded molecule to ibuprofen; (d) slerp interpolation between two molecules in latent space using six steps of equal distance.
Figure 3Two-dimensional PCA analysis of latent space for variational autoencoder. The two axis are the principle components selected from the PCA analysis; the color bar shows the value of the selected property. The first column shows the representation of all molecules from the listed data set using autoencoders trained without joint property prediction. The second column shows the representation of molecules using an autoencoder trained with joint property prediction. The third column shows a representation of random points in the latent space of the autoencoder trained with joint property prediction; the property values predicted for these points are predicted using the property predictor network. The first three rows show the results of training on molecules from the ZINC data set for the logP, QED, and SAS properties; the last two rows show the results of training on the QM9 data set for the LUMO energy and the electronic spatial extent (R2).
Comparison of Molecule Generation Results to Original Datasets
| source | data set | samples | logP | SAS | QED | % in ZINC | % in emol |
|---|---|---|---|---|---|---|---|
| Data | ZINC | 249k | 2.46 (1.43) | 3.05 (0.83) | 0.73 (0.14) | 100 | 12.9 |
| GA | ZINC | 5303 | 2.84 (1.86) | 3.80 (1.01) | 0.57 (0.20) | 6.5 | 4.8 |
| VAE | ZINC | 8728 | 2.67 (1.46) | 3.18 (0.86) | 0.70 (0.14) | 5.8 | 7.0 |
| Data | QM9 | 134k | 0.30 (1.00) | 4.25 (0.94) | 0.48 (0.07) | 0.0 | 8.6 |
| GA | QM9 | 5470 | 0.96 (1.53) | 4.47 (1.01) | 0.53 (0.13) | 0.018 | 3.8 |
| VAE | QM9 | 2839 | 0.30 (0.97) | 4.34 (0.98) | 0.47 (0.08) | 0.0 | 8.9 |
Describes the source of the molecules: data refers to the original data set, GA refers to the genetic algorithm baseline, and VAE to our variational autoencoder trained without property prediction.
Shows the data set used, either ZINC or QM9.
Shows the number of samples generated for comparison, for data, this value simply reflects the size of the data set. Columns d–f show the mean and, in parentheses, the standard deviation of selected properties of the generated molecules and compares that to the mean and standard deviation of properties in the original data set.
Shows the water–octanal partition coefficient (logP).[36]
Shows the synthetic accessibility score (SAS).[37]
Shows the Qualitative Estimate of Drug-likeness (QED),[38] ranging from 0 to 1. We also examine how many of the molecules generated by each method are found in two major molecule databases:
ZINC;
E-molecules[39], and compare these values against the original data set.
MAE Prediction Error for Properties Using Various Methods on the ZINC and QM9 Datasets
| database/property | mean | ECFP | CM | GC | 1-hot SMILES | Encoder | VAE |
|---|---|---|---|---|---|---|---|
| ZINC250k/logP | 1.14 | 0.38 | 0.05 | 0.16 | 0.13 | 0.15 | |
| ZINC250k/QED | 0.112 | 0.045 | 0.017 | 0.041 | 0.037 | 0.054 | |
| QM9/HOMO, eV | 0.44 | 0.20 | 0.16 | 0.12 | 0.12 | 0.13 | 0.16 |
| QM9/LUMO, eV | 1.05 | 0.20 | 0.16 | 0.15 | 0.11 | 0.14 | 0.16 |
| QM9/Gap, eV | 1.07 | 0.30 | 0.24 | 0.18 | 0.16 | 0.18 | 0.21 |
Baseline, mean prediction.
As implemented in Deepchem benchmark (MoleculeNet),[40] ECFP-circular fingerprints, CM-coulomb matrix, GC-graph convolutions.
1-hot-encoding of SMILES used as input to property predictor.
The network trained without decoder loss.
Full variational autoencoder network trained for individual properties.
Figure 4Optimization results for the jointly trained autoencoder using 5 × QED – SAS as the objective function. (a) shows a violin plot which compares the distribution of sampled molecules from normal random sampling, SMILES optimization via a common chemical transformation with a genetic algorithm, and from optimization on the trained Gaussian process model with varying amounts of training points. To offset differences in computational cost between the random search and the optimization on the Gaussian process model, the results of 400 iterations of random search were compared against the results of 200 iterations of optimization. This graph shows the combined results of four sets of trials. (b) shows the starting and ending points of several optimization runs on a PCA plot of latent space colored by the objective function. Highlighted in black is the path illustrated in part (c). (c) shows a spherical interpolation between the actual start and finish molecules using a constant step size. The QED, SAS, and percentile score are reported for each molecule.