Elman Mansimov1, Omar Mahmood2, Seokho Kang3, Kyunghyun Cho4,5,6,7. 1. Department of Computer Science, Courant Institute of Mathematical Sciences, New York University, 60 5th Avenue, New York, New York, 10011, United States. 2. Center for Data Science, New York University, 60 5th Avenue, New York, New York, 10011, United States. 3. Department of Systems Management Engineering, Sungkyunkwan University, 2066 Seobu-ro, Jangan-gu, Suwon, 16419, Republic of Korea. 4. Department of Computer Science, Courant Institute of Mathematical Sciences, New York University, 60 5th Avenue, New York, New York, 10011, United States. kyunghyun.cho@nyu.edu. 5. Center for Data Science, New York University, 60 5th Avenue, New York, New York, 10011, United States. kyunghyun.cho@nyu.edu. 6. Facebook AI Research, 770 Broadway, New York, New York, 10003, United States. kyunghyun.cho@nyu.edu. 7. CIFAR Azrieli Global Scholar, Canadian Institute for Advanced Research, 661 University Avenue, Toronto, ON, M5G 1M1, Canada. kyunghyun.cho@nyu.edu.
Abstract
A molecule's geometry, also known as conformation, is one of a molecule's most important properties, determining the reactions it participates in, the bonds it forms, and the interactions it has with other molecules. Conventional conformation generation methods minimize hand-designed molecular force field energy functions that are often not well correlated with the true energy function of a molecule observed in nature. They generate geometrically diverse sets of conformations, some of which are very similar to the lowest-energy conformations and others of which are very different. In this paper, we propose a conditional deep generative graph neural network that learns an energy function by directly learning to generate molecular conformations that are energetically favorable and more likely to be observed experimentally in data-driven manner. On three large-scale datasets containing small molecules, we show that our method generates a set of conformations that on average is far more likely to be close to the corresponding reference conformations than are those obtained from conventional force field methods. Our method maintains geometrical diversity by generating conformations that are not too similar to each other, and is also computationally faster. We also show that our method can be used to provide initial coordinates for conventional force field methods. On one of the evaluated datasets we show that this combination allows us to combine the best of both methods, yielding generated conformations that are on average close to reference conformations with some very similar to reference conformations.
A molecule's geometry, also known as conformation, is one of a molecule's most important properties, determining the reactions it participates in, the bonds it forms, andthe interactions it has with other molecules. Conventionalconformation generation methods minimize hand-designed molecular force field energy functions that are often not well correlated withthe true energy function of a molecule observed in nature. They generategeometrically diversesets of conformations, some of which are very similar to the lowest-energy conformations and others of which are very different. In this paper, we propose a conditional deep generative graph neural network that learns an energy function by directly learning to generate molecular conformations that are energetically favorable and more likely to be observed experimentally in data-driven manner. On three large-scale datasets containing small molecules, we showthat our method generates a set of conformations that on average is far more likely to be close to the corresponding reference conformations than arethose obtained from conventional force field methods. Our method maintains geometrical diversity by generating conformations that are not too similar to each other, and is also computationally faster. We also showthat our method can be used to provide initialcoordinates for conventional force field methods. On one of the evaluated datasets we showthat this combination allows us to combine the best of both methods, yielding generated conformations that are on average close to reference conformations with some very similar to reference conformations.
The three-dimensional (3-D) coordinates of atoms in a molecule arecommonly referred to as the molecule’s geometry or conformation. The task, known as conformation generation, of predicting possible valid coordinates of a molecule, is important for determining a molecule’s chemical and physical properties[1]. Conformation generation is also a vital part of applications such as generating 3-D quantitative structure-activity relationships (QSAR), structure-based virtualscreening and pharmacophore modeling[2]. Conformations can be determined in a physicalsetting using instrumental techniques such as X-ray crystallography as well as using experimental techniques. However, these methods are typically time-consuming and costly.A number of computational methods have been developed for conformation generation over the past few decades[2]. Typically this problem is approached by using a force field energy function to calculate a molecule’s energy, andthen minimizing this energy withrespect to the molecule’s coordinates. This hand-designed energy function yields an approximation of the molecule’s true potential energy observed in naturebased on the molecule’s atoms, bonds andcoordinates. The minimum of this energy function corresponds to the molecule’s most stable configuration. Although this approach has been commonly used to generate a geometrically diverseset of conformations with certain conformations being similar to the lowest-energy conformations, it has been shownthat molecule force field energy functions are often a crude approximation of actual molecular energy[3].In this paper, we propose a deep generative graph neural network that learns the energy function from data in an end-to-end fashion by generating molecular conformations that are energetically favorable and more likely to be observed experimentally[4]. This is done by maximizing the likelihood of the reference conformations of the molecules in the dataset. We evaluate andcompare our method withconventional molecular force field methods on three databases of small molecules by calculating the root-mean-square deviation (RMSD) between generated andreference conformations. We showthat conformations generated by our model are on average far more likely to be close to the reference conformation compared to those generated by conventional force field methods i.e. the variance of the RMSD between generated andreference conformations is lower for our method. Despite having lower variance, we showthat our method does not generategeometrically similar conformations. We also showthat our approach is computationally faster than force field methods.A disadvantage of our model is that in general for a given molecule, the best conformation generated by our model lies further away from the reference conformation compared to the best conformation generated by force field methods. We showthat for the QM9 small molecule dataset, the best of both methods can be combined by using the conformations generated by the deep generative graph neural network as an initialization to the force field method.
Conformation Generation
We consider a molecule as an undirected, complete graph G = (V, E), where V is a set of vertices corresponding to atoms, and E is a set of edges representing the interactions between pairs of atoms from V. Each atom is represented as a vector v ∈ of node features, andthe edge between the i-th and j-th atoms is represented as a vector e ∈ of edgefeatures. There are M vertices and M(M − 1)/2 edges. We define a plausible conformation as one that may correspond to a stable configuration of a molecule. Given the graph of a molecule, the task of molecular geometry prediction is the generation of a set of plausible conformations X = , where is a vector of the 3-D coordinates of the i-th atom in the a-thconformation.Molecules can transition between conformations and end up in different local minima based on the stability of the respective conformations and environmental conditions. As a result, there is morethan one plausible conformation associated with each molecule; it is hence natural to formulateconformation generation as finding (local) minima of an energy function defined on a pair of molecule graph andconformation:Alternatively, we could sample from a Gibbs distribution:wherewhere ζ is a normalizing constant. We use S to indicatethe number of conformations we generate for each molecule.Under this view, the problem of conformation generation is decomposed into two stages. In the first stage, a computationally-efficient energy function is constructed. The second stage involves either performing optimization as in Eq. (1) or sampling as in Eq. (2) to generate a set of conformations from this energy function.
Energy function construction
A conventional approach is to define an energy function semi-automatically. The functional form of an energy function is designed carefully to incorporate various chemical properties, whereas detailed parameters of the energy function are either computationally or experimentally estimated. Two examples of widely used energy functions arethe Universal Force Field (UFF)[5] andthe Merck Molecular Force Field (MMFF)[6]. In contrast to these methods, herewe will describe how to estimatethe energy function or probability distribution directly from data using the latest techniques from deep learning.
Energy minimization/sampling
Once the energy function is defined, a conventional approach is to runthe minimization many times starting from different initialconformations. Due to the non-convexity of the energy function, each run is likely to end up in a unique local minimum, allowing us to collect a set of many conformations.A typical approach is to use distance geometry (DG)[7] or its variants, such as experimental-torsion basic knowledge distance geometry (ETKDG)[8], to randomly generate an initialconformation that satisfies various geometric constraints such as lower and upper bounds on the distances between atoms. Starting from the initialconformation, an iterative optimization algorithm, such as L-BFGS[9], gradually updates the conformation until it finds a minimum of the energy function. In this paper, we instead propose an approach based on deep generative models that allow us to sample directly from a distribution over all possible conformations given a molecule graph.
Deep Generative Model for Molecular Geometry
We propose to “learn” an energy function from a databasecontaining many pairs of a molecule and its experimentally obtained conformation. Let be a set of examples from such a database, where X* is “a” reference conformation, often obtained and verified empirically in a certain environment. Thesereference conformations may not necessarily correspond to the lowest energy configurations of the molecules, but are energetically favorable and more likely to be observed experimentally. Learning an energy function can then be expressed as the following optimization problem:where is a Gibbs distribution defined using as in Eq. (3). In other words, we can learn the energy function by maximizing the log-likelihood of the data D. In principle, the term “energy” has a very specific meaning in each context (e.g., potential energy, statistical free energy and etc). In our case, “energy” refers to an objective function that reflects the likelihood of a conformation given a molecular graph.
Conditional variational graph autoencoders
We use a conditional version of a variationalautoencoder[10] to model the distribution in Eq. (4) (a). This choice enables an underlying model to capturethe complicated, multi-modal nature of this distribution, while allowing us to efficiently sample from this distribution. This is done by introducing a set of latent variables Z = {z1, …, z}, where andrewriting the conditional log-probability aswherewe omit the subscript for brevity.The marginal log-probability in Eq. (5) is generally intractable to compute, andwe instead maximize the stochastic approximation to its lower bound, as is standard practice in problems involving variational inference:where Z is the k-th sample from the (approximate) posterior distribution Q above. We assume that we can computethe KL divergence analytically, for instance by constructing Q and P to be normal distributions.
Modeling the graph using a message passing neural network
We use a message passing neural network (MPNN)[11], a variant of a graph neural network[12,13], which operates on a graph G directly and is invariant to graph isomorphism. The MPNNconsists of L layers. At each layer l, we updatethe hidden vector of each node and hidden matrix of each edge using the equationwhere J is a linear one layer neural network that aggregates the information from neighboring nodes according to its hidden vectors of respective nodes and edges. GRU is a gated recurrent network that combines the new aggregate information and its corresponding hidden vector from previous layer[14]. The weights of the message passing function J and GRU are shared across the L layers of the MPNN.
Prior parameterization
We usethe MPNN described above to model the prior distribution P(Z|G) in Eq. (6) (a). We initialize h0(v) and h(e) in Eq. (8) as linear transformations of the feature vectors v and e of the nodes and edges respectively:where Unodeprior and Uedgeprior are matrices representing the linear transformations for the nodes and edges respectively. The final hidden vector h(v) of each node is passed through a two layer neural network with hidden size d, whose output is transformed into the mean and variance vectors of a Normal distribution with a diagonalcovariance matrix:whereWprior andWprior arethe weight matrices and bprior and bprior arethe bias terms of the transformations. These are used to form the prior distribution:where μ and σ2 arethe j-thcomponents of the mean and variance vectors respectively. In other words, we parameterize the prior distribution as a factorized Normal distribution factored over the vertices andthe dimensions in the 3-D coordinate.
Likelihood parameterization
We use a similar MPNN to model the likelihood distribution, P(X|Z, G) in Eq. (6) (b). The only difference is that this distribution is conditioned not only on the molecular graph G = (V, E) but also on the latent set Z = {z1, …, z}. We incorporatethe latent set Z by adding the linear transformation of the node feature vector v to its corresponding latent variable z. This result is used to initialize the hidden vector:where Unodelikelihood and Uedgelikelihood are matrices representing the linear transformations for the nodes and edges respectively. From there on, we runneural message passing as in Eqs. (8–11), with a newset of parameters, θlikelihood, Wlikelihood, blikelihood, Wlikelihood and blikelihood. The final mean and variance vectors are nowthree dimensional, representing the 3-D coordinates of each atom, andwe can computethe log-probability of the coordinates using Eq. (12).
Posterior parameterization
As computing the exact posterior P(Z|G, X) is intractable, we resort to amortized inference using a parameterized, approximate posterior Q(Z|G, X) in Eq. (6) (c). We use a similar approach to our parameterization of the prior distribution above. However, we replace the input to the MPNNwiththe concatenation of an edgefeature vector e andthe corresponding distance (proximity) matrix D(X*) of the reference 3-D conformation X*:With a newset of parameters, θposterior, Wposterior, bposterior, Wposterior and bposterior, the MPNN outputs a Normal distribution for each latent variable z. Linear weight embeddings of nodes Unode are shared between prior, likelihood and posterior.
Training the conditional variational graph autoencoder
Withthe choice of the Gaussian latent variables z, we can usethe reparameterization trick[10] to computethe gradient of the stochastic approximation to the lower bound in Eq. (7) withrespect to all the parameters of the three distributions[10]. This property allows us to train this model on a large dataset using stochastic gradient descent (SGD). However, there are two major considerations that must be made before training this model on a large molecule database.
Post-alignment likelihood
An important property of conformation generation over a usual problem of regression is that we must take into account rotation and translation. Let R be an alignment function that takes as input a target conformation and a predicted conformation. The function aligns the reference conformation to the predicted conformation andreturns the aligned reference conformation. = R(X, X*) is the conformation obtained by rotating and translating the reference conformation X* to have the smallest distance to the predicted conformation X according to a predefined metric such as RMSD:This alignment function R is selected according to the problem at hand, andwe present below its use in a general form without exact specification.We implement this invariance to rotation and translation by parameterizing the output of the likelihood distribution above to be aligned to the target molecule. That is,where is the coordinate of the i-th atom aligned to the mean conformation {μ1, …, μ}. That is,In other words, we rotate and translatethe reference conformation X* to be best aligned to the predicted conformation (or its mean) beforecomputing the log-probability. This encourages the model to assign high probability to a conformation that is easily aligned to the reference conformation X*, which is precisely the goal of maximum log-likelihood.
Unconditional prior regularization
The secondterm in the lower bound in Eq. (6), which is the KL divergence between the approximate posterior and prior, does not have a point minimum but an infinitely long valley consisting of minimum values. Consider the KL divergence between two univariateNormal distributions:When both distributions are shifted by the same amount, the KL divergence remains unchanged. This could lead to a difficulty in optimization, as the means of the posterior and prior distributions could both diverge.In order to prevent this pathological behavior, we introduce an unconditional prior distribution P(Z) which is a factorized Normal distribution:wherecomputes a Normal probability density, and I is a d × d identity matrix. We minimize the KL divergence between the original prior distribution P(Z|G) andthis unconditional prior distribution P(Z) in addition to maximizing the lowerbound, leading to the following final objective function for each molecule:wherewe assume K = 1 and introduce a coefficient α ≥ 0.
Inference: predicting molecular geometry
Learning a conditional variationalautoencoder above corresponds to the first stage of conformation generation, that is, the stage of energy function construction. Once the energy function is constructed, we need to sample multiple conformations from the Gibbs distribution defined using the energy function, which is logP(X|G) in Eq. (5). Our parameterization of the Gibbs distribution using a directed graphical model[15] allows us to efficiently sample from this distribution. We first sample from the prior distribution, , andthen sample from the likelihood distribution, . In practice, we fix the output variance σ of the likelihood distribution to be 1 andtake the mean set {μ1, …, μ} as a sample from the model.
Experimental Setup
Data
We experimentally verify the effectiveness of the proposed approach using three databases of molecules: QM9[16,17], COD[18] and CSD[19]. These datasets areselected as they possess distinct properties from each other, which allows us to carefully study various aspects of the proposed approach. There is an overlap between COD and CSD databases, since both of these databases werebased on published crystallography data. We only keep molecules from each databasethat can be processed by RDKit1. We further remove disconnected compounds i.e. those whose Simplified Molecular-Input Line-Entry System[20] (SMILES) representation contains ‘.’. See Fig. 1 for some other properties of thesethree datasets.
Figure 1
Dataset Characteristics: information regarding the atoms, bonds, molecular mass and symmetry of molecules in each dataset.
Dataset Characteristics: information regarding the atoms, bonds, molecular mass and symmetry of molecules in each dataset.
QM9
The filtered QM9 dataset contains 133, 015 molecules, each of which contains up to 9 heavy atoms of types C, N, O and F. Each molecule is paired with a reference conformation obtained by optimizing the molecular geometry with density functionaltheory (DFT) at the B3LYP/6-31G(2df,p) level of theory, which implies that the reference conformations are obtained from the same environment. We hold out separate 5,000 and 5,000 randomly selected molecules as validation andtest sets, respectively.
COD
We usethe organic part of the COD dataset. We further filter out any molecule that contains morethan 50 heavy atoms of types B, C, N, O, F, Si, P, S, Cl, Ge, As, Se, Br, Te and I. This results in 66,663 molecules, out of which we hold out separate 3,000 and 3,000 randomly selected ones respectively for validation andtest purposes. Reference conformations are voluntarily contributed to the dataset and are often determined either experimentally or by DFT calculations[21]. Thus, the reference conformations are obtained from different environments.
CSD
Similarly to COD, we remove any molecule that contains morethan 50 heavy atoms, resulting in a total of 236, 985 molecules. We hold out separate 3,000 and 3,000 randomly selected molecules for validation andtest purposes respectively. This dataset contains organic and metal-organic crystallographic structures which have been observed experimentally[19]. The atom types in this dataset are S, N, P, Be, Tc, Xe, Br, Rh, Os, Zr, In, As, Mo, Dy, Nb, La, Te, Th, Ga, Tl, Y, Cr, F, Fe, Sb, Yb, Tb, Pu, Am, Re, Eu, Hg, Mn, Lu, Nd, Ce, Ge, Sc, Gd, Ca, Ti, Sn, Ir, Al, K, Tm, Ni, Er, Co, Bi, Pr, Rb, Sm, O, Pt, Hf, Se, Np, Cd, Pd, Pb, Ho, Ag, Mg, Zn, Ta, V, B, Ru, W, Cl, Au, U, Si, Li, C and I. The reference conformations are obtained from crystal structures.
Models
Baselines
As a point of reference, we minimize a force field starting from a conformation created using ETKDG[8]. We test both UFF and MMFF, andrespectively call the resulting approaches ETKDG + UFF and ETKDG + MMFF. The environment from which each conformation is obtained affects the force field calculations. To keep comparisons fair and to abstract away the effects of the environment, we usethe implementations in RDKit[22] withthe default hyperparameters. The default implementations have often been used in literaturewhen comparing different conformation generation methods[23-25].
Conditional variational graph autoencoder
We build one conditional variational graph autoencoder for each dataset. We use d = 50 hidden units at each layer of neural message passing (Eq. 8) in each of the three MPNNs corresponding to the prior, likelihood and posterior distributions. We use d = 100 in the two layer neural network that comes after the MPNN. As described earlier, we fix the variance of the output in the likelihood distribution to 1. We use L = 3 layers per network for QM9 and L = 5 layers per network for COD and CSD. We chose these hyperparameter values by carrying out a grid-search and choosing the values that had the best performance on the validation set. The grid-search procedure andthe performance of models with different hyperparameters are shown in the Supplementary Information.
Learning
For all models, we use dropout[26] at each layer of the neural network that comes after the MPNNwith a dropout rate of 0.2 to regularize learning. We set the coefficient α in Eq. (20) to 10−5. We train each model using Adam[27] with a fixed learning rate of 3 × 10−4. All models were trained with a batch size of 20 molecules on 1 Nvidia GPU with 12 GB of RAM.
Inference
There are two modes of inference withthe proposed approach. The first approach is to sample from a trained conditional variational graph autoencoder by first sampling from the prior distribution andtaking the mean vectors from the likelihood distribution; we refer to this as CVGAE. We can then usethese samples further as initializations of MMFF minimization; we refer to this as CVGAE + MMFF. The latter approach can be thought of as a trainable approach to initializing a conformation in place of DG or ETKDG.
Evaluation
In principle, the quality of the sampled conformations should be evaluated based on their molecular energies, for instance by DFT, which is often more accuratethan force field methods[3]. However, the computationalcomplexity of the DFT calculation is superlinear withrespect to the number of electrons in a molecule, and so is often impractical[28]. Instead, we follow prior work on conformation generation[1] and evaluatethe baselines and proposed method using the RMSD (Eq. 15) of the heavy atoms between a reference conformation and a predicted conformation which is fast andsimple to calculate.
Results
When evaluating each method, we first sample 100 conformations per molecule for each method in the test set. We can make several observations from Table 1. First, compared to other methods, our proposed CVGAE always succeeds at generating the specified number of conformations for any of the molecules in the test set. UFF and MMFF fail to generateconformations for some molecules, as they do not support handling every element but the pre-defined sets of elements. Since all other evaluated approaches were unsuccessful at generating at least one conformation for a very small number of test molecules, we report results for the molecules for which all evaluated methods generated at least one conformation. We report the median of the mean of the RMSD, the median of the standard deviation of the RMSD andthe median of the best (lowest) RMSD among all generated conformations for each test molecule. Across all three datasets, every evaluated method achieves roughly the same median of the mean RMSD. More importantly, the standard deviation of the RMSD achieved by CVGAE is significantly lower than that achieved by ETKDG + Force Field. After the initialgeneration stage, conformations are usually further evaluated and optimized by running the computationally expensive DFT optimization. Reducing the standard deviation can lower the number of conformations on which DFT optimization has to be run in order to achieve a valid conformation. On the other hand, the best RMSD achieved by ETKDG + UFF/MMFF methods is lower than that achieved by CVGAE. Using MMFF initialized by CVGAE (CVGAE + MMFF) instead of ETKDG (ETKDG + MMFF) improves the mean results on the QM9 dataset for CVGAE, and yields a lower standard deviation andsimilar best RMSD compared to ETKDG + MMFF. Unfortunately, CVGAE + MMFF worsens the results achieved by CVGAE alone on the COD and CSD datasets. We additionally evaluatesingle point DFT energy for the subset of 1000 molecules in the QM9 test set for all 100 generated conformations. We findthat all three methods ETKDG + MMFF, CVGAE and CVGAE + MMFF achieve similar median energy values of −411.52, −410.87 and −411.50 respectively. The energy was calculated using GAMESS software[29] with default parameters.
Table 1
Number of successfully processed molecules in the test set (success per test set 100), number of successfully generated conformations out of 100 (success per molecule ↑), median of mean RMSD (mean ↓), median of standard deviation of RMSD (std. dev. ↓) and median of best RMSD (best ↓) per molecule on QM9, COD and CSD datasets. ETKDG stands for Distance Geometry with experimental torsion-angle preferences.
Dataset
ETKDG + Force Field
CVGAE
CVGAE + Force Field
UFF
MMFF
MMFF
QM9
success per test set
96.440%
96.440%
100%
99.760%
success per molecule
98.725%
98.725%
100%
98.684%
mean
0.425
0.415
0.390
0.367
std. dev.
0.176
0.189
0.017
0.074
best
0.126
0.092
0.325
0.115
COD
success per test set
99.133%
99.133%
100%
95.367%
success per molecule
99.627%
99.627%
100%
99.071%
mean
1.389
1.358
1.331
1.656
std. dev.
0.407
0.415
0.099
0.425
best
0.429
0.393
1.206
0.635
CSD
success per test set
97.400%
97.400%
100%
99.467%
success per molecule
99.130%
99.130%
100%
97.967%
mean
1.537
1.488
1.506
1.833
std. dev.
0.421
0.418
0.115
0.434
best
0.508
0.478
1.343
0.784
UFF and MMFF are force field methods and stand for Universal Force Field and Molecular Mechanics Force Field respectively. CVGAE stands for Conditional Variational Graph Autoencoder. CVGAE + Force Field represents running the MMFF force field optimization initialized by CVGAE predictions.
Number of successfully processed molecules in the test set (success per test set 100), number of successfully generated conformations out of 100 (success per molecule ↑), median of mean RMSD (mean ↓), median of standard deviation of RMSD (std. dev. ↓) and median of best RMSD (best ↓) per molecule on QM9, COD and CSD datasets. ETKDG stands for Distance Geometry with experimental torsion-angle preferences.UFF and MMFF are force field methods and stand for Universal Force Field and Molecular Mechanics Force Field respectively. CVGAE stands for Conditional Variational Graph Autoencoder. CVGAE + Force Field represents running the MMFF force field optimization initialized by CVGAE predictions.We also report the diversity of conformations generated by all evaluated methods in Table 2. Diversity is measured by calculating the mean and standard deviation of the pairwise RMSD between each pair of generated conformations per molecule. Overall, we can see that despite having a smaller median of standard deviation of RMSD between generated conformations andreference conformations, CVGAE does not collapse to generating extremely similar conformations. Although, CVGAE generates relatively less diverse samples compared to ETKDG + MMFF baseline on all datasets. The conformations of molecules generated by CVGAE + MMFF are less diverse on the QM9 dataset and more diverse on COD/CSD datasets compared to ETKDG + MMFF baseline.
Table 2
Conformation Diversity.
Dataset
ETKDG + MMFF
CVGAE
CVGAE + MMFF
QM9
mean
0.400
0.106
0.238
std. dev.
0.254
0.061
0.209
COD
mean
1.148
0.239
1.619
std. dev.
0.699
0.181
0.537
CSD
mean
1.244
0.567
1.665
std. dev.
0.733
0.339
0.177
Mean and std. dev. represents the corresponding mean and standard deviation of pairwise RMSD between at most 100 generated conformations per molecule.
Conformation Diversity.Mean and std. dev. represents the corresponding mean and standard deviation of pairwise RMSD between at most 100 generated conformations per molecule.The computational efficiency of each of the evaluated approaches on the QM9 andCOD datasets is shown in Fig. 2. For consistency, we generated one conformation for one molecule at a time using each of the evaluated methods on an Intel(R) Xeon(R) E5-2650 v4 CPU. On the QM9 dataset, CVGAE is 2× more efficient than ETKDG + UFF/MMFF, while CVGAE + MMFF is slightly slower than ETKDG + UFF/MMFF. On the COD dataset, which contains a larger number of atoms per molecule, CVGAE is almost 10× as fast as ETKDG + UFF/MMFF, while CVGAE + MMFF is about 2× as fast as ETKDG + UFF/MMFF. This shows that CVGAE scales much better than the baseline ETKDG + UFF/MMFF methods as the size of the molecule grows.
Figure 2
Computational efficiency of various approaches on QM9 and COD datasets.
Computational efficiency of various approaches on QM9 andCOD datasets.Figures 3 and 4 visualize the median, standard deviation and best RMSD results as a function of the number of heavy atoms in a molecule on the QM9 andCOD/CSD datasets respectively. For all approaches, we can see that the best and median RMSD both increasewiththe number of heavy atoms. The standard deviation of the median RMSD for CVGAE and CVGAE + MMFF is lower than that for ETKDG + MMFF across molecules of almost all sizes. The standard deviation of the best RMSD is slightly higher for CVGAE and CVGAE + MMFF than for ETKDG + MMFF on molecules with at most 12 atoms, but is lower for larger molecules, particularly for CVGAE. Overall, CVGAE yields a lower or similar median RMSD compared to ETKDG + MMFF across molecules of all sizes and a lower standard deviation, whereas ETKDG + MMFF provides a lower best RMSD particularly for larger molecules observed in the COD/CSD datasets.
Figure 3
This figure shows the means and standard deviations of the best and median RMSDs on the union of COD and CSD datasets as a function of number of heavy atoms. The molecules were grouped by number of heavy atoms, and the mean and standard deviation of the median and best RMSDs were calculated for each group to obtain these plots. Groups at the left hand side of the graph with less than 1% of the mean number of molecules per group were omitted.
Figure 4
This figure shows the means and standard deviations of the best and median RMSDs on the QM9 dataset as a function of number of heavy atoms. The molecules were grouped by number of heavy atoms, and the mean and standard deviation of the median and best RMSDs were calculated for each group to obtain these plots. Groups at the left hand side of the graph with less than 1% of the mean number of molecules per group were omitted.
This figure shows the means and standard deviations of the best and median RMSDs on the union of COD and CSD datasets as a function of number of heavy atoms. The molecules were grouped by number of heavy atoms, andthe mean and standard deviation of the median and best RMSDs were calculated for each group to obtain these plots. Groups at the left handside of the graph with less than 1% of the mean number of molecules per group were omitted.This figure shows the means and standard deviations of the best and median RMSDs on the QM9 dataset as a function of number of heavy atoms. The molecules were grouped by number of heavy atoms, andthe mean and standard deviation of the median and best RMSDs were calculated for each group to obtain these plots. Groups at the left handside of the graph with less than 1% of the mean number of molecules per group were omitted.Figures 5 and 6 qualitatively comparethe results of CVGAE against MMFF and CVGAE + MMFF against CVGAE respectively. For each dataset, each figure shows the three molecules for which the first method in each figure outperforms the second method by the greatest amount, andthe three molecules for which the second method outperforms the first by the greatest amount. The reference molecules are shownalongside the conformations resulting from each of the methods for comparison.
Figure 5
This figure shows the three molecules in each dataset for which the differences between the RMSDs of the neural network predictions and the baseline ETKDG + MMFF predictions were greatest in favour of the neural network predictions (max(RMSD − RMSD)), and the three for which this difference was greatest in favour of the ETKDG + MMFF predictions (max(RMSD − RMSD)). The top row of each subfigure contains the reference molecules, the middle row contains the neural network predictions and the bottom row contains the conformations generated by applying MMFF to the reference conformations.
Figure 6
This figure shows the three molecules in each dataset whose RMSD decreased the most and the three whose RMSD increased the most on applying MMFF to the conformations predicted by the neural network. The top row of each subfigure contains the reference molecules, the middle row contains the neural network predictions and the bottom row contains the conformations generated by applying MMFF to the neural network predictions.
This figure shows the three molecules in each dataset for which the differences between the RMSDs of the neural network predictions andthe baseline ETKDG + MMFF predictions were greatest in favour of the neural network predictions (max(RMSD − RMSD)), andthe three for which this difference was greatest in favour of the ETKDG + MMFF predictions (max(RMSD − RMSD)). The top row of each subfigurecontains the reference molecules, the middle rowcontains the neural network predictions andthe bottom rowcontains the conformations generated by applying MMFF to the reference conformations.This figure shows the three molecules in each dataset whose RMSD decreased the most andthe three whose RMSD increased the most on applying MMFF to the conformations predicted by the neural network. The top row of each subfigurecontains the reference molecules, the middle rowcontains the neural network predictions andthe bottom rowcontains the conformations generated by applying MMFF to the neural network predictions.We can see some general trends from boththese figures. The conformations produced by the neural network are qualitatively much moresimilar to the reference in the case of the QM9 dataset than in the cases of the COD and CSD datasets. In the case of the COD and CSD datasets, the CVGAE predictions appear to be squashed or compressed in comparison to the reference molecules. For example, in almost every casewe can see the absence of visible rings andthe absence of bonds protruding from the lengthwise dimension of the molecule. At the same time we can see that on COD and CSD, CVGAE does better than ETKDG + MMFF in cases where ETKDG + MMFF creates loops and protrusions in the wrong places.
Analysis and Future Work
Overall we observe that CVGAE performs better than ETKDG + MMFF on QM9 than on COD and CSD. One possible reason that could explain this phenomenon is that COD and CSD contain much larger number of heavy atoms per molecule than QM9. In the absence of adequate number of neural message passing steps and adequate number of hidden units, the network may converge to outputting a conformation that contains atoms largely along a single non-linear dimension in order to minimize outliers, which would be heavily penalized by the sum of squared distances term in the loss function. A neural network architecturewith a larger number of neural message passing steps and larger number of hidden units may be needed to generate less conservative conformations and achieve comparable results to those for QM9. This is a recommended direction of futurework that will require morecomputationalresources, including distributed training on multiple GPUs with sufficient memory.Another concern for COD and CSD is the inconsistency in the environments from which the reference conformations are obtained. The inconsistency would not be a serious concern for small molecules, but it can result in performance degradation with larger molecules. Further investigation should be performed withthe dataset of larger molecules andtheirreference conformations whose corresponding environments are identical. Additionally, conditioning deep generative graph neural networks on the environment could be explored in the future.We also observe that our CVGAE method has a lower variance than the baseline methods, so a relatively small number of samples needs to be taken beforegetting a conformation with a good RMSD. In addition, CVGAE is faster than force field methods and uses less computationalresources once trained. Using conformations generated by CVGAE as an initialization to force field method showed promising results on the QM9 dataset that allowed to combine the best of two distinct methods. However, applying a force field method on the conformations generated by CVGAE leads to an increase in RMSD on the COD and CSD datasets - futurework could explorewhy this is the case. Another avenue of future inquiry could be the joint training of CVGAE and a force field method, which would involve implementing force field minimization using a deep learning framework, connecting this to CVGAE andbackpropagating through this aggregate model. This joint training could further yield better results than either method alone.Supplementary Information.
Authors: Nils-Ole Friedrich; Christina de Bruyn Kops; Florian Flachsenberg; Kai Sommer; Matthias Rarey; Johannes Kirchmair Journal: J Chem Inf Model Date: 2017-10-18 Impact factor: 4.956
Authors: Saulius Gražulis; Adriana Daškevič; Andrius Merkys; Daniel Chateigner; Luca Lutterotti; Miguel Quirós; Nadezhda R Serebryanaya; Peter Moeck; Robert T Downs; Armel Le Bail Journal: Nucleic Acids Res Date: 2011-11-08 Impact factor: 16.971