Literature DB >> 29392176

Learning a Local-Variable Model of Aromatic and Conjugated Systems.

Matthew K Matlock¹, Na Le Dang¹, S Joshua Swamidass¹.

Abstract

A collection of new approaches to building and training neural networks, collectively referred to as deep learning, are attracting attention in theoretical chemistry. Several groups aim to replace computationally expensive ab initio quantum mechanics calculations with learned estimators. This raises questions about the representability of complex quantum chemical systems with neural networks. Can local-variable models efficiently approximate nonlocal quantum chemical features? Here, we find that convolutional architectures, those that only aggregate information locally, cannot efficiently represent aromaticity and conjugation in large systems. They cannot represent long-range nonlocality known to be important in quantum chemistry. This study uses aromatic and conjugated systems computed from molecule graphs, though reproducing quantum simulations is the ultimate goal. This task, by definition, is both computable and known to be important to chemistry. The failure of convolutional architectures on this focused task calls into question their use in modeling quantum mechanics. To remedy this heretofore unrecognized deficiency, we introduce a new architecture that propagates information back and forth in waves of nonlinear computation. This architecture is still a local-variable model, and it is both computationally and representationally efficient, processing molecules in sublinear time with far fewer parameters than convolutional networks. Wave-like propagation models aromatic and conjugated systems with high accuracy, and even models the impact of small structural changes on large molecules. This new architecture demonstrates that some nonlocal features of quantum chemistry can be efficiently represented in local variable models.

Entities: Chemical Disease Gene

Year: 2018 PMID： 29392176 PMCID： PMC5785769 DOI： 10.1021/acscentsci.7b00405

Source DB: PubMed Journal: ACS Cent Sci ISSN： 2374-7943 Impact factor: 14.553

Introduction

A surge of interest in deep learning has encouraged its application to a range of problems in theoretical chemistry and chemical biology. Deep learning is a collection of new techniques that have substantially increased the power, flexibility, and applicability of neural networks.[1−4] There is currently a resurgence in chemistry research groups using neural networks to predict the properties of small molecules.[5−11] The best deep learning models consistently outperform other machine learning approaches in modeling drug metabolism,[5,7,12,13] electrophilic and nucleophilic reactivity,[8,14] chemical reactions,[15] logP,[16] and pKa.[17] This track record raises hope that deep learning might approximate solutions to the Schrödinger equation, thereby efficiently estimating atomic and molecular properties. Large data sets of quantum chemical calculations have been compiled to enable researchers to test machine learning methods.[18] To date, machine learning techniques have been used to estimate atomization energies,[19−22] bond energies,[23] molecular orbital energies,[24] ground state Hamiltonians,[25] and ab initio molecular dynamics,[26] among others. For small molecules with fewer than 10 heavy atoms, deep learning models can approximate experimental observations with accuracy comparable to density functional theory and other methods.[27,28] Encouraged by this success, several groups aim to replace computationally expensive ab initio calculations with accurate approximations in molecular simulations. This hope is curtailed severely if nonlocal features, like aromaticity, cannot be efficiently represented by deep learning models. Fully connected networks might represent nonlocal features, but they are not computationally or representationally efficient.[25] Likewise, graph-walking architectures proposed thus far are also inefficient, and require collapsing rings into pseudoatoms.[15] Consequently, most effort has focused on different types of convolutional networks, which are efficient because they locally aggregate information in molecular graphs[14,29] (Figures A and 5A–D). Convolutional networks, however, do not efficiently propagate long-range information. This raises a fundamental question: are the nonlocal features of chemistry representable in efficient, local-variable models?

Figure 1

Figure 5

Current state of the art neighborhood and weave deep learning architectures aggregate over local neighborhoods, while the WAVE architecture efficiently propagates information across a molecule using a breadth-first-search. (A, B) The neighborhood convolution architecture sums atom descriptors over increasing neighborhood depth. These neighborhood sums are then used as input to a fully connected neural network. (C, D) The undirected graph or weave architecture also aggregates over local neighborhoods. However, instead of a simple summation, aggregation is learnable and performed over pairs of atoms in a local neighborhood. Furthermore, multiple aggregation steps are performed. (E) In contrast, the WAVE model propagates information across a molecule in waves using a breadth-first search with a recurrent neural network. For the example acetaminophen, colored arrows indicate separate paths of information propagation. (F) The WAVE architecture uses gated recurrent units (GRU) to combine information from multiple input bonds and compute an output. This output then flows along bonds to the next atom in the BFS, and also flows up to the next layer. (G) The GRU uses a carefully constructed mix gate to combine information from multiple input bonds. The mix gate computes both linear-weighted sums and softmax-weighted sums of input bond data. In addition, tree and cross edges enter at separate ports, with separate weight computations.

Wave-like propagation of information more accurately represents aromatic and conjugated system size. (A) Four architectures were tested, including two convolutional models, the neighborhood model (red), which aggregates features over nearby atoms, and the weave model (orange), which repeatedly aggregates information over a local neighborhood of atoms and bonds; the WAVE model (blue) propagates waves of nonlinear computation forward and backward across a molecule; and a hybrid N-WAVE (purple) architecture combines the neighborhood and WAVE models. (B, C) Models of all four architectures were trained to label each atom with the size of the aromatic and conjugated systems of which it was a part. Both the WAVE and N-WAVE models outperform neighborhood and weave models. Dashed green lines indicate 95% confidence intervals of the interalgorithmic agreement on aromatic system size between RDKit and OpenBabel. Local-variable models of quantum mechanics have a long history, of both proposals and proofs of nonexistence.[30,31] Recently, hydrodynamics experiments have reawakened interest in these models by showing that many nonlocal features of quantum mechanics can arise in classical systems from waves, which are clearly governed exclusively by local-variables and interactions.[31] In these experiments, bouncing oil droplets interact with the waves they create on a vibrating bath. The droplets’ dynamics exhibit several nonlocal features, with correspondence to pilot-wave theory interpretations of quantum mechanics. Though not a perfect quantum analogue, these systems exhibit nonlocal behavior like double-slit diffraction, suggesting that local-variable models might explain more of quantum mechanics than first appreciated. With this in view, we aimed to test whether commonly used definitions of aromatic and conjugated systems are representable in a local-variable model.

Propagating Information in Waves

We hypothesize that aromatic and conjugated systems are representable with local-variables when information is propagated in waves, back and forth, across a molecule. Information is propagated locally, but can travel across the entire molecule in a single pass. This hypothesis is motivated by two lines of reasoning. The first line of reasoning is inspired by chemical informatics. Efficient algorithms to compute both aromaticity and conjugated system size are usually implemented as depth-first searches and complex nonlinear rules.[32,33] Though speculative, we imagine it is possible to reformulate these algorithms as multiple passes along a breadth-first search, visiting atoms in a wave-like order to compute local interactions. A breadth-first search is a way of efficiently bookkeeping local interactions across arbitrarily complex euclidian graphs. If our intuition is correct, this would prove an efficient way of representing chemicals to capture long-range interactions. The second line of reasoning is inspired by the pilot-wave interpretation of quantum mechanics.[31,34] Nonlocal waves can arise in local-variable models as local interactions propagate across a system, and these waves can describe a great deal of quantum mechanics; this guides our intuition that a local variable model of quantum chemistry could propagate information nonlocally. Neither chemical informatics algorithms nor the mathematical details of pilot-wave theory are directly encoded in the algorithm. These lines of reasoning are offered merely to motivate our intuition that information propagated in waves of local interactions might give rise to the nonlocal behavior required to model quantum mechanics. Guided by this intution, we constructed two architectures (WAVE and N-WAVE) to pass information in waves of local interactions across a molecule’s graph (Figures A and 5E–G). Local variables are maintained at each atom, and a nonlinear computation updates them in each pass using information from local interactions on one side of the wavefront. In this way, this architecture is a local-variable model, just like pilot-wave models of quantum mechanics. The precise details of the interactions and local variables are not prespecified, and these details are learned from data during a training phase.

Results and Discussion

We focused on aromatic and conjugated systems, both of which arise from nonlocal interactions across a molecule. The training and validation sets included both small and large molecules, and several adversarial examples (see Data and Methods). The modeling task is to annotate each atom and bond with the size of its aromatic or conjugated system, which in turn is determined by nonlocal features. This is a challenging task that requires architectures to represent nonlocality; small changes to a molecule can dramatically alter the system size at distant sites. Architectures must be capable, at minimum, of representing Hückel’s rule[35] and the aromaticity algorithms in chemical informatics software.[36,37] This is a key test case, because architectures that cannot represent aromaticity are not suitable for modeling quantum chemistry. Though there is ongoing controversy over the precise definition of aromaticity,[38] conjugated system size is more reliably defined and equally important to quantum chemistry. Both aromatic and conjugated systems are characterized by electron delocalization across an array of aligned pi-bonds. This delocalization determines several properties of molecules, most noticeably by imparting some with brilliant colors.[41,42] It is delocalization that enables some molecules to transfer charge or interact with light with precisely tuned energetics. For this reason, large aromatic and conjugated systems are important in both chemistry and biology.[36−38,43,44] For example, metalloporphyrins, nicknamed the pigments of life, contain large aromatic systems that are necessary for photosynthesis, oxygen transport, and electron transfer in several enzymes.[45] Effective models of quantum chemistry should be able to represent aromaticity in complex chemicals like these. Aromaticity is a real phenomenon, but its definition in quantum mechanics and analytical chemistry is not settled, especially for many boundary cases and rings with non-carbon atoms.[38] For this reason, all models of aromaticity are subject to intractable debate. We define both aromatic and conjugated system sizes using the RDKit[39] software package, which defines aromaticity using a graph-based algorithm. Error in this definition was estimated by comparing to the OpenBabel[40] aromaticity model, which is also commonly used but is less documented than RDKit. Using an established and documented algorithm ensures reproducibility of our results while also quantifying confidence intervals on the training targets. Using a quantum mechanics definition of aromaticity would introduce additional problems without reducing this ambiguity. Simulation dependent definitions are sensitive to initial conditions and are difficult to exactly reproduce, while still remaining subject to the same ambiguity in defining aromaticity. Ultimately, prediction of quantum chemical properties as determined by experiment or simulation is most important, but at minimum architectures used in quantum chemistry should be able to reproduce aromaticity and conjugation calculated deterministically from molecular graphs. Defining aromaticity by a well-documented, graph-based algorithm establishes a straightforward and reproducible diagnostic of the limits of deep learning models. Ambiguity notwithstanding, architectures incapable of modeling aromaticity computed with a graph-based algorithm are not expected to effectively model quantum mechanics, which will include similar long-range behavior. As we will see, convolutional networks fail this test, but wave-like propagation succeeds.

Representing Aromatic and Conjugated Systems

We assessed wave-like propagation by comparing it to convolutional architectures. Controlling for differences unrelated to the information propagation model, each architecture was evaluated within a common framework. The same input features, output architecture, and training regimen were used in all cases, with the tested architecture inserted between the standardized input and output layers (Figure S1). After training, the accuracy of each model was assessed. The two convolutional models, neighborhood convolution and weave, did not closely fit the data (Figure B,C). The output of the neighborhood convolution was nonlinearly related to the true values and exhibited high error on both aromatic (7.34 root-mean-square error (RMSE)) and conjugated systems (23.1 RMSE). The weave model was more effective, but still exhibited high error that increased with molecule size. In contrast, the two wave-like models represented the data with much better accuracy on both aromatic (4.36 vs 3.68 and 3.25 RMSE) and conjugated system size (13.3 vs 3.87 and 3.44 RMSE). The output of wave-like models is linear, and the error is constant with molecule size. Moreover, the error of the best wave-like model, N-WAVE, was within the interalgorithm error of two commonly used aromaticity detection algorithms (RSME 3.58, Figure S2). These results demonstrate a critical and heretofore unrecognized limitation of convolutional networks; they cannot represent long-range interactions known to be important to chemical properties. Having demonstrated this failure point, it should guide future work in deep learning architectures. To this end, these results also support the hypothesis that wave-like propagation of information can better represent properties requiring propagation of long-range information. The N-WAVE and weave models were selected for further study, because they achieved the best performances in their classes, wave-like and convolutional, respectively.

Modeling Long-Range Effects

Wave-like models also more accurately represent changes in aromatic and conjugated system sizes due to modifications at distant locations. We generated adversarial examples by converting double bonds to single bonds. Converted bonds were chosen to disrupt the aromatic or conjugated system (Figure A,D). For these adversarial examples, the convolutional model estimates aromatic and conjugated system sizes with poor accuracy (RMSE of 3.18 and 11.8, respectively, Figure B,E). In contrast, the wave-like model’s estimate is more accurate for both aromatic and conjugated system sizes (RMSE of 2.48 and 4.08, respectively).

Figure 2

Wave-like propagation correctly models adversarial examples. (A, D) Adversarial aromatic and conjugated systems were generated from PubChem molecules by changing a double bond to a single bond. (B, E) Wave-like models more accurately estimate the size of aromatic and conjugated systems than convolutional models in these adversarial examples. (C, F) Critically, wave-like models more accurately estimate the change in aromatic and conjugated system sizes between adversarial molecules and their unmodified parents. Dashed green lines indicate 95% confidence intervals of the interalgorithmic agreement on aromatic system size between RDKit and OpenBabel. Critically, wave-like models more accurately estimate the change in aromatic and conjugated system size of the adversarial molecules when compared to their unmodified parent. The convolutional model does not accurately model the change in aromatic and conjugated system sizes between a molecule and its modified counterparts (RMSE of 3.31 and 10.5, respectively, Figure C,F). In contrast, the wave-like model estimates these pairwise changes much more accurately (RMSE of 2.50 and 3.75, respectively).

Consistency on Individual Systems

Our hypothesis was further supported by comparing the consistency of estimates within individual aromatic and conjugated systems. Effective models compute the same estimate for all atoms in the same system (Figure ). The wave-like estimate was more consistent than convolutional models (within system average standard deviation of 0.56 vs 0.83). As a typical example, the convolutional model overestimated aromatic system size for atoms near the center of a large polycyclic hydrocarbon system and underestimated size for atoms near the edge. In contrast, this type of error is not visible in wave-like estimates.

Figure 3

The error from wave-like models is better behaved than convolutional models. Three molecules were painted with their aromatic system size (top) alongside the error of convolutional (middle) and wave-like (bottom) models. These examples include (left) a large polycyclic compound, (center) a porphyrin, and (right) an adversarial porphyrin, with a key double bond (circled in black) switched to a single bond to disrupt the aromatic macrocycle. The errors are higher for the convolutional model than the wave-like model. (far right) Moreover, the convolutional model often does not produce the same prediction across large systems, causing a larger variance in predictions that increases with aromatic system size. Likewise, the wave-like model accurately identified cases where aromatic status is defined by long-range interactions. Changing a double bond to a single bond in porphyrin derivatives obliterates the large aromatic system (Figure ). Convolutional models overestimate the size of the aromatic system for atoms far from the modified bond, but the wave-like model estimates were more accurate. We further quantified the long-range effect visible in the porphyrin case using the full set of adversarial examples. We quantified the accuracy of aromatic status predictions on atoms more than five bonds away from the modified bond. This test evaluates the long-range propagation of information about subtle changes in a molecule. On these cases, the wave-like model is much more accurate than convolutional models (91.6% vs 85.8% area under the receiver operator curve, respectively, Table S1).

Representational Efficiency

We hypothesize that wave-like propagation efficiently represents aromatic and conjugated system size. In most modeling tasks, there is a trade-off between model error and complexity, which is quantified by the number of parameters in a model. All else being equal, models with more parameters will fit the data better. Simple models should be preferred, unless a more complex model improves generalization accuracy. Comparing representational efficiency, therefore, requires measuring the performance of a wide range of models of each architectural class, to determine which architecture has the best trade-off between complexity and error. The architecture with the best trade-off most efficiently represents the task. Put another way, representational efficiency can be assessed by controlling either for complexity or for accuracy. Controlling for complexity, models using efficient architectures will perform better with the same number of parameters as those using less efficient architectures. Controlling for accuracy, models using more efficient architectures will have fewer parameters than those with equivalent performance but less efficient architectures. We compared the trade-off between model complexity as defined by the number of tunable model parameters and accuracy with hyperparameter sweeps on the structural parameters of each architecture (Table S2). Each combination of hyperparameters was used to train and evaluate a model. In total, we trained and tested 567 convolutional models and 324 wave-like models (Figure ). We found that wave-like models were more accurate with fewer parameters than convolutional models.

Figure 4

Wave-like propagation is more representationally efficient than convolution. A large search was conducted over structural parameters of the convolutional (orange) and wave-like (purple) models (Table S2). In total, 567 weave architectures and 324 wave-like architectures were tested. (A, D) Most wave-like architectures with more than one forward–backward pass exhibited better validation accuracy than the best convolutional models. Furthermore, wave-like models require far fewer parameters. (B, E) Depth is a critical determinant of convolutional model accuracy, with increasing depth exhibiting diminishing returns on validation accuracy. Wave-like models with depth greater than one have nearly equivalent performance. (C, F) In contrast, the number of local variables (defined by the width of the feature vector at each atom and bond) is critical for wave-like model accuracy, but does not improve convolutional models. Patterns of performance in the best models of each architecture support this hypothesis. The best wave-like models had lower test error and lower parameter counts than the best convolutional models (140,024 parameters vs 453,099 parameters, aromatic system size RMSE 2.34 vs 2.88, conjugated system size RMSE 2.88 vs 7.61). Furthermore, wave-like models outperformed convolutional models 2 orders of magnitude more complex (e.g., 6,704 parameters vs 453,099 parameters, conjugated system size RMSE 6.04 vs 7.61). Similar patterns of performance are observed globally. Most wave-like models with more than one pass outperformed the best convolutional model (232 of 243 models). These experiments provide strong evidence that wave-like propagation is more representationally efficient than local aggregation.

Critical Determinants of Representational Power

The parameter sweep also tests what components of an architecture are critical determinants of representational efficiency. The convolutional architecture requires increasing depth to propagate information across a molecule, but this only slowly improves accuracy with diminishing returns while substantially increasing the number of tunable parameters (Figure B,E). The performance of convolutional models improves only slightly between depth 30 and 50 (mean RMSE of 9.07 vs 8.63), but depth 50 models have nearly twice as many parameters (mean parameter count of 172,171 vs 286,731). In contrast, increasing the number of local variables did not improve convolutional model estimates (Figure C,F). In wave-like propagation, the most important determinant of performance was depth. At least two forward–backward passes were required for accurate models. However, additional passes did not improve performance substantially. After depth, the number of local variables most strongly influenced model accuracy (average conjugated system size RMSE of 8.39, 6.46, and 5.89 for widths of 10, 20, and 30, Figure F). This suggests that performance is primarily influenced by the wave-like information propagation and by the number of local variables. The WAVE model is defined by wave-like propagation of information, but it encodes local interactions with recurrent units making use of several components. Most components are standard in deep learning, but they also include a new component, a mix gate, which mixes information from multiple inputs together. The relative importance of each component of the model was assessed by studying the performance of the model as each component was removed or altered, one at a time (Table S3). The largest increases in error are observed when using only one forward–backward pass and when removing the mix gate. Other components only subtly affect error. We conclude that these two features of the WAVE model, along with the number of local variables, are critical determinants of its representational power. This supports the hypothesis that wave-like propagation is a critical determinant of efficient representation of nonlocal features.

Computational Complexity of Wave-like Propagation

Wave-like propagation is sublinear in computational complexity, improving substantially on the O(N3) complexity of ab initio calculations and fully connected networks. Most organic molecules are 1D or 2D, but condensed phase simulations are 3D. Moreover, parallelism is easily exploitable with the current generation of computational hardware. Consequently, the computational complexity of an efficient parallel implementation of wave-like propagation is proportional to a system’s width, or O(N), ), and ), respectively, for 1D, 2D, and 3D systems (see Supporting Information). In contrast, the weave and neighborhood convolutions have constant computational complexity, but do not propagate information across entire systems. With sublinear complexity on large systems, wave-like propagation may be a practically applied to modeling nonlocal properties in much larger systems, including perhaps macromolecules and condensed phase simulations.

Conclusion

It may be surprising that convolutional neural networks struggle to represent aromatic and conjugated systems. This finding initially appears to contradict several universal approximation proofs.[46−48] These proofs, however, only apply to fully connected networks, with infinite training data, and with arbitrarily large numbers of hidden units. Under these conditions, neural networks can approximate simple quantum systems to arbitrary accuracy.[25] These proofs do not extend to convolutional networks, which are not fully connected networks. Demonstrating the limitations of convolutional networks in chemistry is a foundational finding for the field. It may be equally surprising that wave-like propagation efficiently represents nonlocal features with local variables. As demonstrated by hydrodynamic analogues of quantum mechanics,[31] surprising nonlocal behavior can emerge from local interactions. Hydrodynamics models, however, are not perfect analogues of quantum mechanics, and they may never scale to large many-body problems. Our findings, nonetheless, empirically add to the argument by demonstrating that important nonlocal features in more complex chemical systems can, in principle, emerge from local variable models when information is propagated in waves. While this study focused on aromatic and conjugated system size computed using graph-based algorithms, the findings here have immediate practical relevance in ongoing efforts to model quantum chemistry with deep learning. Convolutional networks are not efficient models of nonlocal properties, but other efficient models are possible and might be preferred. It remains an open question how much of quantum chemistry is representable with local-variable models, but we may soon find out as deep learning is extended to quantum mechanical simulations of large molecules. As efforts to model quantum chemistry with deep learning progress, they could become more than engineering efforts solving a purely practical problem. They might also empirically test whether quantum chemistry is fully representable in local-variable models.

Data and Methods

Large Aromatic and Conjugated Systems from PubChem

Diverse training and test sets with many types and sizes of aromatic and conjugated systems were extracted from the PubChem database.[49] First, a random sample of 200,000 compounds was collected (Figure S3A,B). Most compounds in PubChem were small (fewer than 50 heavy atoms), with small conjugated systems (fewer than 25 atoms) and small aromatic systems (fewer than 10 atoms). To ensure representation of larger, more difficult molecules, we selected a stratified subsample of training and test sets from the full range of aromatic and conjugated system sizes (Figure S3C–F). Only molecules with fewer than 200 atoms were included. Aromatic system sizes in the final data set ranged from 0 to 80 heavy atoms, while conjugated system sizes ranged between 0 and 165 heavy atoms. The final data set included 2,678 molecules with 139,138 heavy atoms for the training set and 471 molecules with 24,240 heavy atoms for the test set.

Adversarial Modifications to Aromatic and Conjugated Systems

To study the efficiency of information propagation in deep learning chemistry architectures, we added adversarial examples to both the training and test sets. For each PubChem compound in the data sets, between two and four adversarial examples were constructed. In each example, a single double bond from a kekulized form of the molecule was chosen and then changed to a single bond. Bonds were chosen so that the change in aromatic or conjugated system sizes was large. For aromatic system sizes, two adversarial examples were constructed with bond modifications leading to (1) the maximum change in aromatic system size and (2) the median change in aromatic system size (Figure S4A). The RMSE difference in aromatic system sizes between the normal and adversarial examples was 6.61 and 7.29 for the training and test sets, respectively (Figure S4B,C). For conjugated system sizes, the same strategy was used to generate two additional adversarial examples (Figure S5A). The RMSE difference in conjugated system sizes between normal and adversarial examples was 20.1 and 19.3 for the training and test sets, respectively (Figure S5B,C). Including adversarial examples, the final training set contained 11,123 molecules with 605,000 heavy atoms, and the final test set contained 1,990 molecules with 105,726 atoms.

Common Atom-Level Input Descriptors

All models used the same minimal set of numerical descriptors which were computed across the heavy atoms in each molecule. These descriptors included 11 binary indicators denoting the element of the atom, the number of covalently bonded hydrogens, and the formal charge of the atom. Atom indicators included boron, carbon, nitrogen, oxygen, fluorine, phosphorus, sulfur, chlorine, arsenic, bromine, and iodine.

Common Atom-Level and Bond-Level Targets

All models were trained to predict the same set of targets at the atom and bond level. For each atom, 11 atom-level targets were calculated: (1) membership in any ring, (2) membership in an aromatic ring, (3–8) membership in a ring of size three, four, five, six, seven, or eight heavy atoms, (9) the number of atoms in the largest ring of which this atom is a member, (10) the number of atoms in the aromatic system of which this atom is a member, and (11) the number of atoms in the conjugated system of which this atom is a member. Aromatic and conjugated systems were defined as the set of all atoms reachable by walking from atom to atom on aromatic or conjugated bonds, respectively. Bond aromaticity was included as a bond-level target. RDKit[39] was used to label aromatic and conjugated bonds. OpenBabel[40] was used to determine ring sizes and membership.

Common Input and Output Architecture

To evaluate each architecture’s ability to propagate information across a molecule, a common architecture was used at the first input layer and final output (Figure S1A). At the input, a single fully connected layer with rectified linear activation scales the atom-level input size to the input size of the test network. The test network produces a vector of atom features for each atom in the input data. Two separate networks use these features to predict atom or bond targets. The atom network consists of a single hidden layer of ten units, eight output units with sigmoid activations, and three output units with rectified linear activations. The bond network consists of an atom feature conversion layer, a single hidden layer of ten units, and one output unit with a sigmoid activation. Atom features are converted to bond features by application of an order invariant transformation. Specifically, features for each atom in a bond are processed by a fully connected layer with rectified linear activation, and the output for both atoms is summed to produce the bond-level features.

Common Training Protocol

All models were trained in the Tensorflow[50] framework using the same protocol, using cross-entropy loss[51] on the binary targets and normalized sum of squared differences loss on the integer targets (Figure S1B). All models were trained with 15,000 iterations of mini-batch gradient descent using the Adam optimizer[52] and batches of 100 molecules. An initial learning rate of 10–3 was used, followed by a decay to 10–4 after 7,500 iterations.

Neighborhood Convolution (NC or XenoSite) Architecture

The NC architecture is used by several groups, including ours, to predict drug metabolism and other atom and bond level properties. The NC linearly sums descriptors of neighboring atoms up to a depth of five bonds (d = 5) once for each neighborhood depth (Figure A,B). The resulting feature vector has a width d times the number of input descriptors. The feature vector is input to a neural network. We used two hidden layers of size 40 and 20, sequentially, to compute an output feature vector for each atom of dimension 20. The performance of NC depends on passing quality descriptors known to be important in chemistry, like aromaticity and ring membership. In this study, however, only minimal descriptors are used. It is not expected that NC will perform well in these tests, and it serves as a baseline method against which to compare other architectures. Hyperparameters for this model include the neighborhood depth d over which to aggregate descriptors, the number of hidden layers, the sizes of those hidden layers, and the number of output features. The baseline model had 1,104 parameters. Current state of the art neighborhood and weave deep learning architectures aggregate over local neighborhoods, while the WAVE architecture efficiently propagates information across a molecule using a breadth-first-search. (A, B) The neighborhood convolution architecture sums atom descriptors over increasing neighborhood depth. These neighborhood sums are then used as input to a fully connected neural network. (C, D) The undirected graph or weave architecture also aggregates over local neighborhoods. However, instead of a simple summation, aggregation is learnable and performed over pairs of atoms in a local neighborhood. Furthermore, multiple aggregation steps are performed. (E) In contrast, the WAVE model propagates information across a molecule in waves using a breadth-first search with a recurrent neural network. For the example acetaminophen, colored arrows indicate separate paths of information propagation. (F) The WAVE architecture uses gated recurrent units (GRU) to combine information from multiple input bonds and compute an output. This output then flows along bonds to the next atom in the BFS, and also flows up to the next layer. (G) The GRU uses a carefully constructed mix gate to combine information from multiple input bonds. The mix gate computes both linear-weighted sums and softmax-weighted sums of input bond data. In addition, tree and cross edges enter at separate ports, with separate weight computations.

Weave Architecture

Weave is a state-of-the art undirected graph model that is used by DeepChem[29,53] to predict the biochemical properties of small molecules. The weave architecture aggregates information locally over pairs of atoms in a neighborhood using multiple layers (Figure C,D). Weave is fully described in the literature,[29] but a brief description is included here with default parameters noted. We used 15 weave layers. Each layer nonlinearly aggregates atoms, using a fully connected rectified linear layer, to produce a new set of pair outputs. Then, it does the same over pairs of atoms to produce a set of atom outputs. Atom output vectors were dimension 30, and bond output vectors were dimension 10. At each step, pairs of atoms up to a depth of two are aggregated. After aggregation, batch normalization was applied to both the atom and pair outputs at each layer. The exact implementation details are explained in the references. There are six hyperparameters: the number of layers, the bond and atom width, the bond and atom hidden layer width, and the aggregation depth. The baseline model had 71,164 parameters. These hyperparameters were chosen for the baseline model to bring the number of parameters onto the same order as the baseline WAVE architecture (see next section).

Multi-Pass Breadth-First (WAVE) Architecture

This architecture propagates information across a molecule back and forth in waves (Figure E). Atoms and bonds are processed in order as determined by a breadth-first search. Information is passed along bonds. At each atom, information from previously visited atoms is used to update the local variables (the state vector), which are then passed on to the next set of neighboring atoms. The backward pass reverses the order in which atoms are visited, reversing the flow of information. In each pass, the neural network updates the current state of the atom based on all the information passed to it from neighboring atoms. The breadth-first search is initiated at an atom selected near the center of the molecule. From here, layers of atoms that are one bond away, two bonds away, and so on are identified. Atom states are updated layer by layer, in order, passing on information along bonds from one layer to the next. In the backward pass, layers are updated in the opposite order and the information flow is reversed. The bonds between layers are labeled “tree edges”, and bonds between atoms in the same layer are labeled “cross edges”.[54] For tree edges, one atom of the edge is updated first, and then its updated state is sent to the next atom. For cross edges, each atom in the cross edge receives the non-updated state of the other atom; consequently, all atoms in a layer can be processed simultaneously. In the backward pass, the information transfer and order of updates is reversed. In this way, information propagates in waves of back and forth across the molecule, from the central atom outward, then back again. A gated recurrent unit (GRU)[55]—a repeated neural network—integrates information from bonds and updates the atom’s local state (Figure F). This unit is mathematically detailed in the next section, but a narrative description is included here. The weights for this network are identical for all atoms in the molecule, but different for each pass of the algorithm. A mix gate integrates information from multiple bonds. It calculates two sums of the inputs along each tree-edge bond, a softmax sum and a weighted sum (Figure G). Weighting for each sum is computed independently for each component of the vector by single layer network. Cross edge inputs enter in their own port with their own set of mixing weights different from tree-edge inputs. The complexity of the mix gate is necessary for the network to treat tree and cross edges differently, while learning dimension and context aggregations that can range from weighted averaging, to weighted sums, to the weighted maximum or minimum. Next, the mixed input is concatenated with the atom’s local state and fed forward into the update and output gates. The rest of the unit is a standard GRU architecture. The output gate computes a preliminary output vector. This output vector is mixed with the local variables using a component-wise, weighted sum tuned by the update gate. The update gate is a single fully connected layer with a sigmoid activation, which tunes how much the output vector should update the local variables. The updated memory vector then becomes the final output of the GRU, which replaces the current state and is handed forward to update additional atom states. There are four hyperparameters: the number of forward–backward passes, the memory and output width, the number of output network layers, and the number of mix network layers. Within a forward or backward pass, all GRUs share the same weights, and all GRUs across all passes share the same structure. Default parameters are noted here, which should be assumed unless stated otherwise. The default WAVE uses three forward–backward passes. The GRUs return vectors of dimension 25. The output gate uses a single fully connected layer with exponential linear activations to compute a new output. The mix gate uses a separate fully connected layer for each type of edge (cross or tree) and each weighting (softmax and softsign). The baseline model had 50,289 parameters.

The WAVE Model GRU

The schematic and narrative overview of the WAVE model’s GRU summarizes and depicts the information flow through a set of equations (Figure E–G). The GRU computes an updated state from the current state of the atom and the states of bond-connected atoms. This updated state is passed to neighboring atoms in the next layer, which have not been updated. Using column vectors, all its inputs are states of dimension d. This updated state s* is the weighted average of the mix gate output m and a computed output vector o,Here, u is an update vector that ranges from zero to one, y is the preliminary output vector, and m is the output of the mix state, all of which are vectors of length d. The operator ⊗ is the element-wise multiply. The update and output vectors are computed aswhere s is the current state of the atom and the subscripted W and b variables denote tunable matrices and vectors of appropriate dimension and size. The relu function is the rectified linear activation, and σ is the logistic activation, both of which are element-wise functions. Both m and s are vectors of d length, which are concatenated into a vector of length 2d; m is the mixed neighbor state, computed aswhere 1 is a unity vector of dimension i that collapses its operand into a vector of size d, N* is the previously updated states from neighboring atoms, stacked as columns alongside each other in a d by i matrix, and i is the number of input states. A and B are computed weighting matrices, and N* is composed of both tree and cross edge states, as determined by the breadth-first search.where T* and C are the column stacked tree and cross edge states, which are integrated using different sets of weights from each other. To enable parallel processing of all atoms a given depth from the central atom, we use the updated states for the tree edge connected atoms, and non-updated states from the cross edge connected atoms. The weight matrices A and B are shaped the same as N*. A is computed byIn this nonstandard notation, the braces are broadcast operators that replicate and horizontally stack the s, b, and w vectors to match the correct dimensions for concatenation or element-wise arithmetic. The softmax is defined in the usual way, ranges from zero to one, and is oriented so as to normalize each row to sum to one, normalizing each component of the state independently. The matrix B is computed with a similar formula bywhere the softsign is the standard element-wise function that ranges from zero to one. In this way, the tree and cross edges are integrated into the same aggregate state m using different parameters. The combined use of both softmax and softsign enables a context dependent and nonlinear aggregation that can range from a weighted average, to weighted sum, to a minimum or maximum. The GRU computes its update from the updated states of atoms in the previous layer and non-updated atoms from the current layer, and then passes the newly updated state s* to atoms in the next layer to be updated.

Hybrid N-WAVE Architecture

In addition to the WAVE architecture, we experimented with hybrid architectures, which combined the neighborhood convolution and WAVE architectures. In this architecture, neighborhood convolution is applied to compute an atom representation which is then used as the input to multiple passes of WAVE.

Variants of the WAVE Architecture

In addition to the standard WAVE, we studied several variants of the gated recurrent unit architecture used to propagate information. Optional architectural components included (1) the read gate (a single fully connected layer with sigmoid activation, which selects among the output features of the mix gate for input to the unit); (2) the update gate; (3) the mix gate (replacing with an unweighted sum of input bond features); (4) use of the atom input features in the mix gate weight computation; (5) the mix gate’s softmax weighted sum; (6) the mix gate’s linear weighted sum; (7) use of the same recurrent unit for both forward and backward passes; and (8) a hybrid model with a neighborhood convolution as the first layer before the wave WAVE (N-WAVE). Furthermore, we studied three ways for the mix gate to process bonds labeled as cross edges by the breadth-first search: (1) the same weight computations are used for tree or cross edges indiscriminately; (2) weights on cross edges are computed by separate networks, or ports, from those used on tree edges; or (3) there are separate ports for tree and cross edges, treating cross edges as undirected, with cross edge memory fetched from the previous layer. The standard model treated cross edges with the latter strategy.

Hyperparameter Sweeps

We assessed validation accuracy for many values of each model’s hyperparameters. For each combination of hyperparameter values, three models were trained. Of these three, the model with the lowest error on the training set was chosen for evaluation on the validation set. A complete list of studied hyperparameters and tested values can be found in Table S2. In total, 567 weave model architectures and 324 N-WAVE model architectures were evaluated.

34 in total

1. Big Data Meets Quantum Chemistry Approximations: The Δ-Machine Learning Approach.

Authors: Raghunathan Ramakrishnan; Pavlo O Dral; Matthias Rupp; O Anatole von Lilienfeld
Journal: J Chem Theory Comput Date: 2015-04-23 Impact factor: 6.006

2. Quantifying aromaticity with electron delocalisation measures.

Authors: Ferran Feixas; Eduard Matito; Jordi Poater; Miquel Solà
Journal: Chem Soc Rev Date: 2015-04-10 Impact factor: 54.564

3. Learning to predict chemical reactions.

Authors: Matthew A Kayala; Chloé-Agathe Azencott; Jonathan H Chen; Pierre Baldi
Journal: J Chem Inf Model Date: 2011-09-02 Impact factor: 4.956

4. XenoSite server: a web-available site of metabolism prediction tool.

Authors: Matthew K Matlock; Tyler B Hughes; S Joshua Swamidass
Journal: Bioinformatics Date: 2014-11-18 Impact factor: 6.937

5. A simple model predicts UGT-mediated metabolism.

Authors: Na Le Dang; Tyler B Hughes; Varun Krishnamurthy; S Joshua Swamidass
Journal: Bioinformatics Date: 2016-06-20 Impact factor: 6.937

6. Relationship between absorption spectrum and molecular conformations of 11-cis-retinal.

Authors: W Sperling; C N Rafferty
Journal: Nature Date: 1969-11-08 Impact factor: 49.962

Review 7. Deep learning for computational chemistry.

Authors: Garrett B Goh; Nathan O Hodas; Abhinav Vishnu
Journal: J Comput Chem Date: 2017-03-08 Impact factor: 3.376

8. Predicting CYP2C19 catalytic parameters for enantioselective oxidations using artificial neural networks and a chirality code.

Authors: Jessica H Hartman; Steven D Cothren; Sun-Ha Park; Chul-Ho Yun; Jerry A Darsey; Grover P Miller
Journal: Bioorg Med Chem Date: 2013-04-22 Impact factor: 3.641

9. Metalloporphyrins-Applications and clinical significance.

Authors: R Chandra; M Tiwari; P Kaur; M Sharma; R Jain; S Dass
Journal: Indian J Clin Biochem Date: 2000-08

10. Machine Learning Predictions of Molecular Properties: Accurate Many-Body Potentials and Nonlocality in Chemical Space.

Authors: Katja Hansen; Franziska Biegler; Raghunathan Ramakrishnan; Wiktor Pronobis; O Anatole von Lilienfeld; Klaus-Robert Müller; Alexandre Tkatchenko
Journal: J Phys Chem Lett Date: 2015-06-18 Impact factor: 6.475

5 in total

1. Site-Level Bioactivity of Small-Molecules from Deep-Learned Representations of Quantum Chemistry.

Authors: Kathryn Sarullo; Matthew K Matlock; S Joshua Swamidass
Journal: J Phys Chem A Date: 2020-10-21 Impact factor: 2.781

2. ACS Central Science Virtual Issue on Machine Learning.

Authors: Andrew L Ferguson
Journal: ACS Cent Sci Date: 2018-08-08 Impact factor: 14.553

3. Practical Model Selection for Prospective Virtual Screening.

Authors: Shengchao Liu; Moayad Alnammi; Spencer S Ericksen; Andrew F Voter; Gene E Ananiev; James L Keck; F Michael Hoffmann; Scott A Wildman; Anthony Gitter
Journal: J Chem Inf Model Date: 2018-12-18 Impact factor: 4.956

Review 4. Artificial Intelligence for Autonomous Molecular Design: A Perspective.

Authors: Rajendra P Joshi; Neeraj Kumar
Journal: Molecules Date: 2021-11-09 Impact factor: 4.411

Review 5. Opportunities and obstacles for deep learning in biology and medicine.

Authors: Travers Ching; Daniel S Himmelstein; Brett K Beaulieu-Jones; Alexandr A Kalinin; Brian T Do; Gregory P Way; Enrico Ferrero; Paul-Michael Agapow; Michael Zietz; Michael M Hoffman; Wei Xie; Gail L Rosen; Benjamin J Lengerich; Johnny Israeli; Jack Lanchantin; Stephen Woloszynek; Anne E Carpenter; Avanti Shrikumar; Jinbo Xu; Evan M Cofer; Christopher A Lavender; Srinivas C Turaga; Amr M Alexandari; Zhiyong Lu; David J Harris; Dave DeCaprio; Yanjun Qi; Anshul Kundaje; Yifan Peng; Laura K Wiley; Marwin H S Segler; Simina M Boca; S Joshua Swamidass; Austin Huang; Anthony Gitter; Casey S Greene
Journal: J R Soc Interface Date: 2018-04 Impact factor: 4.293

5 in total