Literature DB >> 35155916

Flexible Dual-Branched Message-Passing Neural Network for a Molecular Property Prediction.

Jeonghee Jo¹, Bumju Kwak², Byunghan Lee³, Sungroh Yoon⁴.

Abstract

A molecule is a complex of heterogeneous components, and the spatial arrangements of these components determine the whole molecular properties and characteristics. With the advent of deep learning in computational chemistry, several studies have focused on how to predict molecular properties based on molecular configurations. MA message-passing neural network provides an effective framework for capturing molecular geometric features with the perspective of a molecule as a graph. However, most of these studies assumed that all heterogeneous molecular features, such as atomic charge, bond length, or other geometric features, always contribute equivalently to the target prediction, regardless of the task type. In this study, we propose a dual-branched neural network for molecular property prediction based on both the message-passing framework and standard multilayer perceptron neural networks. Our model learns heterogeneous molecular features with different scales, which are trained flexibly according to each prediction target. In addition, we introduce a discrete branch to learn single-atom features without local aggregation, apart from message-passing steps. We verify that this novel structure can improve the model performance. The proposed model outperforms other recent models with sparser representations. Our experimental results indicate that, in the chemical property prediction tasks, the diverse chemical nature of targets should be carefully considered for both model performance and generalizability. Finally, we provide the intuitive analysis between the experimental results and the chemical meaning of the target.

Entities: Chemical

Year: 2022 PMID： 35155916 PMCID： PMC8829939 DOI： 10.1021/acsomega.1c05877

Source DB: PubMed Journal: ACS Omega ISSN： 2470-1343

Introduction

To design de novo compounds, all possible sets should be explored throughout a combinatorial space called chemical compound space (CCS). Quantum mechanics (QM) functions as a guide that narrows down this search space based on the first principles of molecular dynamics. Density functional theory (DFT) is the standard method for analyzing a molecular electronic structure from its numerical wave functions in QM to predict molecular behaviors. In the past decade, machine learning (ML)-based methods have been developed and have played a key role in various QM tasks. For example, support vector machine (SVM),[1] Gaussian processes,[2] kernel ridge regression models,[3,4] and other ML-based studies[5] have been developed and applied to predict molecular properties or model molecular dynamics. Recently, with the advent of large-scale quantum-mechanical databases, deep learning (DL)-based models have become crucial in quantum machine learning (QML) studies.[6] Deep neural network (DNN)-based methods have made it possible to model hidden complex relationships between molecular structures and their properties. Early DNN-based QM studies explored various tasks such as DFT prediction,[7] a simple molecule dynamics simulation,[8] atomization energy prediction,[9] or learning electronic properties.[10] These models considered a single atom as a neuron of an input layer, with the frame of a one-body representation of a given system. Given a molecular geometry, a single atom can be represented with its atomic number and position, and multiple atoms can also be represented in the same way with their interactions, using their mutual distances or angles. However, standard MLP-based models cannot handle various sizes of chemicals.[2,11] This limitation triggers a harmful effect on the model generalizability. Graph neural networks (GNNs) can address these limitations by describing a molecule as a graph. GNNs can train both node and edge representations, which are naturally equivalent to atoms and atom–atom relationships present in a molecule. This graph-based molecular representation has become widely adopted in solving various molecule learning tasks. Several GNN-based models[12−16] adopted a localized convolution kernel to learn each component of a molecule. These models are called graph convolutional networks (GCNs). The spatial embedding-based GCN model1 trains each local attribute iteratively and then aggregates these features to represent molecular properties. The message-passing neural network (MPNN) framework[19] has taken these approaches one step further. On the basis of a graph convolution, MPNNs learn node features in two types of phases: message-passing and update phases. In the message-passing phase, the hidden representations of nodes, neighboring nodes, and their edges are aggregated to represent the locality of a given node, called a message. Subsequently, the hidden features of these nodes are updated with the message and other features in the following update phase. The final step in a network is called a readout phase; all of the features are combined to yield the predictions. In the MPNN framework, both the message aggregation and the readout phase use a sum operator to gather local attributes and all atomic attributes, respectively. All MPNN models can be considered as a type of the spatial GCNs because in these models each atom feature is updated with its localized convolutional filters. Since the late 2010s, several message-passing-based DNN models have been developed, such as SchNet,[12,20] GDML,[13] PhysNet,[15] MegNet,[21] and DimeNet.[22] SchNet and the subsequent model architectures suggested a block structure, which consists of a set of specific message-passing functions. They also adopted a continuous-filter convolution to learn intramolecular interactions that take the form of a continuous distribution. These models used atom types and atom–atom distances as inputs of the model. Some studies used more sophisticated functions to represent angles in a rotational equivariant approach.[23,24] Many of the MPNN-based models exhibited competitive performances in various QM research areas, even in the molecular property prediction on a nonequilibrium state. However, most of the previous works did not consider the natural heterogeneities among chemical properties. Although subtle, the target properties have originated from different chemical natures. For example, among the 12 targets from QM9, nine describe the types of given molecular energy, while the other three do not. A conventional DL optimization can select the better parameters for describing all of the targets. However, it cannot determine that feature type that is more important for predicting each target. In this case, the optimal parameters in the given model architecture may vary according to target types. Almost all previous models have no choice but to optimize parameters under the assumption that all types of features such as node features or edge features contribute equally to predicting all targets. This may limit the model generalizability and transferability because it ignores the diversity of chemical factors determining various properties. In addition, GNNs have suffered from differentiating nodes, especially with deep architecture, that is, the oversmoothing problem.[17,25] This problem emerges because iterative aggregations of neighboring features are equivalent to a repeated mixing of local features, eventually making all node representations too similar and reducing the model performance. Therefore, stacking more layers to a network cannot be the solution in the GNN (MPNN) case without careful consideration. To overcome all these limitations, we propose a novel GNN for molecular property prediction, namely, a dual-branched Legendre message-passing neural network (DL-MPNN) with simple and powerful scalability. We used two types of features: an atom type and a position in three-dimensional space. We calculated the distance between two atoms and an angle between three atoms from atomic coordinates. In detail, we applied modified Legendre rational functions for radial basis functions of atom–atom distances. For a representation of angles between atoms, orthogonal Legengdre polynomials were used. Finally, we adopted trainable scalars to each type of feature for model flexibility in multitask QM learning. Our approaches exhibited superior performances to other recent models in quantum-mechanical benchmark tasks. In addition, we validate the effectiveness of the proposed method via ablation studies. To the best of our knowledge, this study is the first attempt to automatically explore a relative feature of importance for each target type without increasing the computational cost. The introduction of the trainable scalars has several advantages over that of using an additional layer. First, the overall computational cost is not significantly increased, because it only adopts scalar multiplications. Subsequently, it makes the model flexible for heterogeneous types of features in multitarget tasks. In addition, we introduce an MLP-based novel-type block called a single-body block. These blocks are stacked to form a discrete branch in the model and do not communicate with the message-passing branch, as they solely train only atom embeddings. The outcome of this MLP-based pathway is aggregated with that of the MPNN at the final stage. Accordingly, we obtain an atom feature that is not overmixed by incoming messages during the training. In summary, our contributions are as follows: (1) We propose a novel dual-branched network for molecular property prediction, which comprises a message-passing for locally aggregated features and a fully connected pathway for discrete atoms. (2) We introduce trainable scalar-valued parameters to enable the model to enhance more important feature signals according to each label. (3) Our experimental results are better than those of most previous works in the public benchmarks with sparser representations. In addition, we verify that various quantum-chemical factors contribute differently according to target molecular properties.

Experimental Section

Our neural network is dual-branched, comprising two different types of networks. One is the MPNN, a type of spatial-based GCNs, and the other is the MLP-based network. The details are described in the following sections.

Preliminaries

We briefly introduce the notations and the data used in this study. We describe a molecule as a set of different types of i points V = {Z, P} and v = {zi, pi}. Z = {z} is a set of scalar-valued charges for each atom type. P = {p}, where p = (p, p, p) is a set of xyz coordinates of atom points in the Euclidean space. We denote v as a set of neighboring atoms of v.

Data Preprocessing

We used two types of information between three neighboring atoms; atom types z and coordinates p of each atom in the molecules. We calculated distances d and angles α between atom coordinates p, p, and p. Because coordinates are used to describe the whole molecular structure, a single position of an atom is meaningless. We initialized random trainable atom type embeddings X from Z and trained them. All d and α values from the molecules were calculated before the model training. More details are described in Figure .

Figure 1

Data preprocessing. An atom type Z and atomic coordinate p were used in the model. We created trainable embeddings X from Z and calculated distances d and angles α from the coordinates p, p, and p.

Input Embedding: Distance and Angle Representations

Distance Representation: Radial Basis Function

We used radial basis functions to expand a scalar-valued edge distance to a vector. In particular, we adopted sequences of Legendre rational polynomials, one of the continuous and bounded orthogonal polynomials. The Legendre rational function R(x) and Legendre rational polynomial P(x) are defined aswhere n denotes the degree. The nth-order Legendre rational polynomials are generated recursively as follows.We seleced the first nth order polynomials R1(x), R2(x), ..., R(x), such that a scalar-valued distance is embedded as an n-dimensional vector . The plot of functions degree of 1,2,...,12 are described in the top right side of Figure . We modified the equations to set the maximum value as 1.0, to make the distribution of the bases similar to that of the initial atom embedding distribution.

Figure 2

Basis functions for data preprocessing. For any pair of two atoms located closer than a given cutoff, we created an edge representation between two atoms regardless of molecular bond information. A scalar-valued distance is expanded as an n-dimensional vector by radial basis functions. For radial basis functions, we used Legendre rational polynomials. Angle representations are depicted by any two edges sharing one atom. A cosine-valued angle is represented as the degree = 1, 2, ..., mth Legendre polynomials (top right). Recall that we did not use any molecular bond information. Following previous studies,[12,15,20,22−24,27] we set a single scalar cutoff c and assumed that any atom pair v and v located closer than the cutoff can interact with each other, and vice versa, before training. In other words, we build edge representations between any two atoms close to each other. The cutoff value is analogous to the kernel height or width in the conventional convolutional layer because it determines the size of the receptive field of the local feature map.

Angle Representation: Cosine Basis Function

Similarly, we adopted another sequence of orthogonal polynomials to represent a scalar-valued angle α as vectors: Legendre polynomials of the first kind. Legendre polynomials of the first kind are the orthogonal polynomials that exhibit several useful properties. They are the solutions to the Legendre differential equation.[37] They are commonly used to describe various physical systems and dynamics. Their formula is expressed asThe polynomials of degree n are orthogonal each other, such thatWe calculated angles α between any pair of two edges sharing one node, and . We selected the first mth order polynomials Q1(x), Q2(x), ..., Q(x), such that a scalar-valued angle is embedded as an m-dimensional vector . The scheme is presented on the left side of Figure . We calculated cosine values of each angle and embed them with Legendre polynomials.

Comparison of Two Different Types of Legendre Polynomials

We briefly compare the two different Legendre polynomials. First, the sequences of Legendre rational functions {P(x)} are adopted as the encoding method for distances, which have different distributions over the distances. The standard deviations of the distributions of the distance encodings are higher at shorter distance values. Therefore, they can approximate the interatomic potentials, such that the atom pair exhibits stronger interatomic relationships if they are located closely with each other. In other words, the sequences of the Legendre rational functions {P(x)} can make shorter distance values exhibit richer representations. Next, all Legendre polynomials of the first kind {Q(x)} are defined within the range of [−1, 1]. These functions are symmetric (even functions) or antisymmetric (odd functions) at the zero point. Because the cosine function is an even function type and ranges from −1 to 1, {Q(x)} can cover the cosine angle of 0–2π and is symmetric at the π value. In the three-dimensional Euclidean spaces, the angle of θ (0 ≤ θ ≤ π) between two nonparallel vectors on a plane is equivalent to its supplementary angle of (π – θ). Therefore, Legendre polynomials of the first kind {Q(x)} can represent the cosine of angles properly.

Model Architecture

The overall model architecture is described in Figure . There are two discrete pathways in the network, which include the MLP-based pathway with single body blocks and MPNN-based pathway comprising output blocks and interaction blocks. First, an atom type Z is embedded by the embedding block. The distances are calculated from atomic coordinates p, and if a distance d is less than the predefined cutoff value, then d is represented as an edge attribute of the molecule graph. The scalar d is expanded to a vector d in the radial basis function {P(x)}. The angles α are calculated from the edges. Then α is encoded to an m-dimensional vector before training via Legendre polynomials of the first kind {Q(x)}. In this study, we set m = 12.

Figure 3

Model architecture. The network comprises two separate branches, MLP-based (red arrow) and MPNN-based (blue arrow) pathways. Atom types Z are embedded as trainable matrices. Distances d and angles α are calculated from atomic coordinates. We calculated from d. We also calculated from α. All blocks except the embedding block are stacked multiple times (not sharing weights). In this model, we stacked six, seven, and six blocks for the single body blocks, output blocks, and interaction blocks, respectively. For simplicity, skip connections are not shown. In MPNN, the message passing and readout steps of our model can be summarized aswhere m, h, e, and α denote the message of atom v and its neighbors, single-atom representation of atom v, interaction between two neighboring atoms (v, v), and angle information between three atoms (v, v, v), respectively. t is the time step or layer order in the model, and ŷ represents the predicted target value. Both f and f represent the graph convolution, and f is the sum operation. In the proposed model, there are four types of blocks: the embedding block, output block, single body block, and interaction block. All blocks except the embedding block are sequentially stacked multiple times. Detailed explanations are presented in the next section.

Block Details

We adopted four types of blocks in the model. The embedding block E, the output blocks Ot, the single body blocks St, and the interaction blocks It, where t = 0,..., (T – 1) is the time step of the multiple sequential blocks. Atom type embeddings Z enter the embedding block E at the first step, and the embedded X moves to the other three blocks at time step t = 0. Distance embeddings d are used for all blocks E, Ot, and It, except the single body blocks St. Angle embeddings αijk are soley used in It. All blocks are sequentially stacked with time steps t = 0,..., (T – 1) including some skip connections. The length of time steps T of each block may differ from other block types. Detailed figures are in Figure .

Figure 4

Block structures. (a) Embedding block (top left). (b) Interaction block (top right). (c) Single body block (bottom left). (d) Output block (bottom right). We denote the dense layer with the activation as σ(Wx + b). We also denote the trainable scalars as γ. All blue boxed items, including dense layers and features, are γ-trainable. Subscripts are omitted for simplicity. (e) Weighted aggregation formula of two final outputs from the single block and output block pathways. cS and cP denote the trainable scalar-valued coefficients.

Embedding Block

In the embedding block E, the categorical atom types Z are represented as float-valued trainable features. Z, Z, and d from each atom pair are used to make atom embeddings X.

Output Block

In the output block Ot, the output of It is trained with R, except the first block, which takes an embedded input X from E. ⊕ and ⊗ denote the direct sum and direct product, respectively. All blue-boxed objects are γ-trainable. These blocks train the two-body representation.

Single Body Block

Message-passing functions would mix all representations of a molecule. We assumed that it is not always beneficial to predict molecular properties. Therefore, we introduced a separated path from a message-passing pathway. In the single body block St, the output X from E enters and is trained throughout multiple MLP layers. The last of the layer is γ-trainable. Neither edge nor angle representation is used. These blocks handle each-atom representations.

Interaction Block

In the interaction block It, each output from the Ot and R, Q are applied. The blue boxes and the operators are also used in these blocks. In addition, these blocks train the three-body interaction.

Trainable Scalars: γ

In molecular property prediction tasks, various factors determine multiple properties with different contribution weights for each target. To tackle this issue, we did not consider any additional layer or attention preventing the model from the oversmoothing problem and becoming cost-intensive. Instead, we proposed a more simple solution. As a solution, we introduced trainable scalars γ to each layer. This makes the contributions of heterogeneous factors flexible, according to each target. In other words, the importance of heterogeneous features such as atom type and atom–atom distances can be various over targets. Therefore, our model can focus on more important features of the target. Mathematical formulas are described below. We introduced γ to some of the layers in the output blocks, single body blocks, and the interaction blocks. All γ values are trained independently of each other (subscripts were omitted for simplicity). We initialized γ values from the exponential function of random normal distribution (μ = 0.0, σ = 0.1), namely, . This was because we observed that the slightly right-skewed distribution performed better than the normal distribution with μ = 1.0, without skewness. This distribution is presented in Figure .

Figure 5

Gaussian distribution and the initialization of the γ distribution. Both distributions attain their maximum values at x = 1.0

Related Works

Starting from DTNN,[26] several GNNs adopt the perspective of a molecule as a graph. Under this perspective, a molecular system is a combination of atoms and many-body interatomic interactions that correspond to graph nodes and edges. MPNN[19] introduced the message concept, which is an aggregation of attaching edges and corresponding neighbor nodes. SchNet[12] is a multiblock network based on the message-passing framework. The gradients flow atomwise, and the convolution process of atom–atom interaction features is in the interaction block. Subsequently, PhysNet,[15] MegNet,[21] and MGCN[27] extended the previous works and improved the performances based on the message-passing multiblock framework. More recent works introduced angle information to describe the geometry of a molecule. With angle information, we can inspect up to three-body interactions. Spherical harmonics were used to represent an original angle to be rotationally equivariant.[28,29] Several studies[14,23,30,31] introduced a Clebsch-Gordan decomposition to represent an angle comprising a linear combination of irreducible representations. Theoretically, it can be expanded to be arbitrary n-body networks,[14,23,31] and most of the networks do not explore more than three-body interactions in limited computational resources. In summary, most of the recent GNNs on molecular learning used an atom type (one-body), a distance between an atom pair (two-body), and an angle between three atoms (three-body). As aforementioned, we also used the same input features in accordance with the previous GNNs, for a fair evaluation.

Training and Evaluation

Data Set

We used QM9[38] and molecular dynamics (MD) simulation data sets[13,39] for the experiment. QM9 is the most popular quantum-mechanics database created in 2014, which comprises 134 000 small molecules made up of five atoms, which include carbon, oxygen, hydrogen, nitrogen, and fluorine. Each molecule has 12 numerical properties, such that the benchmark consists of 12 regression tasks. The MD simulation data set is created for the energy prediction task from molecular conformational geometries. The simulation data are given as trajectories. The subtasks are divided according to each molecule type. The energy values are given as a scalar (kcal/mol), and forces are given as three-dimensional vectors (in xyz format) of each atom. The energies can be predicted solely on the molecular geometry or by using the additional forces. We used the most recent subdata sets[39] of which the properties were created from a CCSD(T) calculation, which is a more reliable method than the conventional DFT method.

Implementation

The tensorflow package of version 2.2 was used to implement the proposed framework. To calculate Legendre polynomials, we utilized the scipy package ver. 1.5.2. For the QM9 data set, we trained the model at least 300 epochs for each target. We terminated the training when the validation mean absolute error (MAE) did not decrease for 30 epochs. Therefore, the overall training epochs are slightly different from each label. We randomly split the train, valid, and test data set to 80%, 10%, and 10% of whole data set, respectively, in accordance with the guideline.[40] The initial learning rate and decay rate were set to 10–3 and 0.96 per 20 epochs, respectively. Adam[41] was used as the optimizer, and the batch size was set to 24. For the MD simulation data set, most of the training configurations were the same with those of the QM9 experiment. However, we modified the loss function because the energies were trained using both energies and forces, which differs from QM9. We obtained the predicted forces from the gradients of the atom positions following from the previous works.[15,22] We also followed the original guideline[39] for data splitting. Therefore, we used 1000 samples to train all subtasks, the other 500 for a test except ethanol, and 1000 for a test of ethanol. We set the decay rate as 0.9, which makes the learning rate decrease faster than the QM9 training.

Evaluation

First, we evaluated the model performance with MAE, the standard metric of QM9.[40] We compared the performance of our model with the models of previous studies SchNet,[12] Cormorant,[23] LieConv,[42] DimeNet,[22] and DimeNet++.[32] We also analyzed our proposed methods in terms of the effect of γ values as well as the single body block of each target. We also analyzed the effect of the single body block. We evaluated the performance of the model without single body blocks (not dual-branched architecture). In particular, we observed that the single body block contributed to the model performance differently with each target type. We inferred that these differences can be explained by the different nature of the properties. We discussed the relationship between some of the results and the chemical interpretations in the next section. We compared the change in ratio of the average of all γ values from the single block and that of the output blocks over epochs throughout the training. We extracted the γ values and calculated the ratio of ∑(γ values from the single block)/∑(γ values from the output block) every 30 epoch, until 200 epochs. After this point, the changes in the ratios varied negligibly, so we did not depict the ratios after that time point. Note that we did not compare the γ values directly among different layers or targets. The magnitudes of these values are determined by several complex sources: the original feature scale, layer weights, layer biases, activation functions, and uncontrollable random noises. Therefore, the γ values themselves cannot be evaluated directly.

Results and Discussion

Model Performance

The Comparative Results

We compared the performance of our model performance with that of other MPNN-based models that also used atom types and locations as inputs.[12,21−23,32] The MAEs for QM9 and MD simulations are described in Table and Table , respectively. For QM9, our model achieved advanced performances in six of the 12 targets. For the MD simulation, our model exhibited the best performance among other models in four of the five targets.

Table 1

Mean Absolute Error on QM9 Compared with Previous Worksa

target	unit	SchNet	cormorant	LieConv	DimeNet	DimeNet++	DL-MPNN	MP only
μ	D	0.033	0.038	0.032	0.0286	0.0297	0.0256	0.0238
α	bohr³	0.235	0.085	0.084	0.0469	0.0435	0.0444	0.0457
ϵ_HOMO	eV	0.041	0.034	0.030	0.0278	0.0246	0.0223	0.0238
ϵ_LUMO	eV	0.034	0.038	0.025	0.0197	0.0195	0.0169	0.0163
Δ_ϵ	eV	0.063	0.061	0.049	0.0348	0.0326	0.0391	0.0403
⟨R²⟩	bohr²	0.073	0.961	0.800	0.331	0.331	0.414	0.385
zpve	meV	1.7	2.027	2.280	1.29	1.2	1.2	1.2
U₀	eV	0.014	0.022	0.019	0.00802	0.0063	0.0074	0.0084
U	eV	0.019	0.021	0.019	0.00789	0.0063	0.0074	0.0085
H	eV	0.014	0.021	0.024	0.00811	0.0065	0.0076	0.0092
G	eV	0.014	0.020	0.022	0.00898	0.0076	0.0076	0.0083
C_v	calmolK	0.033	0.026	0.038	0.0249	0.0249	0.0234	0.0235

DL-MPNN and MP only denote our dual-branched model and the model without MLP-pathway, respectively.

Table 2

Mean Absolute Error on MD Simulation Compared with Previous Works

target	train method	sGDML	SchNet	DimeNet	DL-MPNN
aspirin	forces	0.68	1.35	0.499	0.590
benzene	forces	0.06	0.31	0.187	0.053
ethanol	forces	0.33	0.39	0.230	0.10
malonaldehyde	forces	0.41	0.66	0.383	0.225
toluene	forces	0.14	0.57	0.216	0.200

DL-MPNN and MP only denote our dual-branched model and the model without MLP-pathway, respectively. In general, our model exhibited better performances on the targets in QM9, which are more related to molecular interactions such as dipole moment (μ), molecular orbitals (ϵHOMO and ϵLUMO), Gibbs free energy (G), and others. These properties are relevant in molecular reactions to external effects. However, predictions for other targets such as the electronic spatial extent (⟨R2⟩) and internal energies (U0, U, H) were not superior to those of other models. We found that this may be related to the distribution of each target value. We observed that, when the mean of the target value is closer to zero and the standard deviation of the target value is smaller, the prediction performances are better. For example, the 95% confidence intervals (CIs) of the distribution of the highest occupied molecular orbital (HOMO) and the lowest unoccupied molecular orbital (LUMO) are −6.5 ± 1.2 and 0.3 ± 1.3, respectively. In case of internal energies including U, U0, and H, the 95% of CI is −76.5 ± 10.4, −76.1 ± 10.3, and −77.0 ± 10.5, respectively. This issue triggered by the diverse target distribution of QM9 has already been reported in the literature.[24]

The Effect of Basis Functions

We briefly discuss the effect of Legendre polynomials as radial basis functions in our model. The results are shown in Table . We compared our radial basis functions with Bessel basis functions in DimeNet[22] and DimeNet++[32] using QM9. We analyzed six of 12 targets in QM9, μ, HOMO, LUMO, gap, ⟨R2⟩, and U. To summarize the comparisons, only μ showed clearly improved validation performances (∼30% lower MAE), and the other five targets showed slightly decreased MAE values.

Table 3

Relative Validation MAE of our Model using Legendre Polynomials as Radial Basis Functions, Comparing with the Model with Replacing Radial Basis Functions by Bessel Basis Functions (Used in DimeNet and DimeNet++[22])a

target	(MAE using Legendre polynomials)/(MAE using Bessel basis functions)
μ	0.77
ϵ_HOMO	1.02
ϵ_LUMO	1.03
Δ_ϵ	1.01
⟨R²⟩	1.11
U	1.03

If the value is less than 1.0, it means that the Legendre polynomials as radial basis functions performed better.

If the value is less than 1.0, it means that the Legendre polynomials as radial basis functions performed better. This inconsistency over targets could not be clearly explained. Therefore, we concluded that a Legendre polynomial of our model is not worse than Bessel basis functions. Note that our cutoff value is less than those of any other previous works, maintaining the model performances. This will be discussed in the next section.

The Effect of Trainable Scalars

The Role of the γ in the Model Performance

We trained our model with and without trainable γ values for all 12 labels in QM9 data sets. We found that the validation performances were significantly improved (over 10%) in the case of μ, LUMO, and ⟨R2⟩ with γ values. Overall, the validation MAE values with γ values were lower than those without γ values in nine target cases in QM9. With U0 and U, the MAEs without training γ values are 3% lower than those with training γ. The results are shown in Figure .

Figure 6

Ratio of validation MAE without training γ over MAE with training γ. If the ratio is less than 1.0, the performance with training γ is superior to than that of without γ. Nine of 12 targets showed a lower validation MAE with training γ coefficients.

Adaptive Dependency on Pathways by Targets

We analyzed , which is the ratio of the sum of the trained coefficients from the MLP pathway (the last single body block) over those from the MPNN pathway (the last output block), respectively, by each target. We plotted the ratio changes with training steps in Figure . If the ratio increases during the training steps, the MLP pathway is more critical than the MPNN pathway for target prediction, and vice versa. Note that the weights of the MLP pathway are only trained on atom types Z.

Figure 7

Ratio of the sum of the trained coefficients from the last single body block and output block, respectively. If the ratio goes to zero, it can be interpreted that the model depends more on the features from atom–atom relationships, rather than those from individual atom information. The ratios from U, U0, and H showed a similar pattern with each other, which are closely related in the internal energy term. We found that, in training the model on μ, the single body blocks were less important. During the training on the target μ, the trainable scalars in S become close to zero, while other targets were not the cases. We found that this exceptional result can be explained with the construction of μ, which is the sum of dipole moments of all atom pairs in a molecule. We repeated the experiment with μ three times and observed these patterns in all the times (results are not shown). The formulation of the dipole moment is given by[33]where q and R are charges and position vectors, respectively. A dipole moment occurs between two polarized atoms, especially in a covalent bond or an ionic bond.[34,35] By definition, dipole moments are represented by the charge separation between atom pairs (dipoles) in a molecule. The atom-only representations did not play a key role in the case of predicting μ in our experiments, which is consistent with the formulation. The overall performances also support these observations (Table ). In the case of μ, the performance was better (MAE = 0.0238) when the model used the MPNN pathway only than when the full model was used (MAE = 0.0256). However, the full model showed better performance than the MPNN pathway only model, in most of the target predictions. It also confirms that the atom-only representations from the MLP pathway does not help in predicting the μ of given molecules. In addition, the plots in Figure in predicting U0, U, and H are similar to each other. U0 and U mean the internal energy at 0 and 298.15 K, respectively. H is the enthalpy at 298.15 K, which has the internal energy terms. These targets are closely related to each other by construction. Considering the experimental results, we concluded that the model trained differently with respect to the target, and the pattern of trained results is consistent with the nature of the chemistry.

Density of a Molecule Representation

As mentioned before, the cutoff c value determines the node connections in a molecular graph. With a high cutoff value, the molecular graph representation would be dense. When the cutoff is low, the corresponding graph representation would be sparse. The dense graphs generally have higher feasibilities in capturing the connectivity information in a graph than the sparse graphs. However, dense structures are exposed to higher risks of gathering excessive features even when there is less necessary information. Besides, in an MPNN, the messages of every node from its neighboring edges are mixed repeatedly throughout the network. If a graph is excessively dense, most messages become indistinguishable, because all atoms would refer all other neighbors in every step. This may potentially increase the risk of an oversmoothing problem,[25] prone to occur in deep GCNs. It is challenging to determine the most appropriate cutoff value of the graph data set; however, we found a clue from molecular conformations. We observed that the average distance between any atom pair in QM9 molecules is 3.27 Å (Å). The atom pairs within 4.0 Å, which is the cutoff value used in this work, account for 72% of all the pairs in the QM9 data set. Previous models SchNet,[12] MegNet,[21] and DimeNet[22]/DimeNet++[32] adopted 10.0, 5.0, and 5.0 as their cutoff values, respectively. We observed that 99.99% and 89% for cutoff values of 10.0 and 5.0 of the distances, respectively, are represented in molecular graphs. Note that the angle representation is defined with two different edges sharing one atom. This indicates that, for the angle representation, the required number of computations increases linearly with the squared number of edges. If the number of nodes of a graph are |VG|, then the upper bound of the number of representations including all nodes, edges, and angles of the graph is given as even if the message directions were not considered. We compared graph density values according to different cutoff values. If the edges are within a cutoff value of 4.0, then the number of edge representations are reduced to half ((72%)2 = 52.8%) of the overall possible number of edges . If the cutoff value is 5.0, then the proportion of the number of the edge representations is increased to 80% ((89%)2 = 79.2%). We demonstrate that the cutoff value of 4.0 Å is sufficient in describing a molecular graph in QM9. As mentioned earlier, the molecules in QM9 comprise five atoms, namely, C, H, O, N, and F. The lengths of the covalent bonds between any two atoms among these five atoms are always less than 2.0 Å.[36] Because the maximum length of any two successive covalent bondings is less than 4.0, the model can capture any two-hop neighboring atoms simultaneously, with a cutoff value of 4.0 Å. This property makes it possible to identify all the angles between two covalent bonds. The detailed description is presented in Figure . From these observations, we argue that the 4.0 Å is the best choice as the cutoff value for an efficient training and chemical nature. The same holds for MD simulation as well, because in this data set, all the atoms are carbon, hydrogen, and oxygen (C, H, and O). Finally, we argue that 4.0 Å is also an adequate value for organic molecules in the real world, because most molecular bond lengths are shorter than 2.0 Å.[36]

Figure 8

Example of bond and angle representation. All lengths of the covalent bond between two atoms of C, H, O, N, and F are less than 2.0, so any two-hop neighbor via covalent bond v from an atom v will always be located inside the cutoff = 4.0. Therefore, an atom v can always capture two successive covalent bonds (v, v) and (v, v).

Conclusion

In this study, we developed a novel dual-branched network, message-passing-based pathway for atom–atom interactions and fully connected layers for single atom representations. We represented a molecule as a graph in three-dimensional space with atom types and atom positions. We embedded a scalar-valued atom–atom distance as a vector using Legendre rational functions. Similarly, we embedded a scalar-valued angle as a vector using orthogonal Legendre polynomials. Both functions are complete orthogonal series and require low computational costs with recursive relations. Our model exhibited remarkable performances in two quantum-chemical data sets. In addition, we proposed trainable scalar values, such that the proposed model can attend more significant features according to the various natures of the targets during the training. With the analysis of the trained scalars, we also showed that the model can obtain an important interpretation ability of the target. We found that the trained scalar values can explain the chemical nature of the target. Furthermore, we adopted a smaller cutoff value than those used in previous MPNN models and showed that, with this value, we can save the computational resources without a loss of performance. Furthermore, we argued that this cutoff value is sufficiently long to identify the local structure of a molecule. Although we showed both the model performance and a hint of interpretability, our model was applied to restricted fields of small molecules. We will enhance the model scalability to broader applications including biomolecules, drugs, and crystals in our future works. Furthermore, we will conduct further analysis on the predictability of the model according to target distributions. Future works will be focused on other molecular property predictions or the predictions for more complex molecules in more various fields.

18 in total

1. Fast and accurate modeling of molecular atomization energies with machine learning.

Authors: Matthias Rupp; Alexandre Tkatchenko; Klaus-Robert Müller; O Anatole von Lilienfeld
Journal: Phys Rev Lett Date: 2012-01-31 Impact factor: 9.161

2. Assessment and Validation of Machine Learning Methods for Predicting Molecular Atomization Energies.

Authors: Katja Hansen; Grégoire Montavon; Franziska Biegler; Siamac Fazli; Matthias Rupp; Matthias Scheffler; O Anatole von Lilienfeld; Alexandre Tkatchenko; Klaus-Robert Müller
Journal: J Chem Theory Comput Date: 2013-07-30 Impact factor: 6.006

3. An accurate density functional theory calculation for electronic excitation energies: the least-squares support vector machine.

Authors: Ting Gao; Shi-Ling Sun; Li-Li Shi; Hui Li; Hong-Zhi Li; Zhong-Min Su; Ying-Hua Lu
Journal: J Chem Phys Date: 2009-05-14 Impact factor: 3.488

4. SchNet - A deep learning architecture for molecules and materials.

Authors: K T Schütt; H E Sauceda; P-J Kindermans; A Tkatchenko; K-R Müller
Journal: J Chem Phys Date: 2018-06-28 Impact factor: 3.488

5. PhysNet: A Neural Network for Predicting Energies, Forces, Dipole Moments, and Partial Charges.

Authors: Oliver T Unke; Markus Meuwly
Journal: J Chem Theory Comput Date: 2019-05-14 Impact factor: 6.006

6. Quantum-chemical insights from deep tensor neural networks.

Authors: Kristof T Schütt; Farhad Arbabzadah; Stefan Chmiela; Klaus R Müller; Alexandre Tkatchenko
Journal: Nat Commun Date: 2017-01-09 Impact factor: 14.919

7. Machine learning of accurate energy-conserving molecular force fields.

Authors: Stefan Chmiela; Alexandre Tkatchenko; Huziel E Sauceda; Igor Poltavsky; Kristof T Schütt; Klaus-Robert Müller
Journal: Sci Adv Date: 2017-05-05 Impact factor: 14.136

8. MoleculeNet: a benchmark for molecular machine learning.

Authors: Zhenqin Wu; Bharath Ramsundar; Evan N Feinberg; Joseph Gomes; Caleb Geniesse; Aneesh S Pappu; Karl Leswing; Vijay Pande
Journal: Chem Sci Date: 2017-10-31 Impact factor: 9.825

Review 9. Deep Learning for Deep Chemistry: Optimizing the Prediction of Chemical Patterns.

Authors: Tânia F G G Cova; Alberto A C C Pais
Journal: Front Chem Date: 2019-11-26 Impact factor: 5.221

Review 10. Array programming with NumPy.

Authors: Charles R Harris; K Jarrod Millman; Stéfan J van der Walt; Ralf Gommers; Pauli Virtanen; David Cournapeau; Eric Wieser; Julian Taylor; Sebastian Berg; Nathaniel J Smith; Robert Kern; Matti Picus; Stephan Hoyer; Marten H van Kerkwijk; Matthew Brett; Allan Haldane; Jaime Fernández Del Río; Mark Wiebe; Pearu Peterson; Pierre Gérard-Marchant; Kevin Sheppard; Tyler Reddy; Warren Weckesser; Hameer Abbasi; Christoph Gohlke; Travis E Oliphant
Journal: Nature Date: 2020-09-16 Impact factor: 49.962