Jeonghee Jo1, Bumju Kwak2, Byunghan Lee3, Sungroh Yoon4. 1. Bio-MAX Institute, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul 08826, Republic of Korea. 2. Recommendation Team, Kakao Corporation, 235 Pangyoyeok-ro, Bundang-gu, Seongnam-si, Gyeonggi-do 13494, Republic of Korea. 3. Department of Electronic and IT Media Engineering, Seoul National University of Science and Technology, 232 Gongneung-ro, Nowon-gu, Seoul 01811, Republic of Korea. 4. Department of Electrical and Computer Engineering, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul 08826, Republic of Korea.
Abstract
A molecule is a complex of heterogeneous components, and the spatial arrangements of these components determine the whole molecular properties and characteristics. With the advent of deep learning in computational chemistry, several studies have focused on how to predict molecular properties based on molecular configurations. MA message-passing neural network provides an effective framework for capturing molecular geometric features with the perspective of a molecule as a graph. However, most of these studies assumed that all heterogeneous molecular features, such as atomic charge, bond length, or other geometric features, always contribute equivalently to the target prediction, regardless of the task type. In this study, we propose a dual-branched neural network for molecular property prediction based on both the message-passing framework and standard multilayer perceptron neural networks. Our model learns heterogeneous molecular features with different scales, which are trained flexibly according to each prediction target. In addition, we introduce a discrete branch to learn single-atom features without local aggregation, apart from message-passing steps. We verify that this novel structure can improve the model performance. The proposed model outperforms other recent models with sparser representations. Our experimental results indicate that, in the chemical property prediction tasks, the diverse chemical nature of targets should be carefully considered for both model performance and generalizability. Finally, we provide the intuitive analysis between the experimental results and the chemical meaning of the target.
A molecule is a complex of heterogeneous components, and the spatial arrangements of these components determine the whole molecular properties and characteristics. With the advent of deep learning in computational chemistry, several studies have focused on how to predict molecular properties based on molecular configurations. MA message-passing neural network provides an effective framework for capturing molecular geometric features with the perspective of a molecule as a graph. However, most of these studies assumed that all heterogeneous molecular features, such as atomic charge, bond length, or other geometric features, always contribute equivalently to the target prediction, regardless of the task type. In this study, we propose a dual-branched neural network for molecular property prediction based on both the message-passing framework and standard multilayer perceptron neural networks. Our model learns heterogeneous molecular features with different scales, which are trained flexibly according to each prediction target. In addition, we introduce a discrete branch to learn single-atom features without local aggregation, apart from message-passing steps. We verify that this novel structure can improve the model performance. The proposed model outperforms other recent models with sparser representations. Our experimental results indicate that, in the chemical property prediction tasks, the diverse chemical nature of targets should be carefully considered for both model performance and generalizability. Finally, we provide the intuitive analysis between the experimental results and the chemical meaning of the target.
To design de novo compounds, all possible
sets should be explored
throughout a combinatorial space called chemical compound space (CCS).
Quantum mechanics (QM) functions as a guide that narrows down this
search space based on the first principles of molecular dynamics.
Density functional theory (DFT) is the standard method for analyzing
a molecular electronic structure from its numerical wave functions
in QM to predict molecular behaviors.In the past decade, machine
learning (ML)-based methods have been
developed and have played a key role in various QM tasks. For example,
support vector machine (SVM),[1] Gaussian
processes,[2] kernel ridge regression models,[3,4] and other ML-based studies[5] have been
developed and applied to predict molecular properties or model molecular
dynamics. Recently, with the advent of large-scale quantum-mechanical
databases, deep learning (DL)-based models have become crucial in
quantum machine learning (QML) studies.[6] Deep neural network (DNN)-based methods have made it possible to
model hidden complex relationships between molecular structures and
their properties.Early DNN-based QM studies explored various
tasks such as DFT prediction,[7] a simple
molecule dynamics simulation,[8] atomization
energy prediction,[9] or learning electronic
properties.[10] These models considered a
single atom as a neuron of an input layer,
with the frame of a one-body representation of a given system. Given
a molecular geometry, a single atom can be represented with its atomic
number and position, and multiple atoms can also be represented in
the same way with their interactions, using their mutual distances
or angles. However, standard MLP-based models cannot handle various
sizes of chemicals.[2,11] This limitation triggers a harmful
effect on the model generalizability.Graph neural networks
(GNNs) can address these limitations by describing
a molecule as a graph. GNNs can train both node and edge representations,
which are naturally equivalent to atoms and atom–atom relationships
present in a molecule. This graph-based molecular representation has
become widely adopted in solving various molecule learning tasks.
Several GNN-based models[12−16] adopted a localized convolution kernel to learn each component of
a molecule. These models are called graph convolutional networks (GCNs).
The spatial embedding-based GCN model1 trains each local attribute iteratively and then
aggregates these features to represent molecular properties.The message-passing neural network (MPNN) framework[19] has taken these approaches one step further.
On the basis of a graph convolution, MPNNs learn node features in
two types of phases: message-passing and update phases. In the message-passing phase, the hidden
representations of nodes, neighboring nodes, and their edges are aggregated
to represent the locality of a given node, called a message. Subsequently, the hidden features of these nodes are updated with
the message and other features in the following update phase. The
final step in a network is called a readout phase;
all of the features are combined to yield the predictions. In the
MPNN framework, both the message aggregation and the readout phase
use a sum operator to gather local attributes and all atomic attributes,
respectively. All MPNN models can be considered as a type of the spatial
GCNs because in these models each atom feature is updated with its
localized convolutional filters.Since the late 2010s, several
message-passing-based DNN models
have been developed, such as SchNet,[12,20] GDML,[13] PhysNet,[15] MegNet,[21] and DimeNet.[22] SchNet
and the subsequent model architectures suggested a block structure, which consists of a set of specific message-passing functions.
They also adopted a continuous-filter convolution to learn intramolecular
interactions that take the form of a continuous distribution. These
models used atom types and atom–atom distances as inputs of
the model. Some studies used more sophisticated functions to represent
angles in a rotational equivariant approach.[23,24] Many of the MPNN-based models exhibited competitive performances
in various QM research areas, even in the molecular property prediction
on a nonequilibrium state.However, most of the previous works
did not consider the natural
heterogeneities among chemical properties. Although subtle, the target
properties have originated from different chemical natures. For example,
among the 12 targets from QM9, nine describe the types of given molecular
energy, while the other three do not. A conventional DL optimization
can select the better parameters for describing all of the targets.
However, it cannot determine that feature type that is more important
for predicting each target. In this case, the optimal parameters in
the given model architecture may vary according to target types. Almost
all previous models have no choice but to optimize parameters under
the assumption that all types of features such as node features or
edge features contribute equally to predicting all targets. This may
limit the model generalizability and transferability because it ignores
the diversity of chemical factors determining various properties.In addition, GNNs have suffered from differentiating nodes, especially
with deep architecture, that is, the oversmoothing problem.[17,25] This problem emerges because
iterative aggregations of neighboring features are equivalent to a
repeated mixing of local features, eventually making all node representations
too similar and reducing the model performance. Therefore, stacking
more layers to a network cannot be the solution in the GNN (MPNN)
case without careful consideration.To overcome all these limitations,
we propose a novel GNN for molecular
property prediction, namely, a dual-branched Legendre message-passing
neural network (DL-MPNN) with simple and powerful scalability. We
used two types of features: an atom type and a position in three-dimensional
space. We calculated the distance between two atoms and an angle between
three atoms from atomic coordinates. In detail, we applied modified
Legendre rational functions for radial basis functions of atom–atom
distances. For a representation of angles between atoms, orthogonal
Legengdre polynomials were used. Finally, we adopted trainable scalars
to each type of feature for model flexibility in multitask QM learning.
Our approaches exhibited superior performances to other recent models
in quantum-mechanical benchmark tasks. In addition, we validate the
effectiveness of the proposed method via ablation studies.To
the best of our knowledge, this study is the first attempt to
automatically explore a relative feature of importance for each target
type without increasing the computational cost. The introduction of
the trainable scalars has several advantages over that of using an
additional layer. First, the overall computational cost is not significantly
increased, because it only adopts scalar multiplications. Subsequently,
it makes the model flexible for heterogeneous types of features in
multitarget tasks.In addition, we introduce an MLP-based novel-type
block called
a single-body block. These blocks are stacked to form a discrete branch
in the model and do not communicate with the message-passing branch,
as they solely train only atom embeddings. The outcome of this MLP-based
pathway is aggregated with that of the MPNN at the final stage. Accordingly,
we obtain an atom feature that is not overmixed by incoming messages
during the training.In summary, our contributions are as follows:
(1) We propose a
novel dual-branched network for molecular property prediction, which
comprises a message-passing for locally aggregated features and a
fully connected pathway for discrete atoms. (2) We introduce trainable
scalar-valued parameters to enable the model to enhance more important
feature signals according to each label. (3) Our experimental results
are better than those of most previous works in the public benchmarks
with sparser representations. In addition, we verify that various
quantum-chemical factors contribute differently according to target
molecular properties.
Experimental Section
Our neural
network is dual-branched, comprising two different types
of networks. One is the MPNN, a type of spatial-based GCNs, and the
other is the MLP-based network. The details are described in the following
sections.
Preliminaries
We briefly introduce the notations and
the data used in this study. We describe a molecule as a set of different
types of i points V = {Z, P} and v = {zi, pi}. Z = {z} is a set of scalar-valued
charges for each atom type. P = {p}, where p = (p, p, p) is a set of xyz coordinates of atom points
in the Euclidean space. We denote v as a set of neighboring atoms of v.
Data Preprocessing
We used two types
of information
between three neighboring atoms; atom types z and coordinates p of each atom in the molecules. We calculated distances d and angles α between atom coordinates p, p, and p. Because coordinates are used to
describe the whole molecular structure, a single position of an atom
is meaningless. We initialized random trainable atom type embeddings X from Z and trained them. All d and α values from the molecules were calculated before
the model training. More details are described in Figure .
Figure 1
Data preprocessing. An
atom type Z and atomic
coordinate p were used in the model. We created trainable
embeddings X from Z and calculated
distances d and angles
α from the coordinates p, p, and p.
Data preprocessing. An
atom type Z and atomic
coordinate p were used in the model. We created trainable
embeddings X from Z and calculated
distances d and angles
α from the coordinates p, p, and p.
Input Embedding: Distance and Angle Representations
Distance
Representation: Radial Basis Function
We used
radial basis functions to expand a scalar-valued edge distance to
a vector. In particular, we adopted sequences of Legendre rational
polynomials, one of the continuous and bounded orthogonal polynomials.
The Legendre rational function R(x) and Legendre rational polynomial P(x) are defined
aswhere n denotes the degree.
The nth-order Legendre rational polynomials are generated
recursively as follows.We seleced the first nth order polynomials R1(x), R2(x),
..., R(x), such that a scalar-valued distance is embedded as an n-dimensional
vector . The plot of functions
degree of 1,2,...,12
are described in the top right side of Figure . We modified the equations to set the maximum
value as 1.0, to make the distribution of the bases similar to that
of the initial atom embedding distribution.
Figure 2
Basis functions for data
preprocessing. For any pair of two atoms
located closer than a given cutoff, we created an
edge representation between two atoms regardless of molecular bond
information. A scalar-valued distance is expanded as an n-dimensional vector by radial basis functions. For radial basis functions,
we used Legendre rational polynomials. Angle representations are depicted
by any two edges sharing one atom. A cosine-valued angle is represented
as the degree = 1, 2, ..., mth Legendre polynomials
(top right).
Basis functions for data
preprocessing. For any pair of two atoms
located closer than a given cutoff, we created an
edge representation between two atoms regardless of molecular bond
information. A scalar-valued distance is expanded as an n-dimensional vector by radial basis functions. For radial basis functions,
we used Legendre rational polynomials. Angle representations are depicted
by any two edges sharing one atom. A cosine-valued angle is represented
as the degree = 1, 2, ..., mth Legendre polynomials
(top right).Recall that we did not use any
molecular bond information. Following
previous studies,[12,15,20,22−24,27] we set a single scalar cutoff c and assumed that
any atom pair v and v located closer than the cutoff
can interact with each other, and vice versa, before training. In
other words, we build edge representations between any two atoms
close to each other.
The cutoff value is analogous to the kernel height or width in the
conventional convolutional layer because it determines the size of
the receptive field of the local feature map.
Angle Representation:
Cosine Basis Function
Similarly,
we adopted another sequence of orthogonal polynomials to represent
a scalar-valued angle α as vectors:
Legendre polynomials of the first kind. Legendre polynomials of the
first kind are the orthogonal polynomials that exhibit several useful
properties. They are the solutions to the Legendre differential equation.[37] They are commonly used to describe various physical
systems and dynamics. Their formula is expressed asThe polynomials
of degree n are orthogonal each other, such thatWe calculated angles α between any pair of two edges sharing one node, and . We selected the first mth order polynomials Q1(x), Q2(x), ..., Q(x), such that a scalar-valued angle is embedded as an m-dimensional
vector . The scheme is presented
on the left side
of Figure . We calculated
cosine values of each angle and embed them with Legendre polynomials.
Comparison of Two Different Types of Legendre Polynomials
We briefly compare the two different Legendre polynomials. First,
the sequences of Legendre rational functions {P(x)} are adopted as the
encoding method for distances, which have different distributions
over the distances. The standard deviations of the distributions of
the distance encodings are higher at shorter distance values. Therefore,
they can approximate the interatomic potentials, such that the atom
pair exhibits stronger interatomic relationships if they are located
closely with each other. In other words, the sequences of the Legendre
rational functions {P(x)} can make shorter distance values exhibit richer
representations.Next, all Legendre polynomials of the first
kind {Q(x)} are defined within the range of [−1, 1]. These functions
are symmetric (even functions) or antisymmetric (odd functions) at
the zero point. Because the cosine function is an even function type
and ranges from −1 to 1, {Q(x)} can cover the cosine angle of 0–2π
and is symmetric at the π value. In the three-dimensional Euclidean
spaces, the angle of θ (0 ≤ θ ≤ π)
between two nonparallel vectors on a plane is equivalent to its supplementary
angle of (π – θ). Therefore, Legendre polynomials
of the first kind {Q(x)} can represent the cosine of angles properly.
Model Architecture
The overall model architecture is
described in Figure . There are two discrete pathways in the network, which include the
MLP-based pathway with single body blocks and MPNN-based pathway comprising
output blocks and interaction blocks. First, an atom type Z is embedded by the embedding block. The distances are
calculated from atomic coordinates p, and if a distance d is less than the predefined cutoff value, then d is represented as an edge attribute of the molecule graph.
The scalar d is expanded
to a vector d in the radial
basis function {P(x)}. The angles α are calculated from the edges. Then α is encoded to an m-dimensional vector before
training via Legendre polynomials of the first kind {Q(x)}. In this study, we
set m = 12.
Figure 3
Model architecture. The network comprises two
separate branches,
MLP-based (red arrow) and MPNN-based (blue arrow) pathways. Atom types Z are embedded as trainable matrices. Distances d and angles α are calculated from atomic coordinates. We calculated from d. We also calculated from α. All blocks except the
embedding block are stacked multiple times
(not sharing weights). In this model, we stacked six, seven, and six
blocks for the single body blocks, output blocks, and interaction
blocks, respectively. For simplicity, skip connections are not shown.
Model architecture. The network comprises two
separate branches,
MLP-based (red arrow) and MPNN-based (blue arrow) pathways. Atom types Z are embedded as trainable matrices. Distances d and angles α are calculated from atomic coordinates. We calculated from d. We also calculated from α. All blocks except the
embedding block are stacked multiple times
(not sharing weights). In this model, we stacked six, seven, and six
blocks for the single body blocks, output blocks, and interaction
blocks, respectively. For simplicity, skip connections are not shown.In MPNN, the message passing and readout steps
of our model can
be summarized aswhere m, h, e, and α denote
the message of atom v and its neighbors, single-atom
representation of atom v, interaction between two neighboring atoms (v, v), and angle information between three atoms (v, v, v), respectively. t is the time step or layer order in the model, and ŷ represents the predicted target value. Both f and f represent the graph convolution, and f is the sum operation.In the
proposed model, there are four types of blocks: the embedding
block, output block, single body block, and interaction block. All
blocks except the embedding block are sequentially stacked multiple
times. Detailed explanations are presented in the next section.
Block Details
We adopted four types of blocks in the
model. The embedding block E, the output blocks Ot, the
single body blocks St, and the interaction blocks It, where t = 0,..., (T –
1) is the time step of the multiple sequential blocks. Atom type embeddings Z enter the embedding block E at the first
step, and the embedded X moves to the other three
blocks at time step t = 0. Distance embeddings d are used for all blocks
E, Ot, and It, except the single body blocks
St. Angle embeddings αijk are soley used
in It. All blocks are sequentially stacked with time steps t = 0,..., (T – 1) including some
skip connections. The length of time steps T of each
block may differ from other block types. Detailed figures are in Figure .
Figure 4
Block structures. (a)
Embedding block (top left). (b) Interaction
block (top right). (c) Single body block (bottom left). (d) Output
block (bottom right). We denote the dense layer with the activation
as σ(Wx + b). We also denote
the trainable scalars as γ. All blue boxed items, including
dense layers and features, are γ-trainable.
Subscripts are omitted for simplicity. (e) Weighted aggregation formula
of two final outputs from the single block and output block pathways. cS and cP denote
the trainable scalar-valued coefficients.
Block structures. (a)
Embedding block (top left). (b) Interaction
block (top right). (c) Single body block (bottom left). (d) Output
block (bottom right). We denote the dense layer with the activation
as σ(Wx + b). We also denote
the trainable scalars as γ. All blue boxed items, including
dense layers and features, are γ-trainable.
Subscripts are omitted for simplicity. (e) Weighted aggregation formula
of two final outputs from the single block and output block pathways. cS and cP denote
the trainable scalar-valued coefficients.
Embedding
Block
In the embedding block E, the categorical
atom types Z are represented as float-valued trainable
features. Z, Z, and d from
each atom pair are used to make atom
embeddings X.
Output
Block
In the output block Ot, the
output of It is trained with R, except the first block, which takes an embedded
input X from E. ⊕ and ⊗ denote the direct sum and direct
product, respectively. All blue-boxed objects are γ-trainable.
These blocks train the two-body representation.
Single Body
Block
Message-passing functions would mix
all representations of a molecule. We assumed that it is not always
beneficial to predict molecular properties. Therefore, we introduced
a separated path from a message-passing pathway. In the single body
block St, the output X from E enters and
is trained throughout multiple MLP layers. The last of the layer is
γ-trainable. Neither edge nor angle representation is used.
These blocks handle each-atom representations.
Interaction
Block
In the interaction block It, each output
from the Ot and R, Q are applied.
The blue boxes and the operators are also used in these
blocks. In addition, these blocks train the three-body interaction.
Trainable Scalars: γ
In molecular property prediction
tasks, various factors determine multiple properties with different
contribution weights for each target. To tackle this issue, we did
not consider any additional layer or attention preventing the model
from the oversmoothing problem and becoming cost-intensive.Instead, we proposed a more simple solution. As a solution, we introduced
trainable scalars γ to each layer. This makes the contributions
of heterogeneous factors flexible, according to each target. In other
words, the importance of heterogeneous features such as atom type
and atom–atom distances can be various over targets. Therefore,
our model can focus on more important features of the target. Mathematical
formulas are described below.We introduced γ to some of the layers in the output
blocks,
single body blocks, and the interaction blocks. All γ values
are trained independently of each other (subscripts were omitted for
simplicity). We initialized γ values from the exponential function
of random normal distribution (μ = 0.0, σ = 0.1), namely, . This was because we observed that the
slightly right-skewed distribution performed better than the normal
distribution with μ = 1.0, without skewness. This distribution
is presented in Figure .
Figure 5
Gaussian distribution and the initialization of the γ distribution.
Both distributions attain their maximum values at x = 1.0
Gaussian distribution and the initialization of the γ distribution.
Both distributions attain their maximum values at x = 1.0
Related Works
Starting from DTNN,[26] several GNNs adopt the perspective of a molecule
as a graph.
Under this perspective, a molecular system is a combination of atoms
and many-body interatomic interactions that correspond to graph nodes
and edges. MPNN[19] introduced the message
concept, which is an aggregation of attaching edges and corresponding
neighbor nodes. SchNet[12] is a multiblock
network based on the message-passing framework. The gradients flow
atomwise, and the convolution process of atom–atom interaction
features is in the interaction block. Subsequently, PhysNet,[15] MegNet,[21] and MGCN[27] extended the previous works and improved the
performances based on the message-passing multiblock framework.More recent works introduced angle information to describe the geometry
of a molecule. With angle information, we can inspect up to three-body
interactions. Spherical harmonics were used to represent an original
angle to be rotationally equivariant.[28,29] Several studies[14,23,30,31] introduced a Clebsch-Gordan decomposition to represent
an angle comprising a linear combination of irreducible representations.
Theoretically, it can be expanded to be arbitrary n-body networks,[14,23,31] and most of
the networks do not explore more than three-body interactions in limited
computational resources. In summary, most of the recent GNNs on molecular
learning used an atom type (one-body), a distance between an atom
pair (two-body), and an angle between three atoms (three-body). As
aforementioned, we also used the same input features in accordance
with the previous GNNs, for a fair evaluation.
Training and Evaluation
Data Set
We used QM9[38] and
molecular dynamics (MD) simulation data sets[13,39] for the experiment. QM9 is the most popular quantum-mechanics database
created in 2014, which comprises 134 000 small molecules made
up of five atoms, which include carbon, oxygen, hydrogen, nitrogen,
and fluorine. Each molecule has 12 numerical properties, such that
the benchmark consists of 12 regression tasks.The MD simulation
data set is created for the energy prediction task from molecular
conformational geometries. The simulation data are given as trajectories.
The subtasks are divided according to each molecule type. The energy
values are given as a scalar (kcal/mol), and forces are given as three-dimensional
vectors (in xyz format) of each atom. The energies
can be predicted solely on the molecular geometry or by using the
additional forces. We used the most recent subdata sets[39] of which the properties were created from a
CCSD(T) calculation, which is a more reliable method than the conventional
DFT method.
Implementation
The tensorflow package
of version 2.2
was used to implement the proposed framework. To calculate Legendre
polynomials, we utilized the scipy package ver. 1.5.2.For the
QM9 data set, we trained the model at least 300 epochs for each target.
We terminated the training when the validation mean absolute error
(MAE) did not decrease for 30 epochs. Therefore, the overall training
epochs are slightly different from each label. We randomly split the
train, valid, and test data set to 80%, 10%, and 10% of whole data
set, respectively, in accordance with the guideline.[40] The initial learning rate and decay rate were set to 10–3 and 0.96 per 20 epochs, respectively. Adam[41] was used as the optimizer, and the batch size
was set to 24.For the MD simulation data set, most of the training
configurations
were the same with those of the QM9 experiment. However, we modified
the loss function because the energies were trained using both energies
and forces, which differs from QM9. We obtained the predicted forces
from the gradients of the atom positions following from the previous
works.[15,22] We also followed the original guideline[39] for data splitting. Therefore, we used 1000
samples to train all subtasks, the other 500 for a test except ethanol,
and 1000 for a test of ethanol. We set the decay rate as 0.9, which
makes the learning rate decrease faster than the QM9 training.
Evaluation
First, we evaluated the model performance
with MAE, the standard metric of QM9.[40] We compared the performance of our model with the models of previous
studies SchNet,[12] Cormorant,[23] LieConv,[42] DimeNet,[22] and DimeNet++.[32] We
also analyzed our proposed methods in terms of the effect of γ
values as well as the single body block of each target. We also analyzed
the effect of the single body block. We evaluated the performance
of the model without single body blocks (not dual-branched architecture).In particular, we observed that the single body block contributed
to the model performance differently with each target type. We inferred
that these differences can be explained by the different nature of
the properties. We discussed the relationship between some of the
results and the chemical interpretations in the next section.We compared the change in ratio of the average of all γ values
from the single block and that of the output blocks over epochs throughout
the training. We extracted the γ values and calculated the ratio
of ∑(γ values from the single block)/∑(γ
values from the output block) every 30 epoch, until 200 epochs. After
this point, the changes in the ratios varied negligibly, so we did
not depict the ratios after that time point. Note that we did not
compare the γ values directly among different layers or targets.
The magnitudes of these values are determined by several complex sources:
the original feature scale, layer weights, layer biases, activation
functions, and uncontrollable random noises. Therefore, the γ
values themselves cannot be evaluated directly.
Results
and Discussion
Model Performance
The Comparative Results
We compared the performance
of our model performance with that of other MPNN-based models that
also used atom types and locations as inputs.[12,21−23,32] The MAEs for QM9 and
MD simulations are described in Table and Table , respectively. For QM9, our model achieved advanced performances
in six of the 12 targets. For the MD simulation, our model exhibited
the best performance among other models in four of the five targets.
Table 1
Mean Absolute Error on QM9 Compared
with Previous Worksa
target
unit
SchNet
cormorant
LieConv
DimeNet
DimeNet++
DL-MPNN
MP only
μ
D
0.033
0.038
0.032
0.0286
0.0297
0.0256
0.0238
α
bohr3
0.235
0.085
0.084
0.0469
0.0435
0.0444
0.0457
ϵHOMO
eV
0.041
0.034
0.030
0.0278
0.0246
0.0223
0.0238
ϵLUMO
eV
0.034
0.038
0.025
0.0197
0.0195
0.0169
0.0163
Δϵ
eV
0.063
0.061
0.049
0.0348
0.0326
0.0391
0.0403
⟨R2⟩
bohr2
0.073
0.961
0.800
0.331
0.331
0.414
0.385
zpve
meV
1.7
2.027
2.280
1.29
1.2
1.2
1.2
U0
eV
0.014
0.022
0.019
0.00802
0.0063
0.0074
0.0084
U
eV
0.019
0.021
0.019
0.00789
0.0063
0.0074
0.0085
H
eV
0.014
0.021
0.024
0.00811
0.0065
0.0076
0.0092
G
eV
0.014
0.020
0.022
0.00898
0.0076
0.0076
0.0083
Cv
calmolK
0.033
0.026
0.038
0.0249
0.0249
0.0234
0.0235
DL-MPNN and
MP only denote our
dual-branched model and the model without MLP-pathway, respectively.
Table 2
Mean Absolute Error
on MD Simulation
Compared with Previous Works
target
train method
sGDML
SchNet
DimeNet
DL-MPNN
aspirin
forces
0.68
1.35
0.499
0.590
benzene
forces
0.06
0.31
0.187
0.053
ethanol
forces
0.33
0.39
0.230
0.10
malonaldehyde
forces
0.41
0.66
0.383
0.225
toluene
forces
0.14
0.57
0.216
0.200
DL-MPNN and
MP only denote our
dual-branched model and the model without MLP-pathway, respectively.In general, our model
exhibited better performances on the targets
in QM9, which are more related to molecular interactions such as dipole
moment (μ), molecular orbitals (ϵHOMO and ϵLUMO), Gibbs free energy (G), and others.
These properties are relevant in molecular reactions to external effects.
However, predictions for other targets such as the electronic spatial
extent (⟨R2⟩) and internal
energies (U0, U, H) were not superior to those of other models.We
found that this may be related to the distribution of each target
value. We observed that, when the mean of the target value is closer
to zero and the standard deviation of the target value is smaller,
the prediction performances are better. For example, the 95% confidence
intervals (CIs) of the distribution of the highest occupied molecular
orbital (HOMO) and the lowest unoccupied molecular orbital (LUMO)
are −6.5 ± 1.2 and 0.3 ± 1.3, respectively. In case
of internal energies including U, U0, and H, the 95% of CI is −76.5
± 10.4, −76.1 ± 10.3, and −77.0 ± 10.5,
respectively. This issue triggered by the diverse target distribution
of QM9 has already been reported in the literature.[24]
The Effect of Basis Functions
We
briefly discuss the
effect of Legendre polynomials as radial basis functions in our model.
The results are shown in Table . We compared our radial basis functions with Bessel basis
functions in DimeNet[22] and DimeNet++[32] using QM9. We analyzed six of 12 targets in
QM9, μ, HOMO, LUMO, gap, ⟨R2⟩, and U. To summarize the comparisons, only μ showed
clearly improved validation performances (∼30% lower MAE),
and the other five targets showed slightly decreased MAE values.
Table 3
Relative Validation MAE of our Model
using Legendre Polynomials as Radial Basis Functions, Comparing with
the Model with Replacing Radial Basis Functions by Bessel Basis Functions
(Used in DimeNet and DimeNet++[22])a
target
(MAE using
Legendre polynomials)/(MAE using Bessel basis functions)
μ
0.77
ϵHOMO
1.02
ϵLUMO
1.03
Δϵ
1.01
⟨R2⟩
1.11
U
1.03
If the value is less than 1.0,
it means that the Legendre polynomials as radial basis functions performed
better.
If the value is less than 1.0,
it means that the Legendre polynomials as radial basis functions performed
better.This inconsistency
over targets could not be clearly explained.
Therefore, we concluded that a Legendre polynomial of our model is
not worse than Bessel basis functions. Note that our cutoff value
is less than those of any other previous works, maintaining the model
performances. This will be discussed in the next section.
The Effect of Trainable Scalars
The Role of the γ
in the Model Performance
We
trained our model with and without trainable γ values for all
12 labels in QM9 data sets. We found that the validation performances
were significantly improved (over 10%) in the case of μ, LUMO,
and ⟨R2⟩ with γ values.
Overall, the validation MAE values with γ values were lower
than those without γ values in nine target cases in QM9. With
U0 and U, the MAEs without training γ values are 3% lower than
those with training γ. The results are shown in Figure .
Figure 6
Ratio of validation MAE
without training γ over MAE with
training γ. If the ratio is less than 1.0, the performance with
training γ is superior to than that of without γ. Nine
of 12 targets showed a lower validation MAE with training γ
coefficients.
Ratio of validation MAE
without training γ over MAE with
training γ. If the ratio is less than 1.0, the performance with
training γ is superior to than that of without γ. Nine
of 12 targets showed a lower validation MAE with training γ
coefficients.
Adaptive Dependency on
Pathways by Targets
We analyzed , which is
the ratio of the sum of the trained
coefficients from the MLP pathway (the last single body block) over
those from the MPNN pathway (the last output block), respectively,
by each target. We plotted the ratio changes with training steps in Figure . If the ratio increases
during the training steps, the MLP pathway is more critical than the
MPNN pathway for target prediction, and vice versa. Note that the
weights of the MLP pathway are only trained on atom types Z.
Figure 7
Ratio of the sum of the trained coefficients from the
last single body block and output
block, respectively. If the ratio goes to zero, it can be interpreted
that the model depends more on the features from atom–atom
relationships, rather than those from individual atom information.
The ratios from U, U0, and H showed a similar pattern with each other,
which are closely related in the internal energy term.
Ratio of the sum of the trained coefficients from the
last single body block and output
block, respectively. If the ratio goes to zero, it can be interpreted
that the model depends more on the features from atom–atom
relationships, rather than those from individual atom information.
The ratios from U, U0, and H showed a similar pattern with each other,
which are closely related in the internal energy term.We found that, in training the model on μ, the single
body
blocks were less important. During the training on the target μ,
the trainable scalars in S become close to zero, while other targets were not the cases.
We found that this exceptional result can be explained with the construction
of μ, which is the sum of dipole moments of all atom pairs in
a molecule. We repeated the experiment with μ three times and
observed these patterns in all the times (results are not shown).The formulation of the dipole moment is given by[33]where q and R are charges and position vectors,
respectively. A dipole moment
occurs between two polarized atoms, especially in a covalent bond
or an ionic bond.[34,35] By definition, dipole moments
are represented by the charge separation between atom pairs (dipoles)
in a molecule. The atom-only representations did not play a key role
in the case of predicting μ in our experiments, which is consistent
with the formulation. The overall performances also support these
observations (Table ). In the case of μ, the performance was better (MAE = 0.0238)
when the model used the MPNN pathway only than when the full model
was used (MAE = 0.0256). However, the full model showed better performance
than the MPNN pathway only model, in most of the target predictions.
It also confirms that the atom-only representations from the MLP pathway
does not help in predicting the μ of given molecules.In addition, the plots in Figure in predicting U0, U, and H are similar to each other. U0 and U mean the internal energy
at 0 and 298.15 K, respectively. H is the enthalpy
at 298.15 K, which has the internal energy terms. These targets are
closely related to each other by construction. Considering the experimental
results, we concluded that the model trained differently with respect
to the target, and the pattern of trained results is consistent with
the nature of the chemistry.
Density of a Molecule Representation
As mentioned before,
the cutoff c value determines the node connections
in a molecular graph. With a high cutoff value, the molecular graph
representation would be dense. When the cutoff is
low, the corresponding graph representation would be sparse.The dense graphs generally have higher feasibilities in capturing
the connectivity information in a graph than the sparse graphs. However,
dense structures are exposed to higher risks of gathering excessive
features even when there is less necessary information. Besides, in
an MPNN, the messages of every node from its neighboring edges are
mixed repeatedly throughout the network. If a graph is excessively
dense, most messages become indistinguishable, because all atoms would
refer all other neighbors in every step. This may potentially increase
the risk of an oversmoothing problem,[25] prone to occur in deep GCNs.It is challenging to determine
the most appropriate cutoff value
of the graph data set; however, we found a clue from molecular conformations.
We observed that the average distance between any atom pair in QM9
molecules is 3.27 Å (Å). The atom pairs within 4.0 Å,
which is the cutoff value used in this work, account for 72% of all
the pairs in the QM9 data set. Previous models SchNet,[12] MegNet,[21] and DimeNet[22]/DimeNet++[32] adopted
10.0, 5.0, and 5.0 as their cutoff values, respectively. We observed
that 99.99% and 89% for cutoff values of 10.0 and 5.0 of the distances,
respectively, are represented in molecular graphs.Note that
the angle representation is defined with two different
edges sharing one atom. This indicates that, for the angle representation,
the required number of computations increases linearly with the squared
number of edges. If the number of nodes of a graph are |VG|, then the upper bound of the number of representations
including all nodes, edges, and angles of the graph is given as even if
the message directions were not
considered. We compared graph density values according to different
cutoff values. If the edges are within a cutoff value of 4.0, then
the number of edge representations are reduced to half ((72%)2 = 52.8%) of the overall possible number of edges . If the
cutoff value is 5.0, then the proportion
of the number of the edge representations is increased to 80% ((89%)2 = 79.2%).We demonstrate that the cutoff value of 4.0
Å is sufficient
in describing a molecular graph in QM9. As mentioned earlier, the
molecules in QM9 comprise five atoms, namely, C, H, O, N, and F. The
lengths of the covalent bonds between any two atoms among these five
atoms are always less than 2.0 Å.[36] Because the maximum length of any two successive covalent bondings
is less than 4.0, the model can capture any two-hop neighboring atoms
simultaneously, with a cutoff value of 4.0 Å. This property makes
it possible to identify all the angles between two covalent bonds.
The detailed description is presented in Figure . From these observations, we argue that
the 4.0 Å is the best choice as the cutoff value for an efficient
training and chemical nature. The same holds for MD simulation as
well, because in this data set, all the atoms are carbon, hydrogen,
and oxygen (C, H, and O). Finally, we argue that 4.0 Å is also
an adequate value for organic molecules in the real world, because
most molecular bond lengths are shorter than 2.0 Å.[36]
Figure 8
Example of bond and angle representation. All lengths
of the covalent
bond between two atoms of C, H, O, N, and F are less than 2.0, so
any two-hop neighbor via covalent bond v from an atom v will always be located inside the cutoff = 4.0. Therefore,
an atom v can always
capture two successive covalent bonds (v, v) and (v, v).
Example of bond and angle representation. All lengths
of the covalent
bond between two atoms of C, H, O, N, and F are less than 2.0, so
any two-hop neighbor via covalent bond v from an atom v will always be located inside the cutoff = 4.0. Therefore,
an atom v can always
capture two successive covalent bonds (v, v) and (v, v).
Conclusion
In this study, we developed a novel dual-branched
network, message-passing-based
pathway for atom–atom interactions and fully connected layers
for single atom representations. We represented a molecule as a graph
in three-dimensional space with atom types and atom positions. We
embedded a scalar-valued atom–atom distance as a vector using
Legendre rational functions. Similarly, we embedded a scalar-valued
angle as a vector using orthogonal Legendre polynomials. Both functions
are complete orthogonal series and require low computational costs
with recursive relations. Our model exhibited remarkable performances
in two quantum-chemical data sets. In addition, we proposed trainable
scalar values, such that the proposed model can attend more significant
features according to the various natures of the targets during the
training. With the analysis of the trained scalars, we also showed
that the model can obtain an important interpretation ability of the
target. We found that the trained scalar values can explain the chemical
nature of the target. Furthermore, we adopted a smaller cutoff value
than those used in previous MPNN models and showed that, with this
value, we can save the computational resources without a loss of performance.
Furthermore, we argued that this cutoff value is sufficiently long
to identify the local structure of a molecule. Although we showed
both the model performance and a hint of interpretability, our model
was applied to restricted fields of small molecules. We will enhance
the model scalability to broader applications including biomolecules,
drugs, and crystals in our future works. Furthermore, we will conduct
further analysis on the predictability of the model according to target
distributions. Future works will be focused on other molecular property
predictions or the predictions for more complex molecules in more
various fields.
Authors: Charles R Harris; K Jarrod Millman; Stéfan J van der Walt; Ralf Gommers; Pauli Virtanen; David Cournapeau; Eric Wieser; Julian Taylor; Sebastian Berg; Nathaniel J Smith; Robert Kern; Matti Picus; Stephan Hoyer; Marten H van Kerkwijk; Matthew Brett; Allan Haldane; Jaime Fernández Del Río; Mark Wiebe; Pearu Peterson; Pierre Gérard-Marchant; Kevin Sheppard; Tyler Reddy; Warren Weckesser; Hameer Abbasi; Christoph Gohlke; Travis E Oliphant Journal: Nature Date: 2020-09-16 Impact factor: 49.962