Hideo Doi1, Kazuaki Z Takahashi1, Takeshi Aoyagi1. 1. Research Center for Computational Design of Advanced Functional Materials, National Institute of Advanced Industrial Science and Technology Tsukuba Central 2, 1-1-1 Umezono, Tsukuba, Ibaraki 305-8568, Japan.
Abstract
A combination of atomic numbers and bond-orientational order parameters is considered a candidate for a simple representation that involves information on both the atomic species and their positional relation. The 504 candidates are applied as the fingerprint of the molecules stored in QM9, a data set of computed geometric, energetic, electronic, and thermodynamic properties for 133 885 stable small organic molecules made up of carbon, hydrogen, oxygen, nitrogen, and fluorine atoms. To screen the fingerprints, a regression analysis of the atomic charges given by Open Babel was performed by supervised machine learning. The regression results indicate that the 60 fingerprints successfully estimate Open Babel charges. The results of the dipole moments, an example of a property expressed by charge and position, also had a high accuracy in comparison with the values computed from Open Babel charges. Therefore, the screened 60 fingerprints have the potential to precisely describe the chemical and structural information on the atomic environment of molecules.
A combination of atomic numbers and bond-orientational order parameters is considered a candidate for a simple representation that involves information on both the atomic species and their positional relation. The 504 candidates are applied as the fingerprint of the molecules stored in QM9, a data set of computed geometric, energetic, electronic, and thermodynamic properties for 133 885 stable small organic molecules made up of carbon, hydrogen, oxygen, nitrogen, and fluorine atoms. To screen the fingerprints, a regression analysis of the atomic charges given by Open Babel was performed by supervised machine learning. The regression results indicate that the 60 fingerprints successfully estimate Open Babel charges. The results of the dipole moments, an example of a property expressed by charge and position, also had a high accuracy in comparison with the values computed from Open Babel charges. Therefore, the screened 60 fingerprints have the potential to precisely describe the chemical and structural information on the atomic environment of molecules.
A representation of atomic environments
has become more needed
than in the past decade because of the rapid development of material
informatics and related technology.[1−13] Chemical representations without real-space information are mainstream
because they are compact, lightweight, and usually adopted for molecular
informatics. Indeed, some simple fingerprints, such as the simplified
molecular-input line-entry system (SMILES)[14] and SYBYL line notation (SLN),[15] are
useful, for instance, for similarity measurements between molecules
as the simplest example. Structure representations have been developed
separately from the chemical representations, e.g., Voronoi polyhedron
of a central atom, angular Fourier series,[1] partial and generalized radial distribution functions, and the bond-orientational
order parameter.[16] These were originally
not intended for machine learning (ML) applications but are promising.
However, separating the chemistry and structure of molecules is open
to debate. For example, the Hohenberg–Kohn theorem,[17−19] which states that all physical quantities of a system can in principle
be calculated from the electron density of the ground state, implies
that there is an inseparable relationship between information about
the electronic properties of atoms and the direct function of the
geometric location of nuclei. Therefore, the species of atoms and
their positional relations can hardly be separated to provide information
on the atomic environment. This fact means that there is a need to
consider a more proper representation of both chemistry and structure.
One of the simplest ideas for such a representation is a combination
of the existing chemical and structural representations. For example,
a Coulomb matrix[20] is a generalization
of an adjacency matrix representation and has been extended as the
Ewald sum and the Sine matrix.[5] More advanced
methods include histograms of distances (HDs), HD angles (HDAs), HDA
dihedrals,[21] bag of bonds,[22] and the smooth overlap of atomic positions.[23]The bond-orientational order parameter
that uses spherical harmonics
exhibits a highly generic performance for describing molecular local
structures. It was first developed by Steinhardt and co-workers to
investigate the structures of supercooled liquids and metallic glasses.[16] Lechner and co-workers improved the bond-orientational
order parameter by locally averaging neighborhood molecules.[24] Their parameters successfully distinguished
body-centered cubic (bcc), face-centered cubic (fcc), hexagonal close-packed
(hcp), and liquid-like local structures and have been widely used
to classify complex local structures.[25−32] Further modifications have been attempted to extend the entire accuracy
or for a specific use.[33,34] The capability of their parameters
has been reported for the identification of the crystal-like structures
of a Lennard–Jones fluid,[35−37] water,[38−45] polyethylene,[46] and a liquid crystal
and its polymer.[33,34] Such a high sensitivity for local
structures is desirable for the precise expression of the structural
contributions of the atomic environment.In this work, we consider
a combination of atomic numbers and bond-orientational
order parameters as a candidate for the simple representation that
involves information on both the atomic species and their positional
relationship. This is applied as the fingerprint of the molecules
stored in the QM9 data set,[47,48] which contains geometric,
energetic, electronic, and thermodynamic properties for 133 885
stable small organic molecules made up of carbon, hydrogen, oxygen,
nitrogen, and fluorine. A total of 504 fingerprints were systematically
designed and then screened by the supervised ML for the regression
analysis for the atomic charges of molecules. The results indicate
that the 60 fingerprints successfully estimate the atomic charges.
These fingerprints were used to compute the dipole moments of molecules
as an example of a property expressed by both charge and position,
and a high accuracy was obtained compared to the values computed from
atomic charges. Therefore, the 60 screened fingerprints have the potential
to precisely describe the chemical and structural information about
the atomic environments of molecules.
Methodology
Real-Space
Fingerprint Using a Generic Local-Order Parameter
Here we
introduce the real-space fingerprint of the atomic environment
considered in this work. First, the chemical information to be embedded
is defined as the implant function F(a, r, rc), where a is the atomic number of atom i, r is the
interatomic distance between atoms i and j, and rc is the cutoff radius
for judging whether particle j is a neighbor of particle i. Here we use four cutoff radii of 1.50, 1.75, 2.00, and
2.50 Å to accommodate various molecular geometries. We emphasize
that F(a, r, rc) is basically a simple combination of atomic numbers
and distances among neighboring atoms. In this work, a total of 18
species of F(a, r, rc) were calculated for each rc, as shown in Table , where b(rc, i) is an array
that involves identifiers of neighboring atoms of atom i and N is the number of neighboring atoms contained
in b(rc, i). F was designed based
on the following rules: (i) it should be a function of atomic number,
(ii) the interaction distance to be emphasized should change depending
on the weighting, and (iii) differences in the weighting with and
without normalization should be considered. Rule (i) is essential
as a fingerprint for the constituent elements of a molecule, but 0,
1, and 1.5 powers were considered in order to have variations in the
effect of the atomic number. Rule (ii) is a concept similar to that
of the radial basis function G2 in Parrinello–Behler
type descriptors,[1,49] where the weights are maximized
for interaction distances of 0.0, 1.5, 2.0, and 2.5 Å. Rule (iii)
was considered because the form of the function is clearly different
depending on the presence or absence of normalization. From the above
rules (i–iii), the number of implant functions became 18. Note
that other expressions of F are possible for the
chemical and physical properties as long as these are defined as a
single scalar function. Second, the implant function was embedded
in the bond-orientational order parameter developed by Steinhardt
and co-workers[16] aswhere l is an arbitrary positive
integer denoting the degree of the harmonic function, m is an integer that runs from – l to + l, Y is a
spherical harmonics function, and is a vector from particle j to particle i. In this work, we set l values as 4, 6, 8, 12, 15, 18, and 20. Therefore, the total number
of representations for the atomic environment for atom i becomes 504 (equal to the number of l × the
number of rc × the number of F). The atomic number and all the representations for atom i are stored in the atomic environment vector for atom i, , which has 505 (504 + 1) elements. Finally, the atomic environment
vectors for all atoms are merged to the descriptor array D. In this work, the number of elements of D was
505 × 133 885. This was almost the same scale of data
as that for the data set of a conventional representation for the
atomic environment.[13] However, in order
to attain an effective atomic environment it is desirable to have
fewer elements. Therefore, we attempted to screen the fingerprints
from 504 to a smaller number by means of supervised ML for a regression
analysis of the atomic charges of molecules.
Table 1
Description
of the Implant Function F(a, r, rc)
name of function
formula
F1
1
F2
1/N
F3
rij
F4
F5
aj
F6
F7
ajrij–1
F8
F9
aj1.5rij–1
F10
F11
aj1.5rij
F12
F13
aj exp[−4(rij – 1.5)2]
F14
F15
aj exp[−4(rij – 2.0)2]
F16
F17
aj exp[−4(rij – 2.5)2]
F18
Machine Learning
To examine the capability of each
fingerprint and to screen the fingerprints, a regression analysis
of the atomic charges was performed in supervised ML. The molecular
geometry and atomic charge for each molecule stored in QM9 were preliminarily
optimized to the generalized amber force field (GAFF)[50] by Open Babel.[51]Figure shows the actual ML flow used
in this work. First, b(rc, i) was determined for each
atom. Second, Q(rc, F, i) values
were computed for each atom and were stored in . The atomic number a was also stored in . The values for all atoms were merged
with D. Third, the atomic charges for all atoms were
merged with the response variable vector . Note that the response variable became the supervisor of the ML.
Then, the operator vector , which satisfies
the relation D = , was estimated through ML. The term was estimated using the random forest method[52] implemented on Scikit-learn.[53] The random forest method has some useful characteristics
as a learning algorithm of ML: (i) the learning routine is simple
and thus has high performance of computing, (ii) the method prevents
an overlearning, (iii) little or no data cleansing is needed, and
(iv) the significance of data descriptors can be easily quantified.
Note that overlearning is the situation where the learning results
only fit to data used in the learning and do not fit to any new data. was checked through a k-fold cross-validation implemented on Scikit-learn, where k denotes the number of cross-validations. Note that k-fold cross-validation is the method for checking overlearning.
The fivefold cross-validation was done considering the quality and
quantity of our data in this work. Namely, 1/5 of 2 407 756
local coordinates (∼481 551) were used for each of a
total of five cross-validations. Finally, the fingerprints were screened
based on the function of the random forest method, which quantified
the importance of the data descriptors.
Figure 1
Machine learning flow
for the regression analysis that predicts
the atomic charges of molecules stored in QM9.
Machine learning flow
for the regression analysis that predicts
the atomic charges of molecules stored in QM9.
Results and discussion
The time consumption to compute our
fingerprint for all the molecules
included in QM9 was approximately 72 h using an AMD Ryzen Threadripper
1950X CPU. Note that the fingerprints were eternally usable if these
were prepared for ML once. Table shows fingerprints in order of importance based on
the random forest method. There is a bias in the importance of the
fingerprints, which means that the top ranking fingerprints represent
the typical chemical environments. Importantly, the development of
highly sensitive fingerprints that contain such biases can be left
to machine learning. Our proposed screening makes this possible, allowing
us to select a relatively small set of candidates for the best fingerprint
at any given time depending on the constantly changing (and typically
increasing) amount of data in the database. The fingerprints ranked
from first to 60th account for 97.3% of the importance. Figure shows the importance of fingerprints
from the first rank to the 60th rank. After the 55th rank, the importance
level is generally saturated at about 0.0006. Therefore, in the following
sections, only fingerprints from the first rank to the 60th rank were
used.
Table 2
Fingerprints in Order of Importance
Based on the Random Forest Method
rank
l
rc (Å)
F
importance
1
12
1.75
F13
0.688213177
2
6
1.50
F8
0.092358318
3
12
2.50
F11
0.028534423
4
6
1.75
F15
0.023667266
5
12
1.50
F13
0.016632251
6
12
2.50
F9
0.014181471
7
12
1.50
F8
0.012526058
8
6
1.75
F8
0.011079143
9
4
2.50
F8
0.006942812
10
6
1.50
F15
0.004692287
11
12
2.50
F17
0.004414363
12
12
1.75
F12
0.003995392
13
12
1.75
F8
0.003751858
14
4
1.75
F16
0.003325598
15
12
2.50
F10
0.002796547
16
6
2.50
F17
0.002704569
17
4
2.50
F13
0.002336904
18
4
2.50
F14
0.002325804
19
12
2.50
F15
0.002142383
20
8
2.50
F10
0.002134398
21
6
2.50
F13
0.002012958
22
12
2.50
F8
0.001898805
23
4
2.50
F10
0.001880449
24
6
2.50
F8
0.001702691
25
6
2.00
F14
0.001672374
26
8
2.50
F8
0.001621629
27
6
2.50
F9
0.001594413
28
4
1.75
F8
0.001559824
29
12
1.75
F16
0.001545018
30
8
1.75
F8
0.001538792
31
12
2.50
F14
0.001531455
32
8
2.50
F9
0.001471098
33
6
2.50
F11
0.001456022
34
6
2.50
F10
0.001254301
35
8
2.00
F8
0.001153517
36
4
1.50
F8
0.001143127
37
8
2.50
F16
0.001135169
38
4
2.50
F11
0.001103377
39
12
1.75
F10
0.001053001
40
4
2.50
F9
0.000982597
41
8
2.50
F11
0.000970519
42
4
1.75
F17
0.000890206
43
8
2.50
F17
0.000877437
44
12
2.00
F8
0.000846427
45
4
2.50
F15
0.000787806
46
4
2.50
F17
0.000771686
47
6
2.00
F15
0.000742759
48
4
2.50
F16
0.000736547
49
6
2.50
F14
0.000732413
50
4
1.75
F14
0.000720596
51
12
2.50
F16
0.000718018
52
12
2.00
F10
0.000707675
53
6
2.00
F8
0.000698370
54
8
2.50
F15
0.000695468
55
4
2.00
F8
0.000657664
56
8
1.50
F17
0.000655981
57
4
2.00
F14
0.000639511
58
6
2.50
F15
0.000630099
59
12
1.75
F11
0.000611518
60
4
1.50
F16
0.000594030
Figure 2
A logarithmic plot of
the importance of the fingerprints from the
first to the 60th rank.
A logarithmic plot of
the importance of the fingerprints from the
first to the 60th rank.For the most basic assessment
of the capability of our fingerprint,
the ML flow was performed using the GAFF molecular geometries and
atomic charges derived from OpenBabel. Figure shows the resulting regression curves for
(a) hydrogen, (b) carbon, (c) nitrogen, (d) oxygen, (e) fluorine,
and (f) total charges. The results for every atomic species demonstrated
the high coefficient of determination (R2) value, which indicated that the fingerprint captured well the charge
variations derived from the difference of the atomic environment.
The statistic scores for the regression results are shown in Table . The worst R2 value was for the nitrogen charges. The mean
squared error (MSE) was also the worst, and the mean absolute error
(MAE) was the second worst. This implies that the variations in the
charges of nitrogen atoms assigned by GAFF include systematic outliers.
In contrast, the R2 value and MSE for
hydrogen charges were the best, and the MAE for hydrogen charges was
second best. The regression results were also examined from the values
of the molecular dipole moment. Figure shows the resulting regression curve for the molecular
dipole moments. The results demonstrated a high R2 value of 0.990, which indicated that the fingerprint
captured well the variations of dipole moment values derived from
the difference of the atomic environment.
Figure 3
Regression
curves for (a) hydrogen, (b) carbon, (c) nitrogen, (d)
oxygen, (e) fluorine, and (f) total charges by the fingerprint of
the GAFF structures.
Table 3
Statistic
Scores for the Regression
Results of the Fingerprint of the GAFF Structures
atom
R2
MAE
MSE
H
0.984
0.002
0.00004
C
0.974
0.007
0.00020
N
0.951
0.006
0.00030
O
0.983
0.003
0.00006
F
0.967
0.001
0.00008
total
0.993
0.004
0.00012
Figure 4
Regression curve for
molecular dipole moments from the fingerprint
of the GAFF structures.
Regression
curves for (a) hydrogen, (b) carbon, (c) nitrogen, (d)
oxygen, (e) fluorine, and (f) total charges by the fingerprint of
the GAFF structures.Regression curve for
molecular dipole moments from the fingerprint
of the GAFF structures.To evaluate the robustness
of the fingerprints to variations in
molecular geometries, the ML flow was performed using the DFT molecular
geometries stored in QM9 before they were optimized in OpenBabel.
The atomic charges, as the response variables, were the OpenBabel
charges. We emphasize that response variables can be used for more
than just OpenBabel charges. For example, not only the restrained
electrostatic potential (RESP) charges[54] given by precise DFT calculations but also the charge and Lennard–Jones
potential parameters given by CHARMM,[55] COMPASS,[56] and other force fields are
possible candidates. However, as a simple and easy way to evaluate
an example, we used the OpenBabel charge here. Figure shows the resulting regression curves for
(a) hydrogen, (b) carbon, (c) nitrogen, (d) oxygen, (e) fluorine,
and (f) total charges. The results for every atomic species demonstrated
high R2 values, which indicated that the
selected 60 fingerprints demonstrated a high robustness against the
deviation of the molecular geometries. The statistic scores for the
regression results are shown in Table . The worst R2 value was
that for the fluorine charges. However, MAE and MSE values for fluorine
charges were the smallest. This was because the charge variations
of fluorine atoms were basically small. Thus, the regression accuracy
for the fluorine charges was never low. The trends for the statistic
scores for nitrogen charges were almost the same as those the regression
results from the GAFF structures and OpenBabel charges. The regression
results were also examined for the molecular dipole moment. Figure shows the regression
curve for molecular dipole moments. The results demonstrated an R2 value of 0.990, which is comparable to that
for the regression from GAFF structures and OpenBabel charges. Again,
the accurate prediction of OpenBabel charges and dipole moments for
DFT structures shows the high robustness of the 60 fingerprints for
the molecular geometries.
Figure 5
Regression curves for (a) hydrogen, (b) carbon,
(c) nitrogen, (d)
oxygen, (e) fluorine, and (f) total charges from the fingerprint of
the DFT structures.
Table 4
Statistic
Scores for the Regression
Results from the Fingerprint of the DFT Structures
atom
R2
MAE
MSE
H
0.995
0.003
0.00009
C
0.983
0.005
0.00010
N
0.947
0.006
0.00040
O
0.971
0.003
0.00010
F
0.944
0.001
0.00001
total
0.995
0.003
0.00009
Figure 6
Regression
curve for molecular dipole moments from the fingerprint
of the DFT structures.
Regression curves for (a) hydrogen, (b) carbon,
(c) nitrogen, (d)
oxygen, (e) fluorine, and (f) total charges from the fingerprint of
the DFT structures.Regression
curve for molecular dipole moments from the fingerprint
of the DFT structures.Finally, to examine the
prediction capability of the trained model
from GAFF structures and OpenBabel charges, the charges were predicted
from the trained operator vector GAFF, and the descriptor array was predicted from DFT structures DDFT. Figure shows a comparison of the predicted charges with OpenBabel
charges for (a) hydrogen, (b) carbon, (c) nitrogen, (d) oxygen, (e)
fluorine, and (f) total charges. The results for carbon and hydrogen
had high R2 values, and the other atomic
species exhibited reasonable R2 values.
The statistical scores for the comparison results are shown in Table . The worst R2, MAE, and MSE values were found for nitrogen
charges, a feature consistent with the other results. Overall, the
model trained from GAFF structures and OpenBabel charges can adequately
predict the OpenBabel charges for DFT molecular structures. However,
the molecular dipole moments were not determined accurately when the
predicted charges were used (data not shown). Despite the success
of machine learning in predicting charges at a certain level, the
low prediction accuracy of the dipole moment implies that the gap
in molecular geometries between GAFF and DFT is not small.
Figure 7
Comparison
of predicted charges with OpenBabel charges for (a)
hydrogen, (b) carbon, (c) nitrogen, (d) oxygen, (e) fluorine, and
(f) total charges.
Table 5
Statistical
Scores for the Comparison
of Predicted Charges with OpenBabel Charges
atom
R2
MAE
MSE
H
0.913
0.004
0.00020
C
0.936
0.014
0.00050
N
0.764
0.022
0.00100
O
0.871
0.009
0.00050
F
0.860
0.004
0.00003
total
0.977
0.009
0.00040
Comparison
of predicted charges with OpenBabel charges for (a)
hydrogen, (b) carbon, (c) nitrogen, (d) oxygen, (e) fluorine, and
(f) total charges.
Conclusions
We developed a candidate for simple representation that combines
atomic species and the bond-orientational order parameter. The 504
candidates were applied as the fingerprint of molecules stored in
QM9. To screen the fingerprints, a regression analysis of the atomic
charges given by OpenBabel was performed by a supervised ML flow.
The 60 fingerprints were selected based on importance value of the
random forest method. The selected fingerprints was applied for the
regression analysis of OpenBabel charges, which used GAFF molecular
structures. The results exhibited high statistic scores for both the
atomic charge and the dipole moment, which indicated that the fingerprint
captured well the variations of charge derived from the difference
of the atomic environment. Then, the regression analysis was performed
using the DFT structures to assess the robustness of 60 fingerprints
against the deviation of the molecular structures. The results exhibited
high statistical scores, which indicated that the fingerprints demonstrated
a high robustness against the deviation of the molecular structures.
Finally, the atomic charges were predicted from DFT and DGAFF to examine
the prediction capability of the trained model from GAFF structures
and OpenBabel charges. The results showed that the trained model can
adequately predict the OpenBabel charges for DFT molecular structures.
Therefore, the screened 60 fingerprints have the potential to precisely
describe the chemical and structural information on the atomic environment
of molecules. Importantly, only 60 chemical representations give the
exact atomic charges for 133 885 different molecular geometries.
This fact means that our chemical representation successfully compresses
the information content of fingerprints based on 3D molecular geometries.
The robustness of the fingerprints to molecular geometries and the
robustness of the learning model to charge prediction also indicate
that our chemical representation can be used for iterations that simultaneously
determine molecular geometries and charges. Furthermore, our representation
can be used in combination with many other conventional atomic representations.[2−5,7,13] Thus,
the prediction capability of the above-mentioned trained model may
be improved for the deviation of molecular structures and molecules
never included in training data. In addition to constructing precise
big data for molecular structures, the development of atomic representations
might assist in the high-throughput design of molecular models that
include intra- and intermolecular interaction parameters without computationally
expensive first-principles simulation techniques.