Literature DB >> 35155951

Screening toward the Development of Fingerprints of Atomic Environments Using Bond-Orientational Order Parameters.

Hideo Doi1, Kazuaki Z Takahashi1, Takeshi Aoyagi1.   

Abstract

A combination of atomic numbers and bond-orientational order parameters is considered a candidate for a simple representation that involves information on both the atomic species and their positional relation. The 504 candidates are applied as the fingerprint of the molecules stored in QM9, a data set of computed geometric, energetic, electronic, and thermodynamic properties for 133 885 stable small organic molecules made up of carbon, hydrogen, oxygen, nitrogen, and fluorine atoms. To screen the fingerprints, a regression analysis of the atomic charges given by Open Babel was performed by supervised machine learning. The regression results indicate that the 60 fingerprints successfully estimate Open Babel charges. The results of the dipole moments, an example of a property expressed by charge and position, also had a high accuracy in comparison with the values computed from Open Babel charges. Therefore, the screened 60 fingerprints have the potential to precisely describe the chemical and structural information on the atomic environment of molecules.
© 2022 The Authors. Published by American Chemical Society.

Entities:  

Year:  2022        PMID: 35155951      PMCID: PMC8829853          DOI: 10.1021/acsomega.1c06587

Source DB:  PubMed          Journal:  ACS Omega        ISSN: 2470-1343


Introduction

A representation of atomic environments has become more needed than in the past decade because of the rapid development of material informatics and related technology.[1−13] Chemical representations without real-space information are mainstream because they are compact, lightweight, and usually adopted for molecular informatics. Indeed, some simple fingerprints, such as the simplified molecular-input line-entry system (SMILES)[14] and SYBYL line notation (SLN),[15] are useful, for instance, for similarity measurements between molecules as the simplest example. Structure representations have been developed separately from the chemical representations, e.g., Voronoi polyhedron of a central atom, angular Fourier series,[1] partial and generalized radial distribution functions, and the bond-orientational order parameter.[16] These were originally not intended for machine learning (ML) applications but are promising. However, separating the chemistry and structure of molecules is open to debate. For example, the Hohenberg–Kohn theorem,[17−19] which states that all physical quantities of a system can in principle be calculated from the electron density of the ground state, implies that there is an inseparable relationship between information about the electronic properties of atoms and the direct function of the geometric location of nuclei. Therefore, the species of atoms and their positional relations can hardly be separated to provide information on the atomic environment. This fact means that there is a need to consider a more proper representation of both chemistry and structure. One of the simplest ideas for such a representation is a combination of the existing chemical and structural representations. For example, a Coulomb matrix[20] is a generalization of an adjacency matrix representation and has been extended as the Ewald sum and the Sine matrix.[5] More advanced methods include histograms of distances (HDs), HD angles (HDAs), HDA dihedrals,[21] bag of bonds,[22] and the smooth overlap of atomic positions.[23] The bond-orientational order parameter that uses spherical harmonics exhibits a highly generic performance for describing molecular local structures. It was first developed by Steinhardt and co-workers to investigate the structures of supercooled liquids and metallic glasses.[16] Lechner and co-workers improved the bond-orientational order parameter by locally averaging neighborhood molecules.[24] Their parameters successfully distinguished body-centered cubic (bcc), face-centered cubic (fcc), hexagonal close-packed (hcp), and liquid-like local structures and have been widely used to classify complex local structures.[25−32] Further modifications have been attempted to extend the entire accuracy or for a specific use.[33,34] The capability of their parameters has been reported for the identification of the crystal-like structures of a Lennard–Jones fluid,[35−37] water,[38−45] polyethylene,[46] and a liquid crystal and its polymer.[33,34] Such a high sensitivity for local structures is desirable for the precise expression of the structural contributions of the atomic environment. In this work, we consider a combination of atomic numbers and bond-orientational order parameters as a candidate for the simple representation that involves information on both the atomic species and their positional relationship. This is applied as the fingerprint of the molecules stored in the QM9 data set,[47,48] which contains geometric, energetic, electronic, and thermodynamic properties for 133 885 stable small organic molecules made up of carbon, hydrogen, oxygen, nitrogen, and fluorine. A total of 504 fingerprints were systematically designed and then screened by the supervised ML for the regression analysis for the atomic charges of molecules. The results indicate that the 60 fingerprints successfully estimate the atomic charges. These fingerprints were used to compute the dipole moments of molecules as an example of a property expressed by both charge and position, and a high accuracy was obtained compared to the values computed from atomic charges. Therefore, the 60 screened fingerprints have the potential to precisely describe the chemical and structural information about the atomic environments of molecules.

Methodology

Real-Space Fingerprint Using a Generic Local-Order Parameter

Here we introduce the real-space fingerprint of the atomic environment considered in this work. First, the chemical information to be embedded is defined as the implant function F(a, r, rc), where a is the atomic number of atom i, r is the interatomic distance between atoms i and j, and rc is the cutoff radius for judging whether particle j is a neighbor of particle i. Here we use four cutoff radii of 1.50, 1.75, 2.00, and 2.50 Å to accommodate various molecular geometries. We emphasize that F(a, r, rc) is basically a simple combination of atomic numbers and distances among neighboring atoms. In this work, a total of 18 species of F(a, r, rc) were calculated for each rc, as shown in Table , where b(rc, i) is an array that involves identifiers of neighboring atoms of atom i and N is the number of neighboring atoms contained in b(rc, i). F was designed based on the following rules: (i) it should be a function of atomic number, (ii) the interaction distance to be emphasized should change depending on the weighting, and (iii) differences in the weighting with and without normalization should be considered. Rule (i) is essential as a fingerprint for the constituent elements of a molecule, but 0, 1, and 1.5 powers were considered in order to have variations in the effect of the atomic number. Rule (ii) is a concept similar to that of the radial basis function G2 in Parrinello–Behler type descriptors,[1,49] where the weights are maximized for interaction distances of 0.0, 1.5, 2.0, and 2.5 Å. Rule (iii) was considered because the form of the function is clearly different depending on the presence or absence of normalization. From the above rules (i–iii), the number of implant functions became 18. Note that other expressions of F are possible for the chemical and physical properties as long as these are defined as a single scalar function. Second, the implant function was embedded in the bond-orientational order parameter developed by Steinhardt and co-workers[16] aswhere l is an arbitrary positive integer denoting the degree of the harmonic function, m is an integer that runs from – l to + l, Y is a spherical harmonics function, and is a vector from particle j to particle i. In this work, we set l values as 4, 6, 8, 12, 15, 18, and 20. Therefore, the total number of representations for the atomic environment for atom i becomes 504 (equal to the number of l × the number of rc × the number of F). The atomic number and all the representations for atom i are stored in the atomic environment vector for atom i, , which has 505 (504 + 1) elements. Finally, the atomic environment vectors for all atoms are merged to the descriptor array D. In this work, the number of elements of D was 505 × 133 885. This was almost the same scale of data as that for the data set of a conventional representation for the atomic environment.[13] However, in order to attain an effective atomic environment it is desirable to have fewer elements. Therefore, we attempted to screen the fingerprints from 504 to a smaller number by means of supervised ML for a regression analysis of the atomic charges of molecules.
Table 1

Description of the Implant Function F(a, r, rc)

name of functionformula
F11
F21/N
F3rij
F4
F5aj
F6
F7ajrij–1
F8
F9aj1.5rij–1
F10
F11aj1.5rij
F12
F13aj exp[−4(rij – 1.5)2]
F14
F15aj exp[−4(rij – 2.0)2]
F16
F17aj exp[−4(rij – 2.5)2]
F18

Machine Learning

To examine the capability of each fingerprint and to screen the fingerprints, a regression analysis of the atomic charges was performed in supervised ML. The molecular geometry and atomic charge for each molecule stored in QM9 were preliminarily optimized to the generalized amber force field (GAFF)[50] by Open Babel.[51]Figure shows the actual ML flow used in this work. First, b(rc, i) was determined for each atom. Second, Q(rc, F, i) values were computed for each atom and were stored in . The atomic number a was also stored in . The values for all atoms were merged with D. Third, the atomic charges for all atoms were merged with the response variable vector . Note that the response variable became the supervisor of the ML. Then, the operator vector , which satisfies the relation D = , was estimated through ML. The term was estimated using the random forest method[52] implemented on Scikit-learn.[53] The random forest method has some useful characteristics as a learning algorithm of ML: (i) the learning routine is simple and thus has high performance of computing, (ii) the method prevents an overlearning, (iii) little or no data cleansing is needed, and (iv) the significance of data descriptors can be easily quantified. Note that overlearning is the situation where the learning results only fit to data used in the learning and do not fit to any new data. was checked through a k-fold cross-validation implemented on Scikit-learn, where k denotes the number of cross-validations. Note that k-fold cross-validation is the method for checking overlearning. The fivefold cross-validation was done considering the quality and quantity of our data in this work. Namely, 1/5 of 2 407 756 local coordinates (∼481 551) were used for each of a total of five cross-validations. Finally, the fingerprints were screened based on the function of the random forest method, which quantified the importance of the data descriptors.
Figure 1

Machine learning flow for the regression analysis that predicts the atomic charges of molecules stored in QM9.

Machine learning flow for the regression analysis that predicts the atomic charges of molecules stored in QM9.

Results and discussion

The time consumption to compute our fingerprint for all the molecules included in QM9 was approximately 72 h using an AMD Ryzen Threadripper 1950X CPU. Note that the fingerprints were eternally usable if these were prepared for ML once. Table shows fingerprints in order of importance based on the random forest method. There is a bias in the importance of the fingerprints, which means that the top ranking fingerprints represent the typical chemical environments. Importantly, the development of highly sensitive fingerprints that contain such biases can be left to machine learning. Our proposed screening makes this possible, allowing us to select a relatively small set of candidates for the best fingerprint at any given time depending on the constantly changing (and typically increasing) amount of data in the database. The fingerprints ranked from first to 60th account for 97.3% of the importance. Figure shows the importance of fingerprints from the first rank to the 60th rank. After the 55th rank, the importance level is generally saturated at about 0.0006. Therefore, in the following sections, only fingerprints from the first rank to the 60th rank were used.
Table 2

Fingerprints in Order of Importance Based on the Random Forest Method

ranklrc (Å)Fimportance
1121.75F130.688213177
261.50F80.092358318
3122.50F110.028534423
461.75F150.023667266
5121.50F130.016632251
6122.50F90.014181471
7121.50F80.012526058
861.75F80.011079143
942.50F80.006942812
1061.50F150.004692287
11122.50F170.004414363
12121.75F120.003995392
13121.75F80.003751858
1441.75F160.003325598
15122.50F100.002796547
1662.50F170.002704569
1742.50F130.002336904
1842.50F140.002325804
19122.50F150.002142383
2082.50F100.002134398
2162.50F130.002012958
22122.50F80.001898805
2342.50F100.001880449
2462.50F80.001702691
2562.00F140.001672374
2682.50F80.001621629
2762.50F90.001594413
2841.75F80.001559824
29121.75F160.001545018
3081.75F80.001538792
31122.50F140.001531455
3282.50F90.001471098
3362.50F110.001456022
3462.50F100.001254301
3582.00F80.001153517
3641.50F80.001143127
3782.50F160.001135169
3842.50F110.001103377
39121.75F100.001053001
4042.50F90.000982597
4182.50F110.000970519
4241.75F170.000890206
4382.50F170.000877437
44122.00F80.000846427
4542.50F150.000787806
4642.50F170.000771686
4762.00F150.000742759
4842.50F160.000736547
4962.50F140.000732413
5041.75F140.000720596
51122.50F160.000718018
52122.00F100.000707675
5362.00F80.000698370
5482.50F150.000695468
5542.00F80.000657664
5681.50F170.000655981
5742.00F140.000639511
5862.50F150.000630099
59121.75F110.000611518
6041.50F160.000594030
Figure 2

A logarithmic plot of the importance of the fingerprints from the first to the 60th rank.

A logarithmic plot of the importance of the fingerprints from the first to the 60th rank. For the most basic assessment of the capability of our fingerprint, the ML flow was performed using the GAFF molecular geometries and atomic charges derived from OpenBabel. Figure shows the resulting regression curves for (a) hydrogen, (b) carbon, (c) nitrogen, (d) oxygen, (e) fluorine, and (f) total charges. The results for every atomic species demonstrated the high coefficient of determination (R2) value, which indicated that the fingerprint captured well the charge variations derived from the difference of the atomic environment. The statistic scores for the regression results are shown in Table . The worst R2 value was for the nitrogen charges. The mean squared error (MSE) was also the worst, and the mean absolute error (MAE) was the second worst. This implies that the variations in the charges of nitrogen atoms assigned by GAFF include systematic outliers. In contrast, the R2 value and MSE for hydrogen charges were the best, and the MAE for hydrogen charges was second best. The regression results were also examined from the values of the molecular dipole moment. Figure shows the resulting regression curve for the molecular dipole moments. The results demonstrated a high R2 value of 0.990, which indicated that the fingerprint captured well the variations of dipole moment values derived from the difference of the atomic environment.
Figure 3

Regression curves for (a) hydrogen, (b) carbon, (c) nitrogen, (d) oxygen, (e) fluorine, and (f) total charges by the fingerprint of the GAFF structures.

Table 3

Statistic Scores for the Regression Results of the Fingerprint of the GAFF Structures

atomR2MAEMSE
H0.9840.0020.00004
C0.9740.0070.00020
N0.9510.0060.00030
O0.9830.0030.00006
F0.9670.0010.00008
total0.9930.0040.00012
Figure 4

Regression curve for molecular dipole moments from the fingerprint of the GAFF structures.

Regression curves for (a) hydrogen, (b) carbon, (c) nitrogen, (d) oxygen, (e) fluorine, and (f) total charges by the fingerprint of the GAFF structures. Regression curve for molecular dipole moments from the fingerprint of the GAFF structures. To evaluate the robustness of the fingerprints to variations in molecular geometries, the ML flow was performed using the DFT molecular geometries stored in QM9 before they were optimized in OpenBabel. The atomic charges, as the response variables, were the OpenBabel charges. We emphasize that response variables can be used for more than just OpenBabel charges. For example, not only the restrained electrostatic potential (RESP) charges[54] given by precise DFT calculations but also the charge and Lennard–Jones potential parameters given by CHARMM,[55] COMPASS,[56] and other force fields are possible candidates. However, as a simple and easy way to evaluate an example, we used the OpenBabel charge here. Figure shows the resulting regression curves for (a) hydrogen, (b) carbon, (c) nitrogen, (d) oxygen, (e) fluorine, and (f) total charges. The results for every atomic species demonstrated high R2 values, which indicated that the selected 60 fingerprints demonstrated a high robustness against the deviation of the molecular geometries. The statistic scores for the regression results are shown in Table . The worst R2 value was that for the fluorine charges. However, MAE and MSE values for fluorine charges were the smallest. This was because the charge variations of fluorine atoms were basically small. Thus, the regression accuracy for the fluorine charges was never low. The trends for the statistic scores for nitrogen charges were almost the same as those the regression results from the GAFF structures and OpenBabel charges. The regression results were also examined for the molecular dipole moment. Figure shows the regression curve for molecular dipole moments. The results demonstrated an R2 value of 0.990, which is comparable to that for the regression from GAFF structures and OpenBabel charges. Again, the accurate prediction of OpenBabel charges and dipole moments for DFT structures shows the high robustness of the 60 fingerprints for the molecular geometries.
Figure 5

Regression curves for (a) hydrogen, (b) carbon, (c) nitrogen, (d) oxygen, (e) fluorine, and (f) total charges from the fingerprint of the DFT structures.

Table 4

Statistic Scores for the Regression Results from the Fingerprint of the DFT Structures

atomR2MAEMSE
H0.9950.0030.00009
C0.9830.0050.00010
N0.9470.0060.00040
O0.9710.0030.00010
F0.9440.0010.00001
total0.9950.0030.00009
Figure 6

Regression curve for molecular dipole moments from the fingerprint of the DFT structures.

Regression curves for (a) hydrogen, (b) carbon, (c) nitrogen, (d) oxygen, (e) fluorine, and (f) total charges from the fingerprint of the DFT structures. Regression curve for molecular dipole moments from the fingerprint of the DFT structures. Finally, to examine the prediction capability of the trained model from GAFF structures and OpenBabel charges, the charges were predicted from the trained operator vector GAFF, and the descriptor array was predicted from DFT structures DDFT. Figure shows a comparison of the predicted charges with OpenBabel charges for (a) hydrogen, (b) carbon, (c) nitrogen, (d) oxygen, (e) fluorine, and (f) total charges. The results for carbon and hydrogen had high R2 values, and the other atomic species exhibited reasonable R2 values. The statistical scores for the comparison results are shown in Table . The worst R2, MAE, and MSE values were found for nitrogen charges, a feature consistent with the other results. Overall, the model trained from GAFF structures and OpenBabel charges can adequately predict the OpenBabel charges for DFT molecular structures. However, the molecular dipole moments were not determined accurately when the predicted charges were used (data not shown). Despite the success of machine learning in predicting charges at a certain level, the low prediction accuracy of the dipole moment implies that the gap in molecular geometries between GAFF and DFT is not small.
Figure 7

Comparison of predicted charges with OpenBabel charges for (a) hydrogen, (b) carbon, (c) nitrogen, (d) oxygen, (e) fluorine, and (f) total charges.

Table 5

Statistical Scores for the Comparison of Predicted Charges with OpenBabel Charges

atomR2MAEMSE
H0.9130.0040.00020
C0.9360.0140.00050
N0.7640.0220.00100
O0.8710.0090.00050
F0.8600.0040.00003
total0.9770.0090.00040
Comparison of predicted charges with OpenBabel charges for (a) hydrogen, (b) carbon, (c) nitrogen, (d) oxygen, (e) fluorine, and (f) total charges.

Conclusions

We developed a candidate for simple representation that combines atomic species and the bond-orientational order parameter. The 504 candidates were applied as the fingerprint of molecules stored in QM9. To screen the fingerprints, a regression analysis of the atomic charges given by OpenBabel was performed by a supervised ML flow. The 60 fingerprints were selected based on importance value of the random forest method. The selected fingerprints was applied for the regression analysis of OpenBabel charges, which used GAFF molecular structures. The results exhibited high statistic scores for both the atomic charge and the dipole moment, which indicated that the fingerprint captured well the variations of charge derived from the difference of the atomic environment. Then, the regression analysis was performed using the DFT structures to assess the robustness of 60 fingerprints against the deviation of the molecular structures. The results exhibited high statistical scores, which indicated that the fingerprints demonstrated a high robustness against the deviation of the molecular structures. Finally, the atomic charges were predicted from DFT and DGAFF to examine the prediction capability of the trained model from GAFF structures and OpenBabel charges. The results showed that the trained model can adequately predict the OpenBabel charges for DFT molecular structures. Therefore, the screened 60 fingerprints have the potential to precisely describe the chemical and structural information on the atomic environment of molecules. Importantly, only 60 chemical representations give the exact atomic charges for 133 885 different molecular geometries. This fact means that our chemical representation successfully compresses the information content of fingerprints based on 3D molecular geometries. The robustness of the fingerprints to molecular geometries and the robustness of the learning model to charge prediction also indicate that our chemical representation can be used for iterations that simultaneously determine molecular geometries and charges. Furthermore, our representation can be used in combination with many other conventional atomic representations.[2−5,7,13] Thus, the prediction capability of the above-mentioned trained model may be improved for the deviation of molecular structures and molecules never included in training data. In addition to constructing precise big data for molecular structures, the development of atomic representations might assist in the high-throughput design of molecular models that include intra- and intermolecular interaction parameters without computationally expensive first-principles simulation techniques.
  32 in total

1.  Universal variational functionals of electron densities, first-order density matrices, and natural spin-orbitals and solution of the v-representability problem.

Authors:  M Levy
Journal:  Proc Natl Acad Sci U S A       Date:  1979-12       Impact factor: 11.205

2.  Accurate determination of crystal structures based on averaged local bond order parameters.

Authors:  Wolfgang Lechner; Christoph Dellago
Journal:  J Chem Phys       Date:  2008-09-21       Impact factor: 3.488

3.  SYBYL line notation (SLN): a single notation to represent chemical structures, queries, reactions, and virtual libraries.

Authors:  R Webster Homer; Jon Swanson; Robert J Jilek; Tad Hurst; Robert D Clark
Journal:  J Chem Inf Model       Date:  2008-12       Impact factor: 4.956

4.  Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17.

Authors:  Lars Ruddigkeit; Ruud van Deursen; Lorenz C Blum; Jean-Louis Reymond
Journal:  J Chem Inf Model       Date:  2012-11-01       Impact factor: 4.956

5.  Roles of icosahedral and crystal-like order in the hard spheres glass transition.

Authors:  Mathieu Leocmach; Hajime Tanaka
Journal:  Nat Commun       Date:  2012-07-24       Impact factor: 14.919

6.  Structural evolution in the aging process of supercooled colloidal liquids.

Authors:  Takeshi Kawasaki; Hajime Tanaka
Journal:  Phys Rev E Stat Nonlin Soft Matter Phys       Date:  2014-06-26

Review 7.  Representing potential energy surfaces by high-dimensional neural network potentials.

Authors:  J Behler
Journal:  J Phys Condens Matter       Date:  2014-04-23       Impact factor: 2.333

8.  Searching for local order parameters to classify water structures at triple points.

Authors:  Hideo Doi; Kazuaki Z Takahashi; Takeshi Aoyagi
Journal:  J Comput Chem       Date:  2021-06-24       Impact factor: 3.376

9.  Mining of effective local order parameters for classifying crystal structures: A machine learning study.

Authors:  Hideo Doi; Kazuaki Z Takahashi; Takeshi Aoyagi
Journal:  J Chem Phys       Date:  2020-06-07       Impact factor: 3.488

10.  The microscopic pathway to crystallization in supercooled liquids.

Authors:  John Russo; Hajime Tanaka
Journal:  Sci Rep       Date:  2012-07-12       Impact factor: 4.379

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.