Literature DB >> 35155951

Screening toward the Development of Fingerprints of Atomic Environments Using Bond-Orientational Order Parameters.

Hideo Doi¹, Kazuaki Z Takahashi¹, Takeshi Aoyagi¹.

Abstract

A combination of atomic numbers and bond-orientational order parameters is considered a candidate for a simple representation that involves information on both the atomic species and their positional relation. The 504 candidates are applied as the fingerprint of the molecules stored in QM9, a data set of computed geometric, energetic, electronic, and thermodynamic properties for 133 885 stable small organic molecules made up of carbon, hydrogen, oxygen, nitrogen, and fluorine atoms. To screen the fingerprints, a regression analysis of the atomic charges given by Open Babel was performed by supervised machine learning. The regression results indicate that the 60 fingerprints successfully estimate Open Babel charges. The results of the dipole moments, an example of a property expressed by charge and position, also had a high accuracy in comparison with the values computed from Open Babel charges. Therefore, the screened 60 fingerprints have the potential to precisely describe the chemical and structural information on the atomic environment of molecules.

Entities: Chemical

Year: 2022 PMID： 35155951 PMCID： PMC8829853 DOI： 10.1021/acsomega.1c06587

Source DB: PubMed Journal: ACS Omega ISSN： 2470-1343

Introduction

A representation of atomic environments has become more needed than in the past decade because of the rapid development of material informatics and related technology.[1−13] Chemical representations without real-space information are mainstream because they are compact, lightweight, and usually adopted for molecular informatics. Indeed, some simple fingerprints, such as the simplified molecular-input line-entry system (SMILES)[14] and SYBYL line notation (SLN),[15] are useful, for instance, for similarity measurements between molecules as the simplest example. Structure representations have been developed separately from the chemical representations, e.g., Voronoi polyhedron of a central atom, angular Fourier series,[1] partial and generalized radial distribution functions, and the bond-orientational order parameter.[16] These were originally not intended for machine learning (ML) applications but are promising. However, separating the chemistry and structure of molecules is open to debate. For example, the Hohenberg–Kohn theorem,[17−19] which states that all physical quantities of a system can in principle be calculated from the electron density of the ground state, implies that there is an inseparable relationship between information about the electronic properties of atoms and the direct function of the geometric location of nuclei. Therefore, the species of atoms and their positional relations can hardly be separated to provide information on the atomic environment. This fact means that there is a need to consider a more proper representation of both chemistry and structure. One of the simplest ideas for such a representation is a combination of the existing chemical and structural representations. For example, a Coulomb matrix[20] is a generalization of an adjacency matrix representation and has been extended as the Ewald sum and the Sine matrix.[5] More advanced methods include histograms of distances (HDs), HD angles (HDAs), HDA dihedrals,[21] bag of bonds,[22] and the smooth overlap of atomic positions.[23] The bond-orientational order parameter that uses spherical harmonics exhibits a highly generic performance for describing molecular local structures. It was first developed by Steinhardt and co-workers to investigate the structures of supercooled liquids and metallic glasses.[16] Lechner and co-workers improved the bond-orientational order parameter by locally averaging neighborhood molecules.[24] Their parameters successfully distinguished body-centered cubic (bcc), face-centered cubic (fcc), hexagonal close-packed (hcp), and liquid-like local structures and have been widely used to classify complex local structures.[25−32] Further modifications have been attempted to extend the entire accuracy or for a specific use.[33,34] The capability of their parameters has been reported for the identification of the crystal-like structures of a Lennard–Jones fluid,[35−37] water,[38−45] polyethylene,[46] and a liquid crystal and its polymer.[33,34] Such a high sensitivity for local structures is desirable for the precise expression of the structural contributions of the atomic environment. In this work, we consider a combination of atomic numbers and bond-orientational order parameters as a candidate for the simple representation that involves information on both the atomic species and their positional relationship. This is applied as the fingerprint of the molecules stored in the QM9 data set,[47,48] which contains geometric, energetic, electronic, and thermodynamic properties for 133 885 stable small organic molecules made up of carbon, hydrogen, oxygen, nitrogen, and fluorine. A total of 504 fingerprints were systematically designed and then screened by the supervised ML for the regression analysis for the atomic charges of molecules. The results indicate that the 60 fingerprints successfully estimate the atomic charges. These fingerprints were used to compute the dipole moments of molecules as an example of a property expressed by both charge and position, and a high accuracy was obtained compared to the values computed from atomic charges. Therefore, the 60 screened fingerprints have the potential to precisely describe the chemical and structural information about the atomic environments of molecules.

Methodology

Real-Space Fingerprint Using a Generic Local-Order Parameter

Here we introduce the real-space fingerprint of the atomic environment considered in this work. First, the chemical information to be embedded is defined as the implant function F(a, r, rc), where a is the atomic number of atom i, r is the interatomic distance between atoms i and j, and rc is the cutoff radius for judging whether particle j is a neighbor of particle i. Here we use four cutoff radii of 1.50, 1.75, 2.00, and 2.50 Å to accommodate various molecular geometries. We emphasize that F(a, r, rc) is basically a simple combination of atomic numbers and distances among neighboring atoms. In this work, a total of 18 species of F(a, r, rc) were calculated for each rc, as shown in Table , where b(rc, i) is an array that involves identifiers of neighboring atoms of atom i and N is the number of neighboring atoms contained in b(rc, i). F was designed based on the following rules: (i) it should be a function of atomic number, (ii) the interaction distance to be emphasized should change depending on the weighting, and (iii) differences in the weighting with and without normalization should be considered. Rule (i) is essential as a fingerprint for the constituent elements of a molecule, but 0, 1, and 1.5 powers were considered in order to have variations in the effect of the atomic number. Rule (ii) is a concept similar to that of the radial basis function G2 in Parrinello–Behler type descriptors,[1,49] where the weights are maximized for interaction distances of 0.0, 1.5, 2.0, and 2.5 Å. Rule (iii) was considered because the form of the function is clearly different depending on the presence or absence of normalization. From the above rules (i–iii), the number of implant functions became 18. Note that other expressions of F are possible for the chemical and physical properties as long as these are defined as a single scalar function. Second, the implant function was embedded in the bond-orientational order parameter developed by Steinhardt and co-workers[16] aswhere l is an arbitrary positive integer denoting the degree of the harmonic function, m is an integer that runs from – l to + l, Y is a spherical harmonics function, and is a vector from particle j to particle i. In this work, we set l values as 4, 6, 8, 12, 15, 18, and 20. Therefore, the total number of representations for the atomic environment for atom i becomes 504 (equal to the number of l × the number of rc × the number of F). The atomic number and all the representations for atom i are stored in the atomic environment vector for atom i, , which has 505 (504 + 1) elements. Finally, the atomic environment vectors for all atoms are merged to the descriptor array D. In this work, the number of elements of D was 505 × 133 885. This was almost the same scale of data as that for the data set of a conventional representation for the atomic environment.[13] However, in order to attain an effective atomic environment it is desirable to have fewer elements. Therefore, we attempted to screen the fingerprints from 504 to a smaller number by means of supervised ML for a regression analysis of the atomic charges of molecules.

Table 1

Description of the Implant Function F(a, r, rc)

name of function	formula
F₁	1
F₂	1/N
F₃	r_ij
F₄
F₅	a_j
F₆
F₇	a_jr_ij^–1
F₈
F₉	a_j^1.5r_ij^–1
F₁₀
F₁₁	a_j^1.5r_ij
F₁₂
F₁₃	a_j exp[−4(r_ij – 1.5)²]
F₁₄
F₁₅	a_j exp[−4(r_ij – 2.0)²]
F₁₆
F₁₇	a_j exp[−4(r_ij – 2.5)²]
F₁₈

Machine Learning

To examine the capability of each fingerprint and to screen the fingerprints, a regression analysis of the atomic charges was performed in supervised ML. The molecular geometry and atomic charge for each molecule stored in QM9 were preliminarily optimized to the generalized amber force field (GAFF)[50] by Open Babel.[51]Figure shows the actual ML flow used in this work. First, b(rc, i) was determined for each atom. Second, Q(rc, F, i) values were computed for each atom and were stored in . The atomic number a was also stored in . The values for all atoms were merged with D. Third, the atomic charges for all atoms were merged with the response variable vector . Note that the response variable became the supervisor of the ML. Then, the operator vector , which satisfies the relation D = , was estimated through ML. The term was estimated using the random forest method[52] implemented on Scikit-learn.[53] The random forest method has some useful characteristics as a learning algorithm of ML: (i) the learning routine is simple and thus has high performance of computing, (ii) the method prevents an overlearning, (iii) little or no data cleansing is needed, and (iv) the significance of data descriptors can be easily quantified. Note that overlearning is the situation where the learning results only fit to data used in the learning and do not fit to any new data. was checked through a k-fold cross-validation implemented on Scikit-learn, where k denotes the number of cross-validations. Note that k-fold cross-validation is the method for checking overlearning. The fivefold cross-validation was done considering the quality and quantity of our data in this work. Namely, 1/5 of 2 407 756 local coordinates (∼481 551) were used for each of a total of five cross-validations. Finally, the fingerprints were screened based on the function of the random forest method, which quantified the importance of the data descriptors.

Figure 1

Machine learning flow for the regression analysis that predicts the atomic charges of molecules stored in QM9.

Results and discussion

The time consumption to compute our fingerprint for all the molecules included in QM9 was approximately 72 h using an AMD Ryzen Threadripper 1950X CPU. Note that the fingerprints were eternally usable if these were prepared for ML once. Table shows fingerprints in order of importance based on the random forest method. There is a bias in the importance of the fingerprints, which means that the top ranking fingerprints represent the typical chemical environments. Importantly, the development of highly sensitive fingerprints that contain such biases can be left to machine learning. Our proposed screening makes this possible, allowing us to select a relatively small set of candidates for the best fingerprint at any given time depending on the constantly changing (and typically increasing) amount of data in the database. The fingerprints ranked from first to 60th account for 97.3% of the importance. Figure shows the importance of fingerprints from the first rank to the 60th rank. After the 55th rank, the importance level is generally saturated at about 0.0006. Therefore, in the following sections, only fingerprints from the first rank to the 60th rank were used.

Table 2

Fingerprints in Order of Importance Based on the Random Forest Method

rank	l	r_c (Å)	F	importance
1	12	1.75	F₁₃	0.688213177
2	6	1.50	F₈	0.092358318
3	12	2.50	F₁₁	0.028534423
4	6	1.75	F₁₅	0.023667266
5	12	1.50	F₁₃	0.016632251
6	12	2.50	F₉	0.014181471
7	12	1.50	F₈	0.012526058
8	6	1.75	F₈	0.011079143
9	4	2.50	F₈	0.006942812
10	6	1.50	F₁₅	0.004692287
11	12	2.50	F₁₇	0.004414363
12	12	1.75	F₁₂	0.003995392
13	12	1.75	F₈	0.003751858
14	4	1.75	F₁₆	0.003325598
15	12	2.50	F₁₀	0.002796547
16	6	2.50	F₁₇	0.002704569
17	4	2.50	F₁₃	0.002336904
18	4	2.50	F₁₄	0.002325804
19	12	2.50	F₁₅	0.002142383
20	8	2.50	F₁₀	0.002134398
21	6	2.50	F₁₃	0.002012958
22	12	2.50	F₈	0.001898805
23	4	2.50	F₁₀	0.001880449
24	6	2.50	F₈	0.001702691
25	6	2.00	F₁₄	0.001672374
26	8	2.50	F₈	0.001621629
27	6	2.50	F₉	0.001594413
28	4	1.75	F₈	0.001559824
29	12	1.75	F₁₆	0.001545018
30	8	1.75	F₈	0.001538792
31	12	2.50	F₁₄	0.001531455
32	8	2.50	F₉	0.001471098
33	6	2.50	F₁₁	0.001456022
34	6	2.50	F₁₀	0.001254301
35	8	2.00	F₈	0.001153517
36	4	1.50	F₈	0.001143127
37	8	2.50	F₁₆	0.001135169
38	4	2.50	F₁₁	0.001103377
39	12	1.75	F₁₀	0.001053001
40	4	2.50	F₉	0.000982597
41	8	2.50	F₁₁	0.000970519
42	4	1.75	F₁₇	0.000890206
43	8	2.50	F₁₇	0.000877437
44	12	2.00	F₈	0.000846427
45	4	2.50	F₁₅	0.000787806
46	4	2.50	F₁₇	0.000771686
47	6	2.00	F₁₅	0.000742759
48	4	2.50	F₁₆	0.000736547
49	6	2.50	F₁₄	0.000732413
50	4	1.75	F₁₄	0.000720596
51	12	2.50	F₁₆	0.000718018
52	12	2.00	F₁₀	0.000707675
53	6	2.00	F₈	0.000698370
54	8	2.50	F₁₅	0.000695468
55	4	2.00	F₈	0.000657664
56	8	1.50	F₁₇	0.000655981
57	4	2.00	F₁₄	0.000639511
58	6	2.50	F₁₅	0.000630099
59	12	1.75	F₁₁	0.000611518
60	4	1.50	F₁₆	0.000594030

Figure 2

A logarithmic plot of the importance of the fingerprints from the first to the 60th rank.

A logarithmic plot of the importance of the fingerprints from the first to the 60th rank. For the most basic assessment of the capability of our fingerprint, the ML flow was performed using the GAFF molecular geometries and atomic charges derived from OpenBabel. Figure shows the resulting regression curves for (a) hydrogen, (b) carbon, (c) nitrogen, (d) oxygen, (e) fluorine, and (f) total charges. The results for every atomic species demonstrated the high coefficient of determination (R2) value, which indicated that the fingerprint captured well the charge variations derived from the difference of the atomic environment. The statistic scores for the regression results are shown in Table . The worst R2 value was for the nitrogen charges. The mean squared error (MSE) was also the worst, and the mean absolute error (MAE) was the second worst. This implies that the variations in the charges of nitrogen atoms assigned by GAFF include systematic outliers. In contrast, the R2 value and MSE for hydrogen charges were the best, and the MAE for hydrogen charges was second best. The regression results were also examined from the values of the molecular dipole moment. Figure shows the resulting regression curve for the molecular dipole moments. The results demonstrated a high R2 value of 0.990, which indicated that the fingerprint captured well the variations of dipole moment values derived from the difference of the atomic environment.

Figure 3

Regression curves for (a) hydrogen, (b) carbon, (c) nitrogen, (d) oxygen, (e) fluorine, and (f) total charges by the fingerprint of the GAFF structures.

Table 3

Statistic Scores for the Regression Results of the Fingerprint of the GAFF Structures

atom	R²	MAE	MSE
H	0.984	0.002	0.00004
C	0.974	0.007	0.00020
N	0.951	0.006	0.00030
O	0.983	0.003	0.00006
F	0.967	0.001	0.00008
total	0.993	0.004	0.00012

Figure 4

Regression curve for molecular dipole moments from the fingerprint of the GAFF structures.

Regression curves for (a) hydrogen, (b) carbon, (c) nitrogen, (d) oxygen, (e) fluorine, and (f) total charges by the fingerprint of the GAFF structures. Regression curve for molecular dipole moments from the fingerprint of the GAFF structures. To evaluate the robustness of the fingerprints to variations in molecular geometries, the ML flow was performed using the DFT molecular geometries stored in QM9 before they were optimized in OpenBabel. The atomic charges, as the response variables, were the OpenBabel charges. We emphasize that response variables can be used for more than just OpenBabel charges. For example, not only the restrained electrostatic potential (RESP) charges[54] given by precise DFT calculations but also the charge and Lennard–Jones potential parameters given by CHARMM,[55] COMPASS,[56] and other force fields are possible candidates. However, as a simple and easy way to evaluate an example, we used the OpenBabel charge here. Figure shows the resulting regression curves for (a) hydrogen, (b) carbon, (c) nitrogen, (d) oxygen, (e) fluorine, and (f) total charges. The results for every atomic species demonstrated high R2 values, which indicated that the selected 60 fingerprints demonstrated a high robustness against the deviation of the molecular geometries. The statistic scores for the regression results are shown in Table . The worst R2 value was that for the fluorine charges. However, MAE and MSE values for fluorine charges were the smallest. This was because the charge variations of fluorine atoms were basically small. Thus, the regression accuracy for the fluorine charges was never low. The trends for the statistic scores for nitrogen charges were almost the same as those the regression results from the GAFF structures and OpenBabel charges. The regression results were also examined for the molecular dipole moment. Figure shows the regression curve for molecular dipole moments. The results demonstrated an R2 value of 0.990, which is comparable to that for the regression from GAFF structures and OpenBabel charges. Again, the accurate prediction of OpenBabel charges and dipole moments for DFT structures shows the high robustness of the 60 fingerprints for the molecular geometries.

Figure 5

Regression curves for (a) hydrogen, (b) carbon, (c) nitrogen, (d) oxygen, (e) fluorine, and (f) total charges from the fingerprint of the DFT structures.

Table 4

Statistic Scores for the Regression Results from the Fingerprint of the DFT Structures

atom	R²	MAE	MSE
H	0.995	0.003	0.00009
C	0.983	0.005	0.00010
N	0.947	0.006	0.00040
O	0.971	0.003	0.00010
F	0.944	0.001	0.00001
total	0.995	0.003	0.00009

Figure 6

Regression curve for molecular dipole moments from the fingerprint of the DFT structures.

Regression curves for (a) hydrogen, (b) carbon, (c) nitrogen, (d) oxygen, (e) fluorine, and (f) total charges from the fingerprint of the DFT structures. Regression curve for molecular dipole moments from the fingerprint of the DFT structures. Finally, to examine the prediction capability of the trained model from GAFF structures and OpenBabel charges, the charges were predicted from the trained operator vector GAFF, and the descriptor array was predicted from DFT structures DDFT. Figure shows a comparison of the predicted charges with OpenBabel charges for (a) hydrogen, (b) carbon, (c) nitrogen, (d) oxygen, (e) fluorine, and (f) total charges. The results for carbon and hydrogen had high R2 values, and the other atomic species exhibited reasonable R2 values. The statistical scores for the comparison results are shown in Table . The worst R2, MAE, and MSE values were found for nitrogen charges, a feature consistent with the other results. Overall, the model trained from GAFF structures and OpenBabel charges can adequately predict the OpenBabel charges for DFT molecular structures. However, the molecular dipole moments were not determined accurately when the predicted charges were used (data not shown). Despite the success of machine learning in predicting charges at a certain level, the low prediction accuracy of the dipole moment implies that the gap in molecular geometries between GAFF and DFT is not small.

Figure 7

Comparison of predicted charges with OpenBabel charges for (a) hydrogen, (b) carbon, (c) nitrogen, (d) oxygen, (e) fluorine, and (f) total charges.

Table 5

Statistical Scores for the Comparison of Predicted Charges with OpenBabel Charges

atom	R²	MAE	MSE
H	0.913	0.004	0.00020
C	0.936	0.014	0.00050
N	0.764	0.022	0.00100
O	0.871	0.009	0.00050
F	0.860	0.004	0.00003
total	0.977	0.009	0.00040

Comparison of predicted charges with OpenBabel charges for (a) hydrogen, (b) carbon, (c) nitrogen, (d) oxygen, (e) fluorine, and (f) total charges.

Conclusions

We developed a candidate for simple representation that combines atomic species and the bond-orientational order parameter. The 504 candidates were applied as the fingerprint of molecules stored in QM9. To screen the fingerprints, a regression analysis of the atomic charges given by OpenBabel was performed by a supervised ML flow. The 60 fingerprints were selected based on importance value of the random forest method. The selected fingerprints was applied for the regression analysis of OpenBabel charges, which used GAFF molecular structures. The results exhibited high statistic scores for both the atomic charge and the dipole moment, which indicated that the fingerprint captured well the variations of charge derived from the difference of the atomic environment. Then, the regression analysis was performed using the DFT structures to assess the robustness of 60 fingerprints against the deviation of the molecular structures. The results exhibited high statistical scores, which indicated that the fingerprints demonstrated a high robustness against the deviation of the molecular structures. Finally, the atomic charges were predicted from DFT and DGAFF to examine the prediction capability of the trained model from GAFF structures and OpenBabel charges. The results showed that the trained model can adequately predict the OpenBabel charges for DFT molecular structures. Therefore, the screened 60 fingerprints have the potential to precisely describe the chemical and structural information on the atomic environment of molecules. Importantly, only 60 chemical representations give the exact atomic charges for 133 885 different molecular geometries. This fact means that our chemical representation successfully compresses the information content of fingerprints based on 3D molecular geometries. The robustness of the fingerprints to molecular geometries and the robustness of the learning model to charge prediction also indicate that our chemical representation can be used for iterations that simultaneously determine molecular geometries and charges. Furthermore, our representation can be used in combination with many other conventional atomic representations.[2−5,7,13] Thus, the prediction capability of the above-mentioned trained model may be improved for the deviation of molecular structures and molecules never included in training data. In addition to constructing precise big data for molecular structures, the development of atomic representations might assist in the high-throughput design of molecular models that include intra- and intermolecular interaction parameters without computationally expensive first-principles simulation techniques.

32 in total

1. Universal variational functionals of electron densities, first-order density matrices, and natural spin-orbitals and solution of the v-representability problem.

Authors: M Levy
Journal: Proc Natl Acad Sci U S A Date: 1979-12 Impact factor: 11.205

2. Accurate determination of crystal structures based on averaged local bond order parameters.

Authors: Wolfgang Lechner; Christoph Dellago
Journal: J Chem Phys Date: 2008-09-21 Impact factor: 3.488

3. SYBYL line notation (SLN): a single notation to represent chemical structures, queries, reactions, and virtual libraries.

Authors: R Webster Homer; Jon Swanson; Robert J Jilek; Tad Hurst; Robert D Clark
Journal: J Chem Inf Model Date: 2008-12 Impact factor: 4.956

Screening toward the Development of Fingerprints of Atomic Environments Using Bond-Orientational Order Parameters.

Introduction

Methodology

Real-Space Fingerprint Using a Generic Local-Order Parameter

Machine Learning

Results and discussion

Conclusions

1. Universal variational functionals of electron densities, first-order density matrices, and natural spin-orbitals and solution of the v-representability problem.

2. Accurate determination of crystal structures based on averaged local bond order parameters.

3. SYBYL line notation (SLN): a single notation to represent chemical structures, queries, reactions, and virtual libraries.

4. Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17.

5. Roles of icosahedral and crystal-like order in the hard spheres glass transition.

6. Structural evolution in the aging process of supercooled colloidal liquids.

Review 7. Representing potential energy surfaces by high-dimensional neural network potentials.

8. Searching for local order parameters to classify water structures at triple points.

9. Mining of effective local order parameters for classifying crystal structures: A machine learning study.

10. The microscopic pathway to crystallization in supercooled liquids.