Literature DB >> 33166143

tmQM Dataset-Quantum Geometries and Properties of 86k Transition Metal Complexes.

David Balcells¹, Bastian Bjerkem Skjelstad².

Abstract

We report the transition metal quantum mechanics (tmQM) data set, which contains the geometries and properties of a large transition metal-organic compound space. tmQM comprises 86,665 mononuclear complexes extracted from the Cambridge Structural Database, including Werner, bioinorganic, and organometallic complexes based on a large variety of organic ligands and 30 transition metals (the 3d, 4d, and 5d from groups 3 to 12). All complexes are closed-shell, with a formal charge in the range {+1, 0, -1}e. The tmQM data set provides the Cartesian coordinates of all metal complexes optimized at the GFN2-xTB level, and their molecular size, stoichiometry, and metal node degree. The quantum properties were computed at the DFT(TPSSh-D3BJ/def2-SVP) level and include the electronic and dispersion energies, highest occupied molecular orbital (HOMO) and lowest unoccupied molecular orbital (LUMO) energies, HOMO/LUMO gap, dipole moment, and natural charge of the metal center; GFN2-xTB polarizabilities are also provided. Pairwise representations showed the low correlation between these properties, providing nearly continuous maps with unusual regions of the chemical space, for example, complexes combining large polarizabilities with wide HOMO/LUMO gaps and complexes combining low-energy HOMO orbitals with electron-rich metal centers. The tmQM data set can be exploited in the data-driven discovery of new metal complexes, including predictive models based on machine learning. These models may have a strong impact on the fields in which transition metal chemistry plays a key role, for example, catalysis, organic synthesis, and materials science. tmQM is an open data set that can be downloaded free of charge from https://github.com/bbskjelstad/tmqm.

Entities: CellLine Chemical Disease Gene

Year: 2020 PMID： 33166143 PMCID： PMC7768608 DOI： 10.1021/acs.jcim.0c01041

Source DB: PubMed Journal: J Chem Inf Model ISSN： 1549-9596 Impact factor: 4.956

Introduction

Machine learning (ML) is revolutionizing several research fields in which chemistry plays a central role.[1−4] By minimizing the error relative to reference data (i.e., training data set), ML algorithms deliver predictive models mapping a set of descriptors (i.e., features) into one or more properties of interest (i.e., targets). These models can robustly handle data sets that can be both very large and complex and, once compiled, can compute accurate predictions in a simple laptop within a fraction of a second. The fast execution of ML predictions enables the exploration of the vast chemical compound space (CCS)[5−7] with different approaches, including multi-objective optimization[8] and inverse design.[9−11] Neural networks[12−16] and other ML models have been used successfully in a wide range of applications, with numerous examples in materials science[17−21] and drug discovery.[22−26] ML and data-driven approaches are also making rapid progress in catalytic,[27−41] organic,[42−47] inorganic,[48,49] and theoretical[50−56] chemistry. Despite the high potential of ML, a major challenge in its application is the need for big data sets for the training and validation of the models. There are fields of high interest, for example, catalysis, in which the size and scope of experimental data is small. An efficient solution is to use computational results as training data.[57−60] This is one of the fundamental concepts underlying quantum-based ML (QML),[61] in which the ML models are trained with data from quantum mechanical (QM) calculations. QML models are used to predict highest occupied molecular orbital (HOMO)/lowest unoccupied molecular orbital (LUMO) energies and gaps, dipole moments, polarizabilities, and other quantum properties governing the macroscopic behavior of chemical systems. State-of-the-art QML models, including atomistic[62] and message-passing neural networks,[63] yield predictions approaching chemical accuracy.[64] However, the training of these models requires quantum data sets that must be large and comprehensive to avoid overfitting and to ensure the unbiased exploration of the CCS. These data sets are scarce, and their generation remains hampered by the high computational cost of quantum mechanics calculations, thus limiting the scope of QML. Quantum data set examples include the Materials Project,[65] PubChemQC,[66] and the GDB13[67]-based QM series for organic chemistry (QM7,[54] QM7b,[68] QM8,[69,70] and QM9[71]). Ab initio molecular dynamics trajectories and off-equilibrium conformations are also available from the MD17[72] and ANI-1[73] data sets, respectively. Quantum data sets for transition metal (TM) complexes cover either small[27] or large but specific[8] regions of the chemical space. Other data-driven approaches to organometallic chemistry have focused on the isolated ligands.[74] We herein report the transition metal quantum mechanics data set (tmQM), which contains a curated collection of TM compounds, including Werner, bioinorganic and organometallic complexes. The computational protocol used in the generation of the tmQM data set consists of filtering structures from the Cambridge Structural Database (CSD), followed by xTB geometry optimizations and density functional theory (DFT) single points (Figure ). In total, tmQM contains 86,665 complexes extracted from the CSD, representing the diversity of the TM–organic chemical space with a large variety of organic ligands bound to all the 3d, 4d, and 5d TMs from groups 3 to 12. tmQM provides the Cartesian coordinates optimized at the GFN2-xTB level and a set of quantum properties computed at the DFT(TPSSh-D3BJ/def2-SVP) level, including the electronic and dispersion energies, metal center natural charge, HOMO/LUMO energies and gap, and dipole moment. Polarizabilities are also provided at the GFN2-xTB level. The pairwise representations of the properties revealed unusual regions within the CCS, for example, TM complexes with large polarizabilities and wide HOMO/LUMO gaps.

Figure 1

Computational protocol used to generate the tmQM data set. CSD = Cambridge Structural Database; xTB = extended tight-binding; DFT = density functional theory; μ = dipole moment; α = polarizability; q = charge.

Computational protocol used to generate the tmQM data set. CSD = Camn class="Chemical">bridge Structural Database; xTB = extended tight-binding; DFT = density functional theory; μ = dipole moment; α = polarizability; q = charge. The tmQM data set will enable the training of ML models, which can be exploited in the data-driven discovery of new catalysts and functional materials. Traditional predictive models, including multivariate linear regression,[75−77] and quantitative structure–activity relationships,[78,79] will also benefit from the availability of the tmQM data set, which can be downloaded free of charge from https://github.com/bbskjelstad/tmqm and http://quantum-machine.org/datasets/.

Chemical Subspace Extracted from the CSD

The tmQM data set fully comprises structures extracted from the 2020 release of the CSD by using the seven filters listed below. The filters were implemented by means of the CSD Python API. Composition filter (metal elements): Excluded all structures except those containing a single TM.[80] Composition filter (nonmetal elements): Excluded all structures except those containing a minimum of one C and one H atoms. The other elements allowed in the structures were as follows: B, Si, N, P, As, O, S, Se, F, Cl, Br, and I. Component filter: Excluded the structure of all molecular components, except that of the metal complex. Polymer filter: Excluded all polymeric structures. Spatial coordinates filter: Excluded all structures without three-dimensional coordinates. Disorder filter: Excluded all structures with disordered atoms. Charge filter: Excluded all structures with charge higher than 1 and lower than −1. Filters 1–2 extract mononuclear TM–organic compounds from the CSD, including Werner, organometallic, and bioinorganic complexes. Filter 3 removes the solvent molecules and counterions that are found in many crystal structures. Filters 4–6 ensure the correctness of the structures passed to the software used in the QM calculations. Filter 7 removes highly charged species, which may cause charge-separation artifacts in the gas-phase QM calculations. In total, 116,332 structures were extracted from the CSD with filters 1–7. Figure shows the distribution of different molecular properties over the TM series. The number of bonds involving the metal center (Figure A) peaks at 4, 5, and 6 (31, 12, and 33% of the total, respectively). The latter is the most abundant instance and dominates with most TMs. Notable exceptions to this trend are Ni, Pd, Pt, and Cu, which show a preference for making four bonds. These observations can be associated to the prevalence of the tetrahedral (4 bonds), square planar (4 bonds), trigonal bipyramidal (5 bonds), square pyramidal (5 bonds), and octahedral (6 bonds) coordination geometries. However, it should be noted that the number of metal bonds was extracted from the connectivity table of the CSD mol2 files. Thus, this number is equal to the degree of the metal center node in the molecular graph of the complex, which is not necessarily equal to the coordination number.[81] For example, the η5-Cp ligand counts five bonds but, in an octahedral complex, and from a molecular orbital perspective, it only occupies three coordination sites. With Ti and other early TMs forming stable arene complexes, 8 is one of the most abundant metal bond counts (i.e., octahedral complexes with three monodentate ligands and one Cp ligand). In contrast, at the extreme of the late TM groups, the number of metal bonds peaks at the lowest possible values. For example, 2 is the most common metal bond count with Au.

Figure 2

Distributions over the 3–5d TM series by (A) metal node degree; i.e. number of bonds to the metal, (B) molecular charge q, and (C) size in number of atoms. The insets show the totals. The data are for the 116,332 structures extracted from the CSD with filters 1–7. The figure also shows the distribution of the TM complex charges (Figure B) and sizes (Figure C). The former distribution clearly shows the dominance of q = 0 for all TMs, without any exception, and with the neutral complexes comprising 82% of the total. The molecular size distribution, measured in number of atoms, is balanced between the small (1–50 atoms) and medium-size (50–100 atoms) classes, which include 34 and 57% of the total, respectively. The large class (>100 atoms) includes a smaller portion of structures (9%), being the smallest fraction with all TMs. Figure reflects the strong organic component of the TM complexes extracted from the CSD. C and H account for 87% of the chemical composition of the entire space (Figure D). After these two elements, N, O, P, Cl, and F are, in this order, the most abundant. These elements are found in the most common ligands, including amines, carboxylates, heterocycles, phosphines, and halides. The nature of the chemical space was also explored by computing Morgan fingerprints, using radius = 3, and a large number of bits (i.e., 32,768) to avoid hash collisions. The connectivity needed to generate the fingerprints is available from the CSD database and can be retrieved by using the CSD code provided for all entries of the tmQM data set. Figure E shows the 30 most abundant fingerprints, which account for conjugated C–C bonds (e.g., bits 21,860 and 24,401), aromatic rings based on C (e.g., 15,535 and 1947) and N (e.g., 22,946), amines (e.g., 23,463), and other fragments that are commonly found in organic ligands. Other groups and ligands, including chloride, alkoxy, oxo, and phosphines, can also be easily recognized in fingerprints 18,067, 25,271, 31,370, and 2049, respectively.

Figure 3

Composition by the number of non-TM atoms in the chemical formula (D), with the inset excluding C and H, and the 30 most abundant Morgan fingerprints (E). The data are for the structures extracted from the CSD with filters 1–7.[82] Fingerprint legend: All nonlabeled atoms are C, and the gray fragments show the fingerprint connectivity but are not part of it; fingerprint label = bit number, blue circle = central atom in the fingerprint, yellow circle = aromatic atom, star = arbitrary atom. Figures and 5, which show one random example for each of the 30 TM elements, give a glimpse of the vast diversity of the chemical space extracted from the CSD. The 30 complexes in the two figures (i.e., a mere 0.03% of the full space) include 48 ligands, which are bound to the metal center in five different coordination modes (κ1, κ2, κ3, η2, and η5), four different coordination numbers (2, 4, 5, and 6) and six different coordination geometries (linear, tetrahedral, square planar, trigonal bipyramidal, square pyramidal, and octahedral). Interestingly, the further extension of these variables by considering all the 116,332 structures extracted with filters 1–7 would allow for a combinatorial explosion yielding a massive number of TM complexes. Thus, despite the large size of the CSD, this database represents a minuscule fraction of the full TM–organic compound space, which also underlines the need for predictive models enabling the efficient exploration of this vast space.

Figure 4

Randomly selected structures, and their CSD codes, for each TM in groups 3–7. The selection was made among the 116,332 structures extracted from the CSD with filters 1–7.

Figure 5

Randomly selected structures and their CSD codes for each TM in groups 8–12. The selection was made among the 116,332 structures extracted from the CSD with filters 1–7.

Randomly selected structures, and their CSD codes, for each TM in groups 3–7. The selection was made among the 116,332 structures extracted from the CSD with filters 1–7. Randomly selected structures and their CSD codes for each TM in groups 8–12. The selection was made among the 116,332 structures extracted from the CSD with filters 1–7.

Quantum Geometries and Properties

The structures of the TM complexes extracted from the CSD with filters 1–7 were used as the basis to construct the tmQM data set. The advantage of using the CSD as the source of structures is that the TM complexes in the resulting data set can be accessed experimentally through documented synthesis procedures. Thus, ML models trained with the tmQM data set will embed synthetic accessibility in their internal representations used for prediction and generation tasks. The CSD structures were fully optimized in gas phase with the extended tight-binding xTB method.[83] The second-generation parametrization for geometries, frequencies, and noncovalent interactions (GFN2-xTB[84]) was used. The GFN2-xTB parametrization is less empirical than the GFN1, and it was proven to be more robust in geometry optimization.[85] The tight optimization level was used in the GFN2-xTB calculations to set the convergence thresholds to 1 × 10–6Eh (energy) and 8 × 10–4Ehα–1 (gradient). The calculations were carried out with the xtb program. Before passing the geometries to the software used for the DFT calculations, the following three filters were appliedin which the norm of the displacement (dn) is summed over all optimization cycles (ocyc) and divided by the size of the system in atoms (NAt) and the CSD R factor. The 7% geometries yielding the largest Sq values were excluded.[85] Convergence filter: Excluded all geometries that did not reach the convergence thresholds. Geometry quality: The GFN2-xTB-optimized geometries were ranked based on their deviation from the initial CSD crystal structure. The deviation was measured for each geometry by computing a structure quality index Sq with eq Electron-count filter: Excluded all structures with an odd number of electrons. The first two filters excluded geometries with major flaws (e.g. erroneous coordination number and geometry). The third filter excluded all TM complexes that are forced to have an open-shell ground state, due to an odd number of electrons (i.e. 22,325 of the 116,332 structures extracted from the CSD). This filter excludes the errors and high computational cost associated to QM calculations on open-shell systems. In total, 86,699 geometries passed filters 1–3. The GFN2-xTB on class="Chemical">ptimized Cartesian coordinates of all TM complexes are included in the tmQM data set. By using chemoinformatics software like RDKit, molSimplify, and Open Babel, these coordinates can be easily transformed into features for ML models, including Morgan fingerprints,[86] SMILES,[87−89] and autocorrelation functions.[90] All geometries are provided together with their CSD code, molecular size, charge, spin multiplicity, stoichiometry, and metal node degree (i.e., number of bonds involving the metal center). The quantum properties of the tmQM data set were obtained from single-point calculations at the DFT level on the GFN2-xTB optimized geometries. All properties were computed for the closed-shell singlet state. The calculations were performed in gas phase with the hybrid meta-GGA TPSSh functional[91] and the double-ζ polarized def2-SVP basis set,[92] including effective core potentials for Z > 36. Dispersion was introduced by means of the D3BJ model.[93] The calculations were carried out with the Gaussian16 program, using the ultrafine pruned (99,590) grid for high numerical accuracy. This level of theory was used to compute the following properties: electronic and dispersion energies, HOMO and LUMO energies, HOMO/LUMO gap, dipole moment, and metal center charge, which was derived from natural population analysis.[94] In total, the computation of the quantum properties converged for 86,665 TM complexes. In addition to the GFN2-xTB geometries, the tmQM data set provides these DFT properties for all TM complexes. Polarizabilities are also provided at the GFN2-xTB level based on the self-consistent D4 model using a Gaussian-weighting scheme.[84]

Pairwise Property Representations

The nature of the tmQM data set was explored by representing quantum property pairs in scatter plots. Figure includes a selection of four plots showing the poor correlation between the HOMO/LUMO gap and the polarizability (Figure F) and between the metal natural charge and the dipole moment (Figure G), the HOMO energy (Figure H), and the LUMO energy (Figure I). The plots have blob shapes with an almost continuous variation of the two properties represented in each case. This lack of correlation was also observed in the pairwise representations of the HOMO/LUMO gap versus the dipole moment (Figure S1), polarizability versus dipole moment (Figure S2), HOMO/LUMO gap versus metal center natural charge (Figure S3), and polarizability versus metal center natural charge (Figure S4). Interestingly, these representations also show that unusual regions of the chemical space have small, yet significant, populations, for example, complexes with large polarizabilities and wide HOMO/LUMO gaps, complexes with small dipole moments and highly charged metal centers, complexes with low HOMO energies and electron-rich metal centers, and complexes with high LUMO energies and electron-poor metal centers.

Figure 6

Pairwise correlations, with color gradients based on property values; α vs HOMO/LUMO gap (F), μ vs qM (G), HOMO energy vs qM (H), and LUMO energy vs qM (I). Level of theory: TPSSh-D3BJ/def2-SVP, except GFN2-xTB for α. The pairwise scatters were also plotted by using the color of the data points to encode the periodic table group of the metal center (Figure ). For the sake of clarity, the plots were divided in two sets, one accounting for groups 3, 5, 7, 9, and 11, and one accounting for groups 4, 6, 8, 10, and 12. The data points were added to the scatter plots in a random order; that is regions with a dominant color are mostly associated to a given TM group. Most of the plots have no color structure, that is, any metal can give any combination of properties with the appropriate choice of ligands. This is the case, for instance, of polarizability versus dipole moment (Figure J). However, there are property pairs with some structure, for example, the HOMO versus LUMO energies (Figure K), in which group 12 yields the largest gaps. The most structured property pairs are those involving the metal natural charge, with the scatter plots yielding color bands (Figure L,M). Following the periodic trends, the groups closest to the d0 configuration, or exceeding the d10 configuration, yielded the highest positive charges, whereas the groups closest to the d10 configuration yielded the lowest negative charges. More pairwise representations of the quantum properties are available in the Supporting Information (Figures S5–S8).

Figure 7

Pairwise correlations colored by the periodic table group; α vs μ (J), LUMO vs HOMO energies (K), HOMO/LUMO gap vs qM (L), and α vs qM (M). Level of theory: TPSSh-D3BJ/def2-SVP, except GFN2-xTB for α.

Data Benchmarks

The tmQM data set was assessed by performing three different benchmarks for a set of quantum properties including the metal center natural charge (qM), dipole moment (μ), HOMO/LUMO gap, and polarizability (α). Benchmark 1: The qM, μ, and HOMO/LUMO gaps of the GFN2-xTB-optimized geometries, computed at the TPSSh-D3BJ/def2-SVP level, were compared to their values recomputed at the B2PLYP-D3/def2-SVP level.[95] Benchmark 2: The qM, μ, and HOMO/LUMO gaps of the GFN2-xTB-optimized geometries, computed at the TPSSh-D3BJ/def2-SVP level, were compared to their values recomputed after reoptimizing the geometries at the same TPSSh-D3BJ/def2-SVP level. Benchmark 3: The α of the GFN2-xTB-optimized geometries, computed at the same GFN2-xTB level, were compared to their values recomputed at the TPSSh-D3BJ/def2-SVP level. Benchmark 1 showed how the quantum properties vary upon lifting the DFT level from the meta-GGA TPPSh hybrid functional to the B2PLYP-D3 double-hyn class="Chemical">brid functional. Benchmark 2 showed how much sensible are the quantum properties to the level of theory used in the geometry optimization of the CSD structures, by comparing GFN2-xTB to DFT(TPSSh-D3BJ). Benchmark 3 showed the deviation of the GFN2-xTB polarizabilities relative to the DFT(TPPSh). Table gives the mean absolute error (MAE) and r2 score for each benchmark.[96]

Table 1

Data Benchmarks and Their Associated MAE (in Atomic Units, Except for μ, in D) and r2 Scores

property	q_M		μ		gap		α
benchmark	MAE	r²	MAE	r²	MAE	r²	MAE	r²
1 (B2PLYP-D3)	0.12	0.99	0.53	0.98	0.124	0.69
2 (DFT-Opt)	0.05	0.99	0.56	0.94	0.007	0.92
3 (DFT-α)							19.8	0.81

Table shows that in both benchmarks 1 and 2, q and μ yielded the smallest MAEs, with r2 → 1. The largest deviations were found for the HOMO/LUMO gaps, in line with the strong dependence of this property on the theory levels used in the single-point and geometry optimization calculations. This scenario is illustrated for benchmark 1 with qM (Figure N) and the HOMO/LUMO gap (Figure O). However, despite the larger uncertainty of the HOMO/LUMO gap relative to qM, the pairwise correlations of these two properties at the TPSSh-D3BJ (Figure P) and B2PLYP-D3 (Figure Q) levels have essentially the same shapes, with three adjacent clusters at qM ≈ −1.50e, −0.75e, and 0.50e, that increase in size from qM ≈ −2 to qM ≈ +2. In benchmark 3, the deviation of the GFN2-xTB α values relative to the DFT(TPSSh-D3BJ) is larger than those of qM and μ in benchmarks 1 and 2, though significantly smaller than that of the HOMO/LUMO gap in benchmark 1. More pairwise representations of the quantum property benchmarks are provided in the Supporting Information (Figures S9–S12).

Figure 8

Pairwise property correlations from the B2PLYP-D3 benchmark 1; qM (N), HOMO/LUMO gap (O), TPSSh-D3BJ HOMO/LUMO gap vs qM (P), and B2PLYP-D3 HOMO/LUMO gap vs qM (Q).

Data Availability

tmQM is an open data set freely available at GitHub (https://github.com/bbskjelstad/tmqm) and from Quantum-Machine (http://quantum-machine.org/datasets/). Quantum features, geometries and properties computed at the GFN2-xTB and TPSSh-D3BJ/def2-SVP levels of theory are provided in the xyz and csv file formats.

Conclusions

This article reported the tmQM data set, which provides the quantum geometries and properties of a large amount of TM complexes. The complexes were extracted from the CSD database with a series of filters imposing constraints on chemical composition, structure, and charge. After optimization at the GFN2-xTB level, additional filters were applied to control geometry quality and electronic structure. A total of 86k TM complexes passed these filters and were included in the tmQM data set after computing their quantum properties at the TPSSh-D3BJ/def2-SVP level, including the electronic and dispersion energies, HOMO/LUMO energies and gap, dipole moment and metal center natural charge. Polarizabilities are also provided at the GFN2-xTB level. The pairwise representations of these properties allowed for mapping regions of the chemical space with unusual properties; for example, TM complexes combining electron-rich metal centers with low HOMO energies. The tmQM data set, which is open and freely available at https://github.com/bbskjelstad/tmqm, will enable the training of ML models for the discovery of new molecular materials based on TMs.

69 in total

1. Fast and accurate modeling of molecular atomization energies with machine learning.

Authors: Matthias Rupp; Alexandre Tkatchenko; Klaus-Robert Müller; O Anatole von Lilienfeld
Journal: Phys Rev Lett Date: 2012-01-31 Impact factor: 9.161

2. Extended-connectivity fingerprints.

Authors: David Rogers; Mathew Hahn
Journal: J Chem Inf Model Date: 2010-05-24 Impact factor: 4.956

3. Double-hybrid density functional theory for excited electronic states of molecules.

Authors: Stefan Grimme; Frank Neese
Journal: J Chem Phys Date: 2007-10-21 Impact factor: 3.488

4. Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17.

Authors: Lars Ruddigkeit; Ruud van Deursen; Lorenz C Blum; Jean-Louis Reymond
Journal: J Chem Inf Model Date: 2012-11-01 Impact factor: 4.956

5. Parameterization of phosphine ligands demonstrates enhancement of nickel catalysis via remote steric effects.

Authors: Kevin Wu; Abigail G Doyle
Journal: Nat Chem Date: 2017-03-06 Impact factor: 24.427

6. Accelerating Chemical Discovery with Machine Learning: Simulated Evolution of Spin Crossover Complexes with an Artificial Neural Network.

Authors: Jon Paul Janet; Lydia Chan; Heather J Kulik
Journal: J Phys Chem Lett Date: 2018-02-15 Impact factor: 6.475

7. Deep learning enables rapid identification of potent DDR1 kinase inhibitors.

Authors: Alex Zhavoronkov; Yan A Ivanenkov; Alex Aliper; Mark S Veselov; Vladimir A Aladinskiy; Anastasiya V Aladinskaya; Victor A Terentiev; Daniil A Polykovskiy; Maksim D Kuznetsov; Arip Asadulaev; Yury Volkov; Artem Zholus; Rim R Shayakhmetov; Alexander Zhebrak; Lidiya I Minaeva; Bogdan A Zagribelnyy; Lennart H Lee; Richard Soll; David Madge; Li Xing; Tao Guo; Alán Aspuru-Guzik
Journal: Nat Biotechnol Date: 2019-09-02 Impact factor: 54.908