David Balcells1, Bastian Bjerkem Skjelstad2. 1. Hylleraas Centre for Quantum Molecular Sciences, Department of Chemistry, University of Oslo, P.O. Box 1033, Blindern, 0315 Oslo, Norway. 2. Institute for Chemical Reaction Design and Discovery (WPI-ICReDD), Hokkaido University, Sapporo 001-0021, Japan.
Abstract
We report the transition metal quantum mechanics (tmQM) data set, which contains the geometries and properties of a large transition metal-organic compound space. tmQM comprises 86,665 mononuclear complexes extracted from the Cambridge Structural Database, including Werner, bioinorganic, and organometallic complexes based on a large variety of organic ligands and 30 transition metals (the 3d, 4d, and 5d from groups 3 to 12). All complexes are closed-shell, with a formal charge in the range {+1, 0, -1}e. The tmQM data set provides the Cartesian coordinates of all metal complexes optimized at the GFN2-xTB level, and their molecular size, stoichiometry, and metal node degree. The quantum properties were computed at the DFT(TPSSh-D3BJ/def2-SVP) level and include the electronic and dispersion energies, highest occupied molecular orbital (HOMO) and lowest unoccupied molecular orbital (LUMO) energies, HOMO/LUMO gap, dipole moment, and natural charge of the metal center; GFN2-xTB polarizabilities are also provided. Pairwise representations showed the low correlation between these properties, providing nearly continuous maps with unusual regions of the chemical space, for example, complexes combining large polarizabilities with wide HOMO/LUMO gaps and complexes combining low-energy HOMO orbitals with electron-rich metal centers. The tmQM data set can be exploited in the data-driven discovery of new metal complexes, including predictive models based on machine learning. These models may have a strong impact on the fields in which transition metal chemistry plays a key role, for example, catalysis, organic synthesis, and materials science. tmQM is an open data set that can be downloaded free of charge from https://github.com/bbskjelstad/tmqm.
We report the transition metal quantum mechanics (tmQM) data set, which contains the geometries and properties of a large transition metal-organic compound space. tmQM comprises 86,665 mononuclear complexes extracted from the Cambridge Structural Database, including Werner, bioinorganic, and organometallic complexes based on a large variety of organic ligands and 30 transition metals (the 3d, 4d, and 5d from groups 3 to 12). All complexes are closed-shell, with a formal charge in the range {+1, 0, -1}e. The tmQM data set provides the Cartesian coordinates of all metal complexes optimized at the GFN2-xTB level, and their molecular size, stoichiometry, and metal node degree. The quantum properties were computed at the DFT(TPSSh-D3BJ/def2-SVP) level and include the electronic and dispersion energies, highest occupied molecular orbital (HOMO) and lowest unoccupied molecular orbital (LUMO) energies, HOMO/LUMO gap, dipole moment, and natural charge of the metal center; GFN2-xTB polarizabilities are also provided. Pairwise representations showed the low correlation between these properties, providing nearly continuous maps with unusual regions of the chemical space, for example, complexes combining large polarizabilities with wide HOMO/LUMO gaps and complexes combining low-energy HOMO orbitals with electron-rich metal centers. The tmQM data set can be exploited in the data-driven discovery of new metal complexes, including predictive models based on machine learning. These models may have a strong impact on the fields in which transition metal chemistry plays a key role, for example, catalysis, organic synthesis, and materials science. tmQM is an open data set that can be downloaded free of charge from https://github.com/bbskjelstad/tmqm.
Machine
learning (ML) is revolutionizing several research fields
in which chemistry plays a central role.[1−4] By minimizing the error relative to reference
data (i.e., training data set), ML algorithms deliver predictive models
mapping a set of descriptors (i.e., features) into one or more properties
of interest (i.e., targets). These models can robustly handle data
sets that can be both very large and complex and, once compiled, can
compute accurate predictions in a simple laptop within a fraction
of a second. The fast execution of ML predictions enables the exploration
of the vast chemical compound space (CCS)[5−7] with different
approaches, including multi-objective optimization[8] and inverse design.[9−11] Neural networks[12−16] and other ML models have been used successfully in a wide range
of applications, with numerous examples in materials science[17−21] and drug discovery.[22−26] ML and data-driven approaches are also making rapid progress in
catalytic,[27−41] organic,[42−47] inorganic,[48,49] and theoretical[50−56] chemistry.Despite the high potential of ML, a major challenge
in its application
is the need for big data sets for the training and validation of the
models. There are fields of high interest, for example, catalysis,
in which the size and scope of experimental data is small. An efficient
solution is to use computational results as training data.[57−60] This is one of the fundamental concepts underlying quantum-based
ML (QML),[61] in which the ML models are
trained with data from quantum mechanical (QM) calculations. QML models
are used to predict highest occupied molecular orbital (HOMO)/lowest
unoccupied molecular orbital (LUMO) energies and gaps, dipole moments,
polarizabilities, and other quantum properties governing the macroscopic
behavior of chemical systems. State-of-the-art QML models, including
atomistic[62] and message-passing neural
networks,[63] yield predictions approaching
chemical accuracy.[64] However, the training
of these models requires quantum data sets that must be large and
comprehensive to avoid overfitting and to ensure the unbiased exploration
of the CCS. These data sets are scarce, and their generation remains
hampered by the high computational cost of quantum mechanics calculations,
thus limiting the scope of QML. Quantum data set examples include
the Materials Project,[65] PubChemQC,[66] and the GDB13[67]-based
QM series for organic chemistry (QM7,[54] QM7b,[68] QM8,[69,70] and QM9[71]). Ab initio molecular dynamics
trajectories and off-equilibrium conformations are also available
from the MD17[72] and ANI-1[73] data sets, respectively. Quantum data sets for transition
metal (TM) complexes cover either small[27] or large but specific[8] regions of the
chemical space. Other data-driven approaches to organometallic chemistry
have focused on the isolated ligands.[74]We herein report the transition metal quantum mechanics data
set
(tmQM), which contains a curated collection of TM compounds, including
Werner, bioinorganic and organometallic complexes. The computational
protocol used in the generation of the tmQM data set consists of filtering
structures from the Cambridge Structural Database (CSD), followed
by xTB geometry optimizations and density functional theory (DFT)
single points (Figure ). In total, tmQM contains 86,665 complexes extracted from the CSD,
representing the diversity of the TM–organic chemical space
with a large variety of organic ligands bound to all the 3d, 4d, and
5d TMs from groups 3 to 12. tmQM provides the Cartesian coordinates
optimized at the GFN2-xTB level and a set of quantum properties computed
at the DFT(TPSSh-D3BJ/def2-SVP) level, including the electronic and
dispersion energies, metal center natural charge, HOMO/LUMO energies
and gap, and dipole moment. Polarizabilities are also provided at
the GFN2-xTB level. The pairwise representations of the properties
revealed unusual regions within the CCS, for example, TM complexes
with large polarizabilities and wide HOMO/LUMO gaps.
Figure 1
Computational protocol
used to generate the tmQM data set. CSD
= Cambridge Structural Database; xTB = extended tight-binding; DFT
= density functional theory; μ = dipole moment; α = polarizability; q = charge.
Computational protocol
used to generate the tmQM data set. CSD
= Camn class="Chemical">bridge Structural Database; xTB = extended tight-binding; DFT
= density functional theory; μ = dipole moment; α = polarizability; q = charge.
The tmQM data set will
enable the training of ML models, which
can be exploited in the data-driven discovery of new catalysts and
functional materials. Traditional predictive models, including multivariate
linear regression,[75−77] and quantitative structure–activity relationships,[78,79] will also benefit from the availability of the tmQM data set, which
can be downloaded free of charge from https://github.com/bbskjelstad/tmqm and http://quantum-machine.org/datasets/.
Chemical Subspace Extracted
from the CSD
The tmQM data
set fully comprises structures extracted from the 2020 release of
the CSD by using the seven filters listed below. The filters were
implemented by means of the CSD Python API.Composition filter
(metal elements):
Excluded all structures except those containing a single TM.[80]Composition filter (nonmetal elements):
Excluded all structures except those containing a minimum of one C
and one H atoms. The other elements allowed in the structures were
as follows: B, Si, N, P, As, O, S, Se, F, Cl, Br, and I.Component filter: Excluded the structure
of all molecular components, except that of the metal complex.Polymer filter: Excluded
all polymeric
structures.Spatial
coordinates filter: Excluded
all structures without three-dimensional coordinates.Disorder filter: Excluded all structures
with disordered atoms.Charge filter: Excluded all structures
with charge higher than 1 and lower than −1.Filters 1–2 extract mononuclear TM–organic
compounds from the CSD, including Werner, organometallic, and bioinorganic
complexes. Filter 3 removes the solvent molecules and counterions
that are found in many crystal structures. Filters 4–6 ensure
the correctness of the structures passed to the software used in the
QM calculations. Filter 7 removes highly charged species, which may
cause charge-separation artifacts in the gas-phase QM calculations.In total, 116,332 structures were extracted from the CSD with filters
1–7. Figure shows the distribution of different molecular properties over the
TM series. The number of bonds involving the metal center (Figure A) peaks at 4, 5,
and 6 (31, 12, and 33% of the total, respectively). The latter is
the most abundant instance and dominates with most TMs. Notable exceptions
to this trend are Ni, Pd, Pt, and Cu, which show a preference for
making four bonds. These observations can be associated to the prevalence
of the tetrahedral (4 bonds), square planar (4 bonds), trigonal bipyramidal
(5 bonds), square pyramidal (5 bonds), and octahedral (6 bonds) coordination
geometries. However, it should be noted that the number of metal bonds
was extracted from the connectivity table of the CSD mol2 files. Thus,
this number is equal to the degree of the metal center node in the
molecular graph of the complex, which is not necessarily equal to
the coordination number.[81] For example,
the η5-Cp ligand counts five bonds but, in an octahedral
complex, and from a molecular orbital perspective, it only occupies
three coordination sites. With Ti and other early TMs forming stable
arene complexes, 8 is one of the most abundant metal bond counts (i.e.,
octahedral complexes with three monodentate ligands and one Cp ligand).
In contrast, at the extreme of the late TM groups, the number of metal
bonds peaks at the lowest possible values. For example, 2 is the most
common metal bond count with Au.
Figure 2
Distributions over the 3–5d TM
series by (A) metal node
degree; i.e. number of bonds to the metal, (B) molecular charge q, and (C) size in number of atoms. The insets show the
totals. The data are for the 116,332 structures extracted from the
CSD with filters 1–7.
Distributions over the 3–5d TM
series by (A) metal node
degree; i.e. number of bonds to the metal, (B) molecular charge q, and (C) size in number of atoms. The insets show the
totals. The data are for the 116,332 structures extracted from the
CSD with filters 1–7.The figure also shows the distribution of the TM complex charges
(Figure B) and sizes
(Figure C). The former
distribution clearly shows the dominance of q = 0
for all TMs, without any exception, and with the neutral complexes
comprising 82% of the total. The molecular size distribution, measured
in number of atoms, is balanced between the small (1–50 atoms)
and medium-size (50–100 atoms) classes, which include 34 and
57% of the total, respectively. The large class (>100 atoms) includes
a smaller portion of structures (9%), being the smallest fraction
with all TMs.Figure reflects
the strong organic component of the TM complexes extracted from the
CSD. C and H account for 87% of the chemical composition of the entire
space (Figure D).
After these two elements, N, O, P, Cl, and F are, in this order, the
most abundant. These elements are found in the most common ligands,
including amines, carboxylates, heterocycles, phosphines, and halides.
The nature of the chemical space was also explored by computing Morgan
fingerprints, using radius = 3, and a large number of bits (i.e.,
32,768) to avoid hash collisions. The connectivity needed to generate
the fingerprints is available from the CSD database and can be retrieved
by using the CSD code provided for all entries of the tmQM data set. Figure E shows the 30 most
abundant fingerprints, which account for conjugated C–C bonds
(e.g., bits 21,860 and 24,401), aromatic rings based on C (e.g., 15,535
and 1947) and N (e.g., 22,946), amines (e.g., 23,463), and other fragments
that are commonly found in organic ligands. Other groups and ligands,
including chloride, alkoxy, oxo, and phosphines, can also be easily
recognized in fingerprints 18,067, 25,271, 31,370, and 2049, respectively.
Figure 3
Composition
by the number of non-TM atoms in the chemical formula
(D), with the inset excluding C and H, and the 30 most abundant Morgan
fingerprints (E). The data are for the structures extracted from the
CSD with filters 1–7.[82] Fingerprint
legend: All nonlabeled atoms are C, and the gray fragments show the
fingerprint connectivity but are not part of it; fingerprint label
= bit number, blue circle = central atom in the fingerprint, yellow
circle = aromatic atom, star = arbitrary atom.
Composition
by the number of non-TM atoms in the chemical formula
(D), with the inset excluding C and H, and the 30 most abundant Morgan
fingerprints (E). The data are for the structures extracted from the
CSD with filters 1–7.[82] Fingerprint
legend: All nonlabeled atoms are C, and the gray fragments show the
fingerprint connectivity but are not part of it; fingerprint label
= bit number, blue circle = central atom in the fingerprint, yellow
circle = aromatic atom, star = arbitrary atom.Figures and 5, which show one random example for each of the
30 TM elements, give a glimpse of the vast diversity of the chemical
space extracted from the CSD. The 30 complexes in the two figures
(i.e., a mere 0.03% of the full space) include 48 ligands, which are
bound to the metal center in five different coordination modes (κ1, κ2, κ3, η2, and η5), four different coordination numbers (2,
4, 5, and 6) and six different coordination geometries (linear, tetrahedral,
square planar, trigonal bipyramidal, square pyramidal, and octahedral).
Interestingly, the further extension of these variables by considering
all the 116,332 structures extracted with filters 1–7 would
allow for a combinatorial explosion yielding a massive number of TM
complexes. Thus, despite the large size of the CSD, this database
represents a minuscule fraction of the full TM–organic compound
space, which also underlines the need for predictive models enabling
the efficient exploration of this vast space.
Figure 4
Randomly selected structures,
and their CSD codes, for each TM
in groups 3–7. The selection was made among the 116,332 structures
extracted from the CSD with filters 1–7.
Figure 5
Randomly
selected structures and their CSD codes for each TM in
groups 8–12. The selection was made among the 116,332 structures
extracted from the CSD with filters 1–7.
Randomly selected structures,
and their CSD codes, for each TM
in groups 3–7. The selection was made among the 116,332 structures
extracted from the CSD with filters 1–7.Randomly
selected structures and their CSD codes for each TM in
groups 8–12. The selection was made among the 116,332 structures
extracted from the CSD with filters 1–7.
Quantum Geometries and Properties
The structures of
the TM complexes extracted from the CSD with filters 1–7 were
used as the basis to construct the tmQM data set. The advantage of
using the CSD as the source of structures is that the TM complexes
in the resulting data set can be accessed experimentally through documented
synthesis procedures. Thus, ML models trained with the tmQM data set
will embed synthetic accessibility in their internal representations
used for prediction and generation tasks.The CSD structures
were fully optimized in gas phase with the extended tight-binding
xTB method.[83] The second-generation parametrization
for geometries, frequencies, and noncovalent interactions (GFN2-xTB[84]) was used. The GFN2-xTB parametrization is less
empirical than the GFN1, and it was proven to be more robust in geometry
optimization.[85] The tight optimization
level was used in the GFN2-xTB calculations to set the convergence
thresholds to 1 × 10–6Eh (energy) and 8 × 10–4Ehα–1 (gradient). The calculations
were carried out with the xtb program. Before passing the geometries
to the software used for the DFT calculations, the following three
filters were appliedin which the norm of the displacement (dn) is summed over all optimization cycles (ocyc)
and divided by the size of the system in atoms (NAt) and the CSD R factor. The 7% geometries
yielding the largest Sq values were excluded.[85]Convergence filter: Excluded all geometries
that did not reach the convergence thresholds.Geometry quality: The GFN2-xTB-optimized
geometries were ranked based on their deviation from the initial CSD
crystal structure. The deviation was measured for each geometry by
computing a structure quality index Sq with eqElectron-count filter: Excluded all
structures with an odd number of electrons.The first two filters excluded geometries with major flaws
(e.g.
erroneous coordination number and geometry). The third filter excluded
all TM complexes that are forced to have an open-shell ground state,
due to an odd number of electrons (i.e. 22,325 of the 116,332 structures
extracted from the CSD). This filter excludes the errors and high
computational cost associated to QM calculations on open-shell systems.
In total, 86,699 geometries passed filters 1–3.The GFN2-xTB
on class="Chemical">ptimized Cartesian coordinates of all TM complexes
are included in the tmQM data set. By using chemoinformatics software
like RDKit, molSimplify, and Open Babel, these coordinates can be
easily transformed into features for ML models, including Morgan fingerprints,[86] SMILES,[87−89] and autocorrelation functions.[90] All geometries are provided together with their
CSD code, molecular size, charge, spin multiplicity, stoichiometry,
and metal node degree (i.e., number of bonds involving the metal center).
The quantum properties of the tmQM data set were obtained from
single-point calculations at the DFT level on the GFN2-xTB optimized
geometries. All properties were computed for the closed-shell singlet
state. The calculations were performed in gas phase with the hybrid
meta-GGA TPSSh functional[91] and the double-ζ
polarized def2-SVP basis set,[92] including
effective core potentials for Z > 36. Dispersion
was introduced by means of the D3BJ model.[93] The calculations were carried out with the Gaussian16 program, using
the ultrafine pruned (99,590) grid for high numerical accuracy. This
level of theory was used to compute the following properties: electronic
and dispersion energies, HOMO and LUMO energies, HOMO/LUMO gap, dipole
moment, and metal center charge, which was derived from natural population
analysis.[94] In total, the computation of
the quantum properties converged for 86,665 TM complexes. In addition
to the GFN2-xTB geometries, the tmQM data set provides these DFT properties
for all TM complexes. Polarizabilities are also provided at the GFN2-xTB
level based on the self-consistent D4 model using a Gaussian-weighting
scheme.[84]
Pairwise Property Representations
The nature of the
tmQM data set was explored by representing quantum property pairs
in scatter plots. Figure includes a selection of four plots showing the poor correlation
between the HOMO/LUMO gap and the polarizability (Figure F) and between the metal natural
charge and the dipole moment (Figure G), the HOMO energy (Figure H), and the LUMO energy (Figure I). The plots have blob shapes
with an almost continuous variation of the two properties represented
in each case. This lack of correlation was also observed in the pairwise
representations of the HOMO/LUMO gap versus the dipole moment (Figure S1), polarizability versus dipole moment
(Figure S2), HOMO/LUMO gap versus metal
center natural charge (Figure S3), and
polarizability versus metal center natural charge (Figure S4). Interestingly, these representations also show
that unusual regions of the chemical space have small, yet significant,
populations, for example, complexes with large polarizabilities and
wide HOMO/LUMO gaps, complexes with small dipole moments and highly
charged metal centers, complexes with low HOMO energies and electron-rich
metal centers, and complexes with high LUMO energies and electron-poor
metal centers.
Figure 6
Pairwise correlations, with color gradients based on property
values;
α vs HOMO/LUMO gap (F), μ vs qM (G), HOMO energy vs qM (H), and LUMO
energy vs qM (I). Level of theory: TPSSh-D3BJ/def2-SVP,
except GFN2-xTB for α.
Pairwise correlations, with color gradients based on property
values;
α vs HOMO/LUMO gap (F), μ vs qM (G), HOMO energy vs qM (H), and LUMO
energy vs qM (I). Level of theory: TPSSh-D3BJ/def2-SVP,
except GFN2-xTB for α.The pairwise scatters were also plotted by using the color of the
data points to encode the periodic table group of the metal center
(Figure ). For the
sake of clarity, the plots were divided in two sets, one accounting
for groups 3, 5, 7, 9, and 11, and one accounting for groups 4, 6,
8, 10, and 12. The data points were added to the scatter plots in
a random order; that is regions with a dominant color are mostly associated
to a given TM group. Most of the plots have no color structure, that
is, any metal can give any combination of properties with the appropriate
choice of ligands. This is the case, for instance, of polarizability
versus dipole moment (Figure J). However, there are property pairs with some structure,
for example, the HOMO versus LUMO energies (Figure K), in which group 12 yields the largest
gaps. The most structured property pairs are those involving the metal
natural charge, with the scatter plots yielding color bands (Figure L,M). Following the
periodic trends, the groups closest to the d0 configuration,
or exceeding the d10 configuration, yielded the highest
positive charges, whereas the groups closest to the d10 configuration yielded the lowest negative charges. More pairwise
representations of the quantum properties are available in the Supporting Information (Figures S5–S8).
Figure 7
Pairwise
correlations colored by the periodic table group; α
vs μ (J), LUMO vs HOMO energies (K), HOMO/LUMO gap vs qM (L), and α vs qM (M). Level of theory: TPSSh-D3BJ/def2-SVP, except GFN2-xTB
for α.
Pairwise
correlations colored by the periodic table group; α
vs μ (J), LUMO vs HOMO energies (K), HOMO/LUMO gap vs qM (L), and α vs qM (M). Level of theory: TPSSh-D3BJ/def2-SVP, except GFN2-xTB
for α.
Data Benchmarks
The tmQM data set was assessed by performing
three different benchmarks for a set of quantum properties including
the metal center natural charge (qM),
dipole moment (μ), HOMO/LUMO gap, and polarizability (α).Benchmark 1: The qM, μ,
and HOMO/LUMO gaps of the GFN2-xTB-optimized geometries, computed
at the TPSSh-D3BJ/def2-SVP level, were compared to their values recomputed
at the B2PLYP-D3/def2-SVP level.[95]Benchmark 2: The qM, μ,
and HOMO/LUMO gaps of the GFN2-xTB-optimized geometries, computed
at the TPSSh-D3BJ/def2-SVP level, were compared to their values recomputed
after reoptimizing the geometries at the same TPSSh-D3BJ/def2-SVP
level.Benchmark 3: The α of the
GFN2-xTB-optimized geometries,
computed at the same GFN2-xTB level, were compared to their values
recomputed at the TPSSh-D3BJ/def2-SVP level.Benchmark 1 showed how the quantum properties vary upon
lifting the DFT level from the meta-GGA TPPSh hybrid functional to
the B2PLYP-D3 double-hyn class="Chemical">brid functional. Benchmark 2 showed how much
sensible are the quantum properties to the level of theory used in
the geometry optimization of the CSD structures, by comparing GFN2-xTB
to DFT(TPSSh-D3BJ). Benchmark 3 showed the deviation of the GFN2-xTB
polarizabilities relative to the DFT(TPPSh). Table gives the mean absolute error (MAE) and r2 score for each benchmark.[96]
Table 1
Data Benchmarks and Their Associated
MAE (in Atomic Units, Except for μ, in D) and r2 Scores
property
qM
μ
gap
α
benchmark
MAE
r2
MAE
r2
MAE
r2
MAE
r2
1 (B2PLYP-D3)
0.12
0.99
0.53
0.98
0.124
0.69
2 (DFT-Opt)
0.05
0.99
0.56
0.94
0.007
0.92
3 (DFT-α)
19.8
0.81
Table shows that
in both benchmarks 1 and 2, q and μ yielded
the smallest MAEs, with r2 → 1.
The largest deviations were found for the HOMO/LUMO gaps, in line
with the strong dependence of this property on the theory levels used
in the single-point and geometry optimization calculations. This scenario
is illustrated for benchmark 1 with qM (Figure N) and the
HOMO/LUMO gap (Figure O). However, despite the larger uncertainty of the HOMO/LUMO gap
relative to qM, the pairwise correlations
of these two properties at the TPSSh-D3BJ (Figure P) and B2PLYP-D3 (Figure Q) levels have essentially the same shapes,
with three adjacent clusters at qM ≈
−1.50e, −0.75e, and
0.50e, that increase in size from qM ≈ −2 to qM ≈ +2. In benchmark 3, the deviation of the GFN2-xTB α
values relative to the DFT(TPSSh-D3BJ) is larger than those of qM and μ in benchmarks 1 and 2, though
significantly smaller than that of the HOMO/LUMO gap in benchmark
1. More pairwise representations of the quantum property benchmarks
are provided in the Supporting Information (Figures S9–S12).
Figure 8
Pairwise property correlations from the B2PLYP-D3
benchmark 1; qM (N), HOMO/LUMO gap (O),
TPSSh-D3BJ HOMO/LUMO
gap vs qM (P), and B2PLYP-D3 HOMO/LUMO
gap vs qM (Q).
Pairwise property correlations from the B2PLYP-D3
benchmark 1; qM (N), HOMO/LUMO gap (O),
TPSSh-D3BJ HOMO/LUMO
gap vs qM (P), and B2PLYP-D3 HOMO/LUMO
gap vs qM (Q).
Data Availability
tmQM is an open data set freely available
at GitHub (https://github.com/bbskjelstad/tmqm) and from Quantum-Machine (http://quantum-machine.org/datasets/).
Quantum features, geometries and properties computed at the GFN2-xTB
and TPSSh-D3BJ/def2-SVP levels of theory are provided in the xyz and
csv file formats.
Conclusions
This article reported
the tmQM data set, which provides the quantum
geometries and properties of a large amount of TM complexes. The complexes
were extracted from the CSD database with a series of filters imposing
constraints on chemical composition, structure, and charge. After
optimization at the GFN2-xTB level, additional filters were applied
to control geometry quality and electronic structure. A total of 86k
TM complexes passed these filters and were included in the tmQM data
set after computing their quantum properties at the TPSSh-D3BJ/def2-SVP
level, including the electronic and dispersion energies, HOMO/LUMO
energies and gap, dipole moment and metal center natural charge. Polarizabilities
are also provided at the GFN2-xTB level. The pairwise representations
of these properties allowed for mapping regions of the chemical space
with unusual properties; for example, TM complexes combining electron-rich
metal centers with low HOMO energies. The tmQM data set, which is
open and freely available at https://github.com/bbskjelstad/tmqm, will enable the training of ML models for the discovery of new
molecular materials based on TMs.
Authors: Alex Zhavoronkov; Yan A Ivanenkov; Alex Aliper; Mark S Veselov; Vladimir A Aladinskiy; Anastasiya V Aladinskaya; Victor A Terentiev; Daniil A Polykovskiy; Maksim D Kuznetsov; Arip Asadulaev; Yury Volkov; Artem Zholus; Rim R Shayakhmetov; Alexander Zhebrak; Lidiya I Minaeva; Bogdan A Zagribelnyy; Lennart H Lee; Richard Soll; David Madge; Li Xing; Tao Guo; Alán Aspuru-Guzik Journal: Nat Biotechnol Date: 2019-09-02 Impact factor: 54.908