Literature DB >> 32694541

Quantum chemical calculations for over 200,000 organic radical species and 40,000 associated closed-shell molecules.

Peter C St John1, Yanfei Guan2, Yeonjoon Kim3, Brian D Etz3, Seonah Kim4, Robert S Paton5.   

Abstract

The stabilities of radicals play a central role in determining the thermodynamics and kinetics of many reactions in organic chemistry. In this data descriptor, we provide consistent and validated quantum chemical calculations for over 200,000 organic radical species and 40,000 associated closed-shell molecules containing C, H, N and O atoms. These data consist of optimized 3D geometries, enthalpies, Gibbs free energy, vibrational frequencies, Mulliken charges and spin densities calculated at the M06-2X/def2-TZVP level of theory, which was previously found to have a favorable trade-off between experimental accuracy and computational efficiency. We expect this data to be useful in the further development of machine learning techniques to predict reaction pathways, bond strengths, and other phenomena closely related to organic radical chemistry.

Entities:  

Year:  2020        PMID: 32694541      PMCID: PMC7374734          DOI: 10.1038/s41597-020-00588-x

Source DB:  PubMed          Journal:  Sci Data        ISSN: 2052-4463            Impact factor:   6.444


Background & Summary

Accurate determination of reaction energies is a central step in exploring organic chemistry mechanisms. The majority of chemical reactions consist of multiple elementary steps involving reactive intermediates. Short-lived reactive intermediates are difficult to isolate and analyze experimentally, resulting in increased dependence on accurate mechanistic insight gained from computational techniques[1]. Calculating reaction energies with quantum chemistry techniques, such as density functional theory (DFT), is therefore a central effort of computational organic chemistry. However, the combinatorial complexity of potential reaction pathways requires significant experience on the part of the computational chemist to determine which pathways are most likely to be feasible, and considerable computational resources to ensure enough pathways are explored that nonintuitive reactive intermediates and products are not missed. Enthalpies of radicals in particular, as important intermediates in combustion[2,3], atmospheric[4], redox[5], (bio)-polymer chemistry[6,7], and the functionalization of medicinally-relevant aromatic compounds[8], are frequently calculated to determine the thermodynamics and kinetics of reaction pathways. Fast and accurate predictions for the enthalpy changes of radical reactions will substantially improve the throughput of computational chemistry research and allow detailed calculations to be targeted towards pathways that have the highest likelihood of being experimentally relevant. The accuracy of Machine Learning (ML) models in predicting the results of quantum mechanical calculations has increased substantially in recent years as techniques for connecting molecular structures to deep neural networks have improved[9-11]. These approaches, known as graph neural networks (GNNs)[12], replace the traditional featurization of molecules using fingerprints or descriptors with a framework in which molecular representations are learned from the underlying data[13]. These frameworks therefore continue to increase in accuracy as more data is collected far beyond traditional machine learning approaches. ML approaches to quickly and accurately predict enthalpy[14], ground state energy[15], bond dissociation energy[16], and even transition-state activation energies[17] have been developed by leveraging increasingly large databases of DFT calculations. The public distribution of large quantum chemistry databases, such as ioChem-BD[18], is an important part of advancing the field of machine learning research in computational chemistry, as prior publications of equilibrium[19], off-equilibrium[20], and transition-state structures[21] have found applicability beyond their original purpose. Datasets such as QM7[22], QM9[19], ANI-1x[23], and ANI-1ccx[24] consist of closed-shell organic molecules and so a dataset containing reactive intermediates, such as radical species, is required for the further development of machine learning models. In this data descriptor, we report a quantum chemistry dataset focused on determining the enthalpies of radical reactions for small organic molecules. The database contains over 200,000 organic radical species and more than 40,000 associated closed-shell molecules, which were generated by breaking single, non-cyclic bonds in molecules taken from the PubChem Compound database[25]. Carbon, nitrogen and oxygen centered radicals are represented. Geometry optimizations and enthalpy calculations were performed at the M06-2X/def2-TZVP level of theory[26], which was previously found to have a favorable trade-off between experimental accuracy and computational efficiency. For calculating the bond dissociation enthalpies specifically, results from this DFT methodology were benchmarked against experimental bond dissociation energies and calculations at higher levels of theory[16]. The calculation pipeline showed similar performance to CCSD(T), and is able to capture changes in enthalpy relative to experiment with an accuracy of approximately 2 kcal mol−1. The resulting database consists of optimized 3D geometries, vibrational frequencies, IR intensities, Mulliken atomic charges, spin densities, enthalpies and free energies for each molecule, calculated using Gaussian 16[27]. While this data was developed primarily for calculating bond strengths of organic molecules[28], we expect this comprehensive database of radical and closed-shell calculations to be useful for a wide range of applications in chemistry.

Methods

Selection of closed-shell molecules and radicals

SMILES strings for closed-shell molecules were selected from the PubChem Compound database[25] where the entry had a valid CAS number, ten or fewer heavy atoms, consisted of only C, H, N, and O atoms, did not contain formal charges on any atoms, and for which all atoms were connected via covalent bonds (i.e., entries containing multiple molecules or ionic bonds were removed). From these parent molecules, SMILES strings for child radicals were generated by iteratively breaking all single, non-ring bonds in the parent molecule. The resulting list of SMILES strings was canonicalized and de-duplicated using RDKit[29].

Conformer optimization

An initial guess for the lowest-energy conformer was performed via a search using the MMFF94s force-field, as implemented in RDKit[30]. The number of sampled conformers was determined by min (max (3, 100), 1000, where n is the number of rotatable bonds in the molecule. The lowest-energy conformer was then used as the initial geometry guess for subsequent DFT calculations. In order to obtain realistic geometry guesses for radical species (on which MMFF94s was not parameterized), H-atoms were added to radical centers prior to conformer generation. Basic knowledge of chemical structure, including ring conformations, was also incorporated into initial guesses for conformer structure[31].

Density functional theory calculations

Gaussian input files were created from the lowest-energy conformer using OpenBabel[32]. DFT calculations were performed using Gaussian 16[27] with the M06-2X functional and def2-TZVP basis set with the default ultra-fine grid for all numerical integrations. For radical calculations, additional care was taken to ensure the correct electronic structure. Specifically, spatial and spin symmetry of orbitals were broken through an initial guess of mixed HOMO-LUMO and assuming no point-group symmetry. Stability of the DFT “wavefunction” was also tested, and the geometry was reoptimized if an instability was found. Results were parsed in Python using the cclib package[33].

Parallel QM calculations

Calculations were distributed across a high-performance computing (HPC) cluster (Fig. 1). A PostgreSQL database was used to coordinate calculations for a pool of worker nodes. Each worker, in a loop, selects a single SMILES entry, locks the row to prevent duplicate calculations, performs the force field optimization and DFT calculation, validate the resulting calculations, write results to the database, and repeat for a new molecule.
Fig. 1

Overview of the calculation pipeline and associated software. On each worker, closed-shell molecules and radicals (in SMILES format) are pulled from a central database. Optimized, validated 3D geometries are stored in the database after completed, and a new molecule is started.

Overview of the calculation pipeline and associated software. On each worker, closed-shell molecules and radicals (in SMILES format) are pulled from a central database. Optimized, validated 3D geometries are stored in the database after completed, and a new molecule is started.

Data Records

The data set is provided in a chemical table file format, specifically an SDF molfile containing all optimized geometries with additional property fields including SMILES, Enthalpy, FreeEnergy, SCFEnergy, AtomCharges, RotConstants, VibFreqs, and IRIntensity. All raw Gaussian M062X/def2TZVP logfiles for optimization and frequency calculations are also provided[34]. Code to read the dataset and process the associated data in Python is provided in an associated github repository (https://github.com/pstjohn/bde). A description of the data fields in the SDF file are given in Table 1. In addition to the processed SDF file, raw Gaussian logfiles are provided in a separate zipped directory.
Table 1

Description of the associated data fields, their formats, and units.

Data FieldDescription
SMILESString representation of the 2D connectivity of the molecule. Radicals are denoted using the bracket notation.
EnthalpyMolecular enthalpies, specified to six decimal places. In Hartree
FreeEnergyGibbs energy at standard temperature (298.15 K) and pressure (1 atm). In Hartree
SCFEnergyTotal SCF energy (electronic + nuclear). In Hartree
AtomChargesMulliken atomic charges, one for each atom. The values are formatted as a python list, beginning and ending with brackets and separated with commas. Values correspond to the atom order as given in the 3D coordinates.
AtomSpinsAtomic spin densities (for radicals only). In the same format as AtomCharges.
VibFreqsVibrational frequencies in wavenumbers (cm−1). Formatted as a python list of length 3N-6 (or 3N-5 for linear molecules)
RotConstantsRotational constants (GHz). A formatted python list of length 3.
IRIntensityInfrared intensities (km/mol). In the same format as VibFreqs.
Description of the associated data fields, their formats, and units. As molecules with more heavy atoms allow a greater number of possible arrangements, the database contains more examples of larger molecules than smaller molecules. A complete breakdown of the number of calculations in the database by number of heavy atoms is given in Table 2 and a breakdown of the formal radical center by element and degree is given in Table 3. Further characterization of the radical database to determine proximity of the radicals to stabilizing substituents. SMILES arbitrary target specification (SMARTS) patterns were used to determine whether each radical contains neighboring stabilizing features. Radicals are classified as allylic (adjacent to a C=C double bond), propargylic (adjacent to a CC triple bond), benzylic (adjacent to an aromatic carbon), adjacent to a π-acceptor group (an electron-withdrawing group, EWG), adjacent to a lone-pair (an electron-donating group, EDG), and captodative (alpha to both a π-acceptor and a lone-pair donor). Counts of radicals by neighboring substituents is given in Table 4.
Table 2

Number of optimized closed-shell molecules and radicals by number of heavy atoms.

# Heavy AtomsMoleculesRadicals
001
134
21117
35089
4167404
54851867
613266570
7345219931
8757346163
91359486499
101661584818
Table 3

Distribution of the 246,363 radicals by location of the unpaired electron.

ElementPrimarySecondaryTertiary
C56,067121,36928,135
N11,34914,048
O15,354

Primary, secondary, and tertiary refers to atoms having 1, 2, or 3 non-hydrogen neighbors.

Table 4

Characterization of carbon-centered radicals by neighboring substituents.

NameSMARTSCount
Allylic[#6;X3v3 + 0]-[#6] = [#6 × 3]16,229
Propargylic[#6;X3v3 + 0]-[#6]#[#6]1,887
Benzylic[#6;X3v3 + 0]-[c]8,286
α- to π-acceptor[#6;X3v3 + 0]-[C,N] = ,#[N,O]18,758
α-to lone-pair[#6;X3v3 + 0]-[O,N]55,136
Captodative[#6;X3v3 + 0](-[O,N])-[C,N]=,#[N,O]43,86
Number of optimized closed-shell molecules and radicals by number of heavy atoms. Distribution of the 246,363 radicals by location of the unpaired electron. Primary, secondary, and tertiary refers to atoms having 1, 2, or 3 non-hydrogen neighbors. Characterization of carbon-centered radicals by neighboring substituents. As a set of consistent enthalpies between closed-shell molecules and radicals, this data has been used to calculate a large number of bond dissociation energies (BDEs)[28]. With calculated bond strengths and 3D atomic structures of the parent molecules, we can examine bond strength vs. bond length curves for several common single bonds. Figure 2 shows bond strengths versus bond lengths as calculated using this dataset. Correlation coefficients indicate that bond length vs. strength correlations are strongest for CC single bonds, followed by similar, slightly weaker correlations for C–N, C–H, and C–O bonds. As expected, longer bond lengths are symptomatic of weaker dissociation enthalpies. In contrast, almost no correlation between bond length and strength exists for N–H, N–N, N–O or O–H bonds.
Fig. 2

Bond strength versus bond length for several common single bonds. Bond dissociation enthalpies are inversely correlated with bond lengths for carbon-containing bonds, but less so for other species.

Bond strength versus bond length for several common single bonds. Bond dissociation enthalpies are inversely correlated with bond lengths for carbon-containing bonds, but less so for other species. We also demonstrate how the dataset could also be used to investigate if radical stabilization (through delocalization of the resulting electron’s spin density) affects bond strength. Figure 3a plots BDE versus the maximum spin density on any atom in the resulting radical for hydrogen atom abstraction reactions. For O-H and C-H bonds, there is a strong correlation (ρ = 0.76 and 0.71, respectively) between maximum spin density and BDE, suggesting that radical delocalization plays an important role in determining bond enthalpies. However, for N-H bonds, correlations are much weaker.
Fig. 3

Radical stabilization and enthalpy validation (a). BDE versus maximum atom spin density. Maximum spin density is calculated across all atoms in the resulting radical, with lower numbers indicating a more even distribution of electron spin across all atoms. (b) Distribution of calculated minus predicted enthalpies (in kcal mol−1) following a linear model of atomic composition. The shaded grey region indicates the inner quartile range. Vertical grey dashed lines indicate the thresholds used for outlier detection, defined as ±3 inner quartile ranges away from the first or third quartile.

Radical stabilization and enthalpy validation (a). BDE versus maximum atom spin density. Maximum spin density is calculated across all atoms in the resulting radical, with lower numbers indicating a more even distribution of electron spin across all atoms. (b) Distribution of calculated minus predicted enthalpies (in kcal mol−1) following a linear model of atomic composition. The shaded grey region indicates the inner quartile range. Vertical grey dashed lines indicate the thresholds used for outlier detection, defined as ±3 inner quartile ranges away from the first or third quartile.

Technical Validation

A series of convergence checks were performed to ensure the calculated enthalpies are as reliable as possible. Of the 322,871 total enthalpy calculations performed for this work, 30,327 (9.39%) were discarded due to various validation checks. Molecules with failures of any steps of the calculation pipeline, including the conformer embedding (5,079 molecules) and not reaching normal termination of the DFT calculations (2,213 molecules) were discarded. This was either due to a failure to converge the geometry optimization within the maximum number of steps of the Berny algorithm or due to a failure to converge the SCF procedure within the maximum number of cycles. Vibrational frequencies of the optimized molecules computed at the same level of theory as geometry optimization and were checked to ensure that the optimized stationary point was an energy minimum, with zero imaginary frequencies. If any frequencies were imaginary, the optimization was discarded (18,263 optimizations resulted in at least one imaginary frequency). The 3D structure of the resulting optimization was also inspected to ensure that the connectivity matched the Lewis structure of the input structure. The interatomic distances of formally bonded atom pairs were checked to ensure no bonds were greater than 0.4 Å plus the sum of the covalent radii of the two participating atoms (2,134 molecules failed the covalent radii check)[35]. Finally, molecules were checked for an unreasonably high enthalpy per atom. A linear model was fit to each result’s enthalpy, with the number of C, H, N, and O atoms as the independent variables. Residuals were close to normally distributed (Fig. 3b). Outliers were defined as those calculations that were more than 3 inner quartile ranges away from the first or third quartile. No molecules were outliers in the more stable direction, but 235 molecules had higher enthalpy residuals than the maximum cutoff, indicating they likely converged to highly unstable conformers and were removed.

Usage Notes

While the SDF file containing optimized geometry and extracted properties can be read with a number of different cheminformatic tools, we provide a simple example of processing the file with Python 3 and RDKit and using the data to calculate bond dissociation energies at https://github.com/pstjohn/bde.
Measurement(s)SMILES string • Standard Transformed Enthalpy Change • Standard Gibbs Free Energy Change • vibrational frequency • Charge • spin density • infrared spectrum • organic radical • closed-shell molecules
Technology Type(s)computational modeling technique
Factor Type(s)molecular structure
  21 in total

1.  Fast and accurate modeling of molecular atomization energies with machine learning.

Authors:  Matthias Rupp; Alexandre Tkatchenko; Klaus-Robert Müller; O Anatole von Lilienfeld
Journal:  Phys Rev Lett       Date:  2012-01-31       Impact factor: 9.161

2.  Better Informed Distance Geometry: Using What We Know To Improve Conformation Generation.

Authors:  Sereina Riniker; Gregory A Landrum
Journal:  J Chem Inf Model       Date:  2015-11-30       Impact factor: 4.956

3.  cclib: a library for package-independent computational chemistry algorithms.

Authors:  Noel M O'Boyle; Adam L Tenderholt; Karol M Langner
Journal:  J Comput Chem       Date:  2008-04-15       Impact factor: 3.376

4.  Prediction Errors of Molecular Machine Learning Models Lower than Hybrid DFT Error.

Authors:  Felix A Faber; Luke Hutchison; Bing Huang; Justin Gilmer; Samuel S Schoenholz; George E Dahl; Oriol Vinyals; Steven Kearnes; Patrick F Riley; O Anatole von Lilienfeld
Journal:  J Chem Theory Comput       Date:  2017-10-10       Impact factor: 6.006

5.  Open Babel: An open chemical toolbox.

Authors:  Noel M O'Boyle; Michael Banck; Craig A James; Chris Morley; Tim Vandermeersch; Geoffrey R Hutchison
Journal:  J Cheminform       Date:  2011-10-07       Impact factor: 5.514

6.  Quantum-chemical insights from deep tensor neural networks.

Authors:  Kristof T Schütt; Farhad Arbabzadah; Stefan Chmiela; Klaus R Müller; Alexandre Tkatchenko
Journal:  Nat Commun       Date:  2017-01-09       Impact factor: 14.919

7.  ANI-1, A data set of 20 million calculated off-equilibrium conformations for organic molecules.

Authors:  Justin S Smith; Olexandr Isayev; Adrian E Roitberg
Journal:  Sci Data       Date:  2017-12-19       Impact factor: 6.444

8.  Approaching coupled cluster accuracy with a general-purpose neural network potential through transfer learning.

Authors:  Justin S Smith; Benjamin T Nebgen; Roman Zubatyuk; Nicholas Lubbers; Christian Devereux; Kipton Barros; Sergei Tretiak; Olexandr Isayev; Adrian E Roitberg
Journal:  Nat Commun       Date:  2019-07-01       Impact factor: 14.919

9.  Prediction of organic homolytic bond dissociation enthalpies at near chemical accuracy with sub-second computational cost.

Authors:  Peter C St John; Yanfei Guan; Yeonjoon Kim; Seonah Kim; Robert S Paton
Journal:  Nat Commun       Date:  2020-05-11       Impact factor: 14.919

10.  PubChem 2019 update: improved access to chemical data.

Authors:  Sunghwan Kim; Jie Chen; Tiejun Cheng; Asta Gindulyte; Jia He; Siqian He; Qingliang Li; Benjamin A Shoemaker; Paul A Thiessen; Bo Yu; Leonid Zaslavsky; Jian Zhang; Evan E Bolton
Journal:  Nucleic Acids Res       Date:  2019-01-08       Impact factor: 16.971

View more
  4 in total

1.  A benchmark dataset for Hydrogen Combustion.

Authors:  Xingyi Guan; Akshaya Das; Christopher J Stein; Farnaz Heidar-Zadeh; Luke Bertels; Meili Liu; Mojtaba Haghighatlari; Jie Li; Oufan Zhang; Hongxia Hao; Itai Leven; Martin Head-Gordon; Teresa Head-Gordon
Journal:  Sci Data       Date:  2022-05-17       Impact factor: 8.501

2.  Investigation on the Synthesis, Application and Structural Features of Heteroaryl 1,2-Diketones.

Authors:  Robert J Wehrle; Alexander Rosen; Thu Vu Nguyen; Kalyn Koons; Eric Jump; Mason Bullard; Natalie Wehrle; Adam Stockfish; Patrick M Hare; Abdurrahman Atesin; Tülay A Ateşin; Lili Ma
Journal:  ACS Omega       Date:  2022-07-20

3.  MolE8: finding DFT potential energy surface minima values from force-field optimised organic molecules with new machine learning representations.

Authors:  Sanha Lee; Kristaps Ermanis; Jonathan M Goodman
Journal:  Chem Sci       Date:  2022-05-28       Impact factor: 9.969

4.  Dataset Construction to Explore Chemical Space with 3D Geometry and Deep Learning.

Authors:  Jianing Lu; Song Xia; Jieyu Lu; Yingkai Zhang
Journal:  J Chem Inf Model       Date:  2021-03-08       Impact factor: 4.956

  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.