| Literature DB >> 25977779 |
Raghunathan Ramakrishnan1, Pavlo O Dral2, Matthias Rupp1, O Anatole von Lilienfeld3.
Abstract
Computational de novo design of new drugs and materials requires rigorous and unbiased exploration of chemical compound space. However, large uncharted territories persist due to its size scaling combinatorially with molecular size. We report computed geometric, energetic, electronic, and thermodynamic properties for 134k stable small organic molecules made up of CHONF. These molecules correspond to the subset of all 133,885 species with up to nine heavy atoms (CONF) out of the GDB-17 chemical universe of 166 billion organic molecules. We report geometries minimal in energy, corresponding harmonic frequencies, dipole moments, polarizabilities, along with energies, enthalpies, and free energies of atomization. All properties were calculated at the B3LYP/6-31G(2df,p) level of quantum chemistry. Furthermore, for the predominant stoichiometry, C7H10O2, there are 6,095 constitutional isomers among the 134k molecules. We report energies, enthalpies, and free energies of atomization at the more accurate G4MP2 level of theory for all of them. As such, this data set provides quantum chemical properties for a relevant, consistent, and comprehensive chemical space of small organic molecules. This database may serve the benchmarking of existing methods, development of new methods, such as hybrid quantum mechanics/machine learning, and systematic identification of structure-property relationships.Entities:
Mesh:
Substances:
Year: 2014 PMID: 25977779 PMCID: PMC4322582 DOI: 10.1038/sdata.2014.22
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Figure 1Illustration of the scaling of chemical space with system size.
For the smallest 134k molecules, with up to 9 heavy atoms CONF (not counting hydrogens) taken from the chemical universe GDB-17[11], the distribution of molecular size is shown as a function of number of occupied electron orbitals, i.e. number of electron pairs, Nep=N /2. Each black box denotes the number of constitutional isomers for one out of the 621 stoichiometries present in the 134k molecules. The two left-hand side insets correspond to zoom-ins for smaller compounds. The right-hand side inset zooms in on the predominant stoichiometry, C7H10O2, and features a scatter plot of G4MP2 relative (w.r.t. global minimum) potential energies of atomization E versus molecular radius of gyration, R. Joined projected distributions are shown as well.
Validation of atomization enthalpies at B3LYP/6-31G(2df,p)-level.
|
|
|
|
|
|---|---|---|---|
| For 100 molecules randomly drawn out of the pool of 134k molecules, mean absolute error (MAE), root mean square error (RMSE), and maximal absolute error (maxAE) with respect to more accurate reference methods are reported. | |||
| All values are in kcal/mol. | |||
| G4MP2 | 5.0 | 6.1 | 16.0 |
| G4 | 4.9 | 5.9 | 14.4 |
| CBS-QB3 | 4.5 | 5.5 | 13.4 |
XYZ-like file format for molecular structure and properties.
|
|
|
|---|---|
|
| |
| 1 | Number of atoms |
| 2 | Scalar properties (see |
| 3,…, | Element type, coordinate ( |
|
| Harmonic vibrational frequencies (3 |
|
| SMILES strings from GDB-17 and from B3LYP relaxation |
|
| InChI strings for Corina and B3LYP geometries |
Calculated properties.
|
|
|
|
|
|---|---|---|---|
| Properties are stored in the order given by the first column. | |||
| 1 | tag | — | ‘gdb9’ string to facilitate extraction |
| 2 |
| — | Consecutive, 1-based integer identifier |
| 3 |
| GHz | Rotational constant |
| 4 |
| GHz | Rotational constant |
| 5 |
| GHz | Rotational constant |
| 6 | μ | D | Dipole moment |
| 7 | α | Isotropic polarizability | |
| 8 | ϵHOMO | Ha | Energy of HOMO |
| 9 | ϵLUMO | Ha | Energy of LUMO |
| 10 | ϵgap | Ha | Gap (ϵLUMO−ϵHOMO) |
| 11 | 〈 | Electronic spatial extent | |
| 12 | zpve | Ha | Zero point vibrational energy |
| 13 |
| Ha | Internal energy at 0 K |
| 14 |
| Ha | Internal energy at 298.15 K |
| 15 |
| Ha | Enthalpy at 298.15 K |
| 16 |
| Ha | Free energy at 298.15 K |
| 17 |
| Heat capacity at 298.15 K |
Figure 2Schematic flow chart used for geometry consistency check.
Figure 3Histogram of Coulomb-matrix distances.
For 3,054 molecules which failed the consistency test shown in Fig. 2 Coulomb-matrix distances, DIJ in Ha, between B3LYP and Corina geometries are shown.