| Literature DB >> 32071311 |
Annika Stuke1, Christian Kunkel2, Dorothea Golze3, Milica Todorović3, Johannes T Margraf2, Karsten Reuter2, Patrick Rinke3,2, Harald Oberhofer2.
Abstract
Data science and machine learning in materials science require large datasets of technologically relevant molecules or materials. Currently, publicly available molecular datasets with realistic molecular geometries and spectral properties are rare. We here supply a diverse benchmark spectroscopy dataset of 61,489 molecules extracted from organic crystals in the Cambridge Structural Database (CSD), denoted OE62. Molecular equilibrium geometries are reported at the Perdew-Burke-Ernzerhof (PBE) level of density functional theory (DFT) including van der Waals corrections for all 62 k molecules. For these geometries, OE62 supplies total energies and orbital eigenvalues at the PBE and the PBE hybrid (PBE0) functional level of DFT for all 62 k molecules in vacuum as well as at the PBE0 level for a subset of 30,876 molecules in (implicit) water. For 5,239 molecules in vacuum, the dataset provides quasiparticle energies computed with many-body perturbation theory in the G0W0 approximation with a PBE0 starting point (denoted GW5000 in analogy to the GW100 benchmark set (M. van Setten et al. J. Chem. Theory Comput. 12, 5076 (2016))).Entities:
Year: 2020 PMID: 32071311 PMCID: PMC7029047 DOI: 10.1038/s41597-020-0385-y
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Fig. 1Chemical space spanned by OE62. (a) Molecular size distributions (including hydrogen atoms) for the OE62 dataset and its 31 k and 5 k subsets. (b) Distribution of the 16 different element types in the datasets. (c) Typical structures found in the 62 k dataset, with chemical diversity arising from a rich combinatorial space of scaffold-functional group pairings: The dataset contains aliphatic molecules, as well as molecules with conjugated and complex aromatic backbones and diverse functional groups of technological relevance. The refcode_csd identifiers of depicted molecules are (from left to right): ZZTVO01, VOCMIK, FATVEC, WASVAN, BIDLUW, KETZAL, EHORAU.
Fig. 2Schematic overview of the three datasets and the applied computational methods. The 31 k set includes all structures from the 5 k set and the 62 k all structures from the 31 k and 5 k sets.
Overview of the data (sub)sets in OE62: Applied computational method, resulting molecular properties and DOI-based references to the input and output files of corresponding calculations deposited in the NOMAD repository.
| Set | Method | Computed properties | Access to data records on NOMAD |
|---|---|---|---|
| 62 k | DFT PBE + vdW (vacuum) | • relaxed geometry | [ |
| • occupied & unoccupied MO energies | |||
| • total energy | |||
| • Hirshfeld charges | |||
| 62 k | DFT PBE0 (vacuum) | • geometry fixed at the PBE + vdW level | [ |
| • occupied & unoccupied MO energies | |||
| • total energy | |||
| • Hirshfeld charges | |||
| 31 k | DFT PBE0 (water) | • geometry fixed at the PBE + vdW level | [ |
| • occupied & unoccupied MO energies | |||
| • total energy | |||
| • Hirshfeld charges | |||
| 5 k | DFT PBE0 (vacuum) | • geometry fixed at the PBE + vdW level | [ |
| • occupied & unoccupied MO energies | |||
| • total energy | |||
| 5 k | • geomety fixed at the PBE + vdW level | [ | |
| • occupied & unoccupied MO energies | |||
| • CBS energies of occupied & unoccu pied MOs |
Fig. 3The GW5000 subset compared to the other (sub)sets in OE62. Panel (a) shows distributions of HOMO energies from G0W0@PBE0 (vacuum), PBE + vdW, PBE0 (vacuum) and PBE0 (water) computations. Panel (b) shows the distribution of solvation free energies Δ. In (a and b), distribution medians are marked by dotted lines. Panel (c) depicts a correlation plot for the approximately linear relationship between the G0W0@PBE0 CBS quasiparticle energies and the DFT HOMO energies (PBE and PBE0 in vacuum).
Dataframe structure of all three dataframes df_62 k, df_31 k and df_5 k.
| No. | Column name | Unit | Method | Dataframes | Description |
|---|---|---|---|---|---|
| 1 | refcode_csd | — | — | 62 k, 31 k, 5 k | CSD reference code, unique identifier for the crystal from which the molecule was extracted |
| 2 | canonical_smiles | — | Open Babel | 62 k, 31 k, 5 k | Molecular string representations derived from DFT PBE + vdW relaxed geometries. |
| 3 | inchi | — | Open Babel | 62 k, 31 k, 5 k | |
| 4 | number_of_atoms | — | — | 62 k, 31 k, 5 k | Number of atoms in the molecule |
| 5 | xyz_pbe_relaxed | Å | PBE + vdW (vacuum) | 62 k, 31 k, 5 k | String in XYZ-file format of DFT PBE + vdW relaxed geometry. Line 1 contains the number of atoms. Line 2 is empty. The remaining lines contain atomic type and coordinate (x, y, z). |
| 6 | energies_occ_pbe | PBE + vdW (vacuum) | 62 k, 31 k, 5 k | List of eigenvalues of occupied molecular Kohn-Sham orbitals. Given in ascending order, the last value is the HOMO energy. | |
| 7 | energies_occ_pbe0_vac_tier2 | PBE0 (vacuum) | 62 k, 31 k, 5 k | ||
| 8 | energies_occ_pbe0_water | PBE0 (water) | 31 k, 5 k | ||
| 9 | energies_occ_pbe0_vac_tzvp | PBE0 (vacuum) | 5 k | ||
| 10 | energies_occ_pbe0_vac_qzvp | PBE0 (vacuum) | 5 k | ||
| 11 | energies_occ_gw_tzvp | 5 k | |||
| 12 | energies_occ_gw_qzvp | 5 k | |||
| 13 | cbs_occ_gw | 5 k | List of CBS energies of occupied states computed from | ||
| 14 | energies_unocc_pbe | PBE + vdW (vacuum) | 62 k, 31 k, 5 k | List of eigenvalues of virtual (unoccupied) molecular Kohn-Sham orbitals. Given in ascending order, the first value is the LUMO energy. Only virtual states below the vacuum level (i.e. with negative eigenvalue) are listed. If the LUMO energy is positive, only the LUMO energy is listed. If | |
| 15 | energies_unocc_pbe0_vac_tier2 | PBE0 (vacuum) | 62 k, 31 k, 5 k | ||
| 16 | energies_unocc_pbe0_water | PBE0 (water) | 31 k, 5 k | ||
| 17 | energies_unocc_pbe0_vac_tzvp | PBE0 (vacuum) | 5 k | ||
| 18 | energies_unocc_pbe0_vac_qzvp | PBE0 (vacuum) | 5 k | ||
| 19 | energies_unocc_gw_tzvp | 5 k | |||
| 20 | energies_unocc_gw_qzvp | 5 k | |||
| 21 | cbs_unocc_gw | 5 k | List of CBS energies of unoccupied states computed from | ||
| 22 | total_energy_pbe | PBE + vdW (vacuum) | 62 k, 31 k, 5 k | Total energy of the DFT calculations. Note, for consistency with | |
| 23 | total_energy_pbe0_vac_tier2 | PBE0 (vacuum) | 62 k, 31 k, 5 k | ||
| 24 | total_energy_pbe0_water | PBE0 (water) | 31 k, 5 k | ||
| 25 | total_energy_pbe0_vac_tzvp | PBE0 (vacuum) | 5 k | ||
| 26 | total_energy_pbe0_vac_qzvp | PBE0 (vacuum) | 5 k | ||
| 27 | hirshfeld_pbe | PBE + vdW (vacuum) | 62 k, 31 k, 5 k | List of Hirshfeld partial charges on atoms. Same order as atoms in xyz_pbe_relaxed. | |
| 28 | hirshfeld_pbe0_vac_tier2 | PBE0 (vacuum) | 62 k, 31 k, 5 k | ||
| 29 | hirshfeld_pbe0_water | PBE0 (water) | 31 k, 5 k |
Columns 1 to 3 contain molecular identifiers. Columns 5 to 29 contain molecular properties computed at respective level of theory. All mentioned energies are given in eV.
Fig. 4Coulomb matrix distances between initial crystal geometries and PBE + vdW relaxed geometries. Panel (a) shows the distribution of Coulomb matrix distances for all 62 k molecules and panel (b) shows the distribution of Coulomb matrix distances for the 284 cases that did not pass the consistency check. Two example molecules are shown in (a) for short and large distances between Coulomb matrices (the refcode_csd identifiers are CILWUP (1) and ODAHUW (2)). In (b), 2D structures of three example molecules that failed the consistency check are shown (DAZIND (3), YOMDUA (4) and FODBAC (5)).
Fig. 5Accuracy assessment of HOMO- and atomization energies computed at the PBE0 (vacuum) DFT level of theory. (a) Four example molecules and their refcode_csd identifiers. (b) For the example molecules, the HOMO energy convergence of the Tier1 and Tier2 basis sets is compared against the Tier4 basis set provided with FHI-aims, always employing tight integration settings. (c) Difference in HOMO-energy between the Tier2 (T2) and QZVP basis sets for all molecules of the 5 k set. The distribution-median is given by a dotted line, located at −0.008 eV. (d) Same as (b), but for atomization energies Ef.
Fig. 6Accuracy assessment of G0W0 quasiparticle energies. (a) Convergence of the HOMO G0W0 energies with respect to the inverse of the number of basis functions NBF for the four example molecules shown in Fig. 5. Dashed lines represent linear straight line fits using the def2-QZVP and def2-TZVP points. The intersection of the straight line with the ordinate gives an estimate for the complete basis set limit (CBS) as indicated for BMLTAA. (b) Deviation of the HOMO G0W0 energies from the CBS limit for the 5 k subset. Median values of the distributions are indicated by black dashed lines. (c) Percentage of states with negative slope of the CBS fit. (d) Average G0W0@PBE0 quasiparticle spectrum, where each energy state was artificially broadened by a Gaussian distribution.
| Measurement(s) | organic molecule |
| Technology Type(s) | digital curation • spectroscopy |
| Factor Type(s) | computational method |