| Literature DB >> 35672335 |
Clemens Isert1, Kenneth Atz1, José Jiménez-Luna2,3, Gisbert Schneider4,5.
Abstract
Machine learning approaches in drug discovery, as well as in other areas of the chemical sciences, benefit from curated datasets of physical molecular properties. However, there currently is a lack of data collections featuring large bioactive molecules alongside first-principle quantum chemical information. The open-access QMugs (Quantum-Mechanical Properties of Drug-like Molecules) dataset fills this void. The QMugs collection comprises quantum mechanical properties of more than 665 k biologically and pharmacologically relevant molecules extracted from the ChEMBL database, totaling ~2 M conformers. QMugs contains optimized molecular geometries and thermodynamic data obtained via the semi-empirical method GFN2-xTB. Atomic and molecular properties are provided on both the GFN2-xTB and on the density-functional levels of theory (DFT, ωB97X-D/def2-SVP). QMugs features molecules of significantly larger size than previously-reported collections and comprises their respective quantum mechanical wave functions, including DFT density and orbital matrices. This dataset is intended to facilitate the development of models that learn from molecular data on different levels of theory while also providing insight into the corresponding relationships between molecular structure and biological activity.Entities:
Mesh:
Year: 2022 PMID: 35672335 PMCID: PMC9174255 DOI: 10.1038/s41597-022-01390-7
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 8.501
Fig. 1(a) Principal-moments-of-inertia plot[39] for molecules in the QMugs dataset. NPR = x-th normalized principal moment, I = x-th smallest principal moment of inertia. (b) Venn diagram showing overlap between QMugs and other well-known datasets with DFT-level computed properties: QM9[18], PubChemQC[20], and ANI-1[19]. Overlap was computed based on the uniqueness of the InChI representations of the contained molecules. Numbers do not add up to those reported in Table 1 because of InChI strings that occur multiple times.
Descriptive statistics of the dataset reported herein in the context of other DFT-level molecular datasets and the information provided by each.
| Dataset | Unique compounds | Total conformations | Heavy atoms max (mean) | Method | Δ-learning possible | Wave functions |
|---|---|---|---|---|---|---|
| QM9 | 133,885 | 133,885 | 9 (8.8) | B3LYP/6–31 G(2df,p) | ✗ | ✗ |
| ANI-1 | 57,462 | 22,057,374 | 8 (7.1) | ✗ | ✗ | |
| PubChemQC | 3,982,436 | 3,982,436 | 51 (14.1) | B3LYP/6–31 G(d) | ✗ | ✓ |
| QMugs | 665,911 | 1,992,984 | 100 (30.6) | GFN2-xTB + | ✓ | ✓ |
The number of molecules for PubChemQC corresponds to that available on the website of the project[57]. Heavy atom averages are weighted by the number of conformations.
Fig. 2Distribution of properties for the molecules contained in the QMugs dataset.
Fig. 3Overview of the data generation process. Molecules were extracted from the ChEMBL database, standardized, and filtered, and starting conformers were generated using the RDKit software package. Metadynamics (MTD) simulations were performed using the GFN2-xTB semi-empirical method to generate three diverse conformations before final geometry optimization. Molecules that did not pass a series of geometric sanity checks were removed. DFT-level properties (ωB97X-D/def2-SVP) were computed using Psi4 software.
Calculated properties as stored in the SDFs of the QMugs data collection.
| Property | Symbol | Unit | Key | Δ-ML |
|---|---|---|---|---|
| ChEMBL identifier | — | — | CHEMBL_ID | |
| Conformer identifier | — | — | CONF_ID | |
| Total energy | GFN2:TOTAL_ENERGY | ♦ | ||
| Internal atomic energy | GFN2:ATOMIC_ENERGY | |||
| Formation energy | GFN2:FORMATION_ENERGY | ♦ | ||
| Total enthalpy | GFN2:TOTAL_ENTHALPY | |||
| Total free energy | GFN2:TOTAL_FREE_ENERGY | |||
| Dipole ( | D | GFN2:DIPOLE | ♦ | |
| Quadrupole ( | D Å | GFN2:QUADRUPOLE | ||
| Rotational constants ( | cm−1 | GFN2:ROT_RONSTANTS | ♦ | |
| Enthalpy (vib., rot., transl., total) | Δ | cal mol−1 | GFN2:ENTHALPY | |
| Heat capacity (vib., rot., transl., total) | cal K−1 mol−1 | GFN2:HEAT_CAPACITY | ||
| Entropy (vib., rot., transl., and total) | Δ | cal K−1 mol−1 | GFN2:ENTROPY | |
| HOMO energy | GFN2:HOMO_ENERGY | ♦ | ||
| LUMO energy | GFN2:LUMO_ENERGY | ♦ | ||
| HOMO-LUMO gap | GFN2:HOMO_LUMO_GAP | ♦ | ||
| Fermi level | GFN2:FERMI_LEVEL | |||
| Mulliken partial charges | GFN2:MULLIKEN_CHARGES | ♦ | ||
| Covalent coordination number | — | GFN2:COVALENT_COORDINATION_NUMBER | ||
| Molecular dispersion coefficient | a.u. | GFN2:DISPERSION_COEFFICIENT_MOLECULAR | ||
| Atomic dispersion coefficients | a.u. | GFN2:DISPERSION_COEFFICIENT_ATOMIC | ||
| Molecular polarizability | a.u. | GFN2:POLARIZABILITY_MOLECULAR | ||
| Atomic polarizabilities | a.u. | GFN2:POLARIZABILITY_ATOMIC | ||
| Wiberg bond orders | — | GFN2:WIBERG_BOND_ORDER | ♦ | |
| Total Wiberg bond orders | — | GFN2:TOTAL_WIBERG_BOND_ORDER | ♦ | |
| Total energy | DFT:TOTAL_ENERGY | ♦ | ||
| Total internal atomic energy | DFT:ATOMIC_ENERGY | |||
| Formation energy | DFT:FORMATION_ENERGY | ♦ | ||
| Electrostatic potential | V | DFT:ESP_AT_NUCLEI | ||
| Löwdin partial charges | DFT:LOWDIN_CHARGES | |||
| Mulliken partial charges | DFT:MULLIKEN_CHARGES | ♦ | ||
| Rotational constants ( | cm−1 | DFT:ROT_CONSTANTS | ♦ | |
| Dipole ( | D | DFT:DIPOLE | ||
| Exchange correlation energy | DFT:XC_ENERGY | |||
| Nuclear repulsion energy | DFT:NUCLEAR_REPULSION_ENERGY | |||
| One-electron energy | DFT:ONE_ELECTRON_ENERGY | |||
| Two-electron energy | DFT:TWO_ELECTRON_ENERGY | |||
| HOMO energy | DFT:HOMO_ENERGY | ♦ | ||
| LUMO energy | DFT:LUMO_ENERGY | ♦ | ||
| HOMO-LUMO gap | DFT:HOMO_LUMO_GAP | ♦ | ||
| Mayer bond orders | — | DFT:MAYER_BOND_ORDER | ||
| Wiberg-Löwdin bond orders | — | DFT:WIBERG_LOWDIN_BOND_ORDER | ♦ | |
| Total Mayer bond orders | — | DFT:TOTAL_MAYER_BOND_ORDER | ||
| Total Wiberg-Löwdin bond orders | — | DFT:TOTAL_WIBERG_LOWDIN_BOND_ORDER | ♦ |
Abbreviations: a.u., atomic units; vib., vibrational; rot., rotational; transl., translational. Properties that enable Δ machine learning are labelled with ♦.
Calculated molecular properties stored in the wave function files provided in the QMugs data collection.
| Property | Symbol | Key |
|---|---|---|
| Alpha density matrix | matrix, Da | |
| Beta density matrix | matrix, Db | |
| Alpha orbitals | matrix, Ca | |
| Beta orbitals | matrix, Cb | |
| Atomic-orbital-to-symmetry-orbital transformer | matrix, aotoso | |
| Mayer bond orders | MAYER_INDICES | |
| Wiberg-Löwdin bond orders | WIBERG_LOWDIN_INDICES |
Mayer and Wiberg-Löwdin bond orders included here represent a superset of the bond orders in the SDFs which additionally comprise bond orders for non-covalent bonds.
Fig. 4(a) Distributions of mean pairwise RMSD of atom positions between conformations of each molecule in the QMugs dataset at different stages along the pipeline. While the k-means sampling process selects conformations that are, on average, more geometrically diverse than the average pair of structures generated by MTD simulations, geometry optimization reduces the geometrical diversity between the optimized conformers. (b) Change in atom positions during geometry optimization vs. mean pairwise RMSD of conformations before optimization. Molecules with initially more diverse conformations displayed a greater change in atom positions than those with initially less diverse conformations. (c) Distribution of RMSD of structures prior to and after optimization with the semi-empirical GFN2-xTB method, and of structures optimized with the same approach vs. with ωB97X-D/def2-SVP. The structures of three molecules with varying differences between the two methods are shown as illustrative examples (black and gray correspond to GFN2-xTB and ωB97X-D/def2-SVP-optimized structures, respectively). For illustrative purposes, the example molecules are aligned on their substructures.
Fig. 5Comparison of molecular properties computed at the two levels of theory considered herein (GFN2-xTB, ωB97X-D/def2-SVP) for the molecules contained in QMugs. The molecular formation energy EForm EForm in (a) was calculated by subtracting the atomic UAtom contributions from the total molecular energies U. Only the rotational constants A are shown in (c) as their B and C counterparts showed highly similar values. 22 conformations of small molecules show very large rotational constants and are not shown. RMSE and PCC for rotational constant A are 845.834 cm−1 and 0.091 respectively, if those structures are included. Abbreviations: RMSE, root mean squared error; PCC, Pearson’s correlation coefficient.
Fig. 6Atom-type-specific partial charge correlations (GFN2-xTB, ωB97X-D/def2-SVP) for the QMugs dataset (see Table S1 in the Supporting Information for additional metrics).
Fig. 7Comparison of Wiberg bond orders between GFN2-xTB and ωB97X-D/def2-SVP for the 15 most frequently occurring bond types in the QMugs dataset. The latter level of theory uses Löwdin-orthogonalization. See Table S2 in the Supporting Information for additional metrics. For bond types which occurred > 1 M times in the dataset, a randomly chosen sample of 1 M bonds is plotted.
| Measurement(s) | Quantum Mechanics |
| Technology Type(s) | density functional theory |