| Literature DB >> 33431050 |
Youngchun Kwon1,2, Dongseon Lee1, Youn-Suk Choi3, Kyoham Shin4, Seokho Kang5.
Abstract
Recently, deep learning has been successfully applied to molecular graph generation. Nevertheless, mitigating the computational complexity, which increases with the number of nodes in a graph, has been a major challenge. This has hindered the application of deep learning-based molecular graph generation to large molecules with many heavy atoms. In this study, we present a molecular graph compression method to alleviate the complexity while maintaining the capability of generating chemically valid and diverse molecular graphs. We designate six small substructural patterns that are prevalent between two atoms in real-world molecules. These relevant substructures in a molecular graph are then converted to edges by regarding them as additional edge features along with the bond types. This reduces the number of nodes significantly without any information loss. Consequently, a generative model can be constructed in a more efficient and scalable manner with large molecules on a compressed graph representation. We demonstrate the effectiveness of the proposed method for molecules with up to 88 heavy atoms using the GuacaMol benchmark.Entities:
Keywords: Compressed graph representation; Deep learning; Graph variational autoencoder; Molecular graph generation
Year: 2020 PMID: 33431050 PMCID: PMC7513488 DOI: 10.1186/s13321-020-00463-2
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Fig. 1Substructural patterns that commonly appear between two atoms in molecules
Fig. 2Example of compressed graph representation
Fig. 3Schematic diagram of model architecture
Node features of compressed graph representation
| Feature | Type | Dimensionality |
|---|---|---|
| Atom type | One-hot (B, C, N, O, F, Si, P, S, Cl, Se, Br, I) | 12 |
| Formal charge | One-hot (-1, 1, 2, 3) | 4 |
| No. explicit hydrogens | One-hot (1, 2, 3) | 3 |
| Total ( | 19 |
Edge features of compressed graph representation
| Feature | Type | Dimensionality |
|---|---|---|
| Bond type | One-hot (single, double, triple, or none) | 3 |
| Pattern 1 count | One-hot (1, 2, 3, or none) | 3 |
| Pattern 2 count | One-hot (1, 2, 3, or none) | 3 |
| Pattern 3 count | One-hot (1, 2, or none) | 2 |
| Pattern 4 count | One-hot (1, or none) | 1 |
| Pattern 5 count | One-hot (1, 2, or none) | 2 |
| Pattern 6 count | One-hot (1, or none) | 1 |
| Total ( | 15 |
Fig. 4Molecular graph compression results on training dataset: a histogram of the number of nodes with the original representation; b histogram of the number of nodes with the compressed representation; c scatterplot between original and compressed representations
Summary of molecular graph compression results
| Statistic | Original rep. | Compressed rep. | Reduction rate (%) |
|---|---|---|---|
| Avg. no. nodes | 27.89 | 18.49 | 33.70 |
| Max. no. nodes | 88 | 52 | 40.91 |
Molecular graph generation results of baseline and proposed models
| Metric | SMILES-based | Graph-based | ||||||
|---|---|---|---|---|---|---|---|---|
| LSTM | VAE | AAE | ORGAN | GraphMCTS | JTVAE | NAGVAE | NAGVAE | |
| Validity | 0.959 | 0.870 | 0.822 | 0.379 | 1.000 | 0.927 | ||
| Uniqueness | 1.000 | 0.999 | 1.000 | 0.841 | 1.000 | 0.955 | ||
| Novelty | 0.912 | 0.974 | 0.998 | 0.687 | 0.994 | N/A | N/A | 1.000 |
| KLD | 0.991 | 0.982 | 0.886 | 0.267 | 0.522 | 0.384 | ||
| FCD | 0.913 | 0.863 | 0.529 | 0.000 | 0.015 | 0.009 | ||