| Literature DB >> 35479483 |
Rustam Zhumagambetov1, Ferdinand Molnár2, Vsevolod A Peshkov3, Siamac Fazli1.
Abstract
Recent advances in convolutional neural networks have inspired the application of deep learning to other disciplines. Even though image processing and natural language processing have turned out to be the most successful, there are many other domains that have also benefited; among them, life sciences in general and chemistry and drug design in particular. In concordance with this observation, from 2018 the scientific community has seen a surge of methodologies related to the generation of diverse molecular libraries using machine learning. However to date, attention mechanisms have not been employed for the problem of de novo molecular generation. Here we employ a variant of transformers, an architecture recently developed for natural language processing, for this purpose. Our results indicate that the adapted Transmol model is indeed applicable for the task of generating molecular libraries and leads to statistically significant increases in some of the core metrics of the MOSES benchmark. The presented model can be tuned to either input-guided or diversity-driven generation modes by applying a standard one-seed and a novel two-seed approach, respectively. Accordingly, the one-seed approach is best suited for the targeted generation of focused libraries composed of close analogues of the seed structure, while the two-seeds approach allows us to dive deeper into under-explored regions of the chemical space by attempting to generate the molecules that resemble both seeds. To gain more insights about the scope of the one-seed approach, we devised a new validation workflow that involves the recreation of known ligands for an important biological target vitamin D receptor. To further benefit the chemical community, the Transmol algorithm has been incorporated into our cheML.io web database of ML-generated molecules as a second generation on-demand methodology. This journal is © The Royal Society of Chemistry.Entities:
Year: 2021 PMID: 35479483 PMCID: PMC9037129 DOI: 10.1039/d1ra03086h
Source DB: PubMed Journal: RSC Adv ISSN: 2046-2069 Impact factor: 4.036
Fig. 1A vanilla transformer architecture.
Fig. 2Overview of the beam search with a beam width of N = 3.
Fig. 3Structural representations of the main VDR ligands groups. (A) the secosteroid 1,25D3 bound to ratVDR (PDBID: 1RK3), (B) steroid acid e.g. lithocholic acid (LCA) bound to ratVDR (PDBID: 3W5P) and (C) non-steroidal analogue YR301 bound to ratVDR (PDBID: 2ZFX). All crystal structure were superimposed to 1RK3, the critical amino acid contacts are highlighted in all structures.
Fig. 4The general pipeline of the sampling process for the one-seed approach.
Performance metrics for baseline models: fraction of valid molecules, fraction of unique molecules from 1000 and 10 000 molecules, internal diversity, fraction of molecules passing filters (MCF, PAINS, ring sizes, charge, atom types), and novelty. Reported (mean ± std) over three independent model initializations. Arrows next to the metrics indicate preferable metric values (higher is better for all). CharRNN – character-level recurrent neural network, AAE – adversarial autoencoder, VAE – variational autoencoder, JTN-VAE – junction tree variational autoencoder, LatentGAN – latent vector based generative adversarial network, Transmol – transformer for molecules
| Model | Valid (↑) | Unique@1k (↑) | Unique@10k (↑) | IntDiv (↑) | IntDiv2 (↑) | Filters (↑) | Novelty (↑) |
|---|---|---|---|---|---|---|---|
| Train | 1 | 1 | 1 | 0.8567 | 0.8508 | 1 | 1 |
| HMM | 0.076 ± 0.0322 | 0.623 ± 0.1224 | 0.5671 ± 0.1424 | 0.8466 ± 0.0403 | 0.8104 ± 0.0507 | 0.9024 ± 0.0489 |
|
| NGram | 0.2376 ± 0.0025 | 0.974 ± 0.0108 | 0.9217 ± 0.0019 | 0.8738 ± 0.0002 | 0.8644 ± 0.0002 | 0.9582 ± 0.001 | 0.9694 ± 0.001 |
| Combinatorial |
| 0.9983 ± 0.0015 | 0.9909 ± 0.0009 | 0.8732 ± 0.0002 | 0.8666 ± 0.0002 | 0.9557 ± 0.0018 | 0.9878 ± 0.0008 |
| CharRNN | 0.9748 ± 0.0264 |
| 0.9994 ± 0.0003 | 0.8562 ± 0.0005 | 0.8503 ± 0.0005 | 0.9943 ± 0.0034 | 0.8419 ± 0.0509 |
| AAE | 0.9368 ± 0.0341 |
| 0.9973 ± 0.002 | 0.8557 ± 0.0031 | 0.8499 ± 0.003 | 0.996 ± 0.0006 | 0.7931 ± 0.0285 |
| VAE | 0.9767 ± 0.0012 |
| 0.9984 ± 0.0005 | 0.8558 ± 0.0004 | 0.8498 ± 0.0004 |
| 0.6949 ± 0.0069 |
| JTN-VAE |
|
|
| 0.8551 ± 0.0034 | 0.8493 ± 0.0035 | 0.976 ± 0.0016 | 0.9143 ± 0.0058 |
| LatentGAN | 0.8966 ± 0.0029 |
| 0.9968 ± 0.0002 | 0.8565 ± 0.0007 | 0.8505 ± 0.0006 | 0.9735 ± 0.0006 | 0.9498 ± 0.0006 |
| Transmol | 0.0694 ± 0.0004 | 0.9360 ± 0.0036 | 0.9043 ± 0.0036 |
|
| 0.8437 ± 0.0015 | 0.9815 ± 0.0004 |
Performance metrics for baseline models: Fréchet ChemNet Distance (FCD), similarity to a nearest neighbor (SNN), fragment similarity (Frag), and scaffold similarity (Scaff); reported (mean ± std) over three independent model initializations. Results for random test set (Test) and scaffold split test set (TestSF). Arrows next to the metrics indicate preferable metric values. CharRNN – character-level recurrent neural network, AAE – adversarial autoencoder, VAE – variational autoencoder, JTN-VAE – junction tree variational autoencoder, LatentGAN – latent vector based generative adversarial network, Transmol – transformer for molecules
| Model | FCD (↓) | SNN (↑) | Frag (↑) | Scaf (↑) | ||||
|---|---|---|---|---|---|---|---|---|
| Test | TestSF | Test | TestSF | Test | TestSF | Test | TestSF | |
| Train | 0.008 | 0.4755 | 0.6419 | 0.5859 | 1 | 0.9986 | 0.9907 | 0 |
| HMM | 24.4661 ± 2.5251 | 25.4312 ± 2.5599 | 0.3876 ± 0.0107 | 0.3795 ± 0.0107 | 0.5754 ± 0.1224 | 0.5681 ± 0.1218 | 0.2065 ± 0.0481 | 0.049 ± 0.018 |
| NGram | 5.5069 ± 0.1027 | 6.2306 ± 0.0966 | 0.5209 ± 0.001 | 0.4997 ± 0.0005 | 0.9846 ± 0.0012 | 0.9815 ± 0.0012 | 0.5302 ± 0.0163 | 0.0977 ± 0.0142 |
| Combinatorial | 4.2375 ± 0.037 | 4.5113 ± 0.0274 | 0.4514 ± 0.0003 | 0.4388 ± 0.0002 | 0.9912 ± 0.0004 | 0.9904 ± 0.0003 | 0.4445 ± 0.0056 | 0.0865 ± 0.0027 |
| CharRNN |
|
| 0.6015 ± 0.0206 | 0.5649 ± 0.0142 |
| 0.9983 ± 0.0003 | 0.9242 ± 0.0058 |
|
| AAE | 0.5555 ± 0.2033 | 1.0572 ± 0.2375 | 0.6081 ± 0.0043 | 0.5677 ± 0.0045 | 0.991 ± 0.0051 | 0.9905 ± 0.0039 | 0.9022 ± 0.0375 | 0.0789 ± 0.009 |
| VAE | 0.099 ± 0.0125 | 0.567 ± 0.0338 |
|
| 0.9994 ± 0.0001 |
|
| 0.0588 ± 0.0095 |
| JTN-VAE | 0.3954 ± 0.0234 | 0.9382 ± 0.0531 | 0.5477 ± 0.0076 | 0.5194 ± 0.007 | 0.9965 ± 0.0003 | 0.9947 ± 0.0002 | 0.8964 ± 0.0039 | 0.1009 ± 0.0105 |
| LatentGAN | 0.2968 ± 0.0087 | 0.8281 ± 0.0117 | 0.5371 ± 0.0004 | 0.5132 ± 0.0002 | 0.9986 ± 0.0004 | 0.9972 ± 0.0007 | 0.8867 ± 0.0009 | 0.1072 ± 0.0098 |
| Transmol | 4.3729 ± 0.0466 | 5.3308 ± 0.0428 | 0.6160 ± 0.0005 | 0.4614 ± 0.0007 | 0.9564 ± 0.0009 | 0.9496 ± 0.0009 | 0.7394 ± 0.0009 | 0.0183 ± 0.0065 |
Fig. 5Plots of Wasserstein-1 distance between distributions of molecules in the generated and test sets.
Fig. 6Proportions of molecules that satisfy five different medicinal chemistry filters.
Fig. 7The general pipeline of the sampling process for the two-seeds approach.
Fig. 8Expanding scaffold diversity with the two-seeds approach.