| Literature DB >> 35356675 |
Matthew Ragoza1, Tomohide Masuda2, David Ryan Koes3.
Abstract
The goal of structure-based drug discovery is to find small molecules that bind to a given target protein. Deep learning has been used to generate drug-like molecules with certain cheminformatic properties, but has not yet been applied to generating 3D molecules predicted to bind to proteins by sampling the conditional distribution of protein-ligand binding interactions. In this work, we describe for the first time a deep learning system for generating 3D molecular structures conditioned on a receptor binding site. We approach the problem using a conditional variational autoencoder trained on an atomic density grid representation of cross-docked protein-ligand structures. We apply atom fitting and bond inference procedures to construct valid molecular conformations from generated atomic densities. We evaluate the properties of the generated molecules and demonstrate that they change significantly when conditioned on mutated receptors. We also explore the latent space learned by our generative model using sampling and interpolation techniques. This work opens the door for end-to-end prediction of stable bioactive molecules from protein structures with deep learning. This journal is © The Royal Society of Chemistry.Entities:
Year: 2022 PMID: 35356675 PMCID: PMC8890264 DOI: 10.1039/d1sc05976a
Source DB: PubMed Journal: Chem Sci ISSN: 2041-6520 Impact factor: 9.825
Fig. 1Overview of generative modeling pipeline. First, a docked protein–ligand complex is converted to an atomic density representation through atom typing and gridding operations. Density grids are then provided as input to a conditional variational autoencoder (CVAE). The CVAE input branch encodes the full complex density, while its conditional branch encodes only the receptor density. The complex density is mapped to a probabilistic latent space, which is then sampled as a latent vector ∼ N(μ, σ). This is combined with the conditional vector output by the conditional encoder, and together they are provided to the decoder. The decoder generates an output ligand density that is converted into the final 3D molecular structure through atom fitting and bond adding.
Atom typing scheme. The atomic properties and associated ranges of values that were represented in our atom type vectors
| Atomic property | Value range | Num. values |
|---|---|---|
| Ligand element | B, C, N, O, F, P, S, Cl, Br, I, Fe | 11 |
| Receptor element | C, N, O, Na, Mg, P, S, Cl, K, Ca, Zn | 11 |
| Aromatic | False, true | 2 |
| H-bond acceptor | True | 1 |
| H-bond donor | True | 1 |
| Formal charge | −1, 0, 1 | 3 |
Fig. 2Generative model architecture. The input encoder maps a protein–ligand complex to a set of means and standard deviations defining latent variables, which are sampled to produce a latent vector . The conditional encoder maps a receptor to a conditional encoding vector . The latent vector and conditional vector are concatenated and provided to the decoder, which maps them to a generated ligand density grid. The input encoder and conditional encoder consist of 3D convolutional blocks with leaky ReLU activation functions and residual connections[46] (see detail of Conv3DBlock), alternated with average pooling. The decoder uses a similar architecture in reverse, with transposed convolutions and nearest-neighbor upsampling instead of pooling. U-Net skip connections[47] were included between the convolutional features of the conditional encoder and the decoder to enhance the processing of receptor context. Spectral normalization[48] was applied to all learnable parameters during training. The value displayed after module names in the diagram indicates the number of outputs (or feature maps, for convolutional modules). If not specified, the number of outputs did not change from the previous layer.
Test set targets. The proteins that were selected for test evaluations and the associated ligands that were docked to them. Each of the test set proteins has a binding site from a different pocket cluster
| PBD ID | Ligand IDs | Num. ligands |
|---|---|---|
| 2ah9 | bgn, udp, udh, cto, ud2, upg | 6 |
| 5lvq | aly, 5wv, 5wz, 2lx, 5ws, 5wu, 2qc, 78y,5wy, 5x0, 5wt, p2l, 82i, 5wx | 14 |
| 5g3n | x28, oap, 8in, 6in, u8d, bhp, i3n, gel | 8 |
| 1u0f | g6p, 6pg, s6p, der, f6p, a5p | 6 |
| 4bnw | 36k, nkh, 36i, j2t, fxe, q7u, 3x3, 9kq, 36p,8m5, 34x, 36e, 36g | 13 |
| 4i91 | cpz, 85d, cae, sne, tmh, 3v4, 82s | 7 |
| 2ati | avf, ave, ihu, 055, 25d, mrd, avd | 7 |
| 2hw1 | tr4, lj9, a4j, tr2, anp, a4g, a3y, a3j, quz,a1y, a2j | 11 |
| 1bvr | xt5, tcu, 3kx, 3ky, 2tk, i4i, uud, geq, 665,nai, nad | 11 |
| 1zyu | adp, skm, anp, acp, s3p, dhk, k2q | 7 |
Fig. 3Properties of generated molecules. The percent of generated molecules that were valid, novel, unique, moved less than 2 Å RMSD during UFF minimization, had lower Vina energy, or had higher CNN predicted affinity than the reference molecule. These metrics are reported separately for molecules from posterior and prior sampling. Also shown are the distributions of molecular weight, Tanimoto fingerprint similarity, RMSD from UFF minimization, difference in Vina energy, and difference in CNN affinity. The fingerprint similarity, difference in Vina energy, and difference in CNN affinity were computed with respect to the reference molecule (lower ΔVina energy is better, higher ΔCNN affinity is better).
Fig. 4Controlling the variability of generated molecules. This figure depicts the effect of sampling molecules using different multipliers on the standard deviation of the latent distribution. The leftmost image shows the real ligand that was input to the model for posterior sampling. The first row shows posterior molecules sampled using different variability factors. The second row shows prior samples with different variability factors.
Fig. 5Controlling bias towards the reference molecule. This figure shows the effect of sampling molecules from latent distributions that interpolate between the posterior and prior. On the far left is the real molecule that was used to define the posterior distribution, followed by molecules sampled using different bias factors. A bias a factor of 1.0 indicates the full posterior distribution and 0.0 indicates the full prior distribution.
Fig. 6Conditioning generated molecules on shikimate kinase mutants. This figure displays posterior molecules that were generated using shikimate as the input ligand, shown in the top left corner. They were each conditioned on mutated versions of the shikimate kinase receptor. After the reference molecule, the first row shows molecules generated from the cognate receptor (wild type) and four different multi-residue mutants. The next three rows show molecules conditioned on receptors with different single-residue mutations. The mutations highlighted in blue involve residues identified in previous work as making important binding interactions with shikimate. Mutations that inverted the charge of the residue are highlighted in red.
Fig. 7Latent interpolation between shikimate kinase ligands. This figure depicts a series of spherical interpolations in latent space between four different known actives for shikimate kinase. Starting with a prior molecule, each row displays an interpolation to the next ligand in the sequence, with the real molecule shown at the end of the row. The interpolated molecules are labelled with the weights that were used to combine the two endpoints of the latent interpolation. The molecules in this graphic were not minimized with any force field.