Literature DB >> 35960209

Molecular Design Method Using a Reversible Tree Representation of Chemical Compounds and Deep Reinforcement Learning.

Ryuichiro Ishitani¹, Toshiki Kataoka¹, Kentaro Rikimaru¹.

Abstract

Automatic design of molecules with specific chemical and biochemical properties is an important process in material informatics and computational drug discovery. In this study, we designed a novel coarse-grained tree representation of molecules (Reversible Junction Tree; "RJT") for the aforementioned purposes, which is reversely convertible to the original molecule without external information. By leveraging this representation, we further formulated the molecular design and optimization problem as a tree-structure construction using deep reinforcement learning ("RJT-RL"). In this method, all of the intermediate and final states of reinforcement learning are convertible to valid molecules, which could efficiently guide the optimization process in simple benchmark tasks. We further examined the multiobjective optimization and fine-tuning of the reinforcement learning models using RJT-RL, demonstrating the applicability of our method to more realistic tasks in drug discovery.

Entities: Chemical

Mesh：

Year: 2022 PMID： 35960209 PMCID： PMC9472278 DOI： 10.1021/acs.jcim.2c00366

Source DB: PubMed Journal: J Chem Inf Model ISSN： 1549-9596 Impact factor: 6.162

Introduction

Exploration of novel chemical compounds is an important process in drug and material discovery. The process entails iterative trials, including the proposal of candidate compounds and their experimental evaluation, which are usually expensive and time-consuming. Furthermore, while the chemical space of feasible and chemically synthesizable compounds is astronomically large (estimated to be on the order of 1060), only a small portion of this space has been explored for drugs or functional materials. As it is impossible to explore compounds having the desired properties from this vast chemical space using ad-hoc experimental trials, computational exploration of chemical space has been proposed and is now gaining considerable attention owing to recent advancements in computational power. In computational exploration, all of the steps in the iterative trials, including the proposal of novel chemical compounds and their property evaluation, were performed in silico. Using extensive computational resources, this method extends the exploration range within the vast chemical space remarkably. However, owing to the discrete nature of chemical structure representation (i.e., graphs in discrete mathematics), the difficulty to efficiently explore the chemical space still persists. To address this problem, new exploration methods using deep neural networks[1] (DNNs; for comprehensive reviews see refs (2, 3)) have been proposed. One example is the latent-space-based method, which involves a variational autoencoder (VAE[4]). In this method, a DNN, known as a generative model, learns the mappings between the discrete space of the chemical compounds and the continuous latent manifold. After learning the generative model, exploration is performed over the continuous latent manifold.[5] Bayesian optimization or metaheuristics-based methods were used for the optimization process because the molecular properties used as scoring functions are usually not differentiable with respect to the latent manifold, even though the latent space itself is continuous. However, this method has several limitations. The first concerns the dimensions of the latent manifold. For example, on using the practical size of the training dataset (105–106 compounds), learned latent manifolds with a reasonable reconstitution rate usually result in dimensions greater than 50, which is not easy for the available optimization algorithms. Second, no reliable indicator exists for assessing the rationality of learned mapping. Although we can evaluate the model using the reconstruction error against the training dataset, it is unclear whether a low reconstruction error is sufficient to guide the chemical structure optimization with the given scoring functions. The third problem is that the learning process of the generative model is detached from the optimization process of the chemical compounds with regard to the scoring function. The pretrained generative model remained unchanged during optimization and did not learn the scoring function. Recently, to address the third problem, the retraining of the generative model based on optimization results has been proposed.[6] This method requires an iterative process of optimization and training, which is computationally expensive and time consuming. Another example is a reinforcement learning (RL)-based method. In this method, the molecular design problem is formulated as a Markov decision process wherein an agent learns the optimal policy based on the rewards offered by its surrounding environment. The RL-based method can partly alleviate the problem of latent-space-based methods because the policy approximated by the DNN model learns the scoring function during optimization. RL-based methods can be classified into two types based on the representation of chemical compounds. The first type uses text representation (e.g., SMILES), which allows us to leverage techniques used in natural language processing. Although this representation is widely used (including in latent-space-based methods[5]), the generated text may contain grammatically invalid results that can affect optimization performance. This type of method includes the REINVENT.[7] The second type uses a graph representation of the chemical structure. Examples of this type include GCPN[8] and MolDQN.[9] In this representation, the generated graphs are theoretically decodable to valid molecules, thereby avoiding the grammatically invalid results in the text representation. However, the molecular properties can change drastically even with a single action of RL, such as in the last step of closing a conjugated aromatic ring. This makes it difficult to guide the agent directly to the optimal action at each RL step. Thus, the molecular representation using chemical groups (e.g., phenyl groups) as building blocks seem effective for RL-based molecular design. In contrast, molecular structure generation algorithms using predefined fragments as building blocks have been continuously studied over the last century, and many approaches have been proposed.[10−15] However, it is difficult to integrate such methods with learning-based optimization algorithms, including RL. Recently, Jin et al.[16] proposed a novel coarse-grained molecule representation by leveraging junction tree (JT) decomposition to a molecule graph. In this method, a molecule is represented as a tree (i.e., a graph without circular or closed paths) with nodes corresponding to rings or bonds. Similar to the all-atom graph representation, this JT representation always corresponds to a valid molecule by composing valid chemical fragments as building blocks. This method has been successfully applied to several molecular-generative models.[16,17] However, the JT representation cannot be reversibly converted to the original molecule without auxiliary information. Therefore, additional neural networks (such as atom-based graph convolution networks) are required to supplement this information for decoding molecules from the generated JT representation, making this technique complicated and computationally intensive. In this study, we propose a novel RL-based molecular generation and optimization framework that leverages JT-based representation. For this purpose, we tailored the JT representation to be reverse-convertible to the original molecule without auxiliary neural networks (reversible JT; RJT). This reversible molecular representation enabled us to formulate the molecular generation and optimization problem as tree-structure construction using RL (RJT-RL). The reversible nature of the RJT enables the evaluation of molecules decoded from the intermediate states in the RL episodes. We tested the method using simple molecular design tasks as benchmarks and subsequently applied it to more realistic tasks in drug discovery involving multiobjective scoring functions.

Methods

Reversible Tree Representation of Molecules

In the original JT representation, the molecular graph is converted into a tree representation using the JT algorithm.[16] Briefly, the molecular graph is fragmented into bonds and rings using the smallest set of smallest rings (SSSR) algorithm implemented in RDKit.[18] Then, a node is assigned to each bond or ring, and the nodes are connected by an edge when they share atoms in to construct the tree representation (Figure S1). If the bond nodes intersect at more than one node, the resulting creates a cyclic structure. This situation is avoided by inserting a single atom shared between these nodes as a “singleton node” (see the original study[16] for a detailed description of the algorithm). Hereafter, we term the nodes representing the bond and ring as “bond nodes” and “ring nodes,” respectively. A possible combination of unique atom clusters in the dataset forms a vocabulary . Each node N is assigned a word ID w based on the atom cluster in the node. This coarse-grained representation allows us to build a molecule from chemically valid fragments, thereby avoiding atom-by-atom molecular creation through chemically invalid intermediates. However, the JT representation cannot be reversibly converted into the original molecule because the connection information between the nodes is lost in this coarse-grained representation. In previous studies,[16,17] this problem was avoided by introducing supplementary neural networks, which complicated the JT representation framework. In this study, we introduce the reversible JT (RJT) representation () of molecules by extending the JT decomposition of . To eliminate arbitrariness in the node connection (i.e., to determine atoms shared between two nodes connected by an edge), we record the ID of the atoms shared between two adjacent nodes (N and N) as the “site information” σ. The site information is defined as follows (Figure ). The original JT implementation involves sharing of one or two atoms between adjacent nodes connected by the JT edge. Hereafter, the JT edge sharing one atom is denoted as a “type-1 edge” (Figure A), while that sharing two atoms as a “type-2 edge” (Figure B). Type-1 edges can be further classified into three cases: edges connecting (i) bonds and singleton nodes, (ii) two bond nodes, and (iii) bond and ring nodes. In these three cases, the indices of the shared atom in both nodes are recorded as the site information σ. In contrast, type-2 edges always connect two ring nodes, and the indices of the two atoms in both the ring nodes are recorded. Because the atoms shared between the nodes are always adjacently located in the ring, the indices of the first atom and the direction ID (+1 or −1) of the second atom are recorded as site information σ. A spiro connection between two rings is treated as a special case of the type-2 edge by assigning a direction ID of 0, although only one atom is shared between the nodes. The most appropriate location for storing this site information is edge E, which represents the existence of shared atoms in N and N in . Thus, in this study, the site information is encoded as edge features.

Figure 1

Definition of the site information that enables the reversible conversion to the original molecular structure. (A) Site information for type-1 edges. In the left panels, circles and arrows indicate the nodes and edges of the tree representation, respectively. Numbers near the arrows indicate site information. The left panels show the possible molecular structures assembled from the tree representation. Using site information, the original structure indicated by the dashed line can be selected without ambiguity. (B) Site information for type-2 edges. The numbers and (+)/(−) signs near the arrows indicate site information. The algorithm for converting the RJT representation to molecular graph (Algorithm 1) can be described as follows: this method is similar to the assembly algorithm of Jin et al.,[16] except that enumerating all combinations of node-to-node attachments in is not necessary. In our method, the predicted is traversed in the depth-first order by attaching a subgraph, corresponding to the child node, to the constructed graph. Using the site information, the atom(s) shared between two nodes can be uniquely determined, and the nodes could be deterministically assembled to convert to . However, it is possible that the predicted site information is incompatible with the node types (e.g., more than four atoms are connected to one carbon atom). In such cases, the second (or third) probable site information can be utilized based on the output of the softmax logits of the neural network (described later). Alternatively, an exception could be raised to indicate that a particular RJT contains invalid information. The latter implementation was used to simplify the code.

RJT-Based Neural Network

The neural network architecture that encodes the RJT representation into hidden vectors with fixed sizes ({h}) can be described as follows Each node N is represented by a one-hot vector of word ID w, while edge E is represented by a one-hot vector of site information σ. These one-hot representations of w and σ are then converted to node and edge features x and y using the learnable embedding matrices Enode and Eedge, respectively, as follows To encode the RJT, including both node and edge features, we extended the tree-based gated recurrent unit (GRU)[16] as follows In the above formulas, we extended the tree-based GRU; however, the tree-based long short-term memory (LSTM)[17] could also be extended to accommodate the edge features in a similar manner. The message vectors m are updated in two phases (bottom-up and top-down), as in the original tree-based GRU. Finally, the hidden vector h of node N is computed by aggregating the message vectors m as followswhere ReLU stands for the rectified linear unit, defined as ReLU(x) = max(0, x).

RL Using the RJT Representation

In this study, we formulated a molecular design task for tree generation using RL.[19] In general RL settings, we consider that an agent receives state from the environment and selects an action according to its policy π = π(a|s) at each time step t, where is a set of possible states and is a set of possible actions at state s. After taking an action, the agent receives the next state s and a scalar reward r, and proceeds to the next step t + 1. The episode ends when the agent receives a terminal state at time step T and then proceeds to the next episode with an initial time step t = 0. The return R is the total accumulated reward from time step t to T with discount rate γ, as follows Action value Qπ(s, a) is defined as the expected return on selecting action a in state s after following policy π, whereas the value of state Vπ(s) is defined as the expected return from state s after following policy π. The goal of the agent is to maximize the value of state s, which is the expected return R from each state s. Policy-based RL methods directly parameterize the policy πθ and optimize its parameter θ using the gradient estimator of . The policy gradient theorem states that the unbiased estimate of can be estimated as[20] To reduce the variance of this gradient estimate, the estimate of advantage (Â; A(s, a) = Q(s, a) – V(s)) was used instead of R.[21,22] Although the use of this advantage increases stability, it often leads to destructively large policy updates. In the proximal policy optimization (PPO) algorithm,[23] the clipped surrogate objective LCLIP is optimized instead of LPG to alleviate this problem as followswhere r(θ) = πθ(a|s)/(πθ(a|s)), and clip(•) clips r(θ) outside the interval between [1 – ϵ, 1 + ϵ]. Then, the total loss function LPPO, given asis optimized, where LVF(θ) = (Vθ(s) – Vtarg)2 is the squared error loss of the value function and S[πθ] (s) is an entropy bonus term to ensure efficient exploration. For Â, the truncated version of the generalized advantage estimation is used, as in the original PPO implementation, as followswhere δ = r + γV(s) –V(s), and λ is the generalized advantage estimator parameter.[22]

Definition of Policy and Value Function Networks

The RJT representation of the molecule was used as the state in RL, where is a possible set of molecules that can be converted to RJT. The agent takes the action to modify the RJT of s and constructs the RJT in the next step (Figure A). Here, the action a is defined using the following four components: (i) selection of the word ID w of the new node N, (ii) selection of the existing node N for attaching to the new node, (iii) prediction of the site information σ (for connecting the two subgraphs represented by N and N), and (iv) determining whether the episode ends (Figure B). The policy can then be expressed as a probability distribution function for each component of the action defined here. In this study, these probability distribution functions were approximated by RJTNN, using the RJT of s () as an input (Figure A).

Figure 2

Network architecture of RJT-RL and the actions taken by the agent. (A) Calculation flow of the action and value function estimates. The RJT representing the state is converted to hidden vectors {h} by the RJTNN network defined in the main text, using which the policy distribution πθ and value function estimate Vθ are calculated using subsequent networks. The action taken by the agent is determined by sampling from this policy distribution. The numbers in the tree nodes indicate the word ID in the vocabulary table. (B) Example of an action sampled from the policy distribution and modification to an RJT representing the state. The state before the application of the action represents the molecule shown in panel (A). After applying the action, the state (i.e., RJT) is modified to represent the molecule shown in the leftmost panel. First, we calculated the hidden vectors {h} for each node of the RJT using RJTNN, as follows Then, each component of the policy distribution was calculated as followswhere nnode is the number of nodes in , nvoc is the size of vocabulary , and nsite is the maximum number of possible combinations of site information. The MLP function stands for a multilayer perceptron with two layers, including the ReLU activation function. The action (anode, aword, asite, astop) taken by the agent is determined by sampling from this policy distribution as follows The value function is also approximated using the shared RJTNN with the policy network as follows Using these neural network functions, the PPO target loss function (LPPO(θ)) was minimized using the Adam optimizer during RL training. PyTorch[24] and PFRL[25] were used for the deep learning and RL frameworks, respectively, for the experiments.

Expert Learning

To guide the exploration space of RL for drug-like molecules, the policy network was pretrained using the given dataset (“expert dataset”) similar to previous studies.[7,8] For pretraining, the dataset molecules were converted to RJTs, and the root node was randomly selected from the sampled from this dataset. The path for traversing from the root node was generated in the breadth (or depth)-first order, from which one edge (connecting N and N) was randomly selected. This edge represents the process of creating a new tree from a state comprising N and its ancestral nodes. Thus, the calculated state s and action a from the edge are used to minimize the negative log likelihood of the policy function, i.e., In the training phase of RL, the learned policy function is not expected to deviate significantly from the pretrained policy and generate non-drug-like molecules. The use of augmented likelihood in the REINVENT algorithm[7] prevents the learned policy from deviating significantly from the pretrained model. In this study, we jointly trained the policy using an expert dataset with the target function of the PPO by minimizing the following target functionwhere coefficient cexp is a hyperparameter that controls for the effect of the expert dataset.

Experiment Settings

In all experiments, the valences of the atoms in the assembly process (conversion of RJT to a molecule) were checked. If excess valence was detected (e.g., a carbon atom with more than five bonds), a small negative score (cinvalid) was given as the reward for step t to discourage the actions that produced invalid molecules. Next, in RJT-RL, any molecular property could be used as the reward function r, which is calculated at every RL step because RJTs corresponding to intermediate states are convertible to valid molecules. Thus, we examined the effectiveness of this “step reward” and compared it with the “final reward” case.Here, Rstep and Rfin are the step and final reward functions, respectively, defined according to the specific tasks, and tmax is the last time step in this episode. We also defined Rstep(−1) ≡ 0. Finally, the effect of the “duplication count penalty” on reward function was examined. One problem of RL is the balance between exploration and exploitation; poor exploration that maximizes short-term rewards results in trapping in the local minima. To avoid this problem, the entropy bonus term S[πθ](s) was introduced into the loss function of the PPO.[23] To further facilitate the exploration of the chemical space, we introduced a duplication count penalty function, which modifies the final reward Rfin calculated at the end of the episode as followswhere ndup is the duplication count of the generated molecule in this episode and s is the score calculated for the generated molecule. This function is intended to reset R (the total return of the episode) to zero if the duplication count exceeds a specified threshold. A threshold value of Nlim = 2 was utilized for all experiments in this study. This penalty term is inspired by previous studies,[26−28] wherein the count-base bonus term was shown to facilitate exploration in various RL tasks. All of the experiments described in the Results and Discussion section utilized the ethane molecule (“CC” in SMILES representation), which is the simplest atom cluster in vocabulary , unless otherwise stated. The policy network with a hidden vector size of 128 dimensions was pretrained using a dataset derived from the ZINC250k dataset.[5] In the derived dataset, macrocyclic compounds or rings containing more than eight atoms were removed. Additionally, bridge compounds containing more than eight substitution sites were removed. The resulting dataset contains 246,416 molecules. We tested whether all RJT representations of the molecules in this dataset could be reversibly converted to the original molecules. For comparison, REINVENT,[7] which is a SMILES-based molecular generator that uses RL, was used. The policy network was pretrained using the same dataset, and RL training was performed using the same scoring function and hyperparameters of k = 1 and σ = 20. In addition, CReM,[10] a rule-based structure generation method using fragments as building blocks, was employed. The fragment library for CReM was generated using the ZINC250k dataset, and the optimization method described in this study was used for structure generation. Similar computational resources were used (CPU: one core of Intel Xeon Gold 6254 CPU@ 3.10GHz and GPU: NVIDIA Tesla V100) for all experiments.

Results and Discussion

To examine the effectiveness of RJT-RL, we performed several experiments using molecular design as a benchmark. All of the experiments and their abbreviations are summarized in Tables , 2, and 3. We calculated several measures defined in the MOSES benchmark suite[29] for molecules generated in the last 1/5 episodes.

Table 1

Summary of the Penalized Log P Experimentsa

	method	reward	duplication penalty	Novel	Valid	Uniq	Frag	Scaf	IntDiv	Filt
P1	RJT-RL	step	off	1	1	0.991	0.078	0.016	0.593	0.602
P2	RJT-RL	final	on	1	1	1	0.132	0.024	0.521	0.807
P3	RJT-RL	step	on	1	1	0.996	0.077	0.021	0.593	0.553
P4	REINVENT			1	0.952	0.944	0.153	0.002	0.603	0.824

Table 2

Summary of the Similarity Experimentsa

	target	method	reward	duplication penalty	Novel	Valid	Uniq	Frag	Scaf	IntDiv	Filt
S1	vortioxetine	RJT-RL	step	off	1	1	0.069	0.169	0.027	0.348	0.543
S2	vortioxetine	RJT-RL	final	on	1	1	0.917	0.514	0.077	0.715	0.922
S3	vortioxetine	RJT-RL	step	on	1	1	0.849	0.245	0.002	0.641	0.829
S4	vortioxetine	REINVENT			1	0.970	0.970	0.970	0.689	0.837	0.693
S5	vortioxetine	CReM			1	1	1	0.608	0.034	0.724	0.488
C1	celecoxib	RJT-RL	step	off	1	1	0.100	0.119	0.098	0.387	0.982
C2	celecoxib	RJT-RL	final	on	1	1	0.951	0.183	0.011	0.776	0.809
C3	celecoxib	RJT-RL	step	on	1	1	0.926	0.357	0.004	0.745	0.777
C4	celecoxib	REINVENT			1	0.942	0.942	0.975	0.682	0.843	0.712
C5	celecoxib	CReM			1	1	1	0.608	0.034	0.724	0.488

The metrics of the molecules generated during the last 4000 episodes were calculated, as shown in Table .

Table 3

Summary of the Docking Experimentsa

	target	method	reward	3D conf	Novel	Valid	Uniq	Frag	Scaf	IntDiv	Filt
D1	interaction	RJT-RL	final	random	1	1	0.934	0.062	0	0.643	0.135
D2	interaction	RJT-RL	step	random	1	1	0.875	0.374	0	0.682	0.096
D3	interaction	RJT-RL	final	Enum	1	1	0.845	0.627	0	0.731	0.333
D4	interaction	CReM		random	1	1	0.999	0.580	0	0.562	0.378
M1	multiobjective	RJT-RL	final/step	random	1	1	0.933	0.176	0.002	0.721	0.950
M2	multiobjective	CReM		random	1	1	1	0.803	0	0.640	0.586
F1	fine-tuning	RJT-RL	final/step	random	1	1	0.876	0.147	0	0.685	0.458

The metrics of the molecules generated by the last 10,000, 8000, and 2000 episodes for D1–D4, M1–M2, and F1, respectively, were calculated, as shown in Table .

The MOSES metrics[29] were calculated for the molecules generated in the last 1000 episodes (Novel: novelty, Valid: validity, Uniq: uniqueness, Frag: fragment similarity, Scaf: scaffold similarity, IntDiv: internal diversity (p = 1), and Filt: fraction of molecules passing the unwanted structure filter). The metrics of the molecules generated during the last 4000 episodes were calculated, as shown in Table . The metrics of the molecules generated by the last 10,000, 8000, and 2000 episodes for D1–D4, M1–M2, and F1, respectively, were calculated, as shown in Table .

Penalized Log P Optimization

First, we verified the effectiveness of our method using a single chemical property that is easily calculated using only chemical structure. In the first experimental setup, we evaluated the penalized Log P score,[30] which is the water–octanol partition coefficient (Log P) that also accounts for the ring size and synthetic accessibility (SA) score.[31] Although the maximization of the penalized Log P score itself has no actual application in drug discovery, we first attempted this task to assess whether our method can optimize the computationally easy problem beyond the example molecules in the training dataset. Furthermore, many previous studies that performed optimization in the chemical space have employed this target score, thereby allowing the comparison of our method with these studies. For RJT-RL, we examined three cases (P1, P2, and P3 in Table ) that differed in the settings of the reward evaluation (step or final) and the duplication penalty. In P1 and P3, the penalized Log P score was used as the step reward function, Rstep, but the duplication penalty was turned off in P1. In P2, the penalized Log P score was used as Rfin, and no reward was given for the intermediate steps. The results are shown in Figures , and S2 and Table . In all cases (P1, P2, and P3), molecules with high rewards were continuously generated, even in the absence of the duplication penalty (Figure A–C). Diverse molecules were generated without any penalty (uniqueness in Table ); thus, the duplication penalty effect may be limited in this experimental setup. The comparison of the final and step reward cases (P2 vs P1/P3) showed that the reward functions tended to rise faster in P1/P3 than in P2 (Figure A–C). A similar tendency was observed in all other experiments, starting from different random states (Figure S3). These results suggest that the rewards of the RL intermediate states facilitated the optimization process. The REINVENT model (P4) also continuously generated high-scoring molecules, but the scores did not exceed those generated by our method (Figure D). The top three penalized Log P scores obtained in the experiments are summarized in Table , along with the results from previous studies. The comparison shows that our method performed comparably to the weighted retraining method,[6] which also updates the generative model during the optimization process according to the target function.

Figure 3

Table 4

Summary of the Results of the Penalized Log P Optimization Experiment Performed in this and Previous Studiesa

			best scores
method	reward type	duplication penalty	1	2	3
RJT-RL (P1)	step	off	26.45	26.39	26.34
RJT-RL (P2)	final	on	26.23	26.22	26.21
RJT-RL (P3)	step	on	26.45	26.39	26.34
REINVENT (P4)			16.04	16.03	15.71
CharVAE			1.98	1.42	1.19
JT-VAE			5.3	4.93	4.49
GCPN			7.98	7.85	7.80
MolDQN			11.71	11.63	11.63
Weighted retraining			27.84	27.59	27.21

Previous studies include the Following Methods: CharVAE,[5] JT-VAE,[16] GCPN,[8] MolDQN,[9] and Weighted Retraining[6]

Results of penalized Log P optimization by (A–C) RJT-RL (P1–P3; see Table ) and (D) REINVENT. The penalized Log P values for each episode are plotted in blue, whereas their moving average and maximum values are plotted in orange and green, respectively. Previous studies include the Following Methods: CharVAE,[5] JT-VAE,[16] GCPN,[8] MolDQN,[9] and Weighted Retraining[6]

Similarity-Guided Molecule Generation

Next, we verified the effectiveness of our method using a simple task to generate molecules similar to a given query structure. To measure the similarity between molecules i and j, we used the Jaccard index[32]J of the RDKit implementation of the FCFP4 fingerprints.[33] Although the similarity-guided optimization problem could seem trivial to human intuition, this task is significant for computational optimization using fingerprints as the similarity measure. For example, when attempting to generate an ether group, the alcohol moiety must be generated before reaching the final state. The fingerprint bits of alcohol and ether oxygens are different in the definition of FCFP4;[33] the optimization target score decreases in the step generating the intermediate alcohol moiety. Therefore, the landscape of the fingerprint-based similarity score is not smooth but contains many local minima wherein simple optimization algorithms can get trapped. This local-minima problem of fingerprint similarity could be avoided using the final reward (i.e., evaluating only the final molecule in the episodes). However, this makes it impossible to evaluate each RL state value in the episodes. To evaluate the effectiveness of different reward settings, we examined three different cases of RJT-RL (Table ).

Vortioxetine Rediscovery Experiment

In the first experiment (S1–S5; Table ), vortioxetine was used as the query structure (Figure A). The most similar molecule in the pretraining dataset is shown in Figure B, with a similarity of 0.58. The results of the vortioxetine rediscovery experiment are summarized in Figures C–G and S4. In case S1, optimization was performed with a step reward and without a duplication penalty (Table ), thereby achieving a similarity score of 0.724 (Figure C). However, after convergence to a local minimum, the exploration efficiency suddenly deteriorated, and similar or the same molecules were repeatedly generated (uniqueness in Table and Figure S4A). In case S2, when only the final reward and duplication penalty were applied (Table ), molecules with a similarity score of approximately 0.5 were continuously generated (Figure D), and a highest similarity of 0.771 was achieved (Figure D). In case S3, with the application of step reward and duplication penalty, diverse molecules with high similarity scores were continuously generated (Figure E). The average similarity score between the last 1000 episodes was 0.681, and a structure that was almost identical to the query structure was generated (Figure E). The best and average performances for S3 outperformed those for S2, demonstrating that the step-by-step evaluation of the molecular properties successfully guided the generation of optimum molecules. Additionally, several runs were conducted with different random seeds for S1–S3 (Figure S5), and it was observed that the duplication penalty strongly facilitated optimization in all cases. The step reward (S3) tends to produce better results than the final reward (S2), but it is influenced by random states to some extent. These results suggest that the advantage of step reward outweighs the disadvantage arising from the local minima in this problem.

Figure 4

Compounds generated by the similarity optimization experiment targeting vortioxetine. (A) Chemical structure of vortioxetine used in the query structure in the experiment. (B) Compound most similar to vortioxetine in the ZINC250k dataset with its similarity score in parentheses. (C–G) Results of similarity optimization experiments, S1–S5 (see Table ). The similarity scores for each episode (or iteration) are plotted in blue, and the moving average and maximum values are plotted in orange and green, respectively. The chemical structures of compounds with the highest scores (top three) are shown. Similarity scores are noted below the chemical structures. In S4, REINVENT also generated structures that were similar, but not identical, to the query structure, with the highest similarity score of 0.732 (Figure F). However, structures with high rewards were generated infrequently, and an average similarity score of approximately 0.32 was obtained at the end of training (Figure F). This suggested that the agent failed to adequately explore the chemical space around the query structure. In S5, CReM also generated similar structures, with the highest similarity score of 0.776 (Figure G). The average similarity score at the end of the optimization was approximately 0.5 (Figure G). However, it requires many evaluations (approximately 50,000) of the score function as compared to the other experiments (approximately 10,000).

Celecoxib Rediscovery Experiment

Next, similarity-guided optimization of celecoxib, which is also included in the GuacaMol benchmark suite,[34] was performed to observe the effects of the query structures (Figure A). For the experiment, RJT-RL, REINVENT, and CReM were used under conditions similar to those for vortioxetine (C1–C5; Table ). The molecule most similar to celecoxib in the pretraining dataset is shown in Figure B, with a similarity of 0.606.

Figure 5

Compounds generated by the similarity optimization experiment targeting celecoxib. (A) Chemical structure of celecoxib used in the query structure of the experiment. (B) Compound most similar to celecoxib in the ZINC250k dataset with its similarity score in parentheses. (C–G) Results of the similarity optimization experiments, C1–C5 (see Table ). The similarity scores for each episode (or iteration) are plotted in blue, and the moving average and maximum values are plotted in orange and green, respectively. The chemical structures of compounds with the highest scores (top three) are shown. Similarity scores are noted below the chemical structures. The results of the experiment are summarized in Figures C–G and S6. In RJT-RL, C3 (step reward) performed better than C1 and C2, as observed in the vortioxetine experiments (S1–S3). C3 generated a structure identical to celecoxib with a similarity of 1.0, but some effect of the random state on the generated structures was observed (Figure S7). CReM (C5) also successfully generated a structure identical to celecoxib after optimization of the 8th generation with an evaluation of approximately 60,000 compounds (Figure G). This result suggests that CReM is also a powerful method but requires a large number of score evaluations. In contrast, the best molecule generated by REINVENT (C4) has a similarity of 0.62 (Figure F), which is slightly better than the most similar structure in the dataset (Figure B). The original paper on REINVENT[7] stated that it generated an identical structure to celecoxib under similar conditions but with a larger size (1.5 million) of the pretraining dataset. REINVENT may require a larger pretraining dataset to achieve the best performance.

Structure-Based Scaffold Hopping

We further examined the efficiency of RJT-RL in the realistic task of structure-based scaffold hopping, wherein known drug candidate compounds with structural information (i.e., important interactions between the target protein and compounds) are available. B-Raf kinase with its inhibitor complex structure[35] (PDB ID: 3TV6, Figure A) was used as an example. In this structure, the pyrimidine nitrogen of the inhibitor hydrogen bonds with the main-chain nitrogen atom of Gly 596 (Figure B) in the kinase hinge region. In addition to this hinge interaction, this inhibitor interacts with the main-chain nitrogen atom of Cys 532 through its sulfonamide group (Figure C). In practical structure-based drug discovery applications, it is crucial to search for molecules with different scaffolds that retain known important interactions with target proteins. Thus, we attempted to design novel compounds that preserved the interaction between the hinge region of the enzyme and the pyrimidine group in the compounds while changing the scaffold in the compound interacting with Cys 532.

Figure 6

B-Raf kinase with its inhibitor complex[35] (PDB ID: 3TV6) was used as the target structure in the structure-based scaffold-hopping experiments. (A) Overall structure, (B) interactions between the kinase hinge motif and the inhibitor, and (C) interactions around the sulfonamide moiety of the inhibitor. The protein backbone and bound inhibitor are presented using ribbon and ball-and-stick models, respectively. Molecular graphics were prepared using CueMol (http://www.cuemol.org/). In this task (D1–D4; Table ), we used the docking score and the interaction score pertaining to the hydrogen-bond acceptor of the compound and the main-chain nitrogen atoms of Gly 596 and Cys 532. The ETKDG algorithm implemented in RDKit was used to generate three-dimensional (3D) conformations from two-dimensional (2D) structures.[36] In this process, we randomly selected one stereoisomer from all possible stereoisomers except for D3, wherein we enumerated all possible stereoisomers up to 16, attempted docking simulations, and then selected the isomer with the best docking score. The interaction score is defined as followswhere d(p, c) is the distance between the specified atoms of the protein and the compound. In this experiment, dmax = 6.0 Å and dmin = 3.0 Å were used. The total reward function is defined as followswhere HBA is the hydrogen-bond acceptor atom in the compound, ΔG is the AutoDock Vina score[37] of the docking pose of the compound, and cintr is a hyperparameter. Given the scale of the typical docking score (around −10) and interaction score (0–1) terms, cintr = 10.0 was used in this experiment. The docking score was clipped by 500 because the PPO algorithm was destabilized by an extremely high value of the docking score. The pyrimidine group was used as the initial state for the RL training. The initial binding pose of the pyrimidine group to the protein was restrained to the same as that in the crystal structure. Two types of reward functions were compared as in the other experiments (Table ). In cases D1 and D3, the aforementioned reward function was used for the final step, while Rstep(t) = 0 was maintained for all other intermediate steps. In case D2, the same function was used for Rstep(t). Furthermore, for comparison with the rule-based methods, CReM was examined using the same scoring function, including the docking and interaction scores (D4; Table ). The results for D1–D3 demonstrated that RJT-RL successfully generated optimized molecules with the desired interactions (Figure A–C). Both the interaction and docking scores were gradually optimized during the RL training episodes, and high-score compounds were continuously sampled during the later training episode stages. However, the distributions of the Log P, SA, and QED[38] scores of the generated compounds tended to deviate from those of the training dataset (Figure S8A–C). Compounds with high docking scores that satisfied the desired interactions are shown in Figure S9. Next, we compared the results of the final and step rewards (D1 and D2, respectively). The step reward was marginally better than the final reward. D2 obtained a higher reward than D1 for the same number of episodes and converged to the best score faster than D1 (Figure A,B). This result also demonstrates the potential of the step-by-step evaluation of the molecular properties. However, the speed of D2 was much lower than that of D1 because the docking simulation was performed for each RL step in D2. Notably, D2 took approximately seven times more time than D1 to reach the same number of episodes (Table ).

Figure 7

Table 5

Average Computational Time Required for Generating One Molecule

	reward	3D conf	average time (s)
D1	final	random	1.74
D2	step	random	12.9
D3	final	enum	8.96

Results of the structure-based scaffold-hopping experiments, D1–D4 (see Table ). The reward (left panel), interaction (middle panel), and docking (right panel) scores for each episode are plotted in blue, and the moving average and maximum values are plotted in orange and green, respectively. In D3, we enumerated all stereoisomers, selected the one with the best docking score, and used the final reward function. This result demonstrates that the enumeration of stereoisomer accelerated the speed of optimization compared to that of D1 (Figure C). However, the speed of D3 was approximately four times lower than that of D1, as it performed up to 16 docking simulations per episode (Table ). In D1 and D2, the randomly determined stereoisomers in the 3D-embedding process may result in different docking scores even though the agent performed the same actions. This random selection of stereoisomers may affect the optimization speed. Next, we compared the results for RJT-RL (D1) and CReM (D4). Because the score evaluation, including the docking simulation, is a bottleneck in this task, we considered one score evaluation in CReM corresponding to one episode in RJT-RL with a final reward. The results show that the generated molecules are optimized against the target score function at a speed comparable to that of D1 (Figure D). However, the distribution of each score term is quite different from that of RJT-RL (D1; Figure A). In D3, the docking score term is well optimized, while the interaction term is less optimized than in D1 (Figure D). One possible reason for this unbalanced optimization is the tendency of CReM to generate large molecules (Figures S8D and S9D). This tendency may increase the van der Waals interaction terms in the Vina scoring function,[37] thereby causing an unbalanced optimization. This could be avoided by optimizing cintr, the balancing hyperparameter of the score terms, or using the ligand efficiency[39] instead of the docking score. Finally, we examined several runs with different random seeds for D1, D2, and D3 (Figure S10). In all cases, the optimization proceeds similarly, but the convergence speeds differ depending on the runs in D1 and D2. In particular, in another run of D2, the speed of convergence is observed to be similar to that of D1 (Figure S10B). Next, we examined the differences in the molecules generated from different runs. We calculated the pairwise similarities of the top 100 molecules from the two runs and plotted their distributions (Figure S10D). The results demonstrate that diverse molecules are generated from different runs with the same hyperparameters. This could be an advantage of RJT-RL because it is important to obtain a variety of compounds with high target scores in actual drug design tasks. Consequently, given the level of improvement and deterioration by the step reward (D2) and stereoisomer enumeration (D3), we conclude that the final reward (D1) is best under the current experimental settings.

Multiobjective Optimization

While the above cases targeted simple rewards with one or two objectives, a typical drug design task requires the improvement of multiple objective functions. In particular, the compounds generated in cases D1–D3 tended to contain highly hydrophobic and/or possibly unstable structures (Figures S8A–D and S9), as also evident from the low “filter” score of D1–D3 in Table . These unfeasible structures can be suppressed using multiobjective target functions, including penalties for unwanted structures. In M1, we introduced SA[31] and Log P scores to penalize such unfeasible structures and examined the performance of our method in multiobjective optimization tasks. We define two penalty terms as follows.where SA and Log P are the SA and Log P scores, respectively, of the target molecule. Thus, these terms are intended to restrict the SA score to less than 4 and the Log P score to the range of 0–5. The total reward function is defined as followswhere cintr = 10, cSA = 1, and cLogP = 5 were used. We standardized and neutralized the compounds using the functions implemented in RDKit[18] before generating the 3D conformations. Additionally, we examined the CReM generator with similar settings, except for the target score function, which is defined as Rfin above. The result of M1 demonstrated that the multiobjective reward function increased similar to the cases D1–D3, and the distributions of SA and Log P values were restricted to the specified ranges (Figures A and S11). In D1, molecules with high SA scores (SA > 4) were frequently generated (Figure S8A), whereas the generation of such molecules was suppressed in M1 (Figure A). The Log P value of the generated molecules was also restricted to the specified range compared to those in the D1 experiments (Figures A vsS8A). Consequently, the “filter score” in M1 was drastically improved compared to that of D1 (Table ). The top three compounds in terms of reward and docking pose of the best compound are shown in Figure . Although the second and third compounds in Figure are similar, the diversity of the generated structures of M1 can be observed from their statistics and plots (Table and Figure S11). Collectively, these results demonstrate that RJT-RL can be successfully applied to multiobjective optimization. Moreover, another run of M1 was performed, and a similar tendency in the property distributions was observed (Figure S12). A similarity plot between these runs (Figure S12D) demonstrated that diverse molecules were generated from the different runs in this case.

Figure 8

Figure 9

Compounds were generated in a multiobjective reward experiment, M1 (see Table ). The chemical structures of the top three compounds, with their rewards and docking scores, are shown in the top panels. The binding poses of the best compounds are shown in the bottom panel. The protein backbone and bound inhibitor are shown using the ribbon and transparent ball-and-stick models, respectively. The original compound bound to the crystal structure is shown as semitransparent. The possible hydrogen-bonding interactions are indicated by yellow dashed lines.

Results of the multiobjective reward experiments (A) M1 and (B) M2 (see Table ). In the upper panel, the reward, interaction, and docking scores for each episode are plotted in blue, and the moving average and maximum values are plotted in orange and green, respectively. In the lower panel, histograms of the Log P and SA score distributions of the training dataset (blue) and episodes 0–1000 (orange), 20,000–21,000 (green), and 39,000–40,000 (red) are plotted. Compounds were generated in a multiobjective reward experiment, M1 (see Table ). The chemical structures of the top three compounds, with their rewards and docking scores, are shown in the top panels. The binding poses of the best compounds are shown in the bottom panel. The protein backbone and bound inhibitor are shown using the ribbon and transparent ball-and-stick models, respectively. The original compound bound to the crystal structure is shown as semitransparent. The possible hydrogen-bonding interactions are indicated by yellow dashed lines. Next, the results for RJT-RL (M1) and CReM (M2) were compared. CReM also successfully improved the overall score (Figure B). However, an unbalanced optimization of the score terms was observed, similar to the case of D4 (Figure D). Although the docking score was improved by optimization, the other terms (i.e., interaction, SA, and Log P terms) were less improved than those of M1 (Figure ). Even though the increasing tendency of the SA score was suppressed, a large number of compounds with SA > 4.0 were still generated (Figure B). As for the Log P property, D4 generated more molecules with Log P in the specified range (0 < Log P < 5) than M2 (Figures S8D and 8B). To perform multiobjective optimization with CReM, we may have to perform an intensive hyperparameter search for the best weighting terms (cintr, cSA, and cLogP) in the scoring function and/or to introduce another penalty term for large molecules.

Fine-Tuning to Different Reward Functions

One of the advantages of learning-based methods is their transferability to various tasks. To assess the transferability of the RJT-RL models (policy and value function), a fine-tuning experiment, F1, using different reward functions (Table ) was performed. In this experiment, we attempted fine-tuning the model trained in D1 to the multiobjective reward function defined in M1. The results of F1 are shown in Figures and S13. From the episodes in the early training stage, the agent generated molecules with high docking and interaction scores, which were maintained during the training (Figure ). The SA and Log P scores also gradually improved as the training proceeded, and their distributions fell within the range specified by the penalty score (i.e., SA < 4 and 0 < Log P < 5) (Figure ). The filter score of F1 was improved from that of D1 (Table ), suggesting that the distribution of the generated compounds contained more feasible structures than before fine-tuning. In contrast, the diversity of molecules generated by F1 was slightly lower than that of M1, as evident from their statistics (uniqueness and IntDev in Table ). The top three compounds in terms of reward and the docking pose of the best compound are shown in Figure . This result demonstrates that the agent adapted to the different reward functions containing the penalty terms after fewer training steps than those in M1.

Figure 10

Results of fine-tuning experiment F1 (see Table ). (A) In the upper panel, the reward, interaction, and docking scores for each episode are plotted in blue, and the moving average and maximum values are plotted in orange and green, respectively. In the lower panel, histograms of the Log P and SA score distributions of episodes 0–1000 (orange), 5000–6000 (green), and 9000–10,000 (red) are plotted. (B) Chemical structures of the top three compounds with their rewards and docking scores are shown in the top panels. The binding poses of the best compounds are shown in the bottom panel as in Figure .

Conclusions

In this study, we introduced RJT, a coarse-grained representation of molecules, which is directly convertible to the original chemical structure. We leveraged this representation for drug discovery, formulating a molecule design task as reinforcement learning to generate an RJT. The proposed method exhibited a better or comparable performance to other state-of-the-art methods in simple benchmark tasks. The results indicate the potential of the step-by-step evaluation of molecular properties, which is only possible in our framework using RJT. The step reward was shown to be advantageous when the computational cost of the score calculation was low because the score required multiple evaluations to generate a single compound. We further demonstrated that our method is applicable to real-world tasks such as structure-based scaffold hopping with a multiobjective reward function. Another advantage of RTJ-RL is the fine-tuning of the policy and value function models, which is not possible with rule-based fragment growth methods. This feature may be useful for tuning scoring functions in the compound design process. The results of the experiments also suggest several problems with the proposed method. For example, considering the 3D generation involving docking simulation, efficient training of the RJT-RL model requires proper handling of the stereoisomers of the generated compounds. In this study, we enumerated the possible stereoisomers and searched for the best stereoisomers using the brute-force method. However, the search was performed on a single CPU, which reduced the overall performance (Table ). Parallelization of the 3D conformer generation and docking simulation could improve performance because this calculation is embarrassingly parallelizable. Another solution to this problem is to extend the site information in the RJT representation to properly handle the chirality flag of the atom so that the agent can generate the molecule, including the stereoisomers. Finally, in this study, we only considered the de novo design of compounds from a relatively small starting fragment size. To apply the proposed method for the optimization of lead compounds with larger and more complex structures, it is necessary to extend the action to include node deletion and mutation to enable the modification of the starting scaffolds.

20 in total

1. Better Informed Distance Geometry: Using What We Know To Improve Conformation Generation.

Authors: Sereina Riniker; Gregory A Landrum
Journal: J Chem Inf Model Date: 2015-11-30 Impact factor: 4.956

2. OpenGrowth: An Automated and Rational Algorithm for Finding New Protein Ligands.

Authors: Nicolas Chéron; Naveen Jasty; Eugene I Shakhnovich
Journal: J Med Chem Date: 2015-09-23 Impact factor: 7.446

3. Quantifying the chemical beauty of drugs.

Authors: G Richard Bickerton; Gaia V Paolini; Jérémy Besnard; Sorel Muresan; Andrew L Hopkins
Journal: Nat Chem Date: 2012-01-24 Impact factor: 24.427

4. The computer program LUDI: a new method for the de novo design of enzyme inhibitors.

Authors: H J Böhm
Journal: J Comput Aided Mol Des Date: 1992-02 Impact factor: 3.686

5. AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading.

Authors: Oleg Trott; Arthur J Olson
Journal: J Comput Chem Date: 2010-01-30 Impact factor: 3.376

Review 6. Deep learning.

Authors: Yann LeCun; Yoshua Bengio; Geoffrey Hinton
Journal: Nature Date: 2015-05-28 Impact factor: 49.962

Molecular Design Method Using a Reversible Tree Representation of Chemical Compounds and Deep Reinforcement Learning.

Introduction

Methods

Reversible Tree Representation of Molecules

RJT-Based Neural Network

RL Using the RJT Representation

Definition of Policy and Value Function Networks

Expert Learning

Experiment Settings

Results and Discussion

Penalized Log P Optimization

Similarity-Guided Molecule Generation

Vortioxetine Rediscovery Experiment

Celecoxib Rediscovery Experiment

Structure-Based Scaffold Hopping

Multiobjective Optimization

Fine-Tuning to Different Reward Functions

Conclusions

1. Better Informed Distance Geometry: Using What We Know To Improve Conformation Generation.

2. OpenGrowth: An Automated and Rational Algorithm for Finding New Protein Ligands.

3. Quantifying the chemical beauty of drugs.

4. The computer program LUDI: a new method for the de novo design of enzyme inhibitors.

5. AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading.

Review 6. Deep learning.

7. GuacaMol: Benchmarking Models for de Novo Molecular Design.

8. PRO-LIGAND: an approach to de novo molecular design. 1. Application to the design of organic molecules.

9. Optimization of Molecules via Deep Reinforcement Learning.

10. CReM: chemically reasonable mutations framework for structure generation.

Introduction

Methods

Reversible Tree Representation of Molecules

RJT-Based Neural Network

RL Using the RJT Representation

Definition of Policy and Value Function Networks

Expert Learning

Experiment Settings

Results and Discussion

Penalized Log P Optimization

Similarity-Guided Molecule Generation

Vortioxetine Rediscovery Experiment

Celecoxib Rediscovery Experiment

Structure-Based Scaffold Hopping

Multiobjective Optimization

Fine-Tuning to Different Reward Functions

Conclusions

Review 6. Deep learning.

Penalized Log P Optimization