Literature DB >> 35258953

AI-Driven Synthetic Route Design Incorporated with Retrosynthesis Knowledge.

Shoichi Ishida¹, Kei Terayama^2,3, Ryosuke Kojima³, Kiyosei Takasu¹, Yasushi Okuno^3,4.

Abstract

Computer-aided synthesis planning (CASP) aims to assist chemists in performing retrosynthetic analysis for which they utilize their experiments, intuition, and knowledge. Recent breakthroughs in machine learning (ML) techniques, including deep neural networks, have significantly improved data-driven synthetic route designs without human intervention. However, learning chemical knowledge by ML for practical synthesis planning has not yet been adequately achieved and remains a challenging problem. In this study, we developed a data-driven CASP application integrated with various portions of retrosynthesis knowledge called "ReTReK" that introduces the knowledge as adjustable parameters into the evaluation of promising search directions. The experimental results showed that ReTReK successfully searched synthetic routes based on the specified retrosynthesis knowledge, indicating that the synthetic routes searched with the knowledge were preferred to those without the knowledge. The concept of integrating retrosynthesis knowledge as adjustable parameters into a data-driven CASP application is expected to enhance the performance of both existing data-driven CASP applications and those under development.

Entities: Chemical

Mesh：

Year: 2022 PMID： 35258953 PMCID： PMC8965881 DOI： 10.1021/acs.jcim.1c01074

Source DB: PubMed Journal: J Chem Inf Model ISSN： 1549-9596 Impact factor: 4.956

Introduction

Since the 1960s, various computer-aided synthesis planning (CASP) applications have been developed to emulate chemists’ thinking and help organic synthesis chemists in their work.[1−9] CASP applications have played an important role in the definable parts of synthesis (e.g., the characteristics of chemical structures and retrosynthetic tree size), whereas the indefinable parts of synthesis (e.g., chemists’ intuition) and opportunities to contribute to creativity in retrosynthetic analysis have been left to chemists.[1] As an underlying chemists’ intuition, Corey formalized the concept of retrosynthesis (retrosynthesis knowledge) and major types of strategies (e.g., transform- and topology-based strategies). He stated that retrosynthetic analysis is most efficiently performed through the simultaneous use of as many different independent strategies as possible.[10] For the selection of optimal strategies, the chemists’ knowledge of chemistry and their experiments are essential; the optimal strategies for a particular synthesis problem depend on the molecules, persons, and situations involved (e.g., lead optimization and large-scale synthesis of drug candidates).[11] CASP approaches are generally classified into two types: knowledge-based[8,12] and data-driven approaches.[6,9] Knowledge-based approaches employ manually encoded (human-curated) transformations considering information, such as stereochemical and electronic effects.[8] For instance, one excellent knowledge-based CASP application, Chematica[8] (now rebranded as Synthia), provides a considerable discretion for chemists to perform retrosynthetic analysis based on their own ways of thinking using their own scoring functions (e.g., SMALLER, SELECTIVITY, and RINGS variables), and it is now used globally.[8,13,14] However, knowledge-based approaches still require the great efforts of many experts, as the number of new reaction types discovered per year has been in the low few thousands.[15] In contrast, data-driven CASP aims to automatically extract knowledge related to transformations from numerous reaction records to discover synthetic routes.[16] Recent breakthroughs in deep learning (DL),[17,18] along with the availability of reaction records[19,20] and open-source codes,[21−24] have improved the core techniques of data-driven CASP such as 1-step (retro)synthetic reaction prediction[25−28] and multistep synthetic route searches.[9,29−32] In the existing reaction prediction methods, various representations of molecules (e.g., fingerprints,[25] Simplified Molecular Input Line Entry System (SMILES) strings,[26,27,33] and graphs[28,34]) and their corresponding suitable DL techniques have been used, showing promising performance. Regarding search algorithms, the Monte Carlo tree search (MCTS),[9,24,35,36] depth-first proof number search,[31,37,38] and graph-based exploration methods[30,32] have been used to obtain possible synthetic routes efficiently. Several outstanding data-driven CASP applications are being used practically in industries and laboratories,[9,30,39] and these applications have led to a remarkable revival of interest in CASP research.[40−42] However, in the case of actual chemical synthesis, most data-driven CASP applications are lacking in their ability to reflect or support flexible adaptation to individual chemists’ ways of thinking. The search algorithms used in such applications depend on naive scoring functions for evaluating whether one synthetic position found during a search is preferable to another.[9,35] In addition, as for the 1-step retrosynthetic reaction prediction, large repositories of highly biased published reactions[19,20] prevent the data-driven approaches from acquiring the chemical knowledge sufficiently because imbalanced data training is inherently difficult for AI.[43,44] This implies that there are few opportunities to learn diverse strategies for retrosynthetic analysis. Moreover, the data-driven CASP approaches incorporating generally used various retrosynthesis knowledge have not been developed, and the effects of knowledge on search performance have yet to be investigated. In this study, we developed a data-driven CASP application integrated with rule-based techniques called “Retrosynthesis planning application using retrosynthesis knowledge (ReTReK),” which introduces retrosynthesis knowledge into the evaluation of promising search directions to obtain promising synthetic routes considering the knowledge of synthetic chemists. ReTReK is based on a data-driven framework of retrosynthetic reaction prediction by deep learning and path search by MCTS. To explicitly introduce retrosynthesis knowledge into ReTReK, referring to previous works,[4,7,8,13,14] we formulated four scores that aimed to explore the ideally shortest synthetic route or select a reaction that ideally provides only the desired product. A graph convolutional network (GCN) technique, which tolerates the biased reactions data set,[28] was used to build the retrosynthetic reaction prediction model. The Reaxys reaction database[19] was used to construct the ReTReK model. We evaluated the performance of ReTReK using drug-like molecules[45−50] for demonstrations and molecules from the ChEMBL database[51] for quantitative evaluations. We successfully demonstrated that synthetic routes designed using ReTReK with retrosynthesis knowledge were preferable to those designed without retrosynthesis knowledge. Furthermore, we quantitatively showed that retrosynthesis knowledge improved the performance when solving certain target molecules, and it successfully guided the search direction in MCTS. The ReTReK application is publicly available on GitHub at https://github.com/clinfo/ReTReK. The proposed concept of integrating retrosynthesis knowledge, in the form of adjustable parameters, into a data-driven CASP application is expected to enhance the performance of both existing data-driven CASP applications and those under development.

Results and Discussion

Construction of ReTReK

To implement a data-driven CASP application that can reflect retrosynthesis knowledge, ReTReK was constructed using MCTS, the GCN technique, and the four retrosynthesis knowledge scores introduced earlier (Figure ). When designing a synthetic route for a target molecule, the following three factors are described as basic points:[52] the construction of the required carbon skeleton considering regiochemistry and stereochemistry; ideally the shortest synthetic route to the molecule; reaction that ideally gives only the desired product in each step. Thus, we formulated the four scores: a convergent disconnection score (CDScore), an available substances score (ASScore), a ring disconnection score (RDScore), and a selective transformation score (STScore). The CDScore and ASScore are designed to favor convergent synthesis, which is an efficient strategy in multistep chemical synthesis. The RDScore is designed to reflect a ring construction strategy. This strategy is preferred if the target compound has complex ring structures because the construction of ring structures in a synthetic route tends to result in simple and easily available starting materials. The STScore is designed to reflect the number of possible products from a reaction because a synthetic reaction with few byproducts is preferred, considering the yield. Additionally, to handle stereochemistry and regiochemistry when applying a retrosynthetic reaction predicted by the GCN to a molecule, the Reactor in the ChemAxon API[53] was used. The basic MCTS algorithm comprises four steps: selection, expansion, rollout, and update. For the selection step, a tree policy is used to select a promising retrosynthetic tree position. The policy of the ReTReK also considers the retrosynthesis knowledge scores. A GCN-based model (Figure a) was used for the 1-step retrosynthetic reaction prediction as a policy network in the expansion and rollout steps. Reaxys reaction records[19] were used to train the model and to prepare the starting materials, and the compounds obtained from the ZINC database[54] were used as the starting materials. By iterating through the four steps listed above, a retrosynthetic tree is expanded, thus attempting to identify a promising synthetic route. ReTReK without the retrosynthesis knowledge scores can be regarded as a basic data-driven CASP model such as the approach proposed by Segler et al.[9] and AiZynthFinder[24] although there are some minor differences in the policy networks and evaluation terms of the MCTS. Moreover, when comparing the performance of the CASP approaches, it should be noted that what is used for reaction templates and starting materials will affect the performances.

Figure 1

Figure 2

(a) Model architecture of the GCN-based policy network. (b) Top-n accuracies of the model for n values ranging from 1 to 1000. Specifically, the top-1, top-50, top-100, top-300, and top-500 accuracies are 0.361, 0.906, 0.938, 0.968, and 0.976, respectively.

Complete workflow of ReTReK. ReTReK combines a path-finding algorithm (MCTS) and GCN technique, and retrosynthesis knowledge is incorporated into the selection step of the MCTS procedure. The retrosynthesis knowledge is formalized using four scores: the CDScore, STScore, RDScore, and ASScore. (a) Model architecture of the GCN-based policy network. (b) Top-n accuracies of the model for n values ranging from 1 to 1000. Specifically, the top-1, top-50, top-100, top-300, and top-500 accuracies are 0.361, 0.906, 0.938, 0.968, and 0.976, respectively.

Top-n Accuracies of the GCN-Based Policy Network

To determine the effective size for the expansion step in the MCTS procedure, the top-n accuracies (for n up to 1000) of the GCN-based policy network are calculated as shown in Figure b. The 1-step retrosynthetic reaction prediction model aimed to prioritize 19 633 reaction templates for application to an input molecule. In this study, we used reaction templates considering a reaction center, first-degree neighbors, and protecting groups because a previous study proposed this type of reaction template to maintain chemical integrity.[35] Accordingly, the top-1, top-50, top-100, top-300, and top-500 accuracies were found to be 0.361, 0.906, 0.938, 0.968, and 0.976, respectively. Beyond the top-500 accuracies, an increase in the prediction performance was not significant. Considering the results of a previous study[28] on reaction templates of different sizes, the prediction performance is assumed to be equivalent to or better than that of the previous template-based 1-step retrosynthetic reaction prediction model.[9] Based on the results for the top-n accuracies, we evaluated the effect of the MCTS expansion sizes and the retrosynthesis knowledge on the performance of solving for target molecules using the top 50, 100, 300, and 500 predicted templates.

Effects of Expansion Sizes and Retrosynthesis Knowledge on the Performance of Solving for Target Molecules

Figure shows how the expansion sizes and retrosynthesis knowledge influenced the performance of solving for target molecules. The 161 molecules from the preprocessed ChEMBL data set were used as the target molecules. The searches were performed using different expansion sizes (50, 100, 300, and 500) and six retrosynthesis knowledge patterns: no retrosynthesis knowledge (no knowledge), the STScore, CDScore, ASScore, RDScore, and all the four retrosynthesis knowledge scores (all knowledge). In most of the cases, the solution performance was improved in proportion to the expansion size. However, the case of the STScore pattern and an expansion size of 100 resulted in a lower number of solved molecules than the case of the same pattern and an expansion size of 50. This result is attributed to a relative lack of MCTS iterations because of the increase in the expansion size. Regarding the retrosynthesis knowledge, all the knowledge patterns, except the STScore pattern, resulted in an increase in the number of solved molecules compared to the no-knowledge pattern. The CDScore pattern with an expansion size of 500 showed the best solution performance, yielding 90 solved molecules, whereas the no-knowledge pattern with the same expansion size resulted in 59 solved molecules. Although the STScore pattern resulted in fewer solved molecules than the no-knowledge pattern, this result was considered reasonable because the STScore focused on reactions with few byproducts. This often leads to strict conditions for retrosynthesis. To compare the ReTReK with the other data-driven CASP model, ASKCOS[23] was applied to the 161 molecules, and the solving performance was shown in Figure S1. This result shows that the solving performances of ReTReK without retrosynthesis knowledge and ASKCOS were comparable, which suggests that the retrosynthesis knowledge score may improve other data-driven CASP approaches.

Figure 3

Comparison of the numbers of solved molecules with different expansion sizes and retrosynthesis knowledge patterns. The gray, black, blue, orange, green, and red bars correspond to the no-knowledge, all-knowledge, STScore, CDScore, ASScore, and RDScore patterns, respectively. Moreover, the search times necessary for solution are compared between the different expansion sizes and six retrosynthesis knowledge patterns as shown in Figure S2. The search time increases in proportion to the expansion size because an increase in the expansion size expands the search space for MCTS. The median search times for expansion sizes of 50, 100, 300, and 500 are 32, 45, 133, and 294 s, respectively. The STScore pattern requires shorter search times than the no-knowledge pattern although all the other knowledge patterns, except STScore, result in longer search times. These results suggest that synthetic routes can be more efficiently identified under the STScore pattern than the other patterns, although the STScore pattern results in lower solution performance.

Effects of Retrosynthesis Knowledge on the Search Directions in MCTS

Figure a shows how the six retrosynthesis knowledge patterns influence the characteristics of the searched synthetic routes, in terms of four route scores (rSTScore, rCDScore, rASScore, and rRDScore). Each route score was defined as the average corresponding retrosynthesis knowledge score in each step of the searched synthetic route. For ease of comparison, each route score for each of the five knowledge patterns was standardized with respect to the corresponding score for the no-knowledge pattern. The standardized mean values of the rSTScore for the STScore pattern, rCDScore for the CDScore pattern, rASScore for the ASScore pattern, and rRDScore for the RDScore pattern were 0.178, 0.555, 0.130, and 0.309, respectively. All the values were positively shifted compared to the values for the no-knowledge pattern, indicating that all the four retrosynthesis knowledge scores successfully guided the search directions in the MCTS according to the characteristics of each type of knowledge. The CDScore pattern caused MCTS to select more transformation-oriented searches compared to the STScore pattern. The mean values of the CDScore and STScore were 0.299 and 0.178, respectively. A convergent-disconnection-oriented search is assumed to have more chances of splitting the reactive centers into divided molecules because the CDScore attempts to minimize the sizes of each divided molecule simultaneously. Figure b–d shows the parts of the exemplary synthetic routes found by the ReTReK using all-knowledge and no-knowledge patterns. Each case shows that the ReTReK with retrosynthesis knowledge successfully chooses the preferable retrosynthetic reactions than the ReTReK without the knowledge, in terms of the STScore, CDScore, ASScore, and RDScore. Considering that the all-knowledge pattern shows a higher rSTScore than the CDScore and STScore patterns, these results suggest the existence of synergistic effects of retrosynthesis knowledge; however, this hypothesis needs further analysis.

Figure 4

Evaluation of the effects of retrosynthesis knowledge on the search directions of MCTS, in terms of the four route scores. Synthetic routes solved with an expansion size of 500 were used for this evaluation. (a) Each route score standardized based on the corresponding mean and standard deviation of the no-knowledge pattern. The gray, black, blue, orange, green, and red plots represent the standardized route scores for the no-knowledge, all-knowledge, STScore, CDScore, ASScore, and RDScore patterns, respectively. The rhombuses represent the mean values for each case, and the confidence intervals at the 95% confidence level are also shown. (b–d) Parts of the exemplary synthetic routes found by the ReTReK using the all-knowledge and no-knowledge patterns. The circled symbol “S” indicates that a molecule is in the starting materials’ list. All the synthetic routes are shown in Figure S3. (b) Example of parts of the routes showing the STScore effect. The ReTReK with retrosynthesis knowledge successfully chose the reaction with fewer reactive centers than the ReTReK without the knowledge. (c) Example of parts of the routes showing the CDScore and ASScore effects. The ReTReK with retrosynthesis knowledge more successfully guided the convergent synthetic route than the ReTReK without knowledge. (d) Example of parts of the route showing the RDScore effect. The ReTReK with retrosynthesis knowledge successfully chose ring-opening retrosynthetic reaction, whereas the ReTReK without the knowledge found no routes. Considering the results of Figure and Figure , the following strategy to adjust the weight parameters can be considered. First, the search without the retrosynthesis knowledge is recommended to know the baseline result. If the target is not solved with these parameters, applying the CDScore is the reasonable choice to solve the target because the CDScore showed the best solution performance (Figure ). Or, if the target seems to require ring-opening, applying the RDScore is the prospective choice to solve (Figure d). Additionally, if better quality synthetic routes are desired, applying the STScore and/or ASScore could improve the routes based on their concepts. The score weights should be adjusted gradually with positive integers to combinatorial explosion.

Demonstrations of ReTReK for Drug-like Molecules

To demonstrate retrosynthesis planning using ReTReK, we applied ReTReK to six drug-like molecules in the all-knowledge and no-knowledge patterns, and the results are presented in this section and Figure S4. The detailed parameters used in these demonstrations were described in the Methods section. Figure panels a and b illustrate an exemplary retrosynthetic route to a molecule 1 known as a hepatitis B virus capsid inhibitor,[45] found by ReTReK with and without retrosynthesis knowledge. The exploration with retrosynthesis knowledge suggested a convergent route, successfully reflecting the specified knowledge scores. In this route, the target molecule 1 is disconnected into two main segments, iodophthalazinone 7 and pyridylboronic acid 12, which can be converted into 1 by the Suzuki coupling reaction. The key intermediates 7 and 12 could be retrosynthetically divided into three representative materials: hydroxyphthalazine 2, benzyl alcohol 3, and trihalogenated pyridine 8. Iodination of 2 and subsequent N-benzylation with p-iodobenzyl iodide 5, which can be obtained from 3, would provide 2-(iodobenzyl)phthalazin-1-one 6. A reaction of 6 with copper cyanide would provide the intermediate 7. The other intermediate 12 would be prepared from 8 by three-step sequences (i.e., chlorination, incorporation of a boronic acid moiety, and amination with aminoalcohol 11). In contrast, a straightforward route was presented through the exploration without retrosynthesis knowledge. Friedel–Crafts acylation of trihalopyridine 14 with 2-(chlorocarbonyl)benzoic acid (13) would provide 2-(pyridinecarbonyl)benzoic acid 15, which is further reacted with aminoalcohol 11 to afford tricyclic ketone 16. The construction of the phthalazine ring could be performed by the reaction of 16 with p-bromobenzylhydrazine (17) to yield the precursor 18. Finally, the introduction of a nitrile group into the benzyl group of 18 would provide the target molecule 1. Additional demonstrations for two other drug-like molecules, kwakhurin[47] and α7 nicotinic acetylcholine receptor silent agonist,[48] are shown in Figure c,d and Figure e,f, respectively. To confirm the difference in each step’s score between ReTReK with and without retrosynthesis knowledge, four retrosynthesis knowledge scores were added to each step in the synthetic routes (Figure S5).

Figure 5

Comparison of the synthetic route for three target compounds (a,b) hepatitis B virus capsid inhibitor,[45] (c,d) kwakhurin,[47] and (e,f) α7 nicotinic acetylcholine receptor silent agonist[48]) found by ReTReK with retrosynthesis knowledge (a, c, and e) and the corresponding route found without retrosynthesis knowledge (b, d, and f). Furthermore, the effectiveness of retrosynthesis knowledge was confirmed in several cases of retrosynthetic analyses. In these cases, ReTReK with retrosynthesis knowledge succeeded in finding retrosynthetic routes to the target molecules, whereas no route was found using ReTReK without retrosynthesis knowledge (Figure a). Figure a shows the retrosynthetic route to a molecule 19 known as a Mycobacterium tuberculosis thymidylate kinase (MtbTMPK) inhibitor.[46] In the suggested route, 19 is disconnected at the center of the molecule, giving imidazo[1,2-a]pyridine-3-carboxamide 25 and 1-(piperidin-4-yl)pyrimidine-2,4-dione 29. Compound 25 would be obtained from dibromide 24 by selective SNAr reaction with an organometallic reagent such as ethylmagnesium bromide, which is prepared from ethyl bromide. Dibromide 24 can be obtained from benzylamide 22 by stepwise SEAr bromination. Amide 22 would be provided from imidazopyridine-3-carboxylic acid (20) and N-Boc-benzylamine (21). Another intermediate 29 would be synthesized from 4-iodopiperidine (27) with pyrimidine-2,4-dione 28 by SN2 reaction. 27 would be easily prepared from N-Boc 26. From the viewpoint of the practical synthesis, deprotection of N-Boc group of the piperidine ring should be performed after N-alkylation of 28 with 26 to avoid oligomerization of 27 by self N-alkylation. Additional demonstrations for two other drug-like molecules, propolone[49] and EGFR kinase inhibitor,[50] are shown in Figure b and Figure S4, respectively. Further demonstrations for the molecules reported by Segler et al.[9] are shown in Figure S6. For evaluating the performances of ReTReK and the other data-driven CASP approach on more advanced targets, ReTReK and ASKCOS were applied to 15 targets reported by previous research using Chematica.[13,14,55] In terms of the solving performance, ReTReK found the retrosynthetic routes of 11 targets, while ASKCOS found those of 4 targets. As for the solved routes, the routes proposed by two data-driven applications had some skeptical steps that actually proceed, and the solutions were not close to the mature ones proposed by Chematica (Figure S7). According to the results, finding the sophisticated routes of advanced targets like Chematica did is still a challenging task for data-driven CASP applications.

Figure 6

For two target compounds (a, MtbTMPK inhibitor[46] and b, Propolone[49]) synthetic routes were found using the ReTReK with retrosynthesis knowledge, whereas no synthetic routes were found using the ReTReK without retrosynthesis knowledge. These results clearly show that retrosynthesis knowledge effectively contributes to retrosynthetic analyses using ReTReK. However, the experimental validations[56] with targets that chemists are interested in and blind assessments[9] by trained chemists are not performed in this study; thus we plan to perform these validations to evaluate the performance of ReTReK on actual chemical synthesis in future work. Considering these evaluations and demonstrations, the ReTReK framework with integrated retrosynthesis knowledge has the potential to further improve the performance of data-driven CASP applications.

Conclusions

We developed ReTReK, a data-driven CASP application integrated with rule-based techniques, and it can flexibly reflect and apply retrosynthesis knowledge. Through the evaluation of ReTReK with and without retrosynthesis knowledge, we showed that the integration of such knowledge into data-driven CASP applications helps improve their performance and enhance the quality of the explored synthetic routes. We expect the concept of ReTReK to contribute to the further developments and improvements in data-driven CASP applications. To allow for more realistic and preferable synthetic routes to be obtained in the future, we will address the further development of automatic reaction template extraction methods while maintaining the chemical integrity. In this study, orphan atoms (atoms appearing on only one side of the reaction arrow) were included in the reaction templates to automatically retain the protecting and leaving groups in the templates because these groups are often not recorded as reactants or products. Because such groups were manually defined in a previous study,[22] this template definition (considering the orphan atoms) is expected to contribute to the further development of automatic template extraction methods. In addition, we may define the additional retrosynthesis knowledge scores to allow ReTReK to represent the chemists’ ways of thinking more extensively than in the current model. Furthermore, to facilitate the use of ReTReK, we will prepare a user-friendly interactive interface with functions such as range sliders for adjusting each retrosynthesis knowledge score and other tools for displaying the explored synthetic routes.

Methods

Data Sets

To create the ReTReK model, compounds obtained from the Reaxys reaction records[19] and ZINC 15 database[54] were used, and compounds were obtained from the ChEMBL 27 database[51] and the literature[45−50] to evaluate the performance of ReTReK.

Reaction Template Extraction

A set of approximately 50 million reaction records from Reaxys[19] (1795–2019) was used to construct the 1-step retrosynthetic reaction prediction model. The model was designed to take a target or an intermediate molecule as input and was trained to predict a suitable reaction template for the input molecule. The purpose of a reaction template is to represent a generalized chemical reaction. In this study, a reaction template consists of a reactive center, orphan atoms, and their first-degree neighbors. An orphan atom is one that appears on only one side of the reaction arrow in ChemAxon.[53] These atoms were identified using the Automapper in the ChemAxon API. The reaction template extraction procedure comprises four steps. Figure S8 shows the workflow of the reaction template extraction. In the first step, the reaction records were standardized by removing explicit hydrogen, aromatizing, and retaining the largest fragments. In the second step, the reaction records were filtered based on three conditions: (1) the reaction was required to consist of a single step, (2) it must have a product and up to three reactants, and (3) the number of heavy atoms in the product was limited to 50 or fewer. Thereafter, the number of remaining reaction records was 22 337 137. In the third step, the reaction templates were extracted from the reaction records, and sets consisting of a product and the corresponding reaction template were retained if the reaction template occurred at least 50 times. To prevent the occurrence of two or more fragments, a reaction template was retained only when all atoms on the product side of the template were connected. In the final step, the sets consisting of a product and the corresponding reaction template were filtered on condition that the reaction template could be reversibly applied to the product and derived reactants. On the basis of this requirement, 7 589 744 product-template sets remained, and the number of unique reaction templates was 19 633. Referring to a previous study,[57] a time-splitting strategy was employed to evaluate the neural network model performance. The sets published before 2017 were used for the training, and those published in 2017 and later were used for testing.

Preparation of Molecules for ReTReK Evaluations and Demonstrations

The molecules used for the ReTReK evaluations were obtained from ChEMBL 27[51] and preprocessed via the following procedures. First, the molecules whose The United States Adopted Names’ (USAN) years ranged from 2017 to 2019 and for which chemical structure records were available were selected, resulting in a total of 219 compounds. Thereafter, the compounds were preprocessed by removing the explicit hydrogen, aromatizing, retaining the largest fragments, removing compounds with more than 50 atoms, and removing duplicates. The remaining 161 compounds were used for the evaluations (ChEMBL data set). For further evaluation of the ReTReK, six drug-like compounds[45−50] were used for synthetic route search demonstrations.

Starting Materials

A set of compounds obtained from the ZINC database and Reaxys reaction records were used as the starting materials. A subset of 100 023 building blocks from major suppliers (Sigma-Aldrich, Alfa Aesar, and Acros) was obtained from the ZINC database. From the Reaxys reaction records, 649,130 compounds recorded as reactants with at least five occurrences before 2017 were used. All the compounds were stored in the canonical SMILES format calculated using RDKit.[21]

MCTS for Retrosynthesis

MCTS has been implemented in various CASP studies based on the achievements of Segler et al.[9] MCTS is a search algorithm for exploring optimal solutions and comprises four steps: selection, expansion, rollout, and update.[58] Following Segler’s implementation,[9] a state consists of a set of molecules and is solved (the optimal solution) if all the molecules in the state are starting materials. In this study, retrosynthesis knowledge scores were incorporated into the evaluation term used in the selection step. The same policy network was used for both the expansion and rollout steps, similar to a previous study.[35]

Retrosynthesis Knowledge Used in ReTReK

We define four scores representing four types of retrosynthesis knowledge, namely, the CDScore, ASScore, RDScore, and STScore, inspired by previous studies.[4,7,8]

Convergent Disconnection Score

The CDScore is designed to favor convergent synthesis, which is known to be an efficient strategy in multistep chemical synthesis. The CDScore is calculated by evaluating how equally a product is divided among the reactants of a reaction {R1 + R2 + ... + R → P}, where R is a reactant and P denotes the product.Here, a(P) and a(R) represent the number of atoms in the product and reactant, respectively, and MAE is the mean absolute error.

Available Substances Score

The ASScore, which serves a similar purpose as the CDScore, is defined to reflect the number of available substances generated in a reaction step and is calculated asHere, b(S) and b(R) represent the numbers of available substances (starting materials) and reactants, respectively.

Ring Disconnection Score

A ring construction strategy is preferred if the target compound has complex ring structures because the construction of ring structures in a synthetic route tends to result in simple and easily available starting materials. The RDScore is calculated by checking whether the ring construction occurs in a reaction step as follows:Here, d(P) and d(Ri) represent the number of rings in the product and reactant, respectively.

Selective Transformation Score

A synthetic reaction with few byproducts is preferred, considering the yield. To reflect the number of possible products from a reaction, the STScore is calculated by focusing on the number of reactive centers in the reactants as follows:Here, e(∑R) represents the applicable number of patterns of products enumerated using the reactants and a certain reaction template.

Policy Network

In this study, a policy network is a template-based retrosynthetic reaction prediction model, and the same model is used in both the expansion and rollout steps. We employed a GCN model (a promising model for retrosynthesis found in a previous study[28]) as the retrosynthetic reaction prediction model. The model was trained using the data set prepared as described in the reaction template extraction section and comprised three graph convolutional layers with Leaky ReLU activation and a dropout ratio of 0.3, a graph-dense layer with Leaky ReLU activation, graph-gather layer with hyperbolic tangent activation, and dense layer with softmax activation. To confirm the effectiveness of the expansion sizes on the MCTS performance, the top-n accuracies were calculated for n values in the range from 1 to 1000. To implement this model, a graph-based deep learning framework kGCN[59] was used.

Selection

Starting from the root node, a tree policy was recursively applied to select the subsequent action, indicating that the simulation descended through the search tree gradually until an unvisited node with a nonterminal state was reached. The tree policy was based on the upper confidence bound (UCB) score, and retrosynthesis knowledge was incorporated into the policy as follows:where w represent the weights with values of w1 = 5.0, w2 = 0.5, w3 = 2.0, and w4 = 2.0, and n denotes the number of retrosynthesis knowledge scores used in a search (e.g., n is four if all four types of retrosynthesis knowledge are used). Here, Q denotes an action value calculated in the update step; N and N–1 are the visit counts of the child and parent nodes, respectively; c denotes a constant value that is set to 10; P denotes the softmax probability obtained from the policy network; and K represents the mean of the retrosynthesis knowledge scores.

Expansion

Child nodes, the states of which are selected by the policy network, are added to the node selected by the tree policy. On the basis of the top-n accuracies in the policy network, in independent trials, the top 50, 100, 300, and 500 reaction templates were selected, and they were filtered on condition that the reaction templates could be successfully applied to an unsolved molecule in the state of the selected node.

Rollout

A simulation is implemented using the policy network if the state of a node is not proven or terminal.[9] During the simulation, the following steps are recursively implemented for a maximum of five times: an unresolved molecule, which is not included in the starting materials, of the state is randomly sampled, the top 10 reaction templates of the molecule are obtained by the policy network, and a randomly sampled reaction template is applied to the molecule. At the end of each step, it is checked whether the state is proven or not. A reward function, r, returns one of the three values as reward z, depending on the simulation result. Before the simulation is started, the reward is 10 if the state is proven and −1 if the state is terminated. After the simulation, the reward is equal to the ratio of the number of resolved molecules in the state to the total number of molecules.

Update

The reward obtained from the rollout step is backpropagated through the selected nodes to update their action values Q. On the basis of a previous study,[9] the value of Q is defined aswhere Lmax denotes the maximal branch length and is set to 10, L denotes the current branch length, and ∑P denotes the sum of the softmax probabilities of the reaction templates in the selected nodes.

Evaluating the Effects of Expansion Sizes and Retrosynthesis Knowledge on MCTS Solution Performance

To investigate the effect of expansion sizes on the MCTS’ performance in solving for target molecules, both the number of solved molecules in the ChEMBL data set and times required to solve the molecules were compared for different expansion sizes and six retrosynthesis knowledge patterns. The expansion sizes were 50, 100, 300, and 500 and were determined by the policy network’s top-n accuracies. The six knowledge patterns were as follows: no retrosynthesis knowledge (no knowledge), the CDScore, ASScore, RDScore, STScore, and all the four retrosynthesis knowledge scores (all knowledge). In these experiments, the maximum number of iterations was set to 500 and the score weights for the CDScore, ASScore, RDScore, and STScore were fixed to 5.0, 2.0, 0.5, and 2.0, respectively.

Evaluating the Effects of Retrosynthesis Knowledge on the Search Directions in MCTS

To quantify the effects of the retrosynthesis knowledge on the search directions in MCTS, we defined a route score as the average value of the corresponding retrosynthesis knowledge scores in each step of a solved synthetic route. We calculated four types of route scores (rCDScore, rASScore, rRDScore, and rSTScore) for the solved synthetic routes under the corresponding retrosynthesis knowledge patterns. For comparisons, each route score for the five retrosynthesis knowledge patterns was standardized based on the corresponding mean and standard deviation for the no-knowledge pattern. In these experiments, the synthetic routes solved under the condition of an expansion size of 500 were used. The maximum number of iterations was set to 500, and the score weights for the CDScore, ASScore, RDScore, and STScore were fixed to 5.0, 2.0, 0.5, and 2.0, respectively.

Data and Software Availability

The ReTReK application is publicly available on GitHub at https://github.com/clinfo/ReTReK under the MIT License. The application is distributed in the model based on US Patent data set (10.6084/m9.figshare.5104873.v1) because Reaxys is a commercial database, which cannot be provided to the public. All the compounds used for the evaluations are available on https://github.com/clinfo/ReTReK/tree/master/data/evaluation_compounds. The README file in the GitHub repository provides information about how to setup and use the application.

40 in total

Review 1. Deep learning.

Authors: Yann LeCun; Yoshua Bengio; Geoffrey Hinton
Journal: Nature Date: 2015-05-28 Impact factor: 49.962