Shoichi Ishida1, Kei Terayama2,3, Ryosuke Kojima3, Kiyosei Takasu1, Yasushi Okuno3,4. 1. Graduate School of Pharmaceutical Sciences, Kyoto University, 46-29 Yoshidashimo-Adachicho, Sakyo-ku 606-8501, Kyoto, Japan. 2. Graduate School of Medical Life Science, Yokohama City University, 1-7-29, Suehiro-cho, Tsurumi-ku, Yokohama 230-0045, Kanagawa, Japan. 3. Graduate School of Medicine, Kyoto University, 53 Shogoin-Kawaharacho, Sakyo-ku 606-8507, Kyoto, Japan. 4. HPC- and AI-driven Drug Development Platform Division, RIKEN Center for Computational Science, 7-1-26, Minatojima-minami-machi, Chuo-ku, Kobe 650-0047, Hyogo, Japan.
Abstract
Computer-aided synthesis planning (CASP) aims to assist chemists in performing retrosynthetic analysis for which they utilize their experiments, intuition, and knowledge. Recent breakthroughs in machine learning (ML) techniques, including deep neural networks, have significantly improved data-driven synthetic route designs without human intervention. However, learning chemical knowledge by ML for practical synthesis planning has not yet been adequately achieved and remains a challenging problem. In this study, we developed a data-driven CASP application integrated with various portions of retrosynthesis knowledge called "ReTReK" that introduces the knowledge as adjustable parameters into the evaluation of promising search directions. The experimental results showed that ReTReK successfully searched synthetic routes based on the specified retrosynthesis knowledge, indicating that the synthetic routes searched with the knowledge were preferred to those without the knowledge. The concept of integrating retrosynthesis knowledge as adjustable parameters into a data-driven CASP application is expected to enhance the performance of both existing data-driven CASP applications and those under development.
Computer-aided synthesis planning (CASP) aims to assist chemists in performing retrosynthetic analysis for which they utilize their experiments, intuition, and knowledge. Recent breakthroughs in machine learning (ML) techniques, including deep neural networks, have significantly improved data-driven synthetic route designs without human intervention. However, learning chemical knowledge by ML for practical synthesis planning has not yet been adequately achieved and remains a challenging problem. In this study, we developed a data-driven CASP application integrated with various portions of retrosynthesis knowledge called "ReTReK" that introduces the knowledge as adjustable parameters into the evaluation of promising search directions. The experimental results showed that ReTReK successfully searched synthetic routes based on the specified retrosynthesis knowledge, indicating that the synthetic routes searched with the knowledge were preferred to those without the knowledge. The concept of integrating retrosynthesis knowledge as adjustable parameters into a data-driven CASP application is expected to enhance the performance of both existing data-driven CASP applications and those under development.
Since
the 1960s, various computer-aided synthesis planning (CASP)
applications have been developed to emulate chemists’ thinking
and help organic synthesis chemists in their work.[1−9] CASP applications have played an important role in the definable
parts of synthesis (e.g., the characteristics of chemical structures
and retrosynthetic tree size), whereas the indefinable parts of synthesis
(e.g., chemists’ intuition) and opportunities to contribute
to creativity in retrosynthetic analysis have been left to chemists.[1] As an underlying chemists’ intuition,
Corey formalized the concept of retrosynthesis (retrosynthesis knowledge)
and major types of strategies (e.g., transform- and topology-based
strategies). He stated that retrosynthetic analysis is most efficiently
performed through the simultaneous use of as many different independent
strategies as possible.[10] For the selection
of optimal strategies, the chemists’ knowledge of chemistry
and their experiments are essential; the optimal strategies for a
particular synthesis problem depend on the molecules, persons, and
situations involved (e.g., lead optimization and large-scale synthesis
of drug candidates).[11]CASP approaches
are generally classified into two types: knowledge-based[8,12] and data-driven approaches.[6,9] Knowledge-based approaches
employ manually encoded (human-curated) transformations considering
information, such as stereochemical and electronic effects.[8] For instance, one excellent knowledge-based CASP
application, Chematica[8] (now rebranded
as Synthia), provides a considerable discretion for chemists to perform
retrosynthetic analysis based on their own ways of thinking using
their own scoring functions (e.g., SMALLER, SELECTIVITY, and RINGS
variables), and it is now used globally.[8,13,14] However, knowledge-based approaches still require
the great efforts of many experts, as the number of new reaction types
discovered per year has been in the low few thousands.[15]In contrast, data-driven CASP aims to
automatically extract knowledge
related to transformations from numerous reaction records to discover
synthetic routes.[16] Recent breakthroughs
in deep learning (DL),[17,18] along with the availability of
reaction records[19,20] and open-source codes,[21−24] have improved the core techniques of data-driven CASP such as 1-step
(retro)synthetic reaction prediction[25−28] and multistep synthetic route
searches.[9,29−32] In the existing reaction prediction
methods, various representations of molecules (e.g., fingerprints,[25] Simplified Molecular Input Line Entry System
(SMILES) strings,[26,27,33] and graphs[28,34]) and their corresponding suitable
DL techniques have been used, showing promising performance. Regarding
search algorithms, the Monte Carlo tree search (MCTS),[9,24,35,36] depth-first proof number search,[31,37,38] and graph-based exploration methods[30,32] have been used to obtain possible synthetic routes efficiently.
Several outstanding data-driven CASP applications are being used practically
in industries and laboratories,[9,30,39] and these applications have led to a remarkable revival of interest
in CASP research.[40−42]However, in the case of actual chemical synthesis,
most data-driven
CASP applications are lacking in their ability to reflect or support
flexible adaptation to individual chemists’ ways of thinking.
The search algorithms used in such applications depend on naive scoring
functions for evaluating whether one synthetic position found during
a search is preferable to another.[9,35] In addition,
as for the 1-step retrosynthetic reaction prediction, large repositories
of highly biased published reactions[19,20] prevent the
data-driven approaches from acquiring the chemical knowledge sufficiently
because imbalanced data training is inherently difficult for AI.[43,44] This implies that there are few opportunities to learn diverse strategies
for retrosynthetic analysis. Moreover, the data-driven CASP approaches
incorporating generally used various retrosynthesis knowledge have
not been developed, and the effects of knowledge on search performance
have yet to be investigated.In this study, we developed a data-driven
CASP application integrated
with rule-based techniques called “Retrosynthesis planning
application using retrosynthesis knowledge (ReTReK),” which
introduces retrosynthesis knowledge into the evaluation of promising
search directions to obtain promising synthetic routes considering
the knowledge of synthetic chemists. ReTReK is based on a data-driven
framework of retrosynthetic reaction prediction by deep learning and
path search by MCTS. To explicitly introduce retrosynthesis knowledge
into ReTReK, referring to previous works,[4,7,8,13,14] we formulated four scores that aimed to explore the
ideally shortest synthetic route or select a reaction that ideally
provides only the desired product. A graph convolutional network (GCN)
technique, which tolerates the biased reactions data set,[28] was used to build the retrosynthetic reaction
prediction model. The Reaxys reaction database[19] was used to construct the ReTReK model. We evaluated the
performance of ReTReK using drug-like molecules[45−50] for demonstrations and molecules from the ChEMBL database[51] for quantitative evaluations. We successfully
demonstrated that synthetic routes designed using ReTReK with retrosynthesis
knowledge were preferable to those designed without retrosynthesis
knowledge. Furthermore, we quantitatively showed that retrosynthesis
knowledge improved the performance when solving certain target molecules,
and it successfully guided the search direction in MCTS. The ReTReK
application is publicly available on GitHub at https://github.com/clinfo/ReTReK. The proposed concept of integrating retrosynthesis knowledge, in
the form of adjustable parameters, into a data-driven CASP application
is expected to enhance the performance of both existing data-driven
CASP applications and those under development.
Results and Discussion
Construction
of ReTReK
To implement a data-driven CASP
application that can reflect retrosynthesis knowledge, ReTReK was
constructed using MCTS, the GCN technique, and the four retrosynthesis
knowledge scores introduced earlier (Figure ). When designing a synthetic route for a
target molecule, the following three factors are described as basic
points:[52] the construction of the required
carbon skeleton considering regiochemistry and stereochemistry; ideally
the shortest synthetic route to the molecule; reaction that ideally
gives only the desired product in each step. Thus, we formulated the
four scores: a convergent disconnection score (CDScore), an available
substances score (ASScore), a ring disconnection score (RDScore),
and a selective transformation score (STScore). The CDScore and ASScore
are designed to favor convergent synthesis, which is an efficient
strategy in multistep chemical synthesis. The RDScore is designed
to reflect a ring construction strategy. This strategy is preferred
if the target compound has complex ring structures because the construction
of ring structures in a synthetic route tends to result in simple
and easily available starting materials. The STScore is designed to
reflect the number of possible products from a reaction because a
synthetic reaction with few byproducts is preferred, considering the
yield. Additionally, to handle stereochemistry and regiochemistry
when applying a retrosynthetic reaction predicted by the GCN to a
molecule, the Reactor in the ChemAxon API[53] was used. The basic MCTS algorithm comprises four steps: selection,
expansion, rollout, and update. For the selection step, a tree policy
is used to select a promising retrosynthetic tree position. The policy
of the ReTReK also considers the retrosynthesis knowledge scores.
A GCN-based model (Figure a) was used for the 1-step retrosynthetic reaction prediction
as a policy network in the expansion and rollout steps. Reaxys reaction
records[19] were used to train the model
and to prepare the starting materials, and the compounds obtained
from the ZINC database[54] were used as the
starting materials. By iterating through the four steps listed above,
a retrosynthetic tree is expanded, thus attempting to identify a promising
synthetic route. ReTReK without the retrosynthesis knowledge scores
can be regarded as a basic data-driven CASP model such as the approach
proposed by Segler et al.[9] and AiZynthFinder[24] although there are some minor differences in
the policy networks and evaluation terms of the MCTS. Moreover, when
comparing the performance of the CASP approaches, it should be noted
that what is used for reaction templates and starting materials will
affect the performances.
Figure 1
Complete workflow of ReTReK. ReTReK combines
a path-finding algorithm
(MCTS) and GCN technique, and retrosynthesis knowledge is incorporated
into the selection step of the MCTS procedure. The retrosynthesis
knowledge is formalized using four scores: the CDScore, STScore, RDScore,
and ASScore.
Figure 2
(a) Model architecture of the GCN-based policy
network. (b) Top-n accuracies of the model for n values
ranging from 1 to 1000. Specifically, the top-1, top-50, top-100,
top-300, and top-500 accuracies are 0.361, 0.906, 0.938, 0.968, and
0.976, respectively.
Complete workflow of ReTReK. ReTReK combines
a path-finding algorithm
(MCTS) and GCN technique, and retrosynthesis knowledge is incorporated
into the selection step of the MCTS procedure. The retrosynthesis
knowledge is formalized using four scores: the CDScore, STScore, RDScore,
and ASScore.(a) Model architecture of the GCN-based policy
network. (b) Top-n accuracies of the model for n values
ranging from 1 to 1000. Specifically, the top-1, top-50, top-100,
top-300, and top-500 accuracies are 0.361, 0.906, 0.938, 0.968, and
0.976, respectively.
Top-n Accuracies
of the GCN-Based Policy Network
To determine the effective
size for the expansion step in the MCTS
procedure, the top-n accuracies (for n up to 1000) of the GCN-based policy network are calculated as shown
in Figure b. The 1-step
retrosynthetic reaction prediction model aimed to prioritize 19 633
reaction templates for application to an input molecule. In this study,
we used reaction templates considering a reaction center, first-degree
neighbors, and protecting groups because a previous study proposed
this type of reaction template to maintain chemical integrity.[35] Accordingly, the top-1, top-50, top-100, top-300,
and top-500 accuracies were found to be 0.361, 0.906, 0.938, 0.968,
and 0.976, respectively. Beyond the top-500 accuracies, an increase
in the prediction performance was not significant. Considering the
results of a previous study[28] on reaction
templates of different sizes, the prediction performance is assumed
to be equivalent to or better than that of the previous template-based
1-step retrosynthetic reaction prediction model.[9] Based on the results for the top-n accuracies,
we evaluated the effect of the MCTS expansion sizes and the retrosynthesis
knowledge on the performance of solving for target molecules using
the top 50, 100, 300, and 500 predicted templates.
Effects of
Expansion Sizes and Retrosynthesis Knowledge on the
Performance of Solving for Target Molecules
Figure shows how the expansion sizes
and retrosynthesis knowledge influenced the performance of solving
for target molecules. The 161 molecules from the preprocessed ChEMBL
data set were used as the target molecules. The searches were performed
using different expansion sizes (50, 100, 300, and 500) and six retrosynthesis
knowledge patterns: no retrosynthesis knowledge (no knowledge), the
STScore, CDScore, ASScore, RDScore, and all the four retrosynthesis
knowledge scores (all knowledge). In most of the cases, the solution
performance was improved in proportion to the expansion size. However,
the case of the STScore pattern and an expansion size of 100 resulted
in a lower number of solved molecules than the case of the same pattern
and an expansion size of 50. This result is attributed to a relative
lack of MCTS iterations because of the increase in the expansion size.
Regarding the retrosynthesis knowledge, all the knowledge patterns,
except the STScore pattern, resulted in an increase in the number
of solved molecules compared to the no-knowledge pattern. The CDScore
pattern with an expansion size of 500 showed the best solution performance,
yielding 90 solved molecules, whereas the no-knowledge pattern with
the same expansion size resulted in 59 solved molecules. Although
the STScore pattern resulted in fewer solved molecules than the no-knowledge
pattern, this result was considered reasonable because the STScore
focused on reactions with few byproducts. This often leads to strict
conditions for retrosynthesis. To compare the ReTReK with the other
data-driven CASP model, ASKCOS[23] was applied
to the 161 molecules, and the solving performance was shown in Figure S1. This result shows that the solving
performances of ReTReK without retrosynthesis knowledge and ASKCOS
were comparable, which suggests that the retrosynthesis knowledge
score may improve other data-driven CASP approaches.
Figure 3
Comparison of the numbers
of solved molecules with different expansion
sizes and retrosynthesis knowledge patterns. The gray, black, blue,
orange, green, and red bars correspond to the no-knowledge, all-knowledge,
STScore, CDScore, ASScore, and RDScore patterns, respectively.
Comparison of the numbers
of solved molecules with different expansion
sizes and retrosynthesis knowledge patterns. The gray, black, blue,
orange, green, and red bars correspond to the no-knowledge, all-knowledge,
STScore, CDScore, ASScore, and RDScore patterns, respectively.Moreover, the search times necessary for solution
are compared
between the different expansion sizes and six retrosynthesis knowledge
patterns as shown in Figure S2. The search
time increases in proportion to the expansion size because an increase
in the expansion size expands the search space for MCTS. The median
search times for expansion sizes of 50, 100, 300, and 500 are 32,
45, 133, and 294 s, respectively. The STScore pattern requires shorter
search times than the no-knowledge pattern although all the other
knowledge patterns, except STScore, result in longer search times.
These results suggest that synthetic routes can be more efficiently
identified under the STScore pattern than the other patterns, although
the STScore pattern results in lower solution performance.
Effects
of Retrosynthesis Knowledge on the Search Directions
in MCTS
Figure a shows how the six retrosynthesis knowledge patterns influence the
characteristics of the searched synthetic routes, in terms of four
route scores (rSTScore, rCDScore, rASScore, and rRDScore). Each route
score was defined as the average corresponding retrosynthesis knowledge
score in each step of the searched synthetic route. For ease of comparison,
each route score for each of the five knowledge patterns was standardized
with respect to the corresponding score for the no-knowledge pattern.
The standardized mean values of the rSTScore for the STScore pattern,
rCDScore for the CDScore pattern, rASScore for the ASScore pattern,
and rRDScore for the RDScore pattern were 0.178, 0.555, 0.130, and
0.309, respectively. All the values were positively shifted compared
to the values for the no-knowledge pattern, indicating that all the
four retrosynthesis knowledge scores successfully guided the search
directions in the MCTS according to the characteristics of each type
of knowledge. The CDScore pattern caused MCTS to select more transformation-oriented
searches compared to the STScore pattern. The mean values of the CDScore
and STScore were 0.299 and 0.178, respectively. A convergent-disconnection-oriented
search is assumed to have more chances of splitting the reactive centers
into divided molecules because the CDScore attempts to minimize the
sizes of each divided molecule simultaneously. Figure b–d shows the parts of the exemplary
synthetic routes found by the ReTReK using all-knowledge and no-knowledge
patterns. Each case shows that the ReTReK with retrosynthesis knowledge
successfully chooses the preferable retrosynthetic reactions than
the ReTReK without the knowledge, in terms of the STScore, CDScore,
ASScore, and RDScore. Considering that the all-knowledge pattern shows
a higher rSTScore than the CDScore and STScore patterns, these results
suggest the existence of synergistic effects of retrosynthesis knowledge;
however, this hypothesis needs further analysis.
Figure 4
Evaluation of the effects
of retrosynthesis knowledge on the search
directions of MCTS, in terms of the four route scores. Synthetic routes
solved with an expansion size of 500 were used for this evaluation.
(a) Each route score standardized based on the corresponding mean
and standard deviation of the no-knowledge pattern. The gray, black,
blue, orange, green, and red plots represent the standardized route
scores for the no-knowledge, all-knowledge, STScore, CDScore, ASScore,
and RDScore patterns, respectively. The rhombuses represent the mean
values for each case, and the confidence intervals at the 95% confidence
level are also shown. (b–d) Parts of the exemplary synthetic
routes found by the ReTReK using the all-knowledge and no-knowledge
patterns. The circled symbol “S” indicates that a molecule
is in the starting materials’ list. All the synthetic routes
are shown in Figure S3. (b) Example of
parts of the routes showing the STScore effect. The ReTReK with retrosynthesis
knowledge successfully chose the reaction with fewer reactive centers
than the ReTReK without the knowledge. (c) Example of parts of the
routes showing the CDScore and ASScore effects. The ReTReK with retrosynthesis
knowledge more successfully guided the convergent synthetic route
than the ReTReK without knowledge. (d) Example of parts of the route
showing the RDScore effect. The ReTReK with retrosynthesis knowledge
successfully chose ring-opening retrosynthetic reaction, whereas the
ReTReK without the knowledge found no routes.
Evaluation of the effects
of retrosynthesis knowledge on the search
directions of MCTS, in terms of the four route scores. Synthetic routes
solved with an expansion size of 500 were used for this evaluation.
(a) Each route score standardized based on the corresponding mean
and standard deviation of the no-knowledge pattern. The gray, black,
blue, orange, green, and red plots represent the standardized route
scores for the no-knowledge, all-knowledge, STScore, CDScore, ASScore,
and RDScore patterns, respectively. The rhombuses represent the mean
values for each case, and the confidence intervals at the 95% confidence
level are also shown. (b–d) Parts of the exemplary synthetic
routes found by the ReTReK using the all-knowledge and no-knowledge
patterns. The circled symbol “S” indicates that a molecule
is in the starting materials’ list. All the synthetic routes
are shown in Figure S3. (b) Example of
parts of the routes showing the STScore effect. The ReTReK with retrosynthesis
knowledge successfully chose the reaction with fewer reactive centers
than the ReTReK without the knowledge. (c) Example of parts of the
routes showing the CDScore and ASScore effects. The ReTReK with retrosynthesis
knowledge more successfully guided the convergent synthetic route
than the ReTReK without knowledge. (d) Example of parts of the route
showing the RDScore effect. The ReTReK with retrosynthesis knowledge
successfully chose ring-opening retrosynthetic reaction, whereas the
ReTReK without the knowledge found no routes.Considering the results of Figure and Figure , the following strategy to adjust the weight parameters can
be considered. First, the search without the retrosynthesis knowledge
is recommended to know the baseline result. If the target is not solved
with these parameters, applying the CDScore is the reasonable choice
to solve the target because the CDScore showed the best solution performance
(Figure ). Or, if
the target seems to require ring-opening, applying the RDScore is
the prospective choice to solve (Figure d). Additionally, if better quality synthetic
routes are desired, applying the STScore and/or ASScore could improve
the routes based on their concepts. The score weights should be adjusted
gradually with positive integers to combinatorial explosion.
Demonstrations
of ReTReK for Drug-like Molecules
To
demonstrate retrosynthesis planning using ReTReK, we applied ReTReK
to six drug-like molecules in the all-knowledge and no-knowledge patterns,
and the results are presented in this section and Figure S4. The detailed parameters used in these demonstrations
were described in the Methods section. Figure panels a and b illustrate
an exemplary retrosynthetic route to a molecule 1 known
as a hepatitis B virus capsid inhibitor,[45] found by ReTReK with and without retrosynthesis knowledge. The exploration
with retrosynthesis knowledge suggested a convergent route, successfully
reflecting the specified knowledge scores. In this route, the target
molecule 1 is disconnected into two main segments, iodophthalazinone 7 and pyridylboronic acid 12, which can be converted
into 1 by the Suzuki coupling reaction. The key intermediates 7 and 12 could be retrosynthetically divided
into three representative materials: hydroxyphthalazine 2, benzyl alcohol 3, and trihalogenated pyridine 8. Iodination of 2 and subsequent N-benzylation
with p-iodobenzyl iodide 5, which can
be obtained from 3, would provide 2-(iodobenzyl)phthalazin-1-one 6. A reaction of 6 with copper cyanide would
provide the intermediate 7. The other intermediate 12 would be prepared from 8 by three-step sequences
(i.e., chlorination, incorporation of a boronic acid moiety, and amination
with aminoalcohol 11). In contrast, a straightforward
route was presented through the exploration without retrosynthesis
knowledge. Friedel–Crafts acylation of trihalopyridine 14 with 2-(chlorocarbonyl)benzoic acid (13) would
provide 2-(pyridinecarbonyl)benzoic acid 15, which is
further reacted with aminoalcohol 11 to afford tricyclic
ketone 16. The construction of the phthalazine ring could
be performed by the reaction of 16 with p-bromobenzylhydrazine (17) to yield the precursor 18. Finally, the introduction of a nitrile group into the
benzyl group of 18 would provide the target molecule 1. Additional demonstrations for two other drug-like molecules,
kwakhurin[47] and α7 nicotinic acetylcholine
receptor silent agonist,[48] are shown in Figure c,d and Figure e,f, respectively.
To confirm the difference in each step’s score between ReTReK
with and without retrosynthesis knowledge, four retrosynthesis knowledge
scores were added to each step in the synthetic routes (Figure S5).
Figure 5
Comparison of the synthetic route for
three target compounds (a,b)
hepatitis B virus capsid inhibitor,[45] (c,d)
kwakhurin,[47] and (e,f) α7 nicotinic
acetylcholine receptor silent agonist[48]) found by ReTReK with retrosynthesis knowledge (a, c, and e) and
the corresponding route found without retrosynthesis knowledge (b,
d, and f).
Comparison of the synthetic route for
three target compounds (a,b)
hepatitis B virus capsid inhibitor,[45] (c,d)
kwakhurin,[47] and (e,f) α7 nicotinic
acetylcholine receptor silent agonist[48]) found by ReTReK with retrosynthesis knowledge (a, c, and e) and
the corresponding route found without retrosynthesis knowledge (b,
d, and f).Furthermore, the effectiveness
of retrosynthesis knowledge was
confirmed in several cases of retrosynthetic analyses. In these cases,
ReTReK with retrosynthesis knowledge succeeded in finding retrosynthetic
routes to the target molecules, whereas no route was found using ReTReK
without retrosynthesis knowledge (Figure a). Figure a shows the retrosynthetic route to a molecule 19 known as a Mycobacterium tuberculosis thymidylate
kinase (MtbTMPK) inhibitor.[46] In the suggested
route, 19 is disconnected at the center of the molecule,
giving imidazo[1,2-a]pyridine-3-carboxamide 25 and 1-(piperidin-4-yl)pyrimidine-2,4-dione 29. Compound 25 would be obtained from dibromide 24 by selective SNAr reaction with an organometallic
reagent such as ethylmagnesium bromide, which is prepared from ethyl
bromide. Dibromide 24 can be obtained from benzylamide 22 by stepwise SEAr bromination. Amide 22 would be provided from imidazopyridine-3-carboxylic acid (20) and N-Boc-benzylamine (21). Another intermediate 29 would be synthesized from
4-iodopiperidine (27) with pyrimidine-2,4-dione 28 by SN2 reaction. 27 would be easily
prepared from N-Boc 26. From the viewpoint of the practical
synthesis, deprotection of N-Boc group of the piperidine ring should
be performed after N-alkylation of 28 with 26 to avoid oligomerization of 27 by self N-alkylation.
Additional demonstrations for two other drug-like molecules, propolone[49] and EGFR kinase inhibitor,[50] are shown in Figure b and Figure S4, respectively.
Further demonstrations for the molecules reported by Segler et al.[9] are shown in Figure S6. For evaluating the performances of ReTReK and the other data-driven
CASP approach on more advanced targets, ReTReK and ASKCOS were applied
to 15 targets reported by previous research using Chematica.[13,14,55] In terms of the solving performance,
ReTReK found the retrosynthetic routes of 11 targets, while ASKCOS
found those of 4 targets. As for the solved routes, the routes proposed
by two data-driven applications had some skeptical steps that actually
proceed, and the solutions were not close to the mature ones proposed
by Chematica (Figure S7). According to
the results, finding the sophisticated routes of advanced targets
like Chematica did is still a challenging task for data-driven CASP
applications.
Figure 6
For two target compounds (a, MtbTMPK inhibitor[46] and b, Propolone[49]) synthetic
routes were found using the ReTReK with retrosynthesis knowledge,
whereas no synthetic routes were found using the ReTReK without retrosynthesis
knowledge.
For two target compounds (a, MtbTMPK inhibitor[46] and b, Propolone[49]) synthetic
routes were found using the ReTReK with retrosynthesis knowledge,
whereas no synthetic routes were found using the ReTReK without retrosynthesis
knowledge.These results clearly show that
retrosynthesis knowledge effectively
contributes to retrosynthetic analyses using ReTReK. However, the
experimental validations[56] with targets
that chemists are interested in and blind assessments[9] by trained chemists are not performed in this study; thus
we plan to perform these validations to evaluate the performance of
ReTReK on actual chemical synthesis in future work. Considering these
evaluations and demonstrations, the ReTReK framework with integrated
retrosynthesis knowledge has the potential to further improve the
performance of data-driven CASP applications.
Conclusions
We developed ReTReK, a data-driven CASP application integrated
with rule-based techniques, and it can flexibly reflect and apply
retrosynthesis knowledge. Through the evaluation of ReTReK with and
without retrosynthesis knowledge, we showed that the integration of
such knowledge into data-driven CASP applications helps improve their
performance and enhance the quality of the explored synthetic routes.
We expect the concept of ReTReK to contribute to the further developments
and improvements in data-driven CASP applications.To allow
for more realistic and preferable synthetic routes to
be obtained in the future, we will address the further development
of automatic reaction template extraction methods while maintaining
the chemical integrity. In this study, orphan atoms (atoms appearing
on only one side of the reaction arrow) were included in the reaction
templates to automatically retain the protecting and leaving groups
in the templates because these groups are often not recorded as reactants
or products. Because such groups were manually defined in a previous
study,[22] this template definition (considering
the orphan atoms) is expected to contribute to the further development
of automatic template extraction methods. In addition, we may define
the additional retrosynthesis knowledge scores to allow ReTReK to
represent the chemists’ ways of thinking more extensively than
in the current model. Furthermore, to facilitate the use of ReTReK,
we will prepare a user-friendly interactive interface with functions
such as range sliders for adjusting each retrosynthesis knowledge
score and other tools for displaying the explored synthetic routes.
Methods
Data Sets
To create the ReTReK model, compounds obtained
from the Reaxys reaction records[19] and
ZINC 15 database[54] were used, and compounds
were obtained from the ChEMBL 27 database[51] and the literature[45−50] to evaluate the performance of ReTReK.
Reaction Template Extraction
A set of approximately
50 million reaction records from Reaxys[19] (1795–2019) was used to construct the 1-step retrosynthetic
reaction prediction model. The model was designed to take a target
or an intermediate molecule as input and was trained to predict a
suitable reaction template for the input molecule. The purpose of
a reaction template is to represent a generalized chemical reaction.
In this study, a reaction template consists of a reactive center,
orphan atoms, and their first-degree neighbors. An orphan atom is
one that appears on only one side of the reaction arrow in ChemAxon.[53] These atoms were identified using the Automapper
in the ChemAxon API.The reaction template extraction procedure
comprises four steps. Figure S8 shows the
workflow of the reaction template extraction.In the first step,
the reaction records were standardized by removing
explicit hydrogen, aromatizing, and retaining the largest fragments.In the second step, the reaction records were filtered based on
three conditions: (1) the reaction was required to consist of a single
step, (2) it must have a product and up to three reactants, and (3)
the number of heavy atoms in the product was limited to 50 or fewer.
Thereafter, the number of remaining reaction records was 22 337 137.In the third step, the reaction templates were extracted from the
reaction records, and sets consisting of a product and the corresponding
reaction template were retained if the reaction template occurred
at least 50 times. To prevent the occurrence of two or more fragments,
a reaction template was retained only when all atoms on the product
side of the template were connected.In the final step, the
sets consisting of a product and the corresponding
reaction template were filtered on condition that the reaction template
could be reversibly applied to the product and derived reactants.
On the basis of this requirement, 7 589 744 product-template
sets remained, and the number of unique reaction templates was 19 633.
Referring to a previous study,[57] a time-splitting
strategy was employed to evaluate the neural network model performance.
The sets published before 2017 were used for the training, and those
published in 2017 and later were used for testing.
Preparation
of Molecules for ReTReK Evaluations and Demonstrations
The
molecules used for the ReTReK evaluations were obtained from
ChEMBL 27[51] and preprocessed via the following
procedures. First, the molecules whose The United States Adopted Names’
(USAN) years ranged from 2017 to 2019 and for which chemical structure
records were available were selected, resulting in a total of 219
compounds. Thereafter, the compounds were preprocessed by removing
the explicit hydrogen, aromatizing, retaining the largest fragments,
removing compounds with more than 50 atoms, and removing duplicates.
The remaining 161 compounds were used for the evaluations (ChEMBL
data set). For further evaluation of the ReTReK, six drug-like compounds[45−50] were used for synthetic route search demonstrations.
Starting
Materials
A set of compounds obtained from
the ZINC database and Reaxys reaction records were used as the starting
materials. A subset of 100 023 building blocks from major suppliers
(Sigma-Aldrich, Alfa Aesar, and Acros) was obtained from the ZINC
database. From the Reaxys reaction records, 649,130 compounds recorded
as reactants with at least five occurrences before 2017 were used.
All the compounds were stored in the canonical SMILES format calculated
using RDKit.[21]
MCTS for Retrosynthesis
MCTS has been implemented in
various CASP studies based on the achievements of Segler et al.[9] MCTS is a search algorithm for exploring optimal
solutions and comprises four steps: selection, expansion, rollout,
and update.[58] Following Segler’s
implementation,[9] a state consists of a
set of molecules and is solved (the optimal solution) if all the molecules
in the state are starting materials. In this study, retrosynthesis
knowledge scores were incorporated into the evaluation term used in
the selection step. The same policy network was used for both the
expansion and rollout steps, similar to a previous study.[35]
Retrosynthesis Knowledge Used in ReTReK
We define four
scores representing four types of retrosynthesis knowledge, namely,
the CDScore, ASScore, RDScore, and STScore, inspired by previous studies.[4,7,8]
Convergent Disconnection
Score
The CDScore is designed
to favor convergent synthesis, which is known to be an efficient strategy
in multistep chemical synthesis. The CDScore is calculated by evaluating
how equally a product is divided among the reactants of a reaction
{R1 + R2 + ... + R → P}, where R is a reactant
and P denotes the product.Here, a(P) and a(R) represent the number of atoms in
the product and reactant, respectively, and MAE is the mean absolute
error.
Available Substances Score
The ASScore,
which serves
a similar purpose as the CDScore, is defined to reflect the number
of available substances generated in a reaction step and is calculated
asHere, b(S) and b(R) represent the numbers of available substances (starting
materials)
and reactants, respectively.
Ring Disconnection Score
A ring construction strategy
is preferred if the target compound has complex ring structures because
the construction of ring structures in a synthetic route tends to
result in simple and easily available starting materials. The RDScore
is calculated by checking whether the ring construction occurs in
a reaction step as follows:Here, d(P) and d(Ri) represent the number of rings in the product
and
reactant, respectively.
Selective Transformation Score
A
synthetic reaction
with few byproducts is preferred, considering the yield. To reflect
the number of possible products from a reaction, the STScore is calculated
by focusing on the number of reactive centers in the reactants as
follows:Here, e(∑R) represents the applicable number of patterns of
products
enumerated using the reactants and a certain reaction template.
Policy Network
In this study, a policy network is a
template-based retrosynthetic reaction prediction model, and the same
model is used in both the expansion and rollout steps. We employed
a GCN model (a promising model for retrosynthesis found in a previous
study[28]) as the retrosynthetic reaction
prediction model. The model was trained using the data set prepared
as described in the reaction template extraction section and comprised
three graph convolutional layers with Leaky ReLU activation and a
dropout ratio of 0.3, a graph-dense layer with Leaky ReLU activation,
graph-gather layer with hyperbolic tangent activation, and dense layer
with softmax activation. To confirm the effectiveness of the expansion
sizes on the MCTS performance, the top-n accuracies
were calculated for n values in the range from 1
to 1000. To implement this model, a graph-based deep learning framework
kGCN[59] was used.
Selection
Starting
from the root node, a tree policy
was recursively applied to select the subsequent action, indicating
that the simulation descended through the search tree gradually until
an unvisited node with a nonterminal state was reached. The tree policy
was based on the upper confidence bound (UCB) score, and retrosynthesis
knowledge was incorporated into the policy as follows:where w represent the weights with values of w1 = 5.0, w2 = 0.5, w3 = 2.0, and w4 = 2.0, and n denotes the number of retrosynthesis knowledge scores
used in a search (e.g., n is four if all four types
of retrosynthesis knowledge are used).Here, Q denotes an
action value calculated in the update step; N and N–1 are the visit counts of the child
and parent nodes, respectively; c denotes a constant
value that is set to 10; P denotes the softmax probability
obtained from the policy network; and K represents
the mean of the retrosynthesis knowledge scores.
Expansion
Child nodes, the states of which are selected
by the policy network, are added to the node selected by the tree
policy. On the basis of the top-n accuracies in the
policy network, in independent trials, the top 50, 100, 300, and 500
reaction templates were selected, and they were filtered on condition
that the reaction templates could be successfully applied to an unsolved
molecule in the state of the selected node.
Rollout
A simulation
is implemented using the policy
network if the state of a node is not proven or terminal.[9] During the simulation, the following steps are
recursively implemented for a maximum of five times: an unresolved
molecule, which is not included in the starting materials, of the
state is randomly sampled, the top 10 reaction templates of the molecule
are obtained by the policy network, and a randomly sampled reaction
template is applied to the molecule. At the end of each step, it is
checked whether the state is proven or not.A reward function, r, returns one of the three values as reward z, depending on the simulation result. Before the simulation is started,
the reward is 10 if the state is proven and −1 if the state
is terminated. After the simulation, the reward is equal to the ratio
of the number of resolved molecules in the state to the total number
of molecules.
Update
The reward obtained from
the rollout step is
backpropagated through the selected nodes to update their action values Q. On the basis of a previous study,[9] the value of Q is defined aswhere Lmax denotes
the maximal branch length and is set to 10, L denotes
the current branch length, and ∑P denotes the sum of the softmax probabilities of the reaction
templates in the selected nodes.
Evaluating the Effects
of Expansion Sizes and Retrosynthesis
Knowledge on MCTS Solution Performance
To investigate the
effect of expansion sizes on the MCTS’ performance in solving
for target molecules, both the number of solved molecules in the ChEMBL
data set and times required to solve the molecules were compared for
different expansion sizes and six retrosynthesis knowledge patterns.
The expansion sizes were 50, 100, 300, and 500 and were determined
by the policy network’s top-n accuracies.
The six knowledge patterns were as follows: no retrosynthesis knowledge
(no knowledge), the CDScore, ASScore, RDScore, STScore, and all the
four retrosynthesis knowledge scores (all knowledge). In these experiments,
the maximum number of iterations was set to 500 and the score weights
for the CDScore, ASScore, RDScore, and STScore were fixed to 5.0,
2.0, 0.5, and 2.0, respectively.
Evaluating the Effects
of Retrosynthesis Knowledge on the Search
Directions in MCTS
To quantify the effects of the retrosynthesis
knowledge on the search directions in MCTS, we defined a route score
as the average value of the corresponding retrosynthesis knowledge
scores in each step of a solved synthetic route. We calculated four
types of route scores (rCDScore, rASScore, rRDScore, and rSTScore)
for the solved synthetic routes under the corresponding retrosynthesis
knowledge patterns. For comparisons, each route score for the five
retrosynthesis knowledge patterns was standardized based on the corresponding
mean and standard deviation for the no-knowledge pattern. In these
experiments, the synthetic routes solved under the condition of an
expansion size of 500 were used. The maximum number of iterations
was set to 500, and the score weights for the CDScore, ASScore, RDScore,
and STScore were fixed to 5.0, 2.0, 0.5, and 2.0, respectively.
Data
and Software Availability
The ReTReK application
is publicly available on GitHub at https://github.com/clinfo/ReTReK under the MIT License. The application is distributed in the model
based on US Patent data set (10.6084/m9.figshare.5104873.v1) because Reaxys is a commercial database, which cannot be provided
to the public. All the compounds used for the evaluations are available
on https://github.com/clinfo/ReTReK/tree/master/data/evaluation_compounds. The README file in the GitHub repository provides information about
how to setup and use the application.
Authors: Sara Szymkuć; Ewa P Gajewska; Tomasz Klucznik; Karol Molga; Piotr Dittwald; Michał Startek; Michał Bajczyk; Bartosz A Grzybowski Journal: Angew Chem Int Ed Engl Date: 2016-04-08 Impact factor: 15.336