Literature DB >> 34939786

Influence of Template Size, Canonicalization, and Exclusivity for Retrosynthesis and Reaction Prediction Applications.

Esther Heid¹, Jiannan Liu¹, Andrea Aude¹, William H Green¹.

Abstract

Heuristic and machine learning models for rank-ordering reaction templates comprise an important basis for computer-aided organic synthesis regarding both product prediction and retrosynthetic pathway planning. Their viability relies heavily on the quality and characteristics of the underlying template database. With the advent of automated reaction and template extraction software and consequently the creation of template databases too large for manual curation, a data-driven approach to assess and improve the quality of template sets is needed. We therefore systematically studied the influence of template generality, canonicalization, and exclusivity on the performance of different template ranking models. We find that duplicate and nonexclusive templates, i.e., templates which describe the same chemical transformation on identical or overlapping sets of molecules, decrease both the accuracy of the ranking algorithm and the applicability of the respective top-ranked templates significantly. To remedy the negative effects of nonexclusivity, we developed a general and computationally efficient framework to deduplicate and hierarchically correct templates. As a result, performance improved considerably for both heuristic and machine learning template ranking models, as well as multistep retrosynthetic planning models. The canonicalization and correction code is made freely available.

Entities: Chemical

Mesh：

Year: 2021 PMID： 34939786 PMCID： PMC8757433 DOI： 10.1021/acs.jcim.1c01192

Source DB: PubMed Journal: J Chem Inf Model ISSN： 1549-9596 Impact factor: 4.956

Introduction

Retrosynthesis, i.e., the proposal of precursors for a desired product, and forward reaction prediction, i.e., the proposal of possible products given a set of reactants, are central topics of organic chemistry. With the surge of computer-aided reaction prediction approaches, numerous models for retrosynthesis based on heuristics[1] and machine learning were developed, such as rule- or template-based models,[2−5] transformer models adapted from natural language processing,[6−10] or conditional graph logic networks.[11] Despite the limitations of template-based approaches to generalize to new chemistries due to missing templates, their ability to fully specify precursors and to easily compare a proposed reaction to known reactions with similar transformations makes them a useful and valuable tool for synthesis planning software.[1] Template-based models usually take a molecule as input and propose a ranked list of chemical transformations, often via flat multiclass classification. Since the training of multiclass models becomes more difficult with a larger number of classes,[12] the performance of template-based models is influenced by the number of templates, and thus by their size, canonicalization, and exclusivity. Larger, more specific reaction templates include more atoms around the reaction center and only apply to a small number of molecules. Smaller, more general templates are applicable to more molecules and decrease the overall number of classes, potentially increasing model performance. However, they may lead to a large number of proposed precursors, some of which may not be chemically meaningful. Finding the optimal size of templates for an application is therefore an important and often ambiguous, undetermined problem. In contrast, poorly canonicalized templates (different templates describing the exact same transformation on the same set of molecules) or nonexclusive templates (different templates describing the same chemical transformation on overlapping sets of molecules) unnecessarily increase the number of templates, thus adding noise to the data and should be avoided if possible. However, data-driven approaches to retrosynthesis usually rely on the automated extraction of reaction templates from reaction databases, for example, via the open-source package RDChiral.[13] Such template sets are, by nature, not as well curated and validated as manually crafted reaction rules. They can contain duplicate and nonexclusive templates and may also suffer from too large or too small template sizes. This necessitates the development of efficient and scalable canonicalization and correction routines. Despite these challenges, the effects of template size, canonicalization, and exclusivity on model performance are hardly investigated. A recent study on the influence of the choice of data sets and template size found that smaller templates lead to a lower top-1 prediction accuracy, despite increasing model performance for multistep retrosynthesis and increasing applicability,[14] which is counterintuitive from a machine learning point of view (where fewer classes should increase model accuracy). To the best of our knowledge, no methodical studies of the effects of template exclusivity and canonicalization have been published. The aim of this study is therefore two-fold. First, we aim to characterize and quantify the effects of template size, canonicalization, and exclusivity on heuristic and machine learning template-ranking algorithms for retrosynthesis and forward prediction applications. Second, we aim to establish new canonicalization and hierarchical correction algorithms to remedy the two most important issues identified: duplicate and nonexclusive templates. The resulting code is freely available on Github.[15,16]

Methods

Data Sets

This study relies on two data sets, which were both prepared from reactions extracted from the United States Patent and Trademark Office (USPTO) by Lowe and coworkers.[17,18] First, a subset of 50,000 pharmaceutically relevant reactions, curated by Schneider et al.[19] and employed in a variety of previous studies,[1,5] was used as provided by ref (1) and the preprocessing steps described therein. Second, a set of 480,000 USPTO reactions collected for forward reaction prediction[20] and employed in various studies[21,22] was utilized. This data set was further processed by removing molecules from the reaction that did not contribute a heavy atom to the product (code from ref (5)), as well as removing reactions with more than one product molecule. For both data sets, templates in the retrosynthetic direction were extracted via the RDChiral Python package[13] with default settings (radius 1 and inclusion of certain special groups), as well as at radius 3, 2, 1, and 0 without any special groups after modifying the code slightly.[23] Leaving groups were included in all extracted templates. Only reactions yielding a valid SMARTS string as template, as well as reproducing the reactants after application to the products, were kept, yielding 48,531 reactions in the smaller data set, termed USPTO-50k throughout the remainder of this article, as well as 461,541 reactions for the larger data set, termed USPTO-460k. Templates for forward reaction prediction were obtained by simply reversing the direction of change of the retro templates. Since RDChiral is designed to extract templates in the retrosynthetic direction and imposes certain chirality restrictions which are only meaningful in that direction, reversing the templates may make them less general than necessary. As this study is primarily concerned with the relative effects of canonicalization and exclusivity, the forward (reversed retrotemplates) were not sanitized further.

Model Details

Each data set was split into training, validation, and test reactions via a random 80%/10%/10% split. Three different models were then trained to recommend templates which reproduce the observed reaction at high ranks. Either the product molecule was employed as input for the task of predicting retrosynthetic disconnections or the reactant molecule(s) for the task of predicting the reaction outcome in the forward direction. First, the heuristic template ranking procedure described in ref (1) was utilized as a baseline (referred to as “Sim”), which ranks precedent reactions based on Tanimoto similarities of Morgan fingerprints,[24] as implemented in RDKit.[25] Second, a neural network, similar to the baseline model in ref (5) was trained on Morgan fingerprint bit vectors of the products or reactants (referred to as “ML-fixed”). Lastly, a graph convolutional neural net, Chemprop,[26] was employed to encode a molecule and predict a template class (referred to as “ML-learned”). This model does not rely on precomputed fingerprints, but creates its own, task-specific learned embedding of the molecule via a directed message passing neural net, which is then followed by a standard feed forward neural net. Further details on each model and their hyperparameters are given in the Supporting Information. For each model, the number of applicable templates in the top-N ranked templates, as well as the top-N accuracy, were calculated. A recent study found that top-N accuracy should not be used as a single metric to evaluate the fitness of a template ranking model and should instead be accompanied by a measure of template applicability, as well as the ability of the model to produce viable synthetic routes for target products.[14] Therefore, the number of applicable templates recommended by each model was analyzed in addition to top-N accuracy, i.e., the number of templates that produce any valid precursor or product. In our top-N accuracy metrics, two different definitions of success were considered. Most machine learning models define success as recommending the exact template associated with a test reaction in the data set. Less commonly, success can be defined via recovering the actual molecules in the test reaction (precursors or products, depending on the direction of the template) after application of the recommended template. If all templates are mutually exclusive, both metrics of success yield the exact same results, which is highly desirable but often not achieved in practice, as shown later in this article. Below, metrics based on the “exact template” criterion are labeled T and those based on “exact precursors” or “exact product” are labeled P. We furthermore retrained the policy neural network of the Monte Carlo tree search retrosynthesis planner AiZynthFinder[27] with the reactions and templates from the current study to evaluate the ability of different template sets and their corresponding template recommendation models to create synthetic routes. Default parameters were used as described in ref (27), where we only retrained and utilized the policy model, not the filter model. The processed template and model files are available on Github.[15]

Canonicalization of Templates

Canonicalization of a template is a process that generates a unique string representation of the chemical transformation, which typically consists of a pair of SMARTS connected by the atom mapping. Due to the uniqueness of the canonical form, this process is crucial in template indexing and deduplication. The problem is a generalization of canonicalization of SMILES/SMARTS and mathematically equivalent to computing the canonical form of a graph.[28,29] The problem is NP, but its exact time complexity is still unknown.[30−33] In this section, instead of discussing the exact solutions, we attempt to present a practical, open-source approach to the template canonicalization problem and discuss its limitations and alternative solutions. We furthermore note that we only attempt to canonicalize templates as output by RDChiral, which significantly simplifies the task, since RDChiral only produces simple SMARTS patterns without negations, recursions, or wildcards. For comparisons between more complex SMARTS patterns, we refer the interested reader to the seminal work of Rarey and coworkers on detecting equality and hierarchy between SMARTS patterns.[34,35] Compared to the SMILES canonicalization problem, the template canonicalization problem has two major differences: (1) A template comprises reactant and product SMARTS, in other words, two graphs instead of one. (2) The two graphs are further connected through the atom mapping. Hence, it is natural to merge the two graphs into one condensed graph[36] if two nodes have the same atom mapping number. Here, we note that standardizing the atom mapping is also necessary to produce a unique representation. The key to the canonicalization problem is the node ranking algorithm, which computes the canonical rank of each node or atomic query in SMARTS for this particular problem. We adopted the Weisfeiler–Lehman refinement[37] as our ranking method. It iteratively relabels each node v ∈ V on the graph G(V, E) using the features from the node v itself and its neighbors , until the partition of all the labels becomes stable or the cycles are repeated at least |V| times, where |V| is the number of nodes. We chose node degree and canonical atomic SMARTS without the atom mapping number as the node features and bond SMARTS as the edge features, both with chirality included. Since the condensed graph was employed, the node features from the reactant and the product graphs were simply concatenated if they shared the same atom mapping number. In addition, a tie-breaking tag was also included. A detailed explanation of the canonicalization process is given in the Supporting Information. A major limitation in our algorithm is the ranking method itself, which is equivalent to the 1-dimensional Weisfeiler-Lehman refinement, and cannot distinguish certain highly symmetric graphs.[33,38,39] As a consequence, our canonicalization algorithm may not produce a unique representation for some templates. Despite this limitation, it does not affect this study because a) the number of nonunique representations is very small and b) the template correction algorithm applies a subgraph isomorphism test to compute the hierarchy. The frequency of nonunique representations can be further reduced by including more heuristic invariants,[33] e.g. as the ones used in RDKit.[28] If it is required to guarantee the uniqueness of the template string, the Weisfeiler-Lehman refinement needs to be replaced by a more general graph canonicalization algorithm with the price of potentially higher time complexity, e.g. the nauty algorithm.[40] In the end, we note that our current implementation of the canonicalization algorithm in the RDChiral C++ package[16] only supports the And operator in atomic SMARTS query, which is sufficient to canonicalize template patterns output by RDChiral. However, it should be possible to extend our code to support other SMARTS queries or use a more general SMARTS comparison algorithm such as SMARTScompare.[34,35] We furthermore note that the C++ version[16] of RDChiral is only used for template canonicalization; in all other parts of the workflow, we used a modified RDChiral Python package.[23]

Hierarchical Correction of Templates

In a manual examination of the extracted templates, we found that nonexclusivity of templates can stem from multiple sources. With the default template extraction parameters of RDChiral (radius 1, with special groups), for example, templates are extracted around the reactive center up to one bond away, and further atoms are attached if they match a set of expert-crafted special groups. Multiple reactions might thus yield templates with the same chemical transformation within the same context up to radius 1 but might include different, or no, special groups. A number of examples can be found in the Supporting Information, Figure S1. If a set of templates includes one instance without special groups and one with a special group, the templates are not mutually exclusive anymore; i.e., the template without the special group is applicable to all other reactions with the same template at radius 1. Templates at large radii, especially at radius 2 and 3, can furthermore be nonexclusive if some of the reactions they were extracted from contained a branched side chain. An example is given in Figure , where the left abstracted structure fits the same and more molecules than the right abstracted structure. Therefore, the template on the right possesses a subgraph which is isomorphic to the template on the left, and only the more general template should be kept. Further examples of this behavior are given in the Supporting Information, Figure S2. If templates describe the same transformation on an overlapping set of molecules, as in Figure , mutual exclusivity can be enforced by keeping only the most general template. To identify the template encompassing all relevant reactions, subgraph isomorphisms need to be calculated between pairs of templates. Due to the computational inefficiency inherent to subgraph isomophism evaluations, an exhaustive comparison of all possible pairs of templates in a set is unfeasible; instead, we developed a hierarchical alternative which narrows down the number of necessary subgraph searches. Toward this aim, we utilized RDKit to detect subset relations between templates output by RDChiral, which only comprise simple SMARTS patterns. For a comparison of more complex SMARTS patterns, packages such as SMARTScompare[34,35] may be employed. We further note that an exhaustive comparison of all templates in a set can result in unwanted matches for templates with different reaction centers (example shown in the Supporting Information), which is circumvented by our hierarchical approach as explained in the following.

Figure 1

Example of a template (right) with a subgraph that is isomoporphic to another template (left). Within the correction algorithm, only the left template is kept, since it encompasses all possible reactions of the right template. Figure schematically depicts how the hierarchical correction algorithm detects and eliminates nonexclusive templates. Templates at the desired level of specificity (with SMARTS strings B1–B6, red shades), for example, at radius 1, are clustered according to their respective templates at a lower level of specificity (with SMARTS strings A1, A2, and A3, gray shades), for example, at radius 0. The general templates (gray) are only used to group the more specific templates and thus lower the number of subgraph isomorphism evaluations but are not taken into account beyond that. The grouping also makes a comparison of atom map number between the matches obsolete because it enforces that the same chemical transformation is encoded in each template pair. Furthermore, templates extracted by RDChiral follow a specific syntax, where more strict SMARTS strings are assigned to atoms in the reaction center and more general SMARTS strings to all other atoms,[13] which ensures that the correct substructures of a pattern are matched to each other without inspecting atom map numbers. For templates extracted with different software packages, we recommend including a matching of the atom map number in each subgraph isomorphism search and clustering according to the minimal, most general template/reaction center. Within each cluster, subgraph isomorphisms are calculated between pairs of templates, leading to a tree structure of subgraph isomorphism relationships as depicted in the third column in Figure . For example, for templates in the A1 group, B1 is more general than both B2 and B3 and encompasses both of them. Thus, only B1 is kept. The second group, A2, contains only a single template, and is kept as is. The third group, A3, contains two templates that do not encompass each other, i.e., are unique, so both B5 and B6 are kept. In practice, a combination of these cases can occur, leading to more complex trees. To further lower the computational load, subgraph comparisons do not need to be exhaustively calculated between all templates in a group. Instead, the trees are built iteratively, where a new template is only compared to the current most general template(s) but not to templates already labeled as nonexclusive or duplicates. Further details of the hierarchical correction algorithm, as well as a pseudocode representation, are given in the Supporting Information. We note that an analogous, pairwise comparison algorithm can furthermore be easily implemented via other SMARTS pattern comparison algorithms. The hierarchical correction code used in this study is available on Github.[15]

Figure 2

Schematic depiction of the hierarchical correction algorithm. Templates are clustered using a general template representation (for example, templates B1–B3), and subgraph isomorphisms are computed within each cluster to identify the most general, exclusive patterns (for example, B1). Templates with a more general parent within a cluster are then replaced. In the following, “corrected templates” indicates that template sets were pruned according to the hierarchical correction algorithm. Radius 1 templates were corrected first with clustering at radius 0; then, default and radius 2 templates were corrected with clustering at corrected radius 1. Finally, radius 3 templates were corrected with clustering at corrected radius 2. Since the templates at radius 0 cannot be corrected with the developed algorithm, proper canonicalization is especially important at radius 0; otherwise, errors propagate up the hierarchical scheme, where the clustering does not correctly group together all relevant templates. One may argue that the exclusivity issues, at least for default templates (at radius 1 with special groups), could be resolved by omitting some or all special groups. However, the hierarchical correction features a useful side effect for this class of templates. Namely, special groups which are not important in a specific chemical transformation according to the database, i.e., which are not specified in all reactions, are removed automatically. On the other hand, special groups that are important (one special group occurs in all reactions or different special groups occur exclusively) or special groups in rare templates (with only one reaction precedent) are kept. Thus, the inclusion of special groups is handled automatically based on the context around the reaction center as extracted from a set of reactions and does not depend on an a priori choice, such as to remove selected special groups to allow for more general templates. Furthermore, an alternative path to enforce exclusivity was explored, namely, to specify the number of hydrogen atoms for each template atom, which solves some of the issues encountered at radius 2 and 3 due to branched versus linear side chains. However, this decreased model performance and template applicability considerably (data shown in the Supporting Information) and did not resolve a number of nonexclusivity issues, for example, due to special groups, so this approach was not pursued further.

Results and Discussion

In order to assess the effects of duplicate and nonexclusive templates on the performance of template ranking algorithms, it is necessary to first create a clean set of templates. To clear out duplicates, the newly developed iterative canonicalization process for SMARTS templates extracted with RDChiral was applied. To filter out nonexclusive templates, our novel hierarchical correction scheme was utilized to arrive at exclusive template sets. We now explore the effects of canonicalization and correction on the number and popularity of unique templates, as well as the performance of template ranking algorithms.

Popularity and Number of Unique Templates

Figure depicts the number of templates as a function of data set size, calculated on the USPTO-460k data set and subsets thereof. More specific templates extracted at larger radii naturally yield a larger number of extracted templates for a given data set. For the largest templates studied (radius 3, with no correction, no canonicalization, labeled “reg Radius 3”), about every second new reaction leads to a new template, whereas only every twentieth reaction actually encodes a new transformation, visible from the number of templates at radius 0. Since the accuracy of a machine learning template recommendation scheme usually suffers from a large number of templates (corresponding to a large number of classes in the multiclass classification task), it is desirable to keep the number of templates as low as possible, without sacrificing chemical plausibility of the recommended reactions. Additionally, it is desirable to keep the number of templates associated with only one or a few reactions as low as possible, since the ability of machine learning models to recommend such rare templates is usually low.[5]Figure depicts the popularity of the extracted templates from USPTO-50k and USPTO-460k for different template sizes. More general, smaller templates lower the fraction of rare templates occurring only once in the whole data set, especially for the smaller USPTO-50k data set.

Figure 3

Figure 4

Histogram of number of reactions associated with each template for different template sizes and canonicalization/correction schemes for USPTO-50k (top) and USPTO-460k (bottom).

Comparison of the number of templates (top, regular and canonical; bottom, hierarchically corrected and canonical + corrected) per number of reactions for subsets of the UPSTO-460k data set. The “cor” and “can-cor” curves overlap on this scale. Histogram of number of reactions associated with each template for different template sizes and canonicalization/correction schemes for USPTO-50k (top) and USPTO-460k (bottom). Figures and 4 furthermore depict the influence of hierarchical template correction (labeled “cor”) and canonicalization (labeled “can”). The correction procedure lowers the number of unique templates as well as lowers the number of rare templates. For default RDChiral templates (radius 1 + special groups), the correction scheme removes most of the special groups. Without the correction schemes, the extracted templates contain many duplicates (same transformation, but encoded as different SMARTS string), so that canonicalization decreases the number of unique template strings (top panel in Figure ). After hierarchical correction, canonicalization does not change the number of templates considerably for radii larger than 0; i.e., there is nearly no difference between corrected, “cor”, and canonical corrected “can-cor” templates. Since duplicates at radius 0 can propagate up the hierarchical correction scheme, ideally canonicalization and correction are combined (“can-cor”). From the absolute number of templates, and the number of reactions associated with each template, we can thus conclude that the hierarchical correction scheme not only efficiently reduces the overall number of templates but also the fraction of templates associated with only a single reaction. If a hierarchical correction is not possible or wanted, canonicalizing templates is an alternative, cheaper possibility to decrease the overall number of templates, but to a much smaller extent.

Influence of Template Characteristics on Model Performance

In the following, we examine the model performance of the similarity, ML-fixed, and ML-learned model across different template sizes, correction, and canonicalization schemes. The ranking ability of a model was evaluated via top-N accuracy, where the fraction of test reactions correctly recovered in the top-N ranked suggestions is measured. Two different success criteria were employed, namely, recommending either (a) the exact template extracted for a test reaction or (b) the exact molecules in the test reaction produced by template application. For a set of exclusive templates, these definitions lead to identical results, but a discrepancy arises for nonexclusive templates, where different templates lead to the same outcome, thus rated as success when comparing outcomes but as failure when comparing templates. We use this discrepancy as a measure of nonexclusivity in the following. Figure depicts the top-5 accuracies across the different models, template sizes, correction/canonicalization, and success criteria. The bars with darker shade correspond to success via identical templates and the bars with lighter shade to success via reaction outcome after template application, so that the visible lighter area quantifies the effects of nonexclusive templates. For all models and correction/canonicalization schemes, smaller template sizes lead to higher model performance. Correction/canonicalization increases model performance for both evaluation criteria but especially for success defined via identical templates, where the hierarchical correction remedies the effects of nonexclusive templates and boosts model performance across all models for default, radius 2, and radius 3 templates. We furthermore note that the ML-fixed model performs equally well or better than the ML-learned model across all systems and outperforms the similarity model for some systems only after template correction. The ML-learned model performs especially poorly for large sets of templates, for example, uncorrected templates at radius 3 or 2 or at radius 1 with special groups and thus classification tasks with a large number of classes. The ML-learned model is faced with a more difficult learning objective than the ML-fixed model since it learns both the molecular embedding from the molecular graph and the template class from the embedding, as opposed to the ML-fixed model, which only learns the latter. Both the larger number of parameters and the additional task make the ML-learned model more prone to overfitting, which requires more data to find a meaningful relation between a template and a molecular graph. For template sets with a large number of classes, most templates are only associated with a single reaction, so that the data are insufficient to learn a meaningful molecular embedding (see SI for further details). The ML-learned model therefore profits greatly from template correction. The positive effects of template correction are larger for both machine learning models than for the heuristic model, which is to be expected from the reduction of classes and thus a more effective training of classification models. Top-N accuracies for N = 1 and N = 50 are shown in the Supporting Information, as well as similar figures for the USPTO-460k data set. Figure depicts top-N accuracies for different values of N for the uncorrected templates (“reg”) and canonical + corrected templates (“can-cor”) across different template sizes for the ML-fixed model. For regular templates, the discrepancy between success via templates (dashed) and success via outcomes (continuous) is very large for default, radius 2, and radius 3 templates and hampers model performance. Canonicalizing and correcting the templates hierarchically resolves the discrepancies, which boosts model performance considerably. Similar trends were identified for the USPTO-460k data set (Supporting Information), although the performance increase is not as large.

Figure 5

Figure 6

Dependence of top-N accuracies of proposed retrosynthetic disconnections (top) and forward predictions (bottom) on the template scheme for the USPTO-50k data set, ML-fixed model. Here, “reg” corresponds to uncorrected and “can-cor” to canonical and corrected templates. P means evaluated by precursors or products (continuous line); T means evaluated by template match (dashed line).

Top-5 accuracies of proposed retrosynthetic disconnections (top) and forward predictions (bottom) for the USPTO-50k data set using the “sim”, “ml-fixed”, and “ml-learned” models. The darker shade in a bar corresponds to evaluation via comparing templates and the lighter shade to comparing precursors or products. Each set of four bars shows the effects of canonicalizing (“can”) or hierarchically correcting (“cor”) the regular uncorrected templates (“reg”) or both (“can-cor”). Dependence of top-N accuracies of proposed retrosynthetic disconnections (top) and forward predictions (bottom) on the template scheme for the USPTO-50k data set, ML-fixed model. Here, “reg” corresponds to uncorrected and “can-cor” to canonical and corrected templates. P means evaluated by precursors or products (continuous line); T means evaluated by template match (dashed line). Canonicalizing and correcting templates thus offers a simple and accessible route toward smaller, more efficient template sets. But which template size is ideal? From Figure , one may conclude that the smaller the template is, the higher the model performance is. However, this evaluation does not take into account the number of produced precursors. For large radii, the application of a template usually produces a single (or no) reaction outcome. In contrast, small templates tend to produce a large number of reaction outcomes. For example, if the top-ranked template produces three precursors, and one of them coincides with the true reaction outcome, the three suggestions are not inherently ranked and should count toward top-1 accuracy only in 33.3% percent of cases. Figure compares the observed top-N accuracies after accounting for this effect, namely, averaging the ranks over the produced outcomes (right) instead of simply checking whether the correct outcome was produced at a specified rank (left). In that case, templates at radius 0 become undesirable for both forward and retro models, as their top-1 accuracy drops considerably below default and radius 1 templates. A similar behavior was found for USPTO-460k (Supporting Information). We therefore recommend the use of default templates together with the canonicalization and correction schemes developed in this study. Templates at radius 1 lead to an acceptable performance, too, but we do not recommend them as unequivocally due to the following reasons. The hierarchical correction of default templates enables an automated pruning of special groups without an a priori, expert-guided selection. Since some special groups in RDChiral are necessary for a correct processing of stereochemistry,[13] simply removing all special groups to arrive at radius 1 templates may have undesired side effects. The correction scheme only removes special groups which are not necessary or unique, i.e., correspond to a transformation that can be described fully and without information loss by another, more general template. It thus constitutes a more general, data-driven approach to select important special groups, while keeping the set of templates as small as possible.

Figure 7

Top-N accuracies of proposed retrosynthetic disconnections (top) and forward predictions (bottom) for the USPTO-50k data set, canonical-corrected templates, ranking via the ML-fixed model. Left: ranking via the number of templates. Right: ranking via the number of precursors. Interestingly, canonicalization and correction do not exclusively boost the performance of machine learning ranking models, which was expected since their training loss relies on the assumption that classes are mutually exclusive. Rather, the performance of the heuristic ranking model increases too, despite the fact that the heuristic ranking algorithm is not affected by template characteristics at all. This is a direct effect of the increased applicability of canonical and corrected templates. Figure depicts the fraction of the five highest ranked templates that are applicable to the input molecule, i.e., produces one or more reaction outcomes. Applicabilities for the top-1 and top-50 templates, as well as for the USPTO-460k data set, are shown in the Supporting Information. For both forward and retro models, the hierarchical correction scheme significantly increases the applicability of default, radius 2, and radius 3 templates. We note that the ML-fixed algorithm performs best in ranking applicable templates highest, across all template sizes.

Figure 8

Fraction of applicable templates of the five highest ranked templates of proposed retrosynthetic disconnections (top) and forward predictions (bottom) for the USPTO-50k data set using the “sim”, “ml-fixed”, and “ml-learned” models. Each set of four bars shows the effects of canonicalizing (“can”) or hierarchically corecting (“cor”) the regular uncorrected templates (“reg”) or both (“can-cor”).

Comparison to Other Template Ranking Models

As shown in Table , without any optimization of the model, the performance boost due to correction (retrosynthesis direction, default templates) leads our simple machine learning model to outperform the template-based models NeuralSym[2] and Retrosim[1] in top-N accuracy for all N, as well as GLN[11] for N ≥ 5. Compared to semitemplate based models, this work outperforms the models G2Gs[43] and RetroXPert[44] for N ≥ 3, as well as GraphRetro[41] for N ≥ 5. The template-free models SCROP[45] and LV-Transformer[46] are outperformed for all N, as well as DualTF[47] and RetroPrime[48] for N ≥ 5 (data from refs (41 and 42), USPTO-50k). Only the current state-of-the-art models DualTB,[47] MEGAN,[42] and AT (100x)[10] yield better performances. Without correction, the employed model outperforms none of the mentioned models, highlighting the power and influence of the developed template correction algorithm. A full table for top-N accuracies and applicabilities for all investigated systems is given in the Supporting Information.

Table 1

Comparison of Performance of ML-Fixed Model (Evaluation via Template Identity) Trained on Default Templates with and without Correction and Canonicalization to Performance of Selected Retrosynthesis Models in Literaturea

	Top-N accuracy (%)
Model	N = 1	N = 3	N = 5	N = 10
This work, reg	34.4	51.4	57.8	63.7
This work, can-cor	46.4	68.2	76.0	82.9
NeuralSym[2]	44.4	65.3	72.4	78.9
G2Gs[43]	48.9	67.6	72.5	75.5
RetroXPert[44]	50.4	61.1	62.3	63.4
SCROP[45]	43.7	60.0	65.2	68.7
LV-Transformer[46]	40.5	65.1	72.8	79.4
DualTF[47]	53.6	70.7	74.6	77.0
Retrosim[1]	37.3	54.7	63.3	74.1
GLN[11]	52.5	69.0	75.6	83.7
DualTB[47]	55.2	74.6	80.5	86.9
GraphRetro[41]	53.7	68.3	72.2	75.5
MEGAN[42]	48.1	70.7	78.4	86.1
RetroPrime[48]	51.4	70.8	74.0	76.1
AT (100x)[10]	53.2		80.5	85.2

Data from refs (41 and 42), USPTO-50k.

Influence of Template Characteristics on Retrosynthesis Performance

To showcase the influence of deduplication and exclusivity of template sets on computer-aided retrosynthesis platforms, we retrained the policy network of AiZynthFinder[14,27] using the regular and canonical-corrected template sets of USPTO-50k and USPTO-460k. The canonical-corrected template sets lead to higher top-N accuracies, as well as faster training of the policy network, as shown in Table . We then used the retrained policy network to perform a Monte Carlo tree search-based retrosynthesis pathway search for the 100 ChEMBL compounds used as the benchmark in ref (27), for which AiZynthfinder trained on 1.2 M reactions from USPTO produced valid pathways for 55 compounds.[27] For USPTO-50k, we found pathways for 34 and 41 molecules for the regular and canonical-corrected template sets, respectively. For USPTO-460k, we found pathways for 42 and 52 molecules for the regular and canonical-corrected template sets, respectively, as shown in Table . Correcting template sets for duplicates and nonexclusive templates therefore not only considerably increases the performance of single-step retrosynthesis models as highlighted in previous sections but also for multistep retrosynthesis approaches. With only 460,000 reactions, we obtain nearly the same amount of resolved pathways as the models trained on 1.2 M reactions in ref (27). We note that the canonical-corrected template sets describe the exact same chemical transformations as the regular template sets. The performance boost therefore comes solely from a better template ranking, and thus a better policy in the Monte Carlo tree search, highlighting the importance of the presented template correction algorithm. The processed template and model files are available online.[15]

Table 2

Performance of Policy Neural Network and Retrosynthesis Search of a Retrained AiZynthfinder Model Using Template Sets Reported in This Studya

	USPTO-50k reg	USPTO-50k can-cor	USPTO-460k reg	USPTO-460k can-cor
Policy model
Top-10-accuracy (%)	83.9	88.5	87.5	89.2
Training time per epoch (s)	2	1	63	40

Retrosynthesis search
Resolved pathways	34	41	42	52
Search time per pathway (s)	69	62	65	58

Regular or canonical-corrected templates in the retrodirection from USPTO-50k or USPTO-460k.

Limitations

We note that although the presented template correction and deduplication approach increases model performance for reaction prediction, as well as single-step and multistep retrosynthesis considerably, it suffers from the same problems as all template-based approaches. Namely, templates small enough to efficiently train recommendation models may miss important functional groups further away from the reactive center, which are necessary for the reaction to proceed. Some of these relations might be learned by the template recommendation model but often only to a very general extent. We note that the template canonicalization and correction approach developed in this study does not introduce new, more general templates to a set but instead removes more specific or duplicate templates that are already covered by the set. This might remove information about functional groups further away from the reaction center for some templates but only if the reaction is also known to proceed without these functional groups. A further general limitation of template-based models is their inability to learn or explore transformations unknown to the template set. Also, correct atom mappings of reactions are an important prerequisite for template extraction, which can be cumbersome to create. Template-free retrosynthesis approaches[45−47] alleviate some of these limitations but introduce others, such as limited interpretability as to why a specific transformation was suggested or which known reactions correspond to a suggested transformation. We therefore believe that even given the limitations of template-based approaches, it is still worthwhile to extend, optimize, and explore template-based retrosynthesis and forward prediction, alongside template-free approaches.

Conclusion

We developed new canonicalization and hierarchical template correction algorithms as well as systematically studied the influence of template size, canonicalization, and exclusivity on the performance of various template-ranking algorithms. We find that duplicate and nonexclusive templates significantly impact the performance of all models across different template sizes for reaction prediction, single-step retrosynthesis, and multistep retrosynthesis. The number of nonexclusive templates is especially high in templates including special groups or large radii. Large performance boosts in both applicability and top-N accuracy for machine learning and heuristic models, as well as multistep retrosynthesis approaches, can be achieved by hierarchically correcting template sets for exclusivity. Smaller increases in performance can also be achieved by canonicalizing templates. The correction algorithm can furthermore be used to prune unnecessary special groups from templates automatically, which reduces the need of human interaction, and is thus an important step toward the automated curation of high-quality, exclusive, template sets.

Data and Software Availability

The hierarchical correction code is available as a Python package on Github,[15] along with the processed USPTO-50k and USPTO-460k data sets as CSV files and the complete set of Python scripts to reproduce all results in this study. The repository furthermore contains the AiZynthFinder template sets and policy models for USPTO-50k and USPTO-460k. The canonicalization code is available on Gitlab[16] as part of the RDChiral C++ package. The modified RDChiral version to produce radius 0–3 templates without special groups is available on Github.[23]

21 in total

1. Get Your Atoms in Order--An Open-Source Implementation of a Novel and Robust Molecular Canonicalization Algorithm.

Authors: Nadine Schneider; Roger A Sayle; Gregory A Landrum
Journal: J Chem Inf Model Date: 2015-10-15 Impact factor: 4.956

2. Development of a novel fingerprint for chemical reactions and its application to large-scale reaction classification and similarity.

Authors: Nadine Schneider; Daniel M Lowe; Roger A Sayle; Gregory A Landrum
Journal: J Chem Inf Model Date: 2015-01-13 Impact factor: 4.956