We demonstrate molecular similarity to be a surprisingly effective metric for proposing and ranking one-step retrosynthetic disconnections based on analogy to precedent reactions. The developed approach mimics the retrosynthetic strategy defined implicitly by a corpus of known reactions without the need to encode any chemical knowledge. Using 40 000 reactions from the patent literature as a knowledge base, the recorded reactants are among the top 10 proposed precursors in 74.1% of 5000 test reactions, providing strong quantitative support for our methodology. Extension of the one-step strategy to multistep pathway planning is demonstrated and discussed for two exemplary drug products.
We demonstrate molecular similarity to be a surprisingly effective metric for proposing and ranking one-step retrosynthetic disconnections based on analogy to precedent reactions. The developed approach mimics the retrosynthetic strategy defined implicitly by a corpus of known reactions without the need to encode any chemical knowledge. Using 40 000 reactions from the patent literature as a knowledge base, the recorded reactants are among the top 10 proposed precursors in 74.1% of 5000 test reactions, providing strong quantitative support for our methodology. Extension of the one-step strategy to multistep pathway planning is demonstrated and discussed for two exemplary drug products.
In order to synthesize
a target chemical compound, it is necessary
to identify a series of suitable reaction steps beginning from available
starting materials. This analysis—starting from the target
compound and working backward—dates as far back as Robert Robinson’s
seminal 1917 work on the synthesis of tropinone.[1] It was later formalized as retrosynthesis by E. J. Corey,
ultimately leading to his receiving the 1990 Nobel Prize.[2] This formalization prompted the development of
computer assistance with the intent of allowing chemists to focus
on what to make, rather than how to make it; much of the field’s development in the following
years was led by J. Gasteiger.[3] Computer
assisted synthesis planning has been well-reviewed over the years.[4−7]From the very first attempt at computer-assistance in retrosynthesis
planning,[8] the vast majority of automated
retrosynthesis programs have relied on encoding reaction templates,
or generalized subgraph matching rules. These template-based approaches
require a decision to be made about the extent of generalization and
abstraction, whether extracted algorithmically from reaction databases[9−17] or encoded by hand.[16,18−20] Various techniques
have been developed to extract the likely meaningful context around
the reaction center, including through the consideration of nonstructural
reactivity descriptors, but the trade-off of specificity and coverage
is inevitable. Moreover, application of templates is computationally
expensive due to the cost of solving the subgraph isomorphism problem,
and so these approaches do not scale well for large template sets.[14,21] Similar considerations apply to the task of forward prediction,[22] which has been the subject of several recent
studies.[14,16,23,24]Liu et al.[21] report
a neural model based
on the seq2seq architecture, inspired by a similar
study examining the goal of forward synthesis.[25] The problem of one-step retrosynthesis is treated as a
translation task, converting one sequence of characters (i.e., a product
SMILES[26] string without atom mapping) to
another sequence of characters (i.e., a reactant(s) SMILES string).
They report comparable performance to a baseline model that applies
a library of algorithmically extracted reaction templates and ranks
candidate precursors in order of decreasing template popularity.Cadeddu et al.[27] treat retrosynthesis
in terms of chemical linguistics, where the rarest bonds are proposed
as the sites of disconnections. This is similar to other techniques
where an attempt is made to reduce molecular complexity as rapidly
as possible.[28] However, identifying the
reaction site is only sufficient to propose synthons,[29] or nonphysical fragments of precursors. Given one or more
synthons resulting from a proposed retrosynthetic step, it is still
necessary to propose specific functionalities to create synthetic
equivalents (i.e., specify leaving groups). Hereafter we use the term
“leaving group” to mean any functionality added to a
synthon to yield its synthetic equivalent.Despite their limitations,
reaction templates still provide a very
useful way of encoding transformations, particularly in their ability
to fully specify chemical precursors. For example, cleaving a single
bond between two aromatic carbons is associated with 57 different
leaving group pairs (not all of which are unique, due to symmetry)
in our ca. 40 000 reaction training data set, described later;
the most common are depicted in Figure . The reaction site, consisting of the atom–bond–atom
subgraph pattern, can be encoded in a SMARTS[30] string representation as [cH0]–[cH0]. The abundance of distinct
leaving groups for equivalent reaction sites has been described as
a limitation of template-based approaches, as it necessitates a proliferation
of distinct templates corresponding to each set of leaving groups.[21]
Figure 1
Six most-frequent precursors for the disconnection of
a single
bond between two aromatic carbons. Once a strategic disconnection
is identified (SMARTS: [cH0]–[cH0]), there may still be dozens
of locally plausible precursors to accomplish the transformation,
including different combinations of halides and boronic acid/esters.
(1) Bromide and acid; (2) bromide and ester; (3) chloride and acid;
(4) iodide and acid; (5) chloride and ester; (6) iodide and ester.
Six most-frequent precursors for the disconnection of
a single
bond between two aromatic carbons. Once a strategic disconnection
is identified (SMARTS: [cH0]–[cH0]), there may still be dozens
of locally plausible precursors to accomplish the transformation,
including different combinations of halides and boronic acid/esters.
(1) Bromide and acid; (2) bromide andester; (3) chloride and acid;
(4) iodide and acid; (5) chloride and ester; (6) iodide and ester.Herein, we propose and validate
a similarity-based approach whereby
strategic disconnections are performed based solely on analogy to
known reaction precedents. Reaction templates are used only at the
most rudimentary level to generate chemically valid precursor molecules,
circumventing the need to specify precise levels of generalization.
This is a purely data-driven approach to retrosynthesis, where model
suggestions can be thought of as an interpolation of known reactions to novel substrates, rather than an extrapolation to novel chemistries. In other words, this approach is intended
to mimic the “average retrosynthetic strategy” implicit
in a reaction corpus. It is purely deterministic, acting directly
on the available data, and does not require tuning or training of
any model parameters.
Approach
Overview
Our approach
is motivated by the first question
a chemist might ask when tasked with developing a synthesis plan to
a target molecule: how have similar molecules been synthesized? If a route to the molecule has been previously published, it may
be appropriate to use that route without modification. If it is a
novel compound, then one might look at routes to other compounds with
similar structural motifs and determine whether that synthetic strategy
is applicable. This analysis is formalized into an automated workflow
in the following paragraphs. A more detailed description of its implementation
can be found in the Supporting Information.First, reaction precedents are retrieved from the knowledge
base based on product similarity, sprod, scored between 0 and 1. Molecular similarity is described in the
following section. In our previous work,[14] we saw quantitative evidence that similar products tend to be produced
by similar reactions. This is not that surprising, as often the first
approach in a manual retrosynthesis is examining how molecules with
similar functionalities are produced (e.g., by searching Reaxys[31] or SciFinder[32]).
We restrict the number of precedent reactions to be 100 to limit the
computational time required in subsequent steps.Second, a highly
local transform containing fully specified leaving
groups is extracted from each precedent reaction and applied to the
target compound. In contrast to traditional template extraction approaches
that attempt to include neighboring atoms as necessary context,[12−15] these templates contain only the atoms that are immediately involved
in the reaction (specified by atomic identity, aromaticity, number
of hydrogen atoms, and chirality if applicable). Using the example
of Figure , the template
for a Suzuki reaction would consist only of the two aromatic carbons
that are bridged in the product and the unmapped halogen and boronic
acid/ester leaving groups. This template is applied to the target
compound, which may yield several candidate precursors or yield none.
Importantly, because templates are only applied when the precedent’s
product is similar to the target compound, it is not as important
to heuristically determine the important context around the reaction
center or manually encode reactivity conflicts (as done by Szymkuc
et al.[20] among others); that is implicitly
handled by the previous and upcoming similarity calculations.Third, candidate precursors are further scored by their similarity
to that precedent’s reactants, sreac, between 0 and 1. Precursors are analyzed as if they were a single
molecule, so that it is possible to use intramolecular reactions as
the basis for intermolecular suggestions (and vice versa). Comparing
reactant similarity ensures that not only are the product molecules
similar, but the precursors themselves are as well. The resulting
candidates are ranked by the overall similarity score as calculated
by multiplying product similarity and reactants similarity, s = sprod·sreac. This overall score measures the extent of the match
between the proposed reaction and the information upon which the suggestion
is based; a score of 1.0 would indicate an exact match to a known
disconnection in the database.An example prediction of a retrosynthetic
heteroatom alkylation/arylation
reaction is displayed in Figure . The recorded precursors (highlighted in green) are
recovered and predicted with rank 2; however, all of the top five
precursor suggestions are chemically reasonable. Of particular note
is reaction precedent 11, which is recalled from the knowledge base
due to a high product similarity but is disfavored when considering
the precursor similarity, as the precedent’s bromonitrile is
highly dissimilar to the proposed acid bromide. Precedent reactions
3 and 13 both lead to multiple precursor suggestions, which are then
differentiated by their disparate reactant similarity scores.
Figure 2
Example prediction
of retrosynthetic heteroatom alkylation/arylation
reactions for 1-phenyl-3,4-dihydroquinolin-2(1H)-one.
After recalling up to 20 reaction precedents in order of decreasing
product similarity, the precedent reaction site (highlighted in red
and displayed as a SMARTS string) is extracted and matched against
the target compound. Of the precedent reactions with the most similar
products, not all involve a reaction site that matches the target
compound and thus not all produce candidate precursors. Aside from
the first reaction, precedents with inapplicable reaction sites are
not shown for brevity. The recorded reactants for this target compound
(highlighted inside a green box) are recovered and predicted with
rank 2; however, all of the top five precursor suggestions are chemically
reasonable. Similarity scores are shown using Morgan2noFeat and Tanimoto (see the section on Similarity Calculation).
Example prediction
of retrosynthetic heteroatom alkylation/arylation
reactions for 1-phenyl-3,4-dihydroquinolin-2(1H)-one.
After recalling up to 20 reaction precedents in order of decreasing
product similarity, the precedent reaction site (highlighted in red
and displayed as a SMARTS string) is extracted and matched against
the target compound. Of the precedent reactions with the most similar
products, not all involve a reaction site that matches the target
compound and thus not all produce candidate precursors. Aside from
the first reaction, precedents with inapplicable reaction sites are
not shown for brevity. The recorded reactants for this target compound
(highlighted inside a green box) are recovered and predicted with
rank 2; however, all of the top five precursor suggestions are chemically
reasonable. Similarity scores are shown using Morgan2noFeat and Tanimoto (see the section on Similarity Calculation).
Similarity Calculation
An example of quantitative similarity
scores is shown in Figure . The reference compound appears in the test data set, and
the four other compounds appear as products in the training data set.
Scores can range from 1.0 (exact match) to 0.0 (absolutely no commonality)
and reflect the extent to which a pair of molecules contains overlapping
substructures of various sizes. The benzenesulfonamide motif in the
first three compounds results in a high degree of similarity, while
the similarity of the fourth compound is primarily due to benzonitrile.
The presence of the second piperazinenitrogen in the first compound
and the fluorine in the third and fourth compounds decreases their
corresponding similarity scores, as these functionalities are not
found in the reference molecule.
Figure 3
Example similarity score calculation using Morgan2Feat fingerprints and the Tanimoto metric. Colors indicate
atom-level contributions to the overall similarity (green: increases
similarity score, red: decreases similarity score, uncolored: has
no effect).
Example similarity score calculation using Morgan2Feat fingerprints and the Tanimoto metric. Colors indicate
atom-level contributions to the overall similarity (green: increases
similarity score, red: decreases similarity score, uncolored: has
no effect).Molecular similarity
plays a key role in the selection of reaction
precedents and ranking of candidate precursors. Beyond its use for
information retrieval, molecular similarity also provides an indication
of the presence or absence of functional groups in the target compound
as compared to a precedent reaction product. The presence of functional
groups that do not appear in the precedent may lead to a competing
reaction channel—this will lead to a measurable decrease in
molecular similarity. The absence of functional groups that do appear in the precedent may indicate that some enabling
context or activation is absent—this also leads to a decrease
in similarity. We recognize that the implicit detection of functional
group conflicts using this similarity approach is not as robust as
other reaction prediction methods,[14] but
is very attractive due to its speed and simplicity.Quantifying
molecular similarity on the basis of two-dimensional
(2D) structure generally requires a fingerprinting technique (to represent
a molecule as a vector) and a similarity metric (to compare the two
vectors of two molecules).[33] There are
a number of studies examining different approaches to fingerprinting,[34−38] including learned fingerprints using graph neural networks,[39−42] and to calculating molecular similarity.[43−50] This study was not intended to exhaustively explore these different
metrics, but rather demonstrate the proof of concept using a few common
implementations.We focus our evaluation on Morgan circular
fingerprints[36] as implemented in RDKit.[51] A circular fingerprint is molecular representation
obtained
through an enumeration of submolecular neighborhoods. Initially, atoms
are encoded by an integer identifier (a hashed encoding of simple
structural properties like atomic number). Neighborhoods of larger
sizes are iteratively assigned their own numerical identifiers based
on their constituent atoms and bonds. The “radius” of
a circular fingerprint refers to the size of the largest neighborhood
surrounding each atom that is considered during enumeration. The combination
of all unique identifiers comprises the fingerprint, which is often
folded into a binary vector of fixed length by converting integer
identifiers into indeces of the vector. We refer the reader to Rogers
and Hahn[36] for a thorough explanation of
extended-connectivity fingerprints (ECFPs), which the RDKit implementation
of Morgan fingerprints aims to replicate. Four similar fingerprinting
techniques were attempted:Morgan2noFeat, Morgan fingerprints
of radius 2 without features,Morgan3noFeat, Morgan fingerprints
of radius 3 without features,Morgan2Feat, Morgan fingerprints of
radius 2 with features, andMorgan3Feat, Morgan fingerprints of
radius 3 with features.Fingerprinting
“with features” refers to the inclusion
of information in the initial atom encoding beyond atomic identity
to, for example, take into consideration the similarity between different
halogens; these are documented in RDKit and are based on the work
of Gobbi and Poppinger.[52] Similarity scores
were calculated without explicitly folding the fingerprint down to
a fixed length.We also evaluate several similarity metrics.
The Dice similarity,[53] shown in eq , quantifies the similarity
between two fingerprint
vectors x and y by calculating the ratio
between the prevalence of overlapping substructures (as measured by
nonzero values of xy for each vector index i) and the number of distinct substructures observed in
each (as measured by the summation over x2 and y2 for each fingerprint separately). The
Tanimoto metric,[54] shown in eq , instead normalizes the prevalence
of overlapping substructures (in x and y) by the total number of unique substructures (in x or y). The Tversky similarity[55] (eq ) is a generalization of
the Tanimoto similarity that is parametrized by α and β
to enable an asymmetrically weighted normalization.We choose to test four similarity metrics:Dice, the Dice similarity,Tanimoto, the Tanimoto
similarity,TverskyA, the Tversky similarity with
α = 1.5 and β = 1.0,TverskyB, the Tversky similarity with
α = 1.0 and β = 1.5.Qualitatively,
within the context of our approach, α and
β in the Tversky similarity metric can be thought of as punishing
potential reactivity conflicts (groups present in x but
not y) and punishing missing molecular context (groups
present in y but not x), respectively.
Evaluation Procedure
There is rarely a single correct
answer in retrosynthesis, but rather disconnections that are considered
productive, yielding precursors that are more synthetically accessible,
and those that are unproductive. Proposed reactions should have a
high likelihood of success in the forward direction[14] and fit into a broader synthesis plan connecting back to
buyable reactants with an acceptably high overall yield. There have
been many attempts to quantify synthetic accessibility, primarily
involving heuristic scoring functions trained on subjective expert
ratings.[56] Here, we use a success criterion
that enables a more objective evaluation: that when given the products
of reactions in the United States patent literature, the program recovers
and ranks highly the recorded reactants without having seen that reaction
previously.We use the open source ca. 50k reaction data set
previously used by Liu et al.[21] for the
same task of one-step retrosynthesis prediction. This data set was
derived from a larger collection from the U.S. patent literature;[57] the reactions of this particular subset have
been classified by Schneider et al.[58] into
10 reaction classes. These are described in Table . We follow the same data cleaning procedure
as Liu et al., whereby examples with multiple products are split into
multiple distinct examples. Products with a SMILES length less than
five characters (e.g., byproduct salts) are discarded. Also following
Liu et al., we use an 80%/10%/10% training/validation/testing split;
the ca. 40 000 training reactions comprise our knowledge base.
The full data set with the fixed split is available in the Supporting Information; limitations of this data
set are discussed later.
Table 1
Descriptions of Each
of the 10 Classes
and the Fraction of the ca. 50k Reactions They Represent, Adapted
from Schneider et al.[58]a
class
description
fraction of data set (%)
1
heteroatom alkylation and arylation
30.3
2
acylation and
related processes
23.8
3
C–C bond formation
11.3
4
heterocycle formation
1.8
5
protections
1.3
6
deprotections
16.5
7
reductions
9.2
8
oxidations
1.6
9
functional group
interconversion (FGI)
3.7
10
functional group addition (FGA)
0.5
These were randomly sampled from
the patent literature and should approximate the true distribution
of reaction types reported in the full USPTO literature.
These were randomly sampled from
the patent literature and should approximate the true distribution
of reaction types reported in the full USPTO literature.In Liu et al.’s study, evaluation
was performed within each
class as if the reaction class of the intended transformation was
known a priori. This is useful for cases when a chemist
knows what type of reaction step they would like to perform. However,
for general retrosynthesis planning, a proposed step can come from
any reaction class. We evaluate our approach using both the former
approach—to enable comparison—and the latter—as
a more realistic formulation of the prediction task. Unfortunately,
no direct comparison can be made to the proposed method of Segler
and Waller[16] due to their lack of open-source
code and use of commercial data sets. Performance is quantified using
the top-n accuracy for n = {1, 3,
5, 10, 20, 50}, defined as the fraction of examples where the recorded
precursors are suggested by the program with rank ≤ n. Atom-mapping is excluded from this comparison, but we
do require the chirality of proposed precursors to exactly match that
of recorded precursors.All scripts were written in Python 2.7
using the open source RDKit.[51] We have
written an additional package to improve
handling of stereochemistry when simulating reactions. Details are
available in the Supporting Information.
Results
One-Step Evaluation
Each combination
of similarity
metric and fingerprint was tested on the validation set using the
training set as the knowledge base. The aggregated accuracies across
all classes are shown in Figure S1 for
the case of known reaction class; the accuracies when the reaction
class is excluded from consideration are shown in Figure S2. We find that model performance is relatively insensitive
to the choice of fingerprint and similarity metric, demonstrating
that our approach is robust to changes in how similarity is quantified.
From the result of this validation study, we select the Morgan2Feat fingerprint and Tanimoto similarity for evaluation
on the test set.Quantitative model performance is shown in Table when the reaction
class is known in advance; details of the top-n accuracy
within each class are reported in Table S1. Model performance aggregated across all classes is shown in Table in addition to a
second evaluation when the reaction class is not provided to the model.
When making predictions within a specific reaction class, the top
recommendation by the program exactly matches the reactants used in
the recorded reaction 52.9% of the time. The recorded reactants are
found within the top 3, top 5, and top 10 suggestions 73.8%, 81.2%,
and 88.1% of the time, respectively. Without prior knowledge of the
reaction class, recorded reactants are found in the top 10 suggestions
for 74.1% of test cases.
Table 2
Model Top-10 Accuracy
within Each
Class When the Reaction Class Is Known a Priori
reaction
class, top-10 accuracy (%)
model
1
2
3
4
5
6
7
8
9
10
Liu et al. baseline[21]
77.2
84.9
53.4
54.4
6.2
26.9
74.7
68.4
46.7
73.9
Liu et al. seq2seq(21)
57.5
74.6
46.1
27.8
80.0
62.8
67.8
69.1
47.3
56.5
similarity
(this work)
86.7
94.2
74.6
67.0
97.1
95.5
88.3
98.8
71.2
91.3
Table 3
Model Performance
Aggregated Across
All Classes
top-n accuracy (%), n =
model
1
3
5
10
20
50
Liu et al. baseline[21]
35.4
52.3
59.1
65.1
68.6
69.5
Liu et al. seq2seq(21)
37.4
52.4
57.0
61.7
65.9
70.7
similarity (this work)
52.9
73.8
81.2
88.1
91.8
92.9
similarity (this work)a
37.3
54.7
63.3
74.1
82.0
85.3
Denotes that reaction class information
was not provided to the model, which represents a much harder prediction
task.
Denotes that reaction class information
was not provided to the model, which represents a much harder prediction
task.The similarity-based
model outperforms the baseline and seq2seq models
of Liu et al. by a large margin in every
class. In particular, we see a tremendous improvement in classes 5
and 6 over the baseline approach (retro protections and retro deprotections);
this is a result of our fully specifying leaving groups when extracting
and applying templates from precedent reactions. Naively generalizing
an ester deprotection reaction might result in a forward synthetic
template allowing any alkyl side chain (SMARTS: [C]),
which effectively prevents the retrosynthetic template from suggesting
any ester other than the methyl ester. Our focused template application
strategy suggests specific protecting groups by preserving the full
leaving group functionalities found in precedent reactants. Using
a proper template extraction strategy overcomes the “maximum
possible test accuracy” of 69.5% cited by Liu et al. for the
template-based baseline model.The top nine retrosynthetic predictions
for an exemplary compound
found in the test set is shown in Figure . The highest ranked suggestion from the
model is an aldol condensation to bring together the pyrazolopyrimidine
and the thiazolidinedione ring systems, which exactly matches what
is recorded for this product. The other recommendations are (1) to
form the thiazolidinedione through various ring-closing amidation
reactions, (2) to install the cyclopropylamino functionality using
an SNAr reaction with either the chloro or fluoro substrate,
and (3) to deprotect either of the two amines that appear as secondary
amines in the target compound. The diversity of these recommendations
highlights the power of using the collective knowledge contained in
a reaction database to identify strategic retrosynthetic steps that
might otherwise be overlooked, particularly by a less experienced
chemist. Several additional example predictions are shown in Figures S3 to S18.
Figure 4
Example retrosynthetic
predictions when pooling all reaction classes.
The model successfully proposes the recorded reactants with rank 1,
corresponding to an aldol condensation. Other suggestions among the
top nine include three ring-closing amidations to build the five-membered
ring, two SNAr reactions to install cyclopropamine, and
three amine deprotections.
Example retrosynthetic
predictions when pooling all reaction classes.
The model successfully proposes the recorded reactants with rank 1,
corresponding to an aldol condensation. Other suggestions among the
top nine include three ring-closing amidations to build the five-membered
ring, two SNAr reactions to install cyclopropamine, and
three amine deprotections.
Application to Multi-Step Planning
This one-step strategy
is easily extended to full pathway design by recursive suggestion
of retrosynthetic disconnections. Lenalidomide and salmeterol serve
as two model compounds of significant medicinal importance[59] that can be synthesized using common chemistries
we would expect to exist in our small knowledge base of 40k reactions.
Example pathways for each are shown in Figure . Note that none of these compounds appears
as a product in the knowledge base from which suggestions are made.
Figure 5
Multistep
synthesis plans. Routes are constructed by recursively
applying the one-step retrosynthetic methodology to (a) lenalidomide
and (b) salmeterol. The suggested disconnections are consistent with
published pathways, highlighted with green and blue backgrounds for
lendalidomide and salmeterol, respectively.[60−62] Slight differences
are described in the main text.
Multistep
synthesis plans. Routes are constructed by recursively
applying the one-step retrosynthetic methodology to (a) lenalidomide
and (b) salmeterol. The suggested disconnections are consistent with
published pathways, highlighted with green and blue backgrounds for
lendalidomide and salmeterol, respectively.[60−62] Slight differences
are described in the main text.The first suggestion for lenalidomide (Figure a) is a retro amidation ring opening. Following
closely at rank 3 is a nitro reduction, consistent with published
literature pathways.[60,61] The subsequent retro alkylation
to open the five-membered ring is the next step in both literature
pathways, although one begins from the nitrophthalic anhydride,[60] and the other uses the methyl ester, rather
than the acid chloride.[61] The latter reference
begins with the bromination by N-bromosuccinimide
(NBS), precisely as suggested.The retrosynthesis for salmeterol
(Figure b) is perhaps
more interesting due to the
presence of a chiral center. Following only the rank 1 suggestions,
the similarity-based approach suggests a benzyl deprotection, preceded
by an alkylation, preceded by a reduction of a methyl ester to an
alcohol, preceded by an asymmetric ketone reduction. This exactly
matches the published synthesis[62] except
for the order of the alkylation and reduction steps and the choice
of amine protecting group. A notable alternate albeit low-ranked suggestion
from our approach is to introduce the chirality by means of an enantioselective
organometallic addition of bromochloromethane to the benzaldehyde,
although this would likely present a lack of selectivity in practice.The success of the approach in finding viable synthetic pathways
is particularly impressive when considering that we have not defined
any explicit retrosynthetic strategy. Computer-assisted retrosynthesis
typically involves some high-level strategy to help guide the search
toward simpler, buyable chemicals (e.g., favoring smaller precursors),
just as chemists manually identify disconnections to simplify compounds.[5] In this program, the goal is to mimic the implicit
strategy contained within the reactions in a knowledge base. In other
words, the tendency of the program to lead to smaller, simpler precursors
is solely due to that same tendency being present
in the data. Forgoing an explicit search heuristic allows the program
to rely solely on analogy to precedent reactions and—in the
case of salmeterol—recover a known pathway following the top
suggestions at each step. With a guiding heuristic explicitly favoring
smaller molecules, the first proposed step would have been a retro
alkylation without the retro deprotection first,
which would have led to nonselective overalkylation.
Discussion
Limitations
of the Approach
By design, the similarity-based
approach is meant to apply existing reaction knowledge to novel substrates.
This strategy inherently disfavors making creative disconnections.
Retrosynthetic suggestions do not offer major insights beyond what
could be achieved by a trained synthetic chemist familiar with the
types of reactions found in the knowledge base. We emphasize that
this is an intentional result of using an empirical, data-driven approach
to automated retrosynthesis. The suggestions made by a model extrapolating
outside of its training data carry a significant amount of uncertainty;
as described, the model is effectively restricted to operate within
the scope of known chemistry.While the goal of this study was
to recover the true precursors used in reactions from the patent literature,
this method can be trivially adapted to encourage pathway diversity.
Rather than retrieving precedents and ranking candidate precursors
deterministically, one might intentionally add a random value to the
calculated similarity scores to introduce stochasticity and sample
more dissimilar precedents. The absolute score values used to determine
the rankings of suggestions are included with the various examples
in Figures S3–S18. A small score
perturbation would lead to more creative disconnections but—as
alluded to earlier—introduces more uncertainty into the quality
of recommendations.There are obviously many additional considerations
in synthetic
route planning, not limited to cost, process complexity, reaction
yield, workup difficulty, safety, and toxicity of intermediates. Because
this information is either incomplete or unavailable in public databases,
we focus on the disconnections themselves, which makes this methodology
more suitable for research-scale discovery applications. We expect
that additional considerations could be incorporated by weighting
the scores assigned to precedent reactions by an additional “process
suitability” function to balance the similarity metric with
these other considerations. This method could also be restricted to
use a particular subset of available reaction data to provide domain-specific
suggestions, e.g., only from process chemistry journals.
Limitations
of the Data set
While this methodology
is easily applied to other data sources (e.g., Reaxys[31] or an electronic lab notebook), use of unpublishable data
would prevent future performance comparisons; for this reason, we
have made use of open source data originating from the patent literature.
This data set is well-suited for quantitative performance comparisons
but does present quality concerns, as patented syntheses may not have
been validated experimentally or may have had a very low yield. This
concern, however, does not negate the fact that these examples reflect
an implicit retrosynthetic strategy (with regard both to the types
of reactions commonly employed and to when certain disconnections
are applied on the basis of present or absent structural motifs) contained
within the patent literature.
Quality of Suggestions
There are certain patterns that
chemists follow when performing a retrosynthetic analysis, including
consideration of reaction classes, viable synthons, and hierarchies
of functional group reactivities. On the basis of the quantitative
performance on the USPTO data set, it is clear that the model is successful
in proposing retrosynthetic disconnections that match actual patented
syntheses without the need for any explicit chemical
knowledge. When making suggestions within a specific reaction class,
the model makes a perfect recommendation 52.9% of the time; even without
specifying the reaction class, perfect recommendations are made 37.3%
of the time. When 10 disconnections are proposed, the success rates
increase to 88.1% and 74.1%, respectively. The approach is successful
even when extended to pathway planning for high-value, medicinally
relevant drug compounds.However, some suggestions, particularly
lower-ranked or lower-scored ones, may not be synthetically viable
when attempted experimentally. The use of similarity for prioritizing
suggestions partially mitigates this issue of “false positive”
recommendations while still generating potential synthetic routes
rapidly; to generate more conservative recommendations with a more
guaranteed rate of success, slower methods can be applied for forward
reaction prediction.[14]
Conclusion
We have demonstrated an approach for automated retrosynthesis based
on analogy to known reactions. Molecular similarity, both between
products and between reactants, is a sufficient metric for determining
relevant precedents and applying the corresponding highly local retrosynthetic
transform. Because a relatively small number of templates are applied
when they are thought to be relevant, it is not necessary to define
heuristics for their extraction, nor is the speed limited by the computational
bottleneck of full template library application. Calculating a target
molecule’s similarity to a set of known products is an “embarrassingly
parallel”, computationally inexpensive problem and there exist
numerous means of doing so. By design, suggested precursors are necessarily
linked to specific precedents as supporting evidence. And although
this data set does not contain contextual information, using one that
does would further enrich suggestions by including information about
reagents, catalysts, solvents, and temperatures of precedent reactions
to assist in experimental validation.We describe our workflow
in full detail and open source our code
to enable use of other data sets as knowledge bases, for example,
in-house electronic lab notebook data.
Authors: Connor W Coley; Regina Barzilay; William H Green; Tommi S Jaakkola; Klavs F Jensen Journal: J Chem Inf Model Date: 2017-07-25 Impact factor: 4.956
Authors: Steven Kearnes; Kevin McCloskey; Marc Berndl; Vijay Pande; Patrick Riley Journal: J Comput Aided Mol Des Date: 2016-08-24 Impact factor: 3.686
Authors: Connor W Coley; Regina Barzilay; Tommi S Jaakkola; William H Green; Klavs F Jensen Journal: ACS Cent Sci Date: 2017-04-18 Impact factor: 14.553
Authors: William M Barrett; Sudhakar Takkellapati; Kidus Tadele; Todd M Martin; Michael A Gonzalez Journal: ACS Sustain Chem Eng Date: 2019-04-15 Impact factor: 8.198
Authors: Mingjian Wen; Samuel M Blau; Evan Walter Clark Spotte-Smith; Shyam Dwaraknath; Kristin A Persson Journal: Chem Sci Date: 2020-12-08 Impact factor: 9.825