Connor W Coley1, Regina Barzilay1, Tommi S Jaakkola1, William H Green1, Klavs F Jensen1. 1. Department of Chemical Engineering and Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States.
Abstract
Computer assistance in synthesis design has existed for over 40 years, yet retrosynthesis planning software has struggled to achieve widespread adoption. One critical challenge in developing high-quality pathway suggestions is that proposed reaction steps often fail when attempted in the laboratory, despite initially seeming viable. The true measure of success for any synthesis program is whether the predicted outcome matches what is observed experimentally. We report a model framework for anticipating reaction outcomes that combines the traditional use of reaction templates with the flexibility in pattern recognition afforded by neural networks. Using 15 000 experimental reaction records from granted United States patents, a model is trained to select the major (recorded) product by ranking a self-generated list of candidates where one candidate is known to be the major product. Candidate reactions are represented using a unique edit-based representation that emphasizes the fundamental transformation from reactants to products, rather than the constituent molecules' overall structures. In a 5-fold cross-validation, the trained model assigns the major product rank 1 in 71.8% of cases, rank ≤3 in 86.7% of cases, and rank ≤5 in 90.8% of cases.
Computer assistance in synthesis design has existed for over 40 years, yet retrosynthesis planning software has struggled to achieve widespread adoption. One critical challenge in developing high-quality pathway suggestions is that proposed reaction steps often fail when attempted in the laboratory, despite initially seeming viable. The true measure of success for any synthesis program is whether the predicted outcome matches what is observed experimentally. We report a model framework for anticipating reaction outcomes that combines the traditional use of reaction templates with the flexibility in pattern recognition afforded by neural networks. Using 15 000 experimental reaction records from granted United States patents, a model is trained to select the major (recorded) product by ranking a self-generated list of candidates where one candidate is known to be the major product. Candidate reactions are represented using a unique edit-based representation that emphasizes the fundamental transformation from reactants to products, rather than the constituent molecules' overall structures. In a 5-fold cross-validation, the trained model assigns the major product rank 1 in 71.8% of cases, rank ≤3 in 86.7% of cases, and rank ≤5 in 90.8% of cases.
Synthesis planning
is often referred to as an art. The process
of identifying a suitable pathway (i.e., series of reaction steps)
which transforms some set of available reactants into a target compound
is typically performed by expert chemists with years or decades of
experience. To assist chemists with this task, computer-aided synthesis
design was introduced over 40 years ago in the form of retrosynthetic
planning software.Retrosynthesis was originally formalized
by Corey and Wipke[1,2] in their efforts to introduce
computer assistance to synthesis with
Logic and Heuristics Applied to Synthetic Analysis (LHASA).[3] Corey’s approach to codifying retrosynthesis
involved the explicit identification of molecular structures which
lend themselves to disconnection or, rather, can be produced by known
reactions in the forward direction. Almost all approaches to automated
retrosynthesis, LHASA included, involve the use of reaction templates—submolecular
patterns that encode changes in atom connectivity. Recursively applying
retrosynthetic templates to a target molecule produces a candidate
synthesis tree. However, a synthetic route based on retrosynthetic
templates does not always lead to a successful forward synthesis.
Templates are locally defined pattern-matching rules, inherently naive
to what is present in the rest of the molecule. It is common, therefore,
for a proposed retrosynthetic disconnection to be unviable in the
forward direction. Once a synthetic route has been proposed, it is
critical to evaluate each step in the forward direction to identify
these challenges.Forward analysis is such an important part
of pathway evaluation
that even the very first retrosynthesis program, Corey’s LHASA,
could identify functional group conflicts which might lead to a lack
of specificity or selectivity.[3] Another
early program, Computer-Assisted Mechanistic Evaluation of Organic
Reactions (CAMEO),[4] implemented a similar
approach where nucleophilic and electrophilic sites were analyzed
pairwise to determine qualitative reactivities. Other programs like
SOPHIA[5] and Eros[6] identify potentially reactive functional groups using manually curated
reactivity rules and empirical calculations. Chematica’s Syntaurus[7] contains explicitly encoded lists of incompatible
functional groups for each retrosynthetic template.Manual encoding
of these rules has obvious disadvantages. First,
it relies on the intuition and experience of a small number of chemists.
Second, it is not scalable—it is not realistic to exhaustively
define the full substrate scope and incompatibilities for every possible
reaction. Third, conflicting reactivity is rarely black and white;
incompatibility depends on the exact nature of the reacting molecules.
These factors motivate the development of an automated approach to
forward reaction evaluation.Work by Kayala et al. considers
the problem of forward synthesis
mechanistically, rather than using end-to-end templates.[8,9] They use graph-based representations of molecules and assign approximate
molecular orbitals to each so that a mechanistic step can be considered
an interaction between a donor and an acceptor orbital. Although they
show very promising results, the need for manual encoding of mechanistic
rules to generate training data could be problematic and limit scalabity.
Wei et al.[10] describe the use of neural
networks to predict the outcome of reactions based on reactant fingerprints,
but limit their study to 16 types of reactions covering a very narrow
scope of possible alkyl halide and alkene reactions. Given two reactants
and one reagent, the model was trained to identify which of 16 templates
was most applicable. The data set used for cross-validation comes
from artificially generated examples with limited chemical functionality,
rather than experimental data.Quite recently, Segler and Waller
describe two approaches to forward
synthesis prediction. The first is a knowledge-graph approach that
uses the concept of half reactions to generate possible products given
exactly two reactants by looking at the known reactions in which each
of those reactants participates.[11] In one
validation, the recorded product is found in the list of candidate
products (with a median size of 3, mean 5.3) 67.5% of the time using
a knowledge-base of eight million reactions. With “only”
one million reactions, performance drops to just 15%. The second approach
uses neural networks to rank reaction templates given reactant fingerprints.
For reactions that are known to be contained in their automatically
extracted set of 8720 templates, the authors report accuracies as
high as 78%, but do not quantify the coverage of their template set;
an average of just 44.5 matches per query suggests poor coverage.[12] The data used for training and testing are not
precisely defined, nor is the code/model available for comparative
purposes.Previous studies have not relied on published experimental
reaction
examples due to the challenge of only having “positive”
examples, excepting Segler and Waller.[12] The literature is heavily biased toward reactions with high yields,
and reactions with negligible or zero yields are rarely reported except
for illustrative purposes, e.g., highlighting the necessity of a catalyst.
Proprietary electronic lab notebooks can contain numerous unproductive
reactions, including results of high-throughput screening, but these
data are not available in public or even commercial databases. This
limitation of reaction databases precludes many supervised learning
approaches for forward synthetic prediction: one cannot train a model
directly on literature data to classify a certain reaction as productive
or unproductive, since there are almost no unproductive examples available.In this work, we describe a model that learns to predict the major
products of chemical reactions given a set of reactant molecules by
combining rigid reaction templates and machine learning. Specifically,
we report the following contributions: (1) a data augmentation strategy
whereby reaction databases are supplemented with chemically plausible
negative reaction examples; (2) the successful application of that
strategy using automatically extracted forward synthesis templates,
where poor specificity is not a hindrance and no manual curation is
required; (3) a new reaction representation focused on the fundamental
transformation at the reaction site rather than constituent reactant
and product fingerprints; (4) the implementation and validation of
a neural network-based model that learns when certain modes of reactivity
are more or less likely to occur than other potential modes. Despite
the literature bias toward reporting only high-yielding reactions,
we develop a successful workflow that can be performed without any
manual curation using actual reactions reported in the USPTO literature.
Approach
Overview
Our model predicts the outcome of a chemical
reaction in a two-step manner: (1) applying overgeneralized forward
reaction templates to a pool of reactants to generate a set of chemically
plausible products, and (2) estimating which candidate product is
the major product as a multiway classification problem using machine
learning. This is shown schematically in Figure .
Figure 1
Model framework combining forward enumeration
and candidate ranking.
The primary aim of this work is the creation of the parametrized scoring
model, which is trained to maximize the probability assigned to the
recorded experimental outcome.
Model framework combining forward enumeration
and candidate ranking.
The primary aim of this work is the creation of the parametrized scoring
model, which is trained to maximize the probability assigned to the
recorded experimental outcome.In the first stage, we apply a library of forward synthetic
templates
to define which products could be produced based on the initial reactants.
Rule-based enumeration has been applied to forward synthesis analysis
previously,[13,14] but because many distinct templates
can match a given reactant set and generate hundreds or thousands
of products, simple enumeration is not inherently useful. If a particular
reaction is plausible, but it proceeds at a rate insignificant compared
to other reactions, then we do not need to consider it when evaluating
the viability of a forward reaction step. In the second stage, each
candidate reaction is scored individually by the machine learning
model. This model assesses the likelihood of reactivity, akin to a
reaction rate, in isolation from competing reactions. The scores from
all candidates are compared in a softmax network layer (i.e., an exponential
activation function that maps a list of numbers to a list of probabilities
that sum to one) to generate probabilities describing which product
is predicted to be most abundant.A key component of our approach
is this two-step formalization.
By generating candidate products in the first step, in effect, existing
reaction databases are augmented with negative reaction examples.
This circumvents the limitation of only having high-yielding reaction
data. Implicit in a reaction example A + B → C with greater
than 50% yield is that or at least that D and E were formed to a lesser extent than the
reported C, where D and E are plausible alternate products. Including
these alternate products in the training set allows us to extract
more information from each reaction entry than we otherwise could
by, e.g., training to predict the yield of only recorded reactions.The recorded product of a reaction in the patent database is the
“true” product that the model learns to predict, while
the chemically plausible alternate products generated via templates
are the “false” products which were not reported in
the literature. A model can then be trained to identify the “true”
product as a multiway classification or ranking problem. Treating
recorded products as true outcomes is consistent with the literature
bias toward reporting high-yield reactions.
Data
The source
of reaction examples is the set of
USPTO patents granted between 1976 and 2013, preparsed by Lowe.[15] Contextual information (e.g., temperature, solvent,
reaction time) is inconsistently present, so reaction examples were
reduced to reactants and products only. Forward templates were extracted
from the prefiltered set of 1,122,662 atom-mapped reaction SMILES
in 1976–2013_USPTOgrants_reactionSmiles_feb2014filters.rsmi.[16] For training and testing major product identification,
a 15 000-member subset of the full set of reactions in 1976–2013_USPTOgrants_CML.7z[16] was used. These examples are formatted as CML
documents where distinct roles (reactant, product, reagent, solvent,
catalyst) are assigned to each molecule in the reaction. This data
set was chosen over the previous one so that spectator molecules (e.g.,
reagents, solvents) could be given the opportunity to react, despite
not contributing atoms to the reported products. No two reaction examples
in the data set have an identical reactant pool.
Forward Enumeration
To build the database of forward
templates, we use a heuristics-driven algorithm inspired by Law et
al.[17] and Bogevig et al.[18] For each atom-mapped reaction example found in Lowe’s
USPTO database, the reaction core is defined by determining which
product atoms have a different connectivity than the corresponding
reactant atoms. The reaction core is expanded to include adjacent
unmapped leaving groups and immediately neighboring atoms. Neighboring
atoms are fully generalized into any non-hydrogen substituent for
maximal generality to achieve high coverage at the expense of low
specificity. A SMARTS string encoding the submolecular pattern at
the reaction core can be generated for the reactants and for the products,
which together define a reaction SMARTS string. A total of 140 284
unique reaction SMARTS strings are extracted from 1,122,662 reaction
SMILES strings. Figure S1 shows the popularity
of forward templates as a function of their rank. The five most popular
templates are shown in Figure with ex post facto labels. Although these
five happen to be unimolecular functional group conversions, we have
not predefined common functional groups or common transformations;
moreover, the model does not rely on manual curation, labeling, or
sorting of these extracted templates.
Figure 2
Depiction of the top five most popular
forward synthetic templates
extracted from 1.1 million USPTO reactions. C[al] denotes any aliphatic
carbon.
Depiction of the top five most popular
forward synthetic templates
extracted from 1.1 million USPTO reactions. C[al] denotes any aliphatic
carbon.Due to imperfect canonicalization,
the 140 284 templates
contain some duplicates (e.g., same patterns with different numbering)
and other redundancies (e.g., hydrolysis of ester overlaps with hydrolysis
of alkyl ether/ester). Moreover, application of templates is computationally
expensive (Figure S2), and the marginal
coverage benefit of including additional templates decreases rapidly
with rank (Figure S1). To focus the model
on the most prevalent reaction types, only templates with more than
50 precedents were included in subsequent steps; this corresponds
to the top 1689 templates.For each reaction example, all molecules
(reactants, reagents,
catalysts, and solvents) were combined into a single reactant pool.
The removal of annotations labeling species’ roles was motivated
by the fact that they were originally assigned using information about
the recorded product. The 1689 templates were applied to the reactant
pool to generate a list of potential products. The candidate products
were each reduced to the product fragment with the longest SMILES
string to neglect any byproduct salts and approximate the “major
product”. Atom mapping is preserved so that a candidate product
corresponds to a fully atom-mapped candidate reaction. The inclusion
of reagents, catalysts, and solvents makes the task of product prediction
more challenging due to the competing candidate products they produce.
The recorded product was found within the candidate sets in roughly
76% of reaction examples. The imperfect coverage (24%) reflects the
use of the 1689 most popular templates rather than the full set of
140 284. The 15 000-member data set used for training
and testing consists only of reactions where the recorded product
was found within the candidate set generated by this template set.
A histogram showing the number of candidate atom-mapped reactions
for each example can be found in Figure S6. The peak occurs around 150 candidates with a median of 246 and
mean of 353.Prior to the subsequent candidate ranking step,
during training,
we determine which candidates match the “true” recorded
reaction outcome. Atom-mapping is excluded from this comparison to
limit the impact of its inaccuracies on model performance. When multiple
candidates match the recorded product, the candidate corresponding
to the most popular template is kept and the remaining matching candidates
are discarded so only one “true” candidate exists.
Candidate Ranking
The parametrized reaction scoring
function, a central component of the workflow in Figure , is challenging to design
for many reasons, one of which is representation. A reaction is a
complex data structure for which there is no universal vector-based
description. We employ a new edit-based reaction representation strategy
that emphasizes the change in atom connectivity that occurs at the
reaction core during a chemical reaction. An atom-mapped reaction
candidate is parsed into four different types of edits:An atom a loses a hydrogenAn atom a gains a hydrogenTwo atoms, a and a, lose a connecting bond bTwo atoms, a and a, gain a connecting bond bChanges
in bond order are marked as a loss of the original
bond order and a gain of the new bond order. With this representation,
less fundamental changes (e.g., formal charge, association/disassociation
of salts) are neglected. Loss or gain of a hydrogen is represented
by 32 easy-to-compute features of that reactant atom alone, . Loss or gain of a bond is represented
by a concatenation of the features of the atoms involved and four
features of the bond, .Because edits occur at the reaction center
by definition, the overall
representation of a candidate reaction depends only on the atoms and bonds at the reaction core. There is no explicit
inclusion of other molecular features, e.g., adjacency to certain
functional groups. However, in our featurization, we include rapidly
calculable structural and electronic features of the reactants’
atoms that reflect the local chemical environment first and foremost,
but also reflect the surrounding molecular context.[19−22] The chosen features can be found
in Table and Table S2 and are discussed in more detail in
the Supporting Information.
Table 2
Edit-Based Model
Performance When
Certain Input Features Are Set to Their Average Value for True Edits
in the Training Seta
type
of edit masked, relative test accuracy
ind.b
masked index (ces)
H loss (%)
H gain (%)
bond loss (%)
bond gain
(%)
0
Crippen
logP contribution
98.3
98.3
97.0
96.0
1
Crippen MR contribution
98.0
96.6
96.0
95.9
2
TPSA contribution
96.6
96.4
95.7
96.2
3
Labute ASA contribution
98.4
96.7
95.9
96.6
4
Estate index
97.8
95.9
95.6
95.9
5
Gasteiger partial charge
97.1
98.0
97.4
99.5
6
Gasteiger H partial charge
98.7
100.4
99.7
96.5
7–17
atomic number (1-hot)
98.7
98.2
94.8
96.8
18–23
number
of neighbors (1-hot)
98.5
98.2
94.9
95.4
24–28
number of hydrogens (1-hot)
97.8
96.9
95.2
95.6
39
formal charge
100.4
100.2
100.2
100.4
30
is in ring
99.8
98.5
95.8
96.6
31
is aromatic
99.9
100.1
97.7
95.4
Performance
is reported in terms
of relative accuracy on the test set. The three most significant features
in each column are bolded for emphasis.
Indices in the atom representation.
The design
of the neural network is motivated by the likelihood
of a reaction being a function of the atom/bond changes that are required
for it to occur. Individual edits are first analyzed in isolation
using a neural network unique to the corresponding edit type. Three
fully connected dense layers are used to embed initial features into
an intermediate feature vector representation. The intermediate features
vectors of all edits are summed and passed through a final neural
network to produce a scalar score. This score, assigned to each candidate
separately, represents the propensity of the proposed reaction (i.e.,
set of edits) to occur, akin to the negative free energy of reaction.
The absolute likelihood scores of all candidates are compared in a
final softmax layer, which produces a vector of probabilities from
a vector of numbers by treating their values as pseudoenergies in
a Boltzmann distribution with kBT = 1. An example using simplified atom- and bond-level
features is described in the Supporting Information for the reaction
shown in Figure S4. When trained with this
architecture, the four distinct neural networks become tailored to
assessing their corresponding edit type. The architecture is depicted
in Figure . Layer
sizes were fixed prior to cross-validation tests after an initial
screening to ensure sufficiently flexibility to describe the training
data. All hidden layers are fully connected with bias and tanh activation.
The final model required 144 001 parameters, described in Table S3.
Figure 3
Edit-based model architecture for scoring
candidate reactions.
Reactions are represented by four types of edits. Initial atom- and
bond-level attributes are converted into feature representations,
which are summed and used to calculate that candidate reaction’s
likelihood score.
Edit-based model architecture for scoring
candidate reactions.
Reactions are represented by four types of edits. Initial atom- and
bond-level attributes are converted into feature representations,
which are summed and used to calculate that candidate reaction’s
likelihood score.We also implement a baseline
model, which attempts to rank candidate
products based on the products alone; no consideration is given to
the reactants or corresponding reaction edits. For this model, product
molecules are represented by radius-2 Morgan circular fingerprints
of length 1024. A single hidden layer with tanh activation is used
prior to the linear output layer. The baseline scoring model is shown
in S5 with its 51 301 parameters (described in Table S4).And finally, we implement a
hybrid model, which trains the full
edit-based and baseline architectures simultaneously and uses the
sum of their scores for each candidate reaction.
Results
Following the aforementioned procedures, 15 000 recorded
reaction examples where the true product was found by applying the
1689 most popular templates were taken from the USPTO literature and
augmented by adding the nonrecorded products to create a set of 5,335,669
examples. The model was trained and tested using a 5-fold cross-validation
with the Adadelta optimizer[23] and early
stopping. Each fold used a 70%/10%/20% training/validation/testing
split and ceased training once the validation loss did not improve
for five epochs. The edit-based model achieves an test accuracy of
68.5%, averaged across all folds. In this context, accuracy refers
to the percentage of reaction examples where the recorded product
was assigned a rank of 1. The baseline model was similarly trained
and tested in a 5-fold CV, reaching an accuracy of 33.3%, suggesting
that the set of recorded products in the data set is fairly homogeneous.
The hybrid model, combining the edit-based representation with the
proposed products’ fingerprint representations, achieves an
accuracy of 71.8%. These results are displayed in Table .
Table 1
Comparison
between Baseline, Edit-Based,
and Hybrid Models in Terms of Categorical Crossentropy Loss and Accuracya
model
loss
acc. (%)
top-3 (%)
top-5 (%)
top-10 (%)
random guess
5.46
0.8
2.3
3.8
7.6
baseline
3.28
33.3
48.2
55.8
65.9
edit-based
1.34
68.5
84.8
89.4
93.6
hybrid
1.21
71.8
86.7
90.8
94.6
Top-n refers
to the percentage of examples where the recorded product was ranked
within the top n candidates.
Top-n refers
to the percentage of examples where the recorded product was ranked
within the top n candidates.Accuracy is a simplified metric of model performance;
the actual
objective during training is minimization of the categorical crossentropy
loss, −log p(xtrue), the average of the negative natural logarithm of the probability
assigned to the true candidate. By this metric, which reflects both
model accuracy and model confidence, the edit-based model (1.34) is
vastly superior to the baseline model (3.28); the hybrid model (1.21)
offers an additional improvement.A direct comparison of prediction
distributions is shown in Figure . As indicated by
the baseline model’s histogram of assigned probabilities in Figure a, the true outcome
was assigned a near-zero probability in a majority of examples; in
comparison, the edit-based model exhibits a more favorable distribution
shifted toward higher probabilities—the hybrid model even more
so. This is also reflected in Figure b, where the distribution of assigned ranks is short-tailed
in the edit-based and hybrid models and long-tailed in the baseline
model. The shape of this tail affects overall model success as the
success criterion is relaxed from rank = 1 to rank ≥ n. Success rates of each model are shown in Figure c as a function of the minimum
acceptable rank (i.e., the top-n accuracy). Figure S12 in the Supporting Information shows
model performance as a function of how similar recorded products are
to others in the data set.
Figure 4
Performance of the three reaction prediction
models as indicated
by the (a) histogram of probabilities assigned to true outcomes; (b)
histogram of ranks assigned to true outcomes, truncated to ranks 1–10;
and (c) overall success rate as a function of the minimum acceptable
assigned rank. In each case, the model is attempting to select the
true product out of several hundred possible reaction products.
Performance of the three reaction prediction
models as indicated
by the (a) histogram of probabilities assigned to true outcomes; (b)
histogram of ranks assigned to true outcomes, truncated to ranks 1–10;
and (c) overall success rate as a function of the minimum acceptable
assigned rank. In each case, the model is attempting to select the
true product out of several hundred possible reaction products.It is important to understand
the significance of the probability
the model assigns to each candidate. Figure depicts the performance of each model as
a function of that model’s “confidence”. In this
context, confidence refers to the probability assigned to the highest-ranked
candidate (i.e., the highest probability assigned to any candidate
product). There is a very strong correlation of the model’s
accuracy with the prediction confidence; this signifies that the probability
assigned to a candidate product does indicate the actual likelihood
of that being the major product; this is a highly desirable characteristic
for a predictive model. The distributions of prediction confidence
are shown in Figure S10.
Figure 5
Mean model accuracy as
a function of the binned model confidence,
where the model confidence refers to the probability assigned to the
highest-ranked candidate.
Mean model accuracy as
a function of the binned model confidence,
where the model confidence refers to the probability assigned to the
highest-ranked candidate.
Prediction Examples
Analysis of individual predictions
gives greater insight into model behavior than statistical measures. Figures and 7 show the details of predictions from the hybrid model on
test data.
Figure 6
Reaction examples where the hybrid model assigned rank 1 to the
recorded product. Recorded/predicted reactions: (a) chlorination;
(b) amide synthesis; (c) isoxazole synthesis; (d) sulfamide synthesis;
(e) etherfication; (f) Suzuki coupling; (g) Grignard addition; (h)
azidation; (i) alkylation.
Figure 7
Reaction examples where the hybrid model did not assign rank 1
to the recorded product. Recorded [predicted] reactions: (a) amidation
[amidation of different substrate]; (b) hydrolysis [hydrolysis at
different ether]; (c) deprotection [nitration]; (d) oxidation [bromination];
hydrogenation [dehalogenation]; (f) iodination [iodination at different
site].
Reaction examples where the hybrid model assigned rank 1 to the
recorded product. Recorded/predicted reactions: (a) chlorination;
(b) amide synthesis; (c) isoxazole synthesis; (d) sulfamide synthesis;
(e) etherfication; (f) Suzuki coupling; (g) Grignard addition; (h)
azidation; (i) alkylation.Reaction examples where the hybrid model did not assign rank 1
to the recorded product. Recorded [predicted] reactions: (a) amidation
[amidation of different substrate]; (b) hydrolysis [hydrolysis at
different ether]; (c) deprotection [nitration]; (d) oxidation [bromination];
hydrogenation [dehalogenation]; (f) iodination [iodination at different
site].Figure a depicts
a functional group conversion from an alcohol to the corresponding
chloride using thionyl chloride (SOCl2). The recorded product
is accurately predicted with an assigned probability of 94.8%. Sodium
bicarbonate is also present, which introduces some competing reactivity
channels (e.g., chlorination of bicarbonate), although these are all
assigned a lower probability than the true outcome.Figure b depicts
a reaction between a carbamate and a secondary amine that leads to
the carbamide, assigned a probability of 84.8%. The model recognizes
that the tertiary amine is not a likely candidate for reactivity and
that ethoxy is a plausible leaving group.Figure c depicts
a ring-forming reaction between an aryl alkyne and an iminol that
leads to the isoxazole, assigned a high probability that rounds up
to 100.0%.Figure d depicts
an S–N coupling between a primary arylamine and a sulfonyl
chloride assigned a probability of 98.1%. Although sulfonamides can
be prepared using secondary amines, the model recognizes that the
diarylamine nitrogen is less reactive than the arylamine nitrogen.
It also ranks the various possible Friedel–Crafts reactions
(S–C coupling) lower.Figure e depicts
a simple SN2 etherfication reaction between a phenol and
an alkyl bromide, correctly predicted.Figure f depicts
a correctly predicted Suzuki Coupling between a pyridyl boronic acid
and a pyridyl bromide. This outcome is assigned a probability of 98.8%
and a rank of 1. Even though the necessary context (e.g., palladium
catalyst) is missing and cannot be perceived by the model, the model
implicitly assumes that the reaction would be be run under typical
coupling conditions.Figure g depicts
a Grignard reaction using isobutyl magnesium bromide. The alkylated
substrate contains two potential carbonyl targets: an aldehyde and
a nearby cyclic amide. The model correctly predicts the addition to
occur at the aldehyde to form the secondary alcohol with high confidence.Figure h depicts
the preparation of an azide using sodium azide to replace the mesylate
group. This recorded outcome is assigned a rank of 1 with a probability
of 77.7%.Figure i depicts
a correctly predicted propylation. Propyl iodide reacts with the α
carbon adjacent to a carboxylic ester. While the reaction may actually
proceed through the enolate form, the model never explicitly constructs
this intermediate species.Turning to reaction examples for
which the recorded product was
not predicted correctly, Figure a depicts an amidation reaction between a carboxylic
acid and a primary amine, where two separate primary amine substrates
are available. It is likely that in the original document this example
comes from, two separate reactions were reported for the two substrates,
but these were somehow combined into one example during parsing and
misrecorded in the database. Due to the similar properties at the
reaction core (i.e., the two amines’ nitrogen atoms), these
two candidate outcomes are assigned very similar probabilities of
47.6% (recorded outcome) and 48.3% (alternate outcome); this produces
a narrow misprediction. Remarkably, in the original CML file, the
recorded product yield is 43%—quite close to our assigned probability.
The uncertainty in our model is well-justified in this case.Figure b depicts
a reaction example where, in the recorded outcome, a methyl ester
is hydrolyzed to form a carboxylic acid. On the same substrate, there
is a t-butyldimethyl silyl ether, which is a commonly
employed alkoxy protecting group highly resistant to hydrolysis. This
resistance is not captured in our model, which mistakenly predicts
the silyl ether to be the site of hydrolysis. The outcomes are assigned
similar probabilities (23.6% and 36.1% with ranks 2 and 1), however,
indicating that the model believes either of these two outcomes to
be likely. This example highlights the importance of capturing sterics
in candidate representations; in our atom-level featurization, two
relevant attributes are included: the Labute Approximate Surface Area
contribution and the Total Polar Surface Area contribution. Either
the model has not seen enough examples in the training set where steric
hindrance drives selectivity to understand its effects or—for
this example—the perceived electronic differences between the
two hydrolysis sites outweigh any perceived steric effects.Figure c depicts
a reaction where a secondary amine is dealkylated to form a primary
amine in the presence of ammonium nitrate; this outcome is assigned
a low probability of 1.14% and rank of 19, while the predicted outcome
is assigned a still-low probability of 17.2%. In this case, because
the ammonium counterion to nitrate is not present in the edit-based
representation unless it is involved in the reaction, the model cannot
distinguish weakly acidic conditions of the experiment from highly
acidic nitric acid. The predicted outcome is thus an aromatic nitration
on a substituted benzene ring ortho to chlorine.Figure d depicts
a reaction example where the recorded outcome, oxidation of a substituted
toluene to form a benzaldehyde, is assigned an extremely low probability
of 1 × 10–5%. This reaction is plausible due
to the strength of the oxidizing agent, benzoyl peroxide (BPO), but
that is likely missed in our edit-based representation. The primary
reason for this is that the radical mechanism by which BPO operates—and
its corresponding reactivity—cannot be predicted based on the
atom-level features used to describe the oxygen atom it contributes.
Instead, the model predicts that N-bromosuccinimide
(NBS) brominates the highly substituted aromatic ring. That prediction
is consistent with the use of NBS as a brominating agent and is a
plausible reaction outcome, particularly if BPO were not present.
This example highlights the importance of order of reagent addition,
which is not specified in the database used for training the model.Figure e depicts
a mispredicted reduction of a brominated quinolone species in the
presence of sodium borohydride. In both the recorded and predicted
outcomes, the presence of sodium borohydride is not captured as it
does not contribute heavy atoms to the product molecule. The recorded
hydrogenation of the pyridine ring is assigned a probability of 0.04%
and a correspondingly low rank of 19. The model predicts instead the
debromination with a relatively high probability of 61.1%, which has
been reported for similar substrates.Figure f depicts
an iodination reaction by N-iodosuccinimide (NIS)
where—due to the presence of three aromatic rings—there
are many plausible reactive sites. The recorded outcome, assigned
a rank of 7, is ortho to two fluorines, meta to a nitro group, and
para to a secondary amine; these four substituents all direct reactivity
to this site. While Gasteiger partial charges attempt to capture electron
donating and electron withdrawing effects in aromatic rings, it is
clear that the trained model does not properly capture the impact
of directing groups. As with the previous example of silyl ether hydrolysis,
this could be due to an insufficient input featurization or an insufficient
number of reaction examples from which to learn this nuance.
Input
Feature Analysis
To probe how the trained edit-based
model depends on the input atom features, model performance is evaluated
while masking certain feature indices (i.e., overriding the feature
values of the candidate edits). Average values for each set of indices
are calculated from the true candidates of the 10 500 reaction
examples in the training set. These averaged values are then used
during model evaluation on the test set. The resulting decrease in
model performance measures how strongly the trained model relies on
values of those attributes to discriminate candidates, although it
does not reflect how the model might have developed had that attribute
been missing during training. The results of this analysis for the
edit-based model are reported in Table in terms of relative
test accuracy (averaged over the 5-fold CV). Masking tests were performed
for each edit type individually.Performance
is reported in terms
of relative accuracy on the test set. The three most significant features
in each column are bolded for emphasis.Indices in the atom representation.The importance of atom-level
attributes is not constant across
the four edit types. Interestingly, the worst performance observed
when masking a single input feature only represents a 5.2% decrease
in accuracy. This signifies that the model is making use of information
from many features and from all edit types to learn the nuances of
chemical reactivity rather than relying on one or two attributes.
For loss and gain of hydrogen, we see that the model relies on “functional”
descriptors for evaluation, particularly the total polar surface area
(TPSA) contribution, Estate index, Crippen contribution to molar refractivity
(MR), and the Gasteiger partial charge. Loss or gain of a bond is
also dependent on these features, but more so on the “structural”
features describing the atomic number, number of neighbors, and number
of hydrogens. Aromaticity is another important feature for assessing
bond gain, likely due to the prevalence of aryl halide coupling reactions.
These structural atom-level descriptors are similar to what have been
used in convolutional molecular embedding.[24,25] There is significant room for improvement in identifying suitable
atom-level descriptors that provide additional indications of reactivity
or otherwise reflect the molecular context.
Discussion
Forward
Template Quality
An inherent shortcoming in
the automatic extraction of reaction templates is their locality.
Certain heuristic-driven techniques can be applied after template
extraction[17] to improve applicability and
consolidate many similar templates into fewer generalized ones, but
this process is challenging without manual intervention. To begin
to address the issue of locality, Soh et al.[26] describe an analysis of functional group occurrence in reaction
examples to infer which groups promote/inhibit certain types of reactivity.
While this is a more scalable approach than manually listing incompatible
groups,[7] it has not been adapted into a
directly translatable template-improvement algorithm.Extracted
templates are also highly reliant on atom-mapping to describe the
correspondence between reactant and product atoms. Modern algorithms
have evolved beyond simple maximum common substructure searches,[27,28] yet atom-mapping is certainly not a solved problem. In our approach,
inaccurate atom-mapping is a less significant issue, as our templates
are designed to be overgeneral to achieve high coverage of potential
product species. The overall model performance does not depend strongly
on atom mapping quality.The coverage afforded by the forward
templates affects the generalizability
of the overall two-step model: for a yet-unseen reaction example,
the model must be able to identify the true product as a candidate;
otherwise the true product cannot be evaluated by the neural network.
This same limitation is discussed by Segler and Waller.[12] Although they do not quantify the coverage of
their automatically extracted 8720 templates, the average number of
matches per query of 44.5 (compared to our 353) suggests correspondingly
lower coverage. Morerover, our neural network model can evaluate any candidate reaction, even if the corresponding template
and/or substrates have never been seen before. This makes the overall
model highly extensible, as the template library can be expanded independently
of neural network training. We do not observe a strong dependence
of model performance on the number of candidates (Figure S11).
Candidate Reaction Representation
The vast majority
of existing machine learning algorithms require fixed-length vectors
or one-dimensional sequences as inputs. For single molecules, fixed-length
vector representations include pharmacore fingerprints, Morgan or
Extended-Connectivity fingerprints,[29] descriptor
vectors, and—more recently—learned fingerprints.[24,25] Reactions could be represented as (a) the concatenation of reactant
and product fingerprints, (b) the difference between reactant and
product fingerprints, previously used for a reaction classification
task,[30] or (c) some other representation
with greater focus on the reaction core, like that of Kayala et al.[8,9]In general, the performance of a neural network depends strongly
on the choice of input representation. Representing reactions as concatenations
of constituent molecules’ fingerprints (optionally, reagents/catalysts
too) proved viable in Wei et al.’s analysis of 16 reaction
families for small compounds with 10 or fewer carbon atoms.[10] However, these types of molecule-level representation
of reactants are insufficient for the types of real, multifunctional
molecules used in organic synthesis; interactions between different
combinations of substructures must all be considered. This is similar
in philosophy to the combinatorial enumeration of possible mechanistic
steps in Kayala et al.’s ReactionPredictor.[8,9]The edit-based reaction representation is not without its flaws.
From our analysis of which features the trained model is most dependent
on, we find that the “functional” descriptors obtained
through rapid molecule-level calculations are important in addition
to the structural descriptors. With the current feature set, perhaps
too much attention is placed on the reaction core, so important information
about the nonreacting atoms is not sufficiently captured. There is
opportunity for improvement in identifying suitable atom-level descriptors
that provide additional indications of reactivity or otherwise reflect
the molecular context.
Context Awareness
The current model
assesses candidates
only in terms of the molecules that contribute atoms to the final
product, i.e., the reaction core. There is no consideration of reaction
conditions, such as reactant concentration, catalyst identity, catalyst
concentration, reagent concentrations, solvent(s), temperature, pressure,
or reaction time. These factors can obviously have a tremendous effect
on reaction outcome. In its current form, the lack of context means
that the model weighs reaction candidates under implicitly defined
typical conditions. The unavailability of contextual information also
means that the model learns to limit the confidence with which it
makes some predictions.Preliminary unpublished work on context-aware
models suggests that learning contextual effects will be challenging
due to the sparseness of data. Even in commercial databases (e.g.,
Reaxys) with partial manual curation, most reaction entries do not
contain fully specified conditions. Moreover, reactions within a specific
family are often reported under similar conditions, so extrapolation
to atypical conditions would be challenging. Future work on context-aware
models must address the question of how to train on and extrapolate
from this limited data.Reagent representation, a subproblem
of context-awareness, poses
another significant challenge. Representing a highly reactive reagent,
e.g., a brominating agent, as a fingerprint or through a generic molecular
representation is a poor use of information. From the reaction literature,
we can identify the most popular agents for different additions (e.g.,
Br2, NBS, PBr3, CBr4, LiBr for bromination),
but it is not clear how to apply that knowledge in a scalable and
fully automated way and capture that historical information in a vector
representation. There are far too many reagents and catalysts to enumerate
and encode each one in a one-hot manner, though this approach is feasible
if confined to a small subset of chemistries.[10]
Conclusion
Using both a novel model framework for generating
and ranking candidate
reaction outcomes and a novel edit-based representation, we are able
to reproduce in silico the qualitative results of actual experimental
reactions. The unique framework combining candidate enumeration with
ranking enables augmentation of existing reaction databases with hundreds
of negative reaction examples implicit in every record. In a 5-fold
CV of 15 000 such examples from the USPTO literature where
the recorded product is found in a self-generated set of candidates,
our hybrid model assigns the recorded product rank 1 in 71.8% of cases,
rank 1–3 in 86.7% of cases, and rank 1–5 in 90.8% of
cases; moreover, incorrect predictions are often chemically reasonable
given the lack of detailed contextual information.Through expanded
atom- and bond-level featurization of the reaction
core, especially descriptors with more direct relevance to chemical
reactivity, this model framework can be expanded upon to achieve even
higher predictive performance. And through the use of reaction examples
with more complete information from commercial sources, a context-dependent
model could be trained to understand reactivity in a more nuanced
manner. There is a tremendous role for machine learning to play in
computer assisted synthesis design, not only as a key component of
automated retrosynthesis planning, but as a standalone tool for chemists
to assess reaction viability. This work represents a significant step
toward the long-term goal of virtual reaction screening and validation
as a complement to experimental organic synthesis.
Methods
All scripts were written in Python; RDKit[31] was used for molecule/reaction parsing, applying templates, and
various cheminformatics calculations; Keras[32] using the Theano[33] backend was used for
building the machine learning architecture.
Authors: Sara Szymkuć; Ewa P Gajewska; Tomasz Klucznik; Karol Molga; Piotr Dittwald; Michał Startek; Michał Bajczyk; Bartosz A Grzybowski Journal: Angew Chem Int Ed Engl Date: 2016-04-08 Impact factor: 15.336
Authors: Steven Kearnes; Kevin McCloskey; Marc Berndl; Vijay Pande; Patrick Riley Journal: J Comput Aided Mol Des Date: 2016-08-24 Impact factor: 3.686
Authors: Vinit K Mittal; Sidney C Bailin; Michael A Gonzalez; David E Meyer; William M Barrett; Raymond L Smith Journal: ACS Sustain Chem Eng Date: 2017-12-06 Impact factor: 8.198
Authors: William M Barrett; Sudhakar Takkellapati; Kidus Tadele; Todd M Martin; Michael A Gonzalez Journal: ACS Sustain Chem Eng Date: 2019-04-15 Impact factor: 8.198
Authors: Eric Walker; Joshua Kammeraad; Jonathan Goetz; Michael T Robo; Ambuj Tewari; Paul M Zimmerman Journal: J Chem Inf Model Date: 2019-08-19 Impact factor: 4.956
Authors: Lukáš Grajciar; Christopher J Heard; Anton A Bondarenko; Mikhail V Polynski; Jittima Meeprasert; Evgeny A Pidko; Petr Nachtigall Journal: Chem Soc Rev Date: 2018-11-12 Impact factor: 54.564