Literature DB >> 31572784

Molecular Transformer: A Model for Uncertainty-Calibrated Chemical Reaction Prediction.

Philippe Schwaller^1,2, Teodoro Laino¹, Théophile Gaudin¹, Peter Bolgar³, Christopher A Hunter³, Costas Bekas¹, Alpha A Lee².

Abstract

Organic synthesis is one of the key stumbling blocks in medicinal chemistry. A necessary yet unsolved step in planning synthesis is solving the forward problem: Given reactants and reagents, predict the products. Similar to other work, we treat reaction prediction as a machine translation problem between simplified molecular-input line-entry system (SMILES) strings (a text-based representation) of reactants, reagents, and the products. We show that a multihead attention Molecular Transformer model outperforms all algorithms in the literature, achieving a top-1 accuracy above 90% on a common benchmark data set. Molecular Transformer makes predictions by inferring the correlations between the presence and absence of chemical motifs in the reactant, reagent, and product present in the data set. Our model requires no handcrafted rules and accurately predicts subtle chemical transformations. Crucially, our model can accurately estimate its own uncertainty, with an uncertainty score that is 89% accurate in terms of classifying whether a prediction is correct. Furthermore, we show that the model is able to handle inputs without a reactant-reagent split and including stereochemistry, which makes our method universally applicable.

Entities: Chemical Disease Gene Species

Year: 2019 PMID： 31572784 PMCID： PMC6764164 DOI： 10.1021/acscentsci.9b00576

Source DB: PubMed Journal: ACS Cent Sci ISSN： 2374-7943 Impact factor: 14.553

Introduction

Organic synthesis, the making of complex molecules from simpler building blocks, remains one of the key stumbling blocks in drug discovery.[1] Although the number of reported molecules has reached 135 million, this still represents only a small proportion of the estimated 1060 feasible drug-like compounds.[2,3] The lack of a synthetic route hinders access to potentially fruitful regions of chemical space. Tackling the challenge of organic synthesis with data-driven approaches is particularly timely as generative models in machine learning for molecules are coming of age.[4−10] These generative models enrich the toolbox of medicinal chemistry by suggesting potentially promising molecules that lie outside of known scaffolds. There are three salient challenges in predicting chemical reactivity and designing organic synthesis. First, simple combinatorics would suggest that the space of possible reactions is even greater than the already intractable space of possible molecules. As such, strategies that involve handcrafted rules quickly become intractable. Second, reactants seldom contain only one reactive functional group. Designing a synthesis requires one to predict which functional group will react with a particular reactant and where a reactant will react within a functional group. Predicting those subtle reactivity differences is challenging because they are often dependent on the what other functional groups are nearby. In addition, for chiral organic molecules, predicting the relative and absolute configuration of chiral centers adds another layer of complexity. Third, organic synthesis is almost always a multistep process where one failed step could invalidate the entire synthesis. For example, the pioneering total synthesis of the antibiotic tetracycline takes 18 steps;[11] even a hypothetical method that would be correct 80% of the time would have only a 1% chance of getting 18 predictions correct in a row (assuming independence). Therefore, tackling the synthesis challenge requires methods that are both accurate and have good uncertainty estimates. This would crucially allow us to estimate the “risk” of the proposed synthesis path and put the riskier steps in the beginning of the synthesis so that one can fail fast and fail cheap. The long history of computational chemical reaction prediction has been extensively reviewed in refs (12) and (13). Methods in the literature may be divided into two different groups, namely, template-based and template-free. Template-based methods[14−16] use a library of reaction templates or rules. These templates describe the atoms and their bonds in the neighborhood of the reaction center before and after the chemical reaction has occurred. Template-based methods then consider all possible reactions centers in a molecule and enumerate the possible transformations based on the templates together with how likely each transformation is to take place. As such, the key steps in all template-based methods are the construction of templates and the evaluation of how likely the template is to apply. The focus of the literature has thus far been on the latter question of predicting whether a template applies.[15,16] However, the problem with the template-based paradigm is that templates themselves are often of questionable validity. Previous methods generated templates by hand using chemical intuition.[17−19] Handcrafting is obviously not scalable because the number of reported organic reactions constantly increases, and a significant time investment is needed to keep up with the literature. Recent machine-learning approaches employ template libraries that are automatically extracted from data sets of reactions.[15,16] Unfortunately, automatic template extraction algorithms still suffer from having to rely on meta-heuristics to define different “classes” of reactions. More problematically, all automatic template extraction algorithms rely on pre-existing atom mapping, a scheme that maps atoms in the reactants to atoms in the product. However, correctly mapping the product back to the reactant atoms is still an unsolved problem,[20] and, more disconcertingly, commonly used tools to find the atom mapping (e.g., NameRXN[21,22]) are themselves based on libraries of expert rules and templates. This creates a vicious circle. Atom-mapping is based on templates and templates are based on atom mapping, and ultimately, seemingly automatic techniques are actually premised on handcrafted and often artisanal chemical rules. To overcome the limitations of template-based approaches, several template-free methods have emerged over the recent years. Those methods can, in turn, be categorized into graph-based and sequence-based. Jin et al. characterize chemical reactions by graph edits that lead from the reactants to the products.[23] Their reaction prediction is a two-step process. The first network takes a graph representation of the reactants as input and predicts reactivity scores. On the basis of those reactivity scores, product candidates are generated and then ranked by a second network. An improved version, where candidates with up to five bond changes are taken into account and multidimensional reactivity matrices are generated, was recently presented.[24] Whereas a previous version of the model included both reactants and reagents in the reaction center determination step, the accuracy was significantly improved by excluding the reagents from the reactivity score prediction in the more recent versions. This requires the user to know the identities of the reagents, which implicitly means that the user must already know the product because the reagent is defined as a chemical species that does not appear in the product! Similarly, Bradshaw et al.[25] separated reactants and reagents and included the reagents only in a context vector for their gated graph neural network. They represented the reaction prediction problem as a stepwise rearrangement of electrons in the reactant molecules. A side effect of phrasing reaction prediction as predicting electron flow is that a preprocessing step must be applied to eliminate reactions where the electron flow cannot easily be identified. Bradshaw et al. considered only a subset of the USPTO_MIT data set, containing only 73% of the reactions with a linear electron flow (LEF) topology, thus by definition excluding pericyclic reactions and other important workhorse organic reactions. A more general version of the algorithm was recently presented in ref (26). Perhaps most intriguingly, all graph-based template-free methods in the literature require atom-mapped data sets to generate the ground truth for training, and atom mapping algorithms make use of reaction templates. Sequence-based techniques have emerged as an alternative to graph-based methods. The key idea is to use a text representation of the reactants, reagents, and products (usually simplified molecular-input line-entry system (SMILES)) and treat reaction prediction as machine translation from one language (reactants–reagents) to another language (products). The idea of applying sequence-based models to the reaction prediction problem was first explored by Nam and Kim.[27] Schwaller et al.[28] have shown that using analogies between organic chemistry and human language sequence-to-sequence models (seq-2-seq) could compete against graph-based methods. Both previous seq-2-seq works were based on recurrent neural networks (RNNs) for the encoder and the decoder, with one single-head attention layer in between.[29,30] Moreover, both previous seq-2-seq forward prediction works separated reactants and reagents in the inputs using the atom mapping, and ref (28) tokenized the reagent molecules as individual tokens. To increase the interpretability of the model, Schwaller et al.[28] used attention weight matrices and confidence scores that were generated together with the most likely product. In this work, we focus on the question of predicting products given reactants and reagent. We show that a fully attention-based model adapted from ref (31) with the SMILES[32,33] representation, the Molecular Transformer, outperforms all previous methods while being completely atom-mapping independent and not requiring splitting the input into reactants and reagents. Our model reaches 90.4% top-1 accuracy (93.7% top-2 accuracy) on a common benchmark data set. Importantly, our model does not make use of any handcrafted rules. It can accurately predict subtle and selective chemical transformations, getting the correct chemoselectivity, regioselectivity, and, to some extent, stereoselectivity. In addition, our model can estimate its own uncertainty. The uncertainty score predicted by the model has an ROC–AUC of 0.89 in terms of classifying whether a reaction is correctly predicted. Our model has been made available since August 2018 in the backend of the IBM RXN for Chemistry,[34] a free web-based graphical user interface, and has been used by several thousand organic chemists worldwide to perform more than 40 000 predictions so far.

Data

Most of the publicly available reaction data sets were derived from the patent mining work of Lowe,[35] where the chemical reactions were described using a text-based representation called SMILES.[32,33] To compare to previous work, we focus on four data sets. The USPTO_MIT data set was filtered and split by Jin et al.[23] This data set was also used in ref (28) and adapted to a smaller subset called USPTO_LEF by Bradshaw et al.[25] to make it compatible with their algorithm. In contrast with the MIT and LEF data sets, USPTO_STEREO[28] underwent less filtering, and the stereochemical information was kept. To date, only seq-2-seq models were used to predict on USPTO_STEREO. Stereochemistry adds an additional level of complexity because it requires the models to predict not only molecular graph edge changes but potentially also changes in node labels. Additionally, we used a nonpublic time-split test set, extracted from the Pistachio database,[36] to compare the performance on a set containing more diverse reactions against a previous seq-2-seq model.[28] Table shows an overview of the data sets used in this work and points out the two different preprocessing methods. The separated reagent preprocessing means that the reactants (educts), which contribute atoms to the product, are weakly separated by a > token from the reagents (e.g., solvents and catalysts). Reagents take part in the reaction but do not contribute any atom to the product. So far, in most of the work, the reagents have been separated from the reactants. Jin et al.[23] increased their top-1 accuracy by almost 6% when they removed the reagents from the first step, where the reaction centers were predicted. In Schwaller et al.,[28] the reagents were represented not as individual atoms but as separate reagent tokens and included only the 76 most common reagents.[38] Bradshaw et al. passed the reagent information as a context vector to their model. In ref (26), it was shown that the model performs better when the reagents are tagged as such. Unfortunately, the separation of reactants and reagents is not always obvious. Different tools classify different input molecules as the reactants, and hence the reagents will also differ.[38] For this reason, we decided to train and test on inputs where the reactants and the reagents were mixed and no distinction was made between the two. We called this method of preprocessing mixed. The mixed preprocessing makes the reaction prediction task significantly harder because the model has to determine the reaction center from a larger number of molecules.

Table 1

Data-Set Splits and Preprocessing Methods Used for the Experiments

reactions in	train	valid	test	total
USPTO_MIT set[23]	409 035	30 000	40 000	479 035
-No stereochemical information
USPTO_LEF[25]	*	*	29 360	349 898
-Nonpublic subset of USPTO_MIT, without e.g. multistep reactions
USPTO_STEREO[28]	902 581	50 131	50 258	1 002 970
-Patent reactions until Sept. 2016, includes stereochemistry
Pistachio_2017[28]			15 418	15 418
-Nonpublic time split test set, reactions from 2017 taken from Pistachio database[36,37]

All of the reactions used in this work were canonicalized using RDKit.[39] The inputs for our model were tokenized with the regular expression found in ref (28). In contrast with Schwaller et al.,[28] the reagents were not replaced by reagents tokens but tokenized in the same way as the reactants.

Molecular Transformer

The model used in this work is based on the transformer architecture.[31] The model was originally constructed for neural machine translation (NMT) tasks. The main architectural difference compared with seq-2-seq models previously used for the reaction prediction[27,28] is that the RNN component was completely removed, and it is fully based on the attention mechanism. The transformer is a stepwise autoregressive encoder–decoder model composed of a combination of multihead attention layers and positional feed forward layers. In the encoder, the multihead attention layers attend the input sequence and encode it into a hidden representation. The decoder consists of two types of multihead attention layers. The first is masked and attends only the preceding outputs of the decoder. The second multihead attention layer attends encoder outputs as well as the output of the first decoder attention layer. It basically combines the information of the source sequence with the target sequence that has been produced so far.[31] A multihead attention layer itself consists of several scaled-dot attention layers running in parallel, which are then concatenated. The scaled-dot attention layers take three inputs, the keys, K, the values, V, and the queries, Q, and computes the attention as follows The dot product of the queries and the keys computes how closely aligned the keys are with the queries. If the query and the key are aligned, then their dot product will be large and vice versa. Each key has an associated value vector, which is multiplied by the output of the softmax, through which the dot products were normalized and the largest components were emphasized. d is a scaling factor depending on the layer size. The encoder computes interesting features from the input sequence, which are then queried by the decoder depending on its preceding outputs.[31] One main advantage of the transformer architecture compared with the seq-2-seq models used in refs (27 and 28) is the multihead attention, which allows the encoder and decoder to peek at different tokens simultaneously. Because the recurrent component is missing in the transformer architecture, the sequential nature of the data is encoded with positional encodings.[40] Positional encodings add position-dependent trigonometric signals (see eqs ) to the token embeddings of size demb and allow the network to know where the different tokens are situated in the sequence. The top-k outputs are decoded via a beam search. We set the beam size to 5 for all of the experiments. We based this work on the PyTorch implementation provided by OpenNMT.[41] All of the components of the transformer model are explained and illustrated graphically in ref (42). Whereas the base transformer model had 65 M parameters,[31] we decreased the number of trainable weights to 12 M by going from six layers of size 512 to four layers of size 256. We experimented with label smoothing[43] and the number of attention heads. In contrast with the NMT model,[31] we set the label smoothing parameter to 0.0. As seen below, a nonzero label smoothing parameter encourages the model to be less confident and therefore negatively affects its ability to discriminate between correct and incorrect predictions. Moreover, we observed that at least four attention heads were required to achieve peak accuracies. We, however, kept the original eight attention heads because this configuration achieved superior validation performance. For the training, we used the ADAM optimizer[44] and varied the learning rate as described in ref (31) using 8000 warm up steps, the batch size was set to ∼4096 tokens, and the gradients were accumulated over four batches and normalized by the number of tokens. The model and results can be found online.[45]

Results and Discussion

Table shows the performance of the model as a function of different training variations. SMILES data augmentation[46] leads to a significant increase in accuracy. We double the training data by generating a copy of every reaction in the training set, where the molecules were replaced by an equivalent random SMILES (augm.) on the range of data sets and preprocessing methods. Results are also improved by averaging the weights over multiple checkpoints, as suggested in ref (31), as well as increasing the training time. Our best single models are obtained by training for 48 h on one GPU (Nvidia P100), saving one checkpoint every 10 000 time steps, and averaging the last 20 checkpoints. Ensembling different models is known to increase the performance of NMT models;[47] however, the performance increase (ens. of 5/10/20) is marginal compared with parameter averaging. Nonetheless, ensembling two models that contain the weight average of 20 checkpoints of two independently initialized training runs leads to a top-1 accuracy of 91%. Whereas a higher accuracy and better uncertainty estimation can be obtained by model ensembles, they come at an additional cost of training or test time. The top-5 accuracies of our best single models (weight average of the 20 last checkpoints) on the different data sets are shown in Table . The top-2 accuracy is significantly higher than the top-1 accuracy, reaching >93% accuracy.

Table 2

Ablation Study of Molecular Transformer on the USPTO_MIT Data Set with Separated Reagentsa

	top-1 (%)	top-2 (%)	top-3 (%)	top-5 (%)	training	testing
Single Models
baseline	88.8	92.6	93.7	94.4	24 h	20 min
baseline augm.	89.6	93.2	94.2	95.0	24 h	20 min
baseline augm.	90.1	93.5	94.4	95.2	48 h	20 min
augm. av. 20	90.4	93.7	94.6	95.3	48 h	20 min
Ensemble Models
ens. of 5	90.5	93.8	94.8	95.5	48 h	1 h 25 min
ens. of 10	90.6	93.9	94.8	95.5	48 h	2 h 40 min
ens. of 20	90.6	93.8	94.9	95.6	48 h	5 h 3 min
ens. of 2 av. 20	91.0	94.3	95.2	95.8	2 × 48 h	32 min

Training and test times were measured on a single Nvidia P100 GPU. The test set contained 40k reactions.

Table 3

Single-Model Top-k Accuracy of the Molecular Transformer

USPTO*		top-1 (%)	top-2 (%)	top-3 (%)	top-5 (%)
_MIT	separated	90.4	93.7	94.6	95.3
_MIT	mixed	88.6	92.4	93.5	94.2
_STEREO	separated	78.1	84.0	85.8	87.1
_STEREO	mixed	76.2	82.4	84.3	85.8

Training and test times were measured on a single Nvidia P100 GPU. The test set contained 40k reactions.

Comparison with Previous Work

Because all previous works used single models, we consider only single models trained on the data-augmented versions of the data sets rather than ensembles for the remainder of this paper to have a fair comparison. Table shows that the Molecular Transformer clearly outperforms all methods in the literature across the different data sets. Crucially, although separating reactant and reagent yields, the best model (perhaps unsurprisingly because this separation implies knowledge of the product already), the Molecular Transformer, still outperforms the literature when reactant and reagents are mixed. Moreover, our model achieves a reasonable accuracy in the _STEREO data set, where stereochemical information is taken into account, whereas all prior graph-based methods in the literature cannot account for stereochemistry. We note that if one was to use a reaction prediction algorithm to plan an N-step synthesis, then the probability of getting the scheme right would be p, where p is the probability of a single-step prediction being correct (assuming independence). Therefore, the performance gap between models becomes exponentially amplified when one deploys it to solve synthesis planning problems.

Table 4

Comparison of Top-1 Accuracy (in %) Obtained by the Different Single-Model Methods on the Current Benchmark Data Sets

USPTO*		S2S[28]	WLDN[23]	ELECTRO[25]	GTPN[26]	WLDN5[24]	our work
_MIT	separated	80.3	79.6		82.4	85.6	90.4
_MIT	mixed		74				88.6
_LEF	separated		84.0	87.0	87.4	88.3	92.0
_LEF	mixed						90.3
_STEREO	separated	65.4					78.1
_STEREO	mixed						76.2

Coley et al.[24] published their performance predictions by dividing the reactions of the USPTO_MIT test set into template popularity bins. The template popularity of the test set reactions was computed by counting how many times the corresponding reaction templates were observed in the training set. In Figure , we compare the top-1 accuracy of our USPTO_MIT models with the model of Coley et al.[24] Although Coley et al. had separated the reagents in this experiment, we outperform them across all popularity bins, even with our model predicting on a mixed reactants–reagents input, and the accuracy gap becomes larger as the template popularity decreases. These findings suggest that the Molecular Transformer is not simply memorizing the data and can leverage information inferred from more common reactions to make predictions on rarer reactions.

Figure 1

Molecular Transformer outperforms the state-of-the-art model across both common and rare reactions. The figure shows the top-1 accuracy of our augmented mixed and separated USPTO_MIT single model compared with the model from ref (24) on the USPTO_MIT test set, divided into template popularity bins. (The number of times a particular reaction type is seen in the data set.) The dashed lines show the average across all bins. A looming question is how the Molecular Transformer performs by reaction type. Table shows that the weakest predictions of the Molecular Transformer are on resolutions (the transformation of absolute configuration of chiral centers, where the reagents are often not recorded in the data) and the ominous label of “unclassified” (where many mistranscribed reactions will end up). Moreover, the Molecular Transformer outperforms[28] in virtually every single reaction class. This is because the multihead attention layer in the Molecular Transformer can process long-range interactions between tokens, whereas RNN models impose the inductive bias that tokens far in sequence space are less related. This bias is erroneous because the token location in SMILES space bears no relation to the distance between atoms in 3D space.

Table 5

Prediction of the Augm. Mixed STEREO Single Model on the Pistachio_2017 Test Set Compared with Ref (28), Where the Reactants and Reagents Were Separated

	count	S2S acc. (%)[28]	our acc. (%)
Pistachio_2017	15418	60.0	78.0
-classified	11817	70.2	87.6
-heteroatom alkylation and arylation	2702	72.8	86.6
-acylation and related processes	2601	81.5	90.0
-deprotections	1232	69.0	88.6
-C–C bond formation	329	55.6	81.2
-functional group interconversion (FGI)	315	54.0	91.7
-reductions	1996	71.6	86.1
-functional group addition (FGA)	1090	71.8	89.3
-heterocycle formation	310	57.7	90.0
-protections	868	52.9	87.4
-oxidations	339	41.3	85.0
-resolutions	35	34.3	28.6
-unrecognized	3601	26.8	46.3
with stereochemistry	4103	48.2	67.9
without stereochemistry	11315	64.3	81.6
invalid smiles		2.8	0.5

Figure qualitatively illustrates the systematic pitfalls of the S2S RNN model[28] because of its erroneous inductive bias of assuming that only tokens close together in the SMILES string are chemically related. Figure A is a nucleophilic substitution. Although the reaction is simple, the RNN model predicts an erroneous product that makes little chemical sense where distal groups are joined together, an artifact of the location of those groups in the SMILES representation. Figure B is a simple Buchwald–Hartwig coupling reaction. RNN again predicts a chemically nonsensible product with chemically unreasonable bonds.

Figure 2

Erroneous inductive bias of the S2S RNN model[28] of assuming that only tokens close together in the SMILES string are chemically related leads to systematic pitfalls for reagents with a long SMILES representation. Molecular Transformer correctly predicts the product for both (A) and (B), whereas the RNN model predicts a product that is not only incorrect but also chemically unreasonable.

Examples of Chemical Challenges That Molecular Transformer Tackles

In the following section, we demonstrate the ability of Molecular Transformer to predict the outcome of a wide range of organic reactions with nontrivial selectivity involved. For some of the reactions discussed below, an organic chemist familiar with that particular class of reaction could predict the outcome after thorough reasoning. However, Molecular Transformer can immediately provide us with the ground-truth answer. All of the reactions discussed in this section and shown in Figure are not in the training set.

Figure 3

Examples of challenging chemo-, regio-, and stereoselective transformations that Molecular Transformer successfully predicts. Although the figure separates reactants and reagents for clarity, the predictions were done without making this distinction using the model hosted on IBM RXN.[34] We first consider challenges in chemoselectivity. As Molecular Transformer predicts, the treatment of the fused polycycle 1 with peracetic acid results in the epoxidation of the alkene and not the Baeyer–Villiger oxidation of the ketone.[48] Molecular Transformer also successfully predicts the stereochemistry around the two newly forming stereocenters in 2. Selective esterification of the dicarboxylic acid 3 is possible by the sequential addition of acetyl chloride and an alcohol.[49,50] Careful thinking about the role of each reagent and the reactivity of the cyclic anhydride intermediate suggests the esterification of the unconjugated carboxylic acid. This is indeed what is observed and what Molecular Transformer predicts. The outcome of this reaction is the consequence of the 1,5-relationship between the two carboxylic acids and the presence of the conjugated double bond. Whereas it takes time and experience for an organic chemist to recognize the concurrent presence of these functional groups as their implication on the reaction outcome, Molecular Transformer can furnish the right product by inferring the reactivity of this complex pattern of distant functional groups. The reduction of 5 using excess DIBAL-H was expected to lead to the unselective reduction of the secondary and the tertiary amides.[51] However, 6 was observed as the major product, in agreement with the prediction of Molecular Transformer. This shows how Molecular Transformer can help design new syntheses, ultimately saving many hours of human labor in the laboratory. We next consider challenges in regioselectivity. Predicting the regioselectivity of electrophilic aromatic substitutions is straightforward in many cases. However, the concomitant presence of multiple directing groups and steric crowding can sometimes make human predictions ambiguous. Molecular transformer can deal with complicated examples such as the bromination of 7 with N-bromosuccinimide, affording 8.[52] Molecular Transformer successfully deals with transition-metal-catalyzed reactions as well. It can predict the relative reactivity of the different C–Cl bonds in 2,4,5-trichloropyrimidine 9 in the successive Suzuki coupling reactions with phenylboronic acid.[53] Our last examples illustrate the power of Molecular Transformer in predicting the stereoselectivity of organic reactions. The reduction of the fused bicyclic ketone 13 by lithium aluminum hydride gives the major diastereoisomer 14, successfully predicted by Molecular Transformer.[54] The formation of the (E)-alkene in 16 by the treatment of 15 with tosyl chloride and lithium tert-butoxide is also successfully predicted.[55]

Comparing Molecular Transformer with Quantum-Chemistry-Based Predictors

Having qualitatively discussed a series of challenging examples of chemical selectivity that Molecular Transformer successfully predicts, we next turn to quantitatively explore whether Molecular Transformer has inferred the physical principles that underlie chemical selectivity. The general question of distilling interpretable rationales from machine-learning models is still an active area of research. As such, we attempt to address a more limited question: Can Molecular Transformer, trained on diverse reactions harvested from patents, make accurate predictions on a specific class of challenging reactions where the state-of-the-art predictors are quantum -chemistry calculations motivated by physical organic chemistry insights. To this end, we consider the regioselectivity of electrophilic aromatic substitution reactions in heteroaromatics, a key reaction in medicinal chemistry. Although the reaction mechanism is simple, regioselectivity is controlled by a subtle balance of electronic and steric effects of substituents. We also focus on this reaction because recent pioneering work has systematically curated a large set of examples of halogenation of heteroaromatics from the literature and developed a quantum-chemistry model that quantitatively predicts selectivity,[56] and thus there is a clear benchmark. The state-of-the-art model, RegioSQM,[56] employs quantum-chemistry calculations and achieves a top-1 accuracy of 81% in predicting the site of halogenation. Surprisingly, Molecular Transformer achieves a top-1 accuracy of 83% and top-2 accuracy of 91% on the same data set when predicting on the 445 reactions that are not in the training set of the Molecular Transformer and have a single reactive site. Molecular Transformer is also significantly less computationally expensive than quantum-chemistry calculations. Figure shows examples where quantum-chemistry calculations fail to predict the correct site of bromination, whereas Molecular Transformer makes the correct prediction.

Figure 4

Molecular Transformer achieves a higher accuracy than quantum-chemistry calculations in predicting the regioselectivity of electrophilic aromatic substitution reactions in heteroaromatics. The figure shows examples where RegioSQM,[56] the state of the art, fails, whereas Molecular Transformer makes the correct prediction. The observation that Molecular Transformer correctly predicts those challenging reactions suggests that it might have distilled specific physical chemistry principles from an assortment of diverse reactions, a necessary condition underlying a successful chemical modeling framework.

Comparison with Human Organic Chemists

Coley et al.[24] conducted a study where 80 random reactions from eight different rarity bins were selected from the USPTO_MIT test set and presented to 11 chemists (graduate students to professors) to predict the most likely outcome. The predictions of the human chemists were then compared against those of the model. We performed the same test with our model trained on the mixed USPTO_MIT data set and achieve a top-1 accuracy of 87.5%, significantly higher than the average of the best human (76.5%) and the best graph-based model (72.5%). Additionally, as seen in Figure , Molecular Transformer is generalizable and remains accurate, even for the less common reactions.

Figure 5

Top-1 accuracy of our model (mixed, USPTO_MIT) on 80 chemical reactions across eight reaction popularity bins in comparison with a human study and their graph-based model (WLDN5).[24]

Top-1 accuracy of our model (mixed, USPTO_MIT) on 80 chemical reactions across eight reaction popularity bins in comparison with a human study and their graph-based model (WLDN5).[24] Figure shows the 6 of the 80 reactions for which our model did not output the correct prediction in its top-2 choices. Even though our model does not predict the ground truth, it usually predicts a reasonable most likely outcome: In RXN 14, our model predicts that a primary amine acts as the nucleophile in an amide formation reaction rather than a secondary amine, which is reasonable on the grounds of sterics. In RXN 68, the reaction yielding the reported ground truth is via a nucleophilic substitution of Cl– by OH– by the addition–elimination mechanism, followed by lactim–lactam tautomerism. For the reaction to work, there must have been a source of hydroxide ions, which is not indicated among the reactants. In the absence of hydroxide ions, the best nucleophile in the reaction mixture is the phenolate ion generated from the phenol by deprotonation by sodium hydride. In RXN 72, the correct product is predicted, but the ground truth additionally reports a byproduct (which is mechanistically dubious because HCl will react with excess amine to form the ammonium salt). In RXN 76, a carbon atom is clearly missing in the ground truth. In RXN 61, we predict a SN2 reaction where the anion of the alcohol of the beta hydroxy ester acts as a nucleophile, whereas the mechanism of the ground truth is presumably ester hydrolysis, followed by the nucleophilic attack of the carboxylate group. Proton transfers in protic solvents are extremely fast, and thus deprotonation of the alcohol OH is much faster than ester hydrolysis. Moreover, the carboxylate anion is a poor nucleophile.

Figure 6

Six reactions in the human test set[24] not predicted within top-2 using our model trained on the augmented mixed USPTO_MIT set.

Uncertainty Estimation and Reaction Pathway Scoring

Because organic synthesis is a multistep process, for a reaction predictor to be useful, it must be able to estimate its own uncertainty. The Molecular Transformer model provides a natural way achieve this: The product of the probabilities of all predicted tokens can be used as a confidence score. Figure plots the receiver operating characteristics (ROC) curve and shows that the AUC–ROC is 0.89 if we use this confidence score as a threshold to predict whether a reaction is mispredicted. To obtain the ROC curves, we used a threshold on the confidence score to decide whether a reaction was mispredicted. We counted the predictions that matched the products reported in the patent with a confidence score above the threshold as true-positives (TPs), the predictions that did not match the reported products and were below the threshold as true-negatives (TNs), the predictions that matched the reported products but were below the threshold as false-negatives (FNs), and finally, the predictions that did not match the reported products but were above the threshold as false-positives (FPs). Then, we plotted the false-positive rate (= FP/(FP + TN)) against the true-positive rate (= TP/(TP + FN)) for thresholds between 0.0 and 1.0. Interestingly, Figure reveals that a subtle change in the training method, label smoothing, has a minimal effect on the accuracy but a surprisingly significant impact on the uncertainty quantification. Label smoothing was introduced by Vaswani et al.[31] for NMT models. Instead of comparing the output of the model at a given time step during training with a one-hot encoded target vector, label smoothing reduces the mass of the correct token in the target vector and distributes the smoothing mass across all other tokens in the vocabulary. Therefore, the model learns to be less confident about its predictions. Label smoothing helps to generate higher-scoring translations in terms of the accuracy and the BLEU score[57] for human languages and also helps in terms of reaching higher top-1 accuracy in reaction prediction. The top-1 accuracy on the validation set (mixed, USPTO_MIT) with the label smoothing parameter set to 0.01 is 87.44% compared with 87.28% for no smoothing. However, Figure shows that this small increase in accuracy comes at the cost of no longer being to able to discriminate between a good and a bad prediction. Therefore, no label smoothing was used during the training of our models. The AUC–ROC of our single mixed USPTO_MIT model measured on the test set was also at 0.89. The uncertainty estimation metric allows us to estimate the likelihood of a given reactant–product combination, rather only predicting products given reactants, and this could be used as a score to rank reaction pathways.[58,59]

Figure 7

Receiver operating characteristic curve for different label smoothing values for a model trained on the mixed USPTO_MIT data set when evaluated on the validation set.

Receiver operating characteristic curve for different label smoothing values for a model trained on the mixed USPTO_MIT data set when evaluated on the validation set. Within our uncertainty estimation framework, which is based on the product of probabilities of all predicted tokens, a potential unwanted bias is a bias against long-product SMILES; a large molecule should not necessary imply “difficult” predictions. Figure provides reassuring empirical evidence that this bias is absent. There is no correlation between the confidence score and the length of the SMILES string.

Figure 8

Length of the predicted sequences plotted against the confidence score of the sequence for a model trained on the mixed USPTO_MIT data set with a label smoothing parameter of 0.0. The Pearson product moment correlation coefficient between the length and the confidence score is 0.06.

Chemically Constrained Beam Search

Because no chemical knowledge was integrated into the model, technically, the model could perform “alchemy”, for example, turning a fluoride atom in the reactants into a bromide atom in the products, which was not in the reactants at all. As such, an interesting question is whether the model has learned to avoid alchemy. To this end, we implemented a constrained beam search, where the probabilities of atomic tokens not observed in the reactants were set to 0.0 and hence not predicted. However, there was no change in accuracy, showing that the model had successfully inferred this constraint from the examples shown during training.

Conclusions

We show that a multihead attention Transformer network, the Molecular Transformer, outperforms all known algorithms in the reaction prediction literature, achieving 90.4% top-1 accuracy (93.7% top-2 accuracy) on a common benchmark data set. The model requires no handcrafted rules and accurately predicts subtle chemical transformations. Moreover, the Molecular Transformer can also accurately estimate its own uncertainty, with an uncertainty score that is 89% accurate in terms of classifying whether a prediction is correct. The uncertainty score can be used to rank reaction pathways. We point out that previous work has considered an unrealistically generous setting of separated reactants and reagents. We demonstrate an accuracy of 88.6% when no distinction is drawn between reactants and reagents in the inputs, a score that outperforms previous work as well. For the more noisy USPTO_STEREO data set, our top-1 accuracies are 78.1 (separated) and 76.2%, respectively. The Molecular Transformer has been freely available since August 2018 through a graphical user interface on the IBM RXN for Chemistry platform[34] and has so far been used by several thousand organic chemists worldwide to perform more than 40 000 chemical reaction predictions.

26 in total

Review 1. The art and practice of structure-based drug design: a molecular modeling perspective.

Authors: R S Bohacek; C McMartin; W C Guida
Journal: Med Res Rev Date: 1996-01 Impact factor: 12.944

2. Development of a novel fingerprint for chemical reactions and its application to large-scale reaction classification and similarity.

Authors: Nadine Schneider; Daniel M Lowe; Roger A Sayle; Gregory A Landrum
Journal: J Chem Inf Model Date: 2015-01-13 Impact factor: 4.956

3. What's What: The (Nearly) Definitive Guide to Reaction Role Assignment.

Authors: Nadine Schneider; Nikolaus Stiefl; Gregory A Landrum
Journal: J Chem Inf Model Date: 2016-12-08 Impact factor: 4.956

4. Neural-Symbolic Machine Learning for Retrosynthesis and Reaction Prediction.

Authors: Marwin H S Segler; Mark P Waller
Journal: Chemistry Date: 2017-02-22 Impact factor: 5.236

5. "Found in Translation": predicting outcomes of complex organic chemistry reactions using neural sequence-to-sequence models.

Authors: Philippe Schwaller; Théophile Gaudin; Dávid Lányi; Costas Bekas; Teodoro Laino
Journal: Chem Sci Date: 2018-06-22 Impact factor: 9.825

Review 6. Computational prediction of chemical reactions: current status and outlook.

Authors: Ola Engkvist; Per-Ola Norrby; Nidhal Selmi; Yu-Hong Lam; Zhengwei Peng; Edward C Sherer; Willi Amberg; Thomas Erhard; Lynette A Smyth
Journal: Drug Discov Today Date: 2018-03-03 Impact factor: 7.851

7. Machine Learning in Computer-Aided Synthesis Planning.

Authors: Connor W Coley; William H Green; Klavs F Jensen
Journal: Acc Chem Res Date: 2018-05-01 Impact factor: 22.384

8. Fast and accurate prediction of the regioselectivity of electrophilic aromatic substitution reactions.

Authors: Jimmy C Kromann; Jan H Jensen; Monika Kruszyk; Mikkel Jessing; Morten Jørgensen
Journal: Chem Sci Date: 2017-11-13 Impact factor: 9.825

9. Application of Generative Autoencoder in De Novo Molecular Design.

Authors: Thomas Blaschke; Marcus Olivecrona; Ola Engkvist; Jürgen Bajorath; Hongming Chen
Journal: Mol Inform Date: 2017-12-13 Impact factor: 3.353

10. A graph-convolutional neural network model for the prediction of chemical reactivity.

Authors: Connor W Coley; Wengong Jin; Luke Rogers; Timothy F Jamison; Tommi S Jaakkola; William H Green; Regina Barzilay; Klavs F Jensen
Journal: Chem Sci Date: 2018-11-26 Impact factor: 9.825

52 in total

1. Transformer-CNN: Swiss knife for QSAR modeling and interpretation.

Authors: Pavel Karpov; Guillaume Godin; Igor V Tetko
Journal: J Cheminform Date: 2020-03-18 Impact factor: 5.514

2. Learning to Make Chemical Predictions: the Interplay of Feature Representation, Data, and Machine Learning Methods.

Authors: Mojtaba Haghighatlari; Jie Li; Farnaz Heidar-Zadeh; Yuchen Liu; Xingyi Guan; Teresa Head-Gordon
Journal: Chem Date: 2020-06-16 Impact factor: 22.804

Review 3. Ab Initio Machine Learning in Chemical Compound Space.

Authors: Bing Huang; O Anatole von Lilienfeld
Journal: Chem Rev Date: 2021-08-13 Impact factor: 60.622

4. Prioritizing Direct Photolysis Products Predicted by the Chemical Transformation Simulator: Relative Reasoning and Absolute Ranking.

Authors: Chenyi Yuan; Caroline Tebes-Stevens; Eric J Weber
Journal: Environ Sci Technol Date: 2021-04-21 Impact factor: 9.028

5. Inferring experimental procedures from text-based representations of chemical reactions.

Authors: Alain C Vaucher; Philippe Schwaller; Joppe Geluykens; Vishnu H Nair; Anna Iuliano; Teodoro Laino
Journal: Nat Commun Date: 2021-05-06 Impact factor: 14.919

Review 6. Artificial intelligence in drug discovery: what is realistic, what are illusions? Part 2: a discussion of chemical and biological data.

Authors: Andreas Bender; Isidro Cortes-Ciriano
Journal: Drug Discov Today Date: 2021-01-27 Impact factor: 7.851

7. Combining Machine Learning and Computational Chemistry for Predictive Insights Into Chemical Systems.

Authors: John A Keith; Valentin Vassilev-Galindo; Bingqing Cheng; Stefan Chmiela; Michael Gastegger; Klaus-Robert Müller; Alexandre Tkatchenko
Journal: Chem Rev Date: 2021-07-07 Impact factor: 60.622

8. Evaluating and clustering retrosynthesis pathways with learned strategy.

Authors: Yiming Mo; Yanfei Guan; Pritha Verma; Jiang Guo; Mike E Fortunato; Zhaohong Lu; Connor W Coley; Klavs F Jensen
Journal: Chem Sci Date: 2020-11-23 Impact factor: 9.825

9. Regio-selectivity prediction with a machine-learned reaction representation and on-the-fly quantum mechanical descriptors.

Authors: Yanfei Guan; Connor W Coley; Haoyang Wu; Duminda Ranasinghe; Esther Heid; Thomas J Struble; Lagnajit Pattanaik; William H Green; Klavs F Jensen
Journal: Chem Sci Date: 2020-12-22 Impact factor: 9.825

10. A robotic prebiotic chemist probes long term reactions of complexifying mixtures.

Authors: Silke Asche; Geoffrey J T Cooper; Graham Keenan; Cole Mathis; Leroy Cronin
Journal: Nat Commun Date: 2021-06-10 Impact factor: 14.919