Literature DB >> 30555898

Using Machine Learning To Predict Suitable Conditions for Organic Reactions.

Hanyu Gao¹, Thomas J Struble¹, Connor W Coley¹, Yuran Wang¹, William H Green¹, Klavs F Jensen¹.

Abstract

Reaction condition recommendation is an essential element for the realization of computer-assisted synthetic planning. Accurate suggestions of reaction conditions are required for experimental validation and can have a significant effect on the success or failure of an attempted transformation. However, de novo condition recommendation remains a challenging and under-explored problem and relies heavily on chemists' knowledge and experience. In this work, we develop a neural-network model to predict the chemical context (catalyst(s), solvent(s), reagent(s)), as well as the temperature most suitable for any particular organic reaction. Trained on ∼10 million examples from Reaxys, the model is able to propose conditions where a close match to the recorded catalyst, solvent, and reagent is found within the top-10 predictions 69.6% of the time, with top-10 accuracies for individual species reaching 80-90%. Temperature is accurately predicted within ±20 °C from the recorded temperature in 60-70% of test cases, with higher accuracy for cases with correct chemical context predictions. The utility of the model is illustrated through several examples spanning a range of common reaction classes. We also demonstrate that the model implicitly learns a continuous numerical embedding of solvent and reagent species that captures their functional similarity.

Entities: Chemical Disease Species

Year: 2018 PMID： 30555898 PMCID： PMC6276053 DOI： 10.1021/acscentsci.8b00357

Source DB: PubMed Journal: ACS Cent Sci ISSN： 2374-7943 Impact factor: 14.553

Introduction

Retrosynthetic planning is the process of proposing pathways to synthesize target molecules from available starting chemicals, and has demonstrated its importance and success in the chemical industry.[1,2] While retrosynthesis traditionally requires extensive training and expertise of a chemist, recent years have seen renewed interest in computer-assisted synthetic planning (CASP).[3−7] With the application of powerful machine learning techniques to large data sets of organic reactions like Reaxys[8] and the USPTO database,[9] there have been major advances both in searching for possible retrosynthetic pathways[10−16] and in evaluating the feasibility of the proposed reactions.[5,17−22] While existing tools have been demonstrated to predict the likelihood of success of reactions with good accuracy,[20] one obstacle to experimentally validating computer-proposed reactions is the specification of reaction conditions, including chemical context (catalyst, reagent, solvent) and other operating conditions (e.g., temperature, pressure). In some cases, small changes in reaction conditions can lead to drastically different reaction outcomes. Therefore, recent work on reaction outcome prediction has started to include reaction conditions to improve the accuracy and specificity of predictions.[20,21] More importantly, reaction conditions are necessary to evaluate opportunities for one-pot synthesis, telescoping in flow, and amenability to the use of “green” solvents for sustainability. Extensive work has been done on the optimization of conditions for specific reaction classes, using a combination of domain knowledge and empirical optimization techniques to automatically identify the best reaction condition.[23−27] Nevertheless, the initial guess of reaction conditions for a new reaction is predominantly considered a human task. Chemists use heuristics and perceived similarity of new reactions to ones they are familiar with to propose candidate conditions. However, this approach has its limitations and challenges. The recommendation might be biased by chemists’ preference and familiarity with certain types of reactions; the heuristic rules might not be all-encompassing or too abstract to narrow down to specific chemicals, and conditions of a precedent reaction may not be applicable to the new reactions even if the reactants are structurally similar. In addition to the potential utility of in silico condition recommendation tools to practicing synthetic chemists, they are a necessary component of computer-aided synthesis planning. With thousands of plausible reactions generated in a few minutes or even seconds, it would be impossible to rely on manual input for suggesting reaction conditions. Research has shown that failure to specify appropriate reaction conditions might result in false prediction of reaction outcomes.[21] However, computational condition recommendation remains a rarely addressed and complex challenge. Most existing work focuses only on specific elements of the chemical context (e.g., only reagents or only solvents), or specific reaction classes. Solvent selection, for instance, has been widely studied as a standalone problem.[28] Struebing et al. combined quantum mechanical (QM) calculations with a computer-aided molecular design procedure to identify solvents that accelerate reaction kinetics.[29] This approach was demonstrated to be effective for specific examples, yet it is difficult to apply at a larger scale due to the high computational cost of QM calculations. Data-driven approaches have been employed to recommend conditions for specific types of reactions. Marcou et al.[30] built an expert system to predict the type of catalyst and solvent used for Michael additions, trained on 198 known reactions. The problem was formulated as multiple binary classification subproblems of whether a certain type of solvent/catalyst would be suitable for a specific Michael reaction. However, on an external test set, only 8 out of 52 reactions had both predicted solvent and catalyst matching the true context. Lin et al. used a similarity-based approach to recommend catalysts for desired deprotection reactions, and demonstrated the approach in catalytic hydrogenation reactions.[31] A study by Segler and Waller tackles a broader scope of reactions using a knowledge graph model of organic chemistry to infer complementary and analogous reactivity.[32] Novel reactions are treated as missing links in this graph. Reaction context is taken as the combination of the first reactions that are linked with reactant molecules. They tested this approach on 11 reactions from the literature, and for most of them the model was able to identify the exact same or similar reagent/catalyst as used in the literature. This work demonstrated the feasibility of reaction context inference based on reaction patterns, yet context compatibility and temperature prediction are not taken into consideration. Similar to the aforementioned approach, one straightforward method for identifying reasonable reaction conditions is to find a similar reaction in the literature and simply employ exactly the same reaction conditions reported for that precedent, referred to as the nearest-neighbor approach. Indeed this is an approach that many chemists may use implicitly. This can be successful with a database of known reactions that is sufficiently large and densely populated, but computationally, a nearest-neighbor search against millions of species is RAM- and CPU-intensive, even with optimized search strategies (e.g., using a ball tree). Furthermore, if some information in the nearest-neighbor reaction is not present (i.e., data is incomplete), that information cannot be inferred. The rigidity of this approach precludes asking questions essential to synthesis planning, such as whether the reaction could proceed in a particular replacement solvent. In summary, we identify some primary limitations of existing approaches: There has not been a published method that accurately predicts complete reaction conditions (catalysts, solvents, reagents, and temperature) suitable for use with a very large reaction corpus. The compatibility and interdependence of chemical context and temperature are not taken into account in previous approaches. No previous studies have performed quantitative evaluation of reaction condition predictions on a large-scale reaction data set. There are two major challenges which have impeded progress: (i) There is not a machine readable large data set available with catalysts/solvents/reagents classified into different types. (ii) For the similarity-based approaches it is difficult to quantitatively assess the level of “correctness” of conditions when comparing entire sets of conditions associated with different literature reactions. Closer attention should be paid to balancing the generality/specificity of representing chemical context. If the representation is too general, such as manually encoded types/groups, it might not fully characterize functionality, and if it is too specific, e.g., copy–pasting the entire conditions from other reactions, it does not provide further information about chemical similarity. New tools are needed that propose reaction conditions intelligently and can handle a broad scope of reaction classes. In this work we develop a neural-network-based model to predict suitable reaction conditions for arbitrary organic transformations. The model is trained on roughly 10 million examples from the Reaxys[8] database to predict the chemical species used as catalysts, solvents, reagents, and an appropriate temperature for the reaction. Prediction results are evaluated both quantitatively, using a variety of accuracy metrics, and qualitatively, using multiple sets of representative examples. It is also demonstrated that the model learns the similarity of the chemical context (e.g., different solvents/reagents) exclusively from reaction data.

Results and Discussion

A neural network model is trained to predict up to one catalyst, two solvents, two reagents, and the temperature for a given organic reaction. Detailed data processing and model formulation steps are described in the Methods section. The training process is essentially a multiobjective optimization that minimizes the overall loss function which is a weighted sum of the loss for each individual objective (namely, catalyst, solvent 1, solvent 2, reagent 1, reagent 2, and temperature). The progress of training is reflected in the change of the overall and individual loss, which is shown in Figure . Validation loss (dashed red line) decreases for 2 epochs and then reaches a plateau and stays higher than the training loss (solid red line). Based on the plotted losses in Figure the first solvent (s1, yellow lines) and the first reagent (r1, orange lines) are the most difficult to predict, with a significantly higher loss value than the other objectives (not including temperature). There is a large fraction of reactions which do not have a second solvent (s2, blue lines) or reagent (r2, gray lines), in which cases the model only needs to predict the NULL class, making these second predictions easier to classify. The same principle applies to catalyst, where many reactions do not use a catalyst, and others have catalysts frequently recorded as reagents. The mean squared error for temperature (T loss, green lines) decreases steadily over the epochs, ending in 0.46 for the training set and 0.50 for the validation set (after scaling by a factor of 0.001 K−2).

Figure 1

Change of the loss functions with the number of epochs (left figure, overall; right figure, chemical context and temperature).

Statistical Analysis

Evaluating the results of chemical context prediction is a nontrivial task, mainly because it is a combination of individual chemicals, and because we lack a standard way of quantifying the “closeness” of the prediction when each exact chemical is not predicted. Since there is often more than one possible context combination suitable for a reaction, we do not want to focus exclusively on the top-one prediction, but also want to examine other highly ranked suggestions. However, the number of total combinations grows in a polynomial way (to the fifth power) with the increase of number of candidates to include for every individual element; e.g., if the top-three candidates are to be examined for the catalyst, both solvents, and both reagents, the total number of combinations is 243 (35), which is almost impossible to evaluate manually, and difficult to analyze. Since the data for catalyst, solvent 2 and reagent 2 are much more sparse than solvent 1 and reagent 1, there is likely more value in examining longer candidate list for the latter two. Therefore, we use the top-three reagent 1 predictions and top-three solvent 1 predictions along with the top-two catalyst, top-one solvent 2, and top-one reagent 2 to construct 18 top combinations, from which we can pick the top-three or top-10 combinations with the highest overall scores, calculated as the product of softmax probabilities for each individual element. The number of top candidates is a heuristic choice and can be tailored by model users for specific needs (e.g., the user can explore a longer list of catalyst candidates instead of only the top-two choices). Evaluation is performed on the entire test set, and the accuracy values described in the Methods section are shown in Table . For the most difficult tasks of predicting solvent 1 and reagent 1, the frequency with which the recorded chemical is found in the top-10 combinations is 83.0% and 83.1%, respectively. After including those close match predictions (defined in the Methods section), the accuracy for solvent 1 increases by a margin of 2.4%, and the accuracy for reagent 1 increases by 1.8%. This suggests that, though not explicitly coded, the model learns the chemical similarity of solvents/reagents/catalysts which tend to be used in closely related reactions.

Table 1

Accuracy of Prediction of the Chemical Context by the Condition Recommendation Modela

Prediction task	Top-3 exact matches	Top-10 exact matches	Top-3 close matches	Top-10 close matches
c	93.6%	94.9%	94.9%	96.4%
s1	75.8%	83.0%	78.2%	85.4%
s2	90.1%	91.7%	90.2%	91.9%
r1	73.2%	83.1%	74.8%	84.9%
r2	89.3%	91.8%	89.3%	92.1%
c, s1, r1	57.3%	66.0%	60.4%	69.6%
c, s1, s2, r1, r2	50.1%	57.3%	53.2%	60.3%

c, s1, s2, r1, and r2 refer to catalyst, solvent 1, solvent 2, reagent 1, and reagent 2, respectively.

c, s1, s2, r1, and r2 refer to catalyst, solvent 1, solvent 2, reagent 1, and reagent 2, respectively. Compared to the high accuracies for the individual prediction tasks, the top-three accuracy for the full condition recommendation (catalyst, two solvents, two reagents) is 50.1%, and the top-10 accuracy is 57.3% (53.2% and 60.3% when including close match predictions). However, given that these numbers represent the requirement to predict the full combination (all five elements) of the exact recorded context, it is expected to be more challenging than predicting individual elements. We further computed the top-10 accuracy of a subset of the combinations—catalyst, solvent 1, and reagent 1—as 66.0% and 69.6% for exact matches and close match predictions, respectively. To evaluate the meaningfulness of the accuracies given in Table , we compared the trained model with a baseline model, where top-10 combinations are chosen based on the frequencies of the catalysts, solvents, and reagents (Supporting Information, Table S1). Detailed comparisons and statistical parameters are given in the Supporting Information, Table S2. It can be seen that 87.3% of the reactions do not use a catalyst; 85.6% of the reactions do not use a second solvent, and 82.3% of the reactions do not use a second reagent. For these tasks, predicting the NULL class can achieve relatively high accuracy, but the trained model still performs better by a significant margin. Meanwhile, the top-3 accuracy for predicting the correct combination of chemical context is only 4.7%, compared to 50.1% in the trained model, indicating that simply using the most frequent combination of chemical context is not an effective method. The evaluation of temperature is less straightforward than it seems, because the prediction of temperature is dependent on the chemical context, which means, in the top-10 temperature predictions, at least nine of them are based on chemical contexts that are at least partially different from the recorded context. Practically speaking, however, temperature is relatively easy to change and test in experiments, so suggesting an approximate initial guess of the temperature would be sufficiently helpful for the setup of experiments. The top-one temperature prediction falls within the ±10 or ±20 °C range of the recorded temperature in 36.7% and 57.7% of test cases, respectively. If we isolate reactions whose predicted chemical context matches the recorded chemical context, these accuracies increase to 42.6% and 65.9%. The mean absolute error (MAE) of temperature prediction for all reactions in the data set is 25.5 °C, and when the correct chemical context is found the MAE is 19.4 °C. Figure visualizes this by plotting the predicted temperature against the recorded temperature for a 1% sample of the testing set. The fact that the quality of temperature prediction is significantly improved with correctly predicted chemical context demonstrates that the prediction of temperature accounts for the compatibility with the chemical context. This performance is also compared with a baseline model that predicts the most frequent temperature (20 °C) for every reaction in the Supporting Information. Figure S4 shows the distribution of temperature for the test set. While a majority of reactions use 20 °C, the distribution spans a wide range, and simply predicting the room temperature (20 °C) will result in a mean absolute error of 35.3 °C, which is significantly larger than the prediction given by the trained model and would be misleading for reactions that require high or low temperatures.

Figure 2

Relationship between the true temperature and the top-one predicted temperature (left panel), and predicted temperature if the predicted context matches the chemical context (right panel).

Qualitative Evaluation of Reaction Examples

In addition to the statistical analyses, qualitative evaluation of reaction examples helps provide chemical insight in the model predictions. We select reactions from a variety of common types of organic reactions to evaluate the quality of model predictions. We randomly select example reactions that are labeled by reaction type (around five for each type) and compare the true condition, top-one prediction, and the closest prediction within top-10 candidates. The closest prediction is defined as the prediction that has the largest number of chemical elements exactly matching the true chemical context. The reaction types we choose to test include hydrolysis, esterification, alkylation, epoxidation, Wittig, reduction, oxidation, deprotection, Suzuki–Miyaura coupling, Grubbs metathesis, and Buchwald–Hartwig amination. Due to space limitation, we only place a small part of the examples here in the main text with the top-one prediction (Figure ) and the rest in the Supporting Information (Table S3) with both the top-one prediction and the closest prediction.

Figure 3

Example of model predictions compared with recorded context (temperature rounded to the closest integer; black text represents the recorded conditions, and blue text represents the predicted conditions). (A) Nucleophilic epoxidation. (B) Deprotection of fluorenylmethyloxycarbonyl (Fmoc). (C) Luche reduction of eneone, TBS = tert-butyl(dimethyl)silyl. (D) Buchwald–Hartwig aryl amination, BINAP = 2,2′-bis(diphenylphosphino)-1,1′-binaphthyl. (E) Suzuki-Miyaura coupling, CyJohnPhos = (2-biphenyl)dicyclohexylphosphine. (F) Hoveyda–Grubbs cross metathesis. Figure A is an epoxidation reaction where two C=C bonds are present.[33] The recorded reagents for a nucleophilic epoxidation, selective for the electron deficient alkene, are correctly predicted by the model. Figure B shows a Fmoc deprotection reaction where either acidic or basic conditions can be used. In this case, basic conditions are proposed by the model for removal of the Fmoc group which does not affect the other acetate protecting groups in the molecule.[34] Notably, the predicted reagent (piperidine) is highly similar structurally and functionally to the recorded reagent (morpholine), demonstrating the model’s ability to capture chemical similarity. Figure C is a Luche reduction that needs a Lewis acid catalyst to selectivity reduce the carbonyl, and the model recognizes this specificity by suggesting cerium(III).[35]Figure D is a Buchwald–Hartwig aryl amination that uses BINAP as the ligand.[36] The metal atoms, ligands, the base, and the solvents are all correctly predicted by the model. Figure E,F is a Suzuki–Miyaura coupling reaction[37] and Grubbs metathesis,[38] respectively, for which the model also successfully predicts the exact chemical context. It is worth pointing out the success of predicting the context of catalytic reactions is partially attributable to a data cleaning step that moves transition metal compounds from the reagents field to the catalyst field. This only increases the overall accuracy by a small margin (∼0.5%), but it significantly improves the quality of model predictions for catalytic reactions in the case studies. For example, before data cleaning, Figure D had the catalyst predicted as a reagent, and Figure E missed the catalyst entirely. A complete list of 62 reactions from 11 reaction types can be found in the Supporting Information, Table S3. Finally, temperature is predicted within a reasonable error from the true temperature (Figure and Supporting Information, Table S3), and examples with large deviations are mostly cases with chemical contexts predicted different from the recorded ones. As mentioned in the Introduction, we also performed a nearest-neighbor search for these reaction examples, shown in the Supporting Information. The nearest-neighbor search is analogous to searching for similar reactions in reaction databases such as Reaxys or SciFinder. Results in the Supporting Information, Table S6, show that the nearest-neighbor method works comparably well in giving the first suggestion for many examples, but it is more than 10 000 times slower than the neural network method presented here. Therefore, it is prohibitively expensive to evaluate the nearest-neighbor approach on the entire test set, and only a qualitative comparison is given on the examples described above. This suggests that using the trained neural-network model can achieve significantly faster speed in finding reasonable condition recommendations compared to the nearest-neighbor search method, and thus can be better integrated with computer-assisted retrosynthetic programs. The results in Figure and the Supporting Information, Table S3, show selected results pulled out of the test set of some common classes of reactions such as oxidations, reductions, and coupling reactions. Although these are common reaction classes, there is much diversity in specific conditions to achieve the desired reactivity depending on the structure of the starting materials. The results in Figure demonstrate that specific conditions can be chosen by the model for many reaction classes. We also include in the Supporting Information, Table S4, 100 completely random examples from the test set that will not have the restriction of the 11 reaction classes that we initially pulled examples from. It can be seen that the prediction performance is good for most of these random cases. Nevertheless, it is important to also analyze examples with the most incorrect predictions. We calculate the sum of binary variables indicating whether the correct or similar c, s1, s2, r1, and r2 are predicted and use it as a crude measure of “correctness”. The test results are sorted in ascending order of this sum, and examples are drawn from the first 10 000 entries (this is about 1% of reactions from the test set with the quantitatively worst predictions) in the reaction types described above (five for each type, for some types fewer than five as the reactions with labeled reaction types are sparse in the database). A list of 40 reactions are generated and listed in the Supporting Information, Table S5 (part of the examples shown in Figure ). Even in these ostensibly “worst” cases, many predictions are not unreasonable. Figure A represents a common type of data quality issue in the data set, where water is used to quench the reaction but recorded in the same way as other reagents, despite being incompatible and explosive with alkali metals in their pure form.[39] However, the model recommends reagents that are commonly used for a dissolving metal reduction reaction. Figure B represents another type of problem observed in the data set, where a multistage reaction is recorded as a single transformation.[40] In the original reference, the Grubbs reaction and reduction are two isolated steps, which is probably a rare case in the database, and the model fails to recognize these two reactions simultaneously. The top suggestion is indeed poor, whereas the 6th to 10th suggestions all recognize the Grubbs reaction and suggest a Grubbs catalyst (see the Supporting Information, Table S5). In Figure C, the model suggests using tetrakis palladium(0) where the true context is palladium(II) and triphenylphosphine which is commonly used to form palladium(0) in situ, so the prediction seems to be a viable context combination for the reaction.[41]Figure D is a case of an azide reduction reaction.[42] The model recognizes a reductant is needed, and suggests hydrogen as a first choice. The fourth suggestion of the model include two reagents: 1,3-propanedithiol and triethylamine (Supporting Information, Table S5), which is a more plausible alternative to the recorded reagent. More examples can be found in the Supporting Information, Table S5, and many of the “poorly matching” results are not close to the published procedures but are not altogether unreasonable suggestions.

Figure 4

Examples of the reactions with the fewest chemical elements matching the recorded context (temperature rounded to the nearest integer; black text represents the recorded conditions, and red text represents the predicted conditions). (A) Birch alkylation. (B) Hoveyda–Grubbs cross metathesis, TBS = tert-butyl(dimethyl)silyl. (C) Suzuki-Miyaura coupling. (D) Azide reduction. In addition, we include a test of predicting context for Michael addition reactions that are the same as used by Marcou et al.,[30] which shows significant improvement; results are presented in the Supporting Information.

Learned Embedding of Solvents and Reagents

While no explicit relationship between chemical species is included in the model, it implicitly learns the functional similarity of solvents and reagents through training, as it can suggest similar chemicals for the same reaction. Taking solvent as an example, the similarity information can be extracted from the neural network, specifically the weight matrix in the last hidden layer before the final likelihood prediction and softmax activation. If two rows in the weight matrix are similar, the model will tend to predict similar scores for the corresponding two solvents. In other words, each solvent can be represented by its corresponding row from the weight matrix. This representation contains information about how it is used in different reactions and can be used to characterize functional similarity, implicitly averaged over all training reactions. This is analogous to the word embedding in natural language modeling where discrete words are converted to vectors of real numbers which contain similarity information (word2vec),[43] though we arrive at the representation indirectly, so we call the vector “solvent embedding”. To visualize the embedding of solvents, the top 50 solvents with the highest frequency in the data set are selected, and labeled manually into four types (nonpolar, polar nonprotic, protic, and halogenated). The embeddings of the solvents are normalized to account for their frequency of use and projected into a 2-D space using the t-distributed stochastic neighbor embedding (t-SNE)[44] technique, shown in Figure . It is worth noting that the solvent embedding vectors do not have a physical interpretation, and t-SNE is a technique that aims to preserve the similarity of the data points in a visualized low-dimensional space. Therefore, the axes on Figure do not have direct meaningful representation. Nevertheless, it can be seen that the solvents of the same type are clustered in the same part of the graph, and we can even observe some chemically reasonable trend across clusters (e.g., increasing polarity from the bottom right to upper left). Some “close-neighbor” pairs agree well with chemical knowledge (e.g., benzene and toluene, methanol and ethanol). The model is not supplied with any information about polarity or other electronic properties, but it learns the difference by the ways that solvents are used in various reactions.

Figure 5

Embedding of the most common 50 solvents projected onto a two-dimensional space using t-SNE. Solvents are naturally clustered into their corresponding classes (manually annotated).

Embedding of the most common 50 solvents projected onto a two-dimensional space using t-SNE. Solvents are naturally clustered into their corresponding classes (manually annotated). The same plot for the most common 50 reagent embedding is shown in Figure . The functionality of reagents is more diverse than solvents, and the label list includes inorganic acid, organic acid, Lewis acid, inorganic base, organic base, reductant, oxidant, and activating reagents. While a 2-D projection might be insufficient to fully represent the variance of functionality, it can be observed that reagents of similar function are reasonably clustered in the same area of the plot. Additionally, incompatible reagents are segregated from each other (e.g., acids and bases, reductants and oxidants) demonstrating the ability of the model to not only recommend similar reagents but also lower the chance of recommending reagents that will likely to be reactive with one another.

Figure 6

Embedding of the most common 50 reagents projected onto a two-dimensional space using t-SNE. Reagents are naturally clustered into their corresponding classes (manually annotated).

Embedding of the most common 50 reagents projected onto a two-dimensional space using t-SNE. Reagents are naturally clustered into their corresponding classes (manually annotated). The solvent and reagent embeddings extracted from the condition recommendation model demonstrate the possibility of understanding chemical functionality through reaction data, and have many potential applications. Most directly, it can be used as a tool to identify the closest alternative for the currently used solvents/reagents that are, for example, less toxic, cheaper, etc. It can also be used as input features for other machine learning problems that are context dependent, e.g., evaluation of reaction outcomes and solvent/reagent property estimation.

Strengths and Limitations

The neural-network model developed in this work overcomes many of the challenges described in the Introduction as summarized below: By training on ∼10 million reactions from Reaxys, the model covers a wide range of organic reactions. With a hierarchical neural network, the model predicts all the elements in the reaction condition sequentially with interdependence and relatively high accuracy. The chemicals are not precategorized into classes so predictions can point to the specific chemicals that might, for example, be used as either an acid or an oxidant based on the reaction (e.g., sulfuric acid). On the other hand, individual chemical species are modeled as separate entities so that functional similarity can be learned during training and extracted from the model. The model recommends reaction conditions much faster (less than 100 ms for one reaction) than nearest-neighbor search methods, and allows quantitative evaluation of model predictions on a large scale. It can also be used for efficient condition recommendation for a large number of reactions suggested by computer-assisted retrosynthetic analysis. The reaction conditions can be utilized by forward evaluation tools to better predict reactivity, especially for condition dependent reactions. In addition, the learned embeddings of solvents and reagents can be used to quantify similarity of conditions of sequential reactions to estimate separation requirements, and to find potential green alternatives to toxic solvents/reagents, both of which are helpful for pathway-level route screening and prioritization. Meanwhile, the current model also has some limitations, as summarized below: Since the chemical context is predicted in a sequential manner, we must limit number of predictions at each stage to obtain approximate top-10 combinations in a short time period (similar to a beam search). Truncating the data based on minimal frequencies of catalysts, reagents, and solvents lowers the total number of trainable parameters and avoids data sparsity issues during training, but also limits the ability to predict rare contexts that are used by highly specific reactions. There are various other limitations imposed by the imperfection of the training data. For example, even after filtering, some reactions with multiple transformations remain which confuses model prediction, and the labeling of reagents is sometimes misleading (e.g., quenching chemicals included as a reagent); there are some duplicated records or different labels for the same chemical. While these situations are relatively uncommon in the entire data set, a better curated data set can potentially further improve the model performance.

Methods

Overview

The task of condition prediction can be divided into two parts: chemical context prediction (catalysts, solvents, reagents) is treated as a set of multiclass classification problems (i.e., choosing chemical species from a fixed list), while temperature prediction is treated as a regression problem. Pressure is not considered in this scope because the majority of published reactions of interest are run at atmospheric pressure, and databases often do not record pressure for such cases. Concentration is also not considered because it is excluded from tabulated reaction databases, although sparse information is available in the form of mass or volume of some chemical species. As mentioned in the introduction, all these parts should be linked together in one model to solve these classification and regression problems in a hierarchical formulation to account for the interconnectedness of, e.g., solvent and catalyst selection. We choose to use neural networks to construct the model architecture, because of their flexibility to recognize highly nonlinear relationships and because the size and diversity of the data warrant a high-capacity model. The model is trained on published reaction data from Reaxys[8] to predict the recorded reaction conditions, after which we are able to infer suitable reaction conditions for novel reactions.

Data

We construct our training set starting from the Reaxys reaction database that consists of 53 million reaction records. The information we use includes the simplified molecular-input line-entry system (SMILES) strings of the reactants and products, the Reaxys chemical ID and SMILES (if available) or name for the catalysts, solvents and reagents, and the temperature for the reactions. Note that in this context reagents are explicitly distinct from reactants, with the former generally not contributing carbon atoms to the reaction (typical reagents include acids, bases, oxidants, reductants, etc.). We restrict our analysis to single-product and single-step reactions to better align with the application to computer-aided synthesis planning. In this context, single-product reactions are defined as Reaxys reaction entries with only one recorded major product. Few entries have multiple products (e.g., specifying all outcomes in cases of ambiguous site selectivity), only 6.3% in this data set. Reactions are preannotated to specify the number of reaction steps associated with the recorded transformation; this attribute is used to filter out all multistep reactions. Some reaction examples passing this filter could still be considered as requiring multiple steps (as shown in the Results section), but it is otherwise hard to systematically distinguish between single- and multi-step reactions purely based on reactant and product structures. Reactions without recorded structures (half reactions) or with structures that could not be parsed by RDKit[45] are removed. We also discard reaction entries with no reaction condition information (i.e., no catalyst, no solvent, no reagent, and no temperature), and restrict the maximum number of unique solvents and reagents per reaction to two per category, which will be further explained in the Model Structure section. We also note that there is some ambiguity between catalysts and reagents in Reaxys—many catalysts are recorded as reagents, causing the data to be sparser for catalysts and increasing the number of distinct reagents. This issue can hardly be eliminated completely since a strict separation between reagents and catalyst is difficult to achieve. As a data cleaning step to mitigate this issue, all chemicals that appear as catalysts or reagents that include transition metals in the name are marked as “catalyst exclusive” and consolidated to the catalyst field. For the remaining 12.1 million reaction examples, we analyze the frequency of each solvent, reagent, and catalyst species. A minimum frequency filter is applied to remove rare catalysts, solvents, and reagents. Rare chemical species significantly increase the number of classes over which the condition prediction will be made, introducing concerns over data sparsity, and including them contributes little to model coverage (frequency vs rank plots for catalysts, reagents, and solvents are provided in the Supporting Information, Figures S1–S3). A minimum frequency of 100 is applied to solvents, reagents, and catalysts. The final number of classes—distinct chemical species—for solvents is 232, for reagents is 2247, and for catalysts is 803; this is an appreciable reduction from the original 11 246, 151 214, and 10 323, respectively. Meanwhile, the number of reactions filtered out by this criterion is only 676 848, around 5% of the total reactions. The number of reactions filtered out by each criterion is listed in Table .

Table 2

Number of Reactions by Each Filter Criterion

criterion	number of reactions
originally from Reaxys	53 143 003
temperature out of range	56 235
multistep	23 536 281
multiproduct	3 335 439
missing SMILES (including half reactions)	92 472
cannot be sanitized by RDKit	1 693 625
no condition information	9 684 738
exceeding one catalyst, two solvents, or two reagents	2 645 058
using rare catalysts, solvents, or reagents	676 848
final data set	11 422 307

Temperature is treated as a continuous variable. For those reactions whose temperature is recorded as a range (e.g., 0–20 °C), the midpoint of the range is used. We require the reaction temperature to be within −100 and 500 °C, which comfortably contains the vast majority of organic transformations and only excludes chemistry outside the scope of this study (e.g., hydrocarbon cracking). After the full data preprocessing pipeline, there are a total of 11.4 M reaction records that result in the final data set.

Molecular Representation

Morgan circular fingerprints, as implemented in RDKit (with radius 2, calculated as bit vectors with length 16 384, stereochemistry information included, and feature-based invariants are not used),[45] are used to represent reactants and product species, as they are a commonly used descriptor of organic molecules.[11,15,46] Catalysts, solvents, and reagents are directly represented as one-hot vectors, with each different chemical species (more precisely, each chemical entity with a unique ID in Reaxys) representing a unique class. A NULL class is added for each element, to represent reactions where the corresponding element is not recorded (e.g., no reagent, etc.). The lack of a well-defined chemical structure for certain species (e.g., air) precludes a richer descriptor-based representation, and we find a one-hot representation to work well in practice.

Model Structure

The neural network takes the product fingerprint and reaction fingerprint as two inputs. A reaction fingerprint is calculated as the difference between product fingerprint and reactant fingerprint, which represents the substructures that change during the reaction.[46] Predictions are made sequentially so that information from precedent elements can be incorporated into the prediction of subsequent elements (e.g., the prediction of solvent will depend on what catalyst is chosen). Temperature is the final output of the model, such that it relies on the chemical context recommendations. The workflow of the model (shown graphically in Figure ) is as follows:

Figure 7

Graphical representation of the neural-network model for context recommendation (“Hard Selection” refers to setting the value of the maximal element to one and zero for the rest, although the output of each classification task is a probability distribution).

Reaction and product fingerprints are concatenated and passed through two fully connected layers (ReLU activation, size 1000; ReLU activation, size 1000, with a 0.5 dropout) to generate a dense representation of the fingerprints (referred to as Dense FP). Dense FP is passed through two fully connected layers (ReLU activation, size 300; Softmax activation, size 803) to predict the catalyst (or NULL) for the reaction. The one-hot vector of the catalyst prediction is then concatenated with Dense FP and passed through two fully connected layers (ReLU activation, size 300; Softmax activation, size 232) to predict the first solvent (or NULL). Step 3 is repeated for prediction of the second solvent (size 228), the first reagent (size 2240), and the second reagent (size 1979). The numbers are smaller than the total class of solvents/reagents because some solvents/reagents are only present in one of the fields (i.e., only as Solvent/Reagent 1 or Solvent/Reagent 2). One-hot vectors of the catalyst, solvents, and reagents and Dense FP are all concatenated and passed through two fully connected layers (ReLU activation, size 300; Linear activation, size 1) to predict the temperature. Graphical representation of the neural-network model for context recommendation (“Hard Selection” refers to setting the value of the maximal element to one and zero for the rest, although the output of each classification task is a probability distribution). Notable features of the model construction are as follows: One feature is the order of the prediction tasks. The earlier it appears in the model, the more that task is able to be performed solely based on the reaction, independent of the other predictions. We experimented with predicting single elements using fingerprint information only and found that the validation accuracy (top-one accuracy) is highest for catalyst (92.1%), and similar for solvent and reagent (60.6% and 60.6%, respectively). This is consistent with how chemists generally approach this problem manually, i.e., identify if the reaction requires a catalyst. Reagents are placed last in the sequential prediction, so that information about catalyst and solvent selection is included in predicting reagents, which have the most unique possibilities and a greater level of flexibility even when the catalysts and solvents are fixed. Another feature is the number of catalysts, solvents, and reagents for each reaction. Most of the reactions in the data set have no catalyst or at most one catalyst recorded, so the number of catalysts is limited to one. A majority of reactions use one solvent, but there are still many examples that use multiple solvent or multiple reagent combinations. Few reactions in the data set use three or more solvents or reagents, so limiting the number of solvents and reagents to two for each category keeps the model in a reasonable size. The final model has 38 M parameters.

Training and Evaluation

The data set is split randomly into training/validation/test sets with a ratio of 80/10/10. The time-split strategy is not used because, in practice, the model for condition recommendation is likely to be mainly used for interpolating and generalizing historical condition information to new substrates, but not designing fundamentally novel conditions that have not been previously discovered. It is worth noting that there can be multiple records for the same reaction, possibly happening under different conditions. When multiple records of the same reaction exist (i.e., if multiple reactions share the same reactants and products, which is quite common), they are grouped during shuffling to guarantee that the reactions in the test set are not present in the training set. Our model has intermediate outputs (catalysts, solvents, and reagents) that are used as input for the next prediction, which resembles the features of a recurrent neural network, so we apply the teacher forcing technique[47] during training. It takes the ground truth output, instead of the predicted ones, as the input for the next prediction task. This technique has been shown to increase stability and accuracy of the training. Categorical cross-entropy is used as the loss function for the classification problems (i.e., catalysts, solvents, and reagents), and mean squared error is used for regression (i.e., temperature). A weighting factor of 0.001 is applied to temperature so that the numeric values of the loss functions are approximately on the same scale. Training continues until the validation loss does not improve over five epochs. Evaluation is performed both quantitatively and qualitatively. We calculated the accuracy of the true combined chemical context as well as individual elements to be within the top-three and top-10 predicted combinations. Additionally, we extend this accuracy calculation to include some “close match” predictions. The similarity of solvents is characterized by the Euclidean distance between the Abraham parameters[28] of the two solvents if the parameters are available, and otherwise only exact matches are considered. The methanol–ethanol pair is used as a threshold to identify solvents that are close matches. Catalysts and reagents are classified as close matches if they have the same metal atoms (for organometallic compounds) or if their feature-based Morgan fingerprints are exactly the same. The feature definitions are as implemented in RDKit, adapted from the definitions in Gobbi et al.,[48] which define some invariants that share the same feature, such as “halogen” (e.g., −Cl and −Br), “hydrogen bond donor” (−SH and −OH), and unusual atoms (not H, C, N, O, F, S, Cl, Br, I; e.g., Na+ and K+). A complete table of feature definitions is included in the Supporting Information (Table S8). For temperature, we calculated the percentage of cases when the temperature is predicted to be within ±10 and ±20 °C of the recorded temperature. Besides the quantitative analysis, examples were chosen from common types of chemical reactions to demonstrate the wide applicability of the model and provide more insights into the model prediction characteristics. Successful and unsuccessful predictions in the test set are presented and analyzed to demonstrate the performance of the model. In comparison, the performance of the nearest-neighbor model on these reactions is also tested, and results are discussed in the Supporting Information. This is a computational study, and we do not expect high safety hazards to be encountered.

Conclusion

A neural network model was developed for the task of reaction condition recommendation. Using a hierarchical design and training on about 10 million reactions from Reaxys, the model gives recommendations on the catalyst, solvent, reagent, and temperature to be used for any organic reaction. The model is tested on 1 million reactions outside the training set, and is able to recover a context combination with the catalyst and at least one solvent and reagent close to the true context in the top-10 predictions in 69.6% of those cases. Qualitative evaluation on common types of reactions reveals that the model can predict the exact conditions or predict conditions that have the same functionality as the true conditions. Many failed predictions are due to highly specific reactivity or data inconsistencies. Solvent and reagent embeddings are extracted from the trained model, and the visualization of them demonstrates that these representations capture the functional similarity. The context information generated by this tool can be used to aid experimental design, improve accuracy of in silico evaluation of reactivity and pathway-level evaluation, and improve chemical synthesis processes.

30 in total

Review 1. QSAR without borders.

Authors: Eugene N Muratov; Jürgen Bajorath; Robert P Sheridan; Igor V Tetko; Dmitry Filimonov; Vladimir Poroikov; Tudor I Oprea; Igor I Baskin; Alexandre Varnek; Adrian Roitberg; Olexandr Isayev; Stefano Curtarolo; Denis Fourches; Yoram Cohen; Alan Aspuru-Guzik; David A Winkler; Dimitris Agrafiotis; Artem Cherkasov; Alexander Tropsha
Journal: Chem Soc Rev Date: 2020-05-01 Impact factor: 54.564

2. Unified Deep Learning Model for Multitask Reaction Predictions with Explanation.

Authors: Jieyu Lu; Yingkai Zhang
Journal: J Chem Inf Model Date: 2022-03-10 Impact factor: 4.956

3. Machine learning models in the prediction of drug metabolism: challenges and future perspectives.

Authors: Eleni E Litsa; Payel Das; Lydia E Kavraki
Journal: Expert Opin Drug Metab Toxicol Date: 2021-11-02 Impact factor: 4.481

Review 4. Into the Unknown: How Computation Can Help Explore Uncharted Material Space.

Authors: Austin M Mroz; Victor Posligua; Andrew Tarzia; Emma H Wolpert; Kim E Jelfs
Journal: J Am Chem Soc Date: 2022-10-07 Impact factor: 16.383

5. Learning To Predict Reaction Conditions: Relationships between Solvent, Molecular Structure, and Catalyst.

Authors: Eric Walker; Joshua Kammeraad; Jonathan Goetz; Michael T Robo; Ambuj Tewari; Paul M Zimmerman
Journal: J Chem Inf Model Date: 2019-08-19 Impact factor: 4.956

6. Inferring experimental procedures from text-based representations of chemical reactions.

Authors: Alain C Vaucher; Philippe Schwaller; Joppe Geluykens; Vishnu H Nair; Anna Iuliano; Teodoro Laino
Journal: Nat Commun Date: 2021-05-06 Impact factor: 14.919

7. Graph-based machine learning interprets and predicts diagnostic isomer-selective ion-molecule reactions in tandem mass spectrometry.

Authors: Jonathan Fine; Judy Kuan-Yu Liu; Armen Beck; Kawthar Z Alzarieni; Xin Ma; Victoria M Boulos; Hilkka I Kenttämaa; Gaurav Chopra
Journal: Chem Sci Date: 2020-10-05 Impact factor: 9.825

8. Quantum-mechanical transition-state model combined with machine learning provides catalyst design features for selective Cr olefin oligomerization.

Authors: Steven M Maley; Doo-Hyun Kwon; Nick Rollins; Johnathan C Stanley; Orson L Sydora; Steven M Bischof; Daniel H Ess
Journal: Chem Sci Date: 2020-08-21 Impact factor: 9.825

9. Towards efficient discovery of green synthetic pathways with Monte Carlo tree search and reinforcement learning.

Authors: Xiaoxue Wang; Yujie Qian; Hanyu Gao; Connor W Coley; Yiming Mo; Regina Barzilay; Klavs F Jensen
Journal: Chem Sci Date: 2020-09-14 Impact factor: 9.825

10. A Deep Learning Approach to Antibiotic Discovery.

Authors: Jonathan M Stokes; Kevin Yang; Kyle Swanson; Wengong Jin; Andres Cubillos-Ruiz; Nina M Donghia; Craig R MacNair; Shawn French; Lindsey A Carfrae; Zohar Bloom-Ackermann; Victoria M Tran; Anush Chiappino-Pepe; Ahmed H Badran; Ian W Andrews; Emma J Chory; George M Church; Eric D Brown; Tommi S Jaakkola; Regina Barzilay; James J Collins
Journal: Cell Date: 2020-02-20 Impact factor: 41.582