Literature DB >> 35258973

Machine Learning May Sometimes Simply Capture Literature Popularity Trends: A Case Study of Heterocyclic Suzuki-Miyaura Coupling.

Wiktor Beker^1,2, Rafał Roszak^1,2, Agnieszka Wołos^1,2, Nicholas H Angello³, Vandana Rathore³, Martin D Burke^3,4, Bartosz A Grzybowski^1,2,5,6.

Abstract

Applications of machine learning (ML) to synthetic chemistry rely on the assumption that large numbers of literature-reported examples should enable construction of accurate and predictive models of chemical reactivity. This paper demonstrates that abundance of carefully curated literature data may be insufficient for this purpose. Using an example of Suzuki-Miyaura coupling with heterocyclic building blocks─and a carefully selected database of >10,000 literature examples─we show that ML models cannot offer any meaningful predictions of optimum reaction conditions, even if the search space is restricted to only solvents and bases. This result holds irrespective of the ML model applied (from simple feed-forward to state-of-the-art graph-convolution neural networks) or the representation to describe the reaction partners (various fingerprints, chemical descriptors, latent representations, etc.). In all cases, the ML methods fail to perform significantly better than naive assignments based on the sheer frequency of certain reaction conditions reported in the literature. These unsatisfactory results likely reflect subjective preferences of various chemists to use certain protocols, other biasing factors as mundane as availability of certain solvents/reagents, and/or a lack of negative data. These findings highlight the likely importance of systematically generating reliable and standardized data sets for algorithm training.

Entities: Chemical

Mesh：

Substances：
Solvents

Year: 2022 PMID： 35258973 PMCID： PMC8949728 DOI： 10.1021/jacs.1c12005

Source DB: PubMed Journal: J Am Chem Soc ISSN： 0002-7863 Impact factor: 15.419

Introduction

Machine learning (ML) is making an impact on many fields of research with remarkable successes in areas in which learning is based on well-defined rules (e.g., game theory[1,2]) or large and high-quality data sets (e.g., protein folding[3]). In contrast, when the data are of lesser quality and involve features not properly captured by ML models, the predictions can be less impactful.[4] This is also the case in chemistry where the limitations of data-driven AI are now being recognized.[5,6] On the one hand, when reaction data sets include sufficiently large numbers of mechanistically well-defined reactions, ML models have been able to predict reactivity patterns more accurately than either heuristic or even QM methods and, with physically meaningful descriptors, can extrapolate to compound classes outside of the training sets. For instance, we have demonstrated such ability in predicting regio-, site-, and diastereoselectivity patterns of Diels–Alder cycloadditions,[7] Hong and co-workers showed high fidelity of ML models in assessing radical C-H functionalizations of heterocycles,[8] whereas Seeberger and Gilmore[9] and separately Reymond[10] demonstrated highly accurate models of glycosylation stereoselectivity. On the other hand, when sometimes idiosyncratic human choices or hard-to-control variables come into play, ML methods fare significantly worse. One example is synthesis planning where ML methods have been limited to simple targets,[11,12] often suggest chemically implausible transformations,[13] and cannot emulate more far-sighted thinking of human experts over multiple steps (of note, such multistep “strategizing” has been successfully implemented in “hybrid” systems[14,15] combining knowledge-base and ML approaches, as recently demonstrated by machine-designed syntheses of drugs and complex natural products[16−18]). Another example is prediction of reaction yields for which data-driven methods perform poorly,[19] especially on diverse data sets; in this case, the limited predictability likely reflects the fact that yields can vary perceptibly depending on human or environmental factors, for example, chemist’s skill, minute difference in manual procedures, or even time of the year (for yield variability in reactions performed over the years for the same compounds by the same group of chemists, see the Supporting Information of ref (20)). Another important problem, tackled herein, deals with the prediction of optimal conditions for a particular reaction in which there are generally multiple viable choices of solvents or reagents. Several works[21−24] have attempted to use ML for the prediction of reaction conditions, and the overall message they seem to convey is that ML can, in fact, offer accurate predictions provided adequate numbers of literature examples on which to build the models (but see also critical ref (6)). However, here, we demonstrate with a case study that this may have been an overoptimistic interpretation, and that even with large quantities of carefully curated literature data, ML approaches may not perform considerably better than estimates based on the popularity of reaction conditions reported in the literature. In other words, these ML models do not provide significantly more insights than just suggesting the most popular conditions which could be obtained by simple statistics over literature examples[25,26] and no “machine intelligence.” As a case in point, we consider the problem of predicting reaction conditions most suitable for a given pair of substrates engaging in heteroaryl-heteroaryl or aryl-heteroaryl Suzuki coupling. With >10,000 reaction examples with full condition information, this reaction seemed to provide reaction statistics that would be sufficient for successful ML. After categorizing the solvents, bases, temperatures, and sources of palladium, we apply various neural network (NN) approaches (feed-forward and graph convolution) as well as word-embedding and positive-unlabeled (PU) learning techniques to develop predictive models. Alas, all of these models offer only low accuracy of prediction, not significantly exceeding naïve baseline in which reaction conditions are assigned as those most popular in the literature. Moreover, the same and largely negative outcomes are observed when models described by others to predict reaction conditions are applied to this same data set—in all cases, they do not perform much better than the literature popularity measures. Overall, the fact that numerous state-of-the-art ML approaches fail to identify a predictive link between the structures of substrates and most suitable reaction conditions suggests that such a link may be inaccessible based on published data alone. The result reminds us that in synthetic chemistry, data are heavily influenced by nonscientific factors such as chemist’s subjective preference for certain protocols or even current availability of chemicals in one’s laboratory—there are no “descriptors” to capture these factors within ML models. We advocate that the path forward for chemical ML is to use robotized protocols[27,28] to generate standardized data sets and, in particular, multiple repeats of reactions carried out under different conditions, such that objective comparisons and learning of good vs bad conditions become possible.

Results and Discussion

Reaction Data Set and Classes of Reaction Conditions

We considered Suzuki coupling[28−30] between heteroaryl-heteroaryl and heteroaryl-aryl partners (Figure a). These reactions were retrieved from Reaxys repository.[31] We excluded reactions not reporting yields, those in which no source of palladium was provided, and those coming from patents (which are not peer-reviewed). We have focused on the Reaxys data set because it has higher quality than the machine-extracted reaction set from patents (though we also provide analyses for patent reactions). The details of data curation are included in the Supplementary Information, Section S1. These procedures left a set of 16,748 reactions for which catalyst, base, and solvent were reported and 13,337 for which temperature was also given. A total of 1037 reactions had the same substrates and products but differed in reaction conditions used (after categorization into the classes detailed below, there were 511 such examples). The Reaxys reaction IDs for entire data set are provided at http://doi.org/10.5281/zenodo.4652819. Because these reactions use a variety of solvents and reagents, we first performed statistical analyses to categorize them into broader classes. Figure b shows that 92% of reactions use five sources of Pd, predominantly Pd(PPh3)4. In terms of reaction temperatures, the most popular ones are between 80 and 109 °C (Figure c). Regarding the bases, the five most popular ones cover 82% of cases, with carbonates being most widely used (Figure d). The least consensus seems to be in the use of solvents and solvent mixtures for which five most popular types account for only 45% of all reported reactions (Figure e). Based on these trends and additional analyses given in the Supporting Information (e.g., that counterions present in the bases have no systematic effect on reaction yields; see Figure S8), and in the effort to limit the space of parameters to predict, we focused on the prediction of solvents and bases.

Figure 1

Formulation of the prediction problem and literature-based statistics of reaction conditions. (a) Data set of literature-reported reactions we consider comprises heteroaryl-heteroaryl and aryl-heteroaryl Suzuki couplings (additionally restricted to only bromides and boronic acids). The objective is to use AI models to predict “optimal” reaction conditions for a given pair of substrates. Literature-based statistics of (b) most common Pd sources used in heteroaromatic Suzuki couplings [>50% of all published reactions used Pd(PPh3)4 as a catalyst]; (c) reaction temperatures (almost 50% reactions were performed between 80 and 109 °C; ∼20% of the records do not report temperature); (d) bases (five most common bases cover >80% of reaction space; additionally, carbonate bases were used in almost 70% of reactions); (e) solvents and solvent mixtures (five most common solvent mixtures cover only 45% of reaction space). Legends color-code the specific types of substrates used: “X = any”—any type of halide; “X = hetero”—heteroaromatic halide; “X = aryl”—aryl halide; “B = any”—any type of boronic acid; “B = hetero”—heteroaromatic boronic acid; “B = aryl”—aryl boronic acid. Reflecting the statistics, the bases were categorized, according to popularity, as carbonates, phosphates, fluorides, hydroxides, amines, acetates, and other/miscellaneous. For solvents, we tested two types of categorizations. The more detailed one comprised 13 classes, ranked in the order of decreasing popularity as water/ethers, ethers, water/alcohols/aromatics, water/amides, alcohols/aromatics, aromatics, amides, water/aromatics, low boiling polar aprotic solvents/water, water/alcohols, water, alcohols, and other. The more “coarse-grained” classification distinguished six solvent types: {alcohols, water/polar solvents, water/alcohols, water/amides, water, amides}, {water/aromatics, alcohols/aromatics, water/alcohols/aromatics}, {aromatics}, {ethers}, {water/ethers}, {other}. In this way, we defined either 7 × 13 = 91 or 7 × 6 = 42 classes of reaction conditions, the latter less accurate but in principle easier to predict, should the finer classification prove challenging.

Models Based on Standard NNs

With these preliminaries, our main task was to develop ML models to predict which of the base/solvent class should be used for a given pair of substrates engaging in a Suzuki reaction. To make such predictions, we first used a standard feed-forward architecture with two hidden layers (130 and 15 neurons) with exponential linear unit (ELU) activation functions and softmax for the last layer. The NN had two outputs—one for the predicted base and another for solvent class. Each output gave a ranked list of, respectively, bases and solvents. Inputs were pairs of substrates for which we tested four representations: Morgan fingerprints with 512-bit length and radius 3; Chemical descriptors from the RDKit library[32] (200 descriptors for each substrate); Vectors combining the said Morgan fingerprints and RDKit descriptors; 20 dimensional latent/compressed representation obtained from an autoencoder (AE) comprising three hidden layers (30, 20, and 30 neurons with rectified linear units (ReLU) or exponential activation functions) and using the Morgan fingerprint as input representation. The output of the second hidden layer was used as latent representation of substrates. Although this is not a representation per se, the pretraining of AE can help in removing unimportant (redundant) variables and regularizing the model, as well as providing a denser representation to the following classification layers.[33] The last feature is of particular importance for the fingerprint input, which is very sparse by its construction. Each of these models was evaluated by fivefold cross-validation repeated five times (each time with a random 80:20 test/train split). The results are summarized in Table a,b and give top-n accuracies for all models (i.e., probabilities that the base and solvent used in a particular literature-reported reaction would also be among the n-top predictions of the NN). One conclusion to make is that these accuracies do not vary perceptibly with the representation used. In addition, the accuracies are satisfactory for base prediction (which is heavily dominated by carbonates; see Figure e) but significantly less so for solvents, for which top-1 predictions are only ∼42–51% correct for simplified six-solvent categorization and ∼36–43% for 13 solvent classes. In fact, the accuracies are often on par with a very naïve “model” in which reactions are simply assigned the n-most popular bases or solvents (e.g., for top-1, each reaction is assigned carbonate as base and mixture of water and ether as solvent). This means that our NN models are not performing significantly better than a simple condition “popularity” baseline. As a side note, we observe that the AE model, while not providing better accuracy, provides additional regularization, as indicated by learning curves (Figures S15 and S16).

Table 1

Summary of Accuracies Obtained by Standard Feed-Forward Networksa

(a)
input	prediction accuracy of base (7 classes)			prediction accuracy of solvent (6 classes)
	top-1	top-2	top-3	top-1	top-2	top-3
“popularity” baseline	76.8	89.6	93.8	29.8	57.4	75.5
Morgan fingerprint	80.6 (3.1)	91.0 (2.7)	94.4 (1.9)	51.7 (7.8)	69.4 (5.0)	81.2 (2.8)
RDKit descriptors	74.8 (2.2)	88.6 (1.9)	92.8 (1.6)	42.6 (5.4)	62.9 (4.4)	76.9 (4.3)
Morgan + descriptors	76.9 (3.3)	89.1 (2.1)	93.0 (1.9)	45.2 (7.3)	64.4 (6.0)	78.1 (4.4)
autoencoder	77.7 (2.7)	90.2 (1.6)	93.5 (1.3)	42.2 (5.5)	62.3 (3.7)	77.2 (2.3)

Top-k accuracy metric is the probability (in %) of finding the actual class within top-k classes ordered according to model’s predictions (values in parentheses are standard deviations from fivefold cross-validation). Part (a) is for the model taking into account six solvent classes. Part (b) is for 13 solvent classes.

Advanced NN Models

The failure of the abovementioned models could be reasonably ascribed to a simplistic NN architecture. Accordingly, we examined the performance of state-of-the-art graph convolutional neural networks (GCNNs)[34] and statistical correction proposed by Elkan and co-workers[35,36] (we apply this correction to the NN classifier, denoting it PU-NN model). GCNNs process learn directly from molecular graphs rather than from a predefined set of substructures or descriptors and have been successfully applied to predict, for instance, pKa values of C–H acids[37] and other molecular properties.[38,39] Statistical correction present in PU-NN, on the other hand, aims to solve the so-called PU problem. In our case, PU means that for particular substrates, the fact that certain reaction conditions were not reported does not mean that they were unsuitable for the reaction (“negative”) but only that they were untested. In other words, the reaction might still be feasible under unreported conditions and, at best, it can be assumed that the literature-reported conditions are close to optimal. In technical terms, this means that we now face a multilabel rather than a multiclass binary classification problem. This means that for each pair of substrates, all possible solvents/bases have their own 0/1 labels assigned independently, and more than one solvent/base can be deemed suitable for the reaction (in contrast, in a multiclass classification analyzed previously, each substrate pair could be assigned only to one out of many solvent/base classes). The prediction of “the best” solvent/base for a given pair of substrates is then performed by choosing the class with the highest prediction probability (a probability that the class “matches”). For testing of these two architectures, we focused on the problem of solvent selection (for six coarse-grained solvent classes) for which literature-based distribution is less dominated by a single class than in the case of bases (see Figure d,e), and which has proven more problematic for feed-forward NNs (see Table ). The relevant entries in Table indicate that the top-1 accuracies are, again, below 50% and the top-3 ones are not much better than the naïve, popularity-based baseline.

Table 2

Coarse-Grained Solvent Classification by Advanced NN Modelsa

model architecture	input	top-1	top-2	top-3
“popularity”-based baseline		29.2	53.8	73.1
GCNN	molecular graph of substrates	40.6 (6.3)	61.0 (5.2)	74.7 (3.4)
PU-NN	ECFP6 of substrates	42.1 (6.1)	60.9 (4.6)	74.0 (2.5)
feed-forward	ECFP6 of substrates	45.8 (6.5)	63.5 (5.5)	75.9 (4.1)
feed-forward	ECPF6 of substrates + base class	46.4 (5.6)	64.2 (5.1)	76.6 (5.3)
feed-forward	Mol2Vec[42] embedding of substrates	34.9 (3.9)	54.9 (3.1)	70.1 (2.7)

GCNN: Graph convolutional neural network.[34] PU-NN: NN classifier with PU correction.[35,36] ECFP6: Extended connectivity fingerprints with diameter 6.[41] Top-k accuracy metric is the probability (in %) of finding the actual class within top-k classes ordered according to model’s predictions (values in parentheses are standard deviations from fivefold cross-validation). The baseline values refer to ordering produced by the corresponding frequency in the literature. Note that to mitigate class imbalance, all models used sample weights inversely proportional to class frequency (e.g., if a given solvent class was rarely used in the literature, the error of corresponding “matching” examples was multiplied according to the class size. This adjustment is meant to consider large and small classes on equal footing, without size-induced bias). For the completeness of comparisons, we also tested a feed-forward architecture with substrate fingerprints (as before) but with multilabel instead of multiclass classification. Furthermore, we explored two modifications to this model’s input: (i) addition of the base class (to verify if solvent is in any way correlated to the base); and (ii) Mol2Vec representation of fingerprints. The Mol2Vec technique is inspired by language processing and casts the fingerprints into a 300-dimensional space, whereby the mutual proximity of points is expected to reflect the “chemical” similarity between compounds (the construction of such a space itself is based on statistical properties inferred from a large “corpus” such as the ZINC data set[40]). Unfortunately, the “feed-forward” entries in Table evidence that none of these models improved the accuracy of prediction perceptibly.

Augmenting Models with the Information about Yields

In a further effort to improve prediction accuracies, we decided to take advantage of the yield information, which, in principle, should help the models to distinguish between good and bad reaction conditions with greater precision. The logic here is to first teach the AI models to predict reaction yields for all possible reaction conditions and then, for a given pair of substrates, select as optimal those conditions that correspond to maximal yield. We began by training a regressor having a general feed-forward architecture. The inputs were vectors concatenating 512-bit Morgan fingerprints of the substrates, temperature (°C), as well as several vectorized forms of reaction conditions, either (i) one-hot encoded classes of solvents and bases or (ii) the so-called “learnable embedding”—a technique from natural-language processing—of conditions, in which ligands, solvents/solvent mixtures, and bases were first tokenized and then transformed into multidimensional vectors (for details, see caption to Table ). In all cases, the NN had two hidden layers (40 and 10 neurons), and activation functions for the layers were ELU, linear, and ReLU.

Table 3

Accuracy of Yield Prediction Using Feed-Forward Neural Networks with Different Input Representationsa

input data	loss	MAE	top-1	top-2	top-3	Mdiff
popularity-based baseline		16.3	25.1	44.7	59.4
fine classes	MSD	16.2 (2.3)	0.8 (0.4)	0.9 (0.4)	1.1 (0.6)	9.4
fine classes (with ligand)	MSD	16.0 (1.9)	1.5 (0.7)	1.8 (0.9)	2.5 (1.7)	6.1
“coarse-grained” classes	MSD	16.3 (2.2)	0.6 (0.7)	0.8 (0.7)	1.1 (0.8)	6.1
“coarse-grained” (with ligand)	MSD	15.6 (2.0)	1.0 (0.8)	1.8 (1.6)	3.1 (2.8)	4.6
embedded conditions	MSD	16.3 (2.7)
embedded coarse-grained classes	MSD	16.6 (2.4)	7.6 (11.7)	12.9 (12.4)	14.7 (11.3)	5.4
classifier			37.0	48.8	56.9

MAE = mean absolute error; top-k values as in Tables and 2 in %; values in parentheses are standard deviations from fivefold cross-validation; Mdiff—mean difference between conditions predicted to be the best and the worst for particular coupling partners. Popularity baseline is defined according to most popular literature-reported conditions (though, unlike in Table , here both base and solvent are considered). The last entry labeled as “classifier” refers to the combined predictions of separate base and solvent classifiers based on fingerprint representation. “Learnable embedding” was performed separately for each of three components (ligand, solvent, and base). Tokenization took place before NN training and involved selection of top-X (54 solvents, 72 bases, and 81 ligands) most frequent entries in the literature data, and they were assigned a number (index in the model’s “dictionary”)—usually one of those numbers covered all less significant, null, or unknown entries. Bases, ligands, and solvents were each assigned single tokens, whereas solvent mixtures, up to four components, were represented by tuples of four tokens representing pure solvents (and ordered according to predominance in mixture and with null/zero tokens used to denote “missing” solvents in binary and tertiary mixtures). The embedding layer in the NN kept a “dictionary” translating each token into a D-dimensional vector, whose components were optimized during training. Here, each token was assigned a 3D vector, resulting in a 24D representation of reaction conditions (a concatenation of two 3D vectors for ligand and base, as well as four 3D vectors for solvent components). Results summarized in Table demonstrate that irrespective of the vectorization scheme used, the mean absolute errors (MAEs) of yield prediction were similar, around 16%. Also similar were the predicted spreads of the yields of reactions performed under different conditions; however, the “best” and “worst” conditions were predicted to vary by 5–10%, which is much lower than ∼20–30% observed in experiments. This finding means that our regressors are largely insensitive to reaction conditions. In this light, it is not surprising that the top-k values, that is, conditions’ assignments based on the prediction of the highest-yielding, second-highest-yielding, and so forth reactions, are very poor. Significantly, these predictions are again worse than a frequency-based baseline (even if the model is additionally penalized for incorrect predictions of the same substrates in different conditions; see Section S3). In order to compare those results with the aforementioned classification approach, we trained separate fingerprint-based models for base and solvent classification (in multilabel formulation) and used them to predict the best conditions. The condition class probabilities (required to sort the predictions from best to worst) were taken as a product of corresponding base/solvent class probabilities. This model, as can be seen in the last row of Table , exceeded the naïve baseline in top-1 and top-2 accuracies but was still largely unsatisfactory (e.g., top-1 < 40%). Last but not least, we consider a model extreme in its naiveté—namely, assigning average yield (77%) independently on the input substrates and conditions. Such baseline has an MAE of 16.3%—comparable with even the best regression models (15.6%), especially when standard deviation from cross-validation (typically ∼2%) is taken into account. Yet again, this means that the AI models do not offer any major advantages over simplistic measures based on literature statistics.

Predictions of Previously Described Models

As mentioned in the Introduction section, several prior works reported ML models as relatively accurate in predicting reaction conditions. We tested performance of three such state-of-the-art approaches applied to the Suzuki coupling problem: Reaction Conditions Recommender (RCR) developed by Gao et al.,[22] Yield-BERT predicting reaction yields based on SMILES-represented reaction and associated reaction conditions,[43] and Rel-GAT (Relational Graph Attention Neural Network) previously evaluated on Suzuki and several other coupling reactions.[24] The RCR was used with the NN parameters provided by the authors as trained on the entire Reaxys data set (i.e., encompassing our own data set). For each reaction, we collected top-10 recommendations and translated them into our coarse-grained solvent and base classes. We note that palladium catalyst—usually Pd(PPh3)4—was present in 82.3% of the top-1 recommendations (and 94.1% of all top-10 proposals), indicating that the model correctly recognized Suzuki coupling reaction. On the other hand, for the solvent and base prediction problem, RCR did comparably to our own classifiers and the popularity baseline (RCR’s top-1, 2, and 3 scores were 38.7, 46.1, and 50.7%, respectively). Regarding the Yield-BERT[43] model, we re-trained and tested (5 × CV) it on our data set using the same hyperparameters as the authors (attempts to optimize those hyperparameters did not improve the model; see Section S2.4). The model was originally trained on a significantly smaller and less diverse data set of Suzuki couplings (5760 reactions from ref (44) differing in halide and boronate substrates as well as reaction conditions but all yielding the same product) and achieved MAE of 8.1%. On our larger and more diverse set, the MAE was 14%, that is, only slightly better than our simple regressor. Importantly, when the model was used to score different reaction conditions, the top-1 accuracy was 13.3% which is again better than our regressor (7.6%) but well below the literature popularity baseline (25.1%, see Table ).

Table 4

Accuracy of Condition Prediction Using Previously Reported Modelsa

task type	data source	Reaxys				USPTO
	input data metric	top-1	top-2	top-3	MAE	top-1	top-2	top-3	MAE
	popularity-based baseline	25.1	44.7	59.4	16.3	29.8	51.8	62.7	21.1
classification	reaction conditions recommender[22]	38.7	46.1	50.7		26.4	31.0	34.0
classification	Rel-GAT[42]	39.6	53.6	62.6		46.3	60.9	70.6
regression	yield-BERT[43]	13.3	14.1	14.7	14.1	5.6	8.0	10.9	19.2

Top-k values as in Tables –3 in %. Popularity baseline is defined according to most popular literature-reported conditions (though, unlike in Table , here, both base and solvent are considered).

Top-k values as in Tables –3 in %. Popularity baseline is defined according to most popular literature-reported conditions (though, unlike in Table , here, both base and solvent are considered). On the other extreme, the Rel-GAT[24] was originally evaluated on a significantly broader data set, that is, various types of couplings and, within the Suzuki coupling, on all such examples (i.e., not only the more synthetically challenging[45,46] aryl-heteroaryl couplings but also aryl-aryl). Here, by introducing chemically relevant classes of solvents and bases (instead of explicit classification used in ref (43)), we create a more difficult classification problem (consider, for instance, that out of four most popular bases, three are carbonates), especially when class imbalance is taken into account (see Figure S15 in Section S5.5). In Rel-GAT, this imbalance problem was not addressed, whereas we applied sample weights to address this issue; we note that in a recent study on toxicity prediction, this technique turned out to outperform other approaches to balance the data set.[47] Still, even with these precautions, the model performed similarly to our GCNN discussed earlier and achieved the top-1, 2, and 3 accuracies of condition prediction of, respectively, 39.6, 53.6, and 62.6%, (with standard deviations of the mean ∼4%; see Table and further details in Table S13). Next, we investigated whether our results were in any way peculiar to the Reaxys data set. To this end, the experiments described above were repeated using 5434 reactions from the USPTO[48] collection and deposited at https://github.com/rmrmg/SuzukiConditions/blob/master/uspto/dataset/suzuki_USPTO_with_hetearomatic.txt (for details of the extraction procedure, see Section S1.2). The results summarized in the right portion of Table evidence that in terms of top-1,2,3 metrics, all tested models offer accuracies comparable to the Reaxys data set. The only model that outperforms the popularity baseline is the rel-GAT classifier. However, it should be noted that the popularity baseline is higher for USPTO than for Reaxys—this effect can be explained by the lower diversity of condition classes in USPTO (see Section S5.5 and Figure S18) with regard vs Reaxys (Figure S15). Furthermore, it has to be stressed that automatically curated reactions in USPTO are generally of lower quality (see Section S1.3). Indeed, careful evaluation of randomly sampled 50 entries from each database revealed that as much as half of the USPTO entries may be compromised, whereas in the case of Reaxys, this estimate is at the level of ca. 10% (this is also in line with our recent estimates, spanning all reaction types, of erroneous entries in these repositories[49]). The errors in USPTO records are particularly evident in the case of solvent entries (see Section S1.3.1 and Table S2), plausibly because this collection is dominated by the “other” solvent class (Figure S18), causing corrupted solvent entries to fall out of our classification criteria. With this evidence, the real-world performance of any model trained on such low-quality data, even with the highest possible values of performance metrics, is at least questionable. Further analyses of the NN models trained on USPTO are provided in Sections S8 and S9. Finally, to better understand why even the state-of-the-art ML models offer such limited accuracies, we compared their performance on the pairs of reactions involving the same substrates under different reaction conditions. To this end, we selected all 316 pairs of reactions from Reaxys that satisfy the following criteria: (a) they use the same pair of substrates; (b) they were performed under different conditions (according to our classification of solvents and bases); and (c) the difference of their yields is greater than 10%. If more than two reactions met these conditions, we chose a pair that maximizes the yield difference. Our expectation here was that on such pairs, a good model should correctly recognize which conditions are “better” (i.e., provide higher yield) for a given reaction. The simplistic popularity metric orders these pairs correctly in 43.7% cases. This is on par with RCR (35.4%), Yield-BERT (51.3%), and Rel-GAT (47.0%). This result suggests that both of these ML models are relatively insensitive to reaction conditions and capture only some crude correlations between the structure of the reactants and the preferred reaction conditions. Some additional analyses are provided in Sections S5.4 and S5.5.

Conclusions

In summary, we applied a range of ML techniques, from simple to state-of-the-art, to answer a seemingly simple question—that is, which reaction conditions should be chosen for substrates engaging in a reaction of a particular type. Even though we used a large, diverse, and carefully curated data set of Suzuki–Miyaura couplings, all of these models gave largely unsatisfactory prediction accuracies (especially for the solvent prediction subproblem), not significantly higher than the popularity baseline. At first sight, this may be surprising given that ML models are expected to offer accurate predictions if trained on large enough, high-quality reaction data sets. This might be true when the descriptors used to construct the model capture the chemical essence of a problem in question, for example, when trying to predict reaction outcomes given its substrates, the structural, steric, and electronic descriptors are generally sufficient.[7−10,50] The condition prediction problem, however, is markedly different because in addition to the structural features of reactants, products, and reagents, it entails several “human” factors: Conditions are often chosen based on the query of relevant literature, ultimately selecting those most frequently reported (this may explain why popularity-based metrics worked nearly as well as ML). In addition, mundane factors of instantaneous availability of specific reagents/solvents in one’s laboratory or even “historical” preference for certain choices (i.e., conditions commonly used in one’s laboratory) might come into play. In other words, chemistry often propagates its own practices/routines, and these factors are hardly quantifiable as “descriptors” of sorts. A way around this problem is to begin to augment the available literature data by systematic and standardized experiments in which reactions are repeated under multiple conditions such that meaningful conclusions about better vs worse ones can be learned. Several years ago, such augmentation of thousands upon thousands of reactions would hardly be possible as human chemists lack incentives to repeat—just for the sake of generating more data—a successful reaction under multiple other and likely worse yielding conditions. The recent progress in synthesis automation, however, should make such an effort feasible, at least for some more popular classes of reactions. Until such multiple-condition data become available, we advocate that ML models are always accompanied by and compared against popularity-based baselines which are known, by themselves, to capture certain reactivity trends.[25,26]

35 in total

1. A versatile method for Suzuki cross-coupling reactions of nitrogen heterocycles.

Authors: Noriaki Kudo; Mauro Perseghini; Gregory C Fu
Journal: Angew Chem Int Ed Engl Date: 2006-02-13 Impact factor: 15.336

2. A robotic platform for flow synthesis of organic compounds informed by AI planning.

Authors: Connor W Coley; Dale A Thomas; Justin A M Lummiss; Jonathan N Jaworski; Christopher P Breen; Victor Schultz; Travis Hart; Joshua S Fishman; Luke Rogers; Hanyu Gao; Robert W Hicklin; Pieter P Plehiers; Joshua Byington; John S Piotti; William H Green; A John Hart; Timothy F Jamison; Klavs F Jensen
Journal: Science Date: 2019-08-09 Impact factor: 47.728

3. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play.

Authors: David Silver; Thomas Hubert; Julian Schrittwieser; Ioannis Antonoglou; Matthew Lai; Arthur Guez; Marc Lanctot; Laurent Sifre; Dharshan Kumaran; Thore Graepel; Timothy Lillicrap; Karen Simonyan; Demis Hassabis
Journal: Science Date: 2018-12-07 Impact factor: 47.728

4. Chemist Ex Machina: Advanced Synthesis Planning by Computers.

Authors: Karol Molga; Sara Szymkuć; Bartosz A Grzybowski
Journal: Acc Chem Res Date: 2021-01-11 Impact factor: 22.384

5. Best practices in machine learning for chemistry.

Authors: Nongnuch Artrith; Keith T Butler; François-Xavier Coudert; Seungwu Han; Olexandr Isayev; Anubhav Jain; Aron Walsh
Journal: Nat Chem Date: 2021-06 Impact factor: 24.427

6. Prediction of Major Regio-, Site-, and Diastereoisomers in Diels-Alder Reactions by Using Machine-Learning: The Importance of Physically Meaningful Descriptors.

Authors: Wiktor Beker; Ewa P Gajewska; Tomasz Badowski; Bartosz A Grzybowski
Journal: Angew Chem Int Ed Engl Date: 2018-12-04 Impact factor: 15.336

Machine Learning May Sometimes Simply Capture Literature Popularity Trends: A Case Study of Heterocyclic Suzuki-Miyaura Coupling.

Introduction

Results and Discussion

Reaction Data Set and Classes of Reaction Conditions

Models Based on Standard NNs

Advanced NN Models

Augmenting Models with the Information about Yields

Predictions of Previously Described Models

Conclusions

1. A versatile method for Suzuki cross-coupling reactions of nitrogen heterocycles.

2. A robotic platform for flow synthesis of organic compounds informed by AI planning.

3. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play.

4. Chemist Ex Machina: Advanced Synthesis Planning by Computers.

5. Best practices in machine learning for chemistry.

6. Prediction of Major Regio-, Site-, and Diastereoisomers in Diels-Alder Reactions by Using Machine-Learning: The Importance of Physically Meaningful Descriptors.

7. Predicting reaction performance in C-N cross-coupling using machine learning.

8. Rapid and Accurate Prediction of pK_a Values of C-H Acids Using Graph Convolutional Neural Networks.

9. Predicting Regioselectivity in Radical C-H Functionalization of Heterocycles through Machine Learning.

10. Predicting the outcomes of organic reactions via machine learning: are current descriptors sufficient?

1. Autonomous Chemical Experiments: Challenges and Perspectives on Establishing a Self-Driving Lab.

2. Chemistry-informed molecular graph as reaction descriptor for machine-learned retrosynthesis planning.

3. Robustness under parameter and problem domain alterations of Bayesian optimization methods for chemical reactions.