| Literature DB >> 35258973 |
Wiktor Beker1,2, Rafał Roszak1,2, Agnieszka Wołos1,2, Nicholas H Angello3, Vandana Rathore3, Martin D Burke3,4, Bartosz A Grzybowski1,2,5,6.
Abstract
Applications of machine learning (ML) to synthetic chemistry rely on the assumption that large numbers of literature-reported examples should enable construction of accurate and predictive models of chemical reactivity. This paper demonstrates that abundance of carefully curated literature data may be insufficient for this purpose. Using an example of Suzuki-Miyaura coupling with heterocyclic building blocks─and a carefully selected database of >10,000 literature examples─we show that ML models cannot offer any meaningful predictions of optimum reaction conditions, even if the search space is restricted to only solvents and bases. This result holds irrespective of the ML model applied (from simple feed-forward to state-of-the-art graph-convolution neural networks) or the representation to describe the reaction partners (various fingerprints, chemical descriptors, latent representations, etc.). In all cases, the ML methods fail to perform significantly better than naive assignments based on the sheer frequency of certain reaction conditions reported in the literature. These unsatisfactory results likely reflect subjective preferences of various chemists to use certain protocols, other biasing factors as mundane as availability of certain solvents/reagents, and/or a lack of negative data. These findings highlight the likely importance of systematically generating reliable and standardized data sets for algorithm training.Entities:
Mesh:
Substances:
Year: 2022 PMID: 35258973 PMCID: PMC8949728 DOI: 10.1021/jacs.1c12005
Source DB: PubMed Journal: J Am Chem Soc ISSN: 0002-7863 Impact factor: 15.419
Figure 1Formulation of the prediction problem and literature-based statistics of reaction conditions. (a) Data set of literature-reported reactions we consider comprises heteroaryl-heteroaryl and aryl-heteroaryl Suzuki couplings (additionally restricted to only bromides and boronic acids). The objective is to use AI models to predict “optimal” reaction conditions for a given pair of substrates. Literature-based statistics of (b) most common Pd sources used in heteroaromatic Suzuki couplings [>50% of all published reactions used Pd(PPh3)4 as a catalyst]; (c) reaction temperatures (almost 50% reactions were performed between 80 and 109 °C; ∼20% of the records do not report temperature); (d) bases (five most common bases cover >80% of reaction space; additionally, carbonate bases were used in almost 70% of reactions); (e) solvents and solvent mixtures (five most common solvent mixtures cover only 45% of reaction space). Legends color-code the specific types of substrates used: “X = any”—any type of halide; “X = hetero”—heteroaromatic halide; “X = aryl”—aryl halide; “B = any”—any type of boronic acid; “B = hetero”—heteroaromatic boronic acid; “B = aryl”—aryl boronic acid.
Summary of Accuracies Obtained by Standard Feed-Forward Networksa
| (a) | ||||||
|---|---|---|---|---|---|---|
| input | prediction accuracy of base (7 classes) | prediction accuracy of solvent (6 classes) | ||||
| top-1 | top-2 | top-3 | top-1 | top-2 | top-3 | |
| “popularity” baseline | 76.8 | 89.6 | 93.8 | 29.8 | 57.4 | 75.5 |
| Morgan fingerprint | 80.6 (3.1) | 91.0 (2.7) | 94.4 (1.9) | 51.7 (7.8) | 69.4 (5.0) | 81.2 (2.8) |
| RDKit descriptors | 74.8 (2.2) | 88.6 (1.9) | 92.8 (1.6) | 42.6 (5.4) | 62.9 (4.4) | 76.9 (4.3) |
| Morgan + descriptors | 76.9 (3.3) | 89.1 (2.1) | 93.0 (1.9) | 45.2 (7.3) | 64.4 (6.0) | 78.1 (4.4) |
| autoencoder | 77.7 (2.7) | 90.2 (1.6) | 93.5 (1.3) | 42.2 (5.5) | 62.3 (3.7) | 77.2 (2.3) |
Top-k accuracy metric is the probability (in %) of finding the actual class within top-k classes ordered according to model’s predictions (values in parentheses are standard deviations from fivefold cross-validation). Part (a) is for the model taking into account six solvent classes. Part (b) is for 13 solvent classes.
Coarse-Grained Solvent Classification by Advanced NN Modelsa
| model architecture | input | top-1 | top-2 | top-3 |
|---|---|---|---|---|
| “popularity”-based baseline | 29.2 | 53.8 | 73.1 | |
| GCNN | molecular graph of substrates | 40.6 (6.3) | 61.0 (5.2) | 74.7 (3.4) |
| PU-NN | ECFP6 of substrates | 42.1 (6.1) | 60.9 (4.6) | 74.0 (2.5) |
| feed-forward | ECFP6 of substrates | 45.8 (6.5) | 63.5 (5.5) | 75.9 (4.1) |
| feed-forward | ECPF6 of substrates + base class | 46.4 (5.6) | 64.2 (5.1) | 76.6 (5.3) |
| feed-forward | Mol2Vec[ | 34.9 (3.9) | 54.9 (3.1) | 70.1 (2.7) |
GCNN: Graph convolutional neural network.[34] PU-NN: NN classifier with PU correction.[35,36] ECFP6: Extended connectivity fingerprints with diameter 6.[41] Top-k accuracy metric is the probability (in %) of finding the actual class within top-k classes ordered according to model’s predictions (values in parentheses are standard deviations from fivefold cross-validation). The baseline values refer to ordering produced by the corresponding frequency in the literature. Note that to mitigate class imbalance, all models used sample weights inversely proportional to class frequency (e.g., if a given solvent class was rarely used in the literature, the error of corresponding “matching” examples was multiplied according to the class size. This adjustment is meant to consider large and small classes on equal footing, without size-induced bias).
Accuracy of Yield Prediction Using Feed-Forward Neural Networks with Different Input Representationsa
| input data | loss | MAE | top-1 | top-2 | top-3 | Mdiff |
|---|---|---|---|---|---|---|
| popularity-based baseline | 16.3 | 25.1 | 44.7 | 59.4 | ||
| fine classes | MSD | 16.2 (2.3) | 0.8 (0.4) | 0.9 (0.4) | 1.1 (0.6) | 9.4 |
| fine classes (with ligand) | MSD | 16.0 (1.9) | 1.5 (0.7) | 1.8 (0.9) | 2.5 (1.7) | 6.1 |
| “coarse-grained” classes | MSD | 16.3 (2.2) | 0.6 (0.7) | 0.8 (0.7) | 1.1 (0.8) | 6.1 |
| “coarse-grained” (with ligand) | MSD | 15.6 (2.0) | 1.0 (0.8) | 1.8 (1.6) | 3.1 (2.8) | 4.6 |
| embedded conditions | MSD | 16.3 (2.7) | ||||
| embedded coarse-grained classes | MSD | 16.6 (2.4) | 7.6 (11.7) | 12.9 (12.4) | 14.7 (11.3) | 5.4 |
| classifier | 37.0 | 48.8 | 56.9 |
MAE = mean absolute error; top-k values as in Tables and 2 in %; values in parentheses are standard deviations from fivefold cross-validation; Mdiff—mean difference between conditions predicted to be the best and the worst for particular coupling partners. Popularity baseline is defined according to most popular literature-reported conditions (though, unlike in Table , here both base and solvent are considered). The last entry labeled as “classifier” refers to the combined predictions of separate base and solvent classifiers based on fingerprint representation. “Learnable embedding” was performed separately for each of three components (ligand, solvent, and base). Tokenization took place before NN training and involved selection of top-X (54 solvents, 72 bases, and 81 ligands) most frequent entries in the literature data, and they were assigned a number (index in the model’s “dictionary”)—usually one of those numbers covered all less significant, null, or unknown entries. Bases, ligands, and solvents were each assigned single tokens, whereas solvent mixtures, up to four components, were represented by tuples of four tokens representing pure solvents (and ordered according to predominance in mixture and with null/zero tokens used to denote “missing” solvents in binary and tertiary mixtures). The embedding layer in the NN kept a “dictionary” translating each token into a D-dimensional vector, whose components were optimized during training. Here, each token was assigned a 3D vector, resulting in a 24D representation of reaction conditions (a concatenation of two 3D vectors for ligand and base, as well as four 3D vectors for solvent components).
Accuracy of Condition Prediction Using Previously Reported Modelsa
| task type | data source | Reaxys | USPTO | ||||||
|---|---|---|---|---|---|---|---|---|---|
| input data metric | top-1 | top-2 | top-3 | MAE | top-1 | top-2 | top-3 | MAE | |
| popularity-based baseline | 25.1 | 44.7 | 59.4 | 16.3 | 29.8 | 51.8 | 62.7 | 21.1 | |
| classification | reaction conditions recommender[ | 38.7 | 46.1 | 50.7 | 26.4 | 31.0 | 34.0 | ||
| classification | Rel-GAT[ | 39.6 | 53.6 | 62.6 | 46.3 | 60.9 | 70.6 | ||
| regression | yield-BERT[ | 13.3 | 14.1 | 14.7 | 14.1 | 5.6 | 8.0 | 10.9 | 19.2 |
Top-k values as in Tables –3 in %. Popularity baseline is defined according to most popular literature-reported conditions (though, unlike in Table , here, both base and solvent are considered).