Literature DB >> 35936487

Extended Connectivity Fingerprints as a Chemical Reaction Representation for Enantioselective Organophosphorus-Catalyzed Asymmetric Reaction Prediction.

Abstract

Predicting the outcomes of organic reactions using data-driven approaches aids in the acceleration of research. In laboratory-scale experiments, only a small number of reaction data can be accessed for machine learning model construction, where reaction representations play a pivotal role in the success of model construction. Nevertheless, representation comparison for a small data set is not adequate. Herein, focusing on the enantioselectivity of phosphoric-acid-catalyzed reactions, various two-dimensional and three-dimensional reaction representations (descriptors) were compared. Overall, the concatenated form of the extended connectivity fingerprints showed the best predictive capability for the two types of data sets: high-throughput experimental data and manually collected literature data sets. Furthermore, highlighting the substructure contribution to the prediction outcome was shown to be informative for guiding catalyst development.

Entities: Chemical

Year: 2022 PMID： 35936487 PMCID： PMC9352214 DOI： 10.1021/acsomega.2c03812

Source DB: PubMed Journal: ACS Omega ISSN： 2470-1343

Introduction

Developing chemical reactions to produce desired substances is one of the ultimate goals of organic chemistry. Trial-and-error approaches combined with expert knowledge have traditionally dominated the optimization of chemical reactions. Recently, chemoinformatics[1,2] or data-driven approaches have emerged to fully utilize experimental data to guide the design of chemical reactions, where the reaction outcomes (e.g., yields) are quantitatively predicted from numerical representations of the reactants or catalysts.[3,4] These representations consist of molecular descriptors, which are abstract expressions of the compounds involved in the reaction.[5] Molecular descriptors can be experimentally tested or computationally derived physicochemical parameters,[6,7] topological descriptors derived from chemical formulas,[8] and geometrical descriptors based on molecular conformations.[9,10] Physicochemical descriptors or descriptors based on the hypothetical reaction mechanism combined with a simple multivariate linear regression model are able to give an interpretation of the reaction mechanism.[11−13] On the other hand, topological descriptors accompanied by nonlinear machine learning (ML) models have sufficient predictive capability when trained on high-throughput experimental (HTE) data.[14,15] Although HTE data[16−18] provide the opportunity to analyze the comprehensive reaction space with high precision, the exhaustive combinations of substances under uniformly controlled experimental conditions are not usually available in laboratory-scale experiments for novel reaction development. Thus, methods for constructing highly predictive ML models trained on a small number of reactions are highly demanded. Asymmetric catalyst reactions are paramount in organic chemistry for selectively synthesizing one of the enantiomers. Approximately 50% of the currently marketed pharmaceuticals contain compounds with at least one chiral center.[19] Without catalysts, separation of the racemic mixture is necessary, in which the theoretical yield is limited to 50%. Because the energy barrier between enantiomers remains almost the same, it is difficult to bias the reaction to one enantiomer in a thermodynamics manner. To solve this issue, asymmetric catalysts, such as transition-metal catalysts and organocatalysts, have been developed.[20,21] Highly predictive ML models of the enantioselectivity for asymmetric catalysts have been investigated.[22−24] 1,1′-Bi-2-naphthol (BINOL)-based phosphoric acid catalysis reactions have been used for molecular descriptor development.[22,23] Zahrt et al.[22] proposed novel reaction representations that do not depend on a single conformation of the compounds. The representation includes the steric effect of a molecule by estimating the grid-occupancy probability of the feasible conformers. Although ML models with this representation have high predictive capability when they are trained on a sufficient number of reactions, comparison with other molecular descriptors, including traditional two-dimensional (2D) and three-dimensional molecular descriptors (3D), is not adequate. Furthermore, this sophisticated descriptor is only applicable to a series of analogous substrates or catalysts owing to the necessity of molecular alignment. In addition to the difficulty of deployment, it is not straightforward to use the descriptors for the design of high enantioselectivity because space filling at particular grids is the interpretation of the descriptors. Methods for interpreting the outcomes of nonlinear ML models have been developed for biological activity prediction. The contributions of chemical structural fragments to the model outcomes can be color-coded on the chemical structures as model interpretation.[25] When a graph neural network is used as a ML model, sensitivity analysis by changing the input molecular graph attributions gives the important atoms (substructures) of molecules.[26] The method in the latter category has already been reported for reaction prediction using the publicly available large reaction data set of USPTO.[27] However, to the best of our knowledge, there have been no reports on the former type for reaction–outcome prediction models. Here, we tested various 2D and 3D molecular descriptors in enantioselectivity ML models for phosphoric acid catalyst design and propose model construction schemes with high predictive capability focusing on a small number of reactions. Descriptor comparison was performed using a HTE data set to fully evaluate the descriptors with statistical significance. To mimic a real-world situation when the reaction data from various data sources are combined to make ML models, different reaction mechanisms including substrates and phosphoric acid catalysts were collected. Data-fusion approaches were tested for this data set. We found that 2D chemical structural fingerprints were useful for both highly predictive ML models and interpretation. In data-fusion approaches, reaction mechanism similarity was not as important as chemical structural similarity for the model predictive capability.

Materials and Methods

Chemical Reaction Data Sets

Two data sets of organophosphorus-catalyzed asymmetric reactions were prepared. The first was the HTE data set reported by Zahrt et al.,[22] which is called the HTE data set. The second was a collection of chemical reactions extracted from several research articles from several research groups, which is called the literature data set. The chemical reactions in the two data sets were annotated with the experimentally validated percentage enantiomeric excess (% ee) as the enantioselectivity of the product. The HTE data set consisted of 1075 reactions of enantioenriched N,S-acetal formation catalyzed by chiral BINOL phosphoric acid derivatives. These reactions were the exhaustive combinations of five imines, five thiols, and 43 chiral phosphoric acid catalysts, as summarized in Figure .

Figure 1

HTE data set. The reaction data set was extracted from the publication by Zahrt et al.[22] This data set contained 1075 reactions of the exhaustive combinations from the imine group [five substituents for aryl (Ar)], the thiol group (five substituents for R1), and the chiral phosphoric acid catalyst group (43 substituents for R2). The literature data set consisted of 116 imine-involved enantioselective reactions manually extracted from seven research articles.[28−34] These reactions can be regarded as a general reaction form of conversion of an imine electrophile and a nucleophile with a catalyst to a product. These research articles satisfied the following criteria: the chemical structures of the substances and catalysts were present, multiple reaction types were present, and catalysts with axial chirality (R or S) or chiral centers were present. A collection of chemical reactions from different groups contains different scaffolds, making the % ee prediction task beyond the scope of analogous substances and catalysts. When nucleophiles or catalysts consist of different core substructures (nucleophiles and catalysts in Figure ), superimposition of the conformers for each reaction component becomes a nontrivial task, making the calculation of the conventional 3D reaction descriptors difficult. In this data set, the chiral phosphoric acid catalysts consisted of four core scaffolds, and the nucleophiles consisted of six core scaffolds. The electrophiles consisted of an imine core with various substituents (electrophiles in Figure ). Although all the reactions in the data set can be represented by a general reaction form, electrophile + nucleophile → product, the putative reaction mechanisms are different based on the nucleophile cores: catalytic asymmetric addition of an amide or an imide by nucleophilic attack of the nitrogen lone pair (37 reactions), the direct aza-Mannich reaction (66 reactions), and the Friedel–Crafts reaction (13 reactions). The electrophiles, nucleophiles, and catalysts of the reactions are shown in Figure , and the profile of the literature data set is given in Table .

Figure 2

Table 1

Literature Data Set Profile

			ΔΔG^‡ [kcal mol^–1]
literature ID	number of reactions	putative reaction mechanism	average	minimum	maximum
0[28]	20	Mannicha	0.92	0	1.51
1[29]	24	Mannicha	1.03	0.11	1.53
2[30]	14	Mannicha	0.68	0	1.23
3[31]	13	Additionb	1.65	0.06	3.13
4[32]	8	Mannicha	0.22	0.06	0.75
5[33]	13	Friedel–Craftsc	1.73	1.22	1.98
6[34]	24	Additionb	1.51	0	3.13

Direct aza-Mannich reaction.

Catalytic asymmetric addition of an amide (imide) by nucleophilic attack.

Friedel–Crafts reaction.

Literature data set. The reaction data set was extracted from seven publications.[28−34] This data set contained 116 reactions from combinations of electrophiles (31 substituents for the two R groups), nucleophiles (27 substituents for the six R groups), and catalysts (21 substituents for the four building blocks). Direct aza-Mannich reaction. Catalytic asymmetric addition of an amide (imide) by nucleophilic attack. Friedel–Crafts reaction. As the objective variable for enantioselectivity, the difference in the Gibbs free energy of activation between the R and S isomers was used, which can be converted from the % ee values by the equation ΔΔG‡ = −RT ln ,[35,36] where T is the reaction temperature and R is the gas constant (8.314 J mol–1 K–1). Although this metric uses an absolute enantiomeric excess and does not distinguish between the R and S isomers, this elimination, as well as temperature incorporation, is necessary for integrating the reaction data sets consisting of diverse chemical structures with different reaction temperatures from different sources. The distributions of ΔΔG‡ for both data sets are shown as histograms in Figure . The median ΔΔG‡ for the HTE data set was 1.05 kcal mol–1 and that for the literature data set was 1.10 kcal mol–1.

Figure 3

Enantioselectivity: ΔΔG‡ values of the reactions in the data sets. The histograms show the distributions of the objective variables for the (A) HTE data set and (B) literature data set.

Reaction Descriptors

The reaction descriptors were based on the chemical structures of the main components of the chemical reactions: the electrophiles, nucleophiles, catalysts, and products. For each component, a molecular descriptor vector was derived, and concatenation of the vectors formed a reaction descriptor for the reaction.[37] To calculate the 3D descriptors, the force-field-based energy-minimized conformation was used, which was identified by a conformation search using MOE software (version 2020.09, Chemical Computing Group Inc.) as follows.[38] An input chemical structure, represented as a SMILES string, was loaded into the software, and explicit hydrogen atoms were added to the structure, followed by manual curation of the structure, in particular, fixing the axial chirality when the structure was a catalyst. A conformation search was then performed by the command “Conformations: search” using the default settings with the Amber10 force field. The most stable conformation in terms of the total energy was selected. For each reaction component, the most stable conformation was used for the calculation of the 3D descriptors.

Molecular Descriptors

One of the aims of this study is to evaluate the reaction descriptors in terms of their predictive capability for the enantioselectivity of organophosphorus-catalyzed reactions. Various 2D and 3D molecular descriptors were used for fair comparison.

3D Descriptors

Dragon Descriptors

Dragon descriptors are the molecular descriptors implemented in Dragon 7 software (version 7.0) provided by Kode Chemoinformatics.[39] The descriptors in the software are a collection of descriptors proposed in the scientific literature. More than 5000 descriptors are available, which are categorized based on the descriptor type. Owing to the completeness and diversity of the descriptor types, Dragon descriptors have been frequently used for quantitative structure–activity relationship analyses.[40−42] In this study, 10 3D descriptor groups were considered (the number of descriptors is given in parentheses): geometrical descriptors (38), 3D matrix-based descriptors (99), 3D autocorrelations (80), RDF descriptors (210), 3D-MoRSE descriptors (224), WHIM descriptors (114), GETAWAY descriptors (273), Randic molecular profiles (41), 3D atom pairs (36), and CATS 3D descriptors (300). Estrada-like index (EE_M(w)) descriptors were not considered because they cannot be calculated for chiral phosphoric acid catalysts.

MOE Descriptors

The MOE descriptors were the descriptors calculated by MOE software (version 2020.09). The implemented descriptors mainly focus on the properties for molecular interaction in medicinal chemistry, such as the polar surface area and molecular shape. The calculated molecular properties (energies) at the semiempirical quantum mechanics level were also included. In this study, all the available 3D descriptors (138) were considered. The names and definitions of the descriptors are given in Table S1. For the chiral phosphoric acid catalysts, 18 out of the 138 descriptors could not be calculated (Table S2), leading to a 120-dimensional vector for a reaction component.

Average Steric Occupancy Descriptors

The average steric occupancy (ASO) descriptor of a molecule is a numerical vector in which each variable represents the average occupancy of a grid cell by an ensemble of conformers of the molecule.[22] Because the descriptor values are averaged over the feasible conformers, conformational flexibility can be taken into account, and thus, it is a 3.5-dimensional (3.5D) descriptor. To calculate the ASO descriptors, grid cells surrounding a molecule are first defined, and conformational alignment (superimposition) of analogous molecules is then performed. Although it is difficult to superimpose nonanalogous molecules inside the predefined grid cells in a consistent manner, ASO descriptors can give an interpretation of a prediction model when the modeling algorithm is linear regression. This is similar to molecular interaction-field-based descriptors, such as CoMFA combined with linear regression models (partial least-square regression).[43−45] In this study, ASO descriptors were only used for the HTE data set. The calculated ASO descriptor values for the HTE data set were downloaded from a website[46] as csv files provided by Zahrt et al.[22]

2D Descriptors

Extended Connectivity Fingerprints

Extended connectivity fingerprints (ECFPs) are atom-centric circular fingerprints. The atom environment for each atom becomes a feature component of an ECFP. Each feature generates a hash number, and a collection of the hash numbers forms an ECFP. In this study, the ECFP with a bond diameter of 6 (ECFP6) was used.[47] To avoid bit collision when making a fixed-sized bit vector (assigning different hash values to the same bit), a feature set generated by ECFP6 was first folded into a large bit vector by modulo operation (1,000,000 bits), followed by the operation of removing the bits consisting of zero. This bit reduction process was performed on a component basis: electrophiles, nucleophiles, catalysts, and products. The ECFP6 descriptors were calculated by in-house Python scripts with the aid of the OEChem toolkit.[48]

Mol2vec Descriptors

The Mol2vec descriptor is the descriptor generated by a Mol2vec model. This descriptor is a fixed-sized numerical vector obtained by summing the feature vectors of the atom environments. An atom environment is represented by an ECFP hash value, and the Mol2vec model is trained to find vector embeddings for the hash values taking the similarity among the different atom environments into account. For a molecule, all the feature vectors of the atom environments are summed, which becomes a Mol2vec descriptor. The Mol2vec model is usually trained on a large compound data set to learn the unbiased similarity among the atom environments. To generate the Mol2vec descriptors, the ChEMBL24 database (1,646,866 compounds),[49] part of the Reaxys database (232,104 compounds),[50] and the HTE data set (78 compounds) were prepared to train a Skipgram-based Mol2vec model.[51] To train the Mol2vec model, a hash value of an ECFP atom environment formed a word, and it was assigned to its central atom. The aligned hash values on the canonical SMILES string order formed a molecular sentence. A collection of the sentences was used to train the model, and the word (hash value) embedding was extracted from a hidden layer vector of the model. All the feature vectors in the molecule were summed to form the final Mol2vec descriptor.

Machine Learning

ML models can estimate ΔΔG‡ of a reaction by inputting a value of the chemical reaction descriptor. As modeling algorithms, extreme gradient boosting (XGBoost) and support vector regression (SVR) were used. XGBoost models were constructed with the XGBoost libraries (version 1.4.0),[52] and SVR models were constructed with the scikit-learn (version 0.24.2) libraries. For SVR, the hyperparameters C and ε were set to 10.0 and 0.1, respectively, which were the settings in scikit-learn.[53,54] Two kernel functions were also tested: rbf and polynomials with degrees of 2 and 3. To interpret the outcome of a SVR model, the Tanimoto kernel (vide infra) was used in combination with the ECFP descriptor.[55]

Evaluation Metrics

The ML model performance was evaluated by the mean absolute error (MAE)where n is the number of data, ŷ is the predicted value for the ith sample, and y is the measured value. The MAE is a more rigorous metric for a data set containing outliers than the root-mean-squared error, and thus, it was used in this study.

Interpretation of the SVR Prediction Outcome Based on the Fragment Contributions

Interpretation of the prediction outcome generated by a ML model is important for practical applications. This gives more information than a mere prediction of the enantioselectivity (ΔΔG‡). Because SVR (support vector machine) models combined with the Tanimoto kernel generally achieve high prediction accuracy in activity prediction, only SVR with the Tanimoto kernel was considered in this study. The prediction outcome by a SVR model with the Tanimoto kernel function can be decomposed into the contributions of the descriptors.[56] The outcome of a SVR model is as followswhere K is the kernel function, α and α* are Lagrange multipliers on the dual problem of the objective function, and i is an index of the support vectors (SVs). f() is a linear combination of the kernel functions on the SVs and (weighted on α or α*, αα* = 0). The Tanimoto kernel iswhere d is a dimension of the D dimensional fingerprint. Inserting the Tanimoto kernel equation into the SVR prediction function gives For the d dimension contribution Therefore, the outcome of the SVR model can be decomposed into individual descriptor contributions without approximation. Because each bit of an ECFP vector represents an atom environment, the normalized summation of the contributions matching the atom becomes the total contribution of the atom. Visualization of the atom contributions to the molecular structure was performed by the procedure proposed by Bajorath et al.[56,57]

Results and Discussion

Study Design

HTE Data Set

Using the HTE data set, the MAE values in ΔΔG‡ prediction by several ML models were compared for several reaction representations: Dragon, MOE, Mol2vec, ASO, and ECFP6. Four data-splitting strategies were tested to understand the prediction models from different perspectives: random splitting, 80% ee splitting, silyl splitting, and out-of-sample splitting. For random splitting, the training reactions were randomly selected from the HTE data set. The number of training reactions varied (10, 20, ···, 100 and 100, 200, ···, 1000), and the rest of the reactions became the test reactions. For each number of training reactions, the prediction trial was repeated 100 times by different random splitting to assure statistical validity. For 80% ee splitting, the reactions with less than 80% ee were added to a training pool, and the reactions with greater than or equal to 80% ee were added to a test pool, as in a previous study.[22] The training pool consisted of 718 reactions with low % ee values, and the test pool consisted of 357 reactions with high % ee values. For this splitting, each training data set was constructed by randomly sampling a predefined number of reactions from the training pool. These numbers varied (10, 20, ···, 100 and 100, 200, ···, 700). For silyl splitting, the reactions with catalysts containing a silyl group were the test data set, and the reactions with catalysts not containing a silyl group were the training pool. The test-data-set size was 100 (four catalysts), and the training-pool size was 975 (39 catalysts). A predefined number of reactions was sampled from the training pool for each training data set. The number of training reactions varied (10, 20, ···, 100 and 100, 200, ···, 900). For the out-of-sample splitting, a training data set and three test data sets were prepared in the same way as in previous studies.[22,58] The training data were 384 reactions which were the combinations of 16 substrates and 24 catalysts. The remaining 691 reactions, which were not overlapped in both substrates and catalysts with the training data set, were further classified into three test sets. The first data set consisted of reactions with the same substrates but different in catalysts (termed test sub), the second with the same catalysts but different in substrates (test cat), and the last with both different substrates and catalysts (test sub-cat). The number of training reactions varied (10, 20, ···, 100, 200, 300, and 384). After evaluating the predictive capability of the ML models with various reaction representations and modeling methods, interpretation of the prediction outcome was performed to identify the chemical structural features responsible for high (low) % ee as examples of several reactions in a test data set. This was performed to understand the model, as well as to gain insight into the reactions for further optimization of the components of the reactions. For interpretation, a SVR model was trained on 100 reactions obtained by random splitting, which showed sufficient predictive capability. The model was selected from a set of 100 models trained on the same number of reactions based on the closeness to the average MAE over the 100 trials for fair comparison.

Literature Data Set

The predictive capability of ΔΔG‡ prediction models was investigated in terms of the training-data-set size, reaction representations, and ML algorithms in the same way as in random splitting using the HTE data set. The number of training reactions was varied from 10 to 100, and the MAE values for the test reactions that were not included in the training data set were calculated. For each training-data-set size, 100 trials were performed. Next, we investigated how much of the reaction data with different putative reaction mechanisms contributed to the improvement of the predictive ability of the ML models. For this purpose, reactions in each paper formed a test data set, and four training data sets were prepared for each test data set: all, the same reaction mechanism, in the cluster (cluster), and outside the cluster (outside the cluster). The “all” data set consisted of the rest of the literature data set: the remaining reactions after the test data set was extracted. The same reaction mechanism data set consisted of the reactions of the same putative reaction mechanism (Table ). The cluster data set consisted of the reactions belonging to the same cluster in a reaction descriptor space. The outside cluster data set consisted of the reactions belonging to the different clusters in the reaction descriptor space. The ECFP6-based reaction representation was used for this purpose, and the number of clusters was determined to be three based on an ELBOW plot (Figure S8). These benchmark calculations were investigated to reveal a reasonable way to integrate data sets from different sources, that is, mechanism-based or reaction similarity-based.

Prediction Capability of the ML Models for the HTE Data Set

General Performance

The five descriptors listed in the Materials and Methods section, Dragon (5659 dimensions), MOE (534 dimensions), Mol2vec (1600 dimensions), ASO (8548 dimensions), and ECFP6 (839 dimensions), were evaluated using the data sets obtained by random splitting with, partial least-squares regression, the decision tree, SVR, and XGBoost. Overall, XGBoost showed the best predictive capability in terms of the MAE (Tables S3–S7). Therefore, further discussion is limited to XGBoost with the five reaction representations. A recently proposed neural network-based fingerprint, bidirectional encoder representations from the transformer (BERT)-based reaction fingerprint (rxnfp),[59] was also evaluated for predictive capability for this data set using the default setting for fingerprint generation. However, the performance was worse than other representations (Table S8); thus, we did not consider this fingerprint for further analysis. The MAE values for the test data sets plotted against the training-data-set size (up to 300) are shown in Figure . Overall, the MAE values decreased as the training-data-set size increased. The differences in the MAE values among the representations became smaller as the number of training reactions increased. The average MAE range of the five representations for a training-data-set size of 100 was 0.0630 kcal mol–1 and that for a training-data-set size of 300 (1000) was 0.00911 kcal mol–1 (0.00164 kcal mol–1). The order of the predictive capability for the different representations was consistent for different training-data-set sizes. Ranking the representations by the MAE, ECFP6 > ASO > Dragon (Mol2vec) > MOE (paired t-test at a significance level of 0.05). The results for training-data-set sizes larger than 300 were not significantly different in the MAE values among the different representations (Figure S1). Notably, the ECFP6 representation showed acceptable accuracy even for a training-data-set size of 100. The ECFP representation is atom-centered structural features extracted on chemical graphs (2D representations). This simple feature was better than more complicated 3D or 3.5D representations in terms of the MAE.

Figure 4

Model performance against the data-set size. Prediction accuracy against the training-data-set size for the HTE data set by XGBoost. The average of the MAEs for the test data sets is plotted against the training-data-set size. For each size, the training data sets were randomly sampled from a total of 1075 reactions, and the rest of the reactions formed the test data set. An XGBoost regression model was constructed using a training data set, and the MAE was measured for the test data set. This procedure was repeated 100 times. The error bars correspond to the standard deviations of the MAEs. The descriptors for model construction were Dragon (3D), MOE (3D), ASO (3.5D), Mol2vec (2D), and ECFP6 (2D).

Predictive Capability for Extrapolated Reactions

For the 80% ee splitting data sets, the MAE values for the test data sets for training-data-set sizes of up to 300 are shown in Figure A and those for training-data-set sizes of larger than 300 are shown in Figure S2. The baseline of the MAE values was higher than that for the random splitting data sets (Figure ). ECFP6 was the best representation, and it was statistically significantly better than the second-best representation ASO (paired t-test, significance level of 0.05). The order of the other three descriptors was not consistent as the training-data-set size varied.

Figure 5

Extrapolated prediction capability for the HTE data set. The average of the MAEs for the test data sets is plotted against the training-data-set size for the (A) 80% ee splitting and (B) silyl splitting data sets. For the silyl splitting data set, the performance using the ECFP6 representation was the best irrespective of the training-data-set size (Figures B and S3). For the ECFP6 representation, the trend of the MAE value against the training-data-set size was different from those of the other representations: the MAE value decreased as the training-data-set size increased, while the other representations showed the opposite trend. An ECFP is a collection of bits, and each bit corresponds to an atom-centric fragment (atom environment). The environments of the atoms of the silyl group were ignored in the prediction models using the ECFP6 representation. The superiority of the ECFP representation can be explained by the presence of silyl groups at the 3 and 3′ positions of BINOL-derived catalysts in this reaction might not be important for the enantioselectivity. Although the ECFP approach of not using the atom information of the silyl group seems to be reasonable as it does not add any uncertainty or speculation to the model, the performance might change if the silyl groups have a significant effect on the enantioselectivity. This finding is supported by the Dragon and MOE descriptors showing poorer performance as the number of training data increased. These models performed better for random splitting as the training-data-set size increased (Figure ). In other words, the models constructed with more training reactions predicted ΔΔG‡ better for a test data set exhibiting the same distribution as the training data set. Both descriptors distinguish atom types in various ways and reflect the differences in the descriptor values (e.g., the molecular weight). The finely tuned models might have an adverse effect on ΔΔG‡ prediction when encountering outlier values in the independent variables. Prediction capability for extrapolated reactions was further investigated using the previously reported data sets by the out-of-sample splitting,[22,58] which was a data-splitting strategy for combinatorial reactions. Table shows the MAE values of prediction models for the three test data sets. As a control, one-hot encoding representation was also tested. Overall, the ECFP6 representation consistently performed well among the tested representations including previously reported ones [ASO and multiple fingerprint features (MFF)]. It should be noted that in Table MFF and ASO in the previous studies used different modeling methods from XGBoost used in this study.

Table 2

ΔΔG‡ Prediction Performance for the Out-of-Sample Data Setsa

descriptor	test sub (216)	test cat (304)	test sub-cat (171)
Dragon	0.141	0.419	0.471
MOE	0.150	0.480	0.545
ASO	0.144 (0.161b)	0.285 (0.211b)	0.311 (0.238b)
Mol2vec	0.131	0.344	0.404
ECFP6	0.135	0.215	0.212
One-hot	0.253 (0.178d)	0.575 (0.447d)	0.444 (0.507d)
MFFc	0.137d	0.254d	0.262d

MAE [kcal mol–1].

Reported performances in a previous study.[22]

Multiple fingerprint features.

Reported performance in a previous study.[37]

MAE [kcal mol–1]. Reported performances in a previous study.[22] Multiple fingerprint features. Reported performance in a previous study.[37] Like the silyl splitting data set, the performance using the ECFP6 representation was the best irrespective of the training-data-set size as shown in the MAE values for test data sets for training-data-set sizes of up to 300 in Figure S4. Furthermore, differences in MAE values among descriptors were confirmed by the Y–Y plots (Figures S5–S7). The Y–Y plots for the Dragon and MOE showed a similar trend to the one-hot encoding, implying that these representations did not contain meaningful reaction information for the prediction. The ASO and Mol2vec descriptors failed to predict high ΔΔG‡ values, suggesting that they had a narrow adaptive range. On the other hand, ECFP6 could predict high ΔΔG‡ values better than the other representations (Figures S5–S7). From this analysis focusing on the extrapolation ability of chemical reaction representations, an atom environmental-based chemical structural fingerprint: ECFP6 showed better performance than other representations irrespective of sizes of training data sets.

Interpretation of the Model Outcomes for the HTE Data Set

The ECFP6 representation showed the best predictive ability among the five representations for all of the splitting strategies using the HTE data set. The average MAE value of the 100 prediction trials using 100 training reactions was 0.199 kcal mol–1 by the XGBoost models and 0.233 kcal mol–1 by the SVR model. The performance of the SVR models was slightly poorer than that of the XGBoost models, but they were still comparable. The SVR model for model interpretation was chosen from the 100 constructed models based on the closeness to the average MAE value (0.233 kcal mol–1). Visualization of the example reaction outcomes by the SVR model is shown in Figure , in which the atoms and bonds on the components of the chemical reactions are highlighted. The only difference between ID1 and ID2 was the substituted benzene rings at the 3, 3′ positions of the chiral phosphoric acid catalyst. ID1 showed higher enantioselectivity than ID2. The difference in the chemical structures clearly highlights the effect of 2,4,6-trichlorobenzene. The two reactions ID3 and ID4 (Figure ) differed only by the direct attachment of methylene groups to the 3, 3′ positions of the chiral phosphoric acid catalyst. The enantioselectivity decreased by 0.9 in ΔΔG‡ when inserting methylene groups at the 3, 3′ positions. The atom contributions clearly highlighted this effect. This led to the interpretation that the rigidness of the catalyst conformation may be involved in the enantioselectivity, although ECFP is based on chemical structures (2D). The two reactions ID5 and ID6 differed in the nucleophile: an aromatic and an aliphatic thiol, respectively. In the HTE data set, aromatic thiols generally gave higher enantioselectivity than aliphatic thiols. A difference between these two functional groups is the degree of nucleophilicity: aromatic thiols are less nucleophilic than the corresponding aliphatic thiols. For aromatic thiols to obtain comparative enantioselectivity to aliphatic thiols, 5 mol % of the catalyst is required instead of 2 mol %.[60] This led to the speculation that chemical reactions with aliphatic thiols were completed before the catalyst was actually involved, and thus, they lose enantioselectivity. In this manner, highlighting the contribution of reaction components’ substructure to the prediction outcome gave the opportunity of interpreting the reaction based on the chemical knowledge that chemists possess. To propose a hypothesis, the ECFP representation with SVR is powerful as a first step to analyze the reaction data set.

Figure 6

Model interpretation for the HTE data set. For six example reactions, the contributions of the fragments to the SVM prediction outcome are visualized with colors. The positive contributions are highlighted in orange, and the negative contributions are highlighted in blue. The degree of nontransparency reflects the intensity of the contribution. The reaction IDs are given in the second column from the left. The rightmost column shows the prediction outcomes.

Analysis of the Literature Data Set for Data Fusion

Three descriptors, Dragon (5659 dimensions), MOE (534 dimensions), and ECFP6 (1731 dimensions), were evaluated using the randomly split data sets with the XGBoost regression model. Mol2vec was not tested owing to its poor performance for the HTE data set. The MAE values for the test data sets are plotted against the size of the training data set (up to 100) in Figure . Overall, the MAE values decreased with increasing training-data-set size. The differences between the MAE values among the representations became small as the number of training reactions increased. The MAE range for a training-data-set size of 100 was 0.0398 kcal mol–1. The order of the predictive capability for the different representations was consistent regardless of the training-data-set size. The ECFP6 representation performed the best among the selected reaction representations, which was the same trend as for the HTE data set.

Figure 7

Prediction capability for the literature data set. The average of the MAEs for the test data sets is plotted against the training-data-set size.

Prediction capability for the literature data set. The average of the MAEs for the test data sets is plotted against the training-data-set size. Y–Y plots of the three representations for 70 training reactions are shown in Figure . Overall, the ECFP6 representation performed better (R2 = 0.69, MAE = 0.275 kcal mol–1) than the Dragon (R2 = 0.57, MAE = 0.306 kcal mol–1) and MOE (R2 = 0.63, MAE = 0.277 kcal mol–1) representations. Using Dragon and MOE, the enantioselectivities for some of the test reactions were reasonably well predicted, while the enantioselectivities for the other test reactions were poorly predicted (Figure ). In contrast, using the ECFP6 representation did not produce such outliers for the test data set. Thus, even using a mixed data set from different sources, the ECFP6 representation was found to be preferable to the Dragon and MOE representations.

Figure 8

Y–Y plots for the literature data set using the XGBoost models. The predicted ΔΔG‡ values are plotted against the observed values. The training reactions are in blue, and the test reactions are in red. ECFP6, Dragon, and MOE were tested as reaction representations. Like the HTE data set, model interpretation of a SVR model with the Tanimoto kernel was performed using the ECFP6 representation. The predictive ability of the SVR models was comparable with that of the XGBoost models: the average MAE of ΔΔG‡ prediction for the 100 SVR models trained on 70 randomly selected reactions was 0.275 kcal mol–1, and it was 0.268 kcal mol–1 for the XGBoost model. The selected SVR model for model interpretation showed the closest predictive ability to the average MAE value (0.275 kcal mol–1). Two example reactions along with the fragment contributions to the model outcomes are shown in Figure . ID7 was a Mannich-type reaction from literature 0,[28] while ID8 was also a Mannich-type reaction catalyzed by α,α,α,α-tetraaryl-1,3-dioxolane-4,5- dimethanols (TADDOL)-derived phosphoric acid from the literature.[30] The hydroxyl group on the phenyl ring next to the imine nitrogen has been reported to be important for high enantioselectivity in this type of Mannich reaction.[28,30,61] This information can be derived from the highlighted fragments by the SVR feature contribution method. Furthermore, for ID8, the trifluoro groups at the 4-positions of the phenyl rings in the catalyst were also reported to be important for the selectivity.[30] Small positive contributions were given to the substructures (ID8). Extracting the fragment contributions to the model outcome and their visualization were found to be important by retrospective validation using the two data sets.

Figure 9

Model outcome interpretation. For two example reactions, the fragment contributions to the SVM prediction outcome are visualized on the chemical structures. The color scheme is the same as in Figure .

Data Fusion of Different Reaction Mechanisms and Sources

Visualization of the reaction space for the literature data set is provided in Figure using three reaction representations: ECFP6, Dragon, and MOE. On the three plots, three clusters are represented in different colors, which were determined by k-means clustering in the ECFP6-based reaction descriptor space. The number of clusters was determined based on the ELBOW plots. Although the reactions were obtained from seven different research papers, similar reactions existed based on the ECFP6 reaction descriptors, resulting in the three clusters in the reaction space.

Figure 10

Visualization of the literature data set on principal component analysis (PCA) maps. The dots represent chemical reactions, and the colors represent the three clusters classified by the k-means. Three reaction representations were tested: ECFP6 (top), Dragon (middle), and MOE (bottom). Each literature sub-data set, identified by the literature ID, became a test data set in turn. Four strategies of training-data-set compilation for the XGBoost models were tested: all, the same reaction mechanism, in the cluster, and outside the cluster, as explained in the Study Design section. The predictive capability in terms of the MAE is summarized in Table . Overall, the same reaction mechanism and outside the cluster showed lower prediction capability than all and in the cluster. Outside the cluster was a type of negative control. Collecting and using putative reaction data for model construction did not improve the performance. However, using similar reaction data (i.e., in the cluster) showed notable performance improvement for ID3 and ID4. The reactions in ID4 were Mannich type with an acetylacetone as the nucleophile. The other reactions in the same cluster (clusters ID0, ID1, and ID2) were also Mannich type but with a silyl enol ether nucleophile. The same core scaffolds were not necessary for performance improvement. This also indicated that reaction similarity in terms of the ECFP6 representation was important to collect and identify the training data. Prediction for the ID5 data set was not good by any of the strategies. This might be due to the absence of reactions with similar mechanisms: the reactions in ID5 were Friedel–Crafts-type reactions, also due to the descriptor dissimilarity indicated by the Dragon and MOE PCA maps (Figure S9).

Table 3

Predictive Capability for the Data-Fusion Strategiesa

		training data set
literature ID as test data	cluster ID	same reaction mechanismb	allc	in the clusterd	outside the clustere
0[28]	0	0.460	0.362	0.457	0.385
1[29]	1	0.573	0.440	-f	(0.440)
2[30]	0	0.432	0.400	0.411	0.445
3[31]	2	0.842	0.378	0.312	0.792
4[32]	2	0.949	0.500	0.171	0.925
5[33]	2	-f	0.776	0.731	0.700
6[34]	2	0.512	0.505	0.539	0.857

The MAE values [kcal mol–1] for the literature data set by data-fusion strategies are shown. The best performance is highlighted in bold.

Only reactions classified under the same putative reaction mechanism are shown in Table .

All of the reactions in the literature data set.

Reactions in the same cluster in the ECFP6-based reaction descriptor space.

Reactions not belonging to the cluster where the test reactions belonged.

No training data.

The MAE values [kcal mol–1] for the literature data set by data-fusion strategies are shown. The best performance is highlighted in bold. Only reactions classified under the same putative reaction mechanism are shown in Table . All of the reactions in the literature data set. Reactions in the same cluster in the ECFP6-based reaction descriptor space. Reactions not belonging to the cluster where the test reactions belonged. No training data.

Conclusions

A retrospective study of predictive ML models for phosphoric-acid-catalyzed asymmetric reactions has been performed. From analysis using the HTE data set (enantioenriched N,S-acetal formation catalyzed by chiral BINOL phosphoric acid derivatives), the ECFP6 representation was found to be the best representation among the Dragon, MOE, Mol2vec, ASO, and ECFP6 representations. The ECFP6 representation combined with SVR and the Tanimoto kernel function was able to give reasonable interpretations of the model outcomes for several example test reactions, which would lead to further catalyst development. From analysis using a small-size literature data set consisting of different reaction mechanisms from different sources, ECFP6 was also found to be the best representation. This analysis also revealed that reaction-mechanism-oriented data-fusion approaches were not as good as simple similarity-based approaches in the ECFP6 reaction descriptor space.

40 in total

Review 1. Asymmetric catalysis: an enabling science.

Authors: Barry M Trost
Journal: Proc Natl Acad Sci U S A Date: 2004-02-27 Impact factor: 11.205

2. Conformational analysis of rhazinilam and three-dimensional quantitative structure-activity relationships of rhazinilam analogues.

Authors: Hiroshi Morita; Khalijah Awang; A Hamid A Hadi; Koichi Takeya; Hideji Itokawa; Jun'ichi Kobayashi
Journal: Bioorg Med Chem Lett Date: 2005-02-15 Impact factor: 2.823

3. Do the terms "% ee" and "% de" make sense as expressions of stereoisomer composition or stereoselectivity?

Authors: Robert E Gawley
Journal: J Org Chem Date: 2006-03-17 Impact factor: 4.354

4. Coloring Molecules with Explainable Artificial Intelligence for Preclinical Relevance Assessment.

Authors: José Jiménez-Luna; Miha Skalic; Nils Weskamp; Gisbert Schneider
Journal: J Chem Inf Model Date: 2021-02-25 Impact factor: 4.956

5. The Development of Multidimensional Analysis Tools for Asymmetric Catalysis and Beyond.

Authors: Matthew S Sigman; Kaid C Harper; Elizabeth N Bess; Anat Milo
Journal: Acc Chem Res Date: 2016-05-24 Impact factor: 22.384

6. VAPOL phosphoric acid catalysis: the highly enantioselective addition of imides to imines.

Authors: Yuxue Liang; Emily B Rowland; Gerald B Rowland; Jason A Perman; Jon C Antilla
Journal: Chem Commun (Camb) Date: 2007-08-21 Impact factor: 6.222

Review 7. Quantitative Structure-Selectivity Relationships in Enantioselective Catalysis: Past, Present, and Future.

Authors: Andrew F Zahrt; Soumitra V Athavale; Scott E Denmark
Journal: Chem Rev Date: 2019-12-30 Impact factor: 60.622

8. Computational studies of chiral catalysts: a comparative molecular field analysis of an asymmetric Diels-Alder reaction with catalysts containing bisoxazoline or phosphinooxazoline ligands.

Authors: Kenny B Lipkowitz; Meeta Pradhan
Journal: J Org Chem Date: 2003-06-13 Impact factor: 4.354

9. Prediction of Reaction Yield for Buchwald-Hartwig Cross-coupling Reactions Using Deep Learning.

Authors: Akinori Sato; Tomoyuki Miyao; Kimito Funatsu
Journal: Mol Inform Date: 2021-09-29 Impact factor: 3.353

10. ChEMBL: towards direct deposition of bioassay data.

Authors: David Mendez; Anna Gaulton; A Patrícia Bento; Jon Chambers; Marleen De Veij; Eloy Félix; María Paula Magariños; Juan F Mosquera; Prudence Mutowo; Michal Nowotka; María Gordillo-Marañón; Fiona Hunter; Laura Junco; Grace Mugumbate; Milagros Rodriguez-Lopez; Francis Atkinson; Nicolas Bosc; Chris J Radoux; Aldo Segura-Cabrera; Anne Hersey; Andrew R Leach
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971