Literature DB >> 31316193

Holistic prediction of enantioselectivity in asymmetric catalysis.

Abstract

When faced with unfamiliar reaction space, synthetic chemists typically apply the reported conditions (reagents, catalyst, solvent and additives) of a successful reaction to a desired, closely related reaction using a new substrate type. Unfortunately, this approach often fails owing to subtle differences in reaction requirements. Consequently, an important goal in synthetic chemistry is the ability to transfer chemical observations quantitatively from one reaction to another. Here we present a holistic, data-driven workflow for deriving statistical models of one set of reactions that can be used to predict out-of-sample reactions. As a validating case study, we combined published enantioselectivity datasets that employ 1,1'-bi-2-naphthol (BINOL)-derived chiral phosphoric acids for a range of nucleophilic addition reactions to imines and developed statistical models. These models reveal the general interactions that impart asymmetric induction and allow the quantitative transfer of this information to new reaction components. This technique creates opportunities for translating comprehensive reaction analysis to diverse chemical space, streamlining both catalyst and reaction development.

Entities: Chemical Disease Species

Year: 2019 PMID： 31316193 PMCID： PMC6641578 DOI： 10.1038/s41586-019-1384-z

Source DB: PubMed Journal: Nature ISSN： 0028-0836 Impact factor: 49.962

The efficacy of a catalytic process is dictated by the possible transition states (TS), which feature core non-covalent interactions that determine their geometries and energies.[1-2] Such interactions are often difficult to identify and define since they are energetically weak and sensitive to the molecular properties of every reaction component (catalyst, substrate(s), reagents, solvent, etc.).[3-4] This overarching issue in reaction optimization is often exasperated by subtle connections across several reaction variables, wherein modest structural changes to any or a few of these can have a profound effect on the experimental outcome.[5-6,7] These factors combined with the number of dimensions under study in most reactions, are the underlying reasons for why optimization is decidedly empirical.[8-9] This situation is particularly common in the area of asymmetric catalysis, wherein seemingly minor structural variations to any reaction component can have acute and non-intuitive influences on the observed enantioselectivity.[10] However, it is possible that such mechanistic outliers may be concealed within larger data sets as our pattern recognition skills do not perceive pivotal generalities when reaction situations change. On this basis, we hypothesized that connecting common mechanistic features through the simultaneous interrogation of all reaction components would provide a holistic view of the key non-covalent interactions responsible for reaction performance. This would enable the transfer of experimental observations to genuinely different substrate combinations with unique catalysts. Herein, we develop and deploy a workflow to parameterize all reaction variables of >350 distinct reaction combinations, which allows development of comprehensive correlations with the ultimate ability to predict reaction performance for entirely different structural motifs. The workflow includes techniques to probe general mechanistic principles, which provides the basis for transfer learning or generalized identification of the key interactions imparting asymmetric induction. Asymmetric catalysis is replete with examples of catalysts that can promote disparate reactions through a common mode of activation.[11,12-13,14] However, when one surveys “similar reactions”, many changes to the precise reaction conditions are often required to obtain the desired reaction performance.[15-16] In many cases, these changes can be subtle (i.e., one aromatic solvent for another) or more profound (one catalyst class for another). This leads to the questions: 1) is mechanistic insight truly transferrable to a new reaction in the same subclass considering that a standard mechanistic paradigm may exist with a general mode of activation? 2) if so, how can a workflow be used in a quantitative manner for diverse and multiple reaction profiles? And 3) if achievable, can the observations of one or more reactions be deployed to predict the performance of another? Such analysis strategies could provide mechanistic understanding to why certain conditions are effective for a general reaction type and the ability to quantitatively transfer this information to out of sample predictions streamlining reaction optimization.[17-18] To vet a specific workflow required to probe the questions posed above, it would be pragmatic to compare transformations within a reaction class facilitated by a single catalyst chemotype. Although multifarious reports of the same catalyst class for different transformations exist in enantioselective catalysis, comparative studies – even in a qualitative manner – have been sparse. Such an interrogation would be challenging as a consequence of incomplete datasets generated under non-uniform conditions and the development of readily comprehensible descriptors for each varying reaction component. To address this correlation challenge, we envisioned a strategy for the interrogation of enantioselective catalysis involving the application of modern data-analysis methods and advanced parameter sets. In this approach, integrated descriptor sets (Quantitative Structure Activity Relationships (QSAR), Molecular Mechanics (MM), and Density Functional Theory (DFT) derived),[19] are related to a relatively large library of outputs collected from a general reaction and catalyst type, which are data-mined from multiple literature sources. By combining the appropriate data-organization and trend analysis techniques, general relationships between reactions can be established. The statistical models ability to predict a new reaction type performance is used as a validation of mechanistic transferability (Fig. 1).

Figure 1 |

Workflow for interrogating and applying mechanistic transferability.

(A) BINOL-based phosphoric acid catalyzed nucleophilic additions to imines as a general reaction for workflow development. (B) Streamline reaction performance predictions by employing a mechanistic transferability strategy implemented through correlation of all reaction variables to enantioselectivity. General correlations can be built to reveal the interactions between any reaction component in the relevant TS and enantioselectivity. The mechanistic principles leading to enantioselective catalysis captured by the statistical models can be transferred to genuinely different structural motifs not contained in the training dataset.

Reaction Platform Selection:

As a proof of concept reaction class, the addition of various nucleophiles to imines was identified owing to the ubiquity of this type of transformation in asymmetric catalysis.[20-21] The commonality of this reaction is due to both the simplicity of accessing imine starting materials and the broad applicability of the resulting amines in both synthetic and biosynthetic settings.[22-23] The second criteria in reaction selection is determining a catalyst chemotype that has been widely used for these processes such that significant data exists and provides diversity in data range and structure from published sources. Considering these constraints, we selected the field of chiral phosphoric acid catalysis, particularly, the addition of protic nucleophiles to imines catalyzed by chiral 1,1’-Bi-2-naphthol (BINOL)-derived phosphoric acids bearing aromatic groups at the 3 and 3’ positions (Fig.1).[24] To initiate this workflow, an expanded inventory of 367 reactions with varied components was curated from multiple reports (for list of references see SI). From this survey, we categorized the dataset by imine TS geometry (E or Z) wherein E-imine TS are grouped by a +ee value and Z-imines as a -ee value. Imine stereochemistry was determined by the enantiomer of the product formed if the imine was derived from an aldehyde. However, if ketimines (imines derived from ketones) were employed substituent size must also be considered if the smaller C-substituent has higher Cahn-Ingold Prelog (CIP) priority.[25-26] For the reactions under study, this only effects ketimines that have either a trifluoromethyl or ester C-substituents in which they are considered to have lower priority for the purpose of assigning E or Z-TS. This is important in understanding product enantioselectivities, because nucleophile addition to the same face will yield opposite enantiomers for E and Z configurations. Therefore, developed models will not be capable of predicting product stereochemistry but can be deployed to predict whether a reaction will proceed via an E or Z type mechanism and this information can be used to determine absolute configurations. Simultaneously, a diverse array of molecular descriptor values was collected from DFT optimized geometries to describe the structural features of each imine, nucleophile, catalyst, and solvent. Unfortunately, the lack of structural commonality for particular molecular subsets creates a challenge in identifying readily comprehensible and extensive parameter sets for each component. For example, on comparing substrates and catalyst structures, it is apparent that they have overlapping and distinctive features likely required for determining selectivity patterns (Extended Data Fig. 1). In contrast, the solvents do not have common substructures, yet are critical for enantioselectivity. To address this limitation two approaches were explored: (1) parameters derived from DFT calculations were collected, which have proven well-equipped to describe molecules containing common structural features including Sterimol parameters, bond lengths, angle measurements, molecular vibrations and intensities, Natural Bond Orbitals (NBO) charges, polarizabilities, Highest Occupied Molecular Orbital (HOMO) and Lowest Unoccupied Molecular Orbitals (LUMO) energies.[27-28] This array was collected for the reaction partners and the catalysts. (2) 2D descriptors (e.g., topological and connectivity as exemplified by molecular shape, size and number of heteroatoms) are used as this is a traditional method to assess structurally disparate molecules such as solvents.[29-30] Other reaction variables, such as concentration of reagents/catalysts and inclusion of molecular sieves, were also included as categorical descriptors (see SI).

Extended Data Figure 1 |

Reaction component comparison.

Parameterization challenges for the identification of numerical descriptors in reaction dimension, demonstrated using two reactions representing the extremes of multidimensional feature space. DCM, dichloromethane; MS, molecular sieves; ee, enantioselectivity.

Comprehensive Model Development:

Linear regression algorithms (see SI) were then applied to the entire dataset (367 reactions) to identify correlations between the molecular structure of every reaction variable defined by the parameters collected in the previous step of the workflow and the experimentally determined enantioselectivity. ΔΔG (where ΔΔG = —RTln(e.r.) and T is the temperature at which the reaction was performed) was regressed to an equation to reveal a surprisingly good correlation despite the significant structural variance included in the training set. Both cross-validation analysis (Leave-one-out (LOO) and k-fold) and external validation, in which the dataset is partitioned pseudorandomly into 50:50 training:validation sets suggests a relatively robust model (see SI). The model emphasizes solvent (black), imine (blue), nucleophile (green) and catalyst (red) terms distributed over six parameters, as contributors to the enantioselectivity across these seventeen reaction types (Fig. 2A). A slope approaching unity and intercept approaching zero over the training set indicates an accurate and predictive model with an R2 value of 0.88 demonstrating a high degree of precision. The largest coefficients in this normalized model belong to the imine NBO descriptors indicating the significant role of the imine substrate in the quantification of enantioselectivity as highlighted by the formation of both enantiomeric products, a consequence of active E and Z configurations (vide infra). Comparing two Strecker reactions performed under uniform conditions, results in values ranging from +99% ee for the enantiomer that proceeds through the E-imine TS and −80% ee for the Z. Remarkably, this represents a 3.5 kcal/mol range based solely on imine structure.

Figure 2 |

Comprehensive model development.

(A) Regression model containing 367 data entries facilitated by parameterization of every reaction variable. A positive %ee value indicates E-imine TS, a negative %ee Z-imine TS. LOO, leave-one-out cross-validation score; k-fold, average fourfold cross-validation score. (B) Illustration of mechanistic transferability in the data set via “leave one reaction out” (LORO) analysis. In which distinct reactions (as determined by individual publications) are defined as the validation set. (C) Visual analysis and interpretation of the model terms.

We postulated that the ability to correlate and predict using a singular model for an array of reactions suggests that the transition state features are fundamentally similar within this reaction range. Perhaps, the best test of this hypothesis could be achieved by a “leave one reaction out” (LORO) analysis. In this statistical evaluation, the catalyst, imine, and nucleophile structures are varied as a validation set and assessed through the ability of the model to predict with sufficient accuracy. This would report on the model’s capacity to match patterns across a general reaction type. Using this analysis, each distinct reaction (as determined by individual publications) in the data field was evaluated with most predicted well (see SI). As an illustration of model robustness, we could exclude up to seven reactions with little change in the correlation statistics (Fig. 2B). However, not surprisingly, some reactions were poorly predicted in the LORO protocol, which can be attributed to the model’s inability to capture specific structure changes if they are not adequately expressed in the training set. In sum, the descriptor definitions coupled to the model and validation strategies showcase that patterns can be matched. This is consistent with the hypothesis that a defined set of key non-covalent interactions impart asymmetric induction across a general reaction type. Essentially, this workflow provides evidence that one reaction can be used to predict the results of another, quantitatively.

Trend analysis:

Although the comprehensive model in Fig. 2 establishes the capacity of the selected parameters to describe general aspects of this system, the ultimate goal of our workflow is to discern subtle underlying mechanistic phenomena. This objective could not be achieved by using the above correlation because it was produced by using the entire dataset, which provides only an overview of the mechanistic patterns. We hypothesized that a series of focused correlations, coupled with an evaluation of the overall trends, might serve to reveal fundamental features of the systems. To this end, we truncated the dataset into subsets, categorized by imine TS geometry (E or Z) determined by the relative sign of the ee determined previously, as these are hypothesized to lead to structurally distinct interactions with the other reaction components. This organizational scheme was viewed as a means to facilitate the identification of catalyst features that affect particular mechanistic pathways and therefore, reactant combinations (and vice versa). Linear regression algorithms were then applied to this data classification to identify correlations between molecular structure and the experimentally determined enantioselectivity. Subsequently, analysis and refinement of the resultant models were used to produce explicit mechanistic hypotheses (Fig. 3).

Figure 3 |

Development of focused correlations.

(A) Regression model containing 204 entries data-mined from nine literature sources. (B) Model emphasizes the importance of both steric and electronic factors. Reasonably large catalyst and imine substituents lead to high-levels of enantioselectivity, if these two components are matched any nucleophile should be compatible. (C) Regression model containing 147 entries data-mined from eight literature sources. (D) Overlapping steric terms describing catalyst and imine reinforce the notion that similar interactions remain within the two geometric imine stereoisomers. However, this model emphasizes the importance of steric contributions predominantly from the nucleophile for high enantioselectivities.

The correlation depicted in Fig. 3 was identified from a set of 204 reactions (evenly split into training and validation sets) that proceed via E-imine TS. The relationship includes two solvent, two imine, one nucleophile and three catalyst terms. Overall, the statistical model suggests a mechanistic scenario in which the imine adopts an arrangement that minimizes energetically penalizing repulsion interactions with reasonably large catalyst substituents.[31] Perhaps most telling is that the steric profile of the nucleophile does not impart a significant impact on the stereoselectivity prediction, despite the large structural variance. The included parameters (LUMO and the P‒O asymmetric stretching intensity, iPOas) suggest that H-bonding contacts between catalyst and nucleophile play a minor role and the use of essentially any nucleophile should be compatible with the reaction if the imine and catalyst are matched. In evaluating the model for Z-imines determined by 147 reactions, a number of overlapping terms reinforce the notion that similar interactions between catalyst and substrates remain within the two geometric imine stereoisomers. Two of these terms, the size of the catalyst aryl substituent as measured by Sterimol B1 term, B1cat, and the imine NBOPG parameter essentially describes the repulsive interactions between proximal sterics and the imine N-substituent, a common critical catalyst-substrate interaction with both TS imine configurations. The most significant contrast in the two models is that the Z-imine model includes a significant nucleophile steric descriptor, B5Nu, which is the most highly weighted term in the equation. This suggests that larger nucleophiles introduce enhanced repulsive interactions with the catalyst substituents in the TS, leading to the competing product, which ultimately favors the observed enantiomer. This claim is further supported by the observation of higher enantioselectivities when using catalysts with smaller substituents (e.g., Ar = 3,5-CF3). The proposed physical meaning behind each term in the mathematical equations have been summarized in Fig. 3.

Evaluation of prediction capabilities:

As a final step in the workflow, we evaluated the ability to transfer the mechanistic principles leading to enantioselective catalysis captured by the statistical models to genuinely different structural motifs not contained in the training dataset. If effective out-of-sample prediction were possible, the model could predict the impact of a new imine, nucleophile, and/or catalyst. Initially, reaction performance was evaluated using the comprehensive model to determine the mechanistic pathway under operation, these predictions could then be further refined with the specific models (E or Z). This two-tiered workflow is imperative as the process avoids mechanistic assumptions regarding whether the reaction proceeds via an E or Z TS and therefore ensuring that the results of the test reactions are unknown. The comprehensive model does not immediately allow prediction of stereochemistry however, product configuration can be assigned from the simple models shown in Figure 4. These are based on the amine product yielded from a reaction proceeding via an E or Z TS and catalyzed by the (R)-CPA. The opposite enantiomer will be formed if the (S)-catalyst is employed. As a first case study, we evaluated fifteen additional reactions involving enecarbamates, a nucleophile not contained in the training set, and benzoyl imines, an imine subclass that is part of our initial training set (Fig. 4).[32] Each result was predicted using the comprehensive model, with an average absolute ΔΔG error of 0.37 kcal/mol (13 examples within 5% ee) and correctly assigned the absolute stereochemistry as R, demonstrating the ability of the model to extrapolate effectively to a new nucleophile. A slightly improved outcome is observed using the E-imine mechanistic model with a 0.24 kcal/mol average error (all examples within 5% ee).

Figure 4 |

Out-of-sample predictions using two-tiered prediction workflow.

Comprehensive model first determines E or Z TS, configuration specific models are then utilized to refine predictions. A generic amine product denotes the stereochemical outcome predicted if the reaction proceeds via the E or Z TS and catalyzed by an (R)-CPA. Product stereochemistry is reversed if the (S)-catalyst is used. (A) Application to addition of enecarbamates to benzoyl imines and transfer hydrogenation of alkynyl ketimines. (B) Prediction of TCYP which has cyclohexyl groups at the 2,4,6 positions of the aromatic ring, as a highly selective catalyst for the addition of thiol to benzoyl imines.

As the second case study, the hydrogenation of alkynyl ketimines catalyzed by H8-BINOL where the 3,3’ groups = 3,5-CF3(C6H4) was predicted.[33] This is a more challenging scenario as both imine and catalyst components are not included in the training set. Again, accurate prediction of the outcomes was construed using the Z-imine mechanistic model, with an average absolute error of 0.30 kcal/mol and 13 examples predicted within 2% ee (Fig. 4). The stereochemical outcome was correctly determined to be R with the (S)-catalyst. Although the comprehensive model assesses the mechanistic scenario and therefore assigns the stereochemical outcome, it was not as accurate since the nucleophile information was categorical (symmetrical or displaced). Thus, the beneficial effect of a large nucleophile for a Z reaction was not adequately captured. These examples showcase that the model’s predictive capabilities are not limited to classifying the vast literature, but can be applied to analyze and predict new reactions even in situations where multiple components are varied. As a final case study, we evaluated a recently reported reaction that was rendered highly predictable by application of machine learning (ML) algorithms. The study reported by Denmark and coworkers involved the addition of thiols to benzoyl imines, a distinct reaction included in our training set.[34] To utilize ML approaches, they performed 2,150 separate experiments using 43 catalysts to yield 25 different products (5×5 nucleophile/electrophile matrix). We postulated that our approach could reliably predict their results including the best catalyst, TCYP, which has cyclohexyl groups at the 2,4,6 positions of the aromatic ring and is not in our training set. To test this hypothesis, all experimental results of this reaction type were removed from our original training data, the model was retrained, and deployed to predict their new dataset (34 reactions) collected with the best catalyst, TCYP. We conclude, that our model, with no experimental data on this reaction can also predict the enantioselectivities (average absolute ΔΔG error = 0.65 kcal/mol comprehensive model (26 examples within 5% ee), 0.67 kcal/mol E-imine only model (25 examples within 5% ee)) confidently determining the stereochemical outcome to be R and TCYP as a highly selective catalyst. Overall, through the combination of results generated from the out-of-sample prediction platforms, we can conclude that the E and Z focused correlations generate more accurate predictions but the comprehensive model is valuable as it determines which equation should be deployed. In this report, we have introduced a workflow to model enantioselectivity in assorted catalytic systems. The value of this approach is that complicated reaction conditions can be accounted and successfully evaluated for multiple and diverse reactions. The ability to correlate and predict using a single model for many reactions suggests that general transition state features are fundamentally similar across the reaction range, allowing the transfer of observations from one reaction to another. This finding highlights a likely general phenomenon in asymmetric catalysis, whereby various transformations may be found to perform in the same manner when exposed to similar reaction conditions. However, such reaction similarities may be unmasked, and reaction-specific mechanistic principles emerge from the development of focused correlations.

Reaction component comparison.

17 in total

1. Machine learning made easy for optimizing chemical reactions.

Authors: Jason E Hein
Journal: Nature Date: 2021-02 Impact factor: 49.962

2. Transition State Force Field for the Asymmetric Redox-Relay Heck Reaction.

Authors: Anthony R Rosales; Sean P Ross; Paul Helquist; Per-Ola Norrby; Matthew S Sigman; Olaf Wiest
Journal: J Am Chem Soc Date: 2020-05-14 Impact factor: 15.419

3. Iterative Supervised Principal Component Analysis Driven Ligand Design for Regioselective Ti-Catalyzed Pyrrole Synthesis.

Authors: Xin Yi See; Xuelan Wen; T Alexander Wheeler; Channing K Klein; Jason D Goodpaster; Benjamin R Reiner; Ian A Tonks
Journal: ACS Catal Date: 2020-11-05 Impact factor: 13.084

4. Connecting and Analyzing Enantioselective Bifunctional Hydrogen Bond Donor Catalysis Using Data Science Tools.

Authors: Jacob Werth; Matthew S Sigman
Journal: J Am Chem Soc Date: 2020-09-10 Impact factor: 15.419

5. Data Science Meets Physical Organic Chemistry.

Authors: Jennifer M Crawford; Cian Kingston; F Dean Toste; Matthew S Sigman
Journal: Acc Chem Res Date: 2021-08-05 Impact factor: 24.466

6. Combining Machine Learning and Computational Chemistry for Predictive Insights Into Chemical Systems.

Authors: John A Keith; Valentin Vassilev-Galindo; Bingqing Cheng; Stefan Chmiela; Michael Gastegger; Klaus-Robert Müller; Alexandre Tkatchenko
Journal: Chem Rev Date: 2021-07-07 Impact factor: 60.622

7. Regio-selectivity prediction with a machine-learned reaction representation and on-the-fly quantum mechanical descriptors.

Authors: Yanfei Guan; Connor W Coley; Haoyang Wu; Duminda Ranasinghe; Esther Heid; Thomas J Struble; Lagnajit Pattanaik; William H Green; Klavs F Jensen
Journal: Chem Sci Date: 2020-12-22 Impact factor: 9.825