Literature DB >> 35135886

Enhancing computational enzyme design by a maximum entropy strategy.

Wen Jun Xie¹, Mojgan Asadi², Arieh Warshel¹.

Abstract

Although computational enzyme design is of great importance, the advances utilizing physics-based approaches have been slow, and further progress is urgently needed. One promising direction is using machine learning, but such strategies have not been established as effective tools for predicting the catalytic power of enzymes. Here, we show that the statistical energy inferred from homologous sequences with the maximum entropy (MaxEnt) principle significantly correlates with enzyme catalysis and stability at the active site region and the more distant region, respectively. This finding decodes enzyme architecture and offers a connection between enzyme evolution and the physical chemistry of enzyme catalysis, and it deepens our understanding of the stability-activity trade-off hypothesis for enzymes. Overall, the strong correlations found here provide a powerful way of guiding enzyme design.

Entities: Chemical

Keywords: catalysis; enzyme design; evolution; maximum entropy

Year: 2022 PMID： 35135886 PMCID： PMC8851541 DOI： 10.1073/pnas.2122355119

Source DB: PubMed Journal: Proc Natl Acad Sci U S A ISSN： 0027-8424 Impact factor: 12.779

Enzymes are extraordinary catalysts that play vital roles in nearly all biochemical processes. Designing efficient enzymes could help in solving threats to humankind, including the energy crisis, environmental pollution, and food shortages (1). The use of computational modeling for enzyme design is very promising (2–6). However, such approaches are still not at the stage where they can guide sufficiently reliable enzyme design (7–9). Thus, it is crucial to exploit additional options for improving the design predictability. This work will explore the potential of statistical analysis of enzyme homologous sequences for enhancing computational enzyme design prediction. Naturally evolving enzymes can speed up chemical reactions by many orders of magnitude (e.g., Fig. 1 ). Such a great catalytic power reflects a very long evolutionary process that started at the emergence of life. In principle, it is tempting to study the origin of the catalytic power of enzymes using physics-based models. However, machine learning methods may provide an invaluable guide. The maximum entropy (MaxEnt) principle (10) offers the least-biased model for the sequence distribution by maximizing information entropy subjected to the statistics obtained from multiple sequence alignment (MSA). The MaxEnt model taking epistasis into account has been proposed to distill evolutionary information within a protein family, which was then correlated with residue–residue contact (11–13) and fitness (14, 15), partly leading to the breakthrough of protein structure prediction (16). For enzymes, a high correlation between the statistical energy derived from the MaxEnt model and enzyme efficiency for beta-lactamase was found, but it did not seem to work for trypsin and dihydrofolate reductase (DHFR) (14). This generative model has been recently used to design enzymes (17). The MaxEnt model on its current form can classify designed sequences in a binary way as functional or nonfunctional based on a regression model trained with the statistical energy and additional high-throughput experimental data. However, designed sequences chosen for biochemical analysis do not show improved catalytic power compared with natural sequences (17). Therefore, the MaxEnt approach has not reached the stage of rational enzyme design, where one should be able to accurately predict the effect of mutations on catalytic power and design better enzymes. This might not be that surprising considering the complex interplay among various selection pressures applied to enzyme evolution (18). In particular, enzyme stability and activity may trade off with each other (19–22).

Fig. 1.

The MaxEnt model for enzyme sequences connects enzyme evolution and function. (A and B) The enzyme accelerates chemical reaction by lowering the activation energy using mainly the residues in the catalytic center. Haloalkane dehalogenase (PDB ID code 2dhc) is used as an example to illustrate enzyme catalysis and the reaction mechanism. (A) The residues within a distance of 7.0 Å from the substrate are highlighted. (B) The scheme of the substitution nucleophilic (SN2) step is illustrated using the substrate of 1,2-dichloroethane. (C) The MaxEnt model connects enzyme evolution to the physical chemistry of enzyme catalysis. A pairwise MaxEnt model is learned from the MSA, and each protein sequence () is associated with statistical energy () following the Boltzmann distribution. We found that decreasing the statistical energy significantly correlates with increasing enzyme efficiency and stability in the catalytic center and enzyme surface, respectively. This work explores the hypothesis that the enzyme catalytic center involved in the catalysis and transition-state stabilization directly correlates with the selection pressure of enzyme efficiency in a way that can be captured by the MaxEnt model. This idea is confirmed by finding a significant correlation for the catalytic center between enzyme catalysis and statistical energy derived by applying the MaxEnt model (Fig. 1). In contrast, the statistical energy correlates well with protein stability for remote regions (referred to here as enzyme surface), suggesting that a stable enzyme surface may be needed for optimal enzyme function. Therefore, the results here show that evolutionary information can be used to decode enzyme architecture and understand biocatalysis. Furthermore, we demonstrate that the widely used consensus design is a special case of the MaxEnt model. The correlations and insights thus offer a powerful way to guide enzyme design.

The MaxEnt Model

Homologous enzyme sequences from different species share the same evolutionary origin (23). The natural sequence variation within an enzyme family is constrained by different factors, including its physical chemistry (24). Therefore, distilling evolutionary information from MSA of an enzyme family could shed light on enzyme three-dimensional structure and function. Due to limited homologous sequences and high computational cost, the MaxEnt model is usually truncated to consider pairwise epistatic effect. The MaxEnt model provides a Boltzmann distribution for each sequence , where is the statistical energy with effective temperature as unity and is the partition function: The parameters and are site energy and pairwise coupling between amino acids at two different residue sites, respectively. The is shifted by a constant so that the wild type (WT) has a zero value, which will not affect any results due to gauge invariance. A lower for a sequence indicates a higher probability to appear during evolution and might reflect a particular evolutionary advantage. The statistical energy corresponds to a spin-glass Hamiltonian, which has enormous local frustrations (25). The parameterization, which requires extensive sampling of the model, is thus highly nontrivial, especially for large proteins with hundreds of residues. Instead of using the popular pseudolikelihood (PLL) approximation (13, 14), we have previously developed an efficient code that marries different computational advancements to sample the Hamiltonian rigorously (26). The derivation and parameterization details of the MaxEnt model can be found in . The PLL approximation is unable to reproduce the statistics of natural sequences (27). Here, the excellent reproduction and prediction of natural MSA statistics validate our implementation ().

Results

A critical obstacle to examining the enzyme evolution–catalysis relationship is the lack of enzyme catalytic data covering sufficient mutants for the target enzyme. Although deep mutational scanning can measure the consequence of mutation at scale, the relation between its readout and enzyme physical properties is uncertain (28). Directly measuring the catalytic parameters (, , and ) requires laborious biochemical assays (29). The experimental data are thus relatively sparse. To this end, we first manually curated a database for enzyme efficiency upon mutation from published literature (), focusing on the systems with many mutations either in the catalytic center (here defined as within 7.0 Å from the substrate) or the enzyme surface (here defined as beyond 9.0 Å from the substrate). Many of the enzymes here are model systems in computational chemistry studies. The database contains 12 enzyme–substrate pairs, and for each pair, at least seven mutations measured in similar conditions (pH and temperature) are collected. The protein stability data were also included whenever available; the database includes many higher-order mutations (up to the 10th order). Meanwhile, we made sure that each enzyme has thousands of homologous sequences in the MSA to get statistically meaningful evolutionary information (). The enzymes studied here cover various types of reactions identified by their different Enzyme Commission class number (). We started with the haloalkane dehalogenase from Xanthobacter autotrophicus that catalyzes the conversion of toxic haloalkanes to alcohols (Fig. 1 ) (30). We evaluated the correlation between the statistical energy and the observed enzyme catalytic power (expressed by both and ). All the mutations are located in the catalytic center with a mean distance of 3.4 Å from the substrate (Fig. 2). Except for one double mutation, the other six are single mutations. The enzymatic rates span more than six orders of magnitude, posing great challenges for prediction methods. Nevertheless, as seen from Fig. 2, the shows impressive Pearson correlations with and with values of −0.87 and −0.95, respectively.

Fig. 2.

The MaxEnt model for enzyme sequences correlates with enzyme efficiency at the catalytic center. (A and B) Haloalkane dehalogenase. (C and D) Chorismate mutase. (E and F) Alcohol dehydrogenase. (A, C, and E) Substrates and mutated residues in the dataset are shown in red and blue, respectively; only one unit of the dimeric chorismate mutase and the tetrameric alcohol dehydrogenase is highlighted. For alcohol dehydrogenase, the cofactor NADP+ and catalytic triad are colored in red because of the absence of substrate. PDB ID codes used in rendering the structures are (A) 2dhc, (C) 1ecm, and (E) 6tq5. Substrates are (A) 1,2-dichloroethane, (C) chorismate, and (E) cyclohexanol. (B, D, and F) Correlations between and experimental catalytic power. The least-squares regression line is plotted for each enzyme; the WT enzyme has a zero value of . We then explored the catalytic center of chorismate mutase, which is widely used in enzyme mechanism and design studies (31). This enzyme transforms chorismite to prephenate in the pathway to produce tyrosine and phenylalanine, essential for plants, fungi, and bacteria (29). The enzyme mutations for chorismate mutase from Escherichia coli are 3.7 Å from the substrate on average (Fig. 2). Here again, the correlations are significant, and the has a correlation value of –0.68 with (Fig. 2). The A32S mutation stands out as the only mutant with increased efficiency relative to the WT; our approach also detected such a unique experimental result. The MaxEnt model has been recently applied to chorismite mutase to explore the sequence space of the whole enzyme (17). However, the for the whole enzyme is not that informative for catalytic power; only after combining with fitness data from high-throughput experiments can it train a logistic regression model to binary classify whether a designed sequence is functional or not (17). For both haloalkane dehalogenase and chorismite mutase, the mutants are mainly single mutations. One may wonder what the performance of the MaxEnt model on higher-order mutations is. We then considered alcohol dehydrogenase from Starmerella magnoliae; nine of the 20 mutants having experimental kinetic data are higher-order mutations up to the 10th order (32). The average distance between the mutations and substrate is 6.2 Å (Fig. 2). Here, the cofactor NADP+ and catalytic triad were used as the reference point in calculating the distance because of the absence of substrate in the Protein Data Bank (PDB) structure; note that not every residue involved in this case is very close to the substrate. The – correlation is −0.74 (Fig. 2). The successful prediction for such higher-order mutations underscores the potential of our approach. Interestingly, the strongly correlates with (correlation value of 0.91) but with the opposite trend as the catalytic efficiency, supporting the activity–stability trade-off proposal (19–22). In this case, the independent model without epistasis shows opposing trends as the MaxEnt model (). If we further dissect the dataset into two subdatasets, one contains all single mutations, and the remaining are in the other. The independent model shows opposite trends between the two subdatasets. The results demonstrate the importance of considering epistasis in extracting evolutionary infromation. Next, we examined the generality of our finding by considering an extensive set of enzymes summarized in Table 1. For all the mutants in the catalytic center, we observed a strong correlation between the MaxEnt model and the catalytic effect. Although the collected data for some enzymes has a biased choice of mutants in experiment, the correlations are consistently strong. The correlation seems insensitive to substrates for ketosteroid isomerase. However, for DHFR, the strong correlation disappears when moving from the catalytic center to the enzyme surface. Such results confirm our hypothesis that the catalytic center evolved under the selection pressure of optimizing enzyme catalysis.

Table 1.

Correlation between the MaxEnt model and enzyme efficiency/Tm

Correlation between the MaxEnt model and enzyme efficiency/Tm In addition, the correlation obtained here using the rigorous sampling of the Hamiltonian is slightly stronger than those using PLL approximation, but both of them are better than the independent model (). The consistent results obtained from the PLL approximation again confirm our findings. For the enzyme surface regions, which are at least 9.0 Å away from the substrate (enzyme surface), the correlation between and enzyme efficiency is not that strong or systematic, although in general, there seems to be a negative correlation. Using beta-lactamase as an example (and discarding the substrate difference in two enzyme–substrate pairs), has a stronger correlation with enzyme catalysis for mutations closer to the substrate. This is again consistent with our above finding for the catalytic center. The rationale is that the surface region is not directly responsible for the evolution pressure of enzyme catalysis. To better understand the physical nature of , we also considered in Table 1 the correlation between and the observed (which is inversely related to the protein folding energy). As seen from Table 1, we have a systematically negative correlation between and for the enzyme surface, indicating that the MaxEnt model does reflect the protein stability for regions far away from where catalysis happens. It is reasonable since the MaxEnt model reflects the contact probability (11–13), which can be considered as a generalized free energy function for protein folding (35). It appears that the catalytic center and enzyme surface face different selection pressures. The statistical energy inferred from MSA strongly inversely correlated with enzyme efficiency and enzyme stability in the catalytic center and enzyme surface, respectively. The finding that a more stable enzyme surface might help in promoting enzyme catalysis could also rationalize the growing evidence that it is possible to engineer remote mutations to improve catalysis (Table 1) (33, 36). To demonstrate our approach to enzyme design, we redesigned the catalytic center of haloalkane dehalogenase after parameterization of the MaxEnt model; 37% of the designed sequences have lower than the WT (), suggesting possible enhanced catalysis. Interestingly, one of the top five designs is a consensus design, where the residue is replaced by the most frequently observed amino acid in the natural MSA (). Consensus design has already been shown to be effective in protein engineering (37, 38), and it turns out to be a special case of the MaxEnt model where epistasis is considered. Our results thus provide a statistical basis for the consensus design and suggest that the consensus amino acids near the substrate are likely to improve enzyme catalytic power.

Discussion

This work explored the relationship between enzyme evolution and catalysis by correlating obtained from natural homologous sequences with the catalytic power of different enzymes. It is found that the correlation is significant for the catalytic center, and adopting our finding to guide enzyme design is straightforward. The catalytic center and enzyme surface face different selection pressures. Therefore, it is more likely to improve enzyme catalysis by optimizing the catalytic center instead of the enzyme surface using evolutionary information. This also explains why there are no consistent correlations between catalytic power and evolutionary information in previous studies (14, 17); the complex physical constraints make it highly nontrivial to predict enzyme efficiency from sequence data. For trypsin and DHFR, the mutations in the dataset (14) are on the enzyme surface, which is not likely to show a strong correlation between and catalytic power; such an explanation also applies to cases when the whole enzyme is studied (17). Adopting machine learning to protein studies is promising (14, 15, 17, 39–46); here, we incorporate the understanding of the physical chemistry of enzymes into machine learning and thus, could obtain a consistent prediction for biocatalysis. For simple protein (e.g., protein without domains for catalysis and other complex functions), evolutionary information may be simply correlated with protein stability. While we can use the correlation to estimate enzyme catalytic power, it is interesting to look for some possible rationalization. Thus, we checked the correlation between and . Alcohol dehydrogenase provides strong evidence for the enzyme activity–stability trade-off. However, for enzyme surface, and are inversely correlated, while is roughly correlated with enzyme efficiency. This seems to contradict the idea that catalytic preorganization costs folding energy (19, 20). However, the folding energy as expressed by is related to the stability of the entire enzyme, and the preorganization can be determined by the folding of a limited part of the enzyme (31). As shown in our previous study on DHFR, reducing the reorganization energy may or may not reduce the protein stability since it requires protein restraints in specific directions along the reaction coordinate and not necessarily restraints in all directions (20). Interestingly, the role of the protein surface in helping catalysis found here may not directly be related to the reduction in the reorganization energy (47). In addition to striving to understand the nature of the correlation found in this work, we can also take a very pragmatic “engineering” approach employing the correlation between and . That is, regardless of whether the correlation is negative or positive, we can generate mutants, determine their , choose those with increased , and further screen them with the empirical valence bond (EVB) calculation (2, 7) for design experiments. We anticipate that the emerging evolutionary information, with the rapidly accumulating genomic sequence data, will facilitate studies of more enzyme families. For cases where such information is not sufficient or for new catalytic reactions, the performance of the MaxEnt model might not be sufficient. In such cases, we can further combine with EVB calculations (2, 7) to model the catalytic power and further screen the design to increase the success rate. The findings here seem to be a general principle for enzymes but require a thorough examination of more enzymes. We indeed found more enzymes to support the conclusions present here while this manuscript was under review; the results will be presented in a forthcoming manuscript. Whether the enzyme architecture can be further decoded into more categories with evolutionary information also needs to be investigated. Furthermore, enzymes may be far more intricate than we currently know; it would be exciting to understand the coupling between different enzyme parts from evolutionary information. The great potential of laboratory-directed evolution has been demonstrated (48). In this respect, we believe that our approach can help in extending the design by directed evolution. Moreover, our approach can be used to trace the moves in directed evolution by following the prediction of the MaxEnt model to understand how enzymes evolve. It could also be applied to understand natural evolution, considering ancestral sequences are homologs of the extant sequences. In summary, we decoded enzyme architecture using evolutionary information and connected enzyme evolution with enzyme catalysis. Such a connection can help to bridge evolutionary biology and enzymology. The high-throughput and predictability from the MaxEnt model (or other generative models) combined with experimental validation and computational modeling could push enzyme studies to a systems level. Significantly, the results here call attention to integrating domain knowledge in physical chemistry into machine learning models for protein engineering.

45 in total

1. Iterative approach to computational enzyme design.

Authors: Heidi K Privett; Gert Kiss; Toni M Lee; Rebecca Blomberg; Roberto A Chica; Leonard M Thomas; Donald Hilvert; Kendall N Houk; Stephen L Mayo
Journal: Proc Natl Acad Sci U S A Date: 2012-02-22 Impact factor: 11.205

Review 2. Computational enzyme design.

Authors: Gert Kiss; Nihan Çelebi-Ölçüm; Rocco Moretti; David Baker; K N Houk
Journal: Angew Chem Int Ed Engl Date: 2013-03-25 Impact factor: 15.336

3. Coevolutionary information, protein folding landscapes, and the thermodynamics of natural selection.

Authors: Faruck Morcos; Nicholas P Schafer; Ryan R Cheng; José N Onuchic; Peter G Wolynes
Journal: Proc Natl Acad Sci U S A Date: 2014-08-11 Impact factor: 11.205

4. How Pairwise Coevolutionary Models Capture the Collective Residue Variability in Proteins?

Authors: Matteo Figliuzzi; Pierre Barrat-Charlaix; Martin Weigt
Journal: Mol Biol Evol Date: 2018-04-01 Impact factor: 16.240

5. Soluble oligomerization provides a beneficial fitness effect on destabilizing mutations.

Authors: Shimon Bershtein; Wanmeng Mu; Wanmeng Wu; Eugene I Shakhnovich
Journal: Proc Natl Acad Sci U S A Date: 2012-03-12 Impact factor: 11.205

6. Structural bases of stability-function tradeoffs in enzymes.

Authors: Beth M Beadle; Brian K Shoichet
Journal: J Mol Biol Date: 2002-08-09 Impact factor: 5.469

7. Deep mutational scanning: a new style of protein science.

Authors: Douglas M Fowler; Stanley Fields
Journal: Nat Methods Date: 2014-08 Impact factor: 28.547

8. Low-N protein engineering with data-efficient deep learning.

Authors: Surojit Biswas; Grigory Khimulya; Ethan C Alley; Kevin M Esvelt; George M Church
Journal: Nat Methods Date: 2021-04-07 Impact factor: 28.547

Review 9. Consensus protein design.

Authors: Benjamin T Porebski; Ashley M Buckle
Journal: Protein Eng Des Sel Date: 2016-06-05 Impact factor: 1.650

10. Coevolutionary Landscape Inference and the Context-Dependence of Mutations in Beta-Lactamase TEM-1.

Authors: Matteo Figliuzzi; Hervé Jacquier; Alexander Schug; Oliver Tenaillon; Martin Weigt
Journal: Mol Biol Evol Date: 2015-10-06 Impact factor: 16.240

3 in total

1. Natural Evolution Provides Strong Hints about Laboratory Evolution of Designer Enzymes.

Authors: Wen Jun Xie; Arieh Warshel
Journal: Proc Natl Acad Sci U S A Date: 2022-07-28 Impact factor: 12.779

2. Entropies Derived from the Packing Geometries within a Single Protein Structure.

Authors: Pranav M Khade; Robert L Jernigan
Journal: ACS Omega Date: 2022-06-09

3. The evolutionary advantage of an aromatic clamp in plant family 3 glycoside exo-hydrolases.

Authors: Sukanya Luang; Xavier Fernández-Luengo; Alba Nin-Hill; Victor A Streltsov; Julian G Schwerdt; Santiago Alonso-Gil; James R Ketudat Cairns; Stéphanie Pradeau; Sébastien Fort; Jean-Didier Maréchal; Laura Masgrau; Carme Rovira; Maria Hrmova
Journal: Nat Commun Date: 2022-09-23 Impact factor: 17.694

3 in total