Literature DB >> 29698417

The response to selection in Glycoside Hydrolase Family 13 structures: A comparative quantitative genetics approach.

Jose Sergio Hleap^1,2, Christian Blouin³.

Abstract

The Glycoside Hydrolase Family 13 (GH13) is both evolutionarily diverse and relevant to many industrial applications. Its members hydrolyze starch into smaller carbohydrates and members of the family have been bioengineered to improve catalytic function under industrial environments. We introduce a framework to analyze the response to selection of GH13 protein structures given some phylogenetic and simulated dynamic information. We find that the TIM-barrel (a conserved protein fold consisting of eight α-helices and eight parallel β-strands that alternate along the peptide backbone, common to all amylases) is not selectable since it is under purifying selection. We also show a method to rank important residues with higher inferred response to selection. These residues can be altered to effect change in properties. In this work, we define fitness as inferred thermodynamic stability. We show that under the developed framework, residues 112Y, 122K, 124D, 125W, and 126P are good candidates to increase the stability of the truncated α-amylase protein from Geobacillus thermoleovorans (PDB code: 4E2O; α-1,4-glucan-4-glucanohydrolase; EC 3.2.1.1). Overall, this paper demonstrates the feasibility of a framework for the analysis of protein structures for any other fitness landscape.

Entities: Chemical Disease Gene Mutation Species

Mesh：

Substances：

Year: 2018 PMID： 29698417 PMCID： PMC5919626 DOI： 10.1371/journal.pone.0196135

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

The Glycoside Hydrolase Family 13 (GH13) is a multi-reaction catalytic family and its members perform hydrolysis, transglycosylation, isomerization [1], condensation, and cyclization reactions [2], and even animal amino acid transport [3] with no glycolase activity [4]. The initial definition for this family was formulated in the early 1990’s [5-7]. According to this definition, a member of this family must [6]: 1) hydrolyse or form (by transglycosylation) α-glucosidic linkages; 2) have four conserved amino-acidic regions [8]; 3) contain the catalytic triad: Asp, Glu, and Asp.; and 4) have a TIM-barrel-like fold in its structure. Since then, the number of family members have increased [9] to include α-1,1-, α-1,2-, α-1,3- and α-1,5-glucosidic linkages [1]. Also, the number of conserved regions has been updated to 7 [10, 11]. The catalytic activity and substrate binding residues in the GH13 family members occur at the C-termini of the β-strands and in the loops that extend from these strands [12]. The catalytic site includes aspartate as a catalytic nucleophile, glutamate as an acid/base, and a second aspartate for stabilization of the transition state [13]. The catalytic triad plus an arginine residue are conserved in this family across all catalytic members [14, 15]. The GH13 family has many characterized enzymes with diverse functions which are summarized and clustered in the CAZY database [16]. GH13 is a highly diverse family in function and is also ubiquitous, being found in all kingdoms of life [15]. The GH13 family has been subdivided into over 40 subfamilies [17, 18] by their sequence motif and enzyme specificity [9], but they are all related both in sequence and structure. To date, thousands of sequences comprise this family for which hundreds of structures have been solved with more than 30 different enzymatic specificities [9]. Many comprehensive reviews on their mechanisms, sequences, abundance, phylogeny and concept have been performed [12, 15, 19–24]. Part of the interest in researching this family lies in its industrial importance [25, 26]. The GH13 family is the target of engineering efforts, focusing on factors such as thermal and alkaline stability [27-30], specific activity [27, 31], and other diverse biochemical properties that are important to the industrial context [28, 32, 33]. Many strategies have been used to engineer this family including different “rational design” approaches [34] such as B-fitter [35], proline theory [36], PoPMuSiC-2.1 [37], and sequence consensus [34]. However, to our knowledge, there has been no attempt to leverage both phylogenetic and molecular dynamics signals to quantify the potential of a structure in response to selection. Exploring how selection acts on protein structures is not a trivial problem. One approach is to assume that protein structures are shape phenotypes and that their 3D structures respond to both genetic and environmental factors, thereby falling under a quantitative genetics framework. Proteins and other shapes are highly multivariate in nature [38], and the model for their phenotype (y) can be expressed as [39]: where X and Z represent design matrices for the fixed and random effects in vectors b and a respectively, and e is the residual component that cannot be explained by the model. Here, y is the phenotype of one structure and contains the x, y, and z coordinates of each homologous residue. The residue’s homology is inferred with respect to the rest of the structures being analyzed. For a protein structure t that has 100 homologous residues, the length of y is 300. The more detailed explanation of the abstraction of the protein structure as a shape can be seen in section “Abstracting protein structures as shapes”. With this model, the phylogenetic contribution to phenotype can be estimated. In a multivariate setting such estimation is called the G-matrix, or genetic variance-covariance matrix, which summarizes the genetic contribution and the interaction of all traits. In the example above, G is a 300 by 300 matrix. Lande and Arnold [40] proposed a multivariate strategy to estimate the response to selection given G as: where is a vector of changes in traits, and β is a vector of selection gradients. The latter quantity is the effect of a particular trait on the relative fitness, and therefore depends on its definition. Here we define fitness of a molecule as its function. In enzymes, for example, this term could include the stability, effectiveness, and efficiency necessary for the protein to perform the required function. Then, the selection gradient can be understood as the change in fitness when the trait (in this case geometry) varies. To apply the framework, the estimation of a G-matrix is required [41]. To deal with the fact that the number of samples is limited, this inversion of matrices requires expensive computation and an eigen-decomposition of the covariance matrices carried out using the restricted maximum likelihood (REML) approach is typically employed to perform the variance decomposition. When applied to univariate data, REML is more accurate than maximum likelihood methods because it better handles missing data (i.e. unknown parents, arbitrary breeding designs, etc…) and can account for selection processes. However, REML has good properties only asymptotically. The reliability of the estimates is questionable when data is scarce. One way to deal with complex cases that might bias the REML estimates is to use Bayesian inference of the animal model. This approach uses Markov chain Monte Carlo simulations and produces more robust estimations than REML, with equivalent results in less complex cases [42]. This robustness assumes that the Bayesian model has enough information in the prior probability distribution. A given set of priors considerably affect the estimation of the variance components. In particular, uninformative priors, such as flat priors (all parameters are equally likely), can lead to biases in the estimation.

Lynch’s comparative quantitative genetic model: Applications to protein structures

Lynch [43] developed the phylogenetic mixed model (PMM). In this model, the correlation of phylogenetically heritable components is the length of the path from the most recent common ancestor among two species and the root of the phylogenetic tree (i.e. time to the shared common ancestor) in the phylogeny [44]. The PMM can be described as [43]: where X is an np × p incidence matrix, p being the number of traits and n the number of observations. An assumption of the model is that μ is shared among all taxa in the phylogeny. This is a sensible assumption to make when analyzing truly homologous protein structures, since the mean effect on the phenotype is shared by common ancestry. This also means that μ+ a can be interpreted as the heritable component of the mean phenotype for the ith taxon [43]. Here, the phylogenetic effects are the portion of the variation that has been inherited from ancestral species [45]. It does not only contain the genetic component, but also some environmental contributions given the shared evolutionary history of the taxa [46]. In PMM the ratio between the additive component and the total variance is the heritability (h2) in a univariate approach. Housworth et al. [46] pointed out that a univariate h2 in a PMM is actually equivalent to Freckleton et al. [47]’s and Pagel [48]’s phylogenetic correlation (λ). Despite the robustness of the models, the REML technique, employed to estimate them, has two major drawbacks: assumption of normality of the data and high sample size requirements. It is widely known that REML poorly estimates genetic correlation when overparameterized (multi-trait inference), when the sample size is small (Martins, personal communication), and when the normality assumption is violated [44]. These violations can be handled in a Bayesian framework using Markov Chain Monte Carlo techniques. In such techniques, the higher complexity of the joint probability calculation needed for the likelihood estimation can be broken down in lower dimensional conditionals. From those conditionals the MCMC sampling can be performed and marginal distributions can be extracted [44]. A discussion of the use of Bayesian MCMC techniques is beyond the scope of this work. We refer the interested reader to Sorensen and Gianola [49] for a good description of likelihood and Bayesian methods in quantitative genetics. Despite its strengths, the Bayesian framework also has weaknesses. The most important one is that it requires proper and informative priors. Uninformative priors lead to biases with high variation in results. The sensitivity to the choice of prior distribution should always be assessed [50]. Given that in evolutionary biology datasets the amount of knowledge on the estimator is scarce, well-informed priors are normally not available and by informing priors with partial information, the estimation can become ill-conditioned. To explore the feasibility of a comparative quantitative genetics (CQG) framework in protein structures, we simulated a dataset with variable numbers of traits and observations. We show that the current implementations of the CQG framework are not feasibly applied to the dimensionality required for protein structures. We devised a method that functions as a proxy for the CQG framework and show that it is feasible and accurate. By applying this framework using the energy of unfolding (ΔG°) as fitness function to the GH13 family, we are able to show how purifying selection has fixed the geometry of the TIM-barrel. We also demonstrate how, by changing the fitness function, the response to selection propensity changes accordingly. Finally, a proxy for the amount of dynamic deformation happening in the protein, given a vector of selection, is explored. Overall, we present here a starting framework to explore protein structure evolution and design. This approach has the potential to inform researchers on the potential of a given structural family to respond under selective pressures. This is especially important in protein engineering and structural evolution. It also has the potential to inform possible pathways to be visited in a given fitness landscape. This approach can narrow down the structural variants that a particular structural group followed (or could/can potentially followed) in evolution or engineering, and therefore become an important tool in structural biology inquiring.

Materials and methods

GH13 dataset

The GH13 family has 111 structures reported in the CAZY database, with 386 representatives in the PDB. Given that molecular dynamic simulations are very time consuming, we used a subset of the proteins classified as Glycoside Hydrolases Family 13 (GH13). A stratified selection (from each CAZY structural grouping) of 35 protein structures was performed maximizing taxonomic diversity (i.e. avoiding sampling the same species). The structure of the maltotetraose-forming exo-amylase from Pseudomonas stutzer (PDB code 1GCY) failed during the MD simulation due to structural abnormalities in the crystal, perhaps steric clashes, which induced unacceptably large forces, causing the integrator to fail. After lengthy energy minimizations, the structure could not be resolved and therefore was removed from the analysis. We chose to remove the structure instead of artificially modifying it, in order to avoid bias in the data. A final set of 34 protein structures (Table A1 in S1 File) was used in further analyses.

Molecular dynamics (MD) simulations

Each of the 34 protein structures was simulated in solution using the software GROMACS 4 [51]. The force field modes used for the simulations were GROMOS96 for the protein, and SPCE for the water molecules. Data were collected every two picoseconds for at least 40 nanoseconds, discarding the first 10 nanoseconds of simulation to achieve stability. This process was performed using a workstation with 24 CPU cores and an NVIDIA TESLA™ GPU. The analysis of these simulations will provide information on the flexibility (or within protein variance) of the protein, as opposed to the analysis across homologs which would provide phylogenetic information (or between structures variance). By 40 ns all proteins analyzed have achieved equilibrium and therefore most of the intrinsic variance has been captured.

Aligning the structures and MD simulations

The alignment of homologous proteins was performed using MATT software [52]. To align the snapshots from MD simulations a General Procrustes Superimposition (GPS) was performed using the R package shapes [53].

Abstracting protein structures as shapes

On a set of aligned protein structures, the abstraction is performed in a similar way to that in Adams and Naylor [54]. However, they do not fully describe the abstraction. Here we assign a landmark to the centroids of residues defined by: where A will be the number of heavy atoms (C, O, N) that constitute the side chain of a residue including the alpha carbon (C). This procedure takes into account only the homologous residues. It captures the variance of both the backbone and the side chain. In the case of glycine, the centroid is the C. Once the structure is abstracted as a shape, the resulting n (number of observations) by l (number of coordinates of homologous residues) matrix is referred to as the phenotypic matrix (P). For example, let us assume that we have a protein structure composed of 150 residues. Let’s imagine that 100 different taxa share an ortholog of this protein. After aligning the protein structures let’s assume that 100 residues are homologous across all 100 taxa. The resulting phenotypic matrix (P) will be composed of 100 rows of observations (n) and 300 coordinates (l). These dimensions correspond to the x, y, and z axis of each of the 100 homologous residues. To estimate the variation of this phenotype, the phenotypic variance V can be estimated by computing the variance-covariance matrix of P as V = var(P), or G in a multivariate scenario.

Pooled-within group covariance matrix estimation

After the MD simulations up to 500 samples per simulation were obtained. The estimation of the pooled-within covariance matrix was performed as follows: Align every model within each MD simulation using General Procrustes Superimposition (GPS): Remove extra rotations and translations that could occur during MD simulation. Select an ambassador structure that is closest to the mean structure (the geometrical mean of the dataset). Align all ambassadors using MATT flexible structure aligner to identify homologous sites: Multiple structure alignment to identify structural homology. Extract the centroid of fully homologous sites: Identify shared information among all structures, as explained in section “Aligning the structures and MD simulations”. Concatenate the centroids’ three dimensions for all trajectories. Perform a GPS on the entire set of shapes to bring all pre-aligned structures into the same reference plane. Compute the pooled-within-species covariance matrix (W) by first computing the deviation from the mean in each class/group (individual homologs in our case) as: then computing the sum over the classes of the products of D as: Finally, compute the pooled-within covariance matrix: where S is the number of categorical variables describing the groups or species, ω is an instance where f(ω) corresponds to the class value of the instance, and is the mean of the variable i for individuals belonging to s. Finally, n is the sample size. Here, W contains the covariance matrix of the within-homolog (i.e. Molecular dynamic data). To estimate the evolutionary component of P, the between structures/species covariance matrix (B) has to be taken into account. B will be simply the difference between the V and W.

Estimating as proxy for fitness

The on each model for each protein was estimated using the command line version of FoldX [55]. It is important to notice that the computed is not comparable in proteins of different size, therefore we computed the average per residue as: r being the number of residues. With this as proxy for fitness we can try to explore the fitness surface. To do this, we used the first two principal components (PC) of a PC analysis of the shapes as X and Y axes; in the Z axis (S1 Fig). Is important to state here that the fitness is a relative quantity and depends on the objective of the analysis. Here we chose for ease of computation, but this function only measures structural stability. In proteins, function is the main selective trait. However, function in proteins depends on several properties of the structure such as the aforementioned stability, as well as the Michaelis constant (for enzymes), activation energy, free energy of the system, among others. In this manuscript we will assume that the fitness function we are modeling is stability, and therefore works as a good proxy for it.

Propensity to respond to selection

Arnold [56] showed that, despite high additive variances, G might not be aligned with the fitness surface. This implies that even though βλ can be non-zero, the response to selection might send the phenotype in a different direction than the fitness surface. Blows and Walsh [57] and Hansen and Houle [58] developed an approach to measure the angle between β and the predicted response to selection from the multivariate breeders equation, as: would be zero when there is no genetic constraint, whereas an angle of 90° would represent an absolute constraint [59]. In simpler terms, will tell us how responsive to selection the protein is. Let’s imagine we are trying to push one particular protein structure towards a higher fitness by selection. If is low (close to zero), most of our selection effort will result in a shift in the structures towards the desired goal. If is closer to 90°, a very small fraction of the selection effort will go towards the desired outcome. is therefore measuring the propensity of the structure to respond to selection.

Results and discussion

In Supplementary Material and Methods, and in Supplementary Results (S1 File) we have shown that the traditional PMM models and their Bayesian counterparts are not feasible when the number of traits and observations are in the order of those obtained in protein science when MD simulations are taken into account. Here, we applied a simple method to overcome this over-parameterization.

Overcoming over-parameterization: Approaching the G-matrix by means of the P-matrix

Given the previous results, the estimation of the G-matrix within the Lynch’s PMM is not feasible. This is not a new observation since in comparative evolutionary biology it is widely known that accurate measures of G are difficult or impossible to obtain [60]. This pattern is even more evident when dimensionality is high. On average, protein structures are composed of over 200 residues in a three-dimensional system, which means over 600 variables. Also, the sample size at the species level is typically small. For these reasons, a full and stable estimation of the G-matrix is not possible. However, an increased number of samples can be achieved by means of molecular dynamic simulations. This increases n considerably depending on the length of the simulation. We have shown the infeasibility of the GLMM to deal with the dimensionality and very large sample size. However, it has been shown that phenotypic (V) covariance matrices can be estimated with more confidence with large sample sizes [61]. It is also shown that in some cases, V can be used as surrogate for G when the two are proportional [60, 62]. To test this, we performed a shape simulation explained in S1 File. The simulation was performed with 500 replicates as molecular dynamics snapshots, 100 taxa, and the traits were varied from 2 to 1024 in a geometric series increase. Since the within-homolog matrix structure is known, a pooled-within covariance matrix (W) was computed as exposed in the section “Pooled-within group covariance matrix estimation”. Table 1 shows the feasibility and accuracy of the pooled-within species co-variance estimation method. Here the Cheverud’s Random Skewer (RS) test [61, 63] implemented in the R package phytools [64] were used to test the accuracy. A discussion of the appropriateness of the usage of this metric can be found in S1 File and references therein.

Table 1

Accuracy and feasibility of the pooled-within covariance estimation.

Traits	Time (secs.)	Memory (Mb)	RS_B		RS_W
Traits	Time (secs.)	Memory (Mb)	p-val	ρ	p-val	ρ
2	0.60	182.9	0.002	1.000	0.021	0.999
4	0.80	238.2	0.000	0.999	0.007	0.952
8	1.00	387.6	0.000	0.998	0.000	0.983
16	1.82	407.5	0.000	0.998	0.000	0.963
32	6.08	428.5	0.000	0.998	0.000	0.966
64	20.32	465.9	0.000	0.999	0.000	0.953
128	91.14	539.4	0.000	0.999	0.000	0.947
256	341.90	686.8	0.000	0.999	0.000	0.950
512	1342.36	982.2	0.000	0.999	0.000	0.938
1024	5268.82	1843.7	0.000	0.999	0.000	0.937

Accuracy and feasibility of the pooled-within covariance estimation.

Memory (Mb), time (sec) and accuracy (random skewer correlation) of the pooled-within covariance estimation approach. RS corresponds to the random skewer test for the phylogenetic covariance and RS to the dynamic component. Even with highly multivariate data (1024 traits), the memory requirement is manageable (less than 2 Gb), the evaluation is completed in under an hour, and the accuracy of the estimation is high. The estimated G-matrix is almost identical to the simulated matrix in most of the runs, and the estimated MD have over 0.97 correlated responses to random vectors than the actual MD. This is a surprising result since this method cannot completely separate the error terms from the genetic and dynamic components. However, the split of the error term between the two other components can render it negligible. Moreover, it seems that error does not significantly affect the structure of G and MD, allowing them to behave almost identically in comparison to the simulated counterparts. Given these results, and the fact that the application to real datasets can only be made with this approach, it is reasonable to keep using the described method from this point forward. However, the biological and evolutionary meaning of this approach is less clear than in the other methods since there is no explicit use of a phylogeny.

Meaning of the pooled within-structure covariance matrix

V-matrices can be used as surrogates of G-matrices in cases were they are proportional or sufficiently similar [65]. Proa et al. [65] showed that this assumption can be relaxed if the correlation between G and V ≥ 0.6. In protein structures, we can assume that given the strong selective pressures and long divergence times, the relationship between V and G is standardized. Assuming that this is true in protein structures, the estimated pooled variance-covariance (V/CV) matrices in real datasets might have a specific biological meaning. This was described in Haber [66] for morphological integration in mammals. Following Haber’s [66] logic, the within-structure/species (i.e. thermodynamic V/CV) matrix refers to integration of residues in a thermodynamic and functional manner. It also contains information about environmental factors affecting the physical chemistry of the structure. Haber [66] includes a genetic component for his estimation of the within population variation, since populations follow a filial design. Our data, on the other hand, have a controlled amount of genetic component given that the sampling is done in a time series instead of a static population. Our approach would be more related to an estimation of within repeated measures design. The among-structure/species (i.e additive or evolutionary V/CV) matrix refers to the concerted evolution of traits given integration and selection [66].

Response to selection in the GH13 family

As defined in Eq 2, the response to selection of a phenotype depends on the within-species change in mean due to selection, the correlation between different traits, and the amount of heritable component of the shape. The first component can be referred to as , and also known as the vector of selection gradients [67] or directional selection gradient. The second and third elements are summarized in the G-matrix. As expressed in Eq 2, this covariance matrix represents the genetic component of the variation in the diagonal, and the correlated response of every trait to each other in the off-diagonal. Another extension from Eq 2 is to compute the long-term selection gradient assuming that G is more or less constant over long periods of time: Here would be proportional to the differences in mean between two diverging populations. It is important to stress the relationship between these concepts and fitness. Given that fitness (w) is directly related to selection, its mathematical relationship can be expressed as [57], and so it behaves as the weight of a multiple regression of f on the vector of phenotypes z. In proteins, the definition of fitness is not trivial, and can vary depending on the hypothesis being tested. If the analysis is done comparatively (i.e. across different protein structures from different sources), a fitness analysis including exclusively structural measures, such as Gibbs free energy (ΔG), can be misleading. The fitness surface that can arise from this data would only represent departures from every individual native state. Nevertheless, ΔG and the energy of unfolding (ΔG°), are important measures to determine the stability of the protein which is important for the fitness of a protein structure. The stability of the structure allows it to perform a function and is therefore under selection because it is necessary for the particular biochemical function [68]. We are aware that there is a limitation to the protein structure stability role in fitness. To improve this fitness landscape, f can be defined by ΔG° coupled with a functional measure. In proteins, function is the main selective trait; therefore, including a term accounting for this would create a more realistic fitness surface. In enzymes this can be achieved by using the K/K for each of the enzymes for a common substrate. The fitness function (F) can be expressed as: where is the free energy of unfolding of the structure i, is the turnover number for structure i in substrate s, and is the Michaelis constant of protein i working on substrate s. In the case of the α-amylase family (GH13), one might try to apply the framework developed in previous sections and try to estimate the response to selection of a subset of them. However, Eq 11 cannot be applied since the information of the relative efficiency given a common substrate is not consistently available across all proteins in the dataset. For this reason we are going to work exclusively with , keeping in mind two caveats, 1) that only represents structural stability and 2) that it has been shown that ΔG or are not optimized for during evolution [69].

Estimating dynamic and genetic variance-covariance matrices in the α-amylase dataset

The structure depicted with the higher fitness was model 1 of the Taka-amylase A structure (PDB code 2TAA; EC 3.2.1.1; α-1,4-glucan 4-glucanohydrolase; henceforth referred to with its PDB code) (S1 Fig), from Aspergillus oryzae assuming as fitness. The model 1 of structure 2TAA can be assumed to be the result of the goal of selection. The realized response to selection can be defined as μ⊕ − μ0, where μ⊕ is the target or after-selection mean structure and μ0 is the starting or before-selection structure. To estimate it is essential to have the fitness defined based on the questions to be asked, given that the interpretation of the realized response to selection depends on it. In an engineering perspective, let’s assume that μ⊕ is the mean of a population of structures with the desired stability. On the other hand, μ0 is the mean of a population of structures created by a desired vector. One might ask the question of how does μ0 have to change towards the stability of μ⊕. This can be achieved by computing βλ (Eq 10), and replacing with . In the particular case of the GH13 dataset, let’s assume that the model 1 of the structure 2TAA is the desired phenotype (with the higher fitness in S1 Fig), and the model 643 of the α-amylase protein (PDB code: 4E2O; α-1,4-glucan-4-glucanohydrolase; EC 3.2.1.1; henceforth referred to by its PDB code) from Geobacillus thermoleovorans CCB_US3_UF5 (with the lower fitness in S1 Fig) corresponds to the source phenotype. We have selected the protein with the lowest stability (as ; S1 Fig), a functional yet truncated form of an α-amylase lacking both the N- and C-terminal domains [70], as the source phenotype and pose the hypothetical scenario in which we wish to improve its stability towards the more stable structure of the Taka-amylase A (PDB code 2TAA; EC 3.2.1.1; α-1,4-glucan 4-glucanohydrolase). The latter structure was shown to be the most stable (as per ; S1 Fig) of the set. In our framework, however, there is no requirement to use the extremes of the fitness distribution, as any gradient from the source to the target will suffice to asses the response to selection in that particular hypothesis. In the posed scenario, βλ would have a length corresponding to the dimensions of the shape. In the GH13 case 297 homologous residues were identified, which means that these shapes have a dimensionality of 891 traits. This dimension-per-dimension output is important since it reflects the amount of pressure in each dimension per each residue. However, it makes the visualization more difficult. For the sake of visualization simplicity, Fig 1 shows the absolute value of the sum of βλ per residue, standardized from 0 to 1.

Fig 1

rendered in the source structure 4E2O.

White represents the lowest magnitude (0), while red the highest (1). Blue depicts the non-homologous residues. A. Selection gradient on G. B. Dynamic gradient on M.

rendered in the source structure 4E2O.

White represents the lowest magnitude (0), while red the highest (1). Blue depicts the non-homologous residues. A. Selection gradient on G. B. Dynamic gradient on M. Fig 1A shows the selection gradient using the estimated G. Not surprisingly, the selection gradient for the TIM-barrel is low. This means that there is not much directional selection in this sub-structure. However, it is somewhat surprising that there is not any purifying selection either. This can be explained by the fixation of the trait in the evolution. Since the TIM-barrel is a widespread sub-structure that has been strongly selected during evolution, it might have reached a point of fixation of its geometry. Therefore, the G-matrix shows little covariation among these residues since the geometric variability is also low. It is important to stress here that the phenotype measured is the geometry of the structure more than that of the sequence. Therefore, despite some variation that may have occurred at the sequence level, it might not have meaningfully affected the positional information. However, one must be cautious with the approach employed in Fig 1 since the signs are overlooked, thereby ignoring the direction of selection and the correlated response to selection. Nevertheless, this approach allows for a coarse-grained visual exploration of . Individual instances identified by this method should be analysed afterwards in each dimension. Table 2 shows the actual values of βλ for the top 5 positive values (directional selection) and top 5 negative values (purifying selection).

Table 2

Selection gradient in the top 5 residues.

ResIndex	Residue	β_X	β_Y	β_z	Δz¯X	Δz¯Y	Δz¯Z
Directional
112	TYR	-5.225	1.082	11.138	-5.106	2.043	10.248
122	LYS	12.333	-2.321	-0.964	12.452	-1.360	-1.854
124	ASP	14.28	-6.963	-10.036	14.399	-6.002	-10.926
125	TRP	18.001	-0.984	0.336	18.121	-0.022	-0.554
126	PHE	11.53	-0.833	3.253	11.650	0.128	2.363
Purifying
80	HIS	-5.580	-2.148	4.023	-5.461	-1.187	3.13
121	THR	2.508	-4.644	-5.731	2.627	-3.683	-6.621
223	TYR	-0.010	-7.631	-7.634	0.110	-6.670	-8.524
358	SER	-8.647	-3.461	1.963	-8.527	-2.500	1.073
394	GLU	-4.561	-0.449	-4.002	-4.442	0.512	-4.892

Selection gradient in the top 5 residues.

Top panel shows the residues where at least one of its coordinates is under directional selection and the sum of their absolute values is the highest. Bottom panel contains the information of residues where at least one of its coordinates is under purifying selection, and the sum of the raw values are the lowest. Fig 1B and Table 3 show the mean difference between target and source when effects of correlated dynamic differentials are removed. Given that effectively G acts as a rotation matrix in Eq 10 to remove the selection differentials, one may posit that the same can be achieved with the dynamic (M) matrix. This concept is more difficult to interpret than the actual response to selection. Once G is replaced by M in Eq 10, we might call it dynamic gradient to differentiate it from the selection gradient already explained. In this case, if the gradient is zero for a given trait, this can be interpreted that the dynamic component of the phenotype does not contribute significantly to the difference in shape for that particular trait. In the case of non-zero gradients, these can be interpreted as contributions of the dynamics to the differential, either towards the target (positive gradient) or away from the target (negative gradient).

Table 3

Dynamics gradient in the top 5 residues.

ResIndex	Residue	β_X	β_Y	β_z	Δz¯X	Δz¯Y	Δz¯Z
Directional
117	LEU	13.028	37.149	11.848	2.130	3.521	4.437
125	TRP	29.019	33.605	6.857	18.121	-0.022	-0.554
126	PHE	22.548	33.755	9.774	11.650	0.128	2.363
262	LYS	12.972	38.081	11.412	2.073	4.454	4.001
367	LEU	13.590	34.561	15.609	2.692	0.933	8.197
Purifying
124	ASP	25.297	27.625	-3.515	14.399	-6.002	-10.926
223	TYR	11.008	26.958	-1.113	0.110	-6.670	-8.524

Dynamics gradient in the top 5 residues.

Top panel shows the residues where at least one of its coordinates is under positive gradient. Bottom panel contains the information of residues where at least one of its coordinates is under a negative gradient. In the GH13 subset, most dynamic gradients were positive having only two residues that had one coordinate under a negative gradient (Table 3). This can also be inferred by Fig 1B. The values of the dynamic gradient are high but sensible given the definition of fitness. Since we defined fitness as the energy of unfolding (ΔG°), most of the information used to select the target and source structures comes from stability, and therefore thermodynamic information. The results depicted in Table 3 and Fig 1B suggest that most of the variation that explains the difference in phenotype between the structure 4E2O and 2TAA is contained within the molecular dynamic component rather than the approximation to the phylogenetic component.

Orientation of G

The GH13 θ was 1.4 degrees, which means that the direction of optimal response is 1.4 degrees away from the total genetic variation of 99% explained by the projection. According to this, the Geobacillus thermoleovorans structure is susceptible to the selection in the actual direction of the fitness landscape towards the structure of Aspergillus oryzae to achieve maximum stability. The extent of such change is given by , which means that the centroid position of the residue i should be displaced by . In the case of the dynamics, the same approach can be taken. Here, θ was 1.5 degrees which means that the optimal dynamic response is 1.5 degrees away from the optimal response. This can be interpreted in a similar way as that of the regular θ. However, manipulating the structure along the dynamics gradient is not feasible. The GH13 dataset was 0.3. This means that the genetic constraints on 4E2O are not affecting the direction of selection. This posits the possibility that a strong directional selection will drive the source structure towards the target. The same pattern happens when this approach is applied to M. is 1.46 degrees, which is almost identical to θ. Thus, there are almost no within-variation or dynamic constraints to the vector of response given the dynamic gradient.

Concluding remarks

We have introduced the application of the approximation of comparative quantitative genetics framework, by means of a pooled-within group covariance matrix in a subset of the GH13 proteins, and demonstrated this application is feasible and provides sensible results, given the definition of fitness. This definition is essential in the interpretation of the results since it is the interpretation that gives polarity to . Therefore, all conclusions about the response to selection and the selection gradient itself must be analyzed under this light. The usage of M in the determination of the dynamic gradient could be controversial. This is due to the fact that, in the partition of the phenotypic variance, M is expected to be the environmental variance plus an error term. However, since the source data for the estimation of G and M come from repeated measures by MD, M contains information about the thermodynamics and folding stability of the protein. It is therefore also contributing to selection. It is important to stress the fact that this is an approximation to the true G and true M, since we have shown in previous sections that these cannot be estimated given the dimensionality of the phenotype. However, we have shown that the pooled-within group approach gives consistent results. We have also shown that, in a stability perspective, the TIM-barrel shows a small phylogenetic/genetic component to the selection gradient when a less stable structure (4E2O) is analyzed with respect to a more stable structure (2TAA). In an engineering perspective, this means that most of the changes in shape come from the dynamics. Nevertheless, the small shows that most of the changes applied to 4E2O would directly result in increasing the stability towards the structure 2TAA. 4E2O is a truncated protein, and therefore some loss of stability is expected. It seems that residues 112Y, 122K, 124D, 125W, and 126P, are good candidates to increase the stability of the molecule given their . In these cases, the goal will be to shift the position of their centroids by the resulting vector of the three dimensions. Ultimately, we have shown a framework to analyse protein structures’ response to selection. This framework have incredible potential in industry (protein engineering), structural biology, and evolutionary biology, since allow us to narrow the search space within a given fitness landscape and potentially predict the extent of the propensity of a protein structure to be selected towards a desired target.

Supplementary note.

Supplementary methods (A) and results (B) testing the feasibility of the of the linear mixed model and Bayesian mixed model implementations. (PDF) Click here for additional data file.

Fitness surface of the MD simulations in the 34 structures dataset.

Fitness in the Z axis is defined as . (PNG) Click here for additional data file.

49 in total

1. The phylogenetic mixed model.

Authors: Elizabeth A Housworth; Emília P Martins; Michael Lynch
Journal: Am Nat Date: 2004-01-28 Impact factor: 3.926

2. Stability and the evolvability of function in a model protein.

Authors: Jesse D Bloom; Claus O Wilke; Frances H Arnold; Christoph Adami
Journal: Biophys J Date: 2004-05 Impact factor: 4.033

3. Protein engineering of Bacillus acidopullulyticus pullulanase for enhanced thermostability using in silico data driven rational design methods.

Authors: Ana Chen; Yamei Li; Jianqi Nie; Brian McNeil; Laura Jeffrey; Yankun Yang; Zhonghu Bai
Journal: Enzyme Microb Technol Date: 2015-06-22 Impact factor: 3.493

4. How vague is vague? A simulation study of the impact of the use of vague prior distributions in MCMC using WinBUGS.

Authors: Paul C Lambert; Alex J Sutton; Paul R Burton; Keith R Abrams; David R Jones
Journal: Stat Med Date: 2005-08-15 Impact factor: 2.373

5. Dividing the large glycoside hydrolase family 13 into subfamilies: towards improved functional annotations of alpha-amylase-related proteins.

Authors: Mark R Stam; Etienne G J Danchin; Corinne Rancurel; Pedro M Coutinho; Bernard Henrissat
Journal: Protein Eng Des Sel Date: 2006-11-02 Impact factor: 1.650

Review 6. Estimation of quantitative genetic parameters.

Authors: Robin Thompson
Journal: Proc Biol Sci Date: 2008-03-22 Impact factor: 5.349

7. In silico rational design and systems engineering of disulfide bridges in the catalytic domain of an alkaline α-amylase from Alkalimonas amylolytica to improve thermostability.

Authors: Long Liu; Zhuangmei Deng; Haiquan Yang; Jianghua Li; Hyun-dong Shin; Rachel R Chen; Guocheng Du; Jian Chen
Journal: Appl Environ Microbiol Date: 2013-11-08 Impact factor: 4.792