Literature DB >> 29062121

Sequence entropy of folding and the absolute rate of amino acid substitutions.

Abstract

Adequate representations of protein evolution should consider how the acceptance of mutations depends on the sequence context in which they arise. However, epistatic interactions among sites in a protein result in hererogeneities in the substitution rate, both temporal and spatial, that are beyond the capabilities of current models. Here we use parallels between amino acid substitutions and chemical reaction kinetics to develop an improved theory of protein evolution. We constructed a mechanistic framework for modelling amino acid substitution rates that uses the formalisms of statistical mechanics, with principles of population genetics underlying the analysis. Theoretical analyses and computer simulations of proteins under purifying selection for thermodynamic stability show that substitution rates and the stabilization of resident amino acids (the 'evolutionary Stokes shift') can be predicted from biophysics and the effect of sequence entropy alone. Furthermore, we demonstrate that substitutions predominantly occur when epistatic interactions result in near neutrality; substitution rates are determined by how often epistasis results in such nearly neutral conditions. This theory provides a general framework for modelling protein sequence change under purifying selection, potentially explains patterns of convergence and mutation rates in real proteins that are incompatible with previous models, and provides a better null model for the detection of adaptive changes.

Entities: Chemical

Mesh：

Substances：
Proteins

Year: 2017 PMID： 29062121 PMCID： PMC5701738 DOI： 10.1038/s41559-017-0338-9

Source DB: PubMed Journal: Nat Ecol Evol ISSN： 2397-334X Impact factor: 15.460

Introduction

Proteins continuously change as mutations are fixed or eliminated depending on their effect on the protein’s structure, stability, functionality, and intermolecular interactions. These holistic properties result from interactions between amino acids throughout the protein, inducing epistatic (non-additive) fitness interactions among mutations, and leading to long-term coevolution such that a substitution at one site alters the relative probability of substitutions at other sites1–9. Because of the complexity of epistatic interactions, it has been difficult to identify what determines substitution rates at a site, characterise how these rates depend on the rest of the sequence, and understand how they vary with time and location in the protein. Standard empirical substitution rate models neglect epistatic interactions and the resultant rate heterogeneity beyond simple scaling factors10–12. Although these models have had a major impact throughout the life sciences, they cannot estimate the effect of epistatic interactions on structural stability, function or fitness, predict the role of compensatory substitutions in protein evolution13,14, predict which of the 10% of deleterious mutations in humans are harmless in other species15, or accurately represent the rate and time dependence of convergence and homoplasy14. More complicated empirical models have been developed16–21, but their utility is limited by the large number of required parameters. More mechanistic substitution models that represent the underlying process of molecular evolution hold the promise of increased accuracies using fewer biologically meaningful parameters, but require a deeper understanding of the process of sequence change, especially the characteristics and effects of epistasis. We previously demonstrated that computational simulations of protein evolution, with fitness determined by thermodynamic stability, can reproduce many of the puzzling aspects of protein evolution including the rate- and time-dependence of convergence14 and the site- and time-dependence of substitution rates22. These models exhibit a phenomenon called the ‘evolutionary Stokes shift’6, the tendency for newly resident amino acids at a site to be stabilised, or ‘entrenched’8 over evolutionary time following a substitution. We also observed a tendency for the new amino acid to be pre-stabilised prior to the substitution by chance or contingency8,9. The pre- and post-adjustments of the protein to the new amino acid occur without corresponding changes in fitness, distinguishing this process from paired compensatory substitutions where the constituent substitutions have non-zero fitness effects23. Despite debate about the relationship between structural stability and fitness and the frequency of large reversals in amino acid propensities, the role of epistasis in protein evolution has become increasingly recognised1–9. To understand how selection for structural stability determines amino acid substitution rates, and what drives the pre- and post-substitution structural stabilisations in the absence of changes in fitness, we developed a mechanistic framework employing the formalisms of statistical mechanics. We find that average substitution rates in proteins selected for structural stability can be explained by the evolutionary equivalent of transition state theory, with fluctuations in amino acid preferences due to epistatic interactions representing an essential aspect of the substitution process. Just as entropy plays a preeminent role in statistical mechanics, the sequence entropy of folding, defined as the log of the number of possible sequences that fold with the evolutionarily determined degree of structural stability, is central to evolutionary mechanics. We test our mathematical approximations and predictions using computational simulations of protein evolution. We demonstrate that average substitution rates at a site can be predicted from site-specific structural stability distributions estimated in the absence of selection on that site and the dependence of the sequence entropy of folding on overall protein stability. The effect of other global factors such as effective population size, protein structure designability, and selective strength are combined in the entropy term and do not need to be considered or estimated separately.

Results

Site-specific stabilities and relative substitution rates

To develop a mechanical theory of protein evolution, we considered how purifying selection for stability determines site-specific substitution rates. To make the material more accessible, detailed equations underlying our results are in the Methods section; however, we endeavor here to provide an outline of the main concepts. Although real proteins are under selection for a range of properties, this specific form of selective pressure is well defined, theoretically tractable, and a common constraint. Analysing selection on stability also provides a ‘null model’ to examine other forms of selection. The stability Φ(X) of protein sequence X was defined as the negative of the free energy of folding, with more positive values indicating greater stability. The fitness m(X) = m(Φ(X)) was set equal to the probability that the encoded protein would be folded at thermodynamic equilibrium (Equation (1))6,24,25. Thus, increases in stability correspond to increases in fitness. Consider site k, occupied by amino acid α. The protein stability can be partitioned into two contributions Φ(X) = ϕ(X∌ ) + Φ(X∌ ). The first term includes interactions between the amino acid α resident at site k and those resident at other sites; in statistical mechanics, this term represents the system of interest. The second sequence-dependent term includes the much larger set of interactions amongst amino acids at all sites not including k, and corresponds to the thermodynamic bath in statistical mechanics. Both terms include interactions in the folded and unfolded states. For simplicity, we use Φ, ϕα and ΦBath to represent Φ(X), ϕ(X∌ ) and Φ(X∌ ), respectively. In the low mutation rate regime, the instantaneous substitution rate from resident amino acid α to new amino acid β is equal to the corresponding mutation rate times the fixation probability, which in our model depends on the impact of the two amino acids on the protein’s stability. We note that real and simulated proteins usually evolve within a narrow range of stabilities around an average value Φ̄24,26–29 where selection and the genetic drift of large numbers of slightly destabilising mutations balance30,31, dependent on factors such as temperature24 and effective population size. Under this condition, the change in fitness Δmα→β, and therefore the probability of fixation, is determined by the difference between ϕ and ϕ. Values of ϕ and ϕ will vary as substitutions occur in the rest of the protein, which will be affected by the amino acid occupying position k6. We assume other sites are sufficiently numerous and change sufficiently rapidly that the protein is always equilibrated with the amino acid occupying site k. The joint probability distribution of ϕ and ϕ when site k is occupied by α can be described by the stationary distribution ρ(ϕ, ϕ|α). The average substitution rate from α to β is completely determined by the mutation rate and this resident-dependent joint probability distribution. Because of its importance, we focus on characterising this joint distribution. With our approximations, all sequences with a specific ρ(ϕα, ϕβ|α) have identical fitnesses. ρ(ϕα, ϕβ|α) is then proportional to the number of sequences with specific values of ϕα and ϕβ; in analogy to Boltzmann’s description of entropy as the log of the number of microscopic configurations corresponding to a specified macroscopic state, we characterise the log of the number of sequences corresponding to specific values of ϕα, ϕβ, x = α and Φ̄ as the ‘sequence entropy of folding’ S(ϕα, ϕβ|α). This quantity is notably different from the ‘sequence entropy’ used to represent site-specific variability in a set of aligned sequences32. To explore and evaluate our theoretical analysis, we simulated the evolution of a 300-residue protein with fitness equal to the probability of the protein being folded at thermodynamic equilibrium, matching our theoretical model. These simulations are not meant to make specific quantitative predictions, but rather to predict general characteristics of evolutionary behaviour for proteins that require a native confirmation to carry out their biological function, and have demonstrated their ability to reproduce fundamental aspects of protein evolution6,24,25. By using a simple pair-contact model of protein thermodynamics, we were able to perform replicate simulations corresponding to about 5 billion years given typical eukaryotic substitution rates. We grouped sites with similar substitution patterns into four different site classes, with class 1 the most exposed and 4 the most buried. Figs. 1a-d illustrate ρℂ(ϕGlu, ϕLys|Glu) and ρℂ(ϕGlu, ϕLys|Lys), the joint probability distribution of site-specific contributions of glutamic acid and lysine when one or the other is resident, for sites belonging to site class ℂ = {1,2,3,4}. Corresponding distributions of population-scaled selective coefficients are shown in Supplementary Fig. S2. The distributions are broad, consistent with earlier results demonstrating that selective pressures vary over a wide range as substitutions occur elsewhere in the protein6. Exposed, rapidly evolving sites with few selective constraints (site class 1, Fig. 1a) have more compact distributions with smaller variances compared to buried, slowly evolving sites (site class 4, Fig. 1d). The potential contribution of an amino acid to protein stability is usually greater when that amino acid is resident at a site, a reflection of the ‘evolutionary Stokes shift’6. The amount of this increase appears correlated with the variance in ϕα.

Figure 1

Relative stabilities of amino acid pairs. The stability values of two amino acid residues (Res, labelled with standard three-letter abbreviations) are shown on the two axes (ϕ(Res)). In a-l, relative local contributions to stability are shown for glutamic acid and lysine in different site classes (a-d), or for different population sizes for site class 1 and 3 (e-h), or various amino acid pairs in site class 3 (i-l). Points were sampled when the amino acid on the x-axis was resident (green), when the amino acid on the y-axis was resident (pink), or during transitions between the two (yellow). In m-p, distributions of local contributions to stability in reference state are shown when the non-interacting null amino acid was present (ρ(ϕα, ϕβ|∅), pink), when the amino acid on the x-axis was present as predicted using Equation (7) (ρ̃(ϕα, ϕβ|α), cyan), or as observed (ρ(ϕα, ϕβ|α), green). Grey diagonal lines mark the boundaries of regions of near-neutral substitutions. These and all other stability values are in kcal mol-1.

The bivariate distributions are surprisingly independent of population size (Figs. 1e-h, S2), but highly dependent on amino acid pair (Figs. 1i-l, S2). Distributions for physicochemically similar amino acids (e.g., glutamic acid versus aspartic acid, Fig. 1I) are highly correlated, while those for dissimilar amino acids (e.g., glutamic acid versus alanine, Fig. 1J) are anti-correlated. A non-resident amino acid is generally stabilised if the distributions are correlated (e.g. ϕGlu is positive when aspartic acid is present), but destabilised if the distributions are anti-correlated (e.g. ϕGlu is negative when alanine is present).

Predicting relative substitution rates

Substitution rates should be predictable from knowledge of ρ(ϕ, ϕ|α). To test this, we modelled ρ(ϕ, ϕ|α) as a bivariate normal distributions and numerically integrated over these distributions to calculate substitution rates33–35 (Equation (2)). There is extremely good agreement between expected substitution rates and those obtained by counting substitutions that occurred during simulations, as shown in Figs. 2a-c. The population size independence of the predicted (Figs. 2a-c) and observed substitution rates (Fig. S1b) matches previous observations36, and arises from our use of a concave-down fitness function (Equation 1)37.

Figure 2

Comparison of predicted and observed substitution rates. Predicted rates (x-axis) and observed rates (y-axis) are shown for all pairs of amino acids separated by a single base change for all sites in the different site classes (Class 1, blue circles; Class 2, pink squares; Class 3, black triangles; Class 4, orange diamonds). In a-c: predicted substitution rates calculated by integrating over ρ(ϕα, ϕβ|α) for three different population sizes; d-f: Predicted substitution rates calculated using transition state theory (Equation (6)), which assumes only near-neutral substitutions occur; g-i: Predicted substitution rates calculated using transition state theory with parameters estimated using Equation (7).

We next investigated whether substitution rate calculations could be simplified by considering the dynamics of the substitution process. As described above, the values of ϕα and ϕβ vary as the rest of the protein sequence changes. A part of the evolutionary trajectory before and after a glutamic acid to lysine substitution is shown in Fig. 3. Prior to substitution when glutamic acid is resident, glutamic acid is generally stabilised by the evolutionary Stokes shift, while lysine is slightly destabilised, reflecting the physicochemical differences between these two amino acids. The pattern is reversed after the substitution. Strikingly, substitutions occur in a narrow region along the diagonal ϕGlu = ϕLys, where substitutions are nearly neutral (Figs. 1 and S2). These observations suggest the applicability of transition state theory (TST), a method for predicting the rate of chemical reactions38. TST focuses on how the energies of the reactants and products vary as the reactants undergo conformational fluctuations. The reaction occurs when the reactants are in a ‘transition state’ in which the energies of reactant and product are equal. The predicted reaction rate is equal to the probability that the reactants are in the transition state, times the conversion rate from reactants in the transition state to products.

Figure 3

Example of a trajectory before and after a substitution from glutamic acid to lysine. Stabilities on the x-axis and y-axis are shown as in Figure 1. Local contribution to stability when either is resident is shown for before (green) and after the substitution (pink) (green). Values during the substitution shown in yellow; beginning and end points are shown as black circles. The observed distributions over the simulations when glutamic acid or lysine is resident shown as shaded region. Grey diagonal lines mark regions of near-neutral substitutions.

Adapting this theory, the substitution rate from α to β was estimated from the probability that the protein is in the nearly neutral region (the ‘transition state’) times the rate of neutral substitution. The width of the neutral zone is approximately where γ describes γ the dependence of the number of sequences on protein stability (ρ(Φ)~eγ), which can be estimated by the relative numbers of destabilising and stabilising mutations. We obtained a closed-form expression for substitution rates (Equation (6)) that produces strikingly accurate substitution rate predictions (Fig. 2d-f). Notably, because this calculation considers only neutral substitutions, Kimura’s fixation probability formula is no longer needed, greatly simplifying the calculations.

The equilibrium distributions of site-specific stabilities and the evolutionary ‘Stokes Shift’

As described above, the rate of amino acid substitutions is determined by ρ(ϕ, ϕ|α) in the region where ϕ ≈ ϕ. A full mechanistic description requires that we understand how these distributions, and therefore the substitution rates, are determined; we approach this goal using the principles of statistical mechanics and sequence entropy of folding. ρ(ϕα, ϕβ|α) reflects the number of sequences corresponding to specified values of ϕα, ϕβ, x = α and Φ̄. We approximate ρ(ϕα, ϕβ|α) by the product of two terms, ρLoc(ϕα, ϕβ) × ρBath(ΦBath = Φ̄ – ϕα). The local term ρLoc(ϕα, ϕβ) represents the fraction of sequences with site-specific ϕα and ϕβ, while the second term ρBath(Φ̄ – ϕα) represents the fraction of sequences where the bath interactions provide sufficient contributions to the stability so that ϕα + ΦBath = Φ̄. We approximated the first ‘absolute’ reference term by performing simulations in which a non-interacting amino acid ∅ was fixed at that site and all other sites were allowed to evolve under selection, as before. We then calculated the values of ϕα and ϕβ that would result if amino acids α and β were substituted for ∅ in the resulting sequences. Interactions involving the focal amino acid represent a small fraction of total stability contributions, so the second term ρ(ΦBath) was approximated by the distribution of protein sequences with total stability Φ, ρ(Φ)~eγ. The product of these distributions suggests that the evolutionary Stokes shift alters the average value of ϕα by an amount where is the variance in the distribution of ρ(ϕα|∅), while the variance itself is unaltered (Equation (7)). We can understand this shift by comparing the relative contributions of ϕα and ΦBath to Φ̄. Increasing values of ϕα decrease the value of ΦBath necessary to fulfill ϕα + ΦBath = Φ̄. As the number of possible sequences increases rapidly with decreasing ΦBath, there is a strong bias towards increased values of ϕα. This stabilisation resulting from the large increase in sequence entropy of folding with decreasing ΦBath is precisely the evolutionary Stokes shift. The predicted distributions of ρ(ϕα, ϕβ|α) versus those observed in thermodynamic simulations are shown in Figs. 1m-p. Estimated ζα|α values match observations surprisingly well given the approximations made (Fig. 4a-c). As predicted, the entropic stabilisation is approximately linear with , with slope close to the estimated value of γ = 1.26 (kcal mol-1)-1 (Fig. 4d-f), confirming the trends evident in Fig. 1. The observed entropic stabilisation is smaller than predicted for the largest shifts in the two slowest rate classes, involving the charged lysine, arginine, aspartic acid and glutamic acid. Earlier work demonstrated that equilibration for the most buried states can be extremely slow6, so deviations may result from insufficient time to adjust to the presence of the new amino acid.

Figure 4

Accuracy of site-specific stability and evolutionary Stokes shift predictions. Observed (y-axis) versus estimated values of the evolutionary Stokes shift (ζα|α, x – axis) are shown in a-c for all four site rate classes (Class 1, blue circles; Class 2, pink squares; Class 3, black triangles; Class 4, orange diamonds), for three different population sizes. The linear relationship between the observed evolutionary Stokes shift (y-axis) and the variance in amino acid-specific stability contributions in the absence of selection on the site are shown in d-f. The lines shown are theoretical predictions with γ = 1.26. Outliers are identified.

In earlier work, we described how the evolutionary Stokes shift results in stabilisation of amino acids that are similar to the current resident, and destabilisation of amino acids that have large physicochemical differences. The basis of this effect, according to our theory, is that the presence of α at the site shifts values of ϕβ by ζβ|α = γ φαβ|∅ σα|∅ σβ|∅, where φαβ|∅ is the correlation between ϕα and ϕβ in ρ(ϕα, ϕβ|∅); these shifts can be to higher or lower values depending on whether the physicochemical properties of the amino acids are similar or different (positive or negative φαβ|∅, respectively), increasing or decreasing the density of the distribution in the region ϕα ≈ ϕβ and the corresponding substitution rate. Substitution rates estimated with the TST approximation (Equation (6)) and site-specific stabilities calculated using Equation (7) are remarkably accurate for all four site classes over four orders of magnitude (Fig. 2g-i).

Discussion

The probability of fixation of an amino acid-altering mutation in proteins selected for stability depends on the relative contributions to stability made by the resident (ϕα) and mutant amino acid (ϕβ), and the effect of the resulting stability change (ϕβ−ϕα) on organismal fitness33–35. The situation is complicated by epistatic interactions connecting the mutating site to other sites throughout the protein, resulting in fluctuations in these contributions and corresponding fixation probabilities as substitutions occur throughout the protein. Surprisingly, this apparent complication leads to a simplification; substitutions that occur when stability contributions are similar dominate the evolutionary process, and the nearly neutral zone can be modeled using transition state theory. This represents a fundamental change in understanding of substitution rates; the focus shifts from estimating the fitness change resulting from a substitution to calculating the fraction of the time that substitution is nearly neutral. Fluctuations in stability contributions cannot be ignored because they create the conditions under which substitutions occur. The frequency of nearly neutral conditions depends on the joint distribution of ϕα and ϕβ. Amino acids with similar physicochemical properties make correlated contributions to stability, increasing the probability of near-neutrality resulting in higher rates of conservative change, a phenomenon first described by Fisher39. The multiplicity of interactions at buried sites increase the variances of ϕα and ϕβ, reducing the probability of near neutrality and thus the substitution rate (Figs. 1a-1d), consistent with observed slower substitution rates observed at internal sites. The joint stability distribution is affected by the tendency for the resident amino acid (and similar amino acids) to be stabilised by substitutions at other sites, the ‘evolutionary Stokes shift’6. This shift can be understood purely in terms of biophysics and the sequence entropy of folding; increases in the stabilising contributions of resident amino acids reduce the amount of stabilisation required from interactions amongst the remaining amino acids (the ‘bath’). Because more sequences are able to fulfill this reduced stabilisation requirement, the contributions of the bath to the sequence entropy of folding is larger, and higher affinities for the resident amino acid are entropically favoured. Although describable as an adjustment, this evolutionary mechanism can be fully reversible, as are the simulations described here, with similar processes of moving into and away from the neutral zone6. These processes, called ‘contingency’ and ‘entrenchment’ by Plotkin and colleagues8, are mirrors of each other, so that a tape of the dissipation (entrenchment) process, played backwards, would have the same statistical properties as the pre-adaptation (contingency) process played forwards. The predicted average substitution rates for sites under purifying selection for stability can be estimated solely based on the mutation rate and the joint distribution of ϕα and ϕβ, with no adjustable parameters. Thus, when we show that, as long as the assumptions and approximations of the analysis remain valid, ρ(ϕα, ϕβ) depends only on details of the protein structure and function (which affect ρ(ϕα|∅) and γ), we can infer that Kimura’s formula is not required to predict and explain substitution rates amongst amino acids. Although evolutionary mechanics theory fully incorporates population genetics theory, substitution rates and the evolutionary Stokes shift do not depend on population size (Figure 2, Figure 4), despite its centrality to Kimura’s formulation. Here, we addressed only the theoretical predictions and simulations near equilibrium. Some discrepancies between the predicted and observed Stokes shifts for charged residues in buried sites, however, may be explained by inadequate time for equilibration. Individual sites at specific time points may be constrained by conserved neighbouring sites as well as conserved structural features. Such effects may influence the time-dependent probability of back and subsequent substitutions, an important topic for investigation. The interaction of fluctuating selection and fluctuating population size also requires further investigation. We have considered purifying selection with fixed population size and fitness based purely on folding stability. We and others have previously shown that evolutionary simulations based on protein thermodynamics produce patterns of epistasis, convergence, and entrenchment that are qualitatively similar to patterns in real proteins4–7,14,40; we now have a clear explanation why these patterns may have been produced, from statistical mechanics considerations. Selection for other properties involving contributions from multiple amino acid sites would define their own nearly neutral landscape. We argue that other forms of constant selection such as interactions with other proteins, ligand binding, catalysis, and avoiding proteolysis and aggregation should restrict the number of acceptable sequences but not otherwise affect the theory or calculations. Other selective pressures may force occasionally adaptive, non-neutral substitutions when external pressures change. An evolutionary Stokes shift would still occur following such a forced substitution, but the process would no longer be reversible6. The simple theory described here provides an improved conceptual basis to understand what would happen in the absence of further complications, and should allow more confident prediction of non-structural functional constraints, adaptation, and fluctuating population sizes when they do occur. Analogous models can be applied to other forms of selection acting at higher levels such as development41. The work described here establishes a theory of evolutionary mechanics, and simulations demonstrate that this theory can be used to predict substitution rates from the basic properties of how amino acids interact. Assuming that most proteins under constant selective pressures are dominated by thermodynamic (or similar) processes, we provide a mechanistic explanation for the known predominance of nearly neutral evolution and a better understanding of what happens during purifying selection. As with statistical mechanics and thermodynamics, the theory of evolutionary mechanics allows us to connect the microscopic events of evolutionary mechanics (mutation rates, fitness differences, and fixation probabilities) with the macroscopic events of molecular evolution (relative rates of substitution, and distributions of fluctuating rates across sequences and over time).

Methods

Simulations of protein evolution

The methods used to simulate protein evolution have been described previously6,24,25. Our simulations modelled proteins evolving under selection for a common requirement for globular proteins, stability of the native conformation. The free energy (X, r) of a protein sequence X = {, , … } in conformation r was calculated by summing the pairwise energies of amino acids in contact in that conformation, using the contact potentials derived by Miyazawa and Jernigan42. The free energy of folding ΔFolding(X) was computed by first determining the free energy of the sequence in a pre-chosen native state, the conformation of the 300-residue purple acid phosphatase, PDB 1QHW43. The energies of unfolded states were assumed to follow a Gaussian distribution; the parameters characterising this distribution were estimated by calculating the free energies of the sequence in a widely diverse set of 55 different protein structures. The energy of the unfolded state was then calculated by assuming a large set (10160) of possible unfolded structures with free energies drawn from this distribution. The free energy of folding Δ(X) was calculated as the expected difference between the two, and stability was Φ(X) = −Δ(X). The Malthusian fitness of a sequence m(X) was defined as the fraction of that sequence that would be folded to the native state at equilibrium where T is temperature in units of energy, 0.6 kcal mol-1. The simulations implemented a Gillespie algorithm44 representing the evolution of a protein in the low mutation rate limit where the monomorphic population is represented by a single sequence. Starting from a single randomly chosen nucleotide sequence encoding a 300 amino-acid protein, we simulated evolution by considering in each step all possible nucleotide mutations with rates given by the K80 nucleotide model (κ = 2)45. The fixation probability of each mutation was calculated based on the Kimura formula for diploid organisms33–35, where X and X′ are the sequences before and after the mutation. The total substitution rate was set equal to the product of the mutation rate times the fixation probability, summed over all possible mutations. At each step, the evolutionary time was advanced by an amount chosen from an exponential distribution based on the total substitution rate, and one substitution was chosen to be fixed at random with relative probabilities determined by the product of the mutation rates times the acceptance probabilities. Sequence evolution was simulated for a sufficient number of generations such that protein stability was roughly constant, representing mutation-selection-drift selection balance. 100 equilibrated proteins were chosen, and two longer simulations were performed using each these equilibrated proteins as initial starting sequences, for a total of 200 simulations. The evolution of each lineage was simulated for an evolutionary distance of approximately seven amino acid replacements per amino acid position. The sequence and energy were sampled at regular time intervals.

Grouping of sites

For ease of analysis, we divided protein sites into four classes with similar substitution rates. Substitution matrices were calculated individually for each site; due to the length of simulations, we had on average over 1400 substitutions at each site. Sites were then clustered based on the off-diagonal elements of the substitution matrices using K-means clustering46,47. The resulting clusters were approximately equal in size, and class membership strongly depended on how buried or exposed sites were in the native state (as indicated by number of contacts). We ranked clusters by surface exposure, where class 1 is the most exposed and 4 is the most buried.

Calculating the site-specific contribution to protein stability

The site-specific contribution ϕ(X∌ ) of amino acid α at focal site k as a function of the amino acids X∌ at all sites excluding k is equal to Φ{x1, x2, x3 … x, α, x … x}, the stability when the focal site is occupied by α, minus Φ{x1, x2, x3 … x, ∅, x … x}, the stability of a reference state when α is replaced by a non-interacting amino acid ∅, with the rest of the sequence and thus all other interactions unchanged. The part of the stability unaffected by this replacement is represented by the ‘bath’ interactions Φ(X∌ ) = Φ(X) − ϕ(X∌ ) so that Φ(X) = ϕ (X∌ ) + Φ (X∌ ).

Determining the change in fitness

Prior to a mutation, when amino acid α is resident, the protein stability is equal to Φ = ϕα + ΦBath with corresponding fitness m(Φ), given by Equation 1 (Methods). After a mutation to amino acid β, the stability is equal to Φ′ = ϕβ + ΦBath = Φ + ϕβ – ϕα, corresponding to fitness m(Φ′) = m(Φ + ϕβ – ϕα), where we have used the fact that ΦBath is unchanged by the mutation. The situation is complicated by the non-linear relationship between fitness and stability (Equation 1, Methods), but can be greatly simplified by noting that real proteins, as well as proteins from this and other evolutionary simulations under purifying selection for thermostability, evolve within a narrow range of stability values around an average value Φ̄24,26–29; see Supplementary Fig. S1. This narrow stability range occurs where the effectiveness of selection for greater stability is balanced by large numbers of slightly destabilising mutations fixed by genetic drift30,31. We therefore approximate the protein’s stability prior to the mutation as equal to Φ = Φ̄; the resulting change in fitness is then equal to Δmα→β = m(Φ̄ + ϕβ – ϕα) – m(Φ̄). The value of Φ̄ depends on factors such as temperature24, effective population size (as shown in24 and Fig. S1), and protein structure and function, but will be constant as long as these factors are approximately constant. With these assumptions, the change in fitness and thereby the probability of fixation of the mutation is therefore determined by the difference between the current values of ϕα and ϕβ. While the total stability value of Φ̄ is a constant, the manner in which this stability is distributed amongst the various interactions, and therefore the values of ϕα and ϕβ as well as the corresponding substitution rate, will vary as substitutions occur along the rest of the protein sequence. The nature of this variation depends on which amino acid occupies position k because that amino acid affects the evolution in the rest of the protein6. In order to compute the estimated average substitution rate, we assume that the other sites are sufficiently numerous and change sufficiently rapidly that the protein is always fully adjusted to the current amino acid at site k. (This assumption is most likely to break down following non-conservative substitutions, as discussed below.) The joint probability distribution of ϕα and ϕβ given total stability Φ̄ and the occupation of site k by amino acid α can then be described by the stationary distribution ρ(ϕα, ϕβ|Φ(X) = Φ̄, x = α), which we simplify to ρ(ϕα, ϕβ|α).

Calculating the substitution rate integrating over distributions of local contributions

The average rate for substitution α → β at site k, Q, is equal to the neutral substitution rate υ times the average probability of fixation, which is a function of the stability of the protein before and after the substitution. The standard deviation of observed values of protein stabilities Φ, for example 0.71 kcal mol-1 for population size N set to 106, was small compared with the range of values of ϕ∌ ), allowing us to represent the distribution Φ by its average, Φ ≃ Φ̄ = 9.26 kcal mol-1. We assumed that the stability before the substitution was Φ̄ and afterwards was Φ̄ + (ϕ∌ ) – ϕ∌ )). The average substitution rate was then estimated as where Δm(ϕ, ϕ) = m (Φ̄ + (ϕ, – ϕ)) – m(Φ̄) and ρ(ϕ, ϕ|x = α) is the joint distribution of ϕ(X∌ ) and ϕ(X∌ ) for the equilibrium distribution of sequences X∌ when α occupies site k, and we use the fixation probabilities assuming small values of 2NΔm. Based on observations in Fig. 1, ρ(ϕ, ϕ|x = α) was modeled as a bivariate normal distribution of the form Parameters were calculated directly from evolutionary simulation, and Equation (3) integrated numerically. The neutral substitution rate was calculated using the same K80 nucleotide model (κ = 2)45 as used in the simulation, with all non-nonsense codons considered equally likely.

Calculating the substitution rate integrating assuming only neutral substitutions

As observed in Fig. 1, substitutions generally occur in a neutral region in which ΔΦ = ϕ∌ ) – ϕ∌ ) ≈ 0, so that This condition is satisfied in a band of width 2ε centred on ϕ(X∌ ) – ϕ(X∌ ), where ε represents a deviation from strict neutrality that is sufficiently close for Equation (4) to be sufficiently accurate. A natural scale for ε was obtained by considering the ‘free fitness’ Γ(Φ) of the protein equal to where S(Φ) is the sequence entropy of folding, equal to the log of the number of sequences corresponding to a given total stability Φ48,49. Free fitness is analogous to thermodynamic free energy but with temperature T replaced by 4N, and encompasses contributions from both fitness and sequence entropy to determine the distribution of states; evolutionary dynamics moves towards maximising this quantity. As the stability represents the sum of many small interactions, we would expect the distribution of stabilities to obey the central limit theorem and to resemble a Gaussian distribution. We are, however, on the tail of the distribution where the Gaussian is indistinguishable from an exponential, with one additional unidentifiable parameter. We instead assume S(Φ) = ln(Ω0 e−γΦ) where Ω0 is constant. Noting that the system is at equilibrium with when Φ = Φ̄, it can be demonstrated that a good approximation at the mode is Thus, γ defines the rate of change of the population-weighted fitness 4N(Φ) with stability. Alternatively, a change in stability of corresponds to a unit change in population-weighted fitness. In our calculations, we equated the estimation of γ is described below. Note that this calculation demonstrates that ε is, surprisingly, independent of effective population size N. This is a result of the balance between selection and mutational drift at equilibrium; for fixed effect of mutational drift, the degree of selection adjusts to changes in effective population size so that their product is constant36,37. If we assume that ρ(ϕ, ϕ|x = α) is broader than ε, and that Equation (4) is satisfied, Recalling that the marginal probability density for a multivariate normal distribution is still normal, Equation (3) becomes where δ(ϕ – ϕ) is the Dirac delta function. For highly similar amino acids the entire distribution of ρ(ϕ, ϕ|x = α) may be contained in a region significantly narrower than the neutral zone, resulting in an overestimation of Q > υα→β. For this reason, the estimated rate was capped at the neutral rate υα→β.

Estimating ρ(ϕ, ϕ)

As described in the Results section, we approximate ρ(ϕ, ϕ|x = α) as the product of two terms, ρLoc(ϕ, ϕ) × ρBath(Φ = Φ̄ – ϕ), where ρLoc(ϕ, ϕ) represents the fraction of sequences with given values of ϕ and ϕ independently of how the rest of the protein adjusts to the current amino acid resident at site k, while ρBath(Φ = Φ̄ – ϕ), represents the fraction of sequences where the bath interactions contribute sufficiently to the stability so that ϕ + Φ = Φ̄. ρLoc(ϕ, ϕ) was approximated by ρ(ϕ, ϕ|x = ∅), the observed distribution observed when site k was occupied by a non-interacting amino acid ∅. We assumed that the contribution to the stability was small and approximated the distribution of Bath contributions with the distribution of total protein stabilities, ρBath(Φ = Φ̄ – ϕ) ≃ ρΦ(Φ = Φ̄ – ϕ) ∝ exp (−γ(Φ̄ – ϕ)). Because the number of possible sequences is immense, and because ϕ and ϕk,β are the result of many interactions, the central limit theorem suggests that ρ(ϕ, ϕ|x = ∅) can be approximated by a bivariate normal distribution The normalised product of ρ(ϕ, ϕ|x = ∅) and ρBath(Φ = Φ̄ – ϕ) ∝ exp (−γ(Φ̄ – ϕ)) results in an estimated shifted bivariate normal distribution with Substituting these results into Equation (6) yields

Characterising the bath state distribution

As described above, we assume that the number of protein sequences with a given value of Φ in the range of interest around Φ = Φ̄ is approximately exponential Ω(Φ)~ e−γΦ. We estimated γ from the average change in stability resulting from random mutations, 〈ρmut(ΔΦ)〉, which is negative due to the greater number of sequences coding for proteins with lower stability. This suggests that by correcting for the dependence of Ω on Φ by multiplying ρmut(ΔΦ) and eγΔΦ, this bias would disappear. We adjusted γ so that 〈ΔΦeγΔΦ〉 = 0 where the average was over all possible mutations during the simulations with N equal to 106, yielding γ = 1.26 (kcal mol-1)-1. The bath state distribution determines the equilibrium stabilities through Equation (5). Substituting Equation (1) into Equation (5) yields . This expression results in estimations for Φ̄ of 6.53, 9.26, and 12.05 for N equal to 104, 106, and 108, respectively. These agree well with the average of the distributions shown in Supplementary Figure S1: 6.40, 9.15, and 11.90. We note that under this model, the population scaled fixed load 2N(1 – m(Φ̄)) is equal to that is, it only depends on the dependence of the sequence entropy on the stability and the temperature. For our system, 2N(1 – m(Φ̄)) ≈ 0.38.

Data and Code Availability

Data, including structures used, contact potentials, tables of outcomes, and raw program data output, is available on Dryad (doi:10.5061/dryad.7b8vb). All simulations and analysis software is available on GitHub (https://github.com/EvolutionaryMechanics/Goldstein-Pollock-2017).

40 in total

1. On the probability of fixation of mutant genes in a population.

Authors: M KIMURA
Journal: Genetics Date: 1962-06 Impact factor: 4.562

2. A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process.

Authors: Nicolas Lartillot; Hervé Philippe
Journal: Mol Biol Evol Date: 2004-03-10 Impact factor: 16.240

3. Amino acid coevolution induces an evolutionary Stokes shift.

Authors: David D Pollock; Grant Thiltgen; Richard A Goldstein
Journal: Proc Natl Acad Sci U S A Date: 2012-04-30 Impact factor: 11.205

Review 4. Understanding protein evolution: from protein physics to Darwinian selection.

Authors: Konstantin B Zeldovich; Eugene I Shakhnovich
Journal: Annu Rev Phys Chem Date: 2008 Impact factor: 12.703

5. Statistical mechanics of convergent evolution in spatial patterning.

Authors: Bhavin S Khatri; Tom C B McLeish; Richard P Sear
Journal: Proc Natl Acad Sci U S A Date: 2009-06-02 Impact factor: 11.205

6. Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene.

Authors: R Nielsen; Z Yang
Journal: Genetics Date: 1998-03 Impact factor: 4.562

7. Extensively Parameterized Mutation-Selection Models Reliably Capture Site-Specific Selective Constraint.

Authors: Stephanie J Spielman; Claus O Wilke
Journal: Mol Biol Evol Date: 2016-08-10 Impact factor: 16.240

8. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome.

Authors: S V Muse; B S Gaut
Journal: Mol Biol Evol Date: 1994-09 Impact factor: 16.240

9. Strong evidence for protein epistasis, weak evidence against it.

Authors: David D Pollock; Richard A Goldstein
Journal: Proc Natl Acad Sci U S A Date: 2014-03-20 Impact factor: 11.205

10. The tangled bank of amino acids.

Authors: Richard A Goldstein; David D Pollock
Journal: Protein Sci Date: 2016-05-12 Impact factor: 6.725

15 in total

1. mtProtEvol: the resource presenting molecular evolution analysis of proteins involved in the function of Vertebrate mitochondria.

Authors: Anastasia A Kuzminkova; Anastasia D Sokol; Kristina E Ushakova; Konstantin Yu Popadin; Konstantin V Gunbin
Journal: BMC Evol Biol Date: 2019-02-26 Impact factor: 3.260