Literature DB >> 34218472

A thermodynamic model of protein structure evolution explains empirical amino acid substitution matrices.

Christoffer Norn¹, Ingemar André¹, Douglas L Theobald².

Abstract

Proteins evolve under a myriad of biophysical selection pressures that collectively control the patterns of amino acid substitutions. These evolutionary pressures are sufficiently consistent over time and across protein families to produce substitution patterns, summarized in global amino acid substitution matrices such as BLOSUM, JTT, WAG, and LG, which can be used to successfully detect homologs, infer phylogenies, and reconstruct ancestral sequences. Although the factors that govern the variation of amino acid substitution rates have received much attention, the influence of thermodynamic stability constraints remains unresolved. Here we develop a simple model to calculate amino acid substitution matrices from evolutionary dynamics controlled by a fitness function that reports on the thermodynamic effects of amino acid mutations in protein structures. This hybrid biophysical and evolutionary model accounts for nucleotide transition/transversion rate bias, multi-nucleotide codon changes, the number of codons per amino acid, and thermodynamic protein stability. We find that our theoretical model accurately recapitulates the complex yet universal pattern observed in common global amino acid substitution matrices used in phylogenetics. These results suggest that selection for thermodynamically stable proteins, coupled with nucleotide mutation bias filtered by the structure of the genetic code, is the primary driver behind the global amino acid substitution patterns observed in proteins throughout the tree of life.

Entities: Chemical

Keywords: amino acid substitution; exchangeabilities; protein evolution; protein stability; replacement matrices

Mesh：

Substances：
Proteins

Year: 2021 PMID： 34218472 PMCID： PMC8442976 DOI： 10.1002/pro.4155

Source DB: PubMed Journal: Protein Sci ISSN： 0961-8368 Impact factor: 6.725

INTRODUCTION

Protein amino acid sequences change due to spontaneous mutations at the DNA level. Amino acid exchange rates depend not only on the background mutation rate, but also on how the mutation at the protein level impacts overall organismal fitness. Empirical models that incorporate global amino acid substitution matrices have wide‐ranging applications in bioinformatics and evolutionary science, including homology search, phylogenetics, ancestral sequence reconstruction, and the prediction of functional residues. While such empirical models have tremendous practical utility, the importance of various fitness pressures in shaping the patterns of amino acid substitutions is still unknown. Amino acid mutations can impact fitness by modifying protein properties such as catalysis, binding, expression, aggregation, non‐specific interactions, and protein stability, the latter of which is known to be a major selection pressure at most sites for all proteins., , , Here we develop from first principles a biophysical and genetic model that predicts amino acid substitution rates by assuming that the fitness effects of amino acid mutations arise from changes in protein stability alone. We then evaluate the ability of thermodynamic stability to explain empirical amino acid substitution patterns. The prevailing phylogenetic models of protein evolution explicitly describe amino acid substitution processes using instantaneous rate matrices whose parameters are inferred from observed sequences using statistical methods. Substitutions are modeled as an aggregated Markov process in which sites evolve independently with a substitution rate that depends on the identity of the current amino acid. All widely used phylogenetic programs (e.g., PhyML, IQ‐TREE, RAxML, MrBayes, BAli‐Phy, and PhyloBayes) incorporate global substitution (or exchangeability) matrices as a key component of the default evolutionary model. Some of the most common and successful global matrices include the JTT, WAG, and LG exchangeability matrices. Similarly, NCBI blastp uses the BLOSUM62 matrix by default for remote homology detection. Global substitution matrices are empirically constructed by averaging rate processes over numerous sites in many proteins. These global matrices contain amino acid rate information that has been incredibly useful and effective in practice, and they are essential to modern phylogenetics and computational biology. All these global amino acid substitution matrices have highly similar, consistent patterns that call out for an explanation. The pairwise amino acid rate constants (known as the “exchangeabilities”) in global substitution matrices are a function of the inherent physical properties of amino acids and folded proteins, and hence they should be predictable from first principles. To understand the origin of the variation in global amino acid exchangeabilities, several studies have established correlations with biophysical amino acid descriptors such as hydrophobicity, secondary structure propensity, charge, and codon table structure., , , While these phenomenological correlations help identify factors that influence amino acid substitutions, biophysical and genetic models are required for a mechanistic understanding of the underlying fitness constraints that give rise to the empirical substitution patterns. Over the past two decades, mechanistic models that combine evolutionary dynamics and protein biophysics have led to significant advances in our understanding of the causes of evolutionary rate variation in proteins., , , Biophysical models of rate variation typically treat molecular fitness as dominated by fold stability or the stability of an active structure. Evolutionary trajectories simulated with biophysical fitness models have shown how fluctuations in the structural environment within proteins influence substitution rates and produce epistatic effects between sites. However, such structure‐based evolutionary models have yet to be applied to understand the origin of global amino acid substitution patterns. Building on this previous work, here we develop a biophysical model of amino acid substitution rate variation to investigate the evolutionary and thermodynamic basis for global substitution patterns in proteins. Since global substitution matrices are constructed by averaging over protein sites, our theoretical analysis must necessarily also average over sites to offer a plausible explanation of these ubiquitous global substitution patterns. We first calculate the thermodynamic effects of amino acid mutations in native protein structures and combine a fitness function solely based on protein stability with a position‐specific codon‐level model of sequence evolution to predict global amino acid substitution rates. Using this model, we find that our calculated amino acid substitution rates are strongly correlated with experimental values described by global empirical substitution matrices such as the widely used LG matrix. Most of the empirical amino acid substitution pattern can be explained wholly by mutation combined with selection pressure to maintain thermodynamically stable protein structures.

RESULTS

A parametric all‐atom thermodynamic model of protein evolution

We selected a non‐redundant, curated set of 52 high‐quality protein structures for the basis of our analysis., These 52 protein structures were subjected to a computational analog of saturated mutagenesis. For each site in every protein, we calculated the thermodynamic effect (ΔΔG) of all possible amino acid mutations with the Rosetta macromolecular modeling suite. We assume that fitness is proportional to the fraction of folded protein, which is determined by the ΔΔG value, and then calculate the probability of fixation in a finite population using the Kimura equation, (Figure 1). A global amino acid substitution matrix can then be constructed from these amino acid fixation probabilities and a codon‐level model of nucleotide mutation. There are only four free parameters in this mechanistic evolutionary model: (a) the free energy of the native protein, ΔG nat, (b) the effective population size, N , (c) the nucleotide transition rate, κ, and (d) a whole‐codon rate parameter, ρ. Values of these four parameters are required to calculate a substitution matrix, because we do not know the values of these parameters a priori, we optimize them with the method of maximum likelihood. The end‐result of this method is an amino acid exchangeability matrix (189 free exchangeability rate constants) and an amino acid equilibrium frequency vector (19 free probabilities) that has been fit using only three or four free parameters, each with a clear physical and evolutionary interpretation.

FIGURE 1

Overview of method illustrated for a M to V mutation. The rate of substitution between any two amino acids in a protein is evaluated based on an underlying codon mutation Markov process. Mutations arise according to the mutation proposal model and fix depending on the fitness difference according to an evolutionary dynamics model. Fitness is proportional to the fraction folded (i.e., the probability of the native state) based on an all‐atom energy function. The model contains four free parameters (ΔG nat, N , κ, ρ) that are simultaneously optimized

Thermodynamic effects of amino acid mutations

A full atomistic model of protein structure and energetics is necessary to realistically model site‐specific behavior. However, computing the folding stability of proteins at this level of detail is currently intractable for simulations of protein evolution, as it requires extensive sampling of alternative conformations for every sequence evaluated. Instead, we treat the folding stability of native sequences (ΔG nat) as a global free parameter in the model. In contrast to folding stability, the free energy change upon amino acid mutation (ΔΔG) is both reasonably fast to compute and fairly accurate (r 2 = .56 between prediction and experiment). We therefore calculate the folding stability of a sequence variant at site L as:where i indicates the index of the native amino acid at site L and j indicates the index of the proposed amino acid mutation at that site. The equilibrium properties of each position are conditioned only on the fitness pressure exerted by the native structural environment in the selected proteins. In contrast to using a mutation‐after‐mutation simulation method, this eliminates the need for sequence pre‐equilibration, avoids compounding errors of model and energy function, and allows computation of a site‐specific Q‐matrix from just 20 ΔΔG calculations.

Protein fitness as a function solely of thermodynamic stability

To understand to what degree thermodynamic stability explains the process of protein evolution, we base our fitness function on folding stability alone. Following previous work, we assume that a protein's contribution to fitness is proportional to the fraction of the protein folded into its functional conformation. The fraction of protein folded for a two‐state folding model with stability is given by

Fixation probability in a finite population

Given enough time, a proposed mutation will either spread throughout the population and fix or be purged due to negative selection or genetic drift. We assume throughout a monomorphic population evolving under weak‐mutation. The probability of fixation when amino acid i mutates to amino acid j at site L depends on the relative change in fitness due to that mutation (the selection coefficient ) and the effective population size N : The effective population size N is another global free parameter in the model. The fixation probability we use above is for diploid populations with non‐overlapping generations in which a new mutation arises as a single copy. Similar equations, with exponential terms differing by a factor of two, apply to haploids and overlapping generations, but the effects on our analysis are negligible for large population sizes (e.g., N > 100).

Nucleotide and codon mutation proposals

Most spontaneous mutations involve changes to single‐base pairs. Due to the intrinsic chemical properties of nucleotides,, point mutations occur with greater rates for transitions (pyrimidine to pyrimidine or purine to purine) than transversions. To capture this bias in our model, the transversion rate is fixed at 1.0 and the transition rate κ is treated as a free parameter. A smaller subset of mutations involve changes of multi‐nucleotides or whole codons, for instance as may result from insertions, deletions, or tandem mutations due to error‐prone replication or UV damage. These types of mutations have not been modeled previously in structure‐based simulations of evolution, but they play an important role in the evolution of natural sequences. We model the rate of whole‐codon mutations, scaled relative to transversions, with a single parameter ρ. The mutation proposal probability P at a given site is:

Construction of amino acid rate matrices from codon mutations and protein fixation

We model the site‐specific evolutionary process as a continuous‐time Markov process, where the unit of change is the codon and the relative instantaneous substitution rate from codon u with amino acid i to codon v with amino acid j is the product of the mutation proposal rate, , and the fixation probability, : Codons representing the same amino acids are then aggregated, allowing determination of site‐specific amino acid flux matrices. The flux matrices are averaged across sites and proteins to construct a global 20 × 20 amino acid instantaneous rate matrix Q. For comparison with empirical global substitution matrices such as JTT, WAG, and LG, and for use in phylogenetic analyses, the Q‐matrix was decomposed into a diagonal amino acid equilibrium probability matrix, π, and an independent symmetric 20 by 20 exchangeability matrix R:, The rate constants in the symmetric exchangeability matrix R can be considered normalized equilibrium fluxes between amino acids, where the fluxes have been normalized by the appropriate equilibrium amino acid probabilities (see Methods). The equilibrium probabilities, in turn, describe the expected frequency distribution of amino acids when the substitution process has progressed long enough that it reaches equilibrium.

Optimized model parameters provide typical values

Using the computed mutation energies (ΔΔG), an optimal exchangeability matrix R was found by optimizing values of the four parameters of the model by maximizing the phylogenetic likelihood for the 52 non‐redundant protein families. With grid‐search optimization, we found an optimal parameter set of ΔG nat = −6.0 ± 0.1 kcal/mol, log(N ) = 3.8 ± 0.1, κ = 2.1 ± 0.1, and ρ = 0.11 ± 0.01 (standard error estimated by bootstrap, see supporting methods). We refer to this optimized matrix as the Thermodynamic Mutation‐Selection (TMS) matrix. Alternatively, parameter values can also be determined by direct optimization against, say, the LG matrix, a procedure that gives nearly identical results as the maximum likelihood method (Figure S1). Because the TMS model parameters are based on fundamental quantities in population genetics (N ), spontaneous mutation processes (κ, ρ) and protein thermodynamics (ΔG nat), the optimal parameter values can be compared with independent empirical measurements from protein biochemistry and population genetics. As discussed further below, these optimal values from our thermodynamic, evolutionary model correspond surprisingly well to representative physical and biological values. The stability of natural proteins is typically between −5 and −10 kcal/mol, while effective population sizes vary over many orders of magnitude. For instance, the effective population size for humans is on the order of 1013, while Escherichia coli have an population size of approximately 103. The optimal ΔG nat and N TMS values compare well with these ranges. By inspecting the likelihood surface (see Figure S2), the stability ΔG nat and population N parameters are seen to be highly correlated, with optimal parameters found along a line given by Along this line, substitution parameters do not vary substantially (Figure S3). A similar dependence of protein stability ΔG on N at equilibrium has been noted previously., The strong correlation between ΔG and N suggests that they represent a single latent parameter, reducing the effective parameter space of our model from four to three. In contrast to stability and population size, the parameters describing the mutation process (κ and ρ) are largely independent (Figure S2). The optimal transition/transversion bias κ = 2.1 is consistent with experimental data for spontaneous mutation rates from E. coli, where κ ≃ 2.6 (assuming a K80‐type Markov model, see Appendix S1). For the whole codon mutation parameter ρ, the MLE is 0.1 relative to the transversion rate. This corresponds to that 29% of all proposed mutations are multi‐nucleotide mutations. Obtaining direct empirical estimates of ρ is difficult, but for indels of 2–4 nucleotides in E. coli, the spontaneous mutation rate is approximately 10% of the single‐nucleotide mutation rate. In eukaryotes, multi‐nucleotide mutations comprise approximately 3% of all mutations and have been shown to be important in phylogenetic tests. Given that processes other than indels can result in multi‐nucleotide mutations, and that indels typically result in multiple codon mutations, our ρ opt value appears reasonable.

The thermodynamic TMS model reproduces empirical substitution patterns

Most of the variation in empirical amino acid substitution matrices can be explained by our thermodynamic evolutionary model, as judged by the logarithmic correlation coefficient with the ML TMS matrix (Figure 2). The TMS matrix appears to be a rather typical exchangeability matrix, as the average correlation of TMS with many widely used empirical matrices is r 2 = .54, whereas the correlation of those empirical matrices with each other (excluding TMS) is r 2 = .52 (Figure 2). The average correlation with globular matrices is higher (r 2 = .59), whereas the correlation with mitochondrial matrices is substantially lower (r 2 = .42). For example, the TMS exchangeability matrix has an r 2 of .67 and .64 with the widely used WAG (Figure S4) and LG substitution matrices (Figure 3), respectively. The lower correlation with mitochondrial matrices is likely due to the preponderance of transmembrane proteins in the mitochondrial datasets,, as the Rosetta energy function used in our thermodynamic free energy calculations is intended for soluble proteins. Breaking exchangeabilities down by individual amino acids, we see the lowest correlation for cysteine and proline (r 2 = .50), likely reflecting shortcomings of the energy function and limited modeling of backbone flexibility. The other amino acids have higher correlations, up to r 2 = .85 (Figure 4).

FIGURE 2

FIGURE 3

The TMS model recapitulates the mean substitution behavior between amino acids. (a) Comparison of the amino acid exchangeability matrix predicted by TMS (black circles) and the LG phylogenetic empirical exchangeability matrix (red circles). Inset, correlation between exchangeabilities from TMS (x‐axis) versus LG (y‐axis). (b) Correlation between amino acid equilibrium probabilities predicted by TMS (x‐axis) and values from LG (y‐axis). Identity shown as dashed line. (c) The “mutation‐only” exchangeability matrix, without selection for stability, vs LG. (d) Correlation between mutation‐only equilibrium probabilities and those from LG. Identity shown as dashed line

FIGURE 4

The TMS model recapitulates amino acid‐specific exchangeabilities. Identity is shown as a dashed line

Heatmap of correlations between popular exchangeability matrices. Squared Pearson correlations (r 2) were computed in log‐space over the 190 exchangeability parameters. Matrices were clustered based on r 2 using hierarchically clustering. The TMS matrix is indicated by asterisks The TMS model recapitulates the mean substitution behavior between amino acids. (a) Comparison of the amino acid exchangeability matrix predicted by TMS (black circles) and the LG phylogenetic empirical exchangeability matrix (red circles). Inset, correlation between exchangeabilities from TMS (x‐axis) versus LG (y‐axis). (b) Correlation between amino acid equilibrium probabilities predicted by TMS (x‐axis) and values from LG (y‐axis). Identity shown as dashed line. (c) The “mutation‐only” exchangeability matrix, without selection for stability, vs LG. (d) Correlation between mutation‐only equilibrium probabilities and those from LG. Identity shown as dashed line The TMS model recapitulates amino acid‐specific exchangeabilities. Identity is shown as a dashed line The high correlation between TMS and other experimental matrices like LG is a result of both the thermodynamic model, in which more chemically similar amino acids have higher fixation probabilities, and the mutation model, which biases amino acid replacements in the genetic code due to preferred nucleotide mutations, codon number, and codon connectivity. To understand the impact of the genetic component on the correlations, we computed a “mutation‐only” R‐matrix, assuming only the connectivity of the codon table and the transition‐transversion bias by setting all fixation probabilities in Equation 5 to 1 and setting ρ = 0.10 and κ; = 2.1 to their optimal values. This mutation‐only R‐matrix accounts for 32% of the variation in the LG substitution pattern (Figure 3c), suggesting that genetic and thermodynamic factors contribute roughly equally to the patterns of empirical amino acid substitutions. Correct recapitulation of amino acid exchangeabilities is strongly dependent on the thermodynamic component of the model. For example, using LG as the standard for comparison, the empirical exchangeability for tryptophan to tyrosine is underestimated by 7.4‐fold by the mutation‐only R‐matrix but is underestimated by 1.1‐fold by TMS. Similarly, the mutation‐only model overestimates the empirical exchangeability of asparagine to phenylalanine by 4.8‐fold, while TMS overestimates it by 1.3‐fold. Our ML TMS matrix was optimized over the exchangeabilities and not the amino acid equilibrium frequencies. Nevertheless, from the decomposition of the Q‐matrix (given in Equation 6), we can provide the implied TMS equilibrium frequencies and compare them with those provided with common exchangeability matrices (Figure S5). For example, the calculated TMS equilibrium frequencies correlate well with those from LG (linear r 2 = .65), but suggest unmodelled fitness effects. For instance, we do not model disulfide formation, and unsurprisingly the stationary frequency of cysteine calculated from TMS is 2.8‐fold too low. Likewise, we ignore codon usage biases, which could explain the overestimation of the background frequency of arginine and leucine, both of which have exceptionally skewed codon usage. Although our model correlates strongly with empirical exchangeability matrices like LG, some of the variation in the rate constants remains unexplained. One possible contribution to this discrepancy stems from the fact that we selected a set of proteins that are different from those used to infer the LG‐matrix. Our set of proteins will likely have a slightly different substitution process than the LG set. To investigate this possibility, we used a maximum likelihood phylogenetic inference method, similar to that used for the construction of the LG matrix, to infer exchangeabilities from the sequence alignments for our set of benchmark proteins (Figure S6). The inferred phylogenetic exchangeability matrix for our proteins is highly correlated with LG (r 2 = .97), indicating that our benchmark proteins are similar to those that LG was derived from, and that stochastic errors originating from the rate‐inference itself are minimal. A second source of the unexplained variation could be errors in the energy function used for ΔΔG prediction, as Rosetta energies have an imperfect correlation with empirical values (r 2 = .56). We explored this possibility by simulating noisy data using empirically determined ΔΔG Rosetta prediction errors. This analysis suggested at least 21% of the unexplained variance could be caused by energy function errors (see Appendix S1). Finally, our mutation‐selection model operates at the codon level, producing rates that are subsequently aggregated to the amino acid level—a reduction in dimensionality that can affect the assumed Markov property of codon and amino acid evolution. It is possible that our codon‐level model could explain codon exchangeability matrices better than amino acid matrices. Additionally, when determining an amino acid exchangeability matrix like LG by phylogenetic methods, all 189 exchangeability rate constants are free parameters that are estimated independently from the sequence data. In contrast, our codon model is highly constrained by only two parameters (κ and ρ) and the K80 assumption of equal nucleotide frequencies. More complex mutational models could provide higher correlations at the expense of model simplicity and interpretability. To explore this possibility, we extended our codon model with the T92 nucleotide mutation proposal model. Compared to the K80 model, the T92 model adds a single additional parameter that addresses G+C content bias (%‐GC), which captures most of the variation in nucleotide frequencies per Chargaff's second rule. The optimal %‐GC bias for the exchangeabilities R is 0.51 and for the equilibrium frequencies π is 0.52, which barely deviates from the K80 value fixed at 0.5 (see Figure S7). At the optimal %‐GC bias values, the T92 model does not improve the correlation of the TMS matrix with empirical substitution matrices (such as LG), suggesting that equal nucleotide frequencies in the K80 model is a relatively weak assumption that does not significantly limit the fit of the TMS matrix.

TMS explains some phylogenies better than empirical substitution matrices

The purpose of our study is to see how far a simple thermodynamic and mutation‐selection model could go in explaining empirical amino acid substitutions; it was not intended to provide a better substitution matrix for practical use. Nevertheless, it is interesting to see how well our TMS matrix fares in phylogenetic analyses. For each of the 52 individual phylogenetic trees in our benchmark set, we compared the TMS maximum likelihood values to the LG maximum likelihood values for the same alignments. The mean TMS likelihood was −13,664 while the mean LG likelihood was −13,492, a difference of 172 on average in favor of LG. For most trees, the LG matrix is better, but for three of the 52 trees TMS has a higher likelihood than LG. Using a larger set of 500 Pfam alignments,, independent of both our model parameters and LG, we similarly found that in eight alignments TMS had higher likelihood than LG (see Table S1). Thus, for a certain subset of protein families, TMS does appear to provide a better substitution model than LG.

DISCUSSION

We present an approach to predict relative rates of global amino acid substitutions in evolution from first principles by combining population genetics and protein biophysics. We find that most of the global substitution behavior of amino acids in proteins can be explained by a simple evolutionary fitness model that captures (a) the thermodynamic effects of mutations, (b) the biases in spontaneous mutations, and (c) the structure of the genetic code. Our model is based on an extremely simple assumption: the fitness of a gene variant is controlled only by constraints on protein stability. Nevertheless, this naive model can remarkably recapitulate the complex amino acid substitution patterns seen in empirically derived substitution matrices. Our model incorporates transition/transversion mutation biases, multi‐nucleotide codon changes, variation in codon counts due to the genetic code, and free energy changes calculated from the detailed atomic interactions at specific sites in proteins. By calculating the fitness effects of mutations in the structural environment of native sequences and averaging over many sites in many proteins, we are able to reproduce the majority of the amino acid substitution patterns quantified in common global exchangeability matrices. Previous work by others has used structure‐based models of evolution to predict amino acid substitution rates and equilibrium frequencies at specific sites in proteins., , , , , Such methods have been applied to predict site‐specific substitution matrices, rather than global matrices, but the results are difficult to validate against empirical site‐specific matrices due to the lack‐of‐data problem at specific sites. Arenas and Bastolla have emphasized that there are currently two incomplete kinds of structure‐based evolutionary models described in the literature: stability‐constrained fitness models and structurally‐constrained fitness models., Stability‐constrained models calculate fitness from the effect of a mutation on protein stability, but ignore the structural changes due to the mutation., , , , On the other hand, structurally constrained models calculate fitness from the effect of a mutation on the structure, but ignore the change in stability., , Notably, our method bridges both categories of structure‐based fitness model, as we simultaneously estimate the change in structure due to a mutation (allowing local backbone rearrangements and side chain repacking) and calculate the free energy change of the resulting perturbed structure relative to the native state using a state‐of‐the‐art energy function. Like the previous structure‐based evolutionary methods described above, we also calculate site‐specific rates. However, in our analysis these rates are then averaged over sites and proteins to arrive at the predicted global TMS exchangeability matrix. What then is the physical interpretation of a global substitution matrix? The exchangeabilities in amino acid substitution matrices are rate constants that can be considered to be instantaneous probabilities. The probability of a substitution at a given site is a function that is conditional on the particular physical environment at that protein site and depends on the chemical and physical properties of the current and mutated amino acids. These amino acid properties are largely constant regardless of the different chemical environments at different sites in the different proteins found throughout life. Amino acid mutations with similar chemical and physical properties are expected to have little effect on fitness. Highly dissimilar mutations will likely decrease stability and be eliminated quickly, while mutations that increase stability due to favorable interactions in the site‐specific environment will fix quickly. Since the exchangeabilities in global matrices are calculated by averaging over many different sites, exchangeabilities are seen to be marginal probabilities in which all site‐specific variation has been integrated out. In most common applications of substitution matrices, like phylogenetic analyses and homology detection, we lack detailed physical information at specific protein sites. Hence, in the absence of atomic‐level site‐specific information, global (i.e., marginal) exchangeability matrices should provide the maximum information possible about amino acid substitution behavior. After site‐specific variation has been removed, the remaining information in a global exchangeability matrix quantifies the degree of similarity between amino acids in their inherent biological, chemical, and physical properties, as seen by selection in the context of folded proteins. The fact that we can use site‐specific physics and genetics to accurately predict empirical global substitution matrices helps explain why global matrices have been so successful in practice in phylogenetics, bioinformatics, homology detection, and ancestral sequence reconstruction.

MATERIALS AND METHODS

Protein dataset

We selected a non‐redundant subset of 52 proteins structures from a curated set of high‐quality protein structures., All structural modeling was done with the Rosetta macromolecular modeling suite. To ensure structural diversity, sequence redundancy was decreased so that no sequence shared more than 60% identical positions with any other sequence. Before analysis, each structure was adapted to the energy function using the FastRelax protocols as described by Nivon et al. allowing cartesian space minimization.

Prediction of the energetic effects of amino acid mutations

The ΔΔG prediction method is based on a modified version of the method presented by Park et al., but with a cutoff in the Lennard‐Jones potential set to 6.0 Å. This ΔΔG method samples backbone degrees of freedom for the mutated and neighboring residues in the sequence and allows repacking of all side chains in energetic contact (>0.1 kcal/mol) with the mutated residue.

Markov chain aggregation and averaging

To calculate the amino acid substitution rate matrix for site L from the codon substitution rate matrix , we first determine the frequency of each codon:, Next, we determine the rate between two amino acids with codons and using the aggregation approach presented by Yang et al.: where is the rate between codon v to u at site L. The flux between a pair of amino acids at a site L is:The site‐specific rate (the total flux) is:We normalize the site‐specific flux matrix, , so that a unit of time corresponds to one expected amino acid substitution per site.To calculate the mean instantaneous rate for amino acid i to j we average over sites by normalizing by the equilibrium frequency of an amino acid i at that site:Sites with rates with μ < 10−10 were excluded from the analysis to avoid numerical errors. Next, we determine the exchangeability matrix asFrom this, it can be seen that the exchangeability matrix is equivalent to the equilibrium flux matrix normalized by the appropriate amino acid equilibrium frequencies:

Model parameterization based on phylogenetic trees

Optimal values for the four free parameters of the model were determined by maximizing the sum log‐likelihood of phylogenetic trees for the 52 non‐redundant protein alignments with a total of 8,907 sites:The maximum likelihood trees (optimizing branch lengths, topology, equilibrium frequencies, and site rate variation parameter α) were determined with IQ‐TREE (model TMS + FO + G4, where TMS is an exchangeability matrix calculated from specific values of ). Note that because we use the IQ‐TREE “FO” model option, rather than specifying the equilibrium frequencies predicted by our TMS model, this method optimizes over the TMS exchangeabilities alone. To speed the calculation the tree search was seeded with a tree inferred using LG. To find the parameterization that maximizes the sum log‐likelihood, we performed a grid‐search over the parameter space. Grid search optimization was performed over linearly spaced steps in the ranges and over log‐linearly spaced steps in the ranges , . The maximum likelihood TMS matrix is available in supplementary materials.

Correlations between substitution matrices

The agreement between exchangeability matrices was quantified using a Pearson correlation coefficient calculated in log‐space for the 190 exchangeability parameters.

CONFLICT OF INTEREST

The authors declare no competing financial interest.

AUTHOR CONTRIBUTIONS

Christoffer Norn: Conceptualization; data curation; formal analysis; funding acquisition; investigation; methodology; software; visualization; writing‐original draft; writing‐review & editing. Ingemar Andre: Conceptualization; data curation; formal analysis; funding acquisition; investigation; methodology; software; supervision; visualization; writing‐original draft; writing‐review & editing. Douglas L. Theobald: Conceptualization; data curation; formal analysis; funding acquisition; investigation; methodology; software; supervision; visualization; writing‐original draft; writing‐review & editing. Appendix S1. Supporting Information. Click here for additional data file.

68 in total

1. Pervasive multinucleotide mutational events in eukaryotes.

Authors: Daniel R Schrider; Jonathan N Hourmozdi; Matthew W Hahn
Journal: Curr Biol Date: 2011-06-21 Impact factor: 10.834

2. Amino acid substitution matrices from protein blocks.

Authors: S Henikoff; J G Henikoff
Journal: Proc Natl Acad Sci U S A Date: 1992-11-15 Impact factor: 11.205

Review 3. The interface of protein structure, protein biophysics, and molecular evolution.

Authors: David A Liberles; Sarah A Teichmann; Ivet Bahar; Ugo Bastolla; Jesse Bloom; Erich Bornberg-Bauer; Lucy J Colwell; A P Jason de Koning; Nikolay V Dokholyan; Julian Echave; Arne Elofsson; Dietlind L Gerloff; Richard A Goldstein; Johan A Grahnen; Mark T Holder; Clemens Lakner; Nicholas Lartillot; Simon C Lovell; Gavin Naylor; Tina Perica; David D Pollock; Tal Pupko; Lynne Regan; Andrew Roger; Nimrod Rubinstein; Eugene Shakhnovich; Kimmen Sjölander; Shamil Sunyaev; Ashley I Teufel; Jeffrey L Thorne; Joseph W Thornton; Daniel M Weinreich; Simon Whelan
Journal: Protein Sci Date: 2012-04-23 Impact factor: 6.725

4. Estimation of the number of nucleotide substitutions when there are strong transition-transversion and G+C-content biases.

Authors: K Tamura
Journal: Mol Biol Evol Date: 1992-07 Impact factor: 16.240