Literature DB >> 27337695

The Context-Dependence of Mutations: A Linkage of Formalisms.

Frank J Poelwijk¹, Vinod Krishna¹, Rama Ranganathan².

Abstract

Entities: Chemical Disease Gene Mutation Species

Mesh：

Substances：
Proteins

Year: 2016 PMID： 27337695 PMCID： PMC4919011 DOI： 10.1371/journal.pcbi.1004771

Source DB: PubMed Journal: PLoS Comput Biol ISSN： 1553-734X Impact factor: 4.475

× No keyword cloud information.

Overview

Defining the extent of epistasis—the nonindependence of the effects of mutations—is essential for understanding the relationship of genotype, phenotype, and fitness in biological systems. The applications cover many areas of biological research, including biochemistry, genomics, protein and systems engineering, medicine, and evolutionary biology. However, the quantitative definitions of epistasis vary among fields, and its analysis beyond just pairwise effects remains problematic in general. Here, we bring together a number of previous results that show that different definitions of epistasis are versions of a single mathematical formalism—the weighted Walsh-Hadamard transform. We demonstrate that one of the definitions, the background-averaged epistasis, may be the most informative for describing the epistatic structure of a biological system. Key issues are the choice of effective ensembles for averaging and to practically contend with the vast combinatorial complexity of mutations. In this regard, we discuss strategies for optimally learning the epistatic structure of biological systems.

Introduction

There has been much recent interest in the prevalence of epistasis in the relationships between genotype, phenotype, and fitness in biological systems [1-7]. Epistasis here is defined as the nonindependence (or context-dependence) of the effect of a mutation, which is a generalization of Bateson’s original definition of epistasis as a genetic interaction in which a mutation “masks” the effect of variation at another locus [8]. It is also in line with Fisher’s broader definition of “epistacy” [9]. Epistasis limits our ability to predict the function of a system that harbors several mutations, given knowledge of the effects of those mutations taken independently [10-13], and makes these relationships increasingly more complex [14-19]. From an evolutionary perspective, the presence of epistatic interactions may limit or entirely preclude trajectories of single-mutation steps towards peaks in the fitness landscape [20-29]. With regard to human health, epistasis complicates our understanding of the origin and progression of disease [30-37]. Thus, interest in the extent of epistatic interactions in biological systems has originated from the fields of protein biochemistry, protein engineering, medicine, systems biology, and evolutionary biology alike. Originally, epistasis was considered in the context of two genes, but we can define it more broadly as the nonindependence of mutational effects in the genome, whether the effects are within, between, or even outside protein coding regions (e.g., in regulatory regions). The perturbations may go beyond point mutagenesis, but we limit the discussion here for clarity of presentation. Importantly, the definition of epistasis can be extended beyond pairwise effects to comprise a hierarchy of three-way, four-way, and higher-order terms that represent the complete theoretical description of epistasis between the parts that make up a biological system. How can we quantitatively assign an epistatic interaction given experimentally determined effects of mutations? Because epistasis is deviation from independence, it is crucial to first explicitly state the null hypothesis—asserting what exactly it means to have independent contributions of mutations. This by itself is typically nontrivial. In some cases the phenotype is directly related to a thermodynamic state variable, and the issue is straightforward: independence implies additivity in the state variable. For example, for equilibrium binding reactions between two proteins, independence means additivity in the free energy of binding ΔGbind, such that the energetic effect of a double mutation is the sum of the energetic effects of each single mutation taken independently. However, in general, many phenotypes cannot be so directly linked to a thermodynamic state variable, and quantification of epistasis needs to be accompanied by a proper rationale for the choice of null hypothesis. In what follows, we will assume this step has already been carried out and we will equate independence with additivity of mutational effects. Epistasis between two mutations is then defined as the degree to which the effect of both mutations together differs from the sum of the effects of the single mutations. In this paper, we describe three theoretical frameworks that have been proposed for characterizing the epistasis between components of biological systems; these frameworks originate in different fields and use seemingly different calculations to describe the nonindependence of mutations [2,14,24,33,38-46]. We extend previous observations [47-50] to show that these formalisms are different manifestations of a common mathematical principle, which explains their conceptual similarities and distinctions. Each of these formalisms has its value depending on depth of coverage and nature of sampling in the experimental data and the objective of the analysis. In the end, the fundamental issue is to develop practical approaches for optimally learning the epistatic structure of biological systems in the face of the explosive combinatorial complexity of possible epistatic interactions between mutations. Understanding the mathematical relationships between the different frameworks for analyzing epistasis is a key step in this process.

Results

Basic definitions

We begin with a formal definition of genotype, phenotype, and the representation of mutational effects. Consider a specific sequence comprised of N positions as a binary string g = {g,…,g1} with g ∈{0,1}, where “0” and “1” represent the "wild-type" and mutant state of each position, respectively. This defines a total space of 2 genotypes. The analysis could be expanded to the case of multiple substitutions per position, but we consider just the binary case for clarity here. Each genotype g has an associated phenotype y, which is of the form that the independent action of two mutations means additivity in y. For notational simplicity, we will simply write the genotype in a k-bit binary form, where k is the order of the mutations that are considered. For example, the effect of a single mutation is simply y1−y0, the difference in the phenotype between the mutant and “wild-type” states (Fig 1A). The effect of a double mutant is given by y11−y00 (red arrow, Fig 1B), and its linkage through paths of single mutations is defined by a two-dimensional graph (a square network) with four total genotypes. Similarly, a triple mutant effect is y111−y000 (red arrow, Fig 1C), and its linkage through paths of single mutations are enumerated on a three-dimensional graph (a cube) with eight total genotypes. More generally, and as described by Horowitz and Fersht [51], the phenotypic effect of any arbitrary n-dimensional mutation can be represented by an n-dimensional graph with 2 total genotypes. Understanding the relationship of the phenotypes of multiple mutants to that of the underlying lower-order mutant states is the essence of epistasis and is described below.

Fig 1

Definitions of genotype, phenotype, and effects of mutations.

Representation of (A) single mutant, (B) double mutant, and (C) triple mutant experiments. Phenotypes are denoted by y, where g is the underlying genotype. g = {g,…,g1} with g ∈{0,1}; “0” or “1” indicates the state of the mutable site (e.g., amino acid position). The effect of a single, double, and triple mutation is given by the red arrows. Pairwise (or second-order) epistasis is defined as the differential effect of a mutation depending on the background in which it occurs; for example, in (B) it is the degree to which the effect of one mutation (e.g., y10−y00) deviates in the background of the second mutation (y11−y01). Thus, the expression for second-order epistasis is (y11−y10)−(y01−y00). The third order and higher cases are considered in the main text.

Definitions of genotype, phenotype, and effects of mutations.

The biochemical view of epistasis

A well-known approach in biochemistry for analyzing the cooperativity of amino acids in specifying protein structure and function is to use the formalism of thermodynamic mutant cycles [10,51-53], one manifestation of the general principle of epistasis. In this approach, the "phenotype" is typically an equilibrium free energy ΔG (e.g., of thermodynamic stability or biochemical activity), and the goal is to obtain information about the structural basis of this phenotype through mutations that represent subtle perturbations of the “wild-type” state. For pairs of mutations, the analysis involves measurements of four variants: “wild-type” (), each single mutant ( and ), and the double mutant (), where the lower indices designate the mutated positions and the upper index “o” indicates that free energies are relative to the usual biochemical standard state (Fig 1B). From this, we can compute a coupling free energy between the two mutations (Δ2G1,2) as the degree to which the effect of one mutation (Δ1G1) is different when the same mutation occurs in the background of the other (Δ1G1|2): Whereas the ΔGo terms are individual measurements and Δ1G terms are the effects of single mutations relative to “wild-type,” Δ2G is a second-order epistatic term describing the cooperativity (or nonindependence) of two mutations with respect to the “wild-type” state. This analysis can be expanded to higher order (see [53]). For example, the third-order epistatic term describing the cooperative action of three mutations 1, 2, and 3 (Δ3G1,2,3) is defined as the degree to which the second order epistasis of any two mutations is different in the background of the third mutation: Note that Δ3G requires measurement of eight individual genotypes (Fig 1C). More generally, we can define an nth-order epistatic term (ΔG), describing the cooperativity of n mutations, It is possible to write this expansion in a compact matrix form: where is the vector of 2 epistasis terms of all orders and is the vector of 2 free energies corresponding to phenotypes of all the individual variants listed in binary order. To illustrate, for three mutations n = 3, we obtain In this representation, lower indices in represent combinations of mutations (e.g., , a double mutant) and lower indices in represent epistatic order (e.g., , pairwise epistasis between mutations 1 and 2). Thus, Eqs 1 and 2 correspond to multiplying by the fourth or eighth row of , respectively, to specify λ011 and λ111. Note that and contain precisely the same information re-written in a different form. The matrix represents an operator linking these two representations of the mutation data. We will return to the nature of the operation in a later section. We can write a recursive definition for that defines the mapping between and for all epistatic orders n: The inverse mapping is defined by . This relationship gives the effect of any combination of mutants (in ) as a sum over epistatic terms (in ). This yields, for example, for the energetic effect of three mutations 1, 2, and 3 (): Thus, in the most general case, the free energy value of a multiple mutation requires knowledge of the effect of the single mutations and all associated epistatic terms. For the triple mutant, this means the “wild-type” phenotype, the three single mutant effects, the three two-way epistatic interactions, and the single three-way epistatic term. This analysis highlights two important properties of epistasis: (1) the lack of any epistatic interactions between mutations dramatically simplifies the description of multiple mutations to just the sum over the underlying single mutation effects, and (2) the absence of lower-order epistatic interactions (e.g., Δ2G = 0) does not imply absence of higher-order epistatic terms.

The ensemble view of epistasis

In contrast to the biochemical definition, the significance of a mutation (and its epistatic interactions) may also be defined not solely with regard to a single reference state as the "wild-type", but as an average over many possible genotypes. As we show below, such averaging more clearly identifies epistatic units within a protein and, in principle, can separate mutant effects that are idiosyncratic to particular proteins from those that generally hold over the selected ensemble of genotypes. The concept of averaging epistasis over genotypic backgrounds is related to "statistical epistasis" in evolutionary biology, in which the effects of combinations of mutations are averaged over genotypes present in a population [2]. It is also analogous to the idea of the “schema average fitness” in the field of genetic algorithms (GA) [54], but as applied in a biological context (see e.g., [45]). In its complete form, background-averaged epistasis considers averages over all possible genotypes for the remaining positions in the ensemble. For example, if n = 3, the epistasis between two positions 1 and 2 is computed as an average over both states of the third position (ε*11, with the averaging denoted by a subscript “*”) (see Fig 1C): Thus for n = 3, we can write all epistatic terms: where is a diagonal weighting matrix to account for averaging over different numbers of terms as a function of the order of epistasis; , where q is the order of the epistatic contribution in row i. More generally, for any number of mutations n: where is the same vector of phenotypes of variants as defined above, is the vector of background-averaged epistatic terms, and is the operator for background-averaged epistasis, defined recursively as The recursive definition for the weighting matrix is The matrix has special significance; its action mathematically corresponds to a generalized Fourier decomposition [55,56] known as the Walsh-Hadamard transform and, therefore, this operation can also be seen as a spectral analysis of the high-dimensional phenotypic landscape defined by the genotypes studied [47-50]. In this transform, the phenotypic effects of combinations of mutations are represented as sums over averaged epistatic terms. We note that strong parallels exist between the Fourier decomposition of a landscape and ANOVA, a statistical analysis based on partitioning of variance among effects and interactions of different orders (see S5 Text for details). In summary, the definition of epistasis laid out in this section is a global definition over sequence space, averaging the epistatic effects of mutations over the ensemble of all possible variants. In contrast, the biochemical definition given in the previous section is a local one, treating a particular variant as a reference for determining the epistatic effect of mutations.

Estimating epistasis with linear regression

A third approach for analyzing epistasis is linear regression. For example, when we have a complete dataset of phenotypes of all 2 genotypes, we can use regression to define the extent to which epistasis is captured by only considering terms to some order r It is worth noting that the inverse of is −1 = , the operator for biochemical epistasis (Eq 5; see also S1 Text). Thus, the multidimensional mutant-cycle analysis is indistinguishable from regression to full order (r = n), which is an exact mapping without residual noise (). However, the usual aim of regression is to approximate the data with fewer coefficients than there are data points, i.e., rr): is multiplied by a 2 -by-m matrix , the identity matrix with columns corresponding to epistatic orders higher than r removed. m is the number of epistatic terms up to r and is given by . Thus for regression to order r, we can define , and write The linear regression is performed by solving the so-called normal equations where is the transpose of . The product is necessarily square and invertible as long as is full column rank and hence is full rank. Note that in this analysis we compute epistatic terms only up to the rth order, but use phenotype/fitness data of all 2 combinations of mutants. The more general case, in which we estimate epistatic terms with less than 2 data points, is distinct and is discussed below. If the biochemical definition of epistasis is a local one, exploring the coupling of mutations of all order with regard to one "wild-type" reference, and the ensemble view of epistasis is a global one, assessing the coupling of mutations of all order averaged over all possible genotypes, then the regression view of epistasis is an attempt to project to a lower dimension—capturing epistasis as much as possible with low-order terms.

Link between the formalisms

The analysis presented above leads to a simple unifying concept underlying the calculations of epistasis. In general, all the calculations are a mapping from the space of phenotypic measurements of genotypes to epistatic coefficients in a general form , where Ωepi is the epistasis operator. We give the bottom line of the different operators below; their formal mathematical derivations can be found in S1 Text. The most general situation is that of the background-averaged epistasis with averaging over the complete space of possible genotypes. In this case where is a 2×2 matrix corresponding to the Walsh-Hadamard transform (n is the number of mutated sites) and is a matrix of weights to normalize for the different numbers of terms for epistasis of different orders. The biochemical definition of epistasis using one "wild-type" sequence as a reference is a sub-sampling of terms in the Hadamard transform. In this case where is as defined in Eq 13. In essence, picks out the terms in that concern the “wild-type” background. Note that both these mappings are one-to-one, such that the number of epistatic terms (in ) is equal to the number of phenotypic measurements (in ) and no information is lost. In contrast, regression to lower orders necessarily implies fewer epistatic terms than data points, which means the mapping is compressive and information is lost. In this case where (≡) is the identity matrix but with zeros on the diagonal at the orders that are higher than those over which we regress. From a computational point of view, it is interesting to note that regression using the Hadamard transform makes matrix inversion unnecessary (compare with Eq 15). The fundamental point is that all three formalisms for computing epistasis are just versions of the Walsh-Hadamard transform, with weights selected as appropriate for the choice of a single reference sequence or restrictions on the order of epistatic terms considered. The mathematical underpinnings of these relationships have been previously noted and explained [45,47-50], though the connections to experimental studies in biochemistry and evolutionary biology have been incomplete and underappreciated by the broader scientific community. For example, ensemble and biochemical views of epistasis correspond to Fourier and Taylor expansions, respectively, of multi-dimensional landscapes [47]. The former captures global landscape properties and the latter evaluates the local structure around a particular genotype. Interestingly, the two representations are also mathematically interchangeable (up to weighting factors) by simply changing the representation of genotypes from g ∈{0,1} to σ ∈{−1,1} in an expansion of the form of the regression equation (see Eq 11 and S4 Text). Understanding the connection between the mathematical descriptions and experimental studies of phenotype landscapes as practiced in different fields is important in guiding future work.

Empirical examples

To illustrate the different analyses of epistasis, we begin with a small case study of three spatially proximal mutations that define a switch in ligand specificity in PSD95pdz3, a member of the PDZ family of protein interaction modules (Fig 2A). Two mutations are located in the PDZ3 domain itself (G330T and H372A) and one mutation is in its cognate ligand peptide (T-2F). The phenotype is the binding affinity, K, and the absence of epistasis implies additivity in the corresponding free energy, expressed as ΔGo = RTlnK. (Binding affinities for this system are measured in [57] and given in Fig 2B) These quantitative phenotypes are then transformed into epistatic terms using Eqs 16–18 (Table 1).

Fig 2

Examples of epistasis in a PDZ domain (A) and a K+ ion channel (B).

Table 1

Interaction terms after applying the three different transforms to the PDZ–ligand dataset with three mutable positions: three-way mutant cycle, background-averaged epistasis, and regression (to second order).

Genotype¹	Free Energy²	Interaction Terms³	Mutant Cycle	Background-Averaged Epistasis	Regression Terms
THG	y¯		λ¯	ε¯	β¯
000	-8.17 (0.07)	***	-8.17 (0.07)	-7.24 (0.03)	-7.96 (0.06)
001	-7.58 (0.09)	**1	0.59 (0.11)	-0.51 (0.06)	0.17 (0.10)
010	-6.13 (0.14)	1	2.05 (0.15)	0.23 (0.06)	1.63 (0.13)
011	-6.24 (0.07)	*11	-0.70 (0.19)	0.13 (0.12)	0.13 (0.12)
100	-5.96 (0.03)	1**	2.22 (0.07)	-0.41 (0.06)	1.80 (0.08)
101	-7.70 (0.11)	1*1	-2.33 (0.16)	-1.50 (0.12)	-1.50 (0.12)
110	-7.67 (0.09)	11*	-3.76 (0.18)	-2.92 (0.12)	-2.92 (0.12)
111	-8.45 (0.06)	111	1.67 (0.25)	1.67 (0.25)	0 (0.00)

1 The three mutations are T-2F in the ligand and H372A and G330T in the protein, respectively. They are designated in this column as “THG.”

2 Free energies are in kcal/mol, with standard deviation in parentheses.

3 Interacting positions are in the same order as genotypes, e.g., “*11” indicates the epistasis between amino acid positions 372 and 330 in PSD95-PDZ3.

Standard deviations in epistatic terms are given in parentheses and calculated according to , where δs designate the error vectors and ∘ stands for the element-wise product (see also S2 Text).

Examples of epistasis in a PDZ domain (A) and a K+ ion channel (B).

(A) PDZ domains are small, mixed αβ proteins that bind target peptide ligand (in yellow stick bonds) in a groove formed between the β2 and α2 elements (PSD95pdz3 shown, Protein Data Bank (PDB) accession 1BE9). The study discussed in the main text and in Table 1 is focused on the epistatic interactions between three amino acid positions—two in the PDZ domain (H372 and G330) and one in the ligand (T-2) (red spheres). (B) a thermodynamic cube representing the energetics of mutations at the three positions; values are equilibrium dissociation constants (K) for the target ligand (CRIPT [58]) in μM for all eight possible combination of mutations; errors represent standard deviation. (C) structure of the homotetrameric KcsA K+ ion channel (PDB accession 1K4C), showing the four positions selected for mutation in Sadovsky and Yifrach (in red spheres, shown only for one subunit for clarity) [60]. Note that the experiments were carried out in the Shaker K+ ion channel, and the positions in Shaker numbering are given in parentheses. The positions form a network that roughly links the intracellular activation gate and the selectivity filter. 1 The three mutations are T-2F in the ligand and H372A and G330T in the protein, respectively. They are designated in this column as “THG.” 2 Free energies are in kcal/mol, with standard deviation in parentheses. 3 Interacting positions are in the same order as genotypes, e.g., “*11” indicates the epistasis between amino acid positions 372 and 330 in PSD95-PDZ3. Standard deviations in epistatic terms are given in parentheses and calculated according to , where δs designate the error vectors and ∘ stands for the element-wise product (see also S2 Text). A number of simple mathematical relationships are evident in the data. First, regression is carried out only to the second order, and therefore the third-order epistatic term for this analysis does not exist (or, equivalently, is set to zero if the epistatic vector is defined to be of full length 2). Second, some numerical equalities exist. The regression terms at the highest order (second, in this case) are equal to the corresponding terms for the averaged epistasis. This is because sets columns representing orders higher than the regression order to zero, leaving rows corresponding to the highest regression order with only one non-zero element on the diagonal. For these rows, the entries in the epistasis operators and are equal. Another more trivial equality is the highest-order term for the mutant-cycle and averaged epistasis formalisms; there is only one contribution for the highest order and, therefore, no backgrounds over which to average. The data also illustrate the key properties of the different formalisms. The G330T, H372A, and T-2F mutations represent a collectively cooperative set of perturbations, as indicated by a significant third-order epistatic term by both mutant cycle and background-averaged definitions (λ111 = ε111 = 1.67 kcal mol−1). But the three formalisms differ in the energetic value of the lower-order epistatic terms. For example, G330T is essentially neutral for “wild-type” ligand binding but shows a dramatic gain in affinity in the context of the T-2F ligand; thus, a large second-order epistatic term by the biochemical definition (λ101 = −2.33 kcal mol−1). However, the coupling between G330T and T-2F is nearly negligible in the background of H372A; as a consequence, the background-averaged second-order epistasis term ε1*1 is smaller (−1.5 kcal mol−1). Similarly, both biochemical and regression formalisms assign a large first-order effect to the T-2F (1**) and H372A (*1*) single mutations, while the corresponding background-averaged terms are nearly insignificant. For example, the free energy effect of mutating the ligand (T-2F, λ010) is 2.22 kcal mol−1 in the “wild-type” background, but is −1.54 kcal mol−1 in the background of the H372A mutation—a nearly complete reversal of the effect of this mutation depending on context. Thus, with background averaging, the first-order term for T-2F (ε1**) is close to zero. This makes sense given the experiment described in Fig 2, and, more broadly, given the known specificities in the PDZ family [59], the mutation should not be thought of as a general determinant of ligand affinity. T-2F may have a disrupting effect on the function from the perspective of a specific PDZ domain (the “wild-type”), but from the perspective of the protein family (in which various different functional domain–ligand combinations are found) a phenylalanine at position -2 in the ligand is not necessarily detrimental to binding affinity. Instead, it is a conditional determinant with an effect that depends on the identity of the proximal amino acid in the PDZ domain. The analysis of other combinatorial mutation datasets reinforces these conclusions. For example, high-quality measurements comprising a fourth-order thermodynamic analysis of epistasis is available for the Shaker potassium channel (data from [60] and [61], Fig 2C, Table 2), where the phenotype observed is the activation free energy for opening of the ion channel pore [61]. Using the standard biochemical formalism for epistasis, the work of Sadavosky and Yifrach [60] demonstrates large high-order epistasis between four mutations at sites forming a path between the intracellular crossing of transmembrane helices (the so-called “activation gate”) and the selectivity filter for ions (Fig 2C, [61]). The biologically interesting finding is that for this system of mutations, the magnitude of epistasis rises with increasing order of the epistasis; that is, Δ4G> Δ3G> Δ2G> Δ1G (Table 3, ), a result that suggests the collective action of this systems of residues with regard to pore opening. We compared the biochemical and background-averaged epistasis for this system of four mutations (Fig 2C, Table 2, and complete analysis in S3 Text). The analysis shows that the background-averaged epistasis enhances the essential point of Sadovsky and Yifrach; the fourth-order epistatic term dominates (Table 3, ), and all lower terms are weak. As in the case of the PDZ domain, the reason for this is that the lower-order epistatic effects are conditional on the background of other mutations and are correspondingly assigned less significance. This analysis clarifies the notion that this system of residues comprises a collectively acting, cooperative network underlying channel gating.

Table 2

Interaction terms based on the standard mutant cycle formalism () and on background-averaged epistasis () for pore-opening free energies in the Shaker K+ voltage-gated channel.

As for the PDZ domain (Table 1), background averaging modulates epistasis at each level given the existence of higher-order terms. Primary data are from [60] and [61].

Genotype¹	ΔG_open²	Interaction Terms	Mutant Cycle²	Background-Averaged Epistasis²
	y¯		λ¯	ε¯
0000	-1.97 (0.05)	****	-1.97 (0.05)	-8.33 (0.05)
0001	-7.05 (0.12)	***1	-5.08 (0.13)	-0.64 (0.10)
0010	-13.57 (0.29)	*1	-11.60 (0.29)	-3.52 (0.10)
0011	-9.47 (0.25)	**11	9.18 (0.40)	2.97 (0.20)
0100	-7.97 (0.34)	1*	-6.00 (0.34)	-1.09 (0.10)
0101	-8.11 (0.19)	11	4.94 (0.41)	-1.13 (0.20)
0110	-10.01 (0.33)	11	9.56 (0.56)	1.46 (0.20)
0111	-13.50 (0.32)	*111	-12.53 (0.73)	-3.00 (0.40)
1000	-7.04 (0.21)	1***	-5.07 (0.22)	1.25 (0.10)
1001	-6.58 (0.08)	1**1	5.54 (0.26)	1.02 (0.20)
1010	-8.42 (0.13)	11	10.22 (0.38)	3.68 (0.20)
1011	-8.20 (0.16)	1*11	-9.42 (0.51)	0.12 (0.40)
1100	-5.05 (0.12)	11**	7.99 (0.42)	1.58 (0.20)
1101	-8.80 (0.09)	11*1	-9.15 (0.49)	0.39 (0.40)
1110	-10.07 (0.11)	111*	-13.20 (0.63)	-3.67 (0.40)
1111	-7.52 (0.04)	1111	19.07 (0.81)	19.07 (0.81)

1 The four mutations are T469A, A465V, E395A, and A391V (corresponding to the bits in the first column in left-to-right order).

2 Standard deviations of epistatic terms are given in parentheses and computed according to (see S2 Text).

Table 3

Mean absolute values of interaction terms for the four-mutation network in the Shaker K+ channel.

This analysis recapitulates the basic finding of Sadovsky and Yifrach [60] that these positions comprise a cooperative unit, a result that is further clarified with background averaging.

Epistatic Order¹	Mutant Cycle²	Background-Averaged Epistasis²
	\|λ¯\| mean	\|ε¯\| mean
0	1.97 (0.05)	8.33 (0.05)
1	6.94 (0.26)	1.63 (0.10)
2	7.91 (0.42)	1.98 (0.20)
3	11.08 (0.60)	1.79 (0.40)
4	19.07 (0.81)	19.07 (0.81)

1 Order over which the absolute values of epistatic terms are averaged.

2 Errors on the mean are given in parentheses.

Interaction terms based on the standard mutant cycle formalism () and on background-averaged epistasis () for pore-opening free energies in the Shaker K+ voltage-gated channel.

As for the PDZ domain (Table 1), background averaging modulates epistasis at each level given the existence of higher-order terms. Primary data are from [60] and [61]. 1 The four mutations are T469A, A465V, E395A, and A391V (corresponding to the bits in the first column in left-to-right order). 2 Standard deviations of epistatic terms are given in parentheses and computed according to (see S2 Text).

Mean absolute values of interaction terms for the four-mutation network in the Shaker K+ channel.

This analysis recapitulates the basic finding of Sadovsky and Yifrach [60] that these positions comprise a cooperative unit, a result that is further clarified with background averaging. 1 Order over which the absolute values of epistatic terms are averaged. 2 Errors on the mean are given in parentheses. These examples show that background averaging has the effect of “correcting” mutational effects for the existence of higher-order epistatic interactions. Without background averaging, the effect of a mutation (at any order) idiosyncratically depends on a particular reference genotype and will fail to account for higher-order epistasis that modulates the observed mutational effect. Thus, background averaging provides a measure of the effects of mutation that represents its general value over many related systems and, more appropriately, represents the cooperative unit within which the mutation operates. Note that the degree of averaging depends on the number of mutated sites and, thus, the interpretation of mutational effects will depend on the scale of the experimental study. As we will discuss in the next section, finding good averaging ensembles is crucial for background-averaged epistasis to be a useful quantity. This is not only in terms of elucidating general physical mechanisms at play in the system but also for being able to accurately predict the effects of mutations in an individual system.

The epistatic structure of larger systems

The analytical expressions in Eqs 16–18 involve the measurement of phenotypes () for all 2 combinatorial mutants, a fact that exposes two fundamental problems. First, it is only practical when n is small. In such cases (e.g., Fig 2, n = 3 or 4), the data can be combinatorially complete, permitting a full analysis—the local and global structure of epistasis, possible evolutionary trajectories, and adaptive trade-offs [62,63]. But for the typical size of protein domains (n∼150), the combinatorial complexity of mutations precludes the collection of complete datasets. Second, even if it were possible, the sampling of all genotypes is not desired; indeed, the majority of systems in such an ensemble are unlikely to be functional, and averages over them are not meaningful with regard to learning the epistatic structure of native systems. How then can we apply these epistasis formalisms in practice, especially with regard to background averaging? To develop general principles, we begin with two obvious approaches that lead to well-defined alternative expressions for averaged epistasis. First, consider the case in which the data are only "locally complete;" that is, we have all possible mutants up to a certain order p≤n. We can then define a measure that is intermediate between epistasis with a single reference genotype and epistasis with full background averaging, which we will refer to as the partial background-averaged epistasis. For example, for three positions (n = 3) with data complete only up to order (p = 2), the partial background-averaged effect of the first position (rightmost lower index) is calculated as ε**1, = (y001−y000+y011−y010+y101−y100)/3. Compared to the full background-averaged epistasis, the partial averages just leave out the last term, y111−y110, which represents the unavailable phenotype of the triple mutant y111. More generally, we can define this measure of epistasis as another special case of the Hadamard transform: where ∘ designates the element-wise product. is again a diagonal weighting vector, now given by , where q is the epistatic order associated with row i, as defined earlier, and . Note that p≥q because mutants of order higher than p are considered absent in the dataset. The matrix simply serves to multiply by zero the terms in the Hadamard matrix that include orders higher than p. Interestingly, the matrices display a self-similar hierarchical pattern (Fig 3) and are related to Sierpinski triangles (see [64]). This permits a recursive definition in both n and p for the product ∘, which we will designate as : with = for n≤p, and is a 2×2 matrix of zeros, except for a 1 in the upper left corner. This analysis assumes that data are complete up to order p. If not, analytical schemes for background-averaged epistasis such as Eqs 19 and 20 are not obvious.

Fig 3

Examples of matrices Z introduced to calculate the partial background-averaged epistasis for n = 3.

Examples of matrices Z introduced to calculate the partial background-averaged epistasis for n = 3.

(A) 2 for when data for mutants up to second-order is available and (B) 1 for when only first-order mutants are available. Both matrices are self-similar, which allows their generation for arbitrary order, and are related to the logic Sierpinski triangle. For example, 2 = 1−Σ, where is the anti-diagonal identity matrix and Σ is the Sierpinski matrix (i.e., multigrade AND in Boolean logic) for three inputs. A second analytically tractable case for incomplete data arises in regression, in which the idea is to estimate epistatic terms up to a specified order from available data. This involves solving a set of equations similar to the normal equations: where is an s×2 matrix constructed from the 2 by 2 identity matrix by deleting the 2−s rows corresponding to the unavailable phenotypic data, and , with defined as above. In order for this system of equations to be solvable, a necessary constraint is that s≥m; that is, the number of data points available should be larger than or equal to the number of regression parameters. In addition, the data must be such that it is possible to uniquely solve for all epistatic terms in the regression. For example, if two mutations always co-occur in the data, it is obviously impossible to calculate their independent effects. In such cases, the number of solutions to Eq 21 is infinite ( is not invertible). In practice, even with "high-throughput" assays, we can only hope to measure a tiny fraction of all combinatorial mutants due to the vast number of possibilities. In this situation, the problem of inferring epistasis by regression may be further constrained by imposing additional conditions, termed regularization. For example, kernel ridge regression [65] and least absolute shrinkage and selection operator (LASSO) [66] include a weighted norm of the regression coefficients in the minimization procedure. Regularization comes with its own set of caveats [67], but its application is, unlike the approaches in Eqs 19 and 21, not conditional on specific structure of the data or depth of coverage. However, none of these approaches directly address the problem of optimally defining appropriate ensembles of genotypes over which averages should be taken. In principle, the idea should be to perform background averaging over a representative ensemble of systems that show invariance of functional properties of interest. How can we generally find such ensembles without the impractical notion of exhaustive functional analysis of the space of possible genotypes? One idea is motivated by the empirical finding of sparsity in the pattern of key epistatic interactions within biological systems. Indeed, evidence suggests that, in proteins, the architecture is to have a small subset of amino acids that shows strong and distributed epistatic couplings surrounded by a majority of amino acids that are more weakly and locally coupled [60,68-71]. Thus, protein sequences can show extraordinary divergence while preserving folding and function, and only a small set of epistatic constraints can suffice to computationally build synthetic proteins that recapitulate these properties [72,73]. More generally, the notion of a sparse core of strong couplings surrounded by a milieu of weak couplings has been argued to be a signature of evolvable systems [74]. If it can be more generally verified, the notion of sparsity can be exploited to define relevant strategies for optimally learning the epistatic structure of natural systems. For example, one approach is to minimize the so-called -norm (the sum of absolute values of the epistatic coefficients [66]) in a constrained optimization, while projecting onto background-averaged epistatic terms: This procedure is akin to the technique of compressive sensing [75,76], a powerful approach used in signal processing to recognize the low-dimensional space in which the relevant features of a high-dimensional dataset occur given sparsity of these features. The application of this theory for mapping biological epistasis has, to our knowledge, not been reported before, but its value might be explored with focused high-order mutational analyses in specific well-chosen model systems. This has the potential to link the study of epistasis to a formal theory of signal reconstruction [75,76], which may help define optimal strategies for data collection. The necessary technologies for developing these ideas are now becoming available. It is worth pointing out that a class of approaches that use ensemble-averaged information to understand complex biological systems has been developed and experimentally tested. Statistical methods that operate on multiple sequence alignments [71,77-82] calculate quantities that estimate the coevolution of amino acids in the sampling of sequences comprising the alignment. In this regard, coevolution can be seen as a form of background-averaged pairwise epistasis in which the ensemble of genotypes for averaging is defined by homology. Importantly, these approaches have been successful at revealing a hierarchy of cooperative interactions between amino acids that range from local structural contacts in protein tertiary structures [81-83] to more global functional modes [71,84,85]. Coevolution only provides averaged pairwise epistatic terms, but studies show that it is possible to use this information to computationally design artificial sequences that fold and biochemically function in a manner similar to their natural counterparts [72,73]. Thus, for defining good experimental approaches to elucidating epistatic structures, a conceptual advance may come from formally mapping the constrained optimization problem described in Eq 22 to the kind of ensemble averaging that underlies the statistical coevolution approaches.

Discussion

A fundamental problem is to define the epistatic structure of biological systems, which holds the key to understanding how phenotype arises from genotype. Here we describe a unified mathematical foundation for epistasis in which different approaches are versions of a single mathematical formalism—the weighted Walsh-Hadamard transform. In the most general case, this transform corresponds to an averaging of mutant effects over all possible genetic backgrounds at every order of epistasis. This approach corrects the effect of mutations at every level of epistasis for higher order terms. Importantly, it represents the degree to which the effects of mutations are transferable from one model system to another—the usual purpose of most mutagenesis studies. In contrast, the thermodynamic mutant cycle (commonly used in biochemistry) [51] constitutes a special case of taking a single reference genotype and thus no averaging [60,61,86-90]. This analysis represents the effects of mutations that are specific to a particular model system. Regression (commonly used in evolutionary biology) is an attempt to capture features of a system with epistatic terms up to a defined lower order, often to bound the extent of epistasis or to predict the effects of combinations of mutations [33,91]. The similarity of the regression operator to that of the mutant cycle (see Eq 13) indicates that this approach is also focused around the local mutational environment of a chosen reference sequence. Overall, background averaging would seem to provide the most informative representation of the general effect of a mutation. However, with the exception of very small-scale studies focused on the local mutational environment of extant systems, it is both impractical and logically flawed to collect combinatorially complete mutation datasets for any system. Thus, the essence of the problem is to define optimal strategies for collecting data on ensembles of genotypes that is sufficient for discovering the biologically relevant epistatic structure of systems. The notion of sparsity of epistatic interactions provides a general basis for developing such a strategy, and it will be interesting to test practical applications of this concept (e.g., Eq 22) in future work. Defining optimal data collection strategies will not only provide practical tools to probe specific systems but also might guide us to principles underlying the "design" of these systems through the process of evolution and help the rational design of new systems. The mathematical relations discussed here provide a foundation to advance such understanding.

Additional proofs; expressing epistasis operators as Hadamard transforms.