Literature DB >> 35713756

Relation Between the Number of Peaks and the Number of Reciprocal Sign Epistatic Interactions.

Raimundo Saona¹, Fyodor A Kondrashov¹, Ksenia A Khudiakova².

Abstract

Empirical essays of fitness landscapes suggest that they may be rugged, that is having multiple fitness peaks. Such fitness landscapes, those that have multiple peaks, necessarily have special local structures, called reciprocal sign epistasis (Poelwijk et al. in J Theor Biol 272:141-144, 2011). Here, we investigate the quantitative relationship between the number of fitness peaks and the number of reciprocal sign epistatic interactions. Previously, it has been shown (Poelwijk et al. in J Theor Biol 272:141-144, 2011) that pairwise reciprocal sign epistasis is a necessary but not sufficient condition for the existence of multiple peaks. Applying discrete Morse theory, which to our knowledge has never been used in this context, we extend this result by giving the minimal number of reciprocal sign epistatic interactions required to create a given number of peaks.

Entities: Chemical

Keywords: Fitness landscapes; Morse theory; Multiple peaks; Reciprocal sign epistasis

Mesh：

Year: 2022 PMID： 35713756 PMCID： PMC9205815 DOI： 10.1007/s11538-022-01029-z

Source DB: PubMed Journal: Bull Math Biol ISSN： 0092-8240 Impact factor: 3.871

Introduction

The fitness landscape is the relationship between genotypes and their fitness. Availability of high throughput methods and next-generation sequencing started to experimentally characterize aspects of different fitness landscapes. Due to the enormity of the underlying genotype space (Maynard Smith 1970; Wright 1932), the experimental approaches are limited to assaying the fitness of: (a) closely related genotypes (Melamed et al. 2013; Romero and Arnold 2009; Sarkisyan et al. 2016; de Visser and Krug 2014); or (b), very restricted genotype spaces such as the interaction of a small number of protein sites (Kuo et al. 2020; Pokusaeva et al. 2019; Wittmann et al. 2021). Nevertheless, the number of assayed genotypes in a single landscape is becoming larger in recent studies (Bryant et al. 2021; Russ et al. 2020) and it appears that the experimental characterization of a sufficiently large fitness landscape with multiple fitness peaks may be attainable within the next decade. The possibility of characterizing multiple fitness peaks will always be restricted by the boundaries of the studied genotype space, thus what appears to be two unconnected fitness peaks may be found along the same fitness ridge when a larger section of the genotype space is analyzed (Whitlock et al. 1995). Therefore, there is a need for the development of computational methods (Alley et al. 2019; Bryant et al. 2021; Russ et al. 2020; Biswas et al. 2021; Wittmann et al. 2021) and theory (Zhou and McCandlish 2020) that can improve the description of experimental fitness landscape datasets, such as obtaining an estimate of the number of isolated peaks. Here, we use Morse theory to calculate the minimal number of reciprocal epistatic interactions for a given number of peaks on a landscape. Epistasis is the interaction of allele states of the genotype, which shapes the fitness landscape. When the impact of allele states on fitness is independent of each other, there is no epistasis and the resulting fitness landscape is smooth and has a single peak. Epistasis can lead to a more rugged fitness landscape and decrease the number of paths of high fitness between genotypes. Epistasis that makes the impact of an allele state on fitness stronger or weaker is called magnitude epistasis. On the other hand, epistasis that causes the contribution of an allele state on fitness to change its sign (e.g., a beneficial mutation becomes deleterious) is called sign epistasis (Weinreich et al. 2005). When the two allele states at different loci change the sign of their respective contribution to fitness, then this interaction is called reciprocal sign epistasis. In a simple example of this principle, in a two-loci two-allele model, there are four genotypes, 00, 01, 01 and 11. The following landscape is shaped by sign epistasis when genotypes 00, 01, 10 and 11 have fitnesses of 1,−1,1 and 1, respectively. Reciprocal sign epistasis (RSE) is present when the fitnesses of 00, 01, 10 and 11 genotypes are 1,−1,−1 and 1, respectively. See Fig. 1 for an illustration.

Fig. 1

Types of sign epistasis. Vertices represent genotypes (the sequence is outside the vertex and the fitness inside); edges are present between sequences at mutation distance one; and filled faces represent sign epistatic interactions Of course, the effect of an allele state can depend on more than just one other locus, or site, in the genome. When allele states in different loci impact each other, then the epistasis is higher-order. Higher-order epistasis is found frequently in the characterized fitness landscapes (Weinreich et al. 2013), and it is clear that it has important evolutionary consequences (Canale et al. 2018; Crona et al. 2019; Kondrashov and Kondrashov 2001; de Visser and Krug 2014). However, models that allow studying such epistasis are at an early stage of their development (Crona et al. 2013; Crona 2020), see also Crona et al. 2021 (preprint). The evolutionary consequences of epistasis may be especially important when it leads to multiple local peaks. In that case, a population can get stuck on a suboptimal peak, decreasing the ability of evolution to find an optimal solution. Using a combinatorial argument, Poelwijk et al. (2011) showed the following qualitative property: reciprocal sign epistasis is necessary for the existence of multiple peaks. In contrast, using Morse theory, we derive a more quantitative description of this relationship. This work might be the first formal use of Morse theory to study fitness landscapes.

Outline of the Method

Morse theory studies the properties of some discrete structures (such as graphs) and special functions defined on them. In particular, the strong Morse inequality relates topological characteristics of a structure with the number of critical points of any function defined on it. Therefore, to use Morse Theory, we define a discrete structure that highlights reciprocal sign epistatic interactions and a function based on the given fitness landscape. The discrete structure is a graph: vertices are binary sequences (genotypes) and edges connect those genotypes within one-mutation distances. Moreover, we include edges between those vertices that are separated by a reciprocal sign epistatic interaction. In the case of graphs, the only requirement for (Morse) functions is to assign a number to both nodes and edges. Naturally, the value on the vertices corresponds to the genotype’s fitness. On the other hand, the value on the edges is tailored for Theorem 1.

Theorem 1

(Quantification of epistatic interactions) Let genotypes be encoded as binary sequences. Consider a fitness landscape, i.e., a function that assigns a number to each genotype, with no strictly neutral mutations. Then, Because we model genotypes as binary sequences, the sequence space is a hypercube. Also, we only consider fitness landscapes with no strictly neutral mutation, i.e., all direct neighbors of a vertex must have a different value than this vertex. These two assumptions allow us to unambiguously define RSE instances.

Informal Proof

Let us first briefly explain the combinatorial argument used in Poelwijk et al. (2011) to show the following qualitative property: reciprocal sign epistasis is a necessary condition for the existence of multiple peaks. The main idea is that between any two peaks (i.e., genotypes with locally maximal fitness), there must be a path consisting of single mutations connecting them. In particular, if the path is chosen well, the minimum fitness along this path is part of a RSE instance. Theorem 1 is the corresponding quantitative version of this statement. In particular, if there are three peaks, we conclude that there is not only one instance of RSE in the fitness landscape, but there must exist at least two of them. Intuitively, our result is explained by induction over the number of peaks as follows. In the base case, already explored in Poelwijk et al. (2011), there are only two peaks. For the inductive case, consider a fitness landscape and introduce a new peak to it. This new peak must be connected to all previous peaks through some reciprocal sign epistatic interaction. The question is if any of these interactions was not there before. We show that a new peak must introduce at least one more such interaction. To make this last step in the proof formal, we use discrete Morse theory. If we allow to introduce a peak together with a new dimension, it is easy to illustrate the induction. For the case of two dimensions, we would introduce a third dimension together with a new peak. See Fig. 2b for a representation.

Fig. 2

Introducing a peak introduces a RSE interaction. Vertices represent genotypes; arrows represent fitness increments; filled vertices represent peaks; and filled faces represent reciprocal sign epistatic interactions

Formal Proof

The strong Morse inequality is a general tool that relates characteristics of a space with properties of special functions defined on it. To motivate subsequent definitions, let us present the original statement (Forman 1998, Corollary 3.6, page 107) applied to graphs (instead of more general discrete structures).

Theorem 2

(Strong Morse inequality) Consider a graph and a function . Let and denote the first two Betti numbers of the graph G and let and denote the number of critical nodes and edges of f. Then, we have that To use this result, we must define the following terms: Betti numbers and critical nodes and edges. But before we do that, note that if the number of RSE instances can be represented as a structural property of a graph (inside ), and the number of peaks can be encoded in a function (inside ), then this inequality allows us to quantify the necessary condition for the existence of multiple peaks. We introduce all the necessary concepts before explaining the proof step by step.

Necessary Definitions

In this section, we introduce the terms used in Theorem 2 (Betti numbers, critical nodes and critical edges), as well as RSE instances. All definitions coincide with those given in the general literature.

Definition 1

(Betti numbers) Let be a graph. The zeroth Betti number () is the number of connected components in G. The first Betti number () equals , usually called cyclomatic number.

Remark 1

(Betti numbers in connected graphs) Let be a connected graph. Then, and . Since G is connected, , therefore .

Definition 2

(Critical nodes and edges) Let be a graph and a function. We say that a vertex is critical if, for all edges e containing v, we have that . We say that an edge is critical if . We denote the number of critical vertices and the number of critical edges.

Definition 3

(RSE instance) Consider a fitness landscape represented by , where n is the length of the genotype. An instance of RSE is a collection of four different sequences such that both sequences and are one single mutation away from and and it holds that

Proof

Proof of Theorem 1

Let a fitness landscape be represented by a function , where n is the length of the genotype. Our proof consists of the following steps: During the proof, we will instantiate our constructions in the following example. Define a graph. Show that this graph is connected. Define a function on the graph. Apply the strong Morse inequality.

Example 1

(Fitness landscape in a cube) Consider and the fitness function given as follows.A representation is given in Fig. 3.

Fig. 3

Fitness landscape in a cube. Vertices represent genotypes; edges connect genotypes at one-mutation distance; filled vertices represent peaks; and fitness is indicated in each vertex

Fitness landscape in a cube. Vertices represent genotypes; edges connect genotypes at one-mutation distance; filled vertices represent peaks; and fitness is indicated in each vertex Consider a graph . Let . Let the set of edges be defined in two steps: and have edges involving only sequences at Hamming distance one and two, respectively. The set contains only edges connecting a sequence with one of its fittest beneficial mutations, if one exists (peaks have no beneficial mutation). Formally,where d denotes the Hamming distance. On the other hand, contains edges that connect the two highest points separated by instance of RSE. Formally,Following Example 1, we represent the corresponding graph in Fig. 4. Note that if a vertex has multiple neighbors with maximal fitness, one may choose one arbitrarily. Edges in (resp. ) are represented by solid (resp. dashed) lines.

Fig. 4

Graph in a cube. Vertices represent genotypes; filled edges connect sequences at one mutation distance; dashed edges connect sequences of high fitness in a RSE instance; and filled vertices represent peaks We now prove that G is connected, and therefore its first Betti number (, the number of connected components) is one. First, note that any vertex is connected to a peak. Indeed, from any vertex, by following the path of fittest mutations, we can go to a peak by edges in . Therefore, we only need to prove that all peaks are connected. By contradiction, assume that there are connected components of G. Note that in each component there might be multiple peaks. We will define a special path of single mutations. Consider the “usual” sequence graph, formally the hypercube , where containing all edges connecting sequences at Hamming distance one. Take the path that connects two peaks in different components and has the highest minimum value, i.e.,Without loss of generality, assume that connects and . We will show that the two connected components and are in fact connected, which is a contradiction. Denote the vertex in that achieves the minimum fitness. Divide into the path before and after , formally: . Our first observation is the following: all vertices in are in , i.e., , and similarly all vertices in are in . Indeed, if it was not the case, consider . Since , by following the fittest mutation, it is connected to a peak which is not in . Consider a new path that goes from to and then to . Note that the minimum fitness value in is higher than the one in and also connects two different connected components, which is a contradiction. Therefore, . Similarly, we get that . Having identified this property of and , we can construct a path in our graph of interest G, instead of . Denote the vertex in closest to , similarly denote the vertex in closest to . First notice that , i.e., there is a reciprocal sign epistatic interaction between vertices with high fitness. Indeed, if this were not the case, we could connect them through another mutation that does not involve and create a path with a higher minimum value, which is a contradiction. Since , i.e., are connected in G, all we need to do to construct our desired path connecting and is showing that is connected to a peak in and similarly is connected to a peak in . Since , we can follow the fittest mutation path until a peak (similarly for to a peak ). Consider the paths and , where and . By definition of , we have that . Finally, the path connecting two different peaks (assumed to be disconnected) is . Indeed, note that is a path in since , and . Therefore, the peaks and are connected. But this is a contradiction because and were two different connected components. This concludes the proof that G is connected. Consider the function given by the following.Following Example 1, we represent the corresponding function in Fig. 5. Note that edges have values and we have chosen .

Fig. 5

Function in the graph. Vertices represent genotypes; filled edges connect sequences at one mutation distance; dashed edges connect sequences of high fitness in a RSE instance; edges and vertices are labelled with the value of the defined function; and filled vertices represent peaks

For all , For all , For all , where . Function in the graph. Vertices represent genotypes; filled edges connect sequences at one mutation distance; dashed edges connect sequences of high fitness in a RSE instance; edges and vertices are labelled with the value of the defined function; and filled vertices represent peaks By Theorem 2, we have thatSince M is connected, we have that . By definition of Betti numbers, and since M is connected, (see Remark 1). The number of critical vertices is and the number of critical edges is . By construction, the only critical vertices are peaks. Indeed, a vertex is critical if all edges containing it have greater value. Since all single mutations and edges in have greater values than the peak value, all peaks are critical. Moreover, non-peak values have a beneficial mutation and therefore an edge in with a greater value. Therefore, non-peak vertices are not critical. On the other hand, the only critical edges are those in , i.e., edges that represent reciprocal sign epistatic interaction, since these edges are the only ones whose endpoints have both smaller values. Therefore,

Discussion

We have shown that the multipeaked fitness landscape necessarily has no fewer pairwise reciprocal sign epistatic interactions than the number of fitness peaks minus one. This extends the result of Poelwijk et al. (2011) stating that the reciprocal sign epistasis is a necessary condition for multiple peaks. Additionally, our study showcases the application of discrete Morse theory to fitness landscapes. As our paper was in review, a different way to prove the same result was posted by Chenette et al. (preprint) providing extra confidence in this result. The main difference in our approaches is that we based our proof on Morse theory, while the proof by Chenette et al., relies on explicit constructions of fitness landscapes. In this work, the authors also prove that multipeaked fitness landscapes must have at least as many reciprocal sign epistatic interactions as the number of fitness peaks minus one. They also studied how many instances of RSE can exist when there is only one fitness peak. Moreover, having upper and lower bounds on the number of instances of RSE given a certain number of fitness peaks, they studied if all numbers in between the bounds can be realized by some fitness landscape. More empirically characterized fitness landscapes are becoming available, driven by high throughput mutational scan studies. One straightforward way to analyze them is to determine the number of fitness peaks in the landscape. Our results may allow biologists to deduce the minimum number of reciprocal sign epistatic interactions in their data based on the number of observed fitness peaks. As discussed in Poelwijk et al. (2011), in the general case, reciprocal sign epistasis is not a sufficient condition for multiple peaks. Similarly, we do not show how to estimate the number of peaks from the number of RSE instances. The task of deducing global properties of the landscape from its local properties was accomplished in Crona et al. (2013): they showed that reciprocal sign epistasis is a sufficient condition for the existence of multiple peaks if there is no sign epistasis in any other pair of loci. In our proof, we considered bi-allelic genotypes, which may not reflect the biology of DNA or protein sequences. The application of our theory to genotypes with more than two alleles depends on how epistasis is defined for such genotypes. Epistatic interactions may be found between alleles at different loci, which may lead to instances where some allele states between two sites are in an epistatic interaction, while two different allele states at the same sites may not show any epistasis. Alternatively, epistasis may be defined when there is an epistatic interaction between any alleles at two different sites. Such epistasis may be called allelic and locus epistasis, respectively. See Fig. 6 for an illustration. Our proof is immediately generalizable without modifications for fitness landscapes determined by allelic epistasis but not necessarily to landscapes determined by the locus epistasis. We believe that allelic epistasis is biologically more realistic than locus epistasis and, therefore, our proof is relevant for fitness landscapes determined by DNA or protein sequences.

Fig. 6

Reciprocal sign epistasis with more than two alleles. Vertices represent genotypes; their label is their genotype sequence; filled vertices represent peaks; and arrows represent fitness increments

Reciprocal sign epistasis with more than two alleles. Vertices represent genotypes; their label is their genotype sequence; filled vertices represent peaks; and arrows represent fitness increments For our proof we assumed that the fitnesses of all genotypes are different, while empirically some fitness landscapes may be “neutral” in that many genotypes may have the same fitness, as has been observed in some empirical landscapes (Aguilar-Rodríguez et al. 2017; Schaper et al. 2012). However, the difference between fitnesses in our model can be arbitrarily small, many orders of magnitude smaller than the experimental error. Therefore, the difference in fitnesses we introduce in our proof does not impact the application of our results to empirically characterized fitness landscapes. Generally speaking, reciprocal sign epistasis is value-agnostic, in that any magnitude of the effect is taken into account as long as the sign of the effect changes. Therefore, the small variance in the values of fitnesses that we introduced does not influence the detection of sign epistasis in real data. The complication of deducing the global properties of fitness landscapes from the local properties of epistasis between specific sites arises due to the multidimensionality of the fitness landscape: local peaks formed by a pairwise epistatic interaction can be bypassed through a different dimension. Therefore, the condition formulated in terms of the pairwise epistatic interaction cannot be sufficient. One needs to know the full fitness landscape: to deduce that the fitness landscape has multiple peaks, one has to know that there is no sign epistasis in any other pairwise interaction (Crona et al. 2013). For a quantitative result converse to ours, we anticipate that higher-order epistatic interactions have to be considered, which leads to the requirement of full information about the fitness landscape. We expect that this result can be obtained with a suitable definition of the higher-order epistasis. Such a result could be useful, for example, to study the empirical fitness landscapes if the number of mutations under consideration is small enough to make an almost complete description of the landscape feasible.

23 in total

1. Epistasis can lead to fragmented neutral spaces and contingency in evolution.

Authors: Steffen Schaper; Iain G Johnston; Ard A Louis
Journal: Proc Biol Sci Date: 2011-12-07 Impact factor: 5.349

Review 2. Perspective: Sign epistasis and genetic constraint on evolutionary trajectories.

Authors: Daniel M Weinreich; Richard A Watson; Lin Chao
Journal: Evolution Date: 2005-06 Impact factor: 3.694

Review 3. Empirical fitness landscapes and the predictability of evolution.

Authors: J Arjan G M de Visser; Joachim Krug
Journal: Nat Rev Genet Date: 2014-06-10 Impact factor: 53.242

4. A thousand empirical adaptive landscapes and their navigability.

Authors: José Aguilar-Rodríguez; Joshua L Payne; Andreas Wagner
Journal: Nat Ecol Evol Date: 2017-01-23 Impact factor: 15.460

5. Deep diversification of an AAV capsid protein by machine learning.

Authors: Drew H Bryant; Ali Bashir; Sam Sinai; Nina K Jain; Pierce J Ogden; Patrick F Riley; George M Church; Lucy J Colwell; Eric D Kelsic
Journal: Nat Biotechnol Date: 2021-02-11 Impact factor: 54.908

Review 6. Exploring protein fitness landscapes by directed evolution.

Authors: Philip A Romero; Frances H Arnold
Journal: Nat Rev Mol Cell Biol Date: 2009-12 Impact factor: 94.444

7. Global fitness landscapes of the Shine-Dalgarno sequence.

Authors: Syue-Ting Kuo; Ruey-Lin Jahn; Yuan-Ju Cheng; Yi-Lan Chen; Yun-Ju Lee; Florian Hollfelder; Jin-Der Wen; Hsin-Hung David Chou
Journal: Genome Res Date: 2020-05-18 Impact factor: 9.043

8. An experimental assay of the interactions of amino acids from orthologous sequences shaping a complex fitness landscape.

Authors: Victoria O Pokusaeva; Dinara R Usmanova; Ekaterina V Putintseva; Lorena Espinar; Karen S Sarkisyan; Alexander S Mishin; Natalya S Bogatyreva; Dmitry N Ivankov; Arseniy V Akopyan; Sergey Ya Avvakumov; Inna S Povolotskaya; Guillaume J Filion; Lucas B Carey; Fyodor A Kondrashov
Journal: PLoS Genet Date: 2019-04-10 Impact factor: 5.917

9. Deep mutational scanning of an RRM domain of the Saccharomyces cerevisiae poly(A)-binding protein.

Authors: Daniel Melamed; David L Young; Caitlin E Gamble; Christina R Miller; Stanley Fields
Journal: RNA Date: 2013-09-24 Impact factor: 4.942