Literature DB >> 29712824

Green function of correlated genes in a minimal mechanical model of protein evolution.

Sandipan Dutta1, Jean-Pierre Eckmann2, Albert Libchaber3, Tsvi Tlusty4,5.   

Abstract

The function of proteins arises from cooperative interactions and rearrangements of their amino acids, which exhibit large-scale dynamical modes. Long-range correlations have also been revealed in protein sequences, and this has motivated the search for physical links between the observed genetic and dynamic cooperativity. We outline here a simplified theory of protein, which relates sequence correlations to physical interactions and to the emergence of mechanical function. Our protein is modeled as a strongly coupled amino acid network with interactions and motions that are captured by the mechanical propagator, the Green function. The propagator describes how the gene determines the connectivity of the amino acids and thereby, the transmission of forces. Mutations introduce localized perturbations to the propagator that scatter the force field. The emergence of function is manifested by a topological transition when a band of such perturbations divides the protein into subdomains. We find that epistasis-the interaction among mutations in the gene-is related to the nonlinearity of the Green function, which can be interpreted as a sum over multiple scattering paths. We apply this mechanical framework to simulations of protein evolution and observe long-range epistasis, which facilitates collective functional modes.
Copyright © 2018 the Author(s). Published by PNAS.

Entities:  

Keywords:  Green function; dimensional reduction; epistasis; genotype-to-phenotype map; protein evolution

Mesh:

Substances:

Year:  2018        PMID: 29712824      PMCID: PMC5960285          DOI: 10.1073/pnas.1716215115

Source DB:  PubMed          Journal:  Proc Natl Acad Sci U S A        ISSN: 0027-8424            Impact factor:   11.205


A common physical basis for the diverse biological functions of proteins is the emergence of collective patterns of forces and coordinated displacements of their amino acids (1–13). In particular, the mechanisms of allostery (14–18) and induced fit (19) often involve global conformational changes by hinge-like rotations, twists, or shear-like sliding of protein subdomains (20–22). An approach to examine the link between function and motion is to model proteins as elastic networks (23–26). Decomposing the dynamics of the network into normal modes revealed that low-frequency “soft” modes capture functionally relevant large-scale motion (27–30), especially in allosteric proteins (31–33). Recent works associate these soft modes with the emergence of weakly connected regions in the protein (Fig. 1 )—“cracks,” “shear bands,” or “channels” (21, 22, 34–36)—that enable viscoelastic motion (37, 38). Such patterns of “floppy” modes (39–42) emerge in models of allosteric proteins (36, 43–45) and networks (46–48).
Fig. 1.

Protein as an evolving machine and propagation of mechanical forces. (A) Formation of a softer shear band (red) separating the protein into two rigid subdomains (light blue). When a ligand binds, the biochemical function involves a low-energy hinge-like or shear motion (arrows). (B) Shear band and large-scale motion in a real protein: the arrows show the displacement of all amino acids in human glucokinase when it binds glucose (Protein Data Bank ID codes 1v4s and 1v4t). The coloring shows a high-shear region (red) separating two low-shear domains that move as rigid bodies (shear calculated as in refs. 21 and 36). (C) The mechanical model. The protein is made of two species of amino acids, polar (; red) and hydrophobic (; blue), with a sequence that is encoded in a gene. Each amino acid forms weak or strong bonds with its near neighbors (Right) according to the interaction rule in the table (Left). (D) The protein is made of amino acids with positions that are randomized from a regular triangular lattice. Strong bonds are shown as gray lines. Evolution begins from a random configuration (Left) and evolves by mutating one amino acid at each step, switching between and . The fitness is the mechanical response to a localized force probe (pinch) (2). After mutations (Center; intermediate stage), the evolution reaches a solution (Right). The green arrows show the mechanical response: a hinge-like, low-energy motion with a shear band starting at the probe and traversing the protein, qualitatively similar to B. L, left; R, right.

Protein as an evolving machine and propagation of mechanical forces. (A) Formation of a softer shear band (red) separating the protein into two rigid subdomains (light blue). When a ligand binds, the biochemical function involves a low-energy hinge-like or shear motion (arrows). (B) Shear band and large-scale motion in a real protein: the arrows show the displacement of all amino acids in human glucokinase when it binds glucose (Protein Data Bank ID codes 1v4s and 1v4t). The coloring shows a high-shear region (red) separating two low-shear domains that move as rigid bodies (shear calculated as in refs. 21 and 36). (C) The mechanical model. The protein is made of two species of amino acids, polar (; red) and hydrophobic (; blue), with a sequence that is encoded in a gene. Each amino acid forms weak or strong bonds with its near neighbors (Right) according to the interaction rule in the table (Left). (D) The protein is made of amino acids with positions that are randomized from a regular triangular lattice. Strong bonds are shown as gray lines. Evolution begins from a random configuration (Left) and evolves by mutating one amino acid at each step, switching between and . The fitness is the mechanical response to a localized force probe (pinch) (2). After mutations (Center; intermediate stage), the evolution reaches a solution (Right). The green arrows show the mechanical response: a hinge-like, low-energy motion with a shear band starting at the probe and traversing the protein, qualitatively similar to B. L, left; R, right. Like their dynamic phenotypes, proteins’ genotypes are remarkably collective. When aligned, sequences of protein families show long-range correlations among the amino acids (49–61). The correlations indicate epistasis, the interaction among mutations that takes place among residues linked by physical forces or common function. By inducing nonlinear effects, epistasis shapes the protein’s fitness landscape (62–68). Provided with sufficiently large data, analysis of sequence variation can predict the 3D structure of proteins (50–52), allosteric pathways (53–55), epistatic interactions (56, 57), and coevolving subsets of amino acids (58–60, 69). Still, the mapping between sequence correlation and collective dynamics—and in particular, the underlying epistasis—is not fully understood. Experiments and simulations provide valuable information on protein dynamics, and extensive sequencing accumulates databases required for reliable analysis; however, there remain inherent challenges: the complexity of the physical interactions and the sparsity of the data. The genotype-to-phenotype map of proteins connects spaces of huge dimension, which are hard to sample, even by high-throughput experiments or natural evolution (70–72). A complementary approach is the application of simplified coarse-grained models, such as lattice proteins (73–75) or elastic networks (24), which allow one to extensively survey the map and examine basic questions of protein evolution. Such models have been recently used to study allosteric proteins (35, 36, 43–45) and in networks (46–48). Our aim here is different: to construct a simplified model of how the collective dynamics of functional proteins directs their evolution and in particular, to give a mechanical interpretation of epistasis. This paper introduces a coarse-grained theory that treats protein as an evolving amino acid network with topology that is encoded in the gene. Mutations that substitute one amino acid with another tweak the interactions, allowing the network to evolve toward a specific mechanical function: in response to a localized force, the protein will undergo a large-scale conformational change (Fig. 1 ). We show that the application of a Green function (76, 77) is a natural way to understand the protein’s collective dynamics. The Green function measures how the protein responds to a localized impulse via propagation of forces and motion. The propagation of mechanical response across the protein defines its fitness and directs the evolutionary search. Thus, the Green function explicitly defines the map: gene amino acid network protein dynamics function. We use this map to examine the effects of mutations and epistasis. A mutation perturbs the Green function and scatters the propagation of force through the protein (Fig. 2). We quantify epistasis in terms of “multiple scattering” pathways. These indirect physical interactions appear as long-range correlations in the coevolving genes.
Fig. 2.

Force propagation, mutations, and epistasis. (A) The Green function measures the propagation of the mechanical signal, depicted as a “diffraction wave,” across the protein (blue) from the force source (pinch) to the response site . (B) A mutation deflects the propagation of force. The effect of the mutation on the propagator can be described as a series of multiple scattering paths (6). (C) The epistasis between two mutations, and , is equivalent to a series of multiple scattering paths (9).

Force propagation, mutations, and epistasis. (A) The Green function measures the propagation of the mechanical signal, depicted as a “diffraction wave,” across the protein (blue) from the force source (pinch) to the response site . (B) A mutation deflects the propagation of force. The effect of the mutation on the propagator can be described as a series of multiple scattering paths (6). (C) The epistasis between two mutations, and , is equivalent to a series of multiple scattering paths (9). Using a Metropolis-type evolution algorithm, solutions are quickly found, typically after steps. Mutations add localized perturbations to the amino acid network, which are eventually arranged by evolution into a continuous shear band. Protein function is signaled by a topological transition, which occurs when a shearable band of weakly connected amino acids separates the protein into rigid subdomains. The set of solutions is sparse: there is a huge reduction of dimension between the space of genes to the spaces of force and displacement fields. We find a tight correspondence between correlations in the genotype and phenotype. Owing to its mechanical origin, epistasis becomes long ranged along the high-shear region of the channel.

Model: Protein as an Evolving Machine

The Amino Acid Network and Its Green Function.

We use a coarse-grained description in terms of an elastic network (23–27, 39) with connectivity and interactions that are encoded in a gene (Fig. 1 ). Similar vector elasticity models were considered in refs. 35 and 36 (app. B3 therein). The protein is a chain of amino acids: () folded into a 2D hexagonal lattice (). We follow the HP model (73, 74) with its two species of amino acids, hydrophobic () and polar (). The amino acid chain is encoded in a gene , a sequence of binary codons, where encodes an amino acid and encodes a amino acid. We consider a constant fold, and therefore, any particular codon in the gene encodes an amino acid at a certain constant position in the protein. The positions are randomized to make the network amorphous. These dfs are stored in a vector . Except the ones at the boundaries, every amino acid is connected by harmonic springs to nearest and next nearest neighbors. There are two flavors of bonds according to the chemical interaction, which is defined as an gate: a strong bond and weak and bonds. The strength of the bonds determines the mechanical response of the network to a displacement field , when the amino acids are displaced as . The response is captured by Hooke’s law that gives the force field induced by a displacement field, The analogue of the spring constant is the Hamiltonian , a matrix, which records the connectivity of the network and the strength of the bonds. is a nonlinear function of the gene , reflecting the amino acid interaction rules of Fig. 1 (Eq. , ). Evolution searches for a protein that will respond by a prescribed large-scale motion to a given localized force (“pinch”). In induced fit, for example, specific binding of a substrate should induce global deformation of an enzyme. The response is determined by the Green function (76): is the mechanical propagator that measures the transmission of signals from the force source across the protein (Fig. 2). Eq. constitutes an explicit genotype-to-phenotype map from the genotype to the mechanical phenotype : . This reflects the dual nature of the Green function : in the phenotype space, it is the linear mechanical propagator that turns a force into motion, , whereas it is also the nonlinear function that maps the gene into a propagator, . When the protein is moved as a rigid body, the lengths of the bonds do not change, and the elastic energy cost vanishes. A 2D protein has such zero modes (Galilean symmetries), two translations, and one rotation, and is, therefore, always singular. Hence, Hooke’s law and [] imply that is the pseudoinverse of the Hamiltonian, (78, 79), which amounts to inversion of in the nonsingular subspace of the nonzero modes (). A related quantity is the resolvent, , with poles at the energy levels of , . The fitness function rewards strong mechanical response to a localized probe (pinch in Fig. 1): a force dipole at two neighboring amino acids and on the left side of the protein (L in Fig. 1), . The prescribed motion is specified by a displacement vector , with a dipolar response, , on the right side of the protein (R in Fig. 1). The protein is fitter if the pinch produces a large deformation in the direction specified by . To this end, we evolve the amino acid network to increase a fitness function , which is the projection of the displacement on the prescribed response :Eq. defines the fitness landscape . Here, we examine particular examples for a localized pinch and prescribed response , which drive the emergence of a hinge-like mode. This approach is general and can as well treat more complex patterns of force and motion.

Evolution Searches in the Mechanical Fitness Landscape.

Our simulations search for a prescribed response induced by a force applied at a specific site on the left side (pinch). The prescribed dipolar response may occur at any of the sites on the right side. This gives rise to a wider shear band that allows the protein to perform general mechanical tasks (unlike specific allostery tasks of communicating between specified sites on L and R). We define the fitness as the maximum of [] over all potential locations of the channel’s output (typically 8–10 sites) (). The protein is evolved via a point mutation process where, at each step, we flip a randomly selected codon between zero and one. This corresponds to exchanging and at a random position in the protein, thereby changing the bond pattern and the elastic response by softening or stiffening the amino acid network. Evolution starts from a random protein configuration encoded in a random gene. Typically, we take a small fraction of amino acid of type (about ) randomly positioned within a majority of (Fig. 1, Left). The high fraction of strong bonds renders the protein stiff and therefore, of low initial fitness . At each step, we calculate the change in the Green function (by a method explained below) and use it to evaluate from [] the fitness change :The fitness change determines the fate of the mutation: we accept the mutation if ; otherwise, the mutation is rejected. Since fitness is measured by the criterion of strong mechanical response, it induces softening of the amino acid network. The typical evolution trajectory lasts about steps. Most are neutral mutations () and deleterious ones (); the latter are rejected. About a dozen or so beneficial mutations () drive the protein toward the solution (Fig. 3). The increase in the fitness reflects the gradual formation of the channel, while the jump in the shear signals the emergence of the soft mode. The first few beneficial mutations tend to form weakly bonded -enriched regions near the pinch site on the left side and close to the right boundary of the protein. The following ones join these regions into a floppy channel (a shear band), which traverses the protein from left to right. We stop the simulation when the fitness reaches a large positive value . The corresponding gene encodes the functional protein. The ad hoc value signals slowing down of the fitness curve toward saturation at , as the channel has formed and now only continues to slightly widen. In this regime, even a tiny pinch will easily excite a large-scale motion with a distinct high-shear band (Fig. 1, Right).
Fig. 3.

The mechanical Green function and the emergence of protein function. (A) Progression of the fitness during the evolution run shown in Fig. 1 (black) together with the fitness trajectory averaged over runs (red). Shown are the last 16 beneficial mutations toward the formation of the channel. The contribution of the emergent low-energy mode (blue) dominates the fitness []. (B) Landscape of the fitness change [] averaged over solutions for all 200 possible positions of point mutations at a solution. Underneath, the average amino acid configuration of the protein is shown in shades of red () and blue (). In most sites, mutations are neutral, while mutations in the channel are deleterious on average. L, left. (C) The average magnitude of the two-codon correlation [] in the shear band (amino acids in rows 7–13; red) and in the whole protein (black) as a function of the number of beneficial mutations, . (Inset) Profile of the spatial correlation within the shear band (after beneficial mutations). (D) The mean shear in the protein in a single run (black) and averaged over solutions (red) as a function of the fraction of amino acids, . The values of are shifted by the position of the jump, . (Inset) Distribution of .

The mechanical Green function and the emergence of protein function. (A) Progression of the fitness during the evolution run shown in Fig. 1 (black) together with the fitness trajectory averaged over runs (red). Shown are the last 16 beneficial mutations toward the formation of the channel. The contribution of the emergent low-energy mode (blue) dominates the fitness []. (B) Landscape of the fitness change [] averaged over solutions for all 200 possible positions of point mutations at a solution. Underneath, the average amino acid configuration of the protein is shown in shades of red () and blue (). In most sites, mutations are neutral, while mutations in the channel are deleterious on average. L, left. (C) The average magnitude of the two-codon correlation [] in the shear band (amino acids in rows 7–13; red) and in the whole protein (black) as a function of the number of beneficial mutations, . (Inset) Profile of the spatial correlation within the shear band (after beneficial mutations). (D) The mean shear in the protein in a single run (black) and averaged over solutions (red) as a function of the fraction of amino acids, . The values of are shifted by the position of the jump, . (Inset) Distribution of .

Results

Mechanical Function Emerges at a Topological Transition.

The hallmark of evolution reaching a solution gene is the emergence of a new zero-energy mode, , in addition to the three Galilean symmetry modes. Near the solution, the energy of this mode almost closes the spectral gap, , and has a pole at . As a result, the emergent mode dominates the Green function, . The response to a pinch will be mostly through this soft mode, and the fitness [] will diverge asOn average, we find that the fitness increases exponentially with the number of beneficial mutations (Fig. 3). However, beneficial mutations are rare and are separated by long stretches of neutral mutations. This is evident from the fitness landscape (Fig. 3), which shows that, in most sites, the effect of mutations is practically neutral. The vanishing of the spectral gap, , manifests as a topological change in the system: the amino acid network is now divided into two domains that can move independently of each other at low energetic cost. The soft mode appears at a dynamical phase transition, where the average shear in the protein jumps abruptly as the channel is formed and the protein can easily deform in response to the force probe (Fig. 3). As the shear band is taking shape, the correlation among codons builds up. To see this, we align genes from the simulations in analogy to sequence alignment of real protein families (49–61), albeit without the phylogenetic correlation that hampers the analysis of real sequences. At each time step, we calculate the two-codon correlation between all pairs of codons and :where brackets denote ensemble averages. We find that most of the correlation is concentrated in the forming channel (Fig. 3), where it is 10-fold larger than in the whole protein. In the channel, there is significant long-range correlation shown in the spatial profile of the correlation (Fig. 3, Inset). Analogous regions of coevolving residues appear in real protein families (53–55, 58–60) as well as in coarse-grained models of protein allostery (35, 36, 43, 44) and allosteric networks (46, 47).

Point Mutations Are Localized Mechanical Perturbations.

A mutation may vary the strength of no more than bonds around the mutated amino acid (Fig. 2). The corresponding perturbation of the Hamiltonian is, therefore, localized, akin to a defect in a crystal (80, 81). The mechanics of mutations can be further explored by examining the perturbed Green function, , which obeys the Dyson equation (77, 82) ():The latter can be iterated into an infinite seriesThis series has a straightforward physical interpretation as a sum over multiple scatterings: as a result of the mutation, the elastic force field is no longer balanced by the imposed force , leaving a residual force field . The first scattering term in the series balances by the deformation . Similarly, the second scattering term accounts for further deformation induced by and so forth. In practice, we calculate the mutated Green function using the Woodbury formula [], which exploits the localized nature of the perturbation to accelerate the computation by a factor of ().

Epistasis Links Protein Mechanics to Genetic Correlations.

Our model provides a calculable definition of epistasis, the nonlinearity of the fitness effect of interacting mutations (Fig. 2). We take a functional protein obtained from the evolution algorithm and mutate an amino acid at a site . This mutation induces a change in the Green function (calculated by []) and hence, in the fitness function []. One can similarly perform another independent mutation at a site , producing a second deviation, and . Finally, starting again from the original solution, one mutates both and simultaneously, with a combined effect and . The epistasis measures the departure of the double mutation from additivity of two single mutations:To evaluate the average epistatic interaction among amino acids, we perform the double-mutation calculation for all solutions and take the ensemble average . Landscapes of show significant epistasis in the channel (Fig. 4). Amino acids outside the high-shear region show only small epistasis, since mutations in the rigid domains hardly change the elastic response. The epistasis landscapes (Fig. 4 ) are mostly positive, since the mutations in the channel interact antagonistically (83): after a strongly deleterious mutation, a second mutation has a smaller effect.
Fig. 4.

Mechanical epistasis. The epistasis [], averaged over solutions , between a fixed amino acid at position (black arrow) and all other positions . Here, is located at (A) the binding site, (B) the center of the channel, and (C) slightly off the channel. Underneath, the average amino acid configuration of the protein is drawn in shades of red () and blue (). Significant epistasis mostly occurs along the -rich channel, where mechanical interactions are long ranged. Although epistasis is predominantly positive, negative values also occur, mostly at the boundary of the channel (C). Landscapes are plotted for specific output site at right. L, left. (D) The two-codon correlation function [] measures the coupling between mutations at positions and []. The epistasis and the gene correlation show similar patterns. Axes are the positions of and loci. Significant correlations and epistasis occur mostly in and around the channel region (positions 70–130, rows 7–13).

Mechanical epistasis. The epistasis [], averaged over solutions , between a fixed amino acid at position (black arrow) and all other positions . Here, is located at (A) the binding site, (B) the center of the channel, and (C) slightly off the channel. Underneath, the average amino acid configuration of the protein is drawn in shades of red () and blue (). Significant epistasis mostly occurs along the -rich channel, where mechanical interactions are long ranged. Although epistasis is predominantly positive, negative values also occur, mostly at the boundary of the channel (C). Landscapes are plotted for specific output site at right. L, left. (D) The two-codon correlation function [] measures the coupling between mutations at positions and []. The epistasis and the gene correlation show similar patterns. Axes are the positions of and loci. Significant correlations and epistasis occur mostly in and around the channel region (positions 70–130, rows 7–13). Definition [] is a direct link between epistasis and protein mechanics: the nonlinearity (“curvature”) of the Green function measures the deviation of the mechanical response from additivity of the combined effect of isolated mutations at and , . The epistasis is simply the inner product value of this nonlinearity with the pinch and the response:Relation [] shows how epistasis originates from mechanical forces among mutated amino acids. In the gene, epistatic interactions are manifested in codon correlations (56, 57) shown in Fig. 4, which depicts two-codon correlations from the alignment of functional genes []. We find a tight correspondence between the mean epistasis and the codon correlations . Both patterns exhibit strong correlations in the channel region with a period equal to channel’s length: 10 amino acids. The similarity in the patterns of and indicates that a major contribution to the long-range correlations observed among aligned protein sequences stems from the mechanical interactions propagating through the amino acid network.

Epistasis as a Sum over Scattering Paths.

One can classify epistasis according to the interaction range. Neighboring amino acids exhibit contact epistasis (49–51), because two adjacent perturbations, and , interact nonlinearly via the gate of the interaction table (Fig. 1), (where is the perturbation by both mutations). The leading term in the Dyson series [] of is a single scattering from an effective perturbation with an energy , which yields the epistasisLong-range epistasis among nonadjacent, noninteracting perturbations () is observed along the channel (Fig. 4). In this case, [] expresses the nonlinearity as a sum over multiple scattering paths, which include both and (Fig. 2):The perturbation expansion directly links long-range epistasis to shear deformation: near the transition, the Green function is dominated by the soft mode, , with fitness given by []. From [] and [], we find a simple expression for the mechanical epistasis as a function of the shear:The factor in [] is the ratio of the change in the shear energy due to mutation at (the expectation value of ) and the energy of the soft mode, and it is similar for . Thus, and are significant only in and around the shear band, where the bonds varied by the perturbations are deformed by the soft mode. When both sites are outside the channel, , the epistasis [] is small, . It remains negligible even if one of the mutations, , is in the channel, , and . Epistasis can only be long ranged along the channel when both mutations are significant, , and . We conclude that epistasis is maximal when both sites are at the start or end of the channel as illustrated in Fig. 4. The nonlinearity of the fitness function gives rise to antagonistic epistasis.

Geometry of Fitness Landscape and Gene-to-Function Map.

With our mechanical evolution algorithm, we can swiftly explore the fitness landscape to examine its geometry. The genotype space is a D hypercube with vertices that are all possible genes . The phenotypes reside in a D space of all possible mechanical responses . The Green function provides the genotype-to-phenotype map []. A functional protein is encoded by a gene with fitness that exceeds a large threshold, , and the functional phenotype is dominated by the emergent zero-energy mode, (Fig. 3). We also characterize the phenotype by the shear field (). The singular value decomposition (SVD) of the solutions returns a set of eigenvectors with ordered eigenvalues that show their significance in capturing the data (). The SVD spectra reveal strong correspondence between the genotype and the phenotype, and (Fig. 5). In all three datasets, the largest eigenvalues are discrete and stand out from the bulk continuous spectrum. These are the collective dfs, which show loci in the gene and positions in the “flow” (i.e., displacement) and shear fields that tend to vary together.
Fig. 5.

From gene to mechanical function: spectra and dimensions. (A) The first four SVD eigenvectors (in the text) of the gene (Top), the displacement flow field (Middle), and the shear (Bottom). (B) Cross-sections through the set of solutions in the genotype space (Upper) and the phenotype space (Lower). Density of solutions is color coded. The genotype cross-section is the plane defined by the eigenvectors , and in the phenotype space, it is defined by the eigenvectors (in the text). The dimensional reduction is manifested by the discoid geometry of the phenotype cloud compared with the spheroid shape of the genotype cloud. (C) Genetic correlations show similarity to correlations in the shear field, (color coded in log scale). Corr, correlation.

From gene to mechanical function: spectra and dimensions. (A) The first four SVD eigenvectors (in the text) of the gene (Top), the displacement flow field (Middle), and the shear (Bottom). (B) Cross-sections through the set of solutions in the genotype space (Upper) and the phenotype space (Lower). Density of solutions is color coded. The genotype cross-section is the plane defined by the eigenvectors , and in the phenotype space, it is defined by the eigenvectors (in the text). The dimensional reduction is manifested by the discoid geometry of the phenotype cloud compared with the spheroid shape of the genotype cloud. (C) Genetic correlations show similarity to correlations in the shear field, (color coded in log scale). Corr, correlation. We examine the correspondence among three sets of eigenvectors: of the flow, of the gene, and of the shear. The first eigenvector of the flow, , is the hinge motion caused by the pinch, with two eddies rotating in opposite directions (Fig. 5). The next two modes, shear () and breathing (), also occur in real proteins, such as glucokinase (Fig. 1). The first eigenvectors of the shear and of the gene sequence show that the high-shear region is mirrored as a -rich region, where a mechanical signal may cause local rearrangement of the amino acids by deforming weak bonds. In the rest of the protein, the -rich regions move as rigid bodies with insignificant shear. The higher gene eigenvectors, (), capture patterns of correlated genetic variations. The striking similarity between the sequence correlation patterns and the shear eigenvectors shows a tight genotype-to-phenotype map, as is further shown in the likeness of the correlation matrices of the amino acid and shear flow (Fig. 5). In the phenotype space, we represent the displacement field in the SVD basis, (Fig. 5). Since of the data are explained by the first , we can compress the displacement field without much loss into the first 15 coordinates. This implies that the set of solutions is a D discoid, which is flat in most directions. In contrast, representation of the genes in the SVD frame of reference (with the basis) reveals that, in genotype space, the solution set is an incompressible D spheroid (Fig. 5). The dramatic dimensional reduction in mapping genotypes to phenotypes stems from the different constraints that shape them (36, 84–89).

Discussion

Theories of protein need to combine the many-body physics of the amino acid matter with the evolution of genetic information, which together, give rise to protein function. We introduced a simplified theory of protein, where the mapping between genotype and phenotype takes an explicit form in terms of mechanical propagators (Green functions), which can be efficiently calculated. As a functional phenotype, we take cooperative motion and force transmission through the protein []. This allows us to map genetic mutations to mechanical perturbations, which scatter the force field and deflect its propagation [ and ] (Fig. 2). The evolutionary process amounts to solving the inverse scattering problem: given prescribed functional modes, one looks for network configurations that yield this low end of the dynamical spectrum. Epistasis, the interaction among loci in the gene, corresponds to a sum over all multiple scattering trajectories or equivalently, the nonlinearity of the Green function [ and ]. We find that long-range epistasis signals the emergence of a collective functional mode in the protein. The results of this theory (in particular, the expressions for epistasis) follow from the basic geometry of the amino acid network and the localized mutations and are, therefore, applicable to general tasks and fitness functions with multiple inputs and responses.

Materials and Methods

The Mechanical Model of Protein.

We model the protein as an elastic network made of harmonic springs (23, 24, 39, 90). The connectivity of the network is described by a hexagonal lattice with vertices that are amino acids and edges that correspond to bonds. There are amino acids indexed by Roman letters and bonds indexed by Greek letters. We use the HP model (73) with two amino acid species, hydrophobic () and polar (). The amino acid chain is encoded in a gene , where encodes and encodes , . The degree of each amino acid is the number of amino acids to which it is connected by bonds. In our model, most amino acids have the maximal degree, which is , while amino acids at the boundary have fewer neighbors, (Fig. 1). The connectivity of the graph is recorded by the adjacency matrix , where if there is a bond from to and otherwise. The gradient operator relates the spaces of bonds and vertices (and is, therefore, of size ): if vertices and are connected by a bond , then and . As in the continuum case, the Laplace operator is the product . The nondiagonal elements are if and are connected and otherwise. The diagonal part of is the degree . Hence, we can write the Laplacian as , where is the diagonal matrix of the degrees . We embed the graph in Euclidean space () by assigning positions to each amino acid. We concatenate all positions in a vector of length . Finally, to each bond, we assign a spring with constant , which we keep in a diagonal matrix . The strength of the spring is determined by the rule of the HP model’s interaction table (Fig. 1), , where and are the codons of the amino acid connected by bond . This implies that a strong bond has , whereas the weak bonds and have . We usually take and . This determines a spring network. We also assume that the initial configuration is such that all springs are at their equilibrium length, disregarding the possibility of “internal stresses” (39), so that the initial elastic energy is . We define the “embedded” gradient operator (of size ), which is obtained by taking the graph gradient and multiplying each nonzero element by the corresponding direction vector . Thus, is a tensor (), which we store as a matrix ( is the bond connecting vertices and ). In each row vector of , which we denote as , there are only nonzero elements. To calculate the elastic response of the network, we deform it by applying a force field , which leads to the displacement of each vertex by to a new position (39). For small displacements, the linear response of the network is given by Hooke’s law, . The elastic energy is , and the Hamiltonian, , is the Hessian of the elastic energy , . By rescaling, , which amounts to scaling all distances by , we obtain . It follows that the Hamiltonian is a function of the gene , which has the structure of the Laplacian multiplied by the tensor product of the direction vectors. Each block () is a function of the codons and :The diagonal blocks complete the row and column sums to zero, .

The Inverse Problem: Green Function and Its Spectrum.

The Green function is defined by the inverse relation to Hooke’s law, []. If were invertible (nonsingular), would have been just . However, is always singular owing to the zero-energy (Galilean) modes of translation and rotation. Therefore, one needs to define as the Moore–Penrose pseudoinverse (78, 79), , on the complement of the space of Galilean transformations. The pseudoinverse can be understood in terms of the spectrum of . There are at least zero modes: translation modes and rotation modes. These modes are irrelevant and will be projected out of the calculation (note that these modes do not come from missing connectivity of the graph itself but from its embedding in ). is singular but is still diagonalizable (since it has a basis of dimension ), and it can be written as the spectral decomposition, , where is the set of eigenvalues and are the corresponding eigenvectors (note that denotes the index of the eigenvalue, while and denote amino acid positions). For a nonsingular matrix, one may calculate the inverse simply as . Since is singular, we leave out the zero modes and get the pseudoinverse , . It is easy to verify that, if is orthogonal to the zero modes, then . The pseudoinverse obeys the four requirements (78): (i) , (ii) , (iii) , and (iv) . In practice, as the projection commutes with the mutations, the pseudoinverse has most virtues of a proper inverse. The reader might prefer to link and through the heat kernel, . Then, and .

Pinching the Network.

A pinch is given as a localized force applied at the boundary of the “protein.” We usually apply the force on a pair of neighboring boundary vertices, and . It seems reasonable to apply a force dipole (i.e., two opposing forces ), since a net force will move the center of mass. This pinch is, therefore, specified by the force vector (of size ), with the only nonzero entries being . Hence, it has the same structure as a bond vector of a “pseudobond” connecting and A normal pinch has a force dipole directed along the line (the direction). Such a pinch is expected to induce a hinge motion. A shear pinch will be in a perpendicular direction and is expected to induce a shear motion. Evolution tunes the spring network to exhibit a low-energy mode, in which the protein is divided into two subdomains moving like rigid bodies. This large-scale mode can be detected by examining the relative motion of two neighboring vertices, and , at another location at the boundary (usually at the opposite side). Such a desired response at the other side of the protein is specified by a response vector , and the only nonzero entries correspond to the directions of the response at and . Again, we usually consider a “dipole” response .

Evolution and Mutation.

The quality of the response (i.e., the biological fitness) is specified by how well the response follows the prescribed one . In the context of our model, we chose the (scalar) observable as []. In an evolution simulation, one would exchange amino acids between and , while demanding that the fitness change is positive or nonnegative. By this, we mean is thanks to a beneficial mutation, whereas corresponds to a neutral one. Deleterious mutations are generally rejected. A version that accepts mildly deleterious mutations (a finite temperature Metropolis algorithm) gave similar results. We may impose a stricter minimum condition with a small positive , say . An alternative, stricter criterion would be the demand that each of the terms in , and , increases separately. The evolution is stopped when , which signals the formation of a shear band. When simulations ensue beyond , the band slightly widens, and the fitness slows down and converges at a maximal value, typically .

Evolving the Green Function Using the Dyson and Woodbury Formulas.

The Dyson formula follows from the identity , which is multiplied by on the left and on the right to yield []. The formula remains valid for the pseudoinverses in the nonsingular subspace. One can calculate the change in fitness by evaluating the effect of a mutation on the Green function, , and then examining the change, []. Using [] to calculate the mutated Green function is an impractical method, as it amounts to inverting at each step a large matrix. However, the mutation of an amino acid at has a localized effect. It may change only up to bonds among the bonds with the neighboring amino acids. Thanks to the localized nature of the mutation, the corresponding defect Hamiltonian is, therefore, of a small rank, , equal to the number of switched bonds (the average is about 9.3). can be decomposed into a product . The diagonal matrix records whether a bond is switched from weak to strong () or vice versa (), and is a matrix with columns that are the bond vectors for the switched bonds . This allows one to calculate changes in the Green function more efficiently using the Woodbury formula (91, 92): The two expressions for the mutation impact , [ and ], are equivalent, and one may get the scattering series of [] by a series expansion of the pseudoinverse in []. The practical advantage of [] is that the only (pseudo-)inversion that it requires is of a low-rank tensor (the term in parentheses). This accelerates our simulations by a factor of .

Pathologies and Broken Networks.

A network broken into disjoint components exhibits floppy modes owing to the low energies of the relative motion of the components with respect to each other. The evolutionary search might end up in such nonfunctional unintended modes. The common pathologies that we observed are (i) isolated nodes at the boundary that become weakly connected via mutations, (ii) “sideways” channels that terminate outside the target region (which typically includes around 8–10 sites), and (iii) channels that start and end at the target region without connecting to the binding site. All of these are some easy to understand floppy modes, which can vibrate independent of the location of the pinch and cause the response to diverge () without producing a functional mode. We avoid such pathologies by applying the pinch force to the protein network symmetrically: pinch the binding site on face left, look at responses on face right, and vice versa. Thereby, we not only look for the transmission of the pinch from the left to right but also, from right to left. The basic algorithm is modified to accept a mutation only if it does not weaken the two-way responses and enables hinge motion of the protein. This prevents the vibrations from being localized at isolated sites or unwanted channels.

Dimension and SVD.

To examine the geometry of the fitness landscape and the genotype-to-phenotype map, we looked at the correlation among numerous solutions, typically . Each solution is characterized by three vectors: (i) the gene of the functional protein, (a vector of length codons); (ii) the flow field (displacement), (a vector of length of and velocity components); and (iii) the shear field (a vector of length ). We compute the shear as the symmetrized derivative of the displacement field using the method in ref. 21. The value of the field is the sum of squares of the traceless part of the strain tensor (Frobenius norm). These three types of vectors are stored along the rows of three matrices , , and . We calculate the eigenvectors of these matrices, , , and , via SVD (as in ref. 36). The corresponding SVD eigenvalues are the square roots of the eigenvalues of the covariance matrix , while the eigenvectors are the same. In typical spectra, most eigenvalues reside in a continuum bulk region that resembles the spectra of random matrices. A few larger outliers, typically around a dozen or so, carry the nonrandom correlation information.

The Protein Backbone.

A question may arise as to what extent the protein’s backbone might affect the results described so far. Proteins are polypeptides, linear heteropolymers of amino acids, linked by covalent peptide bonds, which form the protein backbone. The peptide bonds are much stronger than the noncovalent interactions among the amino acids and do not change when the protein mutates. We, therefore, augmented our model with a “backbone”: a linear path of conserved strong bonds that passes once through all amino acids. We focused on two extreme cases: a serpentine backbone either parallel to the shear band or perpendicular to it (Fig. 6).
Fig. 6.

The effect of the backbone on evolution of mechanical function. The backbone induces long-range mechanical correlations, which influence protein evolution. We examine two configurations: parallel (A and B) and perpendicular (C and D) to the channel. (A and B) Parallel. (A) The backbone directs the formation of a narrow channel along the fold (compared with Fig. 5). (B) The first four SVD eigenvectors of the gene (Top), the flow (Middle), and the shear (Bottom). (C and D) Perpendicular. (C) The formation of the channel is “dispersed” by the backbone. (D) The first four SVD eigenvectors of (Top), (Middle), and shear (Bottom).

The effect of the backbone on evolution of mechanical function. The backbone induces long-range mechanical correlations, which influence protein evolution. We examine two configurations: parallel (A and B) and perpendicular (C and D) to the channel. (A and B) Parallel. (A) The backbone directs the formation of a narrow channel along the fold (compared with Fig. 5). (B) The first four SVD eigenvectors of the gene (Top), the flow (Middle), and the shear (Bottom). (C and D) Perpendicular. (C) The formation of the channel is “dispersed” by the backbone. (D) The first four SVD eigenvectors of (Top), (Middle), and shear (Bottom). The presence of the backbone does not interfere with the emergence of a low-energy mode of the protein with a flow pattern (i.e., displacement field) that is similar to the backboneless case with two eddies moving in a hinge-like fashion. In the parallel configuration, the backbone constrains the channel formation to progress along the fold (Fig. 6). The resulting channel is narrower than in the model without backbone (Figs. 1 and 5). In the perpendicular configuration, the evolutionary progression of the channel is much less oriented (Fig. 6). While the flow patterns are similar, closer inspection shows noticeable differences, as can be seen in the flow eigenvectors (Fig. 6 ). The shear eigenvectors represent the derivative of the flow and therefore, highlight more distinctly these differences. As for the correspondence between gene eigenvectors and shear eigenvectors , the backbone affects the shape of the channel in concert with the sequence correlations around it. Transmission of mechanical signals seems to be easier along the orientation of the fold (parallel configuration) (Fig. 6). Transmission across the fold (perpendicular configuration) necessitates significant deformation of the backbone and leads to “dispersion” of the signal at the output (Fig. 6). We propose that the shear band will be roughly oriented with the direction of the fold, but this requires further analysis of structural data. Overall, we conclude from our examination that the backbone adds certain features to patterns of the field and sequence correlation without changing the basic results of our model. The presence of the backbone might constrain the evolutionary search, but this has no significant effect on the fast convergence of the search and on the long-range correlations among solutions.
  71 in total

1.  Protein flexibility and dynamics using constraint theory.

Authors:  M F Thorpe; M Lei; A J Rader; D J Jacobs; L A Kuhn
Journal:  J Mol Graph Model       Date:  2001       Impact factor: 2.518

Review 2.  Molecular dynamics simulations of biomolecules.

Authors:  Martin Karplus; J Andrew McCammon
Journal:  Nat Struct Biol       Date:  2002-09

3.  How mutational epistasis impairs predictability in protein evolution and design.

Authors:  Charlotte M Miton; Nobuhiko Tokuriki
Journal:  Protein Sci       Date:  2016-01-22       Impact factor: 6.725

Review 4.  Relating protein motion to catalysis.

Authors:  Sharon Hammes-Schiffer; Stephen J Benkovic
Journal:  Annu Rev Biochem       Date:  2006       Impact factor: 23.643

5.  Crystal structure of an ancient protein: evolution by conformational epistasis.

Authors:  Eric A Ortlund; Jamie T Bridgham; Matthew R Redinbo; Joseph W Thornton
Journal:  Science       Date:  2007-08-16       Impact factor: 47.728

6.  Cross-species analysis traces adaptation of Rubisco toward optimality in a low-dimensional landscape.

Authors:  Yonatan Savir; Elad Noor; Ron Milo; Tsvi Tlusty
Journal:  Proc Natl Acad Sci U S A       Date:  2010-02-08       Impact factor: 11.205

7.  Hot spots for allosteric regulation on protein surfaces.

Authors:  Kimberly A Reynolds; Richard N McLaughlin; Rama Ranganathan
Journal:  Cell       Date:  2011-12-23       Impact factor: 41.582

Review 8.  Epistasis--the essential role of gene interactions in the structure and evolution of genetic systems.

Authors:  Patrick C Phillips
Journal:  Nat Rev Genet       Date:  2008-11       Impact factor: 53.242

9.  Large-scale allosteric conformational transitions of adenylate kinase appear to involve a population-shift mechanism.

Authors:  Karunesh Arora; Charles L Brooks
Journal:  Proc Natl Acad Sci U S A       Date:  2007-11-13       Impact factor: 11.205

10.  Stability-mediated epistasis constrains the evolution of an influenza protein.

Authors:  Lizhi Ian Gong; Marc A Suchard; Jesse D Bloom
Journal:  Elife       Date:  2013-05-14       Impact factor: 8.140

View more
  9 in total

Review 1.  Allosteric communication in molecular machines via information exchange: what can be learned from dynamical modeling.

Authors:  Dimitri Loutchko; Holger Flechsig
Journal:  Biophys Rev       Date:  2020-03-20

2.  Mechanics of Allostery: Contrasting the Induced Fit and Population Shift Scenarios.

Authors:  Riccardo Ravasio; Solange Marie Flatt; Le Yan; Stefano Zamuner; Carolina Brito; Matthieu Wyart
Journal:  Biophys J       Date:  2019-10-09       Impact factor: 4.033

Review 3.  Designed Elastic Networks: Models of Complex Protein Machinery.

Authors:  Holger Flechsig; Yuichi Togashi
Journal:  Int J Mol Sci       Date:  2018-10-13       Impact factor: 5.923

4.  Revealing evolutionary constraints on proteins through sequence analysis.

Authors:  Shou-Wen Wang; Anne-Florence Bitbol; Ned S Wingreen
Journal:  PLoS Comput Biol       Date:  2019-04-24       Impact factor: 4.475

5.  Simple mechanics of protein machines.

Authors:  Holger Flechsig; Alexander S Mikhailov
Journal:  J R Soc Interface       Date:  2019-06-19       Impact factor: 4.118

6.  Allostery and Epistasis: Emergent Properties of Anisotropic Networks.

Authors:  Paul Campitelli; S Banu Ozkan
Journal:  Entropy (Basel)       Date:  2020-06-16       Impact factor: 2.524

7.  The Statistical Trends of Protein Evolution: A Lesson from AlphaFold Database.

Authors:  Qian-Yuan Tang; Weitong Ren; Jun Wang; Kunihiko Kaneko
Journal:  Mol Biol Evol       Date:  2022-10-07       Impact factor: 8.800

8.  Catalytic enzymes are active matter.

Authors:  Ah-Young Jee; Yoon-Kyoung Cho; Steve Granick; Tsvi Tlusty
Journal:  Proc Natl Acad Sci U S A       Date:  2018-11-01       Impact factor: 11.205

9.  Direct coupling analysis of epistasis in allosteric materials.

Authors:  Barbara Bravi; Riccardo Ravasio; Carolina Brito; Matthieu Wyart
Journal:  PLoS Comput Biol       Date:  2020-03-02       Impact factor: 4.475

  9 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.