Literature DB >> 27189545

A Simple, General Result for the Variance of Substitution Number in Molecular Evolution.

Bahram Houchmandzadeh1, Marcel Vallade2.   

Abstract

The number of substitutions (of nucleotides, amino acids, etc.) that take place during the evolution of a sequence is a stochastic variable of fundamental importance in the field of molecular evolution. Although the mean number of substitutions during molecular evolution of a sequence can be estimated for a given substitution model, no simple solution exists for the variance of this random variable. We show in this article that the computation of the variance is as simple as that of the mean number of substitutions for both short and long times. Apart from its fundamental importance, this result can be used to investigate the dispersion index R, that is, the ratio of the variance to the mean substitution number, which is of prime importance in the neutral theory of molecular evolution. By investigating large classes of substitution models, we demonstrate that although [Formula: see text], to obtain R significantly larger than unity necessitates in general additional hypotheses on the structure of the substitution model.
© The Author 2016. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.

Entities:  

Keywords:  dispersion index; substitution matrix.; substitution number; variance

Mesh:

Substances:

Year:  2016        PMID: 27189545      PMCID: PMC4915360          DOI: 10.1093/molbev/msw063

Source DB:  PubMed          Journal:  Mol Biol Evol        ISSN: 0737-4038            Impact factor:   16.240


Introduction

Evolution at the molecular level is the process by which random mutations change the content of some sites of a given sequence (of nucleotides, amino acids, etc.) during time. The number of substitutions n that occur during this time is of prime importance in the field of molecular evolution and its characterization is the first step in deciphering the history of evolution and its many branching. The main observable in molecular evolution, on comparing two sequences, is , the fraction of sites at which the two sequences are different. In order to estimate the statistical moments of n, the usual approach is to postulate a substitution model Q through which can be related to the statistical moments of n. The simplest and most widely used models assume that Q is site independent, although this constraint can be relaxed (Graur and Li 1999; Yang 2006). Once a substitution model Q has been specified, it is straightforward to deduce the mean number of substitutions and the process is detailed in many textbooks. However, the mean is only the first step in the characterization of a random variable and by itself is a rather poor indicator. The next step in the investigation of a random variable is to obtain its variance V. Surprisingly, no simple expression for V can be found in the literature for arbitrary substitution model Q. The first purpose of this article is to overcome this shortcoming. We show that computing V is as simple as computing , both for short and long times. We then apply this fundamental result to the investigation of the dispersion index R, the ratio of the variance to the mean number of substitutions. The neutral theory of molecular evolution introduced by Kimura (1984) supposes that the majority of mutations is neutral (i.e., have no effect on the phenotypic fitness) and therefore substitutions in protein or DNA sequences accumulate at a “constant rate” during evolution, a hypothesis that plays an important role in the foundation of the “molecular clock” (Bromham and Penny 2003; Ho and Duchêne 2014). The original neutral theory postulated that the substitution process is Poissonian, that is, assuming R = 1. Since the earliest work on the index of dispersion, it became evident however that R is usually much larger than unity (Gillespie 1989; Ohta 1995; see Cutler [2000] for a review of data). Many alternatives have been suggested to reconcile the “overdispersion” observation with the neutral theory (Cutler 2000). Among these various models, a promising alternative, that of fluctuating neutral space, was suggested by Takahata (1991) which has been extensively studied in various frameworks (Zheng 2001; Bastolla et al. 2002; Wilke 2004; Bloom et al. 2007; Raval 2007). The fluctuating neutral space model states that the substitution rate from state i to state j is a function of both i and j. States i and j can be nucleotides or amino acids, in which case we recover the usual substitution models of molecular evolution discussed above. The states can also be nodes of a neutral graph used to study global protein evolution (Huynen et al. 1996; Bornberg-Bauer and Chan 1999; van Nimwegen et al. 1999). For neutral networks used in the study of protein evolution, Bloom et al. (2007) devised an elegant procedure to estimate the substitution rates. We will show in this article that in general and the equality is reached only for the most trivial cases. However, producing large R requires additional hypotheses on the structure of substitution rates. In summary, the problem we investigate in this article is to find a simple and general solution for the variance and dispersion index of any substitution matrix of dimension K. A substitution matrix Q collects the transition rates (); its diagonal elements are set such that its columns sum to zero (see below for notations) and designate the rate of leaving state i. Because of this condition, Q is singular. Zheng (2001) was the first to use Markov chains to investigate the variance of substitution number as a solution of a set of differential equations. His investigation was further developed by Bloom et al. (2007) who gave the general solution in terms of the spectral decomposition of the substitution matrix; this solution was extended by Raval (2007) for a specific class of matrices used for random walk on neutral graphs. Minin and Suchard (2008) used the same spectral method to derive an analytical form for the generating function of a binary process. The first step to characterize the substitution number, which as is well known, is to find the equilibrium probabilities π of being in a state i, which is obtained by solving the linear system with the additional condition of . Once π are obtained, the mean substitution number as a function of time is simply where is the weighted average of the “leaving” rates. We show here that finding the variance necessitates a similar computation. Denoting the weighted deviation of the diagonal elements of Q from the mean , we have to find the solution of the linear system with the additional condition . For long times, the dispersion index is then simply For short times, that is, when the mean number of substitutions is small, the result is even simpler: where in other words, v is the variance of the diagonal elements of the substitution matrix, weighted by the equilibrium probabilities. This article is organized as follows: The “Materials and Methods” section contains the theoretical formulation of the problem and its solution that leads to the simple results (1) and (2). The validity of the method is confirmed by Monte Carlo numerical simulations. Generalization to rate heterogeneity is also considered. The “Materials and Methods” section also contains a “Numerical Method” subsection which describes verbally the numerical algorithm provided in supplementary file S1, Supplementary Material online. In the “Results” section, the simplicity of expressions (1) and (2) are used to study the dispersion index for specific models of nucleotide substitutions widely used in the literature and for general models. We investigate in particular the conditions necessary to produce large R. The last section is devoted to a general discussion of these results and to conclusions. Technical details, such as the proof of are given in the appendices and in supplementary file S2, Supplementary Material online (solution for general initial conditions).

Materials and Methods

Background and Definitions

The problem we investigate in this article is mainly that of counting transitions of a random variable (fig. 1). Consider a random variable X that can occupy K distinct states and let () be the transition rate from state i to state j. The probability density of being in state i at time t is governed by the Master equation where is the “leaving” rate from state i. We can collect the p into a (column) vector and write the above equations in matrix notation where D is the diagonal matrix of m and collects the detailed transition rates from state i to state j () and has zero on its diagonal. In our notations, the upper (lower) index designates the column (row) of a matrix. The matrix is called the substitution matrix and its columns sum to zero.
F

The random variables X can switch between K states; the counter N of the number of transitions is incremented at each transition of the variable X. The figure above shows one realization of these random variables as a function of time.

The random variables X can switch between K states; the counter N of the number of transitions is incremented at each transition of the variable X. The figure above shows one realization of these random variables as a function of time. Before proceeding, we explain the notations used in this article. As the matrix Q is not in general symmetric, a clear distinction must be made between right (column) and left (row) vectors. The Dirac notations are standard and useful for handling this distinction: A column vector is denoted whereas a row vector is denoted and is their scalar product. In some of literature (see Yang 2006), the substitution matrix is the transpose of the matrix used here and the master equation is then written as and therefore its rows sum to zero. By construction, the matrix Q is singular and has one zero eigenvalue whereas all others are negative. Therefore, as time flows, where is the equilibrium occupation probability and the zero-eigenvector of the substitution matrix. where , and the second condition expresses that the sum of the probabilities must be 1. Note that by definition, , and thus is a zero row eigenvector of the substitution matrix.

Problem Formulation

To count the number of substitutions (fig. 1), we consider the probability densities of being in state i after n substitutions at time t. These probabilities are governed by the master equation We can combine the above equations by setting if n < 0. Collecting the elements of into the vector , the above equation can then be written as The quantities of interest for the computation of the dispersion index are the mean and the variance of the number of substitutions. The mean number of substitutions at time t is Let us define and collect the partial means n into the vector . The mean is then defined simply as By the same token, the second moment can be written in terms of partial second moments as where . It is straightforward to show (see Appendix A), for the initial condition , that and obey a linear differential equation The choice of equilibrium initial condition simplifies the computations and is used through the literature; it is, however, not a necessary requirement. We provide the solution for general initial conditions in supplementary file S2, Supplementary Material online. Let us define the row vector which collects the leaving rates. By definition, . Multiplying equations (9) and (10) by the row vector , and noting that , we get a simple relation for the moments: We observe that the mean number of substitutions involves only a trivial integration. Defining the weighted average of the leaving rates as The mean number of substitution is simply To compute the second moment of the substitution number on the other hand, we must solve for using equation (9) and then perform one integration. The next subsection is devoted to the efficient solution of this procedure.

Solution of the Equation for the Moments

One standard way of solving equation (9) would be to express the matrix Q in its eigenbasis; equation(9) is then diagonalized and can be formally solved. This is the method used by Bloom et al. (2007) and further refined by Raval (2007) for a specific class of substitution matrices where . Bloom et al. (2007, eq. 9) give also a general procedure for finding the variance which can be simplified when an eigenbasis is available. The problem with the eigenbasis approach is that even when Q can be diagonalized, this is not the most efficient procedure to find V, as it necessitates the computation of all eigenvalues and row and column eigenvectors of Q and then the cumbersome summation of their binomial products. The procedure we follow involves some straightforward, albeit cumbersome linear algebraic operations, but the end result is quite simple. We note that the matrix Q is singular and has exactly one zero eigenvalue, associated with the row and column eigenvectors. The method we use is to isolate the zero eigenvalue by making a round-trip to a new basis. Thus, if we can find a new basis in which the substitution matrix takes a lower block triangular form we will have achieved our goal of isolating the zero eigenvalue. The nonsingular matrix is of rank K − 1 and has the nonzero and negative eigenvalues of Q. As is the known row eigenvalue of Q, we can split the vector space into and the space padded by . It is then straightforward to find the above transfer matrices X and for such a transformation: Under such a transformation, a column vector transforms into where the K − 1 dimensional vector . In general, we will designate by a ∼ all vectors that belong the K − 1 dimensional space in which the linear application operates. A row vector transforms into where the K − 1 dimensional row vector . Finally, where the elements of have been indexed from 2 to K. Expressing now the equation (9) for the evolution of first moments in the new basis, we find that where and is given in relation (16). Equation (19) is the same as equation (12) and implies that . As is nonsingular (and negative definite), equations (19) and (20) can now readily be solved. Noting that implies that , the differential equation (20) integrates where To compute the second moment (eq. 13) and the variance, we must integrate the above expression one more time. We finally obtain The second term in the right-hand side of the above equation is the excess variance δV with respect to a Poisson process.

Long-Time Behavior

As all eigenvalues of are negative, for large times and the leading term of the excess variance is therefore where is the solution of the linear equation Returning to the original basis, relation (23) becomes where is the row vector of the leaving rates and is the solution of the linear equation is the vector of weighted deviation from of the leaving rates m: Finally, for large times, the dispersion index is which is the relation (1) given in the introduction. Figure 2 shows the agreement between the above theoretical results and stochastic numerical simulations.
F

Comparison between the theoretical result (28) and numerical simulations. 4  ×  4 random (uniform(0, 1)) matrices were generated. For each matrix, a Gillespie algorithm was used to generate 106 random paths as a function of time (tfinal = 1,000), from which the dispersion index was computed. In the above figure, each dot corresponds to one random matrix. The mean relative error is .

Comparison between the theoretical result (28) and numerical simulations. 4  ×  4 random (uniform(0, 1)) matrices were generated. For each matrix, a Gillespie algorithm was used to generate 106 random paths as a function of time (tfinal = 1,000), from which the dispersion index was computed. In the above figure, each dot corresponds to one random matrix. The mean relative error is . Two important consequences should be noted. First, it is not difficult to show that for continuous time random processes, which is called the overdispersion of the molecular clock. A demonstration of this theorem for symmetric substitution matrices whose elements are 0 or 1 (adjacency matrices) was given by Raval (2007). We give the general demonstration for general time reversible (GTR) substitution matrices in Appendix B. For arbitrary matrices, extensive numerical results on 107 random matrices (fig. 4) have shown with no exception.
F

Statistical study of 4 × 4 substitution matrices for 1) GTR matrices (“G,” black curves), 2) arbitrary matrices (“R,” red curves), and 3) sparse GTR matrices (“S,” blue curves). In each case, 107 matrices are generated and for each matrix, its dispersion index R and its eccentricity are computed, where π is the equilibrium probability of state i. In each case, the data are sorted by R value to compute the cumulative histogram (a). (b) The relation between e and R. To make visible the statistical relation between e and R, a moving average of size 1,000 data points is applied to the 107 sorted (R, e) data in each data set and the result is displayed in the lower plot (b).

The second consequence of relation (28) is that if all diagonal elements of the substitution matrix are equal (i.e., ), then the dispersion index is exactly 1 and we recover the property of a normal Poisson process, regardless of the fine structure of Q and the equilibrium probabilities π. This is a sufficient condition. We show that the necessary condition for R = 1 is , which, except for the trivial case where some , again implies the equality of diagonal elements of Q (see Appendix B). Note that the expression (28) was obtained for the initial condition . As mentioned before, the solution for general initial condition is provided in supplementary file S2, Supplementary Material online, where it is shown that for long times, expression (28) remains valid and does not depend on the choice of initial conditions.

Short-Time Behavior

For short times, that is, when the mean number of substitution is small, we can expand δV given by expression (22) to the second order in time: Note that the above summation is over . However, by definition, and hence the sum in relation (30) can be rearranged as The sum, which we will denote by v, represents the variance of the diagonal elements of Q, weighted by the equilibrium probabilities. It is more meaningful to express the variance in terms of the mean substitution number. Using relation (15), we therefore have The dispersion index for short times is therefore We observe that for short times, the dispersion index increases with the mean number of substitution. This is a generic feature of nontrivial substitution matrices (where diagonal elements are not identical) and has been observed by Ohta (1995). The dispersion index for all times can also be computed, and an example is given in Appendix C.

Rate Heterogeneity

The substitution models we have considered here can be applied to sequences where the Q matrix is identical for all sites. For nucleotide sequences, this can describe the evolution of noncoding sequences or synonymous substitutions. On the other hand, for amino acid substitution, it is well known that some sites are nearly constant or evolve at a slower pace than other sites. A natural extension of the present work would be to take into account substitution matrices that vary among sites, drawing, for example, their scaling factor from a Gamma distribution (Yang 2006). The formalism we have developed in this article can be adapted to such an extension. Let us define nS as the substitution number for the sequence: where L is the sequence length and is the substitution number at site β. If we suppose that sites are not correlated, that is, a transition at one site does not modify the transition rates at other sites, are independent random variables and therefore Consider the simplest case where only the scaling factor varies across sites: where Q is a fixed substitution matrix and is drawn from a given distribution with mean . The actual form of f is not important for the discussion here. Recalling that (expressions 15 and 25) for large times, and , where , and are defined for the matrix Q, expressions (33) and (34) are reduced to where we have supposed the sequence length L to be large. We observe that the dispersion index in this case is not sensitive to rate variation among sites which is the same expression (28) we had before. Therefore, in this simplest case, rate variation among sites cannot increase the dispersion index compared with the homogeneous rate case. For a more general case where all the coefficients of the substitution matrix are drawn at random from a given distribution, statistical moments of nS have to be evaluated from expressions (33) and (34) using the exact distribution law . A similar approach can be used to evaluate the short-time behavior of the dispersion index. The hypothesis that sites are independent is a simplification and correlations between sites have been used to discover sectors, that is, functional domains inside a protein sequence (Halabi et al. 2009). The method presented in this article can in principle be used to study the more general case by considering groups of correlated amino acids as the basic units, which will significantly increase the dimensionality K of the substitution matrix.

Numerical Methods

Numerical simulation of stochastic equations uses the Gillespie (1977) algorithm and is written in C ++ language. To compute the dispersion index of a given matrix Q, we generate 106 random paths over a time period of 1,000. To compare the analytical solutions given in this article (fig. 2) with stochastic simulations (fig. 2), we generated random matrices and numerically computed their dispersion index by the above method. This numerical simulation took approximately 10 days on a 60 core cluster. All linear algebra numerical computations and all data processing were performed with the high-level Julia language (Bezanson et al. 2014). Computing the analytical dispersion index for the above random matrices took about 5 s on a desktop computer, using only one core. To generate random GTR matrices, we use the factorization (see Appendix B) which allows for independent generation of the elements of the symmetric matrix S and K – 1 elements of Π. For arbitrary matrices, we draw the elements of the matrix. All random generators used in this work are uniform(0,1). The supplementary file S1, Supplementary Material online, contains the Julia code (15 lines) for computing the dispersion index as given in equation (1). The algorithm can be described as Find the equilibrium column vector by solving the linear system , with the supplementary condition . Extract the row vector from the diagonal of the Q matrix: . Compute , the rate of variation of the mean substitution number . Define the column vector whose elements are given by . Find the column vector by solving the linear system , with the supplementary condition . Compute and the long-time dispersion index .

Results

Application to Specific Nucleotides Substitution Models

Nucleotide substitution models are widely used in molecular evolution (Graur and Li 1999; Yang 2006), for example, to deduce distances between sequences. Some of these models have few parameters or have particular symmetries. For these models, it is worthwhile to express relation (28) for large times into an even more explicit form and compute the dispersion number as an explicit function of the parameters. We provide below such a computation for some of the most commonly used models. For the K80 model proposed by Kimura (1980), all diagonal elements of the substitution matrix are equal; hence, relation (28) implies that R = 1.

T92 Model

Tamura (1992) introduced a two-parameter model (T92) extending the K80 model to take into account biases in G + C contents. Solving relation (28) explicitly for this model, we find for the dispersion index Here , where α and β are the two parameters of the original T92 model. A similar expression was found by Zheng (2001). For a given k, the maximum value of R is And it is straightforward to show that in this case although even reaching a maximum value for R = 1.5 will necessitate strong asymmetries in the substitution rates (such as k = 18.8 and ).

TN93 Model

Tamura and Nei (1993) proposed a generalization of the F81 (Felsenstein 1981) and HKY85 (Hasegawa et al. 1985) models which allows for biases in the equilibrium probabilities, different rates of transition versus transversion and for different rates of transitions. The corresponding substitution matrix is a specific case of this model where corresponds to the HKY85 model, whereas corresponds to that of F81 (also called “equal input”). Solving equation (28) leads to where m are the (negative of) diagonal elements of Q, ; C are defined as For the specific case (equal input or F81 model), expression (36) takes a particularly simple form One can deduce relation (37) from (38) by noting that . As every term of the first sum in relation (37) is smaller than the corresponding term in the second sum: The lower bound is reached for , whereas the upper bound is reached when one of the π approaches 1. Zheng (2001) has also computed an expression for the dispersion index for the F81 model; his solution however is rather complicated. For the general TN93, relation no longer holds. For example, for , the dispersion index is and R can become arbitrarily large with appropriate values of k2. The simplicity of relation (36) allows for the comprehensive exploration of the hyperplane . The results are displayed in figure 3; to obtain large values for R such as R > 1.5 necessitates high asymmetries in the transition rates and/or strong biases in equilibrium probabilities of states.
F

Cumulative histogram of the dispersion index R of the TN93 model and its specific cases. The three dimensional space , is scanned by steps of ( points). For each value of , the dispersion index of the corresponding substitution matrix is computed from relation (28). Black circles: the F81 model (; red diamonds and green left triangles correspond the HKY85 with, respectively, and 10; blue squares, cyan right triangles and magenta up triangles correspond to TN93 model with , respectively. Permutations of lead to the same results and are not displayed. For each substitution matrix, it has been checked that solution (36) and the general solution (28) are identical.

Cumulative histogram of the dispersion index R of the TN93 model and its specific cases. The three dimensional space , is scanned by steps of ( points). For each value of , the dispersion index of the corresponding substitution matrix is computed from relation (28). Black circles: the F81 model (; red diamonds and green left triangles correspond the HKY85 with, respectively, and 10; blue squares, cyan right triangles and magenta up triangles correspond to TN93 model with , respectively. Permutations of lead to the same results and are not displayed. For each substitution matrix, it has been checked that solution (36) and the general solution (28) are identical.

Statistical Investigation of the Dispersion Index and the Influence of Sparseness

The relation (28) can be solved explicitly for general substitution matrices. However, a general substitution matrix of dimension 4 has 11 free parameters (substitution matrices are defined up to a scaling parameter); explicit solution of (28) as a function of substitution matrix parameters is rather cumbersome and does not provide insightful information. An exchangeable (time reversible, GTR) substitution matrix has the additional constraint (Tavaré 1986) . GTR matrices are widely used in the literature as they are convenient for computation and are diagonalizable (Yang 2006, Section 2.6). Considering only exchangeable matrices reduces the number of free parameters to 9, but the parameter space is still too large to be explored systematically. We can however sample the parameter space by generating a statistically significant ensemble of substitution matrices Q and get an estimate of the probability distribution of the dispersion index R. The simplicity of relation (28) allows us to generate 107 random matrices for each class and compute their associated R in a few minutes with a usual normal computer: Depending on the dimension of Q (from 4 to 20) this computation takes between 2 and 10 min. Figure 4 shows the cumulative probability P(R) for both arbitrary (R) and GTR (G) matrices, computed from 107 matrices in each case. We observe that arbitrary matrices produce statistically low dispersal indices: and . The GTR matrices have statistically higher dispersion indices: and . Still, values larger than R = 5, as has been reported in the literature (Cutler 2000), have a very low probability ( for random matrices and for GTR matrices). Statistical study of 4 × 4 substitution matrices for 1) GTR matrices (“G,” black curves), 2) arbitrary matrices (“R,” red curves), and 3) sparse GTR matrices (“S,” blue curves). In each case, 107 matrices are generated and for each matrix, its dispersion index R and its eccentricity are computed, where π is the equilibrium probability of state i. In each case, the data are sorted by R value to compute the cumulative histogram (a). (b) The relation between e and R. To make visible the statistical relation between e and R, a moving average of size 1,000 data points is applied to the 107 sorted (R, e) data in each data set and the result is displayed in the lower plot (b). We observed in the preceding section that for each class of matrices, high values of R are generally associated with large biases in the equilibrium probabilities, that is, a given state would have a very low equilibrium probability in order to allow for large R. We can investigate how this observation holds for general and GTR matrices. For each K × K matrix Q that is generated, we quantify its relative eccentricity by The relation between R and e is statistical: Matrices with dispersion index in will have a range of e and we display the average e for each small interval (fig. 4). We observe again that high values of the dispersion index in each class of matrices require high bias in equilibrium probabilities of states. Another effect that can increase the dispersion index of a matrix is its sparseness. This effect was investigated by Raval (2007) for a random walk on neutral networks, which we generalize here. Until now, we have examined fully connected graphs, that is, substitution processes where the random variable X can jump from any state i to any other state j. For a four-state random variable, each node of the connectivity graph is of degree 3 (dG = 3). This statement may however be too restrictive. Consider, for example, a 4 × 4 nucleotide substitution matrix for synonymous substitutions. Depending on the identity of the codon to which it belongs, a nucleotide can only mutate to a subset of other nucleotides. For example, for the third codon of tyrosine, only transitions are allowed, whereas for the third codon of alanine, all substitutions are synonymous. For a given protein sequence, the mean nucleotide synonymous substitution graph is therefore of degree smaller than 3. In general, the degree of each state (node) i is given by the number (minus one) of nonzero elements of the ith column in the associated substitution matrix. We can investigate the effect of sparseness of substitution matrices on the dispersion index with the formalism developed above. Figure 4 shows the probability distribution of R for GTR matrices with d = 2. As it can be observed, the dispersion index distribution for GTR matrices is shifted to higher values and increases 6-fold from 0.007 (for to 0.044 (for d = 2). A more insightful model would be 20 × 20 GTR matrices for amino acid substitutions. We compare the case of fully connected graphs (F) where any amino acid can replace any other one with the case where only amino acids one nucleotide mutation apart can replace each other (nonsynonymous substitution, NS). The average degree of the graph in the latter case is . As before, we generate 107 random matrices in each class and compute their statistical properties. We observe again (fig. 5) that the distribution of R is shifted to the right for the NS graphs, where the median is , compared with for fully connected graphs.
F

Cumulative probability of the dispersion index for fully connected (black, F) and nonsynonymous (blue, NS) amino acid 20 × 20 substitution matrix. For the fully connected matrix, any amino acid can be replaced by any other. For the NS matrices, only amino acids one nucleotide mutation apart can replace each other. For each case, 107 20 × 20 GTR matrices are generated and their dispersion index R is computed. For the NS matrices, transitions are weighted by the number of nucleotide substitutions that can lead from one amino acid to another: There are, for example, 6-nt mutations that transform a phenylalanine into a leucine, but only one mutation that transforms a lysine into an isoleucine.

Cumulative probability of the dispersion index for fully connected (black, F) and nonsynonymous (blue, NS) amino acid 20 × 20 substitution matrix. For the fully connected matrix, any amino acid can be replaced by any other. For the NS matrices, only amino acids one nucleotide mutation apart can replace each other. For each case, 107 20 × 20 GTR matrices are generated and their dispersion index R is computed. For the NS matrices, transitions are weighted by the number of nucleotide substitutions that can lead from one amino acid to another: There are, for example, 6-nt mutations that transform a phenylalanine into a leucine, but only one mutation that transforms a lysine into an isoleucine. For specific amino acid substitution matrices used in the literature such as WAG (Whelan and Goldman 2001), LG (Le and Gascuel 2008) and IDR (Szalkowski and Anisimova 2011), the index of dispersion is 1.253, 1.196 and 1.242, respectively.

Discussion and Conclusion

The substitution process (of nucleotides, amino acids, etc.) and therefore the number of substitutions n that take place during time t are stochastic. One of the most fundamental tasks in molecular evolutionary investigation is to characterize the random variable n from molecular data. A given model for the substitution process in the form of a substitution matrix Q enables us to estimate the mean number of substitution that occurs during a time t. The mean depends only on the diagonal elements of Q and the equilibrium probabilities of states π: where designates the trace operator. In molecular evolution, the main observable is the probability that two different sequences are different at a given site. Denoting , and assuming that both sequences are at equilibrium (Yang 2006), One can estimate from the fraction of observed differences between two sequences . By eliminating time in relations (39) and (40), it is then possible to relate the estimators (of ) and For sequences of length L, is given by a binomial distribution B(L, p) and the variance of the distance estimator can be deduced from relation (41). This quantity however is very different from the intrinsic variance of the substitution number. The mean of substitution number, its estimator , and the variance of the estimator are only the first step in characterizing a random variable. The next crucial step is to evaluate the variance V of this number. What we have achieved in this article is to find a simple expression for V. In particular, we have shown that for both short and long time scales, the variance V can be easily deduced from Q. For long times, the procedure is similar to deriving the equilibrium probabilities π from , that is, we only need to solve a linear equation associated with Q [relation (25)]. For short times, only the diagonal elements of Q are required to compute V [relation (31)]. A long standing debate in the neutral theory of evolution concerns the value of dispersion index . On the one hand, the exact solution of this paper is used to demonstrate that in general, any substitution process given by a matrix Q is overdispersed, that is, , and the equality can be observed only for trivial models where all diagonal elements of Q are equal. On the other hand, comprehensive investigation of various substitution models (Results) shows that models that produce R much larger than generally require strong biases in the equilibrium probabilities of states. One possibility to produce a higher dispersion index is sparse matrices, where the ensemble of possible transitions has been reduced.

Supplementary Material

Supplementary files S1 and S2 are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).
  25 in total

1.  Neutral evolution of mutational robustness.

Authors:  E van Nimwegen; J P Crutchfield; M Huynen
Journal:  Proc Natl Acad Sci U S A       Date:  1999-08-17       Impact factor: 11.205

2.  Modeling evolutionary landscapes: mutational stability, topology, and superfunnels in sequence space.

Authors:  E Bornberg-Bauer; H S Chan
Journal:  Proc Natl Acad Sci U S A       Date:  1999-09-14       Impact factor: 11.205

3.  On the dispersion index of a Markovian molecular clock.

Authors:  Q Zheng
Journal:  Math Biosci       Date:  2001-08       Impact factor: 2.144

4.  A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach.

Authors:  S Whelan; N Goldman
Journal:  Mol Biol Evol       Date:  2001-05       Impact factor: 16.240

5.  Lack of self-averaging in neutral evolution of proteins.

Authors:  Ugo Bastolla; Markus Porto; H Eduardo Roman; Michele Vendruscolo
Journal:  Phys Rev Lett       Date:  2002-10-24       Impact factor: 9.161

6.  Theory of neutral clustering for growing populations.

Authors:  Bahram Houchmandzadeh
Journal:  Phys Rev E Stat Nonlin Soft Matter Phys       Date:  2009-11-24

7.  Estimation of the number of nucleotide substitutions when there are strong transition-transversion and G+C-content biases.

Authors:  K Tamura
Journal:  Mol Biol Evol       Date:  1992-07       Impact factor: 16.240

8.  Counting labeled transitions in continuous-time Markov models of evolution.

Authors:  Vladimir N Minin; Marc A Suchard
Journal:  J Math Biol       Date:  2007-09-14       Impact factor: 2.259

Review 9.  Molecular-clock methods for estimating evolutionary rates and timescales.

Authors:  Simon Y W Ho; Sebastián Duchêne
Journal:  Mol Ecol       Date:  2014-10-30       Impact factor: 6.185

10.  Markov models of amino acid substitution to study proteins with intrinsically disordered regions.

Authors:  Adam M Szalkowski; Maria Anisimova
Journal:  PLoS One       Date:  2011-05-27       Impact factor: 3.240

View more
  1 in total

1.  Long-term evolution on complex fitness landscapes when mutation is weak.

Authors:  David M McCandlish
Journal:  Heredity (Edinb)       Date:  2018-09-19       Impact factor: 3.821

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.