Literature DB >> 35951748

Approximations to the expectations and variances of ratios of tree properties under the coalescent.

Abstract

Properties of gene genealogies such as tree height (H), total branch length (L), total lengths of external (E) and internal (I) branches, mean length of basal branches (B), and the underlying coalescence times (T) can be used to study population-genetic processes and to develop statistical tests of population-genetic models. Uses of tree features in statistical tests often rely on predictions that depend on pairwise relationships among such features. For genealogies under the coalescent, we provide exact expressions for Taylor approximations to expected values and variances of ratios Xn/Yn, for all 15 pairs among the variables {Hn,Ln,En,In,Bn,Tk}, considering n leaves and 2≤k≤n. For expected values of the ratios, the approximations match closely with empirical simulation-based values. The approximations to the variances are not as accurate, but they generally match simulations in their trends as n increases. Although En has expectation 2 and Hn has expectation 2 in the limit as n→∞, the approximation to the limiting expectation for En/Hn is not 1, instead equaling π2/3-2≈1.28987. The new approximations augment fundamental results in coalescent theory on the shapes of genealogical trees.

Entities: Chemical

Keywords: coalescent theory; external branches; internal branches; time to the most recent common ancestor

Mesh：

Year: 2022 PMID： 35951748 PMCID： PMC9526068 DOI： 10.1093/g3journal/jkac205

Source DB: PubMed Journal: G3 (Bethesda) ISSN： 2160-1836 Impact factor: 3.542

Introduction

Coalescent theory models random genealogies conditional on assumptions about the evolutionary process (Hein ; Wakeley 2009). In coalescent theory, a gene genealogy is a tree or network structure that represents a random draw from a coalescent model. Genealogies in coalescent theory can be summarized using a variety of quantities. For example, for random tree-like genealogies with n lineages, the tree height H records the sum of branch lengths on a path from a leaf to the root, and the tree length L sums all branch lengths in the tree. The total length E of external branches sums over leaves the lengths of paths from leaves to their nearest internal nodes, and the total length of internal branches, , sums the lengths of all remaining branches. Studies in coalescent theory have often investigated the properties of tree summaries conditional on assumptions of coalescent models, with the goal of understanding how shapes of the genealogies relate to processes such as population growth and migration (e.g. Slatkin 1996; Rosenberg and Feldman 2002). Because mutations can be viewed as occurring conditionally on underlying genealogies (Hudson 1990), features of genealogical shape affect the patterns of genetic variation produced by coalescent models that permit mutation. Thus, the understanding of summaries of tree shape predicted by coalescent models is a component of the interpretation of patterns of genetic variation in relation to evolutionary processes. Initial results concerning summaries of genealogical shape focused on single quantities, producing results on quantities such as H and L (Kingman 1982; Hudson 1983, 1990; Tajima 1983). Studies soon examined the information that resides in the relationships between pairs of summaries; genetic variation statistics such as those of Tajima (1989) and Fu and Li (1993) can be viewed as assessing whether or not one aspect of a tree contains long branches in relation to another. Recently, Arbisser performed a detailed investigation of the relationship between H and L under coalescent models. They studied the mathematical relationship between these two quantities, computing under a standard coalescent model with a constant-sized population the covariance and correlation coefficient of H and L. Extending the work of Arbisser on H and L, we (Alimpiev and Rosenberg 2022) reported covariances and correlations for all pairs of variables among , where B is the mean of the lengths of the two basal branches of a genealogy and T is the coalescence time from k to k—1 lineages, . Our compendium in Tables 1 and 2 of Alimpiev and Rosenberg (2022) summarizes pairwise relationships for several of the most commonly used features of coalescent tree shape, recording both new and previously known results.

Table 1.

Definitions of random variables associated with various tree summaries.

Variable	Definition
H_n	∑k=2nTk
L_n	∑k=2nkTk
E_n	∑i=1nei(n)
I_n	Ln−En
B_n	12T2+[∑j=3n−1∑k=2j1j(j−1)Tk]+(∑k=2n1n−1Tk)

Here, T is the random variable representing the coalescence time from k to k—1 lineages, and is the (random) length of the ith external branch of a tree with n leaves. We define H, L, and E for , I for , and B for . The expression for B follows a form that incorporates terms associated with all of its contributing branches, following p. 1400 of Uyenoyama (1997) and Section 2.6 of Alimpiev and Rosenberg (2022), and it can be simplified to .

Table 2.

Expectations and variances of properties of tree branch lengths.

X_n	E[Xn]	limn→∞E[Xn]	Var[Xn]	limn→∞Var[Xn]
H_n	2(n−1)n	2	8(S2,n−1)−4(n−1n)2	4π23−12≈1.15947
L_n	2S1,n−1	∞	4S2,n−1	2π23≈6.57974
E_n	2	2	{4,n=2,8(n−1)(n−2)[S1,n−1n−2(n−1)],n>2.	0
I_n	2S1,n−1−2	∞	4[2[S1,n−1n−2(n−1)](n−1)(n−2)−2S1,n−1n−1+S2,n−1]	2π23≈6.57974
B_n	2S2,n−1−2+2n	π23−2≈1.28987	2(3S2,n−1n2−2S2,n−12n2+n2−4S2,n−1n+3n−4)n2	−π49+π2+2≈1.04637
T_k	2k(k−1)	2k(k−1)	4k2(k−1)2	4k2(k−1)2

These expressions can be found in Alimpiev and Rosenberg (2022). Note that for L and I, although the limiting variance is finite, the expectation is infinite (Tavaré ; Wakeley 2009, p. 76).

Definitions of random variables associated with various tree summaries. Here, T is the random variable representing the coalescence time from k to k—1 lineages, and is the (random) length of the ith external branch of a tree with n leaves. We define H, L, and E for , I for , and B for . The expression for B follows a form that incorporates terms associated with all of its contributing branches, following p. 1400 of Uyenoyama (1997) and Section 2.6 of Alimpiev and Rosenberg (2022), and it can be simplified to . Expectations and variances of properties of tree branch lengths. These expressions can be found in Alimpiev and Rosenberg (2022). Note that for L and I, although the limiting variance is finite, the expectation is infinite (Tavaré ; Wakeley 2009, p. 76). In addition to computing the covariance and correlation coefficient of H and L, Arbisser also found approximations to the expectation and variance of the ratio under the coalescent model. This ratio gives a summary of the joint distribution of H and L that characterizes the relative magnitudes of the variables—a feature not captured by their covariance or correlation. Arbisser found that although the approximation to differed noticeably from the exact value, as obtained by numerical integration and simulations of the coalescent model, the approximation to was quite accurate. In this article, we extend the work of Arbisser to compute approximations to the expectations and variances for ratios of the 14 remaining pairs among . The study performs for the expectation and variance of coalescent ratios an analogous extension of Arbisser to that performed by Alimpiev and Rosenberg (2022) for the covariance and correlation coefficient.

Materials and methods

Tree variables

We work with a haploid population of constant size N that follows a standard coalescent model. Time is measured in units of N generations. In this section, we recall the definitions of the coalescence time T and tree properties H, L, E, I, and B for sample size and . T is defined to be a random variable representing the time to coalescence of k to k—1 lineages, for . Variable T has exponential probability density function The expectation and variance of T are The tree properties H, L, E, I, and B are defined in terms of the T. Visual depictions of these properties appear in Fig. 1, and mathematical definitions of these quantities appear in Table 1.

Fig. 1.

Properties of genealogical trees. The tree height is H. The sum of the lengths of all branches is L. External branches have total length E (green). Internal branches have total length I (orange). Basal branches have mean length B (blue). We define as a useful shorthand. The limit is the Riemann zeta function, usually denoted . In particular, diverges, , and is Apéry’s constant, approximately 1.20206.

Taylor approximations to expectations and variances of ratios

To compute approximate expressions for expected values and variances of the ratios of various tree properties, we rely on Taylor approximations. In particular, consider random variables X and Y with . For the expectation, we have (second-order) approximation (Elandt-Johnson and Johnson 1999, eq. 3.88): For the variance, we have (first-order) approximation (Stuart and Ord 1994, eq. 10.17): We use and to denote approximations from equations (3) and (4). For both the expectation and the variance, we also take the limit of the approximations.

Exact expectations, variances, and covariances of tree properties

Expected values and variances of variables H, L, E, I, B, and T that are used in equations (3) and (4) are known, in many cases, from early studies in coalescent theory (Fu and Li 1993; Tavaré ; Wakeley 2009). We summarize these expectations and variances in Table 2. The covariances compiled by Alimpiev and Rosenberg (2022) appear in Table 3. In the case of pairs (E, B) and (I, B), the covariances are approximate, as described by Alimpiev and Rosenberg (2022).

Table 3.

Covariances of pairs of variables that summarize genealogical trees.

(X_n, Y_n)	Cov[Xn,Yn]	limn→∞Cov[Xn,Yn]
H_n, T_k	4k2(k−1)2	4k2(k−1)2
H_n, L_n	4S2,n−1−4+4n	2π23−4≈2.57974
H_n, E_n	4n	0
H_n, I_n	4S2,n−1−4	2π23−4≈2.57974
H_n, B_n	4[S3,n−1n2−3S2,n−1n2+(4n+1)(n−1)]n2	−2π2+4ζ(3)+16≈1.06902
L_n, T_k	4k(k−1)2	4k(k−1)2
L_n, E_n	4S1,n−1n−1	0
L_n, I_n	4S2,n−1−4S1,n−1n−1	2π23≈6.57974
L_n, B_n	4[S3,n−1n−S2,n−1n+n−1]n	−2π23+4ζ(3)+4≈2.22849
E_n, T_k	4k(k−1)(n−1)	0
E_n, I_n	4S1,n−1n−1−8S1,n−1n(n−1)(n−2)+16n−2	0
E_n, B_n	4(S2,n−1n−n+1)n(n−1)	0
I_n, T_k	4(n−k)k(k−1)2(n−1)	4k(k−1)2
I_n, B_n	4(S3,n−1n−S2,n−1n+n−S3,n−1−1)n−1	−2π23+4ζ(3)+4≈2.22849
B_n, T_k	4k2(k−1)3	4k2(k−1)3

For pairs involving E or I, expressions apply for ; expressions involving B apply for . The expressions can be found in Alimpiev and Rosenberg (2022).

Covariances of pairs of variables that summarize genealogical trees. For pairs involving E or I, expressions apply for ; expressions involving B apply for . The expressions can be found in Alimpiev and Rosenberg (2022).

Evaluating the approximations

For each of 15 pairs of random variables, considering H, L, E, I, and B as well as T, we substitute expressions from Tables 2 and 3 into equations (3) and (4) to obtain approximate expectations and variances for ratios of pairs of variables. For each pair, we choose one variable for the numerator and the other for the denominator; approximate expectations and variances for the reciprocals can be obtained similarly. We present the approximations in Tables 4 and 5, and we plot them in Figs. 2–5.

Table 4.

Approximations to expectations of ratios of pairs of variables.

(X_n, Y_n)	E˜[Xn/Yn]	limn→∞E˜[Xn/Yn]
H_n, T_k	(2k2−2k−1)n−2k(k−1)n	2k2−2k−1
H_n, L_n	n−1S1,n−1n−S2,n−1n−n+1S1,n−12n+S2,n−1(n−1)S1,n−13n	0
E_n, H_n	n(2S2,nn2−2n2−n+1)(n−1)3	π23−2≈1.28987
H_n, I_n	n−1(S1,n−1−1)n−S2,n−1−1(S1,n−1−1)2+S2,n−1(n−1)(n−2)−4n+4S1,n−1+4(S1,n−1−1)3n(n−2)	0
B_n, H_n	S2,n−1n−n+1n−1+3S2,n−1n2−S3,n−1n2−4n2+3n+1(n−1)2+(S2,n−1n−n+1)(2S2,nn2−3n2+2n−1)(n−1)3	π418−π26−ζ(3)−2≈0.56463
L_n, T_k	2S1,n−1k2−(2S1,n−1+1)k	∞
E_n, L_n	(S1,n−12+S2,n−1)n−2S1,n−12−S2,n−1S1,n−13(n−1)	0
L_n, I_n	(S1,n−13+S2,n−1)(n−1)(n−2)−S1,n−12(2n2−7n+2)+S1,n−1(n2−8n+8)(S1,n−1−1)3(n−1)(n−2)	1
B_n, L_n	S2,n−1n−n+1S1,n−1n+S2,n−1n−S3,n−1n−n+1S1,n−12n+S2,n−1(S2,n−1n−n+1)S1,n−13n	0
E_n, T_k	k(k−1)(2n−3)n−1	2k(k−1)
E_n, I_n	S1,n−12(n2−2n+4)−S1,n−1(2n2−n−2)+(S2,n−1+1)(n−1)(n−2)(S1,n−1−1)3(n−1)(n−2)	0
B_n, E_n	(n2+2S1,n−1n−8n+8)(S2,n−1n−n+1)n(n−1)(n−2)	π26−1≈0.64493
I_n, T_k	2k(k−1)(S1,n−1−1)−k(n−k)n−1	∞
B_n, I_n	S2,n−1n−n+1(S1,n−1−1)n+(S2,n−1n−n+1)[S2,n−1(n−1)(n−2)−4n+4S1,n−1+4](S1,n−1−1)3n(n−1)(n−2)+S2,n−1n−(S3,n−1+1)(n−1)(S1,n−1−1)2(n−1)	0
B_n, T_k	2k(k−1)(S2,n−1n−n+1)n−1k−1	13(π2−6)k(k−1)−1k−1

Expressions involving E or I apply for ; expressions involving B apply for . The value for (H, L) follows equation 15 of Arbisser . The expressions are obtained using equation 3 and Tables 2 and 3.

Table 5.

Approximations to variances of ratios of pairs of variables.

(X_n, Y_n)	Var˜[Xn/Yn]	limn→∞Var˜[Xn/Yn]
H_n, T_k	2k(k−1)[k(k−1)S2,nn−(k2−k+1)n+1]n	13k(k−1)[(π2−6)k2−(π2−6)k−6]
H_n, L_n	(n−1S1,n−1n)2[2(S2,n−1)n2−(n−1)2(n−1)2−2[S2,n−1n−(n−1)]S1,n−1(n−1)+S2,n−1S1,n−12]	0
E_n, H_n	[2S1,n−1n(n−1)+2S2,nn2(n−2)−(n2−3)(3n−2)]n2(n−1)4(n−2)	π23−3≈0.28987
H_n, I_n	2S2,nn2−3n2+2n−1(S1,n−1−1)2n2+1(S1,n−1−1)4[[[S2,n−1(n−2)−4](n−1)+4S1,n−1](n−1)n2(n−2)−2(S1,n−1−1)(S2,n−1−1)(n−1)n]	0
B_n, H_n	(4S3,n−1n2+4S2,nn2+11n2−5n−10)(n−1)2−S2,n−1(4S3,n−1n2+8S2,nn2+13n2−9n−12)n(n−1)+4S2,n−12(S2,nn2+n2−n−1)n22(n−1)4	π6108−π418−3π24+112+2ζ(3)−π2ζ(3)3≈0.03744
L_n, T_k	k2(k−1)2S1,n−12[S2,n−1S1,n−12−2(k−1)S1,n−1+1]	∞
E_n, L_n	2S1,n−13n−S1,n−12(6n−8)+S2,n−1(n−1)(n−2)S1,n−14(n−1)(n−2)	0
L_n, I_n	2S1,n−13n−S1,n−12(6n−8)+S2,n−1(n−1)(n−2)(S1,n−1−1)4(n−1)(n−2)	0
B_n, L_n	S1,n−12[−2S2,n−12n2+S2,n−1(3n−4)n+n2+3n−4]+4S1,n−1(S2,n−1n−n+1)(S2,n−1n−S3,n−1n−n+1)+2S2,n−1(S2,n−1n−n+1)22S1,n−14n2	0
E_n, T_k	k2(k−1)2(n2+2S1,n−1n−9n+10)(n−1)(n−2)	k2(k−1)2
E_n, I_n	2S1,n−13n−S1,n−12(6n−8)+S2,n−1(n−1)(n−2)(S1,n−1−1)4(n−1)(n−2)	0
B_n, E_n	4S1,n−1n(S2,n−1n−n+1)2−2S2,n−12(n2+3n−6)n2+S2,n−1(3n−4)(n+6)n(n−1)+(n2−10n+8)(n−1)22n2(n−1)(n−2)	−π430+π24+12≈0.26159
I_n, T_k	k2(k−1)[(k−1)S1,n−12(n−1)(n−2)−2S1,n−1(kn2−4kn+n+2k)+(k−1)S2,n−1(n−1)(n−2)+kn2+n2−9kn+3n+10k−6](n−1)(n−2)	∞
B_n, I_n	[S2,n−1(n−1)(n−2)−4n+4S1,n−1+4](S2,n−1n−n+1)2(S1,n−1−1)4n2(n−1)(n−2)+2[S2,n−1n−(S3,n−1+1)(n−1)](S2,n−1n−n+1)(S1,n−1−1)3n(n−1)−2S2,n−12n2−S2,n−1(3n−4)n−(n+4)(n−1)2(S1,n−1−1)2n2	0
B_n, T_k	k2[[k(k−1)2(3n+2)+4n](n−1)n2−(k+1)(k2−3k+4)S2,n−1]	112k[(18−π2)k3−2(18−π2)k2+(18−π2)k−4(π2−6)]

Expressions involving E or I apply for ; expressions involving B apply for . The value for (H, L) follows equation 18 of Arbisser . The expressions are obtained using equation 4 and Tables 2 and 3.

Fig. 2.

Simulated and theoretical approximations of expectations of ratios of pairs of variables, plotted as functions of sample size n. Expressions for theoretical values are taken from Table 4.

Simulated and theoretical approximations of expectations of ratios of pairs of variables, plotted as functions of sample size n. Expressions for theoretical values are taken from Table 4. Theoretical approximations for variables X in , plotted as functions of k for n = 10, n = 20, and n = 50. The expressions plotted are taken from Table 4. Simulated and theoretical approximations of variances of ratios of pairs of variables, plotted as functions of sample size n. Expressions for theoretical values are taken from Table 5. Theoretical approximations for variables X in , plotted as functions of k for n = 10, n = 20, and n = 50. The expressions plotted are taken from Table 5. Approximations to expectations of ratios of pairs of variables. Expressions involving E or I apply for ; expressions involving B apply for . The value for (H, L) follows equation 15 of Arbisser . The expressions are obtained using equation 3 and Tables 2 and 3. Approximations to variances of ratios of pairs of variables. Expressions involving E or I apply for ; expressions involving B apply for . The value for (H, L) follows equation 18 of Arbisser . The expressions are obtained using equation 4 and Tables 2 and 3. For pairs (X, Y), we simulate the values of and under the coalescent model using ms (Hudson 2002), performing 100,000 replicate simulations for each tree size . We plot the simulated values alongside the approximate values from Tables 4 and 5 in Figs. 2 and 4.

Fig. 4.

Simulated and theoretical approximations of variances of ratios of pairs of variables, plotted as functions of sample size n. Expressions for theoretical values are taken from Table 5.

Results

Expectations of the ratios

The approximate expected values in Table 4, as approximations of ratios, have the form of rational functions. As n grows, the approximate expectations of , and approach 0. This behavior is sensible when considering the properties of the coalescent model: in the numerators, E has expectation 2 and and have bounded expectation in the limit as ; in the denominators, L and I have expectations that grow without bound (Table 2). Similarly, approximate expectations of ratios or with L and I in the numerator and T in the denominator grow to infinity as n increases. The approximation to approaches 1 in the limit as : as the number of leaves in the tree grows, internal branches occupy an increasingly large fraction of the total branch length. For pairs of variables that both have finite expectation, the approximate expectations of their associated ratios—, and —also approach finite values in the limit as . It is interesting to observe that although (Table 2), . In other words, although expectations of the individual variables approach the same value, we expect to be somewhat larger than 1 on average. For each of the 10 pairs of variables among , the approximate expectations from Table 4 are plotted in Fig. 2 together with the simulated values. Although some divergences are present for small n, the approximate and simulated values match closely. The approximate ratios involving T are shown in Fig. 3 as functions of k for each of three values of n. L is the fastest-growing variable according to the expression for its expectation (Table 2), and the graph for is topmost in all three plots. As expectations of H and E are close (Table 2), the graphs for and are close in Fig. 3.

Fig. 3.

Theoretical approximations for variables X in , plotted as functions of k for n = 10, n = 20, and n = 50. The expressions plotted are taken from Table 4.

Variances of the ratios

The limits of approximations of variances of ratios are presented in Table 5. They behave similarly to the expectations in Table 4. Because L and I have expectations that grow without bound, for ratios —with L or I in the denominator—the limits of the variance approximations are 0. As n grows, the denominators grow much faster than the numerators, and the values are therefore increasingly concentrated around 0. Hence, the variances also approach 0. Because L and I are much larger than the coalescence times T, approximations to variances of and diverge to infinity as n increases. Interestingly, however, the approximate variance of , a ratio of two quantities with diverging expectations, approaches 0. The variance approximations with finite nonzero limits are those for , and . All give ratios of two variables with finite expectation and variance as (Table 2). Figure 4 shows the expressions from Table 5 together with the simulated values. Compared to the plots of expectations of ratios (Fig. 2), differences between the simulated and approximate variances are prominent at small n. For the variances of , and , the simulated and approximate values differ substantially even as n increases. Because the theoretical value of that contributes to the approximate variance of is itself an approximation, one of the larger differences between simulation and approximation occurs for the plot for . Figure 5 shows variances of ratios involving T for varying k, for each of three values of n. Qualitatively, the values for approximate variances behave similarly to expectations in Fig. 3: in particular, the vertical placement of the curves follows the same order. Our approximations to the variances of and grow fastest, as the numerators are typically large and the expected value of the denominator T decreases as k grows. Approximations to variances of , and all display much slower growth; for these quantities, the expectations of numerators of the ratios are bounded above by 2 for all n.

Fig. 5.

Theoretical approximations for variables X in , plotted as functions of k for n = 10, n = 20, and n = 50. The expressions plotted are taken from Table 5.

Discussion

In this article, we have computed approximations to expected values and variances of ratios of various branch lengths under the standard coalescent model. We have considered all 15 possible pairs of variables in , a set of variables whose properties have been studied in detail individually. We have also assessed the accuracy of approximations to the expectation and variance by comparing them with values computed by simulation. We have observed that the approximate expressions behave in a way that matches mathematical intuition about the behavior of random variables associated with the branch lengths. In plots of the various approximations, we have illustrated how the random variables relate to each other, both among (Figs. 2 and 4) as well as between pairs including one of along with T (Figs. 3 and 5). As n grows large, the ratios involving L and I have nearly identical behavior in the plots, an observation that is explained by the fact that internal branches take up increasingly large fractions of the total branch length. In the limit as , expectations of both H and E approach a constant value of 2 (Table 2), and approaches 0 (Table 3). However, we observed that is not equal to . For the ratio , the approximation aligns with the naive prediction, , even though is also zero in the limit (Table 2). For B and H, which possess a high correlation, , whereas . Previously, we evaluated covariances and correlation coefficients under the coalescent model for the pairs of variables that we consider here, obtaining exact covariances and correlations for 13 of 15 pairs and approximations for the other two. We obtained limiting expressions for these covariances and correlations as . The approximate values that we have provided here for expectations and variances of ratios make use of these previous results concerning covariances, adding to the understanding of the properties of joint distributions of pairs of genealogical variables in coalescent theory. Many statistical tests of population-genetic models rely on a model prediction of an equivalence between two quantities, framed as a null hypothesis that a test statistic equals a particular value. The prediction is often formulated as a null hypothesis that a difference between two quantities equals 0 or that their ratio equals a null value such as 1. In coalescent theory, tests that evaluate site-frequency spectra for agreement with predictions of coalescent models tend to use differences or other linear combinations (Zeng ; Achaz 2009; Ferretti , 2017; Ronen ; Fu 2022). However, several modeling studies and inference procedures in coalescent theory do emphasize ratios (Slatkin 1996; Uyenoyama 1997; Schierup and Hein 2000; Rosenberg and Hirsh 2003; Eldon 2011; Arbisser ), as do some test statistics (Schlötterer 2002; Lohse and Kelleher 2009). Widely used tests in the area of molecular evolution, such as tests of the relative count of nonsynonymous and synonymous substitutions and the McDonald–Kreitman test of polymorphism and divergence, also make use of ratios (Yang 2014). The choice of a difference or a ratio in formulating a test statistic can rely on several factors. Ratios are unitless, so that their values do not depend on conventions chosen during computation (e.g. scaling time in units of N or 2N). Ratios might take values in a prescribed range that can be simply interpreted, such as the range of the coalescent ratio from to (Arbisser ). However, the statistical properties of random variables formulated as differences are generally easier to compute from the properties of the separate random variables whose difference is taken than are the properties of corresponding statistics formulated as ratios. In general, corresponding differences and ratios in coalescent theory have not been formally compared for features such as their power to reject the null hypothesis when processes such as natural selection or population or species divergence affect the shapes of evolutionary trees. Our work to obtain approximate expectations and variances of ratios can augment understanding of scenarios in which coalescent ratios are considered, and it can assist in evaluating the relative utility of difference-based and ratio-based statistics. We have found that approximations for fixed n and in the limit as are quite accurate in predicting the expected values seen in coalescent simulations of the ratios (Fig. 2). For the variances, the approximations are generally less accurate, although in most cases, graphs of the approximations and simulated values have similar shape (Fig. 4). These approximations are obtained from a Taylor approximation for the variance of a ratio (equation 4), and higher-order approximations of this variance could potentially be applied by use of Taylor’s theorem; as the order of the approximation increases, however, the complexity of the resulting formula also increases. For those variances for which the approximation and simulation are not close in Fig. 4, we advise caution in using the variances in settings in which a precise approximation is needed.

21 in total

Approximations to the expectations and variances of ratios of tree properties under the coalescent.

Introduction

Materials and methods

Tree variables

Taylor approximations to expectations and variances of ratios

Exact expectations, variances, and covariances of tree properties

Evaluating the approximations

Results

Expectations of the ratios

Variances of the ratios

Discussion

1. Consequences of recombination on traditional phylogenetic analysis.

2. On the use of star-shaped genealogies in inference of coalescence times.

3. Statistical tests for detecting positive selection by utilizing high-frequency variants.

4. Measuring the degree of starshape in genealogies--summary statistics and demographic inference.

5. Estimation of parameters in large offspring number models and ratios of coalescence times.

6. Genealogical structure among alleles regulating self-incompatibility in natural populations of flowering plants.

7. Inferring coalescence times from DNA sequence data.

8. TESTING THE CONSTANT-RATE NEUTRAL ALLELE MODEL WITH PROTEIN SEQUENCE DATA.

9. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism.

10. Learning natural selection from the site frequency spectrum.