Literature DB >> 35430886

Covariance of pairwise differences on a multi-species coalescent tree and implications for F_ST.

Abstract

The multi-species coalescent (MSC) provides a theoretical foundation for modern phylogenetics and comparative population genetics. Its theoretical properties have been heavily studied but there are still aspects of the MSC that are largely unknown, including the covariances in pairwise coalescence times, which are fundamental for understanding the properties of statistics that combine data from multiple species, such as the fixation index (FST). The major contribution of this study is the derivation and implementation of exact expressions for the covariances of pairwise coalescence times under phylogenetic models with piecewise constant changes in population size, assuming no gene flow after species divergence. We use these expressions to derive the variance in average pairwise differences within and between populations. We then derive approximations for the expectation and bias of a sequence-based estimator of FST, a commonly used genetic measurement of population differentiation, when it is applied to a non-recombining region of the genome. We show that the estimator of FST is generally biased downward. A freely available software package is provided, STCov, to calculate the mean, variances and covariances in coalescence times presented here under user-defined piecewise-constant species trees. This article is part of the theme issue 'Celebrating 50 years since Lewontin's apportionment of human diversity'.

Entities: Chemical

Keywords: FST; covariance; multi-species coalescent; pairwise differences; population differentiation

Mesh：

Year: 2022 PMID： 35430886 PMCID： PMC9014196 DOI： 10.1098/rstb.2020.0415

Source DB: PubMed Journal: Philos Trans R Soc Lond B Biol Sci ISSN： 0962-8436 Impact factor: 6.671

Introduction

The multi-species coalescent (MSC) is a generalization of Kingman’s coalescent [1] that describes the joint coalescence process in multiple species, or populations, as they diverge from each other. The MSC provides a theoretical foundation for phylogenetic analyses as it fully describes and characterizes the process of incomplete lineage sorting [2-5]. It is, therefore, central in the unification of the fields of population genetics and phylogenetics. It is also central for understanding divergence between populations and allows the theoretical prediction of the amount of variance within and between populations. In this sense, it provides a theoretical framework for relating apportionment of genetic variance within and between populations, as proposed by Lewontin [6], to specific models of population divergence. One of the important utilities of theoretical models, such as the MSC, is to provide predictions regarding observed statistics, eventually leading to the development of estimators of population-level parameters. In this regard, an important use of the MSC has been to understand the properties of pairwise nucleotide differences within and between species, which is one of the most commonly used statistics to analyse population genetic data. Takahata & Nei [7] derived expressions for the variance in average pairwise nucleotide differences and Nei and Li’s ‘net number of differences’ [8], (d). They assumed a Kingman’s coalescent model [1] of two diverging populations, and an infinite sites model of mutation [9,10]. These classical results provided insights into when the net number of differences can be used as a reliable estimator for species divergence, and the appropriate sampling schemes to reduce the variance. It is also one of the first uses of the MSC. Takahata & Nei [7] defined d and d to be the mean number of nucleotide differences between two (haploid) individuals sampled from within population X or Y, respectively. Similarly, d is the average number of nucleotide differences between two individuals randomly sampled from populations X and Y. The statistics d, d and d are then calculated based on sample sizes of n and n from populations X and Y, respectively, as follows: where k is the number of pairwise nucleotide differences between individuals (haplotype genomic sequences) i and i′. Henceforth, in this study, an ‘individual’ is a non-recombining haploid genomic sequence. To measure the net number of nucleotide differences between two populations, Nei & Li’s [8] d is defined as The relationship between differences within and between populations gives an indication of the degree of population subdivision. d specifically measures the excess number of substitutions between populations, which quantifies the extent of divergence. These measures of species divergence form the basis for many evolutionary analyses and are among the most basic and commonly used inferential tools in modern population genetics. The pairwise differences d, d and d provide measures of genetic variability within and between species/populations that are applicable to DNA sequencing data and have been fundamental in analyses of such data since the 1980s. However, since their invention, the question quickly arose of how they relate to older measures of genetic divergence and variability originally derived for independent loci such as allozymes, in particular, how are they related to Wright’s FST? Furthermore, how should FST appropriately be calculated for DNA sequencing data? These questions were answered by Slatkin [11], who argued that FST is equivalent to a ratio of average coalescence times of different pairs of genes. Assuming an infinite sites model, he then showed that Wright’s FST in the context of DNA sequencing data could be expressed in terms of d, d, and d (see equation (7.2) below). The statistics d, d and d have been, and continue to be, a cornerstone of the analysis of DNA sequence data. Understanding their mean, variances and covariances under arbitrary genetic and species tree models is essential for their biological interpretability, and considerable previous work has been devoted to understanding their properties. Tajima [12] and Takahata & Nei [7] studied the variance of average pairwise differences in a panmictic population and in a split model with constant population size. In a series of papers, Wakeley studied the variance in pairwise differences in a general model of population sub-division [13] and the average pairwise differences in a model with migration [14], and later demonstrated the impact of recombination on the numerical stability of such estimates [15]. Tang et al. [16] derived an estimator for the time to most recent common ancestor (TMRCA) of a sample of DNA sequences along with quantification of sampling error by leveraging pairwise differences, free of population structure assumptions. The multi-species coalescent has received renewed attention in the age of genomics because of its applicability in phylogenetic analyses using multiple loci. Efromovich & Kubatko [17] presented a method to calculate the distribution of coalescent times at the root of a species tree with an arbitrary number of populations. In a pair of papers, Wilkinson-Herbots provided unified analytic results for both the distribution of coalescence times and pairwise differences under models of isolation with migration [18,19] under assumptions of constant population size. Heled [20] helped to further marry previously pairwise difference quantification and the multispecies coalescent by deriving closed-form exact results for the ‘average sequence dissimilarity’ between pairs of sequences drawn at random under a simple two-species coalescent process with constant population size. Many methods have also been developed to use pairwise differences under the MSC while leveraging large genomics datasets to infer species tree topologies and divergence times (e.g. [21-23]). Takahata & Nei’s [7] original results on d, d and d relied on the assumption of constant and equal population sizes among populations and through time. Using the MSC, we here extend these results to arbitrary piecewise constant population size histories along a phylogeny. To do so, we derive and present general equations for calculating the covariance of pairwise coalescence times, for any two, three or four haploid individuals, arbitrarily chosen within the phylogeny. We also derive expressions for the expected shared branch length between sets of lineages. We provide a software package, STCov, for calculating these theoretical MSC quantities. We then use these results to demonstrate the effects of various demographic, mutational and sampling size changes on the distribution of d, and extend the discussion to specifically investigate the statistical properties of Slatkin’s FST estimator [11], and some of its various applications [24-26], as it is the most commonly used measure of FST using sequence data. We investigate the effects of bottlenecks, sampling variance and demographic changes on various FST-based measurements, and present the magnitude of downward bias when using FST estimated from a ‘ratio of averages’ approach to Slatkin’s estimator, as is typical in single gene analyses.

Mean, variance and covariance of average pairwise differences

We first review previous results for the mean, variance and covariance of average pairwise nucleotide differences for individuals sampled from two populations, X and Y, as functions of the individual pairwise difference terms (k, k · · ·). Suppose i, i′, i″, i″′ are individuals from population X, and j, j′, j″, j″′ are individuals from population Y. By definition we have, and likewise for population Y. Suppose i, j are individuals from X, Y, respectively, then, Following the derivations in Tajima [12], Takahata & Nei [7] and Wakeley [14], under an infinite-site model of mutation, the variance and covariance of d, d, d and d can be written as follows: Further, formulae for the covariance of average pairwise difference terms can also be reduced to functions of individual pairwise terms This simple result is due to the fact that the covariance of sums can be decomposed into the sums of covariances. As presented in Takahata & Nei (equations 18a–d) [7], covariance equations involving the cross population can be expressed as follows: and These expressions are all functions of the individual pairwise differences, e.g. k. In what proceeds we demonstrate that these expressions can be further generalized as functions of pairwise coalescence times, e.g. t.

Pairwise mutational differences

In this section, we generalize previous work [7,12] by deriving expressions for the covariance of pairwise differences under arbitrary piecewise-constant demographic settings using the MSC. Throughout this section, we will assume an infinite sites model [9,10], with no recombination. We first review results on the mean and variance from previous work (e.g. [7,12,14]), and then extend results to the covariance.

Mean and variance

Note, given a coalescence time t between two individuals, i and j, the expected number of nucleotide differences between the pair is equal to 2μt, for i.e. Under the assumption that the number of mutations conditional on a genealogy is Poisson, the conditional expectation and variance of pairwise differences are equal. By applying the law of total variance, we can decompose the unconditional variance of pairwise differences as We can obtain the second moment of the distribution of pairwise nucleotide differences, , from the definition of variance,

Covariance

Let i, i′, j, j′ be four individuals sampled from arbitrary populations. Let T be a local coalescent tree relating the four individuals restricted to a non-recombining region. Here, we show that Consequently, we further derive the unconditional quantity where denotes the amount of branch length on T shared between the branch connecting pair i, i′ and the branch connecting pair j, j′. Figure 1 provides an illustrative example of this quantity, and electronic supplementary material, §F, provides a more technical treatment.

Figure 1

(a–c) Explanation of expected shared branch length for four unique individuals. Bolded blue lines indicate the branch length between individuals a and b. Bolded red lines indicate branch length between c and d. Overlapping blue and red lines (along with α terms) indicate shared branch length. The four tree topologies are representative of the possible gene tree orderings, but it should be noted that these representative trees assume a and b are exchangeable, as well as c and d. The expected shared branch length is a weighted sum of the shared branch lengths across all possible topology orderings. (Online version in colour.) To prove these results, we start by revisiting the idea that under the infinite-site model, the mutational process given a branch length is Poisson. Given local tree, T, with coalescence times t and t from T, conditional pairwise differences follow a Poisson distribution, written as where 2t is the amount of total branch length locally between the two individuals. A key feature of the Poisson distribution is that the sum of Poisson random variables is also Poisson. To exploit this, let denote the amount of branch length on T shared by pairs i, i′ and j, j′ (figure 1). The branch length between i, i′ not shared with pair j, j′ is denoted by , with similar notation for pair j, j′ by swapping labels. We can decompose the branch lengths into the shared and non-shared segments as Notice that , and are therefore independent Poisson random variables. Similarly, and , where , and are independent of each other conditionally on T. We can expand Cov(k, k|T), (equation 3.5) as follows: The overall result is that the covariance of pairwise differences given the coalescent tree T is equal to the mutation rate times the shared branch length. To get the unconditional quantity, Cov(k, k) (equation 3.6), we apply the law of total covariance: The case when for only three unique individuals (k, k) has the same form, by replacing j′ with i in the equations above. Takahata & Nei [7] have previously derived formulas for the covariance under constant population size; see electronic supplementary material, §C, which presents a visualization of their results as a comparison to the generalized results presented here.

Mean, variance and covariance in pairwise coalescence times

We assume species evolution follows a bifurcating species tree , with no migration (see figure 2a). Each branch, i, of is parameterized by constant diploid population size η, start time τ, and end time τ, where p(i) is the parent branch of i. Let μ be the mutation rate (constant across the genome/species) per sequence per generation. Time is measured in units of generations in the past. We implicitly assume that all coalescent calculations here are conditioned on a fixed species tree , although the tree is not always indicated in the notation for the sake of simplicity and compactness.

Figure 2

Species tree notation. (a) Example of notation used for a four-species tree with topology, divergence times and constant population sizes within each population which can vary between species. (b) Example of a marginal species tree, the result of subsetting a larger species tree. As a consequence, the population size histories are no longer constant within each species, but instead are piecewise constant.

Mean and variance in coalescence times

Let t be the coalescence time of two individuals, i and j, sampled from species X and Y, respectively, in a non-recombining region of the genome. For species tree , denote the marginal tree of two species (see figure 2b). Here, represents the set of divergence times of species ancestral to both X and Y, indexed by (τ1, τ2, …), where τ1 : = τ, the divergence time for species X and Y. Similarly, represents the corresponding population sizes. Suppose there are V ≥ 1 intervals in . Under this marginal tree, we can analytically calculate the first two moments of the distribution of t as and P22(τ1, τ) represents the probability that lineages i and j fail to coalesce in the time interval (τ1, τ), (two lineages in, two lineages out). Formally, this is the probability that two lineages which exist in the same population at time interval τ1 have not coalesced by time τ (backwards in time) Note that the mean and variance of pairwise coalescence times under the standard piecewise constant coalescent process are just simply weighted sums over coalescence intervals.

Covariance in pairwise coalescence times

The challenge in calculating the covariance terms from a species tree, , comes from the combinatorial problem of integrating over all of the possible times and orderings of the coalescent events along the multi-species tree. The general formula for covariance in this case is given by where the last term is simply a product of independent expectations. The first term on the right-hand side of the equation is what we will focus on; in particular, we write D is the species divergence time between individuals i, i′ from , where D = 0 if i, i′ are of the same species (similarly for D). We assume all coalescence events must be at least as ancient as the species divergence time (e.g. t ≥ D), i.e. we assume no introgression, migration or admixture, etc. To evaluate this quantity, , we consider six separate conditional cases. For a bifurcating tree of four individuals, there are three unique coalescence events. The six cases correspond to the possible orderings of coalescence events for this local tree of four individuals, given that we structure the joint likelihood as : Here, ‘first event’ implies most recent, and ‘third’ implies most ancient. These events are further illustrated in detail in figure 3. Conditioning on each of these six events, and evaluating each expectation separately, the expression for the joint expectation becomes In the presence of no population isolation (all individuals from the same species), but piecewise constant population size history, the set of recursions and integrals is presented in its entirety in the electronic supplementary material, §G. This calculation is useful in the instance that all four lineages survive to a common population without having coalesced with one another, which occurs with some probability in each case.

Figure 3

Ordered topologies to consider when calculating t|t. Given four individuals, i, i′, j, j′, the six cases presented outline the necessary labelled/ordered local trees essential for the conditional calculation of P(t|t). The cases can be grouped into three general scenarios based on the timing of t in relation to the conditional t. All 18 possible ordered tree topologies are considered. (Online version in colour.)

C1. t is the first coalescent event. C2. t is the second event, t is the third. C3. t = t as the third coalescent event. C4. t is the second event, t is the third. C5. t is the first event, t is the second. C6. t is the first event, t is the third. Ordered topologies to consider when calculating t|t. Given four individuals, i, i′, j, j′, the six cases presented outline the necessary labelled/ordered local trees essential for the conditional calculation of P(t|t). The cases can be grouped into three general scenarios based on the timing of t in relation to the conditional t. All 18 possible ordered tree topologies are considered. (Online version in colour.) Introducing a species tree structure on top of the six cases multiplies the number of cases to consider. There are five general possible species tree configurations that can arise (see electronic supplementary material, figure S13). We have derived exact equations and recursions to evaluate all six cases (C1, …, C6) across the five general possible tree configurations, and have implemented them in C++ code (STCov) which is freely available to use (more information in the code availability section). From this implementation, we are able to calculate exact theoretical quantities for these statistics under any piecewise constant scenario.

Accuracy of coalescent calculations

To demonstrate the accuracy of the coalescent equations above, as implemented in our software STCov, we compare the theoretical results (assuming infinite-sites) against empirical estimates from gene trees under a finite-sites model using ms [27]. We first test two simple demographic scenarios for a tree of two species, X and Y: η = η, and η = 2η (figures 4 and 5), where η represents scaled effective population size. We assume η = η in both scenarios. Let lineages i1, i2, i3 originate in population X, and lineages j1, j2, j3 originate in Y. We generate 1500 independent gene trees from ms for each demographic scenario (with specified population sizes and single divergence time which we vary from 0–20 in units of 2η generations), and calculate sample mean, variance and covariance terms. The figures demonstrate that the theoretical calculations from STCov match simulations (dots) well, while variation in the empirical estimates can be attributed to a finite sample size.

Figure 4

Figure 5

Assessing the accuracy of theoretical pairwise coalescent time calculations against simulated values, for population sizes: η = 2η. Theoretical results from STCov are plotted as black curves, with dots representing empirical estimates of the quantity on the y-axis using 4500 independently simulated local trees. (Online version in colour.)

Assessing the accuracy of theoretical pairwise coalescent time calculations against simulated values, for population sizes: η = η. Theoretical results from STCov are plotted as black curves, with dots representing empirical estimates of the quantity on the y-axis using 4500 independently simulated local trees. Assessing the accuracy of theoretical pairwise coalescent time calculations against simulated values, for population sizes: η = 2η. Theoretical results from STCov are plotted as black curves, with dots representing empirical estimates of the quantity on the y-axis using 4500 independently simulated local trees. (Online version in colour.)

Accuracy of pairwise difference calculations

In this section, we evaluate the accuracy of our results under varying mutation rates, divergence times and population sizes. We compare our results to simulated datasets. We compare three population size change models, denoted by η = 1η, η = 2η and η = 10η, along with three mutation rates 2μη = 10, 1, 0.1, for a total of nine simulation scenarios. We present one of those scenarios here (figure 6), and leave the full set of results to the electronic supplementary material, figures S2–S10. While allowing for variance in the empirical estimates from sample size, coalescent and mutational variation, there is strong agreement between the theoretical and simulated results. Note that the theoretical quantities assume an infinite-sites model of mutation, whereas our simulations are performed assuming a realistic, finite-sites model (1500 independent genes of 10 000 bp each; see electronic supplementary material for full simulation details). We choose to compare this finite-sites model over simulations using a model of infinite sites to demonstrate the applicability of the results to the types of data that will be used in practice, and to demonstrate when there are limitations. We leave a demonstration of the accuracy of our variance/covariance calculations in relation to the previous results derived for constant population size in Takahata & Nei [7] to the electronic supplementary material, §C.

Figure 6

Assessing the accuracy of average pairwise difference results, 2μη = 1, η = 10η. We compare our theoretical results based on coalescence theory using equations presented here (black line) with empirical estimates using 1500 independently simulated gene sequences (red dots), n = n = 10 sampled individuals. (Online version in colour.)

Accuracy in estimating FST

A direct extension of our discussion on the mean and variance of average pairwise nucleotide differences is to the measurement FST for a given species tree, mutation rate and sample size. Slatkin (1991, equation 8) [11] presented a coalescent-based definition of FST as a function of the difference in expected time to coalescence for a collection of subpopulations. Specializing to two sub populations of interest, X and Y, Slatkin’s FST can be expressed as where i, i′ are from population X, and j, j′ are individuals sampled from population Y. This definition of FST relies on a ratio of estimates of average coalescence times, where average pairwise differences in DNA sequence data are used as the proxy to estimate the unknown coalescence times. Discussed in Slatkin and Hudson et al. [11,26], for two populations X and Y, FST can be estimated from a non-recombining portion of the genome using For the sake of this paper, we differentiate FST and as the exact measurement from unobservable coalescence times and the estimate from pairwise differences across multiple sequences, respectively. As we have shown above, the expectation, variance and covariance of these sample average pairwise differences contained in equation (7.2) can be derived using coalescent theory, for a given mutation parameter μ and sample sizes. We can use these to study the accuracy of the estimator to Slatkin’s FST under an arbitrary species tree, . To begin, it is important to note that the mean of a ratio is not the ratio of means, specifically it is the case that This implies that the estimator is potentially a biased estimator of FST, such that . To study this bias, we need an expression for the mean of . In general, there is no closed form for the mean of a ratio of dependent random variables, so we will first simplify our terms, and then approximate the mean and variance using a Taylor expansion. We can first simplify the expressions for and We are now interested in the mean and variance of the ratio (d + d)/d. As generally discussed in Stuart & Kendall [28], we can use a second-order Taylor expansion of f(A, B) = A/B around the mean values to get an approximation to the mean, and a first-order expansion around the means to get an approximation of the variance of the ratio term. We can approximate the mean as By rearranging terms, observe that is a function of FST, along with other mean, variance and covariance terms Using this, we can get an expression for the bias of Similarly, we can get a first-order approximation for the variance of : Figure 7 shows the accuracy of the two Taylor approximations under a constant population size model for mutation rate 2μη = 1. The approximation for the mean is a good one, however the first-order approximation to the variance is insufficient for low divergence times, as it can be seen there are higher-order terms involved. From this, we decide that we cannot approximate the variance in well with this method, and do not pursue this aspect further. Electronic supplementary material, figures S11 and S12, demonstrate the accuracy of the Taylor approximations under alternate mutation rates, and it can be seen that the approximation to breaks down under a 10× reduction in the mutation rate (2μη = 0.1) due to the high variance in estimating variance/covariance terms of the d statistics.

Figure 7

Accuracy of approximations to the mean and variance of , 2μη = 1, η = η. A comparison of the approximations in equations (7.7) and (7.9) (black curves) to values estimated from empirical simulations (red dots). (a) The approximated value to as a function of divergence time, D, for equal sample sizes n, n accurately approximates simulated estimates. (b) The first-order approximation for the variance as a function of D is a poor approximation for more recent divergence time models. (Online version in colour.)

Results for the mean and bias of

In this section, we study the effects of varying demographic and genetic parameters on the expectation of and consequently its bias as an estimator of FST. First, we start with a discussion on the differences between and FST, both as described above. Supposing we knew the true values, we calculate FST using only the individual expectations of d, d and d. We can write Immediately we can note that FST is not dependent on sample sizes n, n or the mutation rate, μ. Instead, it is solely a function of mean coalescence times, and is only variable in the demographic parameter space. Also, note the fundamental difference between and FST is the term It is known that ratio estimators are in general biased [29]. Jensen’s inequality [30] tells us, for a convex function f(t), that Letting f(t) = (d + d)/d and observing that d ≥ 1/2(d + d), the inequality implies Thus we expect to be a negatively biased estimate of FST. As the divergence time between X and Y becomes deeper (more ancient), we expect d + d to become increasingly independent from d and to become increasingly closer to FST. Also, letting the number of mutations increase in an infinite-sites model, the estimates of d, d and d become closer to their expectations, bringing equation (7.13) closer to equality. Figure 8 demonstrates the relationship between and FST under varying divergence times D, population sizes and mutation rates μ. As discussed above, the relative bias of is much less under a deep divergence model (D = 20.0, in units of 2η generations) as d, d and d are more independent, compared to a more shallow divergence (D = 1.0), where we see in our example FST is three times as large as . It is clear that is a faithful estimator of FST under very high mutation rates, however, it is biased downward for small values of μ, although the bias is reduced for deep divergence models. When estimating FST from multiple genes across the genome, one approach used to reduce the estimation bias is to estimate each term in equation (7.10) individually and apply a ‘ratio of averages’ approach [31], as further highlighted in the discussion.

Figure 8

FST approximation bias using across divergence times. Under varying population size scenarios (rows), we study the difference between theoretical FST and the expected estimate calculated from pairwise differences, , to highlight the potential biases in doing so. (a,c,e) On the y-axis are values and FST as functions of divergence time D. We plot the true value of FST in black, and approximations using equation (7.7) under three mutation rates. (b,d,e) The difference between the true FST (black line in adjacent plot) and the expected sample quantity, to represent the bias in estimation. We simulated assuming equal sample sizes n = n = 10. In all figures, dots represent simulated estimates from 1500 independent genes. (Online version in colour.)

Effect of bottleneck timing on FST

Population bottlenecks can drastically affect the genetic diversity of populations over evolutionarily short periods of time. In the context of FST, the question of when a bottleneck occurred in a history of evolution is key in understanding its impact on population differentiation. In this section, we use the flexibility of STCov to explore the effect of a population bottleneck placed at various times in the history of two theoretical species, X and Y, on FST. Here, we model a population bottleneck as a 10 × reduction in the population size η0 for a fixed length of time (1.0 in units of 2η0 generations). We study four scenarios as described in figure 9. For varying divergence times D, we use STCov to calculate FST under each scenario, and use empirical simulations via ms and SeqGen to validate our results. We find that a recent bottleneck has the largest impact on FST at every divergence time tested (figure 10), demonstrating an increased level of differentiation as compared to the scenario with no bottleneck. Both scenarios of deeper bottlenecks have much less effect on overall FST despite their bottlenecks being identical in size and length. This illustrates that the timing of variation-reducing events such as a bottleneck plays a large role in the impact to measured genetic differentiation using FST, where the impact can be effectively lost given sufficient time post-bottleneck.

Figure 9

Figure 10

The effect of different population bottlenecks on FST. Four different bottleneck scenarios were considered in the genetic history of two species X and Y, as described in figure 9. Curves represent theoretical results from STCov, open circles are empirical estimates from 1500 independently simulated sequences under 2μη = 1.0 mutation rate. Note that as bottleneck lengths in X were fixed to be min (1.0, D), for D ≤ 1, the population histories of recent and deep bottlenecks in X are identical. The vertical dashed line at D = 1 indicates this boundary. (Online version in colour.)

Bottlenecks considered in a tree of two species. (a) Constant population size tree with no bottleneck. (b) A bottleneck occurring in the recent history of species X. (c) A bottleneck which occurred directly after speciation in population X. (d) A bottleneck which occurred in the population ancestral to both X and Y. In all trees, the non-bottleneck population sizes are a fixed constant. (Online version in colour.) The effect of different population bottlenecks on FST. Four different bottleneck scenarios were considered in the genetic history of two species X and Y, as described in figure 9. Curves represent theoretical results from STCov, open circles are empirical estimates from 1500 independently simulated sequences under 2μη = 1.0 mutation rate. Note that as bottleneck lengths in X were fixed to be min (1.0, D), for D ≤ 1, the population histories of recent and deep bottlenecks in X are identical. The vertical dashed line at D = 1 indicates this boundary. (Online version in colour.)

Bias in the FST estimator for gene flow

The value of FST is often used to estimate levels of gene flow between populations. Wright [32] first derived the relationship between FST to estimate Nm in an Island model, where N is the number of individuals in each deme (sub-population), and m is the fraction of migrants into the deme in each generation. Hudson et al. [26] used this relationship to estimate Nm using the following expression: where FST is an estimate from sequence data, i.e. in our notation. The results of the simulations presented there show estimates using 〈Nm〉 are upward-biased using an estimate of FST from sequence data in place of the unknown FST based on coalescence times. There are two potential sources of this bias, the estimator function, 〈Nm〉, and the estimate, . The scope of this study concerns the role of estimator , and we can investigate the effect of this estimator compared to using the true value, FST. We note that we do not intend to estimate or study gene flow in this manuscript, but simply evaluate the accuracy of the function 〈Nm〉 when an estimate of FST is used. To start, we can once again use a Taylor expansion to get an approximation for the expected value of 〈Nm〉, when using We can use this expression to study the difference between using the estimator, , and the (unknown) true value, FST, in the expression for 〈Nm〉. Figure 11 shows the difference between using FST and in 〈Nm〉 under different mutation rates, population sizes and species divergence times. From the figure, we see that the expectations are, in fact, overestimates. In this figure, 10 individuals are sampled from each population. When the divergence time D is low, the bias relative to the true value is substantial, resulting in an estimate twice as large as that which would have been obtained using an accurate estimate of FST. For high mutation rates, this bias decreases rapidly as D increases. For a low mutation rate, 2μη = 0.01, a bias of greater than 50% overestimation persists. Even at high mutation rates, an upwards bias of about approximately 5% exists even at large divergence time values. Note, however, that we do not see a large difference in the bias across different population size models. The results here can explain (at least a portion of) the bias seen in Hudson et al. [26], that using an estimate of FST can result in an artificial increase in the function 〈Nm〉.

Figure 11

〈Nm〉 approximation bias across divergence times and mutation rates. Under varying population size scenarios (rows), we demonstrate the difference between theoretical 〈Nm〉 and the expected estimate when calculating from pairwise differences using equation 7.15. (a,c,e) On the y-axis are values 〈Nm〉 as functions of divergence time D. We plot the value when using the true FST, and approximations , for mutation rates 2μη = 10.0,1.0 and 0.1. (b,d,f) The per cent difference between 〈Nm〉 using FST (black line in a,c,e) and the expected sample quantity to represent the bias in estimation. We simulated assuming equal sample sizes n = n = 10, and population size structure as indicated at the top of each plot. For a fixed sample size, the expected sample quantity tends to overestimate the ‘true’ value, with the amount of overestimation a function of μ and D.

Accuracy of log transform for linearizing FST

Under a neutral divergence model, FST has also commonly been transformed as a linear approximation to the population divergence time, D. Discussed in Cavalli–Sforza [25], and later Nielsen et al. [24], is that given an estimate of FST, D can be estimated by the transformation Another commonly used transformation, presented in Slatkin [33], relates the time of divergence to a ratio of FST values Here, we evaluate the accuracy of these transformations by approximating the expected value of each using similar Taylor expansions, as earlier. Without having an accurate approximation of , we can only make a first-order approximation of equation (7.16) such that For equation (7.17), by plugging in the estimator for FST from equation (7.2), we find Taking the expectation of this quantity By deriving a similar second-order Taylor approximation for the expectation on the right-hand side, as we did earlier with , we get and we have a second-order Taylor approximation of the expectation of equation (7.17). In figure 12, we evaluate the linearity between these expressions and divergence time, and the accuracy of our approximations against simulated data (line versus dots), under two different population size models. It is clear that Slatkin’s [33] linear FST is a linear predictor of divergence time under the constant population size model assumed in its derivation. However, under a model where the population size of species Y is 10 times higher than X, the linearity expectedly disappears. The log transformation of Nielsen et al. and Cavalli-Sforza [24,25] performs worse and can only be used as a local-linear approximation. Across large values of D, it demonstrates clear nonlinear behaviour and Slatkin’s [33] transformation is preferable under the conditions investigated here.

Figure 12

Linearized FST estimates. Testing the linearity of two FST transformations plotted against species divergence time, D. On the left (a,c) is the approximate mean log transformed value. On the right (b,d) is the approximated mean fraction transformed value. Both use as a proxy for the unknown FST. Plotted on the x-axis of all is the simulated divergence time. The red circles correspond to empirical values of and to verify the accuracy of the approximation (line in black). (a,b) correspond to the approximations under a constant population size model. (c,d) correspond to the η = 10η imbalanced population size model. (Online version in colour.)

Discussion

In this study, we have derived the equations and recursions needed to calculate exact values for the covariance between pairs of coalescence times in a species tree model, allowing for piecewise constant changes in population sizes throughout the tree. Using these expressions, we are able to build on previous theory to get exact values for the mean, variance and covariance of the average number of pairwise differences for a given mutation rate and sample size. We have demonstrated that in the constant population size scenario, we can exactly recreate the covariance results of Takahata & Nei [7]. The equations and recursions derived here are implemented in a freely available software package, STCov, which allows for exact calculations under any piecewise constant model of divergence for arbitrary numbers of species/populations. While the covariance results presented here are interesting on their own, we imagine there are many further applications of the summary statistics presented here. One such application we explored is the properties of Slatkin’s FST and its approximation using sequence data, , under a divergence model. Under the infinite-sites model with no recombination, we demonstrate the known negative bias in estimating FST using sequence data and the ‘average of ratios’ approach. We show that the magnitude of the bias is a function of both mutation rate and population divergence time, with the amount of bias decreasing as both mutation rates and divergence times increase. The bias, however, is non-vanishing for low mutation rates, even as simulated divergence time increases, and is further exaggerated for imbalanced population sizes. As such, the results of the transformation for FST used for gene-flow estimation can be biased upwards when using empirical estimates, which reaffirms discussion in Hudson et al. [26] and provides further insight to the source of the bias. We therefore advocate that when looking at FST in a gene-by-gene fashion, such as when performing local FST scans, to consider that empirical estimates of Slatkin’s FST are generally accurate for high values of mutation and deep divergence, but warn against its over-interpretation in low mutation or recent divergence scenarios, where the FST estimate can be uninformative. We recommend using equation (7.8) to estimate the expected level of bias upon application. Throughout the theoretical equations presented here, we assumed an infinite-sites model of mutation with no recombination between sites. However, allowing for recombination between sites provides more stable estimates of the expectations of pairwise differences. As discussed in Wakeley [15], allowing for an increasing amount of recombination between loci decreases the error in estimates of expectations of d, d and d. At the limit of infinitely free recombination between loci, estimates of equation (7.13) tend towards equality and thus the estimator would converge to the value of FST, mitigating the negative bias seen here. Therefore, aligning with conclusions drawn in Bhatia et al. [31], in the age of whole-genome estimates of FST, taking a ‘ratio of averages’ across independent loci rather than the ‘average of ratios’ approach to FST can sidestep the bias we have presented when estimating FST from loci across an entire genome; the former also having the advantage of being a more numerically stable estimator. Independent of bias, our equations demonstrate that the timing of a bottleneck can drastically impact measured levels of FST. Specifically, that the impact of population variation can vanish given enough time. Finally, we study the accuracy of a couple of commonly used linear transformations of FST as approximate measures of population divergence times, and find, for equal population sizes, the estimator proposed in Slatkin [33] has the best performance, but when population sizes are no longer equal, expectedly, even this transformation shows deviations from linearity. There are many interesting properties to study with the covariance of pairwise coalescent times and pairwise differences. We hope that the software provided, STCov, will allow for further investigation into the properties and usefulness of these quantities for estimating various aspects of species trees, such as topology reconstruction, divergence time and population size estimation, gene flow and admixture detection.

Software availability

Along with this manuscript, we provide software (implemented in C++) freely available for download which calculates the various coalescent quantities presented here (means, variances, covariances and shared branch length). We have designed the code to be very flexible to user inputted species trees. The program outputs exact quantities for any user-defined rooted, bifurcating, piecewise-constant population size species tree. Download the code at https://github.com/gaguerra/STCov.

26 in total

1. The probability of topological concordance of gene trees and species trees.

Authors: Noah A Rosenberg
Journal: Theor Popul Biol Date: 2002-03 Impact factor: 1.570

2. Estimation of levels of gene flow from DNA sequence data.

Authors: R R Hudson; M Slatkin; W P Maddison
Journal: Genetics Date: 1992-10 Impact factor: 4.562

3. Estimating species phylogenies using coalescence times among sequences.

Authors: Liang Liu; Lili Yu; Dennis K Pearl; Scott V Edwards
Journal: Syst Biol Date: 2009-07-16 Impact factor: 15.683

4. The genetical structure of populations.

Authors: S WRIGHT
Journal: Ann Eugen Date: 1951-03

5. Discordance of species trees with their most likely gene trees: the case of five taxa.

Authors: Noah A Rosenberg; Randa Tao
Journal: Syst Biol Date: 2008-02 Impact factor: 15.683

6. The distribution of the coalescence time and the number of pairwise nucleotide differences in the "isolation with migration" model.

Authors: Hilde M Wilkinson-Herbots
Journal: Theor Popul Biol Date: 2007-11-23 Impact factor: 1.570