Literature DB >> 26300910

A new method for estimating the demographic history from DNA sequences: an importance sampling approach.

Sadoune Ait Kaci Azzou¹, Fabrice Larribe¹, Sorana Froda¹.

Abstract

The effective population size over time (demographic history) can be retraced from a sample of contemporary DNA sequences. In this paper, we propose a novel methodology based on importance sampling (IS) for exploring such demographic histories. Our starting point is the generalized skyline plot with the main difference being that our procedure, skywis plot, uses a large number of genealogies. The information provided by these genealogies is combined according to the IS weights. Thus, we compute a weighted average of the effective population sizes on specific time intervals (epochs), where the genealogies that agree more with the data are given more weight. We illustrate by a simulation study that the skywis plot correctly reconstructs the recent demographic history under the scenarios most commonly considered in the literature. In particular, our method can capture a change point in the effective population size, and its overall performance is comparable with the one of the bayesian skyline plot. We also introduce the case of serially sampled sequences and illustrate that it is possible to improve the performance of the skywis plot in the case of an exponential expansion of the effective population size.

Entities: Chemical Gene Species

Keywords: coalescent process; effective population size; importance sampling; serially sampled sequences; skywis plot

Year: 2015 PMID： 26300910 PMCID： PMC4528260 DOI： 10.3389/fgene.2015.00259

Source DB: PubMed Journal: Front Genet ISSN： 1664-8021 Impact factor: 4.599

1. Introduction

The demographic history of a population leaves its signature in the genome, which means that DNA sequences contain information about the demographic history of the population from which they are sampled. Therefore, it is possible to use genetic data to infer demographic parameters, an issue with important implications in many fields such as public health, epidemiology and conservation biology (Minin et al., 2008). The first methods for estimating the demographic history from gene sequences were parametric and used coalescent theory. Such methods require a simple demographic model in order to describe the changes in the population size over time in terms of one or more parameters. They are based on importance sampling, e.g., (Slatkin and Hudson, 1991; Stephens and Donnelly, 2000), or Markov Chain Monte Carlo (MCMC) sampling, e.g., (Kuhner et al., 1995, 1998). For example, in the case of exponential growth, the size of the population at any time t measured from the present to the past is given by N(t) = N(0)exp(−βt), and the unknown parameters are N(0) and β. Usually, in practice, it is not known in advance which demographic model fits the sampled gene sequences. Further, population histories are often more complex than those described by simple parametric models. This has motivated the development of non-parametric and semi-parametric methods for inferring the demographic history from sequence data or from an estimated genealogy (e.g., Fu, 1994; Pybus et al., 2000) without resorting to some previous information about the demographic model. Our method is nonparametric and is closely related to the family of skyline plot methods. The first method in this family was introduced by Pybus et al. (2000), and is referred to as the classical skyline plot. The classical skyline plot involves two separate steps, see (Ho and Shapiro, 2011): (1) estimating the genealogy from the sequence data and (2) estimating the population history from the estimated genealogy. Step 1 gives an estimated genealogy that includes the relationships among the individuals (tree topology) as well as their times of divergence. Genealogical estimation is done using standard phylogenetic methods under the so-called strict molecular clock. The strict molecular clock condition means that the branch lengths of the tree are proportional to time, with time being measured in mutations, and all lineages evolve at the same rate. It is also possible to estimate a genealogy in a relaxed-clock framework (Drummond et al., 2006). Further, in step 2 in order to estimate the population history from the estimated genealogy, Pybus et al. (2000) apply coalescent theory in a specific way by considering the times of divergence (node times) as coalescent times. When the true population size is constant, this assumption is equivalent to estimating the mean of an exponential distribution using a single realization from this distribution (Minin et al., 2008). This uncertainty is referred to as coalescent error. Further, the single phylogeny of the sequences is assumed to be known without error (i.e., phylogenetic error is assumed to be negligible). Thus, Pybus et al. (2000) estimate the population size , for each coalescent interval γ = μt, by the product of and γ, where μ is the mutation rate per site per generation and γ is measured in substitutions. Thus, the classical skyline plot produces a piecewise reconstruction of the demographic history that is quite noisy, especially in the presence of small intervals when the sampled sequences are similar. To improve the classical skyline plot estimation, several extensions have been proposed. Without being exhaustive, we discuss the extensions that are most relevant to our work. Strimmer and Pybus (2001) developed a generalized skyline plot estimate based on the Akaike Information Criterion correction (AIC) in order to reduce the number of free parameters in the classical skyline plot. This method allows multiple coalescent events, i.e., for which little divergence time information is available, to be grouped together. Important developments were obtained in a Bayesian framework. Thus, Drummond et al. (2005) and Opgen-Rhein et al. (2005) use multiple change-point (MCP) models to estimate population size dynamics. In particular, Drummond et al. (2005) use a Markov chain Monte Carlo (MCMC) sampling procedure that efficiently samples a variant of the generalized skyline plot, given sequence data, and combines these plots in order to generate a posterior distribution of the effective population size through time. Due to the averaging effect of the MCMC sampling, the Bayesian skyline plot introduced by Drummond et al. (2005) produces smoother estimates than previous skyline plot methods. Also in the Bayesian framework, Minin et al. (2008) propose an alternative to change-point modeling that exploits Gaussian Markov random fields to achieve temporal smoothing of the effective population size. The advantage of the skyride method is that in contrast to estimates given by MCP models, explicit temporal smoothing does not require strong prior decisions like fixing the total number of change points a priori. Finally, Heled and Drummond (2008) introduced the extended Bayesian skyline plot, which permits the analysis of multiple unlinked loci. Increasing the number of independent loci allows the uncertainty in the coalescent to be assessed, leading to an improvement in the reliability of the demographic inference and a substantial reduction in estimation error (Ho and Shapiro, 2011). Further, unlike previous skyline plot methods that use a piecewise-constant model, the extended Bayesian skyline plot permits the use of a piecewise-linear model to describe the demographic history, allowing the population size to change continuously along each interval. In order to estimate the effective population size, we propose a new method in a likelihood-based perspective. Unlike some skyline methods that use a single estimated phylogeny of the sequences, or others that use MCMC approaches, we resort to an efficient importance sampling scheme and our estimate comes to an weighted average over a large number of simulated genealogies, each with a different set of coalescence times. The methodology is described in detail in Section The Skywis Method.

2. Background

2.1. Coalescent theory

In this section, we present the basic ideas behind the standard coalescent, as well as its extension to the case of fluctuating population size. An introduction to coalescent theory can be found in Nordborg (2003). Coalescent theory allows one to produce genealogies relating the sampled sequences according to a large class of population genetic models. In particular, the classical coalescent process assumes a single, isolated and panmictic population (e.g., a Wright-Fisher model), which evolves with constant (haploid) size N over many generations. For sufficiently large N and a sample size n such that n≪N, the ancestral relationships between the gene sequences can be approximated by Kingman's coalescent (Kingman, 1982). In short, the ancestry of a sample of sequences is modeled back in time, starting from the current sample and until the most recent common ancestor (MRCA) of the sample is found. At each step in the genealogical tree, one of the following events can occur: (1) two sequences coalesce if they share a common ancestor; (2) one sequence mutates. In the coalescent framework, time is measured in units of N generations, and N is large. The mutation rate μ per sequence per generation is rescaled so that θ = 2Nμ. Further, one can consider that each pair of lineages coalesces independently as a Poisson process with rate 1, and so, when there are k ancestral lines, coalescent events occur as in a Poisson process with total rate k(k − 1)/2 (Stephens, 2000). In the classical coalescent process, and in the presence of k gene sequences, the waiting time T to the next coalescent event is exponentially distributed with rate , while the distribution of the time until the first mutation event in any of the k lineages is exponential with parameter kθ/2. Since mutations are assumed to occur independently of coalescence, the waiting time until a mutation or coalescent event is exponentially distributed with parameter The classical coalescent framework can be extended to include simple deviations from the idealized Wright-Fisher model, like recombination, fluctuating population size, population structure, and selection. In our paper, we focus on a single extension of the coalescent, namely variable population size. In the case of non-constant population size, the number of descendants of a sequence in one generation does not follow the Poisson distribution with intensity one (Hein et al., 2005). As a result, when the basic coalescent is used to model a real physical population, the size N of the population in the (haploid) Wright-Fisher model cannot be assumed to be equal to the size of the real population. Let N(t) denote the effective population size at time t with N(0) = N. The effective population size reflects the number of individuals that contribute offsprings to the descendant generation and is almost always smaller than the census population size. The variable population size coalescent model for contemporary gene sequences was introduced by Griffiths and Tavaré (1994c) and Donnelly and Tavaré (1995). In this case, the coalescence times T2, T3, …, T do not follow independent exponential distributions. Let V = T + … + T be the accumulated waiting time so that the number of sequences pass from n to k − 1 sequences, i.e., and let Λ(t) the cumulative coalescent rate over time measured relative to the rate at time t = 0: where ν(t) = N(t)/N, the relative size of N(t) to N. The waiting time until the next event depends only on the time of the previous event by the Markov property. The survival function of the time T conditional on V = v is where v = 0. We note that when replacing Λ(t) by t (i.e., in the case N(t) = N, t > 0) in Equation (4), we get the survival function of the exponential distribution. From Equation (4), we obtain the density It is precisely from this equation that Pybus et al. (2000) derived the estimation of the effective population size in the presence of k sequences.

2.2. Importance sampling

Parameter estimation in population genetic models require optimization of the likelihood of the data given the parameters, P(|θ). The likelihood is then evaluated by: where θ is the collection of parameters (such as population size and migration rates) for the population process. Typically, the objective of the analysis is to estimate these parameters by averaging the likelihood over all possible genealogies. A naive Monte Carlo method for the integral in Equation (6) is given by where (1),(2), …,( are an independent sample from P(|θ). Importance Sampling (IS) allows us to improve the efficiency of the Monte Carlo integration. The main idea of the IS approach is to reduce the inefficiency of the approximation (Equation 7) by concentrating the simulation on the trees that are more likely with the observed data. Instead of choosing histories from the distribution Pθ(), we want to sample genealogies from a proposal distribution Q() that better supports the observed data, . The IS method is based on rewriting (Equation 6) as The Monte Carlo approximation of Equation (8) gives where (1),(2), …,( ~ Q(). Good choices of the distribution Q(.) make this method of approximation much more efficient than (Equation 7). Ideally, we would like to sample from Q() = P(|D). However, this is impossible because it supposes perfect knowledge of the likelihood which is not true in practice. Importance sampling (IS) was first used in this context by Griffiths and Tavaré (1994a,b,c). Stephens and Donnelly (2000) proposed improvements to the method by suggesting an approximation to an optimal proposal distribution for IS, P(|D).

3. The skywis method

In this section, we describe our estimation method of the effective population size, when n gene sequences are available. The main idea behind our method is to simulate a large number of genealogies and create a weighted average of the effective population sizes, where the most probable genealogies are given larger weight. In short, reconstructing the demographic history from these sequences involves four distinct steps: simulate J genealogies: (1),(2), …,(; compute where , represents the estimated effective population size for the genealogy ( for each coalescent time (in the presence of k sequences); compute the weights w(1), w(2), …, w(, where w( represents the weight of the genealogy ( in the likelihood of the data; estimate based on genealogies (1),(2), …,(, by the weighted mean of , for j = 1, 2, …, J, and k = 2, 3, …, n, i.e., For example, with a variable population size that is expanding from the past to the present, as we progress toward the MRCA one can expect the population size to be smaller, or coalescence times to be shorter, than in the case of a constant population size. This fact, of shorter coalescence times, should be reflected more faithfully by the most probable genealogies. Since such genealogies receive the largest weights, one can see that through the weighting system the estimator is adapting itself to the information contained in the data. In what follows we describe our method in full detail, namely: how to simulate genealogies; how to set the weights; how to estimate the effective population size.

3.1. Skywis plot for homochronous sampling

3.1.1. Simulation of genealogies

In order to generate genealogies we use the proposal distribution Q(.) introduced by Stephens and Donnelly (2000) assuming a constant population size and a finite sites model with known mutation parameters. Given the Stephens and Donnelly (2000) method is crucial to our approach, we describe it briefly. Let: E: the set of possible types of gene sequences; H−: the set of all sequences when event i occurs (coalescence or mutation) where i decreases from the present to the past in steps of 1 for each event (see Figure 1);

Figure 1

Example of a realization of the coalescent process viewed from past to the present with .

= {H0, H−1, …, H−}: a history of sequences where H0 = , m is the total number of events in the history , and H− is a singleton (the MRCA); P: the mutation transition matrix; Example of a realization of the coalescent process viewed from past to the present with . In the Stephens and Donnelly (2000) method, the H− are viewed as states of a Markov process starting at genetic type H− ∈ E and ending with H0 ∈ E. Let P be the mutation transition matrix. Let Pαβ be the probability of a DNA sequence of type α to mutate to a DNA sequence of type β, and let denote a mutation of a gene sequence from type α to type β according to ; let denote a coalescence of two gene sequences of type α. Then, the forward transition probabilities pθ(H|H), are defined by Equation (11): where is the number of sequences of type α in H, n is the number of sequences in H. Stephens and Donnelly (2000) consider randomly constructing histories backward in time in a Markov way, from the sample H0 to an MRCA (single type), according to some backward transition probabilities qθ(H|H) in the class = {H|Pθ(H|H)>0} with the constraint qθ(H|H) ∝ pθ(H|H). Their proposed backward transition probabilities which define Q(.) are given by Equation (12), namely: where is the number of sequences of type α in H, n is the number of sequences in H, {H − α} is the set of all sequences in H without the chosen sequence α, and C = n(n − 1 + θ) / 2 is a constant of proportionality. The estimated conditional probability is described below. In the proposed reconstruction, when H contains n chromosomes, the new type α is obtained by choosing a chromosome from H at random and then mutating it a geometric number of times. If is the number of chromosomes of type β in H, then (Stephens and Donnelly, 2000), In our approach, the genealogies are simulated backwards in time by the following algorithm based on Equation (12): initialize n: = n, where n is the number of DNA sequences at time t = 0 (present), and i = 0; simulate the time to the next event, W−, as an exponential distribution with parameter ; randomly choose a sequence from H; the chosen sequence type is denoted α; for each type β∈E for which Pαβ > 0, compute ; compute the quantities x1 and x2, where Then, choose: a coalescence event with probability ; a mutation event (to β) with probability . depending on the type of event chosen in step 5, we continue as follows: if there is a coalescence event, choose another sequence of type α randomly, and let n: = n − 1; if there is a mutation event, mutate the sequence α into a sequence β, without changing n, i.e., let n: = n; let i: = i − 1 and continue until n = 1. After implementing the above algorithm, the coalescence times that are at the core of our method can be deduced. In the genealogy given in Figure 1, we can deduce the coalescent times from the event times. For example, T7 = W−1 whereas T6 = W−2 + W−3 because we have a mutation event before a coalescence event.

3.1.2. Weights of genealogies

After generating genealogies using the Stephens and Donnelly (2000) proposal distribution, it is possible to compute the importance weight w( for each genealogy (, with j = 1, 2, …, J. Then w( is given by: where with and

3.1.3. Estimation of the effective population size

When building genealogies backwards in time, as we move backwards in time, fewer coalescence events occur. As a result, coalescence times close to the present are very short and become larger gradually going back in time. These short coalescence times create an undesirable variability in the estimation of the effective population size. Therefore, we propose to cumulate small coalescence times in order to improve the estimation of the effective population size. These cumulated time intervals are called epochs. To define epochs that get larger as we go backwards in time, we followed (Durbin and Li, 2011), and used a special time scale based on the TMRCA. Forest (2014) adopted the same method. Finally, we note that the idea of cumulating small coalescence times in order to smooth the graph of the estimator of the effective population size was first proposed by Strimmer and Pybus (2001); it has since become quite standard in the related literature. Once the genealogies have been simulated using the method described in Section 3.1.1, we cumulate the coalescence times as follows: we fix the total number of epochs, ncum, i.e., the total number of time intervals where we estimate the effective population size; for each simulated genealogy (, we compute the MRCA time, ; we use formula (Equation 18) proposed by Durbin and Li (2011) in order to define epochs where estimates of the effective population size are computed. In other words, the following time cutting points in a genealogy (, j = 1, 2, .., J are used: where . For each genealogy, formula (Equation 18) gives the boundaries of the epochs, measured from the present to the past where b = 0, 1, 2, …ncum (in units of N generations). The boundaries of epochs are different for each genealogy ( and depend on the length of the tree. For example if in units of N generations and ncum = 5, then according to Equation (18), the boundaries of the intervals are 0.0615, 0.1609, 0.3215, 0.581 (backward in time). For example, for the first epoch, this means that we must cumulate coalescence times from T until reaching 0.0615 N generations. The skyline plot can be viewed as a method of moments estimator based on the standard coalescence distributions (Strimmer and Pybus, 2001). For a genealogy (, we have: because is exponentially distributed as . Therefore, we use the estimate The expectation of the accumulated waiting time in order to pass from n to ℓ lineages, , is given by (see, for example, Rodrigo et al., 1999) where c = n−ℓ represents the number of coalesced sequences. From Equation (21), we can see that we can estimate, using the method of moments, the effective population size for the cumulated time of c coalescence times by: where , and c = n−ℓ. In our case, the cumulated waiting times for each genealogy ( are deduced from Equation (18): once the boundaries of the intervals of epochs are computed, the cumulated waiting times, numbered from present to the past, are derived as: where b = 1, 2, …, ncum, j = 1, 2, …, J, and . It follows from Equations (22, 18) that the estimated effective population size for an epoch b, and genealogy (, j = 1, 2, …, J, is given by: where is the number of sequences at the beginning of the interval, and is the number of cumulated coalescence times in the epoch , b = 1, 2, …, ncum, j = 1, 2, …, J. The distribution of importance weights of genealogies described by the Equation (15) is an approximation to the posterior distribution P(|, θ). As a result, one can approximate quantities of interest related to the tree by forming a weighted average of these quantities over the sampled trees as suggested in Stephens (2001). In our case, we are interested in the estimation of E(Ne), b = 1, 2, …, ncum from the J estimates , j = 1, 2, …, J and we let In our algorithm, the weighted average of is computed for the same time interval for all j = 1, 2, …, J that represent the intersections of epochs for the J simulated genealogies. This way of proceeding gives us weighted estimates of effective population sizes under the assumption that the effective population size is constant for an epoch. The reason for taking common intervals across genealogies is that estimates the integral (see Pybus et al., 2000) Therefore, to estimate the integral Equation (26) by a weighted average of estimates from J genealogies, we must use the same time intervals. For illustration, in Figure 2, we assume that two genealogies (1) and (2) are simulated using the method described in Section 3.1.1 with respective weights w(1) and w(2). Further, we assume that we cumulate coalescence times to obtain ncum = 3 epochs. The limits of epochs for a genealogy ( are denoted , and the time to the MRCA by TMRCA(, j = 1, 2. The detailed calculation of the weighted effective population size per epochs is summarized in the following table:

Figure 2

Division of time axis in the presence of two genealogies.

3.2. Skywis for heterochronous sampling

The algorithm described in Section 3.1 can be generalized to the case of serially sampled sequences i.e., sequences sampled at different moments in time. Such samples are also called heterochronous. Figure 3 illustrates a case where we sampled sequences at times t0 < t1 < t2, and the time is measured from the present to the past. Let S be the number of instants where we sampled sequences (S = 3 in Figure 3). Rodrigo and Felsenstein (1999) extend the coalescent likelihood for such heterochronous sequences, a very important issue in the case of rapidly evolving viruses such as HIV. For example, Rodrigo et al. (1999) have estimated, using heterochronous sequences, the viral generation time of HIV type1 (HIV-1). Also, serially sampling rapidly evolving populations is used for dating evolutionary events and divergence times (see e.g., Drummond et al., 2003).

Figure 3

Example of serially sampled sequences with . The red squares are the sampled sequences and the blue squares are the sequences derived from coalescence.

Example of serially sampled sequences with . The red squares are the sampled sequences and the blue squares are the sequences derived from coalescence. In the presence of serially sampled sequences, we have to adapt the method of Stephens and Donnelly (2000) in order to simulate genealogies in this case. This necessarily involves developing new formulas for the probabilities pθ(H|H) and , as given below.

3.2.1. Backward and forward probabilities, and weights of genealogies

Let n( be the number of additional sampled sequences at time t, with s = 1, 2, …, S−1. The main difference between the algorithm for homochronous sequences presented in Section 3.1, and the new algorithm for heterochronous sequences is that the number of lineages increases at the (known) instants t, s = 1, 2, …, S−1 where samples of sequences are added. Further, it is necessary to use event times, because the embedded chain differs according to the relative position of these event times with respect to t, s = 0, 1, 2, …, S − 1. In other words, the probabilities pθ(H|H) and are calculated differently from the case of a single sample of sequences, which has an impact on how the weights of genealogies, w(, j = 1, 2, …, J, are computed. In order to present our results, we introduce these additional notations: D = {H, v}: represents the set of all sequences present in the population after the i event at time v; this is a generalization of H with the specification of the time of event i; : represents the set of all sequences added at time t; Our proposal distribution is an adapted version of the Stephens and Donnelly (2000) method for simulating genealogies, to the case of heterochronous sequences. In this case, as mentioned above, we consider that there is a list of pre-specified sampling times t, s = 0, 1, 2…, S−1 which are dividing the time axis. In what follows, time is measured from the present to the past and by event we mean either a coalescence or a mutation. If an event time v is such that t < v Case 1: . Case 2: . Normally (i.e., in homochronous sampling), the waiting time W− from the state D with t < v < t to the next event has an exponential distribution with rate , where n is the number of lineages at time v. Thus, the probability that there are no events in the interval (v, v′]≡(v, ts] is given by the survival function where W− is the waiting time from state H to state H in a process with homochronous sampling. In the case of heterochronous sequences, the algorithm for simulating the genealogies backward in time is the following: initialize n = n and s = 0, where n is the number of sampled sequences at time t0 = 0 (present), and s is the index of times where we perform the sampling. Further, initialize the cumulated time tcum to 0; simulate the time to the next event, W−, as an exponential distribution with parameter ; let tevt be the observed value; compute ; if and , then let ; let (add a sample of sequences at time t); let s: = s+1 and i: = i−1, and go to step 2; otherwise, go to step 5; let and randomly choose a sequence from n; the chosen sequence type is denoted α; compute the quantities x1 and x2, where Then, choose: a coalescence event with probability ; a mutation event (to β) with probability . depending on the result in step 6: if there is a coalescence event, choose another sequence of type α randomly, and let n: = n − 1; if there is a mutation event, mutate the sequence α into a sequence β, without changing n, i.e., let n: = n; let i: = i−1 and continue until n = 1. After the definition of how to build a genealogy in the case of serially sampled sequences, and the proposal distribution Q, we introduce the probability P of the genealogy by specifying the probability of passing from the state to the state D = {H, v} when there are n sequences, and we suppose that an event time v′ is such that (a coalescence corresponds to a split when viewed from the past to the present). Therefore, as for the backward transition probabilities, we consider separately the case v>t and the case v = t. Case 1: . Case 2: and v = t. where: the probability that there are no events in the interval is given by: represents the number of sequences of type α in ; H = H − : represents the event of adding the set of sequences at time t. As in the case of homochronous sequences, after computing the probabilities , and for a genealogy G(, j = 1, 2, …, J, the importance weights may be derived from Equations (14–17).

3.2.2. Estimation of the effective population size for heterochronous sequences

For heterochronous sequences, the method for producing a skywis plot is similar to the one defined in Section 3.1.3. The main difference lies in the definition of epochs in this case . In the presence of S serially sampled sequences, we cumulate the coalescence times as follows: for each simulated genealogy (, we compute the MRCA time, , j = 1, 2, …, J; we fix the number of epochs at in each time interval (t; t) where no new sample is added, s = 1, 2, …, S, , and t0 = 0 (present); in order to define the epochs, the time cutting points in a genealogy (, j = 1, 2, .., J are computed as follows: where and s = 1, 2, …, S − 1. For each genealogy and for each time interval (t; t), s = 1, 2, …, S, formula Equation (33) gives the limits of the epochs from the present to the past in units of N generations. In practice, we performed minor smoothing at times t, because the addition of new sequences creates an artificial discontinuity at t, s = 1, 2, …, S. Therefore, the population size in the first epoch after t is set to be equal to the effective population size in the epoch preceding the addition of new sequences.

4. Results

To test the ability of our method to capture the demographic signal contained in the DNA sequences, we simulated several demographics scenarios. Further, we compared the results of the skywis plot with those of the generalized skyline plot that uses single tree, and the Bayesian skyline plot that uses MCMC approach. These methods are the closest to our approach. The DNA sequences were simulated using the fastsimcoal program (Excoffier and Foll, 2011) which allows us to consider several demographic scenarios and different mutation models. The genealogies were simulated using the method described in Section 3.1.1. In all our simulations, the coalescence times were cumulated into epochs according to the method described in Section 3.1.3, where n represents the number of simulated DNA sequences. After that, we derive the skywis plot using Equations (24, 25). The generalized skyline plot was performed as follows: From the DNA sequences generated by fastsimcoal, we estimated a phylogeny using the PHYLIP program (the PHYLogeny Inference Package Felsenstein, 1989) using the maximum likelihood method with a molecular clock constraint (we used the dnamlk program). Based on the estimated tree produced by PHYLIP, we used the APE package (Paradis et al., 2004) to produce the generalized skyline plot according to the optimal strategy for grouping adjacent coalescent intervals introduced by Strimmer and Pybus (2001). The Bayesian skyline plot was performed using the BEAST program, version 1.8.1. In order to reproduce a parametrization which is as close as possible to ours, we used (Hasegawa et al., 1985) substitution model with equal base frequencies, and a strict clock with rate 1. Below, we present our results according to the demographic models we considered.

4.1. Constant effective population size

In this case, we consider 50 simulated DNA sequences with parameters: number of nucleotides: 10,000; constant effective population size: 2000 generations; no recombination and no population structure; mutation rate equals to 2·10−7: therefore (θ = 8); JC69 (Jukes and Cantor, 1969) finite sites model. The estimate of the effective population size (skywis plot) is shown in Figure 4A. We observe that the skywis plot (orange line) gives a relatively smooth curve of the effective population size. Further, the estimation turns around the real value N, with a slight over-estimation close to the present, which can be explained by the fact that when the mutation rate θ is large, the sampled sequences are all different, and we have many mutations before one coalescence; thus, coalescence times are longer, and the corresponding population sizes are larger (see Section Simulation of Genealogies.)

Figure 4

Constant effective population size. (A) skywis plot, (B) generalized skyline plot, (C) Bayesian skyline plot.

Constant effective population size. (A) skywis plot, (B) generalized skyline plot, (C) Bayesian skyline plot. In Figure 4B we present the generalized skyline plot (in substitution units). In this case, the form of the graph is not recognizable as a constant line. The Bayesian skyline plot is given in Figure 4C. In this case, the graph is very smooth and is easily recognizable as a constant line.

4.2. Piecewise constant function

In this section, we present results where 25 DNA sequences of length 10,000 nucleotides and mutation rate μ = 5·10−4 were simulated under the JC69 mutation model. We assume that the effective population size follows the piecewise constant model function: where N = N(0) = 104, x = 5000 generations, a = 0.25 (see Figure 5), and the time t is measured from present to the past.

Figure 5

Skywis plot for data simulated from the population model where . (A) skywis plot, (B) generalized skyline plot, (C) Bayesian skyline plot.

Skywis plot for data simulated from the population model where . (A) skywis plot, (B) generalized skyline plot, (C) Bayesian skyline plot. Figure 5A represents the non-parametric estimate (skywis plot) of the effective population size for a number of epochs equal to ncum = 4. We note that the skywis plot was able to detect well enough the change-point of the size of the actual population which dates back 5000 generations. However, the skywis plot seems to overestimate the effective population size for t > 5000 generations. In Figure 5B we present the generalized skyline plot. The skywis plot gives a better result than the generalized skyline plot close to the present, while the estimate given by the generalized skyline plot is closer to the true value when we approach the MRCA. The Bayesian skyline plot presented in Figure 5C is very smooth and generally reproduces the true history except closer to the present, where the Bayesian skyline plot over-smoothes the effective population size.

4.3. Exponential population growth

In this section, we suppose that the effective population growth is exponential assuming an instantaneous growth rate that is proportional to the current population size according to the equation N(t) = Nexp(−βt) from present to the past. Using the fastsimcoal program, we simulate 50 DNA sequences with the following parameters: Number of nucleotides: 1000; N = N(0) at time t = 0: 10,000; no recombination, and no population structure; mutation rate: 5·10−7 (θ = 1); JC69 finite sites model; β = 1 (in generations). The skywis plot for the simulated DNA sequences from the exponential model described above is given in Figure 6A.

Figure 6

skywis plot, (B) generalized skyline plot, (C) Bayesian skyline plot.

skywis plot, (B) generalized skyline plot, (C) Bayesian skyline plot. The result given in Figure 6A is quite good in the sense that the size of the effective population decreases steadily from the present to the past and follows the exponential curve quite closely most of the time. However, we note that the estimated effective population size is almost constant from some point in time when approaching the TMRCA. This is explained by the fact that for the last two sequences the theoretical average time to coalesce represents half the length of the tree, and from this point in time there is no much variability in the estimate of the population size. In particular, this remark led us to consider heterochronous sampling in order to improve the effective population size estimate. In Figure 6B the time is measured in substitution units and we present the generalized skyline plot. As before, the generalized skyline plot has a fluctuating shape but it exhibits a certain tendency to decrease toward the past. In the end, when we approach the time of the MRCA, the generalized skyline plot gives an estimate that is close to the true value. In Figure 6C, we present the Bayesian skyline plot. As in the other scenarios, the Bayesian skyline plot produces a very smooth curve; in this case it suggests that the population had a mild exponential expansion. However, we note that the curve remains constant closer to the MRCA.

4.4. Exponential population growth and heterochronous sequences

In order to test the methodology proposed in Section 3.2, we use the same parameters as in Section 4.3, but by assume that the 50 sequences were collected at different moments in time such as: n0 = 25 (present); n1 = 15 at time t1 = 0.5 in units of N generations (measured from present to the past); n2 = 10 at time t1 = 1 (N generations). The result given in Figure 7 suggests that the effective population decreases exponentially from present to the past. Further, we note that the estimated effective population size continues to decrease when approaching the time of the MRCA, which is a net improvement over the homochronous case. This could be explained by the fact that as more sequences are added over time, more information is available as one approaches the MRCA.

Figure 7

Skywis plot for DNA sequences simulated from an exponential model with 3 serial samples at times .

5. Discussion

The skywis plot is a new flexible method for exploring the demographic history of a sample of DNA sequences based on coalescent theory. Our nonparametric method is likelihood-based and uses IS. More precisely, we generate a large number of genealogies, both their times and their topology; further, we use the importance weights of these genealogies to compute a weighted average of the effective population size per epoch. This allows us to produce estimates that exhibit clear cut population growth tendencies over time, which is the main purpose of this approach, given that it is nonparametric. In practice, we expect our method to be used as a preliminary procedure that could be supplemented by a parametric analysis. We present a framework of the new method and test by simulation its ability to capture the demographic signal contained in the DNA sequences under several demographic scenarios. Moreover, we consider both homochronous and heterochronous data using a simple substitution model, JC69 (Jukes and Cantor, 1969). We could also have considered more complicated substitution models, except those that allow variation in evolutionary rates among lineages. For illustration we present the results given by the generalized skyline plot that uses a single genealogy, and those obtained by the Bayesian skyline plot that uses an MCMC approach. Although the Bayesian skyline plot is smoother than the skywis plot, our estimator is able to capture the shape of the effective population size Ne(t), as well as its main change points, but in some examples it had a (slight) tendency to overestimate the population size as we approached the MRCA. This is not surprising, given that the simulation our estimation method entails first setting a constant population size (where coalescence times are longer) and further operating an adjustment through a weighting system. Further, note that, unlike the methods based on a single tree, it is possible to extend the skywis plot and include recombination. Indeed, recombination induces a graph structure rather than a tree, and IS methods in this context already exist (e.g., Fearnhead and Donnelly, 2001). As a future development, we expect the method to be improved by considering an iterative procedure, in which the present approach would be the first estimation step. As a new approach, the skywis plot remains to be tested on more complex demographic models, and models of substitution that could be more realistic, especially for rapidly evolving RNA viruses. Also, the skywis plot can be easily extended to include multilocus data because, when there is no recombination, the same IS scheme can be applied.

Funding

The Ph.D. studies of SA were supported in part by scholarships awarded by the ISM (Institut des Sciences Mathématiques) and stipends out of NSERC (Natural Sciences and Engineering Research Council) research grants.

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Time interval	N^eℓ,ℓ=1,2,…,2·ncum
[0;tcut,1(2))	w(1)N^e1(1) + w(2)N^e1(2)
[tcut,1(2);tcut,1(1))	w(1)N^e1(1) + w(2)N^e2(2)
[tcut,1(1);tcut,2(2))	w(1)N^e2(1) + w(2)N^e2(2)
[tcut,2(2);tcut,2(1))	w(1)N^e2(1) + w(2)N^e3(2)
[tcut,2(1);TMRCA(2))	w(1)N^e3(1)+w(2)N^e3(2)
[TMRCA⁽²⁾; TMRCA⁽¹⁾]	N^e3(1)

22 in total

1. Times on trees, and the age of an allele.

Authors: M Stephens
Journal: Theor Popul Biol Date: 2000-03 Impact factor: 1.570

Review 2. Coalescents and genealogical structure under neutrality.

Authors: P Donnelly; S Tavaré
Journal: Annu Rev Genet Date: 1995 Impact factor: 16.830

Review 3. Skyline-plot methods for estimating demographic history from nucleotide sequences.

Authors: Simon Y W Ho; Beth Shapiro
Journal: Mol Ecol Resour Date: 2011-02-06 Impact factor: 7.090

4. fastsimcoal: a continuous-time coalescent simulator of genomic diversity under arbitrarily complex evolutionary scenarios.

Authors: Laurent Excoffier; Matthieu Foll
Journal: Bioinformatics Date: 2011-03-12 Impact factor: 6.937