Patrick Hoscheit1, Oliver G Pybus2. 1. MaIAGE, INRA, Université Paris-Saclay, Domaine de Vilvert, Jouy-en-Josas 78350, France. 2. Department of Zoology, University of Oxford, Peter Medawar Building, South Parks Road, Oxford OX1 3SY, UK.
Abstract
A variety of methods based on coalescent theory have been developed to infer demographic history from gene sequences sampled from natural populations. The 'skyline plot' and related approaches are commonly employed as flexible prior distributions for phylogenetic trees in the Bayesian analysis of pathogen gene sequences. In this work we extend the classic and generalized skyline plot methods to phylogenies that contain one or more multifurcations (i.e. hard polytomies). We use the theory of Λ-coalescents (specifically, Beta ( 2 - α , α ) -coalescents) to develop the 'multifurcating skyline plot', which estimates a piecewise constant function of effective population size through time, conditional on a time-scaled multifurcating phylogeny. We implement a smoothing procedure and extend the method to serially sampled (heterochronous) data, but we do not address here the problem of estimating trees with multifurcations from gene sequence alignments. We validate our estimator on simulated data using maximum likelihood and find that parameters of the Beta ( 2 - α , α ) -coalescent process can be estimated accurately. Furthermore, we apply the multifurcating skyline plot to simulated trees generated by tracking transmissions in an individual-based model of epidemic superspreading. We find that high levels of superspreading are consistent with the high-variance assumptions underlying Λ-coalescents and that the estimated parameters of the Λ-coalescent model contain information about the degree of superspreading.
A variety of methods based on coalescent theory have been developed to infer demographic history from gene sequences sampled from natural populations. The 'skyline plot' and related approaches are commonly employed as flexible prior distributions for phylogenetic trees in the Bayesian analysis of pathogen gene sequences. In this work we extend the classic and generalized skyline plot methods to phylogenies that contain one or more multifurcations (i.e. hard polytomies). We use the theory of Λ-coalescents (specifically, Beta ( 2 - α , α ) -coalescents) to develop the 'multifurcating skyline plot', which estimates a piecewise constant function of effective population size through time, conditional on a time-scaled multifurcating phylogeny. We implement a smoothing procedure and extend the method to serially sampled (heterochronous) data, but we do not address here the problem of estimating trees with multifurcations from gene sequence alignments. We validate our estimator on simulated data using maximum likelihood and find that parameters of the Beta ( 2 - α , α ) -coalescent process can be estimated accurately. Furthermore, we apply the multifurcating skyline plot to simulated trees generated by tracking transmissions in an individual-based model of epidemic superspreading. We find that high levels of superspreading are consistent with the high-variance assumptions underlying Λ-coalescents and that the estimated parameters of the Λ-coalescent model contain information about the degree of superspreading.
Entities:
Keywords:
coalescent; maximum likelihood; multifurcating; phylodynamics; phylogenetics
The field of phylodynamics is concerned with the study of how processes acting at the population-level shape the genetic diversity of gene or genome sequences sampled from natural populations. Phylodynamic methods are frequently applied to pathogen populations and used to test hypotheses concerning the epidemiology and transmission of infectious disease (e.g. Pybus and Rambaut 2009). Viruses in particular have been the subject of great attention, since their high mutation rates rapidly generate genetic diversity, even on short time scales, and because increasingly large numbers of virus genome sequences from viral epidemics are available for analysis (e.g. Biek et al. 2015). Phylodynamic approaches are also used in other fields, such as macroevolution, anthropology, and ancient DNA research (e.g. Drummond et al. 2003), in order to understand how dynamical processes gave rise to the patterns of ancestry and diversity observed in biological systems.Phylodynamic analysis of sampled gene sequences relies crucially on ‘tree-generating models’, which describe how phylogenies, genealogies or trees (we use these terms interchangeably) are related to the population dynamic processes that generated them. Among these models, coalescent approaches are widely used because they provide a mathematically simple framework that is capable of relating the demography of a viral population to its sample genealogy. Mathematically, coalescent theory relies on asymptotic properties of Wright–Fisher reproduction models that represent large, constant-sized populations. The distribution of sample genealogies from such populations is described by the so-called Kingman coalescent process (Kingman 1982). The Kingman coalescent has been shown to describe the genealogy of many models in population genetics and has been extended to incorporate a number of biological processes, including population size change (Griffiths and Tavare 1994), selection (Kaplan, Darden, and Hudson 1988), recombination (Hudson and Kaplan 1988), longitudinal sampling (Rodrigo et al. 1999), and population structure (Takahata 1988).The Kingman coalescent has been used to develop a variety of statistical methods that aim to infer the history of population size from an observed sample phylogeny. One such approach that is commonly used is the ‘skyline plot’ and related methods (Ho and Shapiro 2011), which model population size change as a piecewise constant function through time. In this paper, we extend and generalize the skyline plot family of methods beyond the Kingman coalescent.The standard Kingman coalescent (Kingman 1982) describes the genealogy of n individuals sampled at random from a population of size N ≫ n, using a bifurcating ultrametric tree T with n leaves (tree tips). In this paper, we will consider genealogies obtained by sampling from large populations whose sizes vary over time. In contrast to standard Kingman coalescent theory, we will consider populations with high variance in the number of offspring per individual (sometimes called the offspring distribution) which may lead to multifurcations (i.e. nodes with degree >3) in the sample genealogy that can no longer be ignored. To achieve this we consider a more general class of models called Λ-coalescent processes, a family of random trees discovered by Sagitov (1999) and fully described by Pitman (1999). In contrast to the Kingman coalescent, under which trees are binary and strictly bifurcating, Λ-coalescent trees can contain multifurcations. This has led to their application to genetic data from populations undergoing high-variance reproduction (sometimes called sweepstakes reproduction), such as the Atlantic codGadus morhua (Birkner, Blath, and Eldon 2013) or Pacific oystersCrassostrea gigas (Sargsyan and Wakeley 2008). Beyond high-variance reproduction, many other phenomena can lead to multifurcations in the ancestral process of a sample, such as repeated selective sweeps (Durrett and Schweinsberg 2004), strong selective pressure (Neher and Hallatschek 2013), or oversampling (Bhaskar, Clark, and Song 2014). Using classical methods to analyze such datasets can lead to systematic biases (Hallatschek 2018; Sackman, Harris, and Jensen 2019).Most work in this area has been focused on finding statistical signatures of multiple-merger coalescences, and distinguishing them from confounding factors such as population growth (Eldon et al. 2015; Koskela 2018). The statistics most commonly used are based on the site-frequency spectrum (Spence, Kamm, and Song 2016) and, as such, are nonphylogenetic in nature.Here we show how to calculate the likelihood of multifurcating genealogies under Λ-coalescent models, given a function describing effective population size. We derive a new estimate of effective population size, which we call the multifurcating skyline plot, and extend this to longitudinal (serial) sampling. Using simulated data, we show that, in the case of -coalescents, we can estimate the key α parameter that describes the propensity of the sample genealogy to contain multifurcations (α = 2 corresponds to the Kingman coalescent, whilst α = 1 represents the Bolthausen–Sznitman coalescent). Our analyses include data simulated under an empirically informed model of epidemiological super-spreading. We also show that effective population sizes can be estimated accurately. Note that this study aims to undertake inference from a single, pre-specified multifurcating genealogy. We leave for future work the problem of statistically inferring such multifurcating trees from an empirical gene sequence alignment.
2. Methods
2.1 Mathematical properties of Λ-coalescents
Coalescents are random trees that describe the genealogy of a small number of individuals sampled from a larger population. The class of Λ-coalescents is parametrized by a probability measure Λ(dx) on the interval [0, 1]. In most cases, these probability measures will be absolutely continuous with respect to the Lebesgue measure, i.e. Λ(dx) = λ(x)dx, for some non-negative function λ(x) such that .Under the Kingman coalescent, sample trees are almost surely binary, However, trees under the Λ-coalescent are in general multifurcating (i.e. contain at least one node with degree larger than 3). The distribution of such trees can be described as follows: start with n sample lineages at time 0. Note that time is represented backwards hence time 0 represents the present (or, more generally, the time of the most recent tree tip). Consider, for each k-tuple of lineages with 2 ≤ k ≤ n, an independent exponential random variable with parameterThen, if T is the minimum of these random variables, T is exponentially distributed with parameter . At time T, choose the number K of coalescing lineages with probabilityFinally, choose K lineages at random from the set of n extant lineages, and collapse them into a single lineage. Continue this process with the remaining n – K + 1 lineages until only one lineage is left. In other words, each k-tuple of lineages coalesces at rate λ,, independently of all other tuples. To better understand the role played by the Λ measure in the distribution of multifurcations, Equation (2) can also be interpreted in the following way: assume that , so that is a probability measure. Then, at coalescence time T, let be sampled according to F1; select coalescing lineages from the n extant lineages independently with probability P. For more details on this construction (sometimes dubbed the paintbox construction by analogy with the construction used by Kingman) see Pitman 1999.In Equation (1), it is easy to see that when Λ = δ0, the Dirac mass at 0, then λ, = 0 for , and λ,2 =1. Thus, if the probability measure Λ is concentrated solely at 0, then the Λ-coalescent process is strictly binary and identical to the Kingman coalescent. However, the paintbox construction scheme does not apply to this case, since .
2.2 Variable population size
For a given set of n individuals sampled from a large populations, we can define the coalescent effective population size, Ne, which is the size of an ideal population exhibiting the same amount of genetic diversity under the Wright–Fisher reproduction model as the population under study. Indeed, one can recover a Kingman coalescent from the genealogy of n individuals in a Wright–Fisher model of size N by rescaling time by a factor of Ne = N (Wakeley 2009). For more details on the different notions of effective population size see Sjödin et al. 2005.For Λ-coalescents, the analog of the Wright–Fisher model is the Cannings model (Cannings 1974). In the Cannings model, all N individuals in one generation reproduce according to the same offspring distribution (see Fig. 1). If are the number of offspring of individuals 1,…, N, respectively, we need to have in order to keep population size constant. A further condition is exchangeability, meaning that the distribution of the vector is invariant under permutations. This implies in particular that all individuals have the same offspring distribution (i.e. the same propensity to reproduce). Hence, their common expectation is . Let σ2(N) be their common variance. We can recover the Wright–Fisher model as a specific case of the Cannings model by taking to be the multinomial distributions with parameters .
Figure 1.
An ancestral genealogy (in red) of n = 4 lineages sampled from a Cannings model with constant population size N = 8 (black dots). The horizontal axis represents time and the lines indicate ancestry.
An ancestral genealogy (in red) of n = 4 lineages sampled from a Cannings model with constant population size N = 8 (black dots). The horizontal axis represents time and the lines indicate ancestry.Suppose that the population size is large (N →∞). If the variance in offspring number converges to a fixed finite value as N increases (), then the genealogy of n lineages randomly sampled from the Cannings model of size N, rescaled by Ne =N∕σ2, still converges to the Kingman coalescent as in the Wright–Fisher case. If we relax this convergence condition, but still keep σ2(N) =o(N), under some additional technical conditions (see Theorem 3.1 in Sagitov 1999), then the sample genealogy converges to a Λ-coalescent when rescaled by the factor N∕σ2(N). By analogy with the Wright–Fisher case, we can call this the Λ-effective population size. Since the Cannings model counts time in units of generations, this means that, when dealing with a real population that exhibits Cannings-like reproduction, it is necessary to count time in units of generations in order to observe Λ-coalescent-like behaviors. We can thus define a Λ-coalescent process in continuous time, given an effective population size function by analogy with the variable population size coalescent described in (Griffiths and Tavare 1994).Given a tree with n tips we define as the number of coalescences, that is, the number of internal nodes of . If is a binary tree, then . We denote the coalescence times as . For convenience, we denote present as . During the interval , with , let n be the number of extant lineages. By definition, we always have . Finally, for each , let k be the number of lineages involved in the coalescence at time t (see Fig. 2 for an example of this notation). Again by definition, we have .
Figure 2.
A multifurcating, ultrametric sample tree with n = 10 tips. In this case, the number of lineages involved in each coalescence event is , . The number of extant lineages is during during ,…, during .
A multifurcating, ultrametric sample tree with n = 10 tips. In this case, the number of lineages involved in each coalescence event is , . The number of extant lineages is during during ,…, during .The likelihood of a tree under the Λ-coalescent model, given the Λ distribution, is easy to write down thanks to the Markovian description of the process given above. It can in fact be decomposed in product form, since the waiting times between coalescent events are conditionally independent. Given the number of lineages n on an intercoalescent interval , the waiting time is distributed as an exponential random variable with parameterhence the likelihood of observing an interval of length isThen, at each coalescent time, the likelihood of seeing exactly k of the n lineages coalesce is, according to (2),This gives, for the complete likelihood:
with the depending on Λ as in (1). Note that when taking (i.e. the Kingman coalescent) the are all zero except , so that, as expected, we get a zero likelihood for nonbinary trees (i.e. trees with at least one ) under the Kingman coalescent. Following Griffiths and Tavare (1994), we can now take into account the effective population size as a time-change of the Λ-coalescent, and write the likelihood of the tree given an effective population size function under a Λ-coalescent model:In this formula, the population size function is assumed to be deterministic (see Kaj and Krone 2003) for a treatment of Kingman coalescents with stochastically varying population size). The continuous-time model defined by (5) can be obtained as scaling limit of a Cannings model with variable population size, as shown in Theorem 2.2 of Möhle (2002). Indeed, this result, as in the constant population size case, shows that the effective population size function Ne is related to the parameters of the offspring distribution of a Cannings model through , where N(t) and are the census population size and the offspring variance, respectively, of the Cannings model at time t in the coalescent timescale. This relationship will be used later when considering the interpretation of estimated effective population sizes. Under the variable population size model, the change in timescale necessary to transform from generation counts to coalescent time is a little more involved than that for constant population size; details can be found in Möhle (2002) and Freund (2019).
3. Results
Using the likelihood formula (5) and given a time-scaled nonbinary tree , we will now show how to estimate certain features of the process that generated it using maximum likelihood. In Section 4, we will explore other uses of this likelihood, for instance its use as a tree prior in Bayesian phylogenetic inference frameworks, such as that implemented in BEAST (Drummond and Rambaut 2007).
3.1 Maximum-likelihood estimation of the Λ measure
We first consider the problem of estimating the Λ parameter directly from an observed genealogy. In the general case the relevant parameter space (i.e. the space of probability measures on ) is too large to enable direct inference of Λ (see Koskela, Jenkins, and Spanò 2018) for more advanced techniques). Therefore, we will limit ourselves here to the one-parameter family of -coalescents, with . This corresponds to the case of Λ-coalescents where the Λ measure is the Beta distribution with parameters and α, with :
where is the Beta function. By continuity, this can be extended to α = 2, in which case the Beta distribution collapses to the Dirac measure at 0, thus recovering the Kingman coalescent. For , the coalescence rates are given by:We show in Fig. 3 how the parameter influences the coalescence probabilities, going from strictly binary coalescence events (for α = 2) to having coalescence events involving a high number of lineages with high probability when . Indeed, the limit corresponds to the star-shaped coalescent, which has only a single coalescence event involving all lineages.
Figure 3.
Probability of a coalescence event involving lineages among extant lineages in a -coalescent.
Probability of a coalescence event involving lineages among extant lineages in a -coalescent.Beta-coalescent processes have been extensively studied for , starting with Schweinsberg (2003), who described how they can be recovered from the genealogy of certain supercritical branching processes. Similarly, the connection with the so-called α-stable continuous-state branching processes (Birkner, Blath, and Eldon 2005) has enabled the study of many features of the family of Beta-coalescents (Berestycki, Berestycki, and Schweinsber 2007, 2008; Kersting, Schweinsberg, and Wakolbinger 2014). For , the coalescent loses the so-called coming down from infinity property, which makes its mathematical analysis more difficult. For more details see e.g. Berestycki, Berestycki, and Limic 2010.We investigated the problem of inferring the α parameter by using trees simulated under the -coalescent process with constant effective population size, for three different α values, specifically . In each case, we simulated 1,000 trees with and 1,000 simultaneously sampled tips. We then estimated the α parameter independently for each tree by maximizing the likelihood (5):The results of this maximum-likelihood estimation are summarized in Table 1 and shown in Fig. 4. The procedure yields an effectively unbiased estimator, with decreasing variance as the number of tips increases. Furthermore, for a fixed number of tips, the lowest variance was obtained for true values of α close to 2, since the likelihood of trees with nonbinary nodes converges to 0 as , so that likelihood surfaces become more peaked. These results are encouraging because they indicate that features of the Λ measure might be identifiable from phylogenies.
Table 1.
Empirical mean, bias and variance of maximum-likelihood estimates of the α parameter of the Beta-coalescent process.
Number of tips
Mean
Bias
Variance
α=1.2
100
1.2004
–4×10–4
2.6×10−3
500
1.1999
–4.8×10−5
2.8×10−4
1,000
1.1995
–4.7×10−4
1.2×10−4
α = 1.5
100
1.499
–1.3×10−2
1.7×10−3
500
1.499
–1.7×10−4
1.7×10−4
1,000
1.500
1.7×10−4
6.3×10−5
α = 1.8
100
1.797
–2.5×10−3
1.1×10−3
500
1.800
4.2×10−4
1.1×10−4
1,000
1.799
–2.0×10−4
4.0×10−5
Figure 4.
Empirical probability density of maximum-likelihood estimates of for 1,000 trees simulated under -coalescents, with (a) , (b) , and (c) , respectively. The color of the density plots corresponds to the number n of tips in the simulated trees.
Empirical probability density of maximum-likelihood estimates of for 1,000 trees simulated under -coalescents, with (a) , (b) , and (c) , respectively. The color of the density plots corresponds to the number n of tips in the simulated trees.Empirical mean, bias and variance of maximum-likelihood estimates of the α parameter of the Beta-coalescent process.
3.2 Demographic inference using the skyline plot
We now turn to the problem of estimating ancestral effective population size, using a sample tree as data. Following the ‘classic skyline plot’ estimator (Pybus, Rambaut, and Harvey 2000), we assume that the duration of inter-coalescent intervals are known without error. We further assume that the degree of each internal node is known precisely. Although both of these variables are uncertain when trees are estimated from empirical data, we leave for future work the problem of incorporating uncertainty in node degree into phylodynamic inference (see Section 4).It is possible, for a given probability measure Λ, and assuming that population size is constant between coalescent events, to maximize the likelihood function (5). This is made easy by the product form of the likelihood and yields the multifurcating skyline plot estimator:Note that when , the Dirac measure at 0, all the are equal to 0, except for j = 2, for which , so that we recover the ‘classic skyline plot’ estimate (Pybus, Rambaut, and Harvey 2000) for binary trees:Finally, note that Equation (8) implies that the skyline plot estimator is expressed in the same units as the intercoalescent lengths . Hence, if the underlying tree is scaled in real time units, it is necessary to divide the skyline plot estimate by the generation time to recover an estimate in population size units. For a detailed analysis of how varying generation time can affect skyline estimates see Volz 2012.
3.3 Composite internode intervals
The multifurcating skyline plot (8) is a piecewise constant function on inter-coalescent intervals and there can be a large number of such intervals when the number of tips is large, potentially leading to very noisy estimates. To mitigate this over-fitting, we can use the same interval-merging technique as that employed by the ‘generalized skyline plot’ (Strimmer and Pybus 2001). Given a threshold parameter , we consider all inter-coalescent intervals with length smaller than . We then join those intervals with their neighboring intervals earlier in time (closer to the root), starting with the interval closest to the tips. If the ensuing interval is still of length smaller than , we continue this procedure until there are no more intervals with length smaller than . Note that the degree of internal nodes is unchanged. This yields an interval partition of which we denote by . With this notation, is the number of such composite intervals, and for is the number of inter-coalescent intervals joined together to form .For each , we then independently estimate a single effective population size value for the whole composite interval , using (5). Due to the product form of the likelihood, it is easy to see that the maximum-likelihood estimator for the composite interval is the mean of the maximum-likelihood estimators (8) for each of the inter-coalescent intervals (note that the original ‘generalized skyline plot’ of Strimmer and Pybus (2001) uses a method-of-moments estimator, which leads to a harmonic mean rather than an arithmetic mean).Thus this composite-interval approach can compute an estimate of effective population size change with fewer parameters. As suggested in Strimmer and Pybus (2001), we can then optimize over by a model selection statistic such as the corrected Akaike Information Criterion (Hurvich and Tsai 1989):
where is the likelihood of the tree given the composite interval skyline estimate with parameter , as computed by (5).
3.4 Serially sampled trees
In some biological contexts, such as rapidly evolving pathogens or ancient DNA, sequences are sampled at different times and measurable amounts of sequence change occur between the sampling times (Drummond et al. 2003). As described in Rodrigo et al. (1999), coalescent estimates can be extended to this framework, by making the assumption that the sampling process is independent of the population dynamics. For the purpose of simplicity, in this paper we also make this independence assumption. We further assume that effective population size does not change at sampling times and that sampling never occurs at coalescence times.To be precise, let us consider an inter-coalescent interval , and assume we have n extant lineages at time . Assume we have sampling times , and for , let n(j) be the number of extant lineages on the interval ending with (with ). Finally, let be the number of lineages on the coalescence interval . Figure 5 provides a graphical example of a serially sampled tree and its associated indices.
Figure 5.
A serially sampled multifurcating tree, with 10 tips in total. There are coalescent events. During the inter-coalescent interval , the number of extant lineages is (going back in time) and . Using the composite-interval notation, assuming for example that , the ensuing interval partition would have a component, with corresponding to the two intervals and having been joined together. There would be intervals in the merged interval partition.
A serially sampled multifurcating tree, with 10 tips in total. There are coalescent events. During the inter-coalescent interval , the number of extant lineages is (going back in time) and . Using the composite-interval notation, assuming for example that , the ensuing interval partition would have a component, with corresponding to the two intervals and having been joined together. There would be intervals in the merged interval partition.When computing the likelihood of a serially sampled tree under a specified Λ-coalescent model, given an effective population size function, we need to take into account the intervals of the form and that end with a sampling event. On those intervals there are no coalescences, hence we need to consider the likelihood of no coalescences occurring during that time. The contribution to the likelihood of the inter-coalescent interval is then:Again, maximizing this in Ne gives the maximum-likelihood skyline estimator on the interval :
3.5 Inference using the multifurcating skyline plot
We evaluated the ability of the multifurcating skyline plot to estimate Ne by applying it to trees simulated under the Beta-coalescent process. We considered two different demographic histories (constant population size and exponential growth) and two values of the interval length threshold parameter . We used the same α parameter in both cases, namely α = 1.5. The results are shown in Fig. 6.
Figure 6.
Top row: individual trees simulated with 100 tips under a -coalescent, under constant (left) and exponential population size (right). Red lines represent the true effective population size function. Middle row: multifurcating skyline plots of the trees, with fixed and with . Bottom row: generalized multifurcating skyline plots, with threshold length parameter (left) and (right). Time is expressed in number of generations, and the estimated effective population size is in population size units.
Top row: individual trees simulated with 100 tips under a -coalescent, under constant (left) and exponential population size (right). Red lines represent the true effective population size function. Middle row: multifurcating skyline plots of the trees, with fixed and with . Bottom row: generalized multifurcating skyline plots, with threshold length parameter (left) and (right). Time is expressed in number of generations, and the estimated effective population size is in population size units.The results are in line with well-known properties of skyline estimators (Strimmer and Pybus 2001; Drummond et al. 2005): they recover fluctuations in effective population size reasonably well, at the cost of being quite noisy. As expected, noise decreases as the interval-merging parameter increases (Fig. 6).
3.6 Estimation of superspreading parameters
To demonstrate the potential usefulness of the multifurcating skyline framework, we generated phylogenetic trees using a previously published individual-based simulation model (Li, Grassly, and Fraser 2017) that implements a well-known model of epidemiological superspreading (Lloyd-Smith et al. 2005). These simulated trees were then analyzed using the Λ-coalescent model presented above.The Lloyd-Smith superspreading model assumes that each infectious individual i gives rise to a Poisson-distributed number Z of secondary infections, with random parameter ν. These ν are in turn distributed according to a Gamma distribution with shape and scale k, where is the population-level basic reproductive number, and k > 0 is a parameter characterizing the degree of superspreading prevalent in the epidemic. This gives rise to the following (negative binomial) individual transmission distribution:
for . As a consequence, each infectious individual transmits the infection to an average of individuals, with a variance . Hence, for a given value of R0, low values of k correspond to highly variable onward transmission, such that some individuals (the so-called superspreaders) are responsible for a disproportionate number of transmission events. The Lloyd-Smith model has been fitted to observed secondary cases obtained from real epidemics using contact tracing. Estimated parameter values range from for the 2003 SARS outbreak in Singapore to for an Ebola outbreak in Uganda. Generally, k values lower than 1 are considered to be evidence of high levels of superspreading.In order to explore the effect of superspreading dynamics on multifurcating sample phylogenies, we simulated trees under the Lloyd-Smith model using the EpiGenR R package (Li, Grassly, and Fraser 2017). This package tracks the transmission history of an infected population within a host population of constant size N. Each infectious individual remains infectious for an exponentially distributed time, then gives rise to a number of secondary cases distributed according to (12). Details of the implementation can be found in the Supplementary Data section of Li, Grassly, and Fraser (2017).Github: https://github.com/lucymli/EpiGenRWe investigated three superspreading scenarios, namely k = 0.15, k = 0.37, and k = 5, which represent high, moderate, and low levels of superspreading, respectively. These values are consistent with the published parameter estimates for SARS and Ebola outbreaks (Lloyd-Smith et al. 2005; Lau et al. 2017). We chose the same value of R0 =2.5 for all three scenarios. In each scenario, we simulated 100 epidemics in a host population of size N = 10,000. A fraction s of the tips in the resulting simulated complete transmission trees were then subsampled randomly to generate sample phylogenies. This sampling was undertaken uniformly across all tips of the tree, so that most tips were sampled during the times where prevalence is high. We explored four values of sampling intensity; s = 0.01, 0.05, 0.1, and 0.5.We then estimated the parameter of the -coalescent using maximum-likelihood estimation, accounting for demographic variation by using the multifurcating skyline plot as introduced above.The results of this simulation analysis are shown in Fig. 7. In all sampling regimes, we find a positive relationship between the superspreading parameter k and the α parameter of the -coalescent. Even for low sampling (s = 0.01, corresponding to trees with ≤100 tips), estimated α was significantly higher when k = 5 than when k = 0.15 and 0.37. This shows that the timing of coalescent events in multifurcating Λ-coalescent trees contains information about the extent of superspreading in an infectious population. This is consistent with the findings of Li, Grassly, and Fraser (2017), who used particle filtering techniques to show that there is information about transmission heterogeneity in estimated phylogenies that is not present in the time series data alone. Furthermore, Fig. 7 shows the multifurcating phylogenetic models used here perform adequately under high levels of sampling. Indeed, by sampling a large portion of the infectious population, there is an increased probability of seeing deep and large multifurcations in the phylogeny, which likely drives down estimates of α.
Figure 7.
Estimated for phylogenies simulated under the Lloyd-Smith superspreading model, for R0=2.5 and k = 0.15, 0.37, and 5, respectively. Increasing levels of tip sampling are shown: (a) s = 0.01, (b) s = 0.05, (c) s = 0.1 and (d) s = 0.5.
Estimated for phylogenies simulated under the Lloyd-Smith superspreading model, for R0=2.5 and k = 0.15, 0.37, and 5, respectively. Increasing levels of tip sampling are shown: (a) s = 0.01, (b) s = 0.05, (c) s = 0.1 and (d) s = 0.5.
4. Discussion
We have shown that Λ-coalescents can be used to infer demographic trends from multifurcating trees, extending the range of ‘skyline plot’-like estimators beyond binary trees and the standard Kingman coalescent. Using simulated data, we showed that the α parameter of the family of Λ-coalescents can be accurately estimated, which interpolates between the Kingman coalescent (α = 2) and the Bolthausen–Sznitman coalescent . We introduce the multifurcating skyline plot, which estimates effective population size from time-scaled nonbinary trees, the tips of which may be sampled either longitudinally or concurrently. We validated this estimation method on simulated trees.We applied the framework of Λ-coalescents to trees simulated using two different approaches, including an individual-based simulation of epidemic superspreading. Crucially, we demonstrated that the temporal distribution of bifurcations and multifurcations in a sample tree contains information about the extent of superspreading. However, the simulation studies presented here neglect the role played by phylogenetic uncertainty, as we assumed perfect knowledge of the transmission trees. Further work is required to implement the inference of multifurcating trees from sequence alignments in Bayesian phylogenetic frameworks such as BEAST. This will necessitate the definition of tree operators capable of exploring the larger space of nonbinary trees; whether this can be done without affecting computational performance remains an open question. Joint evaluation of molecular clock phylogenetic likelihoods and multifurcating tree prior probabilities has the potential to discriminate between genuine multifurcations, and short tree branches on which no mutations are observed.Several popular Bayesian implementations of the skyline plot approach exist, which treat the skyline plot likelihood as a ‘prior distribution’ on trees. These methods include the ‘Bayesian skyline plot’ (Drummond et al. 2005), which uses the composite-interval procedure described in this paper to reduce noise, and the skyride (Minin, Bloomquist, and Suchard 2008) and skygrid (Gill et al. 2013) plots, which use sophisticated, time-aware, smoothing procedures to penalize population size changes. We expect similar approaches could be applied to the multifurcating skyline plot. Recent work (Möller, du Plessis, and Stadler 2018) has illustrated that mis-specification of the tree prior during Bayesian phylogenetic inference may lead to erroneous estimation of other parameters, highlighting the need to implement a wider range of computationally tractable tree priors. The multifurcating skyline plot may help to address this deficit. We hope it will prove useful in the analysis of populations that exhibit patterns of propagation leading to true polytomies in their phylogeny, such as superspreading (as presented here), strong selective regimes (Neher and Hallatschek 2013), oversampling (Bhaskar, Clark, and Song 2014) or even adaptive radiation (Schluter 2000). The estimated parameters of the Λ distribution might be informative about these propagation processes, which are not taken into account in current phylodynamic approaches. However, we accept the possibility that the -coalescents used here are not the best models for multifurcating phylodynamic inference and that other Λ-coalescent families can be potentially used to estimate parameters of epidemiological interest.Although skyline plot methods based on the Kingman coalescent are commonplace in phylodynamic analysis, the actual values of estimated effective population size are not often interpreted directly. In some cases, ‘true’, census population size and effective population size are related only through nonlinear relations. For example, it was shown (Frost and Volz 2010; Volz 2012) that under SIR-type epidemiological models (Kingman) coalescent effective population size through time evolves proportionally to , where is the rate at which susceptible individuals become infected (typically, ). In this case, the relationship between effective population size and disease incidence I at time t is different at the beginning and end of the epidemic. The multifurcating skyline methods will pose similar problems for interpretation: the actual value of the estimate is affected by the Λ distribution in a nonlinear (and not completely understood) way. It is thus necessary to caution against improper interpretations of the Λ-effective population size, as defined in this paper and as computed by the multifurcating skyline plot. Other authors (Hall, Woolhouse, and Rambaut 2016) have already pointed out the dangers of thinking about effective population size as a number of individuals and have advised to use it as a measure of genetic diversity of the population instead. Such advice should also be heeded in the context of multifurcating coalescents, for which the relationship between census population size and effective population size is even more complicated.