Literature DB >> 32739241

The probability distribution of the ancestral population size conditioned on the reconstructed phylogenetic tree with occurrence data.

Marc Manceau¹, Ankit Gupta², Timothy Vaughan², Tanja Stadler³.

Abstract

We consider a homogeneous birth-death process with three different sampling schemes. First, individuals can be sampled through time and included in a reconstructed phylogenetic tree. Second, they can be sampled through time and only recorded as a point 'occurrence' along a timeline. Third, extant individuals can be sampled and included in the reconstructed phylogenetic tree with a fixed probability. We further consider that sampled individuals can be removed or not from the process, upon sampling, with fixed probability. We derive the probability distribution of the population size at any time in the past conditional on the joint observation of a reconstructed phylogenetic tree and a record of occurrences not included in the tree. We also provide an algorithm to simulate ancestral population size trajectories given the observation of a reconstructed phylogenetic tree and occurrences. This distribution can be readily used to draw inferences about the ancestral population size in the field of epidemiology and macroevolution. In epidemiology, these results will allow data from epidemiological case count studies to be used in conjunction with molecular sequencing data (yielding reconstructed phylogenetic trees) to coherently estimate prevalence through time. In macroevolution, it will foster the joint examination of the fossil record and extant taxa to reconstruct past biodiversity.

Entities: Chemical Disease Gene Species

Keywords: Birth-death process; Epidemiology; Fossilized birth-death model; Macroevolution; Phylogenetics

Mesh：

Year: 2020 PMID： 32739241 PMCID： PMC7733867 DOI： 10.1016/j.jtbi.2020.110400

Source DB: PubMed Journal: J Theor Biol ISSN： 0022-5193 Impact factor: 2.691

Introduction

Owing to seminal papers by Yule, 1925, Kendall, 1948, and much later by Nee et al. (1994), birth-death models have become ubiquitous in evolutionary biology. They are used as a population dynamic model, parameterized via a birth and death rate, in studies spanning fields as diverse as paleontology, macroevolution, linguistics, and epidemiology (see e.g. Foote, 2000, Heath et al., 2014, Gray et al., 2009, Stadler et al., 2013). A major aim when using these models is to reliably estimate the ancestral number of species, languages or infected individuals, i.e. past biodiversity, past prevalence, or more general past population sizes. In both macroevolution and epidemiology, population dynamics inferences can rely on occurrence data, i.e. the fossil record and the case counts record. This data is modeled as a sampling of individuals from the full population through time (Foote, 2000, Starrfelt and Liow, 2016). In recent years, impressive sequencing efforts targeting present-day species and pathogens have enabled the reconstruction of phylogenies. Two main modeling approaches allow to quantify past population sizes in the past using these trees. First, phylodynamic tools have been developed to fit the birth and death rates of a birth-death process on the reconstructed phylogenetic tree of interest, while integrating over past population sizes (Stadler, 2011, Morlon et al., 2011). In order to quantify past population sizes, typically the expected population sizes based on these estimated birth and death rates are calculated (Morlon et al., 2011, Ratmann et al., 2016, Billaud et al., 2019). Thus, such population sizes are not directly conditioned on the reconstructed phylogenetic tree. Instead, the statistical signal in the tree is only used to compute rate estimates. Second, phylodynamic tools have been developed to fit the expected population size of a coalescent model on a reconstructed phylogenetic tree. This modeling approach may appear as a better alternative, for it is directly parametrized with the population size that we wish to estimate. However, this comes at the cost of ignoring stochastic fluctuations in small populations (Morlon et al., 2010, Ratmann et al., 2016). Statistical approaches stemming from the analysis of case count data or from the analysis of reconstructed evolutionary trees have been part of separate bodies of work for many years, historically yielding conflicts between biodiversity estimates based on the fossil record and estimates based on reconstructed phylogenies of extant taxa (Quental and Marshall, 2010 but see also Morlon et al., 2011). A first path towards merging these disparate data was introduced by the fossilized birth-death model of Stadler (2010), which considered a birth-death model with sampling and inclusion of individuals in the tree through time. This allowed taking into account infection trees reconstructed from pathogen sequences sampled throughout an epidemic (Stadler et al., 2011). In macroevolution, it paved the way to more precise phylogenetic dating using well-conserved fossil taxa which could be placed on a reconstructed phylogeny using morphological characters (Gavryushkina et al., 2016). Not so well-conserved fossils (i.e. occurrences) have also been used with this model, using a Markov Chain Monte Carlo (MCMC) scheme to integrate over all possible placements along a fixed tree (Heath et al., 2014). Analytical developments around this new model have been made by Gupta et al. (2019), which derived an analytical formula for the probability density of an outcome of the process, which consists of a reconstructed phylogenetic tree along with a record of occurrences. Again, all these methods do not quantify population sizes directly, but estimate birth and death rates while analytically integrating over population sizes. Very recently, Vaughan et al. (2019) introduced a Monte-Carlo particle filtering algorithm allowing direct quantification of past population sizes and birth and death rates conditioned on reconstructed phylogenetic trees and occurrences (see Andrieu et al., 2010 for details about particle filtering methods). As such, it can produce more accurate population size estimates than the methods mentioned above as the estimates directly condition on all data, i.e. the occurrence record (e.g. poorly preserved fossils, or case count epidemiological record) and the reconstructed phylogenetic tree. In this paper, we build on the analytical developments presented by Gupta et al. (2019), to calculate the past population size distribution as originally targeted by Vaughan et al. (2019). Our approach here is more analytic, leading to much faster numerical calculations compared to the particle filtering method previously developed. The efficiency of our method paves the way towards considering much bigger datasets, and towards extending the method to multi-type or density-dependent birth-death processes. In Section 2, we present the model, notation, and an overview of the strategy to express the targeted distribution. In Section 3, we adapt the main results of Gupta et al. (2019) to compute the probability density of observations made after a given time, conditioned on the past population size. In Section 4, we provide a way to compute the joint density of the past population size and observations made before a given time. Combining results of Sections 3, 4 in Section 5, we compute the distribution of past population sizes conditional on the full outcome of the process, and perform sanity checks against previously published methods achieving similar tasks (Stadler, 2010, Vaughan et al., 2019, Gupta et al., 2019). We finally discuss applications and potential extensions of the model.

Model and notation

Parameters of the process

We consider a population of individuals, any of which can give birth to another individual at rate or die at rate . The process starts at time in the past with one individual, and evolves until reaching present time 0, i.e. time is oriented from the present towards the past. In the rest of the manuscript, something happening at time t will thus always refer to an event taking place t units before present. We superimpose to this background population dynamics three different sampling schemes. First, individuals can be -sampled at rate throughout their lifetime. When -sampled, the individual will be included in the reconstructed phylogenetic tree. Second, individuals can be -sampled at rate throughout their lifetime. When -sampled, the individual is not included in the reconstructed phylogenetic tree, but its sampling time is nevertheless recorded and called ‘an occurrence’. Last, the process finishes upon reaching the present time 0, and each extant individual at that time is -sampled with fixed probability , leading to their inclusion in the reconstructed phylogenetic tree. The sum of all per-capita rates will be called for short . Following Vaughan et al. (2019), we also include in the model an effect of the - and -sampling through time on the population dynamics. We consider that, upon sampling, an individual is either removed from the process with probability , or is unaffected by the sampling with probability . The overall number of individuals, denoted , thus follows a linear birth-death process with birth rate and death rate . Note that, because the -sampling step occurs here at the end of the process, it does not matter whether or not individuals are removed upon -sampling.

Introducing useful probabilities

Some aspects of this process have been previously investigated thoroughly. We now use two key probabilities. First, we will call the probability that a process starting at time t with only one individual remains unsampled up to and including the present time (time 0). We recall that satisfies the ordinary differential equation (ODE) (Maddison et al., 2007) The solution of this for a particular initial condition z being the followingwhere and are the two roots of the polynomial , Second, we call the probability that a process starting at time t with one individual precisely leads to one sampled individual at present time 0. Writing the ODE governing the evolution of this quantity leads to The solution of this being the following These formulas are well known, and correspond respectively to quantities called and in Stadler (2010). When , we will drop the dependence on z and use the shorter notation . We recall standard ways to derive these expressions in Appendix A.

Strategy of the paper

The process with sampling leads to the observation of two distinct objects illustrated in Fig. 1.

Fig. 1

General setting of the method. a) the full process with sampling. Pink dots translate as dots in and correspond to -sampling (sampling through time without sequencing). Blue dots translate as dots in and correspond to -sampling (sampling through time with sequencing). Yellow dots correspond to all present-day -sampling events. Filled or unfilled dots correspond respectively to sampling with or without removal. b) Population size through time. c) Observed occurrences through time. d) Reconstructed phylogenetic tree. e) Number of individuals in reconstructed phylogenetic tree through time. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) The reconstructed phylogenetic tree , on the one hand, represents the evolutionary relationships between all -sampled and -sampled individuals. We further consider that -sampled individuals are labeled either as ‘removed’ or ‘non-removed’. All -sampled removed individuals are necessarily leaves of , whereas -sampled non-removed ones can either stand as leaves (when the descent of the individual is not sampled) or as vertices along a branch (when the descent of the individual is further sampled), in which case they are referred to as sampled ancestors. The record of occurrences , on the other hand, is an ordered list of all -sampling times. We also consider that these sampling times are labeled as either ‘removed’ or ‘non-removed’. In this paper, we are interested in computing the probability distribution of the number of individuals in the past, conditioned on the observed outcome of the process. If denotes the number of sampled lineages in at time t, we call our target distribution, We will refer to epochs as the maximal time slices within which no sampling event in , nor branching event in , happened. These epochs are delimited by the union of sampling times in , branching times in the tree , and sampling times of leaves and sampled ancestors in . All pooled together, we call these ordered times , starting at present time and ending at the origin time . At any time we also introduce: The general strategy – and outline – of the paper is the following. We will traverse the tree and record of occurrences breadth-first, i.e. level-by-level through time. In a backward traversal we will compute the probability density of observations made between time t and 0 conditioned on the population size at time t. We call this probability density, In a forward traversal we will then compute the joint probability density of the observations made prior to time t and the population size at time t. We call this density, Provided we get expressions of and , our target distribution can then be expressed by combining both, noting thatwhere the last line holds because, conditionally on , the future of the (Markov) process is independent of what happened before. In the process of getting the probability density of under the same model, Gupta et al. (2019) provided an analytical formula and an algorithm to compute the first ingredient in the case where all individuals are removed upon sampling (i.e. ). We thus recall their main result, and adapt it to our slightly different framework, in the next section.

Calculation of – The density of observations below t conditioned on past population size

We start this section by presenting the ODEs satisfied by the probability density . This provides us with a numerical algorithm to compute , which we subsequently simplify with analytical results for specific sets of parameters.

Set of ODEs satisfied by

We can derive the probability density by studying its evolution through time. First, observe that we can express at present time 0. Indeed, provided we know the exact number of individuals living at time 0, the probability to see the tips of the tree is directly driven by the -sampling, We now derive the ODE driving the evolution of through time across any given epoch. We consider an infinitesimal time step and list the events which could have happened in the full process between and t, leading to our observations. Suppose the number of observed lineages in this epoch is k, and the total number of individuals alive is . We emphasize three cases, illustrated in Fig. 2:

Fig. 2

Four unobservable scenarios taken into account to derive the ODEs (3.2), (4.1).

nothing happened with probability a birth event happened among the k sampled lineages in , and it leads to an extinct or unsampled subtree to the left or to the right, with probability . among the i other individuals, with probability . a death event happened among the i particles, with probability . Four unobservable scenarios taken into account to derive the ODEs (3.2), (4.1). These allow us to write, , Note that for is not defined, but the term cancels out thanks to the factor i. Subtracting from both sides, dividing by and letting , we get the following set of ODEs driving the evolution of , Last, we need to study how changes at punctual events. We call unsampled lineages the lineages that do not appear on the reconstructed phylogenetic tree, i.e. have not been - or -sampled. Note that these unsampled lineages might still be subject to -sampling events. There are 6 types of punctual events that we can come across at time t in the past, listed below and illustrated in Fig. 3. We denote the probability just before (i.e. up) the punctual event and the probability immediately after (i.e. down). One directly gets by decomposing it into what must occur below , multiplied by the rate of the specific event happening on the infinitesimal time window . We can either find,

Fig. 3

Six observable punctual events in the data.

a leaf of , labeled as removed. This is a -sampling with removal event for which the number of unsampled lineages remains constant, and the number of sampled lineages increases by one (going backward in time). It thus gives, a leaf of , labeled as non-removed. This is a -sampling without removal event for which one of the unsampled lineage becomes a sampled one (going backward in time). It thus gives, a sampled ancestor along a branch of , necessarily labeled as non-removed. This is a -sampling without removal event, not impacting the number of sampled or unsampled lineages. It thus gives, an occurrence in , labeled as removed. This is a -sampling with removal event, for which the number of unsampled lineages increases by one (going backward in time). It thus gives, Note that here also, for is not defined but the term cancels out thanks to the factor i. an occurrence in , labeled as non-removed. This is a -sampling without removal event, not impacting the number of sampled or unsampled lineages. It thus gives, a branching event between two branches of . The number of sampled lineages decreases by one (going backward in time). It thus gives, Six observable punctual events in the data. Note that these updates can be adapted to the case when we don’t observe the removal status of individuals. The update corresponding to a leaf of is the sum of updates (3.3), (3.4), the update corresponding to an occurrence event is the the sum of updates (3.6), (3.7), while updates (3.5), (3.8) are unchanged. This set of ODEs (3.2) together with update Eqs. (3.3), (3.4), (3.5), (3.6), (3.7), (3.8) can be numerically approximated. To do so, we fix a finite upper bound N on the number of hidden individuals and numerically integrate a truncated ODE system. We detail this in the following algorithm to compute an approximation of at any time t. We also define a slight variation of this algorithm, that we will refer to as Algorithm 1’, where no set of time points is required, and the values of are not recorded through time (i.e. matrix B disappears). Instead, when reaching we simply return , which by definition is an estimate of the probability density of . Note that this strategy is identical to what has been used to compute the probability density of a reconstructed phylogenetic tree under a logistic birth-death process (Leventhal et al., 2013). These two algorithms will prove useful to deal with the general case. Furthermore, we may obtain analytical expressions for when as well as when (Gupta et al., 2019). We reveal these in the next two subsections.

Special case

Suppose we can express as the product where is a function of time only, and is defined as in Eq. 2.2. We first get, from the initialization in Eq. (3.2), that . Moreover, substituting in the ODE leads to Thus leading to the following ODE for , on any epoch where the number of sampled lineages remains fixed and equal to k, This is very close to the ODE (2.3) governing the evolution of , and it leads to (see derivation in Appendix A), Last, because , updates (3.3), (3.4), (3.5), (3.6), (3.7), (3.8) simplify to only the following - and -events, Combining these updates with Eq. (3.9) leads to the following proposition. When , at any time t across epoch , considering that we observed so far – i.e. on – v sampled ancestors, w removed leaves at times , x branching events at times , y non-removed leaves at times , we get, We prove this proposition by induction across the epochs in Appendix E, using as the main arguments the equation updates (3.10), (3.11), (3.12), (3.13), combined with Eq. (3.9). Note that this proposition is very similar to what is presented in Section 3 by Gupta et al. (2019). We nevertheless need to highlight two differences. The first one is that we allow here for removal or not of the individual upon sampling, with a given probability r, whereas Gupta et al. (2019) considered that all individuals were removed upon sampling (), and Stadler (2010) considered that individuals were not removed upon sampling (). The second difference concerns the underlying framework under which we derive our results. In Gupta et al. (2019), individuals where distinguishable (say, each one is assigned a number and they can be ordered), whereas in the present paper they are not. When individuals are ordered, the probability density is changed by a factor , which is the number of ways we can arrange elements in a list of size k, i.e. the number of ordered configurations of hidden individuals. Note that, when reaching the origin of the tree, the formula in Proposition 3.1 reduces to a very similar formula for the probability density of because and . We summarize this as the following corollary. When , the probability density of a reconstructed tree with v sampled ancestors, w removed leaves at times , y non-removed leaves at times , and branching events at times , is It directly follows from Proposition 3.1, by noting that . Note also that a rooted binary tree with leaves shows necessarily branching times. Note that this formula is a straightforward generalization of formulas provided in Stadler (2010) (where ) or Stadler et al. (2011) (where ). When , only three kinds of punctual events, corresponding to updates (3.3), (3.6), (3.8) need to be taken into account. Because the number of unsampled individuals i goes into formula (3.6), the simple expression cannot be considered anymore, and one needs to find another expression. This has already been done in Gupta et al. (2019) and we only need to adapt here their result to our slightly different framework. When , we can compute the values at any time t aswhere is a q dimensional time-varying vector which can be computed following Algorithm 2 in Gupta et al. (2019). The proof relies on the definition of a distinguishable version of the probability aswhich allows us to use results previously derived in Gupta et al. (2019). Details are provided in Appendix B. Note that when there is no -sampling, then for all times and is the same as defined in the previous section. This ends our section on the computation of . It thus remains to (i) present a way to compute and (ii) combine and to get the target distribution at any time t. We do this in turn in the next two sections.

Calculation of – the joint density of observations above t and past population size

Recall that we are now interested in computing the joint density of observations above time t and past population size at time t, i.e. . We start by presenting the ODEs satisfied by , before turning to its resolution for specific parameter sets. The approach is very similar to the one presented in the previous section to compute , with the slight difference that we will need to traverse the tree forward in time instead of backward in time. At the time of origin of the process , we only observe one starting lineage in . This provides us with the following initialization condition on M, We then derive the ODEs driving the evolution of across an epoch on which the number of observed lineages is fixed and equal to k. Suppose we know , and we observe no punctual event on the infinitesimal time interval . Unobservable events have already been illustrated in Fig. 2. It allows us to get Subtracting from both sides, multiplying by -1,dividing by and letting , we get the following set of ODEs driving the evolution of , Last, we need to take into account the evolution of at punctual events. Again, there are 6 types of punctual events that we can come across at time t in the past, listed below and illustrated in Fig. 3. We denote the probability just after (i.e. below) the punctual event and the probability immediately before (i.e. up). Because we are here deriving forward in time, one needs to carefully note differences with results derived in Section 3 relating to the number of lineages before and after the event. We can indeed find the same punctual events, namely, a leaf of , labeled as removed. This is a -sampling with removal event for which the number of sampled lineages decreases by one and the number of unsampled lineages remains unchanged. This gives, a leaf of , labeled as non-removed. This is a -sampling without removal event for which one sampled lineages becomes unsampled. This gives, a sampled ancestor along a branch of , necessarily labeled as non-removed. This is a -sampling without removal event which does not affect the number of lineages. It gives, an occurrence in , labeled as removed. This is a -sampling with removal event, for which the number of unsampled lineages decreases by one. This gives, an occurrence in , labeled as non-removed. This is a -sampling without removal event which does not affect the number of lineages. It gives, a branching event between two branches of . This is a -event increasing the number of sampled lineages by one. This gives, Finally, upon reaching present time 0, one needs to take into account the -sampling, leading to the following update, Note, as for , that these updates can be adapted to the case when we do not observe the removal status of individuals. The update corresponding to a leaf of is the sum of updates (4.2), (4.3), the update corresponding to an occurrence event is the the sum of updates (4.5), (4.6), while updates (4.4), (4.7) are unchanged. As already exhibited for , we can build a similar algorithm to compute in the general case, relying on a numerical ODE solver for approximating Eq. (4.1). As for Algorithm 1’ previously introduced to compute the probability density of , a slight variation of this algorithm would allow one to compute an estimate of the probability density of by summing the ’s over all i. Note that this strategy is identical to what has been used to compute the probability density of a reconstructed phylogenetic tree under a logistic birth-death process (Etienne et al., 2012, Laudanno et al., 2020). While this approach is in theory a good approximation, it requires fixing arbitrarilly a truncation parameter N, and exponentiating matrices of dimension , leading to potential speed or accurracy issues. In the remainder of this section, we derive analytical results to avoid resorting to a numerical ODE solver in specific cases.

The corresponding generating function

We introduce now the generating function corresponding to the density , which will prove useful to get analytical results, The initial condition on M translates into, . The ODE (4.1) furthermore translates into the following partial differential equation (PDE), Our target generating function is thus the solution of the following PDE problem across a given epoch , on which the number of observed lineages remains constant and equal to k, Solving this PDE problem allows us to obtain an analytical expression of for any time across an epoch, provided we know the expression of at the end of the epoch. The solution to the PDE problem (4.9) is given bywhere we introduce to ease the notation. We used the method of characteristics to solve this first order linear PDE, see derivations in Appendix C. Between epochs, one must also update according to punctual events taking place. Previously presented updates of M (Eqs. (4.2), (4.3), (4.4), (4.5), (4.6), (4.7)) translate into the following updates for , if t is a removed leaf, if t is a non-removed leaf, if t is a sampled ancestor, if t is a removed occurrence, if t is a non-removed occurrence, if t is a branching event, If we are interested in the distribution at some point, we can thus start the formula at with , and then iteratively alternate between the updates at punctual events and the use of Proposition 4.1 over each epoch. When reaching present time 0, the step of -sampling expressed in Eq. (4.8) moreover translates into, While this procedure in theory allows us to get the analytical formula of at any time, updates (4.13), (4.14) require differentiating the generating function, greatly complicating the expression of the function after a few occurrences. When , these two updates disappear and a nice recursion leads to a closed-form formula that we will detail in Proposition 4.3. We implemented this procedure in the SageMath programming language able to deal with symbolic calculus. We were however not able to make it find concise expressions, and computing these successive derivatives was too time-consuming to be applicable to standard datasets in the field. Instead, when . We suggest another strategy for computing the ’s, namely approximating across punctual events by a polynomial of order N , , while still relying on Proposition 4.1 to drive the evolution of the probability generating function between events. This is a more efficient alternative to numerically solving the ODE system. We only need to derive the expression of the generating function at punctual events as given in the following Proposition 4.2. The derivatives in of a generative function which can be expressed ascan be numerically computed using the formula The derivation is detailed in Appendix D.1. This derivation is at the heart of Algorithm 2, allowing to follow the evolution of the ’s through each epoch, as well as at times when we want to record them. We will refer to Algorithm 2’ as the slight variation of this algorithm aimed at computing the density of . No set of time points is required, and the values of are not recorded through time (i.e. matrix disappears). Instead, when reaching we simply return . Note that we tried to follow an analogous generating function approach as an alternative to Algorithm 1 to compute as well. This leads to another PDE problem, described in Appendix F, that will require further work to be solved. We were not able to come with any analytical simplification, as in the previous section, for the case . However, for the special case , corresponding to the special case leading to the observation of , a nice recursion leads to a closed-form formula for . When , at any time t, considering that we have observed so far –i.e. on – v sampled ancestors, w removed leaves at times , x branching events at times , y non-removed leaves at times , we get, We prove this result by induction across the epochs of in Appendix E, using as the main arguments the update Eqs. (4.10), (4.11), (4.12), (4.15), combined with Proposition 4.1 driving the evolution across an epoch. As a simple corollary of this result, when is the present, we get back the same probability density formula of as provided, e.g. in Theorem 3.5 in Stadler (2010) (when ), in Section 3 in Gupta et al. (2019) (when ), or in our previous Corollary 3.1.1. Indeed, Proposition 4.3 offers yet another proof of Corollary 3.1.1 by noting thatwhere the last equality follows from Eq. (4.16) taking into account the -sampling at present. Note that this alternative proof is also presented in (Laudanno et al., 2020). When , Proposition 4.3 also offers an alternative to Algorithm 2 for deriving . Indeed, resorting to the generating function to get back the probability density, one can get the following corollary. When , at any time t, considering that we have observed so far –i.e. on – v sampled ancestors, w removed leaves at times , x branching events at times , y non-removed leaves at times , we can compute using the following recursion,where we define The probability density can be found back by taking The result follows from the derivation of these derivatives in Appendix D.2. This special case ends the section. In the next section, we will combine results from Sections 3 and 4 and use our ability to compute and to compute , the probability distribution of the population size given .

The distribution of past population size conditioned on observations

The distribution at fixed times

In Section 3, we explained how to compute , the probability density of the observations below time t conditioned on the population size at time t. This relies either on Algorithm 1 in the general case, or on the more optimized Proposition 3.1 in case , or Proposition 3.2 in the case . In Section 4, we explained how to compute , the probability density of the observations above time t and the population size at time t. This relies either on Algorithm 2 in the general case, or on the more optimized Corollary 4.3.1 when . We now combine and to derive the probability distribution of the population size given . Provided we have stored numerical values and for a set of time points , recall from the first section that we obtain Note that the denominator needs only be computed once, by evaluating for example at time or as described in previous sections. Depending on the parameter space that one wants to consider, it thus remains to arrange pieces stemming from the previous sections. We provide a flowchart in Fig. 4 to guide the reader to chose the most efficient path.

Fig. 4

The most efficient results depending on the parameter space considered. In red, results already described in Stadler (2010) and Gupta et al. (2019). In blue, the new contribution of this manuscript. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Generator of trajectories

The previous result gives us the distribution of the population size at any time in the past, but does not state anything about population size trajectories. We provide now an approximate way of simulating population size trajectories conditioned on . Indeed, recall we have, We thus get, We introduced in the last line the following notation, Using these, we see that . This allows us to draw trajectories of the number of ancestors in the past as a time-continuous Markov process with the (inhomogeneous) rates written above. Observe that we could equally write these ODE coefficients using the ’s. This gives,where we introduced in the last line the following notation, This is a standard result for Markov chains that are conditioned on a final state, and the shape of the newly derived transition kernel is called a Doob’s transform (Levin and Peres, 2017). Note that these transitions symplify for special cases when we have an analytical expression of either or .

Numerical implementation

Results of this paper have been implemented numerically and the code is freely available on GitLab: https://gitlab.com/MMarc/popsize-distribution/. We used the numerical implementation to verify the correctness of the results in several ways: We verified that the values of the probability density of computed using and (i.e. respectively using Algorithms 1’ and 2’) were equivalent to values computed using already known formulas when (Stadler, 2010) or when (Gupta et al., 2019). See result in Fig. 5AB.

Fig. 5

Assessment of the accuracy of the methods presented in this paper, on toy datasets. First row, probability density of data, A) against known analytical formula when and ; B) against known analytical formula when and ; C) obtained using Algorithms 1’ or 2’ otherwise, with . Second row, quantiles of the population size distribution, against the particle filter in Vaughan et al. (2019), with parameters . D) quantile of level 0.2; E) median; F) quantile of level 0.8. We verified that the values of the probability density of computed using or (Algorithms 1’ and 2’) were identical on examples for which no previous formula was known. See result in Fig. 5C. We assessed the distribution of the population size against the only numerical method performing the same goal, the particle filtering developed in Vaughan et al. (2019). We compared values of a few quantiles computed using the two methods, see result in Fig. 5DEF). Note that (Vaughan et al., 2019) considered that we never have data on the removal status of individuals. We thus adapted our developments to this scenario in this specific comparison, by summing updates corresponding to the removal or not of the sampled individuals. On each of these sanity checks, we verified that different quantities match across different values. Note that we could equivalently have chosen any other parameter to be varied. We also illustrate in Fig. 6 our target distribution of the past population size conditioned on , on a few simulated examples.

Fig. 6

Inferred population size distribution using matches the simulated population size trajectory under three different processes: A) A homogeneous birth-death with -sampling at present; B) A homogeneous birth-death with -sampling at present and -sampling through time; C) A homogeneous birth-death process with -, - and -sampling. Note that we plot on the same graph , the number of observed lineages in the tree, as this is an obvious lower bound in our population size inference.

Discussion

The results we have derived in this paper fit into two main categories. The first category concerns results allowing one to compute the probability density of a tree and occurrences, while the second category concerns results allowing one to compute the probability distribution of the population size in the past. We discuss these two categories below, before presenting ideas for future extensions of the model.

Using the probability density of the data

We present in this article new ways to compute the probability density of the data, . For the special cases or , efficient calculations are available in Stadler, 2010, Gupta et al., 2019. Our two Algorithms 1’ and 2’ have the potential to improve the computation time of also when and . When analysing data, as described below, often this probability density is conditioned on sampling at least one individual, using (Stadler, 2012). In the case that the tree is known, we can use (with conditioning on sampling at least one individual) to obtain maximum likelihood parameter estimates for the birth-death parameters as well as the sampling parameters . For special cases of this model, it has been shown that not all sampling parameters are identifiable (see e.g. Stadler et al., 2019). Future work will involve investigating which of the sampling parameters in the general model can be estimated. On the other hand, data may consist of sequencing data and occurrence data . Bayesian tools are then typically employed to obtain a sample from the posterior distribution of the parameters using Markov chain Monte Carlo methods. The posterior distribution is,with summarizing the parameters of the model of molecular evolution and being the prior distribution on the model parameters.

Probability distribution of past population sizes

The main results of this paper allow oneto compute the probability distribution of the population size in the past and to generate population size trajectories conditioned on (Section 5). Given a tree and occurrences together with birth-death parameters (which may be the maximum likelihood parameters obtained based on the tree and record of occurrences), we can simulate the distribution of past population sizes as described in Section 5.2. Furthermore, we can calculate the probability of a population size at any time in the past as described in Section 5.1. If we are instead provided with sequencing data and occurrence data , and want to generate a simulated ensemble characterizing the posterior distribution of past population size trajectories , we can use the following strategy. The posterior distribution is, We have described above how to obtain a sample from the posterior distribution using Markov chain Monte Carlo. For each sample of thus obtained, we can simulate an appropriately conditioned population size trajectory as described in Section 5.2. The ensemble of trajectories thus generated has the required distribution. We can employ an analogous procedure if we are interested in the posterior probability distribution of the population size at a particular time t. For each posterior sample of , we can calculate the population size distribution at time t using Section 5.1. The posterior population size at time t is then the average over all these conditional distributions.

Increased efficiency opens new research avenues

Both the density and the probability distribution of the population size in the past can be obtained using the Monte-Carlo particle filtering algorithm developed in Vaughan et al. (2019). The new approach presented in this paper is nevertheless appealing for two reasons. First, it provides a direct link with previous analytical formulas developed in Stadler, 2010, Gupta et al., 2019, thus improving our understanding of these processes and leading to very efficient results in the specific case where . Second, Algorithms 1 and 2 have the potential to be more efficient alternatives to the Monte-Carlo particle filtering algorithm. Computing quantiles shown in Fig. 5DEF using the particle filtering took a few days, as compared to a few minutes with our method, mainly because it can be applied directly on a fixed tree and does not need to be part of a MCMC. A more thorough quantitative comparison of both approaches would require to implement this work in a MCMC framework, which is beyond the scope of this paper. This increased efficiency could open up the possibility to analyse much bigger datasets in the near future. In macroevolution, the study of clades with a huge fossil record like cetaceans could benefit from our approach. This dataset is characterized by a rather small number of extant species and fossils with morphological data available (respectively -sampled and -sampled species), but includes a huge number of fossils without morphological data (-sampled species) (Morlon et al., 2011, Barido-Sottani et al., 2019). For the cetaceans as well as many other clades, it will be of great interest to compute diversity estimates under the modelling framework presented here (assuming ). Ultimately, all -samples could be taken into account to inform the tree and diversity estimates. In the context of epidemiology, typically, the genetic sequences of the pathogen are only available for a fraction of the infected individuals. These correspond to -samples, while other sampled infected individuals correspond to -samples. Further developing our approach in a Bayesian framework, both the genetic sequences and the record of occurrence could be jointly used to estimate the underlying transmission tree and prevalence of the disease through time. Depending on the cost of sequencing and the ability of numerical methods to handle some critical amount of both genetic sequences and number of occurrences, optimal sampling procedure could be investigated, to make the most of both types of data. Finally, while improving on current methods, these two Algorithms 1 and 2 still only provide approximations of, respectively, and , that critically rely on the truncation parameter of the state space N. Increasing N leads to a more accurate approximation, while increasing the runtime of the method. If the probability mass of the number of hidden individuals is non-negligible above N, both algorithms will lead to very poor approximations of and . This value should thus be carefully chosen in empirical applications, depending on what is expected with the data at hand. We point out that the behaviour of these algorithms strongly relies on the runtime and accurracy of the matrix exponentiation steps. Numerous matrix exponentiation methods have been proposed in the literature (Moler and Van Loan, 2003). In our current implementation, we rely on a recent matrix exponentiation method already implemented in scipy (Al-Mohy and Higham, 2010). Future avenues towards improving this specific step could focus on new theoretical results adapted to tridiagonal matrices (Smith and Shahrezaei, 2015) or alternatively try to adapt Laplace transform approximations derived in Crawford et al. (2014), who present theoretical results bounding the errors made in their approximation.

Future extensions

Our proposed modelling framework lends itself well for various biologically realistic extensions to allow closer fit to empirical data in a variety of situations. The first extension that we envision is to relax the assumption of rate homogeneity and instead work with time-varying rates. This has already been considered in different studies relying on birth-death processes, either with exponentially varying functions (Morlon et al., 2011) or with piecewise constant rates (a model dubbed as skyline birth-death process, see Stadler et al., 2013, Gavryushkina et al., 2016). As all our results can be straightforwardly adapted to such a framework, this would not require much theoretical work. However, the challenge would be to do so without overfitting the data. Another popular extension that has been described in the literature on birth-death processes for phylodynamics is to consider multi-type birth-death processes (Maddison et al., 2007). Each individual is assigned a type, which impacts its propensity to give birth to other types. All sampling-related parameters can also be considered type-dependent. The main challenge here boils down to dealing with an increase of dimensionality, because we would be interested in the joint distribution of all subpopulation sizes. This extension is particularly interesting for epidemiological applications, when different populations of infected individuals, clustered according to some characteristic (e.g. patient behaviour or geography) might have very different dynamics (Stadler and Bonhoeffer, 2013). Finally, we are very hopeful that this piece of work could be applied as well to density-dependent birth-death processes, also known as logistic birth-death models. Indeed, very similar ideas to the breadth-first forward and backward traversals as applied in Algorithms 1’ and 2’ appear in the context of logistic birth-death models (Etienne et al., 2012, Leventhal et al., 2013, Laudanno et al., 2020). Preliminary results obtained by adapting our numerical algorithms to this framework are very encouraging, and we are currently in the process of deriving as much analytical results as we can to speed up the method. We are hoping to present this in a subsequent paper.

Conclusion

This manuscript presents a way to efficiently compute the distribution of the past population size in a linear birth-death process, conditioned on the observation of a reconstructed phylogenetic tree and a record of occurrences through time. Such data are very common in macroevolution where the reconstructed phylogenetic tree of extant species is available together with occurrences from the fossil record. In epidemiology, pathogen genetic sequencing data and case count data are a common data source. Our method thus promises to allow efficient quantification of past population sizes, representing past biodiversity or past prevalence, from these rich datasets. We believe that this method also paves the way for the consideration of more complex and more realistic demographic scenarios, assuming either time-dependent (Morlon et al., 2011, Stadler et al., 2013, Gavryushkina et al., 2016) or density-dependent parameters (Etienne et al., 2012, Leventhal et al., 2013), potentially catering for populations with multiple demographic categories/types (Maddison et al., 2007, Stadler and Bonhoeffer, 2013, Freyman and Höhna, 2018). It is our hope that this manuscript will foster important research advances for unravelling demographic histories in epidemiology, macroevolution, and any other fields where birth-death processes form a relevant model framework.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

CRediT authorship contribution statement

Marc Manceau: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Writing - original draft. Ankit Gupta: Validation, Investigation, Writing - original draft, Visualization. Timothy Vaughan: Validation, Investigation, Writing - review & editing, Supervision. Tanja Stadler: Conceptualization, Resources, Writing - review & editing, Supervision, Funding acquisition.

Algorithm 1: Computes a numerical approximation of Lt for a specific set of times
Input:
Observed tree and occurrence data (T,O),
parameters (tor,λ,μ,ψ,ω,ρ,r),
set of time points (τj)j=1S for which we want to compute the density Lτj(i),
and the truncation N setting the accuracy of the algorithm.
Output: A numerical approximation of Lt at times (τj)j=1S,(L∼τj(i))i∈{0,1,…,N}j∈{1,2,…,S}. 1: Pool all (τj) and all branching and sampling times of (T,O) in an ordered list (th)h=1n
2: Set j=1 and initialize B as a S×(N+1) empty matrix
3: Set ∀i∈{0,1,…,N},L∼0(i)=ρk0(1-ρ)i
4: forh=1,2,…,n
5: Numerically solve the ODE L∼˙t=AL∼t on (th-1,th), by computing L∼th=e(th-th-1)AL∼th-1,
6: where matrix A is a (N+1)×(N+1) tridiagonal matrix with entries given by,
∀i∈{0,1,…,N}A(i,i)=-γ(k+i)∀i∈{0,1,…,N-1}A(i,i+1)=λ(2k+i)∀i∈{1,2,…,N}A(i,i-1)=μi
7: ifth=τj
8: Record ∀i,B(j,i)=L∼th(i)
9: Set j=j+1
10: end if
11: ifth=tn or th=τSthen
12: returnB
13: else ifth is a removed leaf then
14: Set L∼th+=ψrL∼th-
15: else ifth is a non-removed leaf then
16: Set ∀i<N,L∼th+(i)=ψ(1-r)L∼th-(i+1) and L∼th+(N)=0
17: else ifth is a sampled ancestor then
18: Set L∼th+=ψ(1-r)L∼th-
19: else ifth is a removed occurrence then
20: Set ∀i>0,L∼th+(i)=ωriL∼th-(i-1) and L∼th-(0)=0
21: else ifth is a non-removed occurrence
22: Set L∼th+(i)=ω(1-r)(k+i)L∼th-(i)
23: elseth is a branching event
24: Set L∼th+=λL∼th-
25: end if
26: end for

Algorithm 2: Computes a numerical approximation of Mt for a specic set of times
Input: Observed tree and occurrence data (T,O),
parameters (tor,λ,μ,ψ,ω,ρ),
set of time points (τj)j=1S for which we want to compute the density,
and the truncation N setting the accuracy of the algorithm.
Output: A numerical approximation of Mt at times (τj)j=1S,(M∼τj(i))i∈{0,1,…,N}j∈{1,2,…,S}.
1: Pool all (τj) and all branching and sampling times of (T,O) in an ordered list (th)h=1n
2: Set j=S and B′ as a S×(N+1) empty matrix
3: Set ∀i∈{0,1,…,N},M∼(i)=1i=0
4: Set k=1
5: forh=n-1,n-2,…,0do
6: Compute the values right before the punctual event,
M∼∼(i)=Δλ2e-Δ(th-t)k∑α=0i∑l=αNM∼th(l)lα1(i-α)!∏m=0i-α-1(2k+l+m)-x1+x2e-Δ(th-t)αx1x2l-α1-e-Δ(th-t)l+i-2αx2-x1e-Δ(th-t)-(2k+l+i-α)
7: ifth=τjthen
8: Record the result in B′ : ∀i,B′(j,i)=M∼∼(i)
9: Set j=j-1.
10: end if
11: ifth=0 or th=τS
12: returnB′
13: els ifth is a removed leaf
14: Update ∀i,M∼(i)=ψrM∼~(i)
15: Set k=k-1
16: else ifth is a non-removed leaf
17: Update M∼(0)=0 and ∀i>0,M∼(i)=ψ(1-r)M∼~(i-1)
18: Set k=k-1
19: else ifth is a sampled ancestor
20: Update ∀i,M∼(i)=ψ(1-r)M∼~(i)
21: else ifth is a removed occurrence
22: Update ∀i<N,M∼(i)=ωr(i+1)M∼~(i+1) and M∼(N)=0
23: else ifth is a non-removed occurrence
24: Update ∀i,M∼(i)=ω(1-r)(k+i)M∼~(i)
25: elseth is a branching event
26: Update ∀i,M∼(i)=λM∼~(i)
27: Set k=k+1
28: end if
29: end for

26 in total

1. Inferring speciation and extinction processes from extant species data.

Authors: Tanja Stadler
Journal: Proc Natl Acad Sci U S A Date: 2011-09-19 Impact factor: 11.205

2. Stochastic Character Mapping of State-Dependent Diversification Reveals the Tempo of Evolutionary Decline in Self-Compatible Onagraceae Lineages.

Authors: William A Freyman; Sebastian Höhna
Journal: Syst Biol Date: 2019-05-01 Impact factor: 15.683

3. How can we improve accuracy of macroevolutionary rate estimates?

Authors: Tanja Stadler
Journal: Syst Biol Date: 2012-09-08 Impact factor: 15.683

4. Uncovering epidemiological dynamics in heterogeneous host populations using phylogenetic methods.

Authors: Tanja Stadler; Sebastian Bonhoeffer
Journal: Philos Trans R Soc Lond B Biol Sci Date: 2013-02-04 Impact factor: 6.237

5. Estimating Diversity Through Time Using Molecular Phylogenies: Old and Species-Poor Frog Families are the Remnants of a Diverse Past.

Authors: O Billaud; D S Moen; T L Parsons; H Morlon
Journal: Syst Biol Date: 2020-03-01 Impact factor: 15.683

6. Bayesian Total-Evidence Dating Reveals the Recent Crown Radiation of Penguins.

Authors: Alexandra Gavryushkina; Tracy A Heath; Daniel T Ksepka; Tanja Stadler; David Welch; Alexei J Drummond
Journal: Syst Biol Date: 2017-01-01 Impact factor: 15.683

7. Estimation for general birth-death processes.

Authors: Forrest W Crawford; Vladimir N Minin; Marc A Suchard
Journal: J Am Stat Assoc Date: 2014-04 Impact factor: 5.033

8. Birth-death skyline plot reveals temporal changes of epidemic spread in HIV and hepatitis C virus (HCV).

Authors: Tanja Stadler; Denise Kühnert; Sebastian Bonhoeffer; Alexei J Drummond
Journal: Proc Natl Acad Sci U S A Date: 2012-12-17 Impact factor: 11.205

9. Ignoring stratigraphic age uncertainty leads to erroneous estimates of species divergence times under the fossilized birth-death process.

Authors: Joëlle Barido-Sottani; Gabriel Aguirre-Fernández; Melanie J Hopkins; Tanja Stadler; Rachel Warnock
Journal: Proc Biol Sci Date: 2019-05-15 Impact factor: 5.349