Literature DB >> 29550451

The fossilized birth-death model for the analysis of stratigraphic range data under different speciation modes.

Tanja Stadler¹, Alexandra Gavryushkina², Rachel C M Warnock², Alexei J Drummond³, Tracy A Heath⁴.

Abstract

A birth-death-sampling model gives rise to phylogenetic trees with samples from the past and the present. Interpreting "birth" as branching speciation, "death" as extinction, and "sampling" as fossil preservation and recovery, this model - also referred to as the fossilized birth-death (FBD) model - gives rise to phylogenetic trees on extant and fossil samples. The model has been mathematically analyzed and successfully applied to a range of datasets on different taxonomic levels, such as penguins, plants, and insects. However, the current mathematical treatment of this model does not allow for a group of temporally distinct fossil specimens to be assigned to the same species. In this paper, we provide a general mathematical FBD modeling framework that explicitly takes "stratigraphic ranges" into account, with a stratigraphic range being defined as the lineage interval associated with a single species, ranging through time from the first to the last fossil appearance of the species. To assign a sequence of fossil samples in the phylogenetic tree to the same species, i.e., to specify a stratigraphic range, we need to define the mode of speciation. We provide expressions to account for three common speciation modes: budding (or asymmetric) speciation, bifurcating (or symmetric) speciation, and anagenetic speciation. Our equations allow for flexible joint Bayesian analysis of paleontological and neontological data. Furthermore, our framework is directly applicable to epidemiology, where a stratigraphic range is the observed duration of infection of a single patient, "birth" via budding is transmission, "death" is recovery, and "sampling" is sequencing the pathogen of a patient. Thus, we present a model that allows for incorporation of multiple observations through time from a single patient.

Entities: Chemical Disease Gene Species

Keywords: Birth-death process; Fossils; Macroevolution; Phylogenetic tree; Sampling-through-time; Tree prior

Mesh：

Year: 2018 PMID： 29550451 PMCID： PMC5931795 DOI： 10.1016/j.jtbi.2018.03.005

Source DB: PubMed Journal: J Theor Biol ISSN： 0022-5193 Impact factor: 2.691

Introduction

Inferring species phylogenies and ultimately the tree of life is one of the main goals of systematics and evolutionary biology. Based on inferred species phylogenies, biologists then aim to uncover the dynamics of speciation and extinction (including rates and times). Recovered fossils and sampled extant species are outcomes of a single diversification process of speciation and extinction, and thus share the same evolutionary history. Ideally, paleontological and neontological data should be used in combination for reconstructing species phylogenies and estimating speciation and extinction dynamics (Hunt, Slater, 2016, Quental, Marshall, 2010, Slater, Harmon, 2013, Wagner, 1995). Joint inference of a time-calibrated phylogeny of living and extinct taxa together with the rates of speciation and extinction requires a model for lineage diversification that gives rise to extant species and fossil samples. Such a model defines the probability density of a rooted phylogenetic tree of extant species and fossils, conditioned on the speciation, extinction, and sampling parameters of the model. This probability density then directly allows us to infer the parameters of the model given a phylogenetic tree, using maximum likelihood or Bayesian inference methods. Furthermore, based on molecular or morphological information for the extant species and fossil samples, this probability density – together with models of molecular sequence and morphological character evolution – allows us to infer the dated phylogeny of observed extant species and fossils (Gavryushkina, Heath, Ksepka, Stadler, Welch, Drummond, 2017, Zhang, Stadler, Klopfstein, Heath, Ronquist, 2016). This latter inference was initially introduced as the total-evidence dating approach by Ronquist et al. (2012a), where the model on the phylogenetic tree was an extension of the uniform prior on ultrametric clock trees to trees with terminals of different ages, meaning no speciation, extinction, and sampling parameters were specified. A popular model giving rise to extant species and fossil samples is the fossilized birth-death (FBD) process (Heath, Huelsenbeck, Stadler, 2014, Stadler, 2010). The process starts with one lineage (the initial species) at some time in the past. Each lineage has a rate of branching speciation (birth) and a rate of extinction (death). Further, each lineage has a rate of producing a fossil sample. At the present, each extant lineage has a probability of being sampled. The tree displaying all extant and extinct lineages of the FBD process together with the samples is called the complete tree. Pruning all lineages without sampled descendants from the complete tree gives rise to the “sampled tree” on extant and extinct samples (Stadler, 2010). A sampled tree is a model for a phylogenetic tree inferred from empirical data. The sampled tree has a degree-one node at the start of the initial lineage, degree-three nodes corresponding to branching events, degree-two nodes corresponding to fossil samples being ancestors of other samples, and degree-one nodes corresponding to extant samples or fossil samples without sampled descendants. In our terminology a branch in a complete or sampled tree always connects two adjacent degree-one or degree-three nodes, that is, any branching node necessarily terminates a branch and starts two new branches. Note that we assume that degree-two nodes (fossil samples) do not subdivide lineages into branches, unless otherwise stated in the subsection. Stadler (2010) provides an example of a complete and a sampled tree in Fig. 1 of that paper. The probability of a sampled tree on extant and fossil samples was calculated in Stadler (2010) and later in Didier et al. (2012). The equations have been implemented in a Bayesian framework for phylogenetic inference as a stand-alone tool (Heath et al., 2014), and as part of the Bouckaert et al. (2014), Gavryushkina et al. (2014), Gavryushkina et al. (2017), Huelsenbeck et al. (2001), Ronquist et al. (2012b), Zhang et al. (2016), and Höhna et al. (2016) software packages. Recently, Didier et al. (2017) provided a method to evaluate the probability of a sampled tree topology, rather than the sampled tree with branch lengths.

Fig. 1

Three speciation modes as described in Foote (1996). The gray and white rectangles represent distinct species. In (i) asymmetric or budding speciation, the ancestral species (gray rectangle) survives after the speciation event whereas in the (ii) symmetric or bifurcating and (iii) anagenetic cases, the ancestral species is replaced by two or one descendant species. In the FBD model definition, we use the word “lineage” rather than “species”. Branching speciation gives rise to an additional species, i.e., co-existing lineages in the sampled tree correspond to different species. However, the FBD model does not assign species to lineages through time. In particular, a branching speciation event can be considered to occur either via budding (asymmetric) speciation, where a single descendant species branches off the ancestral species and both species exist after the speciation event, or via bifurcating (symmetric) speciation where the ancestral species goes extinct and the event gives rise to two new descendant species (Fig. 1, (i) and (ii)). This means that the assignment of species to branches is not specified, and, in particular, branches in the sampled tree do not necessarily correspond to unique species. Several branches in a complete or sampled tree may correspond to the same species due to budding speciation. For example, consider the tree in Fig. 2 showing budding speciation: Sp. 1 and Sp. 2 are each represented by 2 branches. However, a single branch in a sampled tree may also correspond to several different species due to unobserved branching speciation events, i.e., speciation events leading to unsampled lineages. For example, in Fig. 2, assume that Sp. 2 is not sampled. Then the branch from the observed speciation event (i.e., the budding event from Sp. 1) to the tip Sp. 3 would represent 2 species, namely Sp. 2 and Sp. 3. At these unobserved branching speciation events, the species assignment of a branch may change depending on the mode of speciation as defined in Fig. 1 (e.g., the species assignment always changes under symmetric speciation). In summary, while the FBD model assumes that every co-existing lineage at a particular instant in time belongs to a different species, the FBD model does not make statements about species assignments for lineages through time.

Fig. 2

A complete species tree of three species that originated through asymmetric speciation is shown on the left. In the middle, an “oriented” species tree is shown with asymmetric speciation corresponding to the species tree of the same three taxa. At each speciation event, one of the two new branches is labeled with A, because it represents a continuation of the ancestral species, and the other with D, designating the new descendant species. In an oriented tree, every species is identified by a unique sequence of A and D branches. Thus, the oldest species is identified by DA, the one that diverges next by DDA, and the most recent by DDD. On the right, a labeled species tree is shown where the orientations are omitted and every species is assigned with a label (taxon name) instead. The labeled tree representation is more common for existing phylogenetic software. In all three representations the same-colored segments represent the same species. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) In reality, the fossil record often contains multiple observations of the same species over distinct time intervals specifying stratigraphic ranges. In order to use this stratigraphic range information from the fossil record, we extend the FBD model to allow for speciation events of three different speciation modes asymmetric (budding), symmetric (bifurcating), and anagenetic (Fig. 1). Using this extension, we derive the probability density of a sampled tree with stratigraphic ranges. As in previous versions of the FBD model, we do not need to assume that we sampled all species, instead the sampling rate explicitly acknowledges incomplete sampling. Our equations for the FBD model extension allow analysis of stratigraphic range data in a phylogenetic framework. This paper follows a particular structure to present our new extensions of the FBD model. First, we formally define the three speciation modes extending the classic FBD model in “The FBD model under three modes of speciation”. Second, we derive the probability density of a sampled tree on stratigraphic ranges under asymmetric speciation in Section “Mathematics of the asymmetric speciation FBD model”. Third, we derive the probability density under the three modes of speciation in Section “Mathematics of the mixed speciation FBD model”. Based on the mathematical results, we use the section “Marginalizing over the number of fossils within a stratigraphic range” to describe the derivation of the probability density of a sampled tree on stratigraphic ranges, given that we only know the time of the first and the last fossil sample, rather than the total number of fossil samples within a stratigraphic range. In section “Marginalizing over the number of fossils within a stratigraphic interval”, we derive the probability density of a sampled tree on stratigraphic ranges, given that we only know whether a fossil species was present or absent within a stratigraphic interval (i.e., a time interval), rather than the total number of fossil samples within each interval. We summarize the main results in the discussion, highlighting the conceptual use of such equations in a statistical inference framework. Further, we discuss the potential of these new equations for contributing to advances in the field of macroevolution. Finally, we highlight how the equations for the asymmetric speciation case, can be directly employed in molecular epidemiology.

The FBD model under three modes of speciation

Here we extend the FBD model towards assigning species to lineages through time. For species assignment, we need to specify the speciation mode. We consider three modes of speciation as defined in Foote (1996) (see Fig. 1). (i) Asymmetric speciation where an ancestral species gives rise to a new species via budding, i.e., the descendant species branches off the ancestral species and both species exist after the speciation event. (ii) Symmetric and (iii) anagenetic speciation where the ancestral species goes extinct at the speciation event and gives rise to two (in the symmetric case) and to one (in the anagenetic case) new species. Thus, in addition to branching speciation (birth), extinction (death), and sampling events in the FBD model, each branching speciation event is assigned to a mode of speciation (asymmetric or symmetric), and anagenetic changes are marked along lineages. Thus, these three modes of speciation events partition all lineages into segments representing distinct species. All fossil samples that come from the same segment are assigned to a single species corresponding to that segment. A “stratigraphic range” defines a continuous lineage between the first and last fossil appearance of a species. The FBD model with an assignment of species to lineages through time gives rise to a probability density of a sampled tree on stratigraphic ranges, i.e., on extant and fossil samples where each sample is assigned to a species. Thus, using the FBD model with speciation modes for empirical analysis allows us to assign several fossils to each species in a data analysis. At its most conservative interpretation, a stratigraphic range is a single morphospecies observed in multiple stratigraphic layers (Oliver and Beattie, 1996). Since morphospecies identifications across stratigraphic intervals are the primary data used in paleontology to study diversity and diversification rates, the FBD model with specification of speciation modes allows us to use the primary paleontological data (stratigraphic range data), jointly with extant species data for phylogenetic analysis.

Mathematics of the asymmetric speciation FBD model

In this section, we formally define the FBD model under asymmetric speciation as illustrated in Fig. 2. As a model for speciation and extinction, we assume a birth-death process with each lineage having a branching speciation rate λ and extinction rate μ. The process starts with one lineage at time x0 in the past (also called the origin time) and terminates at the present time 0. Table 1 gives an overview of these and all other parameter definitions used throughout this paper.

Table 1

Notation used throughout this paper.

λ	Rate of branching speciation
λ_a	Rate of anagenetic speciation
β	Probability of symmetric (vs. asymmetric) speciation
ψ	Fossil sampling rate
ρ	Extant species sampling probability
μ	Extinction rate
η	(λ, β, λ_a, μ, ψ, ρ)
x₀	Time of origin of a tree
x1,…,xn−j−1	Speciation times in a sampled tree
A/D	Orientation of the two branches descending a budding branching event
left/right	Orientation of the two branches descending a general branching event
n	Number of sampled stratigraphic ranges, i.e., number of sampled species (some stratigraphic ranges may only be represented by a single sample in the past or present)
m	Number of sampled stratigraphic ranges where the associated species goes extinct before present
l	Number of sampled stratigraphic ranges with an extant species sample
j	Number of sampled-ancestor-stratigraphic ranges
k	Total number of sampled fossils
κ′	Total number of sampled fossils that represent the start and end times of stratigraphic ranges (including ranges represented by a single occurrence)
κ	Total number of sampled fossils within the stratigraphic ranges (i.e., k=κ+κ′)
κs¯	Indicates the presence of a fossil within a stratigraphic interval if =1, and absence if =0
v	Number of branching speciation events in the labeled tree where we know the orientation
w	Number of budding speciation events (out of the n−j−1 speciation events) in a sampled tree
d_i	Extinction time of species associated with stratigraphic range i
o_i	Time of first observed fossil corresponding to the species represented by stratigraphic range i, i.e., start time of stratigraphic range i
y_i	Time of last observed fossil corresponding to the species represented by stratigraphic range i, i.e., end time of stratigraphic range i
b_i	Branching event in extended sampled tree giving rise to the straight line on which stratigraphic range i lies, also called birth time of i
γ_i	Number of lineages co-existing at the birth time b_i
a(i)	Most recent stratigraphic range ancestral to stratigraphic range i
t_i	Time of augmented unobserved speciation event that gave rise to the species associated with stratigraphic range i, meaning t_i is speciation time of that species
I	Set of stratigraphic ranges, with i ∈ I if i is in the same straight line as its most recent ancestral stratigraphic range a(i) in the graphical representation of the sampled tree
L_s	Sum of all stratigraphic range lengths
Ls¯	Length of a sub-branch spanning a stratigraphic interval
s_i	Start time of branch i
e_i	End time of branch i
Teo	Oriented extended sampled tree
Tel	Labeled extended sampled tree
Tso	Oriented sampled tree
Tr	Tree when ignoring the κ fossils within stratigraphic ranges
Tl	Tree when ignoring the number of fossils within a stratigraphic interval
D	Summary of fossil occurrence data with k sampled fossils, l sampled extant species, and n sampled stratigraphic ranges with times o_i, b_i, d_i

Notation used throughout this paper. To model “asymmetric” or “budding” speciation, we assume that one of the two descendant branches of a speciation event belongs to the “ancestral” species and the other belongs to the “descendant” species, thus the two descendant branches may be assigned label A for ancestral and label D for descendant species (Fig. 2). Following Ford et al. (2009), we call the label assignment for each pair of descendant branches of a speciation event an “orientation”. In an oriented tree every species is represented by a path that starts with a D-branch and may be continued by several (or none) A-branches. For example, species 2 in the middle tree in Fig. 2 comprises two branches: the initial D-branch and one A-branch, because it is ancestral to another species (species 3). Species 3 consists of only the starting D-branch, because it does not give rise to any other species. Typically, the graphical representation of trees used in the paleontological literature (e.g., Bapst, 2013) draws all A- and D-branches that belong to the same species in a single straight line (Fig. 2, left). This graphical representation implicitly contains the information introduced by the A and D orientation (Fig. 2, middle). Therefore in the remaining figures of the text we use this graphical representation and omit reference to the A and D orientation. In addition to the birth-death process with species assignment, we assume a sampling process for fossils and extant tips. We assume fossil sampling occurs along each lineage with rate ψ, and an extant species is sampled with probability ρ. The tree that includes all extant and extinct species that evolved from a single ancestor during time interval x0 together with all the samples is called the “complete tree”. Fig. 3, left (ignoring the blue and grey colors for the moment), displays a complete tree on eight species with the extant and fossil samples shown using black diamonds. Five species have sampled fossils (species 1,2,4,5,6), and two species (species 3,4) have an extant sample. Two species (7 and 8) are not sampled.

Fig. 3

Example of a complete tree (left) and its extended sampled tree (middle) and sampled tree (right). We mark all fossil and extant species’ samples with a diamond. The stratigraphic ranges are marked in blue, the extended stratigraphic ranges in grey. We remind the reader that a straight line in these trees represents our graphical representation, meaning the oldest branch in a line is the D branch and all subtending branches are A branches. We omit D/A here for clarity of the figure. Furthermore, we omit in the extended sampled tree the fossils within stratigraphic range i that are younger than o, and in the sampled trees the fossils that appear between o and y, as the times of these fossils do not contribute to the probability density of the respective tree. The numbering of species and bifurcation events is chosen to simplify the notation and does not reflect the chronological order of the events. Theorem 2 provides the probability density for the oriented extended sampled tree, Corollary 5 for the labeled extended sampled tree, and Corollary 6 for the extended sampled tree when summing over the possible tree topologies. Theorem 8 provides the probability density for the oriented sampled tree. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Stratigraphic ranges and extended stratigraphic ranges

We now assign all samples belonging to the same species to a stratigraphic range, where o (time of the oldest sample of stratigraphic range i) and y (time of the youngest sample of stratigraphic range i) are the first and last sampled appearances, respectively. Note that a stratigraphic range is a segment of a lineage that does not contain D-branches, with the exception of the first branch belonging to the segment. In other words, it is simply a segment of a straight line in the graphical representation of the tree. The stratigraphic ranges in Fig. 3 are marked in blue. For all species where only one sample is collected, we have . For species where we only have an extant sample, the stratigraphic range is only represented by that particular extant sample (species 3 in Fig. 3); for species where we only have one fossil sample and no extant samples, the stratigraphic range is only represented by that particular fossil sample (species 5 in Fig. 3). We denote the extinction time of a species associated with stratigraphic range i with d (d for death). We set for the associated species being an extant species. An “extended stratigraphic range” defines a continuous lineage between the first appearance of a species (time o for stratigraphic range i) and the extinction of the species (time d for stratigraphic range i). The extended stratigraphic range for the six sampled species in Fig. 3 are highlighted in grey. We have in the case of the sampled species surviving to the present (species 3,4 in Fig. 3). If the extended stratigraphic range is equivalent to the stratigraphic range (species 3 and 4 in Fig. 3). Note that species 6 in Fig. 3 has values o6, y6, d6 displayed, but we dropped these values for some of the other species for clarity of the figures.

Sampled trees and extended sampled trees

Lineages without sampled descendants are deleted from the complete tree to obtain a sampled tree. In this section, we discuss two types of sampled trees: the “sampled tree” and the “extended sampled tree” (see Fig. 3). To obtain the “sampled tree” (or phylogenetic tree) on stratigraphic ranges, all lineages without sampled descendants are deleted from the complete tree. Each branching event that is maintained in the extended sampled tree inherits the labels A and D from the corresponding branching event in the complete tree. As above, going from root to tip, at each speciation event we draw an A-branch as a straight line directly below its ancestral D-branch, while we draw the D-branch to the right of its ancestral branch (Fig. 3, see left for the complete tree, and right for the sampled tree). Thus we generalize the graphical representation to sampled trees, where some species may be missing, meaning straight lines may now correspond to several species. A stratigraphic range remains as a segment of a straight line in the graphical representation of the sampled tree. To obtain the “extended sampled tree” on extended stratigraphic ranges, we delete lineages with no descendant samples from the complete tree keeping the lineages leading to the extinction times d of each sampled species i (Fig. 3, see left for the complete tree, and middle for the extended sampled tree). Note that the extended sampled tree and the sampled tree are oriented trees. Oriented trees facilitate derivations of probability densities, while most phylogenetic trees inferred from empirical data are “labeled” trees, i.e., trees where each sample has a unique label but no orientations are assigned. We obtain a labeled tree from an oriented tree by omitting all A/D orientations, and labelling the sampled species uniformly at random with unique labels. However, despite ignoring the orientation in a labeled tree, we may still know the ancestor-descendant relationships of some branches in the extended sampled tree or in the sampled tree, namely a new species budding off from a stratigraphic range is known to be the descendant (e.g., species 2 in Fig. 3 is a descendant). We will show below how to transform the probability density of an extended sampled tree into a probability density of a labeled extended sampled tree, such that our results can be applied to labeled trees. For sampled trees, we could not find such a transformation. Further, note that typically we have information about the stratigraphic range only, and not the extended stratigraphic range, since we do not know the extinction time d for each stratigraphic range i. Nevertheless, marginalizing over unknown d using numerical techniques (e.g., Markov chain Monte Carlo, MCMC) may be advantageous compared to considering sampled trees when we have stratigraphic range data but no information about their phylogenetic relationships. In this case, we cannot infer the underlying tree topology, however, using extended stratigraphic ranges, we can integrate over tree topologies analytically (Section “Probability density of the extended sampled stratigraphic ranges” below). In what follows, first, we calculate the probability density of an extended sampled tree, including expressions for when we only know the extended stratigraphic ranges but lack information on the phylogenetic relationship of these ranges (thus all tree topologies which have the range data embedded may be possible). Using results for the extended sampled tree, we then calculate the probability density of a sampled tree.

Probability density of the extended sampled tree

We now calculate the probability density of an extended sampled tree. This allows us to estimate the parameters of the FBD model under asymmetric speciation, λ, μ, ψ, ρ, from the extended sampled tree. Additionally, this probability density can be used as the tree prior in Bayesian inference to estimate the extended sampled tree, given observed stratigraphic range data. For the derivation of the probability density, we need some notation. Throughout this paper, n is the number of sampled species, i.e., in the context of extended sampled trees, n is the number of extended stratigraphic ranges. The extended sampled tree on n sampled species is a binary tree with branching events. One stratigraphic range i has birth time meaning the stratigraphic range did not originate via speciation but started the process (species 6 in Fig. 3). All other stratigraphic ranges originate via branching off another lineage, or, more formally, via the branching event giving rise to the most recent D-branch ancestral to that particular stratigraphic range. For a given stratigraphic range i, this time is denoted by b (b for birth). Note that b is the speciation time of the species associated with stratigraphic range i in the case of a fully sampled tree, but may be the speciation time of a non-sampled ancestor of the sampled stratigraphic range i in the case of incomplete sampling (such as species 4 in Fig. 3). Let k be the total number of sampled fossils and m represents the number of sampled species going extinct before the present, where d > 0. Of the number of stratigraphic ranges with let l stratigraphic ranges have an extant sample. In Fig. 3, (species 1, 2 and 6 go extinct), and (species 3 and 4 have an extant sample). Note that if and we sample all species, as we have no extinct species and sample every extant species. We call this case the “guaranteed complete sampling” case. All other parameter combinations are referred to as “potential incomplete sampling”. In the case of guaranteed complete sampling ( and ), the probability density of the oriented extended sampled tree, is, We know that a lineage from b to the extended stratigraphic range i is associated with a single species, as we have no unobserved branching events. Further, as we have no extinction events. The probability of no event happening on a lineage for time b is . The rate for each of the speciation events is λ, the rate for each of the k fossilization events is ψ. Multiplying these components establishes the theorem. □ We note that it follows from the last theorem that, under guaranteed complete sampling, the probability density of an extended sampled tree (Fig. 3, middle, with k sampled fossils), only depends on k, n and the birth times b for each stratigraphic range i. In the following theorem, we show that under potential incomplete sampling, the probability density of the extended sampled tree in fact only depends on k, n and the times b and d for each stratigraphic range and not on the times of each of the k fossils. In the case of potential incomplete sampling, the probability density of the oriented extended sampled tree, is, with, This proof follows in logic the derivations in Stadler (2010). First, define p(t) to be the probability that an individual at time t in the past does not leave any sampled fossils or sampled extant descendants. We note that at time in the past, within a small time interval Δt, an extinction event of a lineage happens with probability (μΔt), no event happens with probability and a speciation event happens with probability λΔt, thus, Rearranging and letting Δt → 0 leads to, The initial value is . Differentiation of Eq. (2) and plugging it into this differential equation establishes the expression for p(t) (this was derived in our earlier work (Stadler, 2010)). The probability density of an individual associated with stratigraphic range i at time t ∈ [o) producing an extended stratigraphic range as observed within (t, d] is described by the differential equation, This differential equation is derived analogous to the differential equation for p(t). The initial value for stratigraphic range i is where if d > 0, if and the extant species is sampled, and if and the extant species is not sampled. Differentiation of Eq. (3) and plugging it into the differential equation shows that is a solution of the differential equation, with . Thus for stratigraphic range i, . Next, stratigraphic range i is traced back into the past from o to b during which time we do not know if the lineage belongs to the same species, as there may be unobserved speciation events. During that interval [b) the probability density of an individual at time t with b ≥ t > o producing an extended stratigraphic range i as observed is described by the differential equation, The initial value for stratigraphic range i is . Note that the additional 2 in the differential equation for Q(t) compared to allows for unobserved speciation events where the ancestor species or the descendant species are not sampled, while only considers unobserved events where the descendant species is not sampled. The differential equation for Q(t) has been already solved in Stadler (2010): our expression for q(t) in Eq. (4) is a solution of the differential equation for Q(t) with initial value . Analogous to we write The rate of a lineage originating via budding from another lineage is λ and sampling of each one of the k fossils happens with rate ψ. Multiplying the probability densities Q(b) for all extended stratigraphic ranges i, the speciation rates, and sampling rates, and dividing by the probability of obtaining a sample (i.e., conditioning on sampling via ), establishes the theorem. □ We note that in the case of guaranteed complete sampling, where and we have and the expression for simplifies to an expression that we encountered already in Theorem 1. If we were to know the oriented, complete tree with the fossil samples and extant species samples, meaning there are no unobserved events, regardless of what fossil or extant samples were collected, then we could calculate the probability density of an oriented, complete tree as, where is the probability of observing a single species in the time interval (b). As a theoretical side note, we further conclude . For establishing we note that the left hand side is the probability density of a given stratigraphic range, with any number of hidden speciation events (including no hidden events); the right hand side is the probability density of the stratigraphic range, without hidden speciation events – this is a special case of the left hand side. For establishing we note that the right hand side is the probability density of a stratigraphic range, meaning the lineage between b and d belongs to the same species, while the left hand side is the probability of a lineage allowing for unobserved speciation events, thus the lineage may correspond to different species before and after unobserved speciation events. Again, the right hand side is a special case of the left hand side. Rather than oriented trees, most software packages perform inference over labeled trees (see Fig. 2, right). That means all n sampled species are labeled uniformly at random with n labels (n! possibilities), and the orientations A and D are summed over, unless we know the orientation. We know the orientation if a stratigraphic range produces a new descendant species: A is the label of the descending branch associated with the stratigraphic range. We denote with v the number of branching speciation events where we know the orientation. This leads to the following corollary. In the case of potential incomplete sampling, the probability density of the labeled extended sampled tree, is, In the case of guaranteed complete sampling ( and ), the probability density of the labeled extended sampled tree, is,

Probability density of the extended sampled stratigraphic ranges

Next we assume that instead of an extended sampled tree, we only know the n sampled stratigraphic ranges with start times o and end times y (), of which l contain a sampled extant species, and there are k sampled fossils. For each stratigraphic range, we augment our data with the values b > o and d < y such that there are no gaps, that is, for each there is j such that b ∈ (b), with the exception of i for which . We aim to calculate the probability density of these stratigraphic ranges with the corresponding b and d. Using this probability density, one can estimate speciation and extinction rates based on fossil occurrence data (i.e., stratigraphic ranges) by marginalizing numerically over all possible speciation and extinction times (b) using methods such as MCMC. This has been done previously assuming all extinct species have at least one fossil sample in Silvestro et al. (2014). In summary, given o, y, b and d () together with k and l (we summarize ), and the parameters λ, μ, ψ, ρ, we need to evaluate the probability density of given the parameters. The probability density of is obtained from Theorems 1 and 2 by integrating over all possible tree topologies which have embedded. The following theorem states this probability density. Let γ be the number of lineages co-existing at the birth time b of stratigraphic range i. For the oldest stratigraphic range i (with birth time ), we have . In Fig. 3, we have . The probability density for under potential incomplete sampling is, The probability density for under guaranteed complete sampling ( and ) is, This corollary is a direct consequence of Theorems 1 and 2 by noting that an extended stratigraphic range i has rate λγ to be initiated via speciation by one of the γ coexisting lineages (while in Theorems 1 and 2 the rate of a branching event along a particular lineage happens with rate λ). As for the extended sampled tree, the probability density of the extended sampled stratigraphic ranges only depends on k, n and the times b and o for each stratigraphic range and not on the times of each of the k fossils. The probability density of is obtained by integrating over oriented trees. Note that each tree topology giving rise to has a known orientation at each branching event (as we augmented each stratigraphic range i with b), implying . Thus, the probability density of where f is a mapping of the intervals to some labels, integrated over labeled trees is obtained by multiplying with . Theorem 2 for the FBD model on extended sampled trees with stratigraphic ranges is the analog of Eq. (1) in Gavryushkina et al. (2014) for the FBD model on sampled trees without stratigraphic ranges, considering fossil phylogenetic relationships explicitly. Equivalently, Corollary 6 is the analog of Eq. (1) in Heath et al. (2014) integrating over fossil phylogenetic relationships analytically.

Probability density of the sampled tree

For the extended sampled tree with stratigraphic ranges described above, we infer extinction times d and avoid considering stratigraphic ranges that are sampled ancestors of other stratigraphic ranges. In this section, we consider the sampled tree spanning the sampled fossils and extant species, without the extinction times d (Fig. 3, right). In a sampled tree, the stratigraphic range i may be a “tip-stratigraphic range”, meaning the fossil at time y is a tip in the sampled tree, or may be a “sampled-ancestor-stratigraphic range”, meaning the fossil at time y has sampled descendants. Species 1–5 correspond to tip-stratigraphic ranges, and species 6 corresponds to a sampled-ancestor-stratigraphic range in Fig. 3. Again, as before, we use n to denote the number of sampled species, i.e., the number of stratigraphic ranges. Recall that an extended sampled tree had branching events. Due to the sampled-ancestor-stratigraphic ranges, a sampled tree may have fewer than branching events. Let j stratigraphic ranges be sampled-ancestor-stratigraphic ranges (in Fig. 3, ). The sampled tree has branching times and origin time x0. Note that of a sampled tree is a subset of of an extended sampled tree. For derivations only, we consider the oldest and youngest fossils as explicit nodes that subdivide branches in the sampled tree (in contrast to the sections above where a node had degree three or degree one, and in contrast to Stadler (2010) where all fossils were treated as nodes in the sampled tree under the classic FBD model without stratigraphic ranges). Stratigraphic ranges where are assumed to have a branch between o and y with length 0. The sampled tree then consists of the following nodes: degree-three nodes, with the branching times at n degree-two nodes, at the time of the oldest fossils (i.e., the start of a stratigraphic range) j degree-two nodes, at the sampled-ancestor-stratigraphic range times y with i being a sampled-ancestor-stratigraphic range (in our example in Fig. 3, and ), degree-one nodes (tips), at the tip-stratigraphic range times y with i being a tip-stratigraphic range. Of these nodes, l nodes are at time . For ease of notation in what follows, we label the stratigraphic ranges that are tip-stratigraphic ranges with (in our example in Fig. 3 stratigraphic ranges 1–5), one degree-one node, the origin of the tree at time x0. Each branch connects two nodes which may be of degree one, two, or three. Thus in the sampled tree, each branch is either fully part of a stratigraphic range, or not at all part of a stratigraphic range. A branch belonging fully to a stratigraphic range is called a “stratigraphic-range branch”. If a stratigraphic-range branch gives rise to a speciation event, precisely one descendant branch is a stratigraphic range-branch. In total, v stratigraphic-range branches give rise to a speciation event. In our example (Fig. 3), as only stratigraphic range 1 gives rise to one additional species (i.e., in the sampled tree, there is only one speciation event that occurs along the sampled stratigraphic range of a given species). In total the sampled tree has branches. Let i ∈ I if stratigraphic range i and its most recent ancestral stratigraphic range, a(i), lie on a straight line in the graphical representation of the sampled tree. In our example Fig. 3, we have . By definition, stratigraphic range i ∈ I and its most recent ancestral stratigraphic range a(i) belong to different species, thus we need to ensure that there is an unobserved speciation event between y and o. We assume that the the species corresponding to stratigraphic range i originated at time t ∈ (y, o), and first augment our data with these times t (see t3 in Fig. 3, right). Second, we analytically integrate over t. We refer to the oriented, sampled tree as . To obtain the probability density of an oriented sampled tree, we multiply the contribution of each branch in the sampled tree, as in Theorem 2. For a branch with start time s and end time e (forward in time), the contribution is if it is a stratigraphic range-branch, and otherwise. We further need to specify the initial values at the tips of the tree. The initial value of each tip-stratigraphic range with y > 0 is the probability of having no sampled descendants multiplied by the rate of fossil sampling, ψp(y), and the initial value of each tip-stratigraphic range with is the probability of sampling an extant species, ρ. Additionally, we need to correct for the unobserved speciation times of species from I. First, we need to multiply by λp(t), — the probability of a speciation event at the unobserved speciation time t and the fact that one of the lineages descending the speciation event was not sampled. Second, we need to account for the fact that all the branches belonging to the lineage starting at t (for a moment we assume that t also subdivides a lineage into branches) and ending at o are stratigraphic range-branches (as they all belong to the same species associated with stratigraphic range i), although we treated them as non-stratigraphic range-branches in the previous paragraph. That means that we first need to multiply by and then by . Thus we obtain the following directly from Theorem 2: In the case of potential incomplete sampling, the probability density of the oriented sampled tree is, where the contribution of branch B Rather than augmenting the state space by t ∈ I, we can integrate analytically over all t. The integral over t is, We evaluate this integral over the interval [o], with a(i) being the most recent ancestral stratigraphic range of i as above, and thus obtain: In the case of potential incomplete sampling, the probability density of the oriented sampled tree is, The term within the right product can be written as . The right bracket is the probability of zero or more unobserved speciation events that change a species along the lineage starting at y and ending at o minus the probability of zero unobserved speciation events that change the species along this lineage. For labeled trees, the term that accounts for unobserved speciation events between ancestor-descendant stratigraphic ranges ( and above) becomes more complex. A labeled tree does not show the graphical representation, meaning we do not know which ancestor-descendant stratigraphic ranges lie on a straight line and thus we do not know which ranges need to be separated by a speciation event. Instead we need to integrate over the possibilities of these ranges lying on a straight line or not, which is non-trivial. Suppose there are two ancestor-descendant stratigraphic ranges that are separated by an observed branching event. As we do not know the orientation of this event, there could have been two possible scenarios: either the two ranges lie on the same straight line, and thus the separating branching event is a budding speciation event giving rise to an additional new species. In this case we need to enforce an unobserved speciation event between the two ranges such that it is guaranteed that they belong to different species. Alternatively, the observed speciation event causes the two ranges not to lie on the same straight line, then we do not have to force an unobserved speciation event. When several speciation events separate a pair of ancestor-descendant stratigraphic ranges or when the same stratigraphic range is the most recent sampled-ancestor of several stratigraphic ranges we could not find a simple expression for the number of different possible scenarios. Thus we cannot provide an expression for the probability density of labeled trees here. In the case of guaranteed complete sampling, the sampled tree equals the extended sampled tree, as for all species. Thus the probability densities from Theorem 1 and Corollary 5 apply. An expression for the probability density of sampled trees when ignoring tree topology (analogous to Section “Probability density of the extended sampled stratigraphic ranges”) does not seem to be straightforward. In fact, it seems more straightforward to integrate over tree topologies of sampled trees using MCMC methods. Ignoring tree topology can further be achieved by estimating parameters based on the extended sampled stratigraphic ranges and integrating over d using MCMC. The complication can be attributed to the sampled-ancestor-stratigraphic ranges. The γ for the number of possible attachment points of a stratigraphic range i in the extended sampled tree scenario is independent of the placement of the other stratigraphic ranges. In the case of the sampled tree, if we ignore the sampled-ancestor-stratigraphic ranges (i.e., replacing them with normal branches in the sampled tree), we can sum over tip-stratigraphic range topologies, analogous to the extended sampled tree scenario where we only have tip-stratigraphic ranges. However, we then have to additionally account for the number of placements of the sampled-ancestor-stratigraphic ranges. This number does not seem to follow a simple formula. Consider two sampled-ancestor-stratigraphic ranges with range (4, 3) (call it SA1) and (3.5, 2) (call it SA2) (see Fig. 4). Assume there is one tip-stratigraphic range X with and then there is space for a sampled-ancestor-stratigraphic range on the interval (5, 2.5). Additionally, assume there is one tip-stratigraphic range Y with and leaving space on the interval (5.5, 1.5), and one tip-stratigraphic range Z with and leaving space on the interval (3.8, 1).

Fig. 4

Illustration of sampled-ancestor-stratigraphic range assignment to non-stratigraphic range lineages X, Y, Z of a sampled tree. If sampled ancestor SA1 is assigned to lineage X, then SA2 can be assigned to Y or Z, while if SA1 is assigned to lineage Y, then SA2 can be assigned only to Z. Thus both sampled-ancestor-stratigraphic ranges fit on two lineages (SA1 on lineages leading to X and to Y; SA2 on lineages leading to Y and Z), and we could be tempted to multiply by for both sampled-ancestor-stratigraphic ranges (meaning we would have in total 4 possible sampled trees). However, given SA1 is assigned to lineage Y (out of X and Y) then SA2 can only be assigned to Z. On the other hand, if SA1 is assigned to X, then SA2 can be assigned to Y or Z. Meaning the number of choices for each sampled-ancestor-stratigraphic range are not independent of the other sampled-ancestor-stratigraphic ranges (here we have in total three possible sampled trees; see Fig. 4). This non-independence of sampled-ancestor-stratigraphic range placement in a sampled tree makes the analytic integration over tree topologies non-trivial. If all sampled-ancestor-stratigraphic ranges have length 0 (i.e., ), we can analytically sum over topologies following Heath et al. (2014), as the different sampled ancestor fossils do not influence each other when being assigned to branches.

Mathematics of the mixed speciation FBD model

After having discussed the FBD model under asymmetric speciation, we now allow for three speciation modes: asymmetric, symmetric, and anagenetic speciation. First we assume that the probability of a branching speciation event being symmetric is β. That is, we extend the FBD model with rates λ, μ, ψ, ρ assigning to each branching event an asymmetric speciation event with probability and a symmetric speciation with probability β. Further, each lineage has rate λ of producing an anagenetic speciation event, i.e., a speciation event without branching. The mixed speciation model has parameters λ, μ, ψ, ρ, β, λ and setting the additional parameters β and λ to zero converts to the initial asymmetric speciation FBD model. This mixed speciation FBD model induces oriented trees where each branch is labeled either left or right. A complete tree produced by this process will be represented by an oriented tree where all nodes have degree-three at most, and all degree-three nodes are of two types reflecting the mode of speciation: asymmetric or symmetric speciation nodes. At nodes representing an asymmetric speciation event, we always assume that the new species starts with the right branch (which would correspond to a D-branch in the previous sections and the left branch would correspond to an A-branch). The left and right descendant branches of a symmetric speciation event are equivalent and we need the orientation only for the convenience of derivations. Further, we have degree-two nodes that represent anagenetic speciation events that also subdivide branches. A branch that descends from an anagenetic speciation event has the same orientation as its ancestor branch. A species in the complete tree under the mixed speciation process is represented by a lineage consisting of a starting branch, which can be: and several (or none) left descending branches (analogous to the A-branches) produced by asymmetric speciation. We define stratigraphic ranges in the complete tree as before. the initial branch starting at time x0, a branch produced by symmetric or anagenetic speciation, or the right branch (analogous to the D-branch) of an asymmetric speciation event, Following the previous sections, we draw the branches belonging to the same species as straight lines in the complete tree (Fig. 5, left). Thus, at an asymmetric speciation event the left branch (analogous to the A-branch) continues the ancestral branch and the right branch (analogous to the D-branch) is drawn on the righthand side of the ancestral branch. For the symmetric speciation event both descendant branches correspond to a new species and are drawn on both sides of the ancestral branch. To designate an anagenetic speciation event we draw the descendant branch slightly shifted to the right of the ancestral branch in the complete tree.

Fig. 5

A complete species tree with three speciation modes (mixed speciation) is shown on the left. A sampled tree with mixed speciation is shown on the right.

A complete species tree with three speciation modes (mixed speciation) is shown on the left. A sampled tree with mixed speciation is shown on the right. A sampled tree (Fig. 5, right) is obtained by deleting all lineages without sampled descendants and ignoring anagenetic speciation nodes. Each branching node inherits its type (asymmetric or symmetric) from the complete tree. We draw branches produced by asymmetric and symmetric speciation nodes in the same way as in the complete tree. Finally, as in the asymmetric speciation case, a straight line in the sampled tree does not necessarily represent a single species in the sampled tree, as there may be unobserved speciation events. Analogous to the asymmetric case, a stratigraphic range in the sampled tree is a segment of a lineage that does not contain unobserved symmetric speciation events and asymmetric speciation events where the species associated with the lineage changes. In other words, it is simply a segment of a straight line in the graphical representation of the sampled tree (see Fig. 5). Suppose again we have sampled n species, i.e. n stratigraphic ranges, and consider a sampled tree describing the phylogenetic relationship of these stratigraphic ranges. We define a sampled-ancestor-stratigraphic range as before. Let j stratigraphic ranges be sampled-ancestor-stratigraphic ranges, the remaining stratigraphic ranges are tip-stratigraphic ranges. The sampled tree has asymmetric branching times symmetric branching times and a time of origin x0. For derivations only, we again consider the oldest and youngest fossil of each stratigraphic range as explicit nodes that subdivide branches in the sampled tree. For convenience, as before, we count sampled nodes that represent a stratigraphic range consisting of a single fossil (i.e., ) twice, as well as counting zero-length branches that begin and end at these sampled nodes. The sampled tree then consists of the following nodes: w degree-three nodes, with the asymmetric branching times at degree-three nodes, with the symmetric branching times at n degree-two nodes, at the time of the oldest fossils j degree-two nodes, at the sampled-ancestor-stratigraphic range times y, with degree-one nodes (tips), at the tip-stratigraphic range times y, and one degree-one node, the origin of the tree at time x0. In total, the sampled tree has branches (also counting the initial branch beginning at the time of origin). As before, we define a set I consisting of stratigraphic ranges that have their most recent sampled ancestors on a straight line in the graphical representation. In the case of potential incomplete sampling, the probability density of the oriented sampled tree with w asymmetric branching events, under mixed speciation is, where p(t), c1, c2, q(t) are defined as in Theorem 2, and the contribution of branch B with Note that the probability densities p(t) and q(t) are the same as in the asymmetric case, as they do not depend on β and λ. For p(t), we note that the type of branching event does not influence the probability density of not sampling any descendants and only the total rate λ will contribute to the expression for p(t). The possibility of having a speciation event without branching does not influence the probability density of not sampling any descendants either. The equation for q(t) also does not depend on the types of branching events that may have happened along the branch (i.e., asymmetric or symmetric speciation), because in both cases one lineage must not have been sampled and the other must have given rise to the observed tree. The possibility of having anagenetic speciation events along a lineage does not influence q(t) either, because anagenetic speciation does not change the sampled tree. The probability density of an individual associated with stratigraphic range i at time t ∈ [o) producing an stratigraphic range as observed within (t, y] is described by the differential equation, Here, given that we know that the whole branch belongs to the same species we can eliminate the possibility of anagenetic or symmetric speciation events, along with sampling or death events. There may still be asymmetric speciation events along this branch, but the descendant species must not have been sampled, which is accounted for by the second term. For stratigraphic range i, the initial condition is with if y > 0 and i is a tip-stratigraphic range, if y > 0 and i is a sampled-ancestor-stratigraphic range, and for . Plugging the expression for into the differential equation proves that this is a solution with . Thus, and the contribution of stratigraphic range i and its descendants to the probability density of the sampled tree is . In other words, we can write the probability density of the oriented sampled tree with fossil samples partitioned into stratigraphic ranges by multiplying the contribution of all branches () together with the initial values at the tips, the speciation rates, the fossilization rates, and the term for conditioning on a sample, as before. The right-most product in Eq. (10) accounts for the required unobserved speciation events prior to the stratigraphic ranges in I. Again, for each stratigraphic range i ∈ I, we multiply by as we do not want to continue assuming that the interval within [o] is associated with an arbitrary number of unobserved events. Then we take the difference of (i) the probability that in this interval any number of unobserved speciation events that change a species along that lineage happened () and (ii) the probability that no unobserved speciation event that change a species along that lineage happened (). Simplifying yields the expression stated in the theorem. Note that as an alternative proof for the right-most product, we could have integrated over t as in Theorem 8. □ Under mixed speciation, in the case of guaranteed complete sampling, the probability density of the oriented sampled tree, is Note that the second to last product accounts for no anagenetic speciation events within stratigraphic ranges, and the last product accounts for at least one anagenetic speciation event occurring between y and o. Setting β to one and λ to zero (that is, allowing for only symmetric speciation events) one can obtain the corresponding probability densities for the FBD process with symmetric speciation only. In the case of potential incomplete sampling, the probability density of the oriented sampled tree under symmetric speciation is, where p(t), c1, c2, and q(t) are defined as in Theorem 2, and with . As expected the expressions for densities and can be obtained from by setting λ to zero and β to the extreme values, that is, and . The probability densities derived here can also be used for extinct clades by setting acknowledging the fact that we would include all extant species but there are none.

Marginalizing over the number of fossils within a stratigraphic range

There may be a degree of uncertainty associated with the number of fossil specimens that were sampled throughout the stratigraphic range of a given species. In many cases, more effort has gone into researching the age of the oldest (o) and youngest (y) fossils (the first and last appearances) of a given species, and it is rare that fossils have been sampled within the stratigraphic range with a constant rate ψ. Thus, we now derive an expression for the probability density of a tree, given the oldest and youngest fossils of each sampled species, marginalizing over the number of fossils within this range. In other words, we integrate over the number of fossil samples, k, for the probability densities derived above. Let κ′ be the total number of sampled fossils that represent the start and end times of a stratigraphic range. If a stratigraphic range is represented by a single fossil then this fossil only contributes one towards κ′. Let κ be the total number of sampled fossils that are found within any given stratigraphic range. In our example in Fig. 3 we have . Note that . Let the sum of all stratigraphic range lengths be . The symbol denotes an extended sampled tree (under asymmetric speciation) or a set of extended sampled stratigraphic ranges D (under asymmetric speciation) or a sampled tree (under asymmetric, symmetric, or mixed speciation). Further, may be oriented or labeled. Let be ignoring the κ fossils sampled within stratigraphic ranges. We further denote the parameters of the FBD model (λ, β, λ) with η. Both in the case of potential incomplete sampling and guaranteed complete sampling, the probability density of is, Note that κ is unknown, however cancels out with ψ meaning does not depend on κ while it depends on κ′. Note that can be obtained from by adding the times, of the κ fossils sampled within the stratigraphic ranges. Then and can be written as: with . From Theorem 2 (resp. Corollary 5) for extended oriented (resp. labeled) sampled trees under asymmetric speciation, Corollary 6 for extended sampled stratigraphic ranges under asymmetric speciation, and Theorem 11 for oriented sampled trees under mixed speciation (and thus in particular under asymmetric or symmetric speciation), we observe that H is independent of κ and under potential incomplete sampling, while H depends on the value κ′. In the case of guaranteed complete sampling (i.e., and ), Theorem 1, Corollary 5, Corollary 6, and 12 show that H is independent of κ and τ, and again H depends on the value κ′. We now want to integrate over all τ to obtain and then sum over all κ, to eliminate κ. Note that each of the κ fossil may be placed anywhere along the stratigraphic ranges with total length L. Thus, In the last equation we employed the fact that is the probability density of a realization of a Poisson process with κ events over time period L. Summing over all κ leads to, which establishes the theorem. □ Under asymmetric speciation with guaranteed complete sampling for oriented trees, based on Theorem 1, with L being the sum of all branch lengths, Theorem 14 simplifies to, Thus, This probability density can also be proven in a direct way. The term is the probability density of the fossils at the start and end of a stratigraphic range being sampled. The term is the rate for the branching events. The probability that no branching events happened along any of the branches is . The probability that no sampling event happened along any of the branches outside the stratigraphic ranges is . Under mixed speciation with guaranteed complete sampling for oriented trees, based on Corollary 12, with L being the sum of all branch lengths, Theorem 14 simplifies to, Thus, This probability density can also be proven in a direct way. The term is the probability density of the fossils at the start and end of a stratigraphic range being sampled. The term is the rate for the branching events, w of which are asymmetric, while the remaining are symmetric, which is accounted for by . The probability that no branching events happen along any of the branches is . The probability that no sampling events happen along any of the branches outside the stratigraphic ranges is . The probability that no anagenetic speciation events happen along the stratigraphic ranges is . The term accounts for unobserved anagenetic speciation events that must have taken place between pairs of ancestor-descendant stratigraphic ranges that lie along the same line.

Marginalizing over the number of fossils within a stratigraphic interval

Instead of recording the age of the oldest and youngest fossils precisely (see previous section), some datasets may only record whether a fossil species was present or not within a given stratigraphic interval spanning the time interval [x, y]. Thus, a branch of a sampled tree within the time interval [x, y] has either one or no fossil “samples” assigned to it, meaning only the presence or absence of a species is recorded. In other words, an assignment of one means that at least one fossil specimen of a particular species was found, but in fact any number k fossil specimens may have been found within that interval. We will now derive equations accounting for only recording presence / absence of fossil specimens for a species rather than the exact number of fossil specimens k in each stratigraphic interval. As in the last section, the symbol denotes an extended sampled tree (under asymmetric speciation) or a set of extended sampled stratigraphic ranges D (under asymmetric speciation) or a sampled tree (under asymmetric, symmetric, or mixed speciation). Further, may be oriented or labeled. A branch in connects speciation nodes and/or tip nodes; fossil samples do not induce new branches. Now we subdivide all branches in into sub-branches, the time points for the start and end of the sub-branches are the start and end points of stratigraphic intervals. Since we do not know the timing for the first and last fossil (o and y) for each stratigraphic range, we have to estimate it. We suggest two options. Either we numerically integrate over o and y using MCMC methods. Alternatively, we make an approximation assuming that if a fossil is found in a particular time interval, the corresponding species existed throughout that time interval, meaning o (resp. y) are the start (resp. end) times of the stratigraphic intervals where a species was found first (resp. last). An exception is that if the fossil is ancestral (resp. descendant) of a speciation node within that interval, then the speciation node is the new time y (resp. o). Let be the (unknown) number of fossil specimens along sub-branch . We set if and otherwise, meaning indicates the presence / absence of a species. Let be using the information on instead of and using the potentially altered o, with l referring to intervals in the stratigraphic record. Let be the length of sub-branch . Both in the case of potential incomplete sampling and guaranteed complete sampling, the probability density of is, Note that k is unknown, however cancels out with ψ meaning does not depend on k. The proof is very similar to the proof of Theorem 14. First, note that can be obtained from by adding the times, of the k fossils sampled along sub-branches with . Then and can be written as: with . Analog to Theorem 14, it can be shown that is independent of k and . Analog to Eq. (11), we obtain, with being the fossil sampling times on sub-branch Since summing over for all leads to, □ Under asymmetric speciation with guaranteed complete sampling for oriented trees, based on Theorems 1 and 17, with L being the sum of all branch lengths, we have, analog to Remark 15, Note that is the probability of no fossil samples along sub-branches with and is the probability of at least one fossil sample on each sub-branch with . Under mixed speciation with guaranteed complete sampling for oriented trees, based on Corollary 12 and Theorem 17, with L being the sum of all branch lengths, we have, analog to Remark 16,

Discussion

Due to the lack of statistical models combining neontological data (such as molecular sequence data) and paleontological data (such as stratigraphic ranges), these data are typically not analyzed within a single framework. Here, we formulate the FBD model under different modes of speciation giving rise to phylogenies and stratigraphic ranges, allowing for incomplete sampling of extinct and extant species. We introduce novel macroevolutionary models where we explicitly model the mode of speciation through time in a phylogenetic context. As part of these new models, we derived the probability density, (with ), of a phylogenetic tree (referred to as the sampled tree in the mathematical derivations) on fossil and extant species samples. Specifically, several samples may be assigned to a single species, yielding so-called stratigraphic ranges in the phylogenetic tree. Thus, our equations will allow for a coherent and flexible analysis of paleontological and neontological data. In particular, we derived the probability density of the phylogenetic tree under asymmetric (budding) speciation in Theorem 8, and for speciation being either asymmetric (budding), symmetric (bifurcating), or anagenetic, in Theorem 11. These phylogenetic trees may have sampled-ancestor-stratigraphic ranges, where the entire stratigraphic range is an ancestor of a descendant sampled species. Treatment of sampled ancestors is computationally challenging, requiring novel operators (i.e., proposal mechanisms) in Bayesian analyses (Gavryushkina, Welch, Stadler, Drummond, 2014, Heath, Huelsenbeck, Stadler, 2014, Zhang, Stadler, Klopfstein, Heath, Ronquist, 2016). In the case of asymmetric speciation, the extended stratigraphic range (meaning the species from its first sample to its extinction time) can never be a sampled-ancestor-stratigraphic range, as the extinction event terminates a lineage. Thus we explore the extended stratigraphic range further under asymmetric speciation. Corollary 5 states the probability density of the tree connecting all samples, while knowing the extinction times for each sampled species (i.e., the extended sampled tree). Corollary 6, taking advantage of the absence of sampled-ancestor-stratigraphic ranges, additionally integrates analytically over all tree topologies. Since under symmetric speciation, a speciation event coincides with the extinction of the ancestor species and thus in the extended sampled tree we may also have sampled-ancestor-stratigraphic ranges, we do not explore the extended sampled tree further under mixed speciation. We envision that Theorems 8 and 11 will be useful when tree inference is based on molecular and morphological data (such as for mammals). These expressions consider oriented trees rather than labeled trees. Because the analytical solution for the probability density of labeled trees is not possible with our equations, this motivates adapting phylogenetic software to oriented trees in order to use the provided equations. In the case of asymmetric speciation, if the inferred extinction times for each sampled species are of interest, then Corollary 5 for labeled trees is appropriate. Many fossil datasets contain only fossil occurrence times without any morphological or molecular information and the tree topology cannot be inferred, therefore, Corollary 6 can be employed for such cases. Silvestro et al. (2014) also considers fossil occurrences, however, the equations assume that at least one fossil per extinct species is sampled, while we allow for non-sampled extinct species. For some rare and well studied groups (e.g., terrestrial vertebrates, dinosaurs) we may know the number of specimens collected through time, a number required by the equations above. In many circumstances, however, we only have information about the first and last occurrence times of a species, but not necessarily how many fossils were sampled in between (e.g., many marine invertebrates). Thus we further provide Theorem 14 to integrate over the number of fossils within a stratigraphic range for any of the settings mentioned above. In other circumstances, we may only have information about the presence or absence of a fossil species within a given stratigraphic interval or layer, but not the total number of specimens of a particular species sampled within each interval. We take the presence / absence data into account in Theorem 17. We focussed on a thorough mathematical treatment of the FBD model under different modes of speciation in this paper. The results, namely the probability of a tree, with η being the FBD model parameters will be crucial for inferring posterior distributions of trees and model parameters (such as speciation and extinction rates) based on molecular and morphological data from extant and fossil species, or fossil occurrence data. Denoting all those data with data, and summarizing all parameters from models of evolution for the molecular and morphological data with θ, a Bayesian method aims to infer, Thus, with the FBD model under different modes of speciation implemented as prior densities in Bayesian inference tools, we can readily infer trees and parameters from both paleontological and neontological data. Bayesian inference under these models may also allow us to assess the common modes of speciation by estimating their rate parameters λ, β, λ. We use the word “may” in the previous sentence as it is not clear if all parameters can be identified based on the available data. While we know that we can infer λ from enough neontological and paleontological data, it will be exciting to explore the extent to which we can estimate further details of the speciation process from these data, i.e., estimate β and λ. Further, our mathematical results open the door to performing species-tree/gene-tree inference incorporating several fossils through time from the same species by using the probability density from Theorems 8 and 11 as a species-tree prior. We explicitly model the mode of speciation within a phylogenetic framework. Following the paleontological literature, we model asymmetric (budding), symmetric (bifurcating), and anagenetic speciation (Foote, 1996). A branching event in the phylogeny gives rise to either an asymmetric or a symmetric speciation event. Thus, these two branching speciation modes reflect the divergence of populations. In particular, these two modes of speciation do not require statements about the morphological change along these lineages; in fact, divergence may be driven by molecular rather than morphological change as observed in cryptic species. While anagenetic speciation may also be driven by molecular or morphological change, this speciation mode can typically only be identified if morphological change has occurred along a lineage to distinguish younger members from earlier ancestral forms, and if the fossil record of a given group has been sufficiently densely sampled to observe this morphological change directly (e.g. planktic forams). Thus, from a phylogenetic perspective, one might argue that we only want to model speciation processes that do not rely on morphological change, i.e., we only model branching speciation and set . This, however, would require us to associate uncertainty with the stratigraphic range data in the sense of allowing for the possibility that different stratigraphic ranges actually belong to the same species despite being morphologically distinct. In addition, this would also require us to allow for the possibility that a single stratigraphic range actually represents multiple morphologically very similar species. We leave it for future work to extend our model such that uncertainty in stratigraphic range data can be considered. The asymmetric speciation scenario, and in particular Theorem 8, can also be useful in epidemiology for modeling transmission trees. For some patients from whom we take a pathogen sequence at time s, we may know that they have already been infected at time o. We may also know that they are still infected at some time y more recently than s. We assume for patient i the “stratigraphic” range (o) and obviously, and/or is possible. Theorem 8 provides the probability density of a sampled tree (i.e., sampled transmission tree). Furthermore, when applying the species-tree/gene-tree framework to pathogens, yielding to a transmission-tree/gene-tree framework, we can incorporate multiple sequences per patient to infer transmission trees. We note that an oriented sampled transmission tree provides us with ancestor-descendant relationships between patients, however, an ancestor may not be the direct donor due to unsampled intermediate patients, unless we can assume guaranteed complete sampling. In the case of guaranteed complete sampling, our method may be considered as an alternative to classic methods on transmission tree reconstruction from genetic data (see Hall, Woolhouse, Rambaut, 2015, Jombart, Eggo, Dodd, Balloux, 2011). In the case of potential incomplete sampling, Didelot et al. (2017) recently proposed a method for inferring transmission trees. Compared to our method, their method can provide donor-recipient pairs. However, their approach cannot integrate over unobserved patients analytically, but requires data augmentation. The latter can be very slow in the case of many unobserved patients. In summary, more explicit treatment of paleontological and neontological data in a phylogenetic framework, as presented here, has the potential to yield more robust and accurate inferences of macroevolutionary parameters, such as phylogenetic relationships, divergence times, rates of diversification, and rates of fossil recovery. Furthermore, our mathematical results also offer potentially promising approaches for detailed analysis of pathogen sequence data from an epidemic. We end by highlighting that our approaches focus on processes inducing trees via processes such as speciation, extinction, transmission, and recovery. For the future, it will be a great challenge to further incorporate reticulation processes such as hybridization, horizontal gene transfer, and recombination.

18 in total

1. MRBAYES: Bayesian inference of phylogenetic trees.

Authors: J P Huelsenbeck; F Ronquist
Journal: Bioinformatics Date: 2001-08 Impact factor: 6.937

2. Diversity dynamics: molecular phylogenies need the fossil record.

Authors: Tiago B Quental; Charles R Marshall
Journal: Trends Ecol Evol Date: 2010-08 Impact factor: 17.712

3. The reconstructed evolutionary process with the fossil record.

Authors: Gilles Didier; Manuela Royer-Carenzi; Michel Laurin
Journal: J Theor Biol Date: 2012-09-13 Impact factor: 2.691

4. The fossilized birth-death process for coherent calibration of divergence-time estimates.

Authors: Tracy A Heath; John P Huelsenbeck; Tanja Stadler
Journal: Proc Natl Acad Sci U S A Date: 2014-07-09 Impact factor: 11.205

5. RevBayes: Bayesian Phylogenetic Inference Using Graphical Models and an Interactive Model-Specification Language.

Authors: Sebastian Höhna; Michael J Landis; Tracy A Heath; Bastien Boussau; Nicolas Lartillot; Brian R Moore; John P Huelsenbeck; Fredrik Ronquist
Journal: Syst Biol Date: 2016-05-28 Impact factor: 15.683

6. Bayesian Total-Evidence Dating Reveals the Recent Crown Radiation of Penguins.

Authors: Alexandra Gavryushkina; Tracy A Heath; Daniel T Ksepka; Tanja Stadler; David Welch; Alexei J Drummond
Journal: Syst Biol Date: 2017-01-01 Impact factor: 15.683

7. Total-Evidence Dating under the Fossilized Birth-Death Process.

Authors: Chi Zhang; Tanja Stadler; Seraina Klopfstein; Tracy A Heath; Fredrik Ronquist
Journal: Syst Biol Date: 2015-10-22 Impact factor: 15.683

8. MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space.

Authors: Fredrik Ronquist; Maxim Teslenko; Paul van der Mark; Daniel L Ayres; Aaron Darling; Sebastian Höhna; Bret Larget; Liang Liu; Marc A Suchard; John P Huelsenbeck
Journal: Syst Biol Date: 2012-02-22 Impact factor: 15.683

9. Bayesian inference of sampled ancestor trees for epidemiology and fossil calibration.

Authors: Alexandra Gavryushkina; David Welch; Tanja Stadler; Alexei J Drummond
Journal: PLoS Comput Biol Date: 2014-12-04 Impact factor: 4.475

10. BEAST 2: a software platform for Bayesian evolutionary analysis.

Authors: Remco Bouckaert; Joseph Heled; Denise Kühnert; Tim Vaughan; Chieh-Hsi Wu; Dong Xie; Marc A Suchard; Andrew Rambaut; Alexei J Drummond
Journal: PLoS Comput Biol Date: 2014-04-10 Impact factor: 4.475

18 in total

1. A Simulation-Based Evaluation of Tip-Dating Under the Fossilized Birth-Death Process.

Authors: Arong Luo; David A Duchêne; Chi Zhang; Chao-Dong Zhu; Simon Y W Ho
Journal: Syst Biol Date: 2020-03-01 Impact factor: 15.683

2. Calibrating phylogenies assuming bifurcation or budding alters inferred macroevolutionary dynamics in a densely sampled phylogeny of bivalve families.

Authors: Nicholas M A Crouch; Stewart M Edie; Katie S Collins; Rüdiger Bieler; David Jablonski
Journal: Proc Biol Sci Date: 2021-12-01 Impact factor: 5.349

3. The Occurrence Birth-Death Process for Combined-Evidence Analysis in Macroevolution and Epidemiology.

Authors: Jérémy Andréoletti; Antoine Zwaans; Rachel C M Warnock; Gabriel Aguirre-Fernández; Joëlle Barido-Sottani; Ankit Gupta; Tanja Stadler; Marc Manceau
Journal: Syst Biol Date: 2022-10-12 Impact factor: 9.160

4. Estimating bite force in extinct dinosaurs using phylogenetically predicted physiological cross-sectional areas of jaw adductor muscles.

Authors: Manabu Sakamoto
Journal: PeerJ Date: 2022-07-12 Impact factor: 3.061