Literature DB >> 34048529

Quartet-Based Inference is Statistically Consistent Under the Unified Duplication-Loss-Coalescence Model.

Abstract

MOTIVATION: The classic multispecies coalescent (MSC) model provides the means for theoretical justification of incomplete lineage sorting-aware species tree inference methods. This has motivated an extensive body of work on phylogenetic methods that are statistically consistent under MSC. One such particularly popular method is ASTRAL, a quartet-based species tree inference method. Novel studies suggest that ASTRAL also performs well when given multi-locus gene trees in simulation studies. Further, Legried et al. recently demonstrated that ASTRAL is statistically consistent under the gene duplication and loss model (GDL). GDL is prevalent in evolutionary histories and is the first core process in the powerful duplication-loss-coalescence evolutionary model (DLCoal) by Rasmussen and Kellis.
RESULTS: In this work we prove that ASTRAL is statistically consistent under the general DLCoal model. Therefore, our result supports the empirical evidence from the simulation-based studies. More broadly, we prove that the quartet-based inference approach is statistically consistent under DLCoal.

Entities: Chemical

Year: 2021 PMID： 34048529 PMCID： PMC9113308 DOI： 10.1093/bioinformatics/btab414

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.931

1 Introduction

The accurate inference of evolutionary histories of species is a grand challenge in evolutionary biology due to the fact that the true evolutionary histories are rarely known (Bininda-Emonds, 2004). Consequently, the common strategy in the phylogenetic community is to rely on established statistical models of evolution when evaluating phylogenetic inference methods. One of the most prominent such models is the multispecies coalescent model (Rannala and Yang, 2003) that accounts for incomplete lineage sorting (ILS), also known as deep coalescence. ILS is a prevalent factor that causes discordance between the observed gene tree topologies and the host species tree (Allman ). In fact, a large body of work in phylogenetics is dedicated to the design of species tree inference methods that are statistically consistent under MSC. Statistical consistency implies that as the number of observed gene trees grows, the species tree estimate converges to the true species tree that ‘generated’ the observed data. Multiple phylogenetic inference methods have been demonstrated to be statistically consistent, cf. GLASS (Mossel and Roch, 2010), R (Degnan ), STEM (Kubatko ), MP-EST (Liu ), BUCKy (Larget ), STAR/USTAR (Allman ; Liu ), NJst (Liu and Yu, 2011), ASTRID (Vachaspati and Warnow, 2015), ASTRAL (Zhang ), other rooted triplet and unrooted quartet methods (Ewing ; Rhodes, 2020; Yourdkhani and Rhodes, 2020) and others. In recent years ASTRAL became one of the most popular species tree inference methods by practitioners. Note that ASTRAL’s objective function is built on the notion of quartets (see Fig. 1). In particular, the proof that ASTRAL is statistically consistent under MSC stems from two observations. First, Allman demonstrated that if a species tree displays a quartet q then q is also the most likely observed (unrooted) gene tree topology. Second, it can be seen that every species tree clade will eventually appear in at least one of the observed gene trees.

Fig. 1.

All three possible quartets on leaves

All three possible quartets on leaves More recently, Legried studied two natural extensions of ASTRAL that enable processing the multi-locus gene trees. Multi-locus gene trees can have multiple leaves with the same species label (that is, the respective species has multiple copies of the same gene). These extensions allow one to apply ASTRAL to a much broader class of phylogenetic gene trees and are referred to as ASTRAL-one and ASTRAL-multi. Given four species (e.g. ) a multi-locus gene tree can have multiple copies of each of the species and therefore can suggest multiple (conflicting) quartets on . In that case, ASTRAL-one chooses a single random copy for each species label and considers the respective quartet type, whereas ASTRAL-multi considers all gene copies and all the respective quartets. Focusing on these two extensions of ASTRAL, Legried et al. proved that both ASTRAL-one and ASTRAL-multi are statistically consistent under the gene duplication and loss model (GDL) (Legried ). Note that GDL is a part of the broader and well-recognized unified duplication-loss-coalescence (DLCoal) model of gene tree evolution by Rasmussen and Kellis (Rasmussen and Kellis, 2012). DLCoal simultaneously accounts for three crucial types of evolutionary factors that shape gene family evolution. Namely, duplications, losses and incomplete lineage sorting. The DLCoal process involves two steps, (i) a birth/death process within the branches of the species tree creates a locus tree (i.e. the GDL process), and (ii) a bounded multispecies coalescence process acting on the locus tree generates the observed gene tree. See Figure 2 for an example.

Fig. 2.

An example of a gene tree G, locus tree L and species tree S. Note that the arrows in the locus tree represent the duplication events, and the cross represents a loss event. Further, the red circles on the gene tree represent the duplication-points. As coalescent (b-MSC) runs on the locus tree, the coalescence of the new and the original loci is likely to happen above a duplication event; therefore, the duplication-points can appear in the middle of gene tree edges, as shown in the figure In this work, for the first time, we prove that ASTRAL-one is statistically consistent under the general DLCoal model. First, we derive gene tree probabilities (constrained to quartets) under the bounded multispecies coalescent model and draw core observations from that analysis. Second, we build on an idea from Legried et al. to systematically separate different duplication-loss scenarios. Then, for each such scenario, we prove that a random quartet from the gene tree is more likely to agree with the species tree quartet rather than any of the two other quartets. Finally, we extend our result for ASTRAL-one to ASTRAL-multi and demonstrate that ASTRAL-multi is also consistent under DLCoal (Our extension of the consistency result to ASTRAL-multi was developed independently from Hill .). Our results provide a theoretical justification to the findings in Du , which showcased the accuracy of ASTRAL-one in the presence of duplications, losses and incomplete lineage sorting.

2 Preliminaries

We denote a rooted (phylogenetic) tree by . Here T is the tree topology and is a binary rooted tree with the designated root vertex, , of degree two, all internal nodes of degree three and with leaves bijectively labeled by elements of set . For convenience, we identify leaves with their labels. Further, tree topologies are planted, implying that an additional root edge is attached to the root vertex. Then, ω specifies the lengths of edges in T in coalescent units [i.e. the number of generations normalized by the effective population size (Allman )]. More formally, . In particular, we assume that all edge lengths are strictly positive. When the phylogenetic tree P is not clear from the context, we will often use the notation T and w to refer to its tree topology and its edge-length function, respectively. An unrooted (phylogenetic) tree topology T is similar to the rooted tree topology, but without a designated root and the root edge. That is, in unrooted tree T all non-leaf vertices have degree three. We say that an edge e is external if it is incident with a leaf vertex, and otherwise we call e internal. Further, given a set , tree topology is obtained from T by restricting the leaf-set to Y. A restricted phylogenetic tree is then obtained by choosing the function that maintains the same leaf-to-root path lengths as in P (in respect to the leaves in Y). A rooted topology T defines a partial order on its nodes: given two nodes x and y we say if x is a descendant of y (and if additionally ). We say that two edges in a rooted tree are parallel if neither edge is located on the path from the other edge to the root. Quartets. A quartet is an unrooted tree topology with exactly four leaves. Assuming that the leaves are a, b, c and d, we denote the quartets in Figure 1(left), (middle) and (right) as and respectively (based on the two cherries separated by the internal edge). We say that a quartet q is displayed in a phylogenetic tree P, if the unrooted tree topology of P restricted to the leaves in q (i.e. ) is equivalent to q. In this case, we write .

2.1 Unified DLCoal model

We now overview the unified duplication-loss-coalescence (DLCoal) model (Rasmussen and Kellis, 2012). Species tree. A species tree represents an evolutionary history of species. Leaves of T are labeled by the extant species names. Locus tree. A locus tree represents a duplication/loss history of a fixed gene. A locus tree is obtained from a species tree by running the duplication/loss process (Legried ; Rasmussen and Kellis, 2012) top-down along the edges of the species tree. More specifically, the duplication/loss process is a birth-death process with a fixed birth (duplication) rate λ and death (loss) rate μ (Arvestad ). The birth-death process starts in the root edge of the species tree; whenever it reaches a speciation point, the process splits into two copies and continues independently in the children edges. See Figure 2 for an example. Note that locus tree leaves are labeled by gene names. A locus tree node is always one of the following two types: Speciation. Such node corresponds to a speciation event/node from the species tree. Duplication. Such node corresponds to a new locus creation event. Remark. A duplication event is asymmetric, as one child (the mother duplicate) follows the parent locus, and the other child (the daughter duplicate) corresponds to a novel locus ( for an example). This will ensure a consistent depiction of duplications for Section 3. Further, we will refer to these points as duplication-points.

Fig. 3.

An alternative depiction of a locus tree from Figure 2 with red dots representing duplication-points slightly shifted toward a novel locus

An alternative depiction of a locus tree from Figure 2 with red dots representing duplication-points slightly shifted toward a novel locus Gene tree. A gene tree represents a gene family’s evolutionary history. The gene tree is obtained from a locus tree by running the bounded multispecies coalescent (b-MSC) process bottom-up along the edges of the locus tree (Rasmussen and Kellis, 2012) (see Section 2.3 for a more detailed description of that process). Figure 2 provides an example of that process.

2.2 Multispecies coalescent (MSC) model

In the standard multispecies coalescent model (Rannala and Yang, 2003) gene lineages are followed backwards in time (from the leaves to the root). For simplicity, we assume that there is exactly one gene lineage starting in every extant locus tree leaf. If two or more lineages enter the same locus tree edge, then the coalescence history of these lineages is determined by an exponential distribution. In particular, for any two lineages a, b that entered the same edge the probability that they coalesce within time x (specified in terms of coalescent units) is as follows: More generally, we denote the probability that i lineages coalesce into j lineages within time x () by This value can be computed using the following formula (Tavaré, 1984): Further, note that if at any given moment in time multiple lineages co-exist in the same edge, then any pair of these lineages have an equal probability of coalescing in the next time. That is, the process is symmetric.

2.3 Bounded MSC (b-MSC) model

The constraints on MSC in the unified DLCoal model appear due to the duplication points. In particular, all lineages originating below a daughter duplicate must coalesce below the respective duplication node. For example, in Figure 3, the gene tree lineages corresponding to leaves a2, c2 and c3 must coalesce below the root node. More formally, assume that a duplication occurred at time-point d. Note that, for convenience, we assume that all leaves are aligned in time and are associated with time-point 0; further, we consider time to increase as we go up the trees away from leaves. Now, let a and b be locus tree leaves that are located below the duplication, which is at time-point d (i.e. a and b belong to the new locus created by the duplication). Then we know that lineages a and b must coalesce prior to time-point d. Therefore, generally, the probability that any two lineages a, b, which entered the same edge below a duplication dup at time d, coalesce within time x is as follows (see Fig. 4 for a respective locus tree example): where is determined by the original, unbounded MSC model.

Fig. 4.

An example of a locus tree that illustrates the b-MSC constraints for Section 2.3

3 Quartet probabilities under b-MSC

To obtain our main result we need to compute the probabilities of each quartet appearing in the gene tree based on a fixed locus tree topology. Note that Allman explicitly computed these probabilities for unbounded MSC. In our case we need to incorporate cases, when duplications (locus creation events) appear along the edges of the locus tree. Remark. From now on, for convenience, we restrict locus trees to four leaves sampled from different species. That is, choosing (any) four genes from distinct species , we consider the tree . Note that considering only four leaves may suppress other duplication nodes along the locus tree edges. Therefore, we need to allow for additional duplication-points along the locus tree edges. Further, if there are multiple duplication-points along a single edge of , it is sufficient to only consider the lowest duplication-point on that edge since it indicates the lowest point, below which gene lineages must coalesce. Without loss of generality assume that the locus tree L displays the quartet . Then there are two cases: either (i) L is a balanced rooted tree or (ii) L is a caterpillar tree. We now explore both those cases. Throughout this section, we sometimes use abbreviations ‘coal.’ for ‘coalesce(d)’ and ‘dup.’ for ‘duplication’. Further, we abbreviate ‘obtained in time t’ as simply ‘in t’.

3.1 L is balanced

For convenience, we set to be the lengths of edges X and Y, respectively (see Fig. 5A). We now explore all possibilities of duplication placements on edges of L.

Fig. 5.

(A) The balanced quartet representing the locus tree and displaying quartet . The dotted circles indicate potential duplication locations that can affect gene tree probabilities. (B–D) Specific duplication scenarios corresponding to Sections 3.1.1, 3.1.2 and 3.1.3, respectively

No duplications (unbounded MSC)

In this case, quartet probabilities are given by Allman . That is,

Duplications along the X or Y edges

Assume that a duplication has occurred along the X and/or Y edge (see Fig. 5C). Recall that a duplication point indicates that gene lineages below it in the locus tree must coalesce prior to the duplication (when looking backwards in time). Therefore, if there is a duplication along the X edge, then lineages corresponding to genes a and b must coalesce on that edge. That is, the gene tree must display quartet . Similarly, the same is true if a duplication is located on the Y edge. Hence,

Root edge duplications

Assume that a duplication occurred on the root edge as shown in Figure 5(D), and no duplications appear on X and Y edges. Then the following holds.

Duplication at the root vertex

In Sections 4 and 5 we mainly consider cases when the locus tree root corresponds to a locus creation event (i.e. there is a duplication-point on one of the X or Y edges right below the root). In that case the gene tree quartet probabilities are given by Lemma 3.1. Let L be a balanced locus tree displaying a quartet q with the root of L corresponding to a duplication. Then . Proof. Since the root of L is a duplication, we place a duplication-point immediately below the root on one of the children edges (i.e. the edge that corresponds to a novel locus). Therefore, quartet probabilities for L are described in Section 3.1.2. That is, . □ Remark. Note that potential duplications along the external edges do not affect the coalescence process.

3.2 L is a caterpillar

As above, we set to be the lengths of edges X and Y respectively (see Fig. 6A). We now similarly explore all possible duplication placements on the edges of L.

Fig. 6.

(A) The caterpillar quartet representing the locus tree and displaying quartet . (B–D) Specific duplication scenarios corresponding to Sections 3.2.2, 3.2.3 and 3.2.4, respectively

(A) The caterpillar quartet representing the locus tree and displaying quartet . (B–D) Specific duplication scenarios corresponding to Sections 3.2.2, 3.2.3 and 3.2.4, respectively In this case, the quartet probabilities are given by Allman . In particular,

X edge duplication

Assume that there is a duplication on the X edge (and potentially more duplications on other internal edges) as shown in Figure 6B. Then, similarly to the balanced case, it is not difficult to see that

Y edge duplication

Assume that there is a duplication on Y and there are no duplications on X as shown in Figure 6C.

Root edge duplication

Assume that a duplication occurred on the root edge and no duplications occurred along the X and Y edges (see Fig. 6D). We start with computing the probability of the quartet. Further, by symmetry . Therefore, equals

3.3 Core observations

It is not difficult to see from the above derivations that for a fixed locus tree topology that displays (balanced or caterpillar), if one increases the length of edge X then the probability grows. More formally, see Lemma 3.2. Let L with and as shown in if L otherwise. Further, from the above derivations we observe the following lemma. Locus trees L1 and L2 with equal lengths of the Y edges and different lengths of the X edges. The dashed lines highlight that the duplication-points are located identically on the two trees relatively to their roots For a locus tree L that displays (regardless of duplication locations) we have . The proofs of Lemmas 3.2 and 3.3 are given in Supplementary Section S1 of Supplementary Material.

4 Consistency of ASTRAL-one

We now prove ASTRAL-one is statistically consistent under the DLCoal model. Let be a fixed species tree and let be a collection of gene trees that independently evolved within S according to the DLCoal process. Then, as the number of trees in goes to infinity, the probability that , the unrooted tree estimate by ASTRAL-one, is equal to the unrooted tree topology T For this result, it is sufficient (see Legried ) to prove the following: Let S be a species tree with four leaves that displays quartet , and let G be a gene tree that evolved in S according to the DLCoal process. If one picks genes (that correspond to species respectively) uniformly at random (assuming they exist) from G, then . Theorem 4.2 is sufficient to prove Theorem 4.1, because ASTRAL, as a distance-minimization method, ‘prefers’ the most dominant quartets among the input trees. Then, by Theorem 4.2, as the number of input trees goes to infinity, the most dominant quartet among input trees for each 4-tuple of species becomes (almost surely) the true species tree quartet; hence, it is almost surely picked by ASTRAL-one (see Legried for a formal proof). Therefore, the remainder of the section is dedicated to the proof of Theorem 4.2. We first prove the theorem for S being balanced and then for S being a caterpillar. Remark. To prove Theorem 4.2, we will use some of the results from

4.1 S is balanced

Similarly to Legried , we first implicitly condition our probability space on the event that at least one of each a, b, c and d genes must be present in G. Further, we condition our probability space on a fixed number of locus tree lineages existing at the speciation point at the root of S. That is, consider the duplication/loss (birth/death) process occurring within the root branch of S. Then, let RL be the random variable denoting the number of locus lineages at the speciation point (see Fig. 8). We are going to prove that for any fixed value of . Therefore, for convenience, we do not explicitly write the condition RL = l in probability equations throughout the rest of the proof. Further, we refer to the set of these l locus lineages as root lineages.

Fig. 8.

An example of the partial embedding of a locus tree into balanced S. The blue lineages correspond to the locus tree. Note that the five locus lineages crossing the dashed speciation line are root lineages Now let be the index of a root lineage, from which gene a has descended. Similarly, we define i, i and i. For better readability of the remainder of the proof, we introduce the notation to describe scenarios of the type . In particular, we write (abc, d) for that scenario, we write (ab, cd) to denote the scenario , and we write to denote the scenario, where all i are distinct. Then, by the law of total probability, we have where I is one of the above scenarios (i.e. a partition of set or a combination of such partitions). In particular . Observe that we cover all possible scenarios/partitions here. Our goal is to prove that . Note that follows from the fact that swapping c and d leaf labels does not affect the probabilities. Let us carry out the proof by considering different values of I. That is, our strategy is to prove that for all of the above I, and at least in one case the strict inequality holds. To facilitate the proofs in each case, first consider the following observations: Observation 4.1. Random variables i and . However, i Proof. Observe that the duplication/loss process runs independently in the parallel branches of the species tree. Therefore, once we condition the probability space on a fixed number of lineages at the divergence point (i.e. fixed l), the random variables i and i become independent. In particular, consider any specific realization of the duplication/loss process below the root lineages and let i be a root lineage that a randomly picked locus a belongs two (i.e. i = i). Then, we can swap the ‘left’ subtrees below two distinct root lineages i and j (the subtrees that lead to species A and B) so that i = j and the probability of that event is not altered due to symmetry. Note that i in that case remains the same. Since we can always reshuffle root lineages like that, we can think of a as ‘choosing’ one of the l root lineages uniformly at random, regardless of a realization of i. The same is also true for all other pairs of and . However, since a and b develop (at least partially) in the same species tree branch random variables i and i can be dependent. Similarly for i and i. □ Observation 4.2. Due to the symmetry of the duplication/loss process, we have for any and . Then, by Claim 4.1, for any and . (Due to Lemma 1 in Legried ). and are greater than or equal to .

Case I = (a,b,c,d)

By the symmetry of the duplication/loss process, reshuffling the and i labels will not change the probability of a fixed duplication/loss history in the root edge. Therefore, we have . Hence, .

Case

We need to show that Observe the following. . Proof. Consider the locus trees and for the (ab, cd) and (ac, bd) cases respectively (see Fig. 9). Note that we only consider the part of the locus tree restricted to the four selected genes . It is not difficult to see that both and are balanced. Therefore, by Lemma 3.1, . □

Fig. 9.

Left: the embedding of a locus tree . Right: the embedding of a locus tree

Corollary 4.1. . Left: the embedding of a locus tree . Right: the embedding of a locus tree . Proof. Our proof is similar to the proof of Lemma 1 in Legried . In particular, let be the number of locus lineages that descended from a root lineage and that existed immediately after the speciation into species A and B. Similarly, we define variables M denoting the number of lineages that existed immediately after the speciation at the parent of C and D. See Figure 10 for an example of N variables. By and we denote the vectors of N and M variables, respectively.

Fig. 10.

An example of a partial locus tree embedding in the left part of the species tree below the root speciation. The two shown root lineages expand (through duplication) into and lineages at the moment of A/B speciation, respectively

Observe that and . Further, note that when conditioned on specific values of N and M, i = i and i = i events become independent. That is, similarly to Claim 4.1, conditioning on the number of lineages at the divergence point for species A and B eliminates the dependency between i and i (and similarly for i and i). After conditioning on N and M, random variables and i are all independent. In particular, we can think of lineages a and b as choosing one of the lineages independently and uniformly at random. Similarly, c and d choose one of the lineages independently and uniformly at random. Then, for fixed values of the N and M vectors we have The last equality is due to . That is, as mentioned above, due to the symmetry of the duplication/loss process a has a uniform probability of being ‘sampled’ from any of the lineages existing at the divergence point of species A and B. Similar relations can then be easily derived for i, i and i. Further, following the same idea, we have Then, by Cauchy-Schwartz, and therefore for any realization of vectors N and M. That is, . □ Using the above results, we have For convenience, from now on we denote the event by and the event by . We prove that Consider the following results. An example of a partial locus tree embedding in the left part of the species tree below the root speciation. The two shown root lineages expand (through duplication) into and lineages at the moment of A/B speciation, respectively . Proof. Note that fixing the number of root lineages allows us to treat the duplication/loss processes independently for the root edge and for the lower edges. Let be a duplication/loss scenario (i.e. a fixed realization of the duplication/loss process) in the root edge conditioned on RL = l. Then, without loss of generality assume that in case (ab, c, d), we have , i = 2 and i = 3; in case (cd, a, b) we assume , i = 2 and i = 3. Similarly, under (ac, b, d) we assume and under (bd, a, c) we assume that . Then, a fixed scenario forces the same ‘top’ structure of the locus trees in all four cases. Given that (ab, c, d) and (cd, a, b) cases are virtually identical for the remainder of the proof (since they are symmetric), for simplicity, we will only consider the (ab, c, d) case. Similarly, under the event, we will only consider case (ac, b, d). Then, Figures 11 and 12 depict two possible topologies of the scenario when acting on the root lineages 1, 2 and 3. Observe that the third topology, where root lineages 1 and 3 form a cherry, is identical in terms of analysis to the topology depicted in Figure 11, and therefore is not considered.

Fig. 11.

Caterpillar locus trees (left) and (right) embedded into the species tree. The red circles represent the potential duplication locations that could influence the gene tree probabilities. Note that the scenarios in the root edges are identical. That is, lengths are equal, and the duplication locations above the dashed speciation lines are identical

Fig. 12.

Balanced locus trees (left) and (right) embedded into the species tree

Note that in Figure 11, the resulting locus trees and are both caterpillars, while in Figure 12, the locus trees are both balanced. This separation is achieved because we condition on a fixed scenario. We now consider these two cases individually. (i) and are caterpillars (see Fig. 11). Let x be the distance (in coalescent units) from the root speciation event to the divergence of a and b in the locus tree under the (ab, c, d) case (as shown on the figure). Note that . There are two cases to consider. • There is a duplication along the x Then, as shown in Section 3.2.2, . That is, . • No duplications along the x Since and are both caterpillars, we denote their edges by X and Y as shown in Figure 6A. In particular we denote the X edge in by and the X edge in by . Then, , whereas (note that is as depicted in Fig. 11). Further, the two locus trees are identical in terms of the duplication locations in their internal edges. Then, by Lemma 3.2, it is not difficult to see that for any fixed . Therefore, the lemma holds. (ii) and are balanced (see Fig. 12). By Lemma 3.1, and . Note that we can apply Lemma 3.1, since the roots of the locus trees in these cases must be duplications.□ Caterpillar locus trees (left) and (right) embedded into the species tree. The red circles represent the potential duplication locations that could influence the gene tree probabilities. Note that the scenarios in the root edges are identical. That is, lengths are equal, and the duplication locations above the dashed speciation lines are identical Balanced locus trees (left) and (right) embedded into the species tree . Proof. This result follows from Lemma 4.4 (i.e. ) and the following relations: □ Observation 4.3. By Lemma 3.3, we have . Then, combining this with Lemma 4.5, we have . . Proof. We give the proof in Supplementary Section S2 of Supplementary Material. □ Summarizing the above results we have. Note that the first inequality is due to Lemmas 4.4 and 4.5. The last inequality is due to Lemma 4.6 and Claim 4.3. That is, our main statement holds. In all five cases locus tree L displays the quartet . Therefore, by Lemma 3.3 . Observe that we obtain the strict inequality in this case. In this case it is not difficult to see that L displays quartet . Therefore (as can be seen from the derivations in Section 3), . This concludes the proof for balanced S.

4.2 S is a caterpillar

Without loss of generality assume that S is as it appears in Figure 13. Similarly to the balanced case, we implicitly condition the probability space on a fixed number of loci (lineages) existing at the moment of speciation as shown in the figure. Note that, while in the balanced case we considered root lineages, in the caterpillar scenario we consider lineages at the least common ancestor of A, B and C. That is, we refer to these lineages/loci as ABC-lineages. Finally, as in the balanced case, we denote the number of ABC-lineages by l.

Fig. 13.

An example of the locus tree embedding into a caterpillar species tree. The three locus lineages crossing the dashed speciation line are the ABC-lineages

An example of the locus tree embedding into a caterpillar species tree. The three locus lineages crossing the dashed speciation line are the ABC-lineages We then use the notation in the same way as in the previous section (while referring to indices of ABC-lineages). Further, scenarios describe relations between i, i and i. We now prove that for all I in . Moreover, for at least one such I, the strict inequality holds; in particular, see case 4.2.4 below. By the symmetry of the duplication/loss process, reshuffling the labels will not affect the probability of a fixed duplication/loss history in the root edge. Therefore, we have . Then, . The proof in this case is similar to case 4.1.3 for balanced S. In particular, observe the following. . Proof. It is sufficient to show that . By Claim 4.2, . Further, Legried showed that (see Lemma 1 in Legried ). □ Lemma 4.8.The following holds. ; ; . Proof. The proofs of these statements are similar to the proofs of the respective statements in Section 4.1.3. In particular, (i) corresponds to Lemma 4.4, (ii) corresponds to Lemma 4.5 and (iii) corresponds to Claim 4.3 from Section 4.1.3. □ Then, similarly to Section 4.1.3 we have In this case , since the locus tree displays the third quartet, . The locus tree displays quartet ; therefore, by Lemma 3.3 and the law of total probability, we have .

5 Consistency of ASTRAL-multi

We now extend our consistency result for ASTRAL-one to another variant of ASTRAL adapted to multi-locus input trees, called ASTRAL-multi. Let be a fixed species tree and let be a collection of gene trees that independently evolved within S according to the DLCoal process. Then, as the number of trees in goes to infinity, the unrooted tree estimate by ASTRAL-multi converges almost surely to T Let S be a species tree with 4 leaves that displays , and let G be a gene tree that evolved in S according to the DLCoal process. Let (respectively and ) be the number of (respectively and ) quartets in G. Then, to prove Theorem 5.1 it is sufficient to show that the following result holds (Legried ): . The remainder of the section is dedicated to the proof of Theorem 5.2. In fact, due to symmetry, it is sufficient to show that . The general structure of the proof is similar to the proof of consistency for ASTRAL-one in the previous section. We present the proof for balanced S, and then briefly discuss the proof for caterpillar S. Remark. Some results in this section hold almost surely. Since this is sufficient for the proof of the theorem, we do not specify this explicitly.

5.1 Proof of Theorem 5.2

As mentioned above, we assume that S is balanced. As before, we implicitly condition the probability space (and the expected values) on a fixed number of root lineages l. That is, we claim that Theorem 5.1 holds for any fixed value of l. We now introduce our core notation for the proof. Similarly to the notation, we let (respectively and ) denote the number of (respectively and ) quartets in the locus tree L. Further, for a fixed scenario I (e.g. scenario ) let be the number of quartets in the locus tree that follow the scenario I. Further, be the number of quartets in G that ‘appeared’ from one of the quartets. Similarly we define and . Consider any . Note that the root of locus tree must be a duplication for such I (because I involves at least two root lineages). Then, if I always defines balanced quartets, we have for any by Lemma 3.1. In particular, we note the following: Observation 5.1. For any we have Further, we will only consider scenarios that uniquely determine the quartet types in the locus tree; therefore, we will typically omit the subscript in the notation. For example, we write instead of , since is the only type of quartets that can appear under scenario (ab, cd). Given a fixed root lineage i, let be the random variable denoting the number of a leaves generated by that lineage. Similarly, we define random variables and . By symmetry, (with similar relations holding for ). Then, observe the following: Observation 5.2. Since the duplication/loss process runs independently in the parallel branches of the species tree, is independent from for any and . Observation 5.3. By the symmetry of the duplication/loss process, we have for all and . Further, the following lemma is due to Legried et al. Lemma 5.1 (Lemma 2 in Legried ). We now outline several key corollary statements. Corollary 5.1. Proof. To prove the first relationship, note that the duplication/loss process occurs independently below distinct root lineages. Then, we have The other two relationships can be established similarly. □ We now consider the following comprehensive set of scenarios: . For each I we will prove that and for at least one I the strict inequality holds. By the symmetry of the duplication/loss process in the root edge, we have By Claim 5.1, and . Then, combining this with Corollary 5.1, we have Consider a fixed duplication-loss scenario, , in the root edge of S. In this section we implicitly condition the probability space on . That is, we prove that for each . Due to symmetry, we consider the following two core scenarios: and . It is then sufficient to show the following: Lemma 5.2. Proof. Due to Lemma 4.4, it is not difficult to see that for any quartet on that evolved according to scenario AB or AC, we have . Therefore, we have Similarly, due to Lemma 4.5, we know that . Therefore, Further, note that and do not depend on the choice of the lineages, but only depend on the scenario (see Figs 11 and 12 (right)). Hence, Combining all of the above relations we have We can now conclude the proof by noting that (by Corollary 5.1) and (by Lemma 3.3). □ This case is symmetric to . Therefore, the proof is similar. All quartets in the locus tree under each of these scenarios are . Then, by Lemma 3.3, for each of the quartets. Therefore, All quartets in the locus tree under each of these scenarios are . It is then not difficult to see that .

5.2 Caterpillar species tree

We now briefly discuss the proof strategy for Theorem 5.2 when S is a caterpillar. Similarly to Section 4.2, we condition the duplication/loss process on a fixed number of ABC-lineages (l)—see Figure 13. Adapting a similar notation to Section 5.1, let denote the random variables for the number of a, b and c genes, respectively, below the ith ABC-lineage (in the locus tree). Further, let denote the total number of d leaves. It is then not difficult to show that is independent from for any . Further, and are independent from for any (analogously to Claim 5.2). Claim 5.3 also upholds when we restrict to . Finally, Lemma 5.1 is applicable in the caterpillar case as well; i.e. . We now need to consider the following scenarios: and prove that for all such I. It is then not difficult to do so, since is analogous to Case 5.1.1 from Section 5.1, is analogous to Case 5.1.3, is analogous to Case 5.1.6, and is analogous to Case 5.1.5. Further, under the inequality is strict, similarly to Case 5.1.5. That is, Theorem 5.2 holds.

6 Conclusion

For the first time, we investigated and established statistical properties of a popular species tree inference method under the powerful duplication-loss-coalescence model. We proved that two natural versions of ASTRAL (adapted for the duplication-loss shaped gene families) are statistically consistent under DLCoal. Our result reinforces the practical value of ASTRAL and other quartet-based methods in the area of evolutionary inference. In addition to our work, Hill studied the rate of convergence of ASTRAL under DLCoal. In the future, we anticipate that other statistically consistent methods under DLCoal will be discovered, and the methods will be compared based on their theoretical rate of convergence and simulation studies, advancing the accuracy of evolutionary inference. Financial Support: This material is based upon work supported by the National Science Foundation under Grant No. 1617626. The Department of Defense, Defense Advanced Research Projects Agency, Preventing Emerging Pathogenic Threats program (HR00112020034 to OE). During the revision and editing, AM was funded by the USDA Agricultural Research Service Research Participation Program of the Oak Ridge Institute for Science and Education (ORISE) through an interagency agreement between the U.S. Department of Energy (DOE) and USDA Agricultural Research Service (contract number DE-AC05-06OR23100). Mention of trade names or commercial products in this article is solely for the purpose of providing specific information and does not imply recommendation or endorsement by the USDA, DOE or ORISE. USDA is an equal opportunity provider and employer. Conflict of Interest: The authors declare that they have no conflict of interest. Click here for additional data file.

14 in total

1. Bayesian gene/species tree reconciliation and orthology analysis using MCMC.

Authors: Lars Arvestad; Ann-Charlotte Berglund; Jens Lagergren; Bengt Sennblad
Journal: Bioinformatics Date: 2003 Impact factor: 6.937

2. Unified modeling of gene duplication, loss, and coalescence using a locus tree.

Authors: Matthew D Rasmussen; Manolis Kellis
Journal: Genome Res Date: 2012-01-23 Impact factor: 9.043

3. Identifying the rooted species tree from the distribution of unrooted gene trees under the coalescent.

Authors: Elizabeth S Allman; James H Degnan; John A Rhodes
Journal: J Math Biol Date: 2010-07-23 Impact factor: 2.259

4. Properties of consensus methods for inferring species trees from gene trees.

Authors: James H Degnan; Michael DeGiorgio; David Bryant; Noah A Rosenberg
Journal: Syst Biol Date: 2009-06-04 Impact factor: 15.683

5. Estimating species trees from unrooted gene trees.

Authors: Liang Liu; Lili Yu
Journal: Syst Biol Date: 2011-03-28 Impact factor: 15.683

6. Split Probabilities and Species Tree Inference Under the Multispecies Coalescent Model.

Authors: Elizabeth S Allman; James H Degnan; John A Rhodes
Journal: Bull Math Biol Date: 2017-11-10 Impact factor: 1.758

7. Line-of-descent and genealogical processes, and their applications in population genetics models.

Authors: S Tavaré
Journal: Theor Popul Biol Date: 1984-10 Impact factor: 1.570

8. STEM: species tree estimation using maximum likelihood for gene trees under coalescence.

Authors: Laura S Kubatko; Bryan C Carstens; L Lacey Knowles
Journal: Bioinformatics Date: 2009-02-10 Impact factor: 6.937

9. A maximum pseudo-likelihood approach for estimating species trees under the coalescent model.

Authors: Liang Liu; Lili Yu; Scott V Edwards
Journal: BMC Evol Biol Date: 2010-10-11 Impact factor: 3.260

10. ASTRAL-III: polynomial time species tree reconstruction from partially resolved gene trees.

Authors: Chao Zhang; Maryam Rabiee; Erfan Sayyari; Siavash Mirarab
Journal: BMC Bioinformatics Date: 2018-05-08 Impact factor: 3.169

5 in total

1. DISCO: Species Tree Inference using Multicopy Gene Family Tree Decomposition.

Authors: James Willson; Mrinmoy Saha Roddur; Baqiao Liu; Paul Zaharias; Tandy Warnow
Journal: Syst Biol Date: 2022-04-19 Impact factor: 9.160

2. Using all Gene Families Vastly Expands Data Available for Phylogenomic Inference.

Authors: Megan L Smith; Dan Vanderpool; Matthew W Hahn
Journal: Mol Biol Evol Date: 2022-06-02 Impact factor: 8.800

3. Species Tree Inference Methods Intended to Deal with Incomplete Lineage Sorting Are Robust to the Presence of Paralogs.

Authors: Zhi Yan; Megan L Smith; Peng Du; Matthew W Hahn; Luay Nakhleh
Journal: Syst Biol Date: 2022-02-10 Impact factor: 15.683

Review 4. Recent progress on methods for estimating and updating large phylogenies.

Authors: Paul Zaharias; Tandy Warnow
Journal: Philos Trans R Soc Lond B Biol Sci Date: 2022-08-22 Impact factor: 6.671

5. The large-sample asymptotic behaviour of quartet-based summary methods for species tree inference.

Authors: Yao-Ban Chan; Qiuyi Li; Celine Scornavacca
Journal: J Math Biol Date: 2022-08-17 Impact factor: 2.164

5 in total