Literature DB >> 26968785

Does Gene Tree Discordance Explain the Mismatch between Macroevolutionary Models and Empirical Patterns of Tree Shape and Branching Times?

Tanja Stadler¹, James H Degnan², Noah A Rosenberg³.

Abstract

Classic null models for speciation and extinction give rise to phylogenies that differ in distribution from empirical phylogenies. In particular, empirical phylogenies are less balanced and have branching times closer to the root compared to phylogenies predicted by common null models. This difference might be due to null models of the speciation and extinction process being too simplistic, or due to the empirical datasets not being representative of random phylogenies. A third possibility arises because phylogenetic reconstruction methods often infer gene trees rather than species trees, producing an incongruity between models that predict species tree patterns and empirical analyses that consider gene trees. We investigate the extent to which the difference between gene trees and species trees under a combined birth-death and multispecies coalescent model can explain the difference in empirical trees and birth-death species trees. We simulate gene trees embedded in simulated species trees and investigate their difference with respect to tree balance and branching times. We observe that the gene trees are less balanced and typically have branching times closer to the root than the species trees. Empirical trees from TreeBase are also less balanced than our simulated species trees, and model gene trees can explain an imbalance increase of up to 8% compared to species trees. However, we see a much larger imbalance increase in empirical trees, about 100%, meaning that additional features must also be causing imbalance in empirical trees. This simulation study highlights the necessity of revisiting the assumptions made in phylogenetic analyses, as these assumptions, such as equating the gene tree with the species tree, might lead to a biased conclusion.

Entities: Disease Species

Keywords: Birth–death process; genealogy; multispecies coalescent; phylogeny

Mesh：

Year: 2016 PMID： 26968785 PMCID： PMC4911941 DOI： 10.1093/sysbio/syw019

Source DB: PubMed Journal: Syst Biol ISSN： 1063-5157 Impact factor: 15.683

Which macroevolutionary processes give rise to empirical phylogenies? This question has puzzled biologists for almost as long as empirical phylogenies have been inferred. It can be argued that neither the discrete tree shapes nor the numerical branching times of empirical trees are explained well by current null models of macroevolution (Blum and François 2006; Etienne and Rosindell 2012). For the discrete tree shape, approaches to testing macroevolutionary null models typically rely on tree-balance statistics, measuring the extent to which sizes of sister clades differ at internal nodes of phylogenies (Sackin 1972; Colless 1982; Mooers and Heard 1997; Aldous 2001; Felsenstein 2004). In balanced trees, sister clades have similar numbers of taxa, whereas in unbalanced trees, their numbers of taxa differ substantially. Tests of a macroevolutionary model compare theoretical- or simulation-based predictions of the model about tree balance to observations from empirical trees (Heard 1996; Agapow and Purvis 2002; Heard and Mooers 2002; Blum and François 2006; Bortolussi et al. 2006). Tests of predictions about branching times proceed similarly, examining representations of the number of lineages through time (Harvey et al. 1994) and evaluating the extent to which lineages accumulate nearer the present rather than early in the phylogeny. Perhaps the simplest model describing the shapes of phylogenies is the constant-rate birth–death model, in which speciations are represented by birth events and extinctions by death events (Kendall 1948, 1949; Nee et al. 1994). Under this model, each species at each point in time has the same rate for speciation and the same rate for extinction. When examining theoretical phylogenies under the model and empirical phylogenies constructed primarily from molecular data, studies typically observe that empirical phylogenies are much less balanced than is predicted by the constant-rate birth–death model (Aldous and Pemantle 1996; Blum and François 2006; Hagen et al. 2015). As all the so-called species-speciation-exchangeable models (Stadler 2013)—including the Yule pure-birth model, diversity-dependent models, and environment-dependent models—predict the same tree shape distribution as the constant-rate birth–death process, a large class of models predicts phylogenies to be more balanced than those that have been reported. Furthermore, branching times in empirical phylogenies are generally closer to the root of the tree than is predicted by the constant-rate birth–death model (Etienne and Rosindell 2012). The mismatch of a simple null model such as the constant-rate birth–death process with empirical phylogenies built from molecular sequences has been described with two classes of explanations: the null model might be a poor description of the macroevolutionary process (Aldous and Pemantle 1996; Heard 1996; Heard and Mooers 2002), or alternatively, it might be a reasonable model that fails because it is applied to nonrepresentative sets of empirical phylogenies that possess various forms of bias, including selection bias and taxon-sampling bias (Mooers and Heard 1997; Heath et al. 2008). We investigate yet a third possibility: the model and data are both reasonable, but the species evolution process that the models describe is not the same as the gene lineage evolution process that the molecular sequences represent. When testing macroevolutionary hypotheses on empirical phylogenies, are the models and data commensurable? In typical models for macroevolution, species trees are considered, representing the branching order of species. Frequently, however, empirical species trees are inferred from one or a small number of concatenated sequence alignments, and the inferred gene tree—the tree of genetic lineages at a particular region of the genome—is implicitly treated as an estimated species tree. Gene trees can be highly discordant with their underlying species tree (Degnan (2013, Table 2), even when gene trees are estimated with high accuracy. Therefore, it is not clear that models of species evolution correctly describe properties of accurately inferred gene trees. Here, using a hierarchical model, we investigate the difference in tree balance and branching times between gene trees and species trees. In our model, the process of species evolution—speciation and extinction—employs the simple birth–death process. Gene trees for a particular species tree, however, are described by the multispecies coalescent model of gene lineage evolution conditional on the species tree (Rannala and Yang 2003; Degnan and Rosenberg 2009). The hierarchical model enables us to investigate the extent to which tree balance differs in gene trees—the data source of empirical phylogenies—and species trees, the source of predictions about the data. Under our model, we find that gene trees typically have greater imbalance compared to species trees. We investigate if the imbalance in empirical phylogenies—which exceeds that of birth–death species trees—can be explained with the hierarchical model under the assumption that empirical phylogenies are, in fact, gene trees. The multispecies coalescent null model assumes that gene lineages merge within the species tree branches according to a coalescent process (Degnan and Rosenberg 2009). Typical analyses of gene trees under the multispecies coalescent treat a fixed species tree as a parameter (Degnan and Salter 2005; Degnan et al. 2012; Wu 2012); here, we permit the species tree to vary as in empirical macroevolutionary studies, examining the distribution of gene trees given a birth–death distribution of species trees. We perform a simulation study over a range of parameter combinations. The discrete tree shape, the discrete temporal ordering of the branching events, and the continuous branching times uniquely describe a phylogenetic tree. We study the gene tree and species tree distributions under the nested model, focusing on tree shape and branching times. As these quantities are high-dimensional objects, we calculate summary statistics. For tree shape, we examine the well-known Colless statistic (Colless 1982); we also consider the Sackin statistic (Sackin 1972) and a statistic recording the number of cherries in a tree (McKenzie and Steel 2000). These statistics measure the imbalance of tree shapes, the Colless and Sackin statistics increasing with increasing imbalance, and the cherry statistic decreasing with increasing imbalance. For the branching times, we consider the statistic (Pybus and Harvey 2000), measuring the temporal locations of branching events. Increasing corresponds to moving branching times in a tree closer to the tips. A constant-rate pure-birth tree has an expected of 0, and increases with an increasing amount of extinction. Under the hierarchical model, our simulation poses three questions. (i) How different are the shapes of gene trees compared to species trees? (ii) How different are the branching times of gene trees and species trees? (iii) How different are the model gene trees from empirical gene trees? We first formally define the species tree and gene tree models. We then discuss our simulation results and compare the simulated gene trees to a database of empirical phylogenies.

The Hierarchical Model

The Birth–Death Model of Speciation and Extinction

The constant-rate birth–death model of speciation and extinction begins at time in the past with a single species. Each species has a birth rate and a death rate with . The values of and apply to all species. At the present, extant species lineages are independently sampled, each with probability , , for inclusion in the final species phylogeny. We assume an improper uniform- distribution on and condition on the final phylogeny having sampled tips. In other words, the resulting simulated tree set is analogous to the following procedure: we draw a time from the uniform- distribution; we simulate for time starting with a single species; we keep the tree if we obtain extant sampled present-day species; we repeat the procedure until we obtain the required number of trees. However, we employ mathematical theory to make the simulations efficient (Aldous and Popovic 2005; Gernhard 2008a). Our simulations vary three parameters: the speciation rate , “turnover” , and sampling probability . To facilitate interpretations, we note that different parameter values for , , and can give rise to exactly the same species tree distribution. When decreasing the sampling probability while increasing the speciation rate and turnover , we can obtain the same distribution of phylogenetic trees (Stadler 2009). We recall the parameter transformations that generate identical phylogenetic tree distributions. Consider arbitrary and with , and let . Choose a sampling probability , with . The increased values of and producing the same distribution as are (Stadler 2009) Note that by equation (1), increases with decreasing . For turnover, noting that , it follows that , so that . Furthermore, equation (2) reveals that turnover increases with decreasing . In the case of , decreasing increases the speciation rate , whereas turnover is fixed at 1. Beginning from choices for with , , and , the parameter values of a partially sampled speciation–extinction process give rise to the same phylogenetic tree distribution as a process with complete sampling if and only if ; if , then no birth–death process producing the identical phylogenetic tree distribution with complete sampling exists (the second requirement on , viz. , is satisfied for all permissible , following from ).

The Coalescent Model for Gene Lineages

Within a species lineage, we assume that gene lineages coalesce backward in time according to Kingman’s coalescent (Kingman 1982a, 1982b). Under Kingman’s coalescent, the waiting time in calendar units for two gene lineages to find their common ancestor is exponentially distributed with rate , where is the haploid effective size of the population along the species lineage and is the length of a generation in calendar units (Hudson 1990; Drummond et al. 2005). Following the assumptions of the multispecies coalescent, gene lineages that do not coalesce along a species tree branch persist into ancestral species branches, where they also have the opportunity to coalesce with other gene lineages entering the ancestral species from other descendant species (Degnan and Rosenberg 2009).

Simulation Design

We simulated species phylogenies under a constant-rate birth–death model with speciation rate , extinction rate , and sampling probability for each extant species. We simulated 100,000 species trees on tips for each parameter combination , for . Next, conditional on species trees, we simulated one gene tree per species tree, assuming a sample of one gene lineage per extant species. We assumed a constant effective population size and a constant generation time for a species, with for each species (meaning and may differ across species, but with a constant product). One coalescent time unit—the expected time to coalescence of two lineages—is generations, or calendar time units. A speciation rate of events per coalescent time unit means that in expectation, a species splits into two species after coalescent time units, or equivalently, after calendar time units (in our setting, ). We compared the distributions of tree shape and branching times of the gene trees to those of the species trees. We summarized the gene tree and species tree distributions using three summary statistics of tree shape, applied separately to both gene trees and species trees: the Colless index , the Sackin index , and the number of cherries . We denote the gene tree statistics by and , and the species tree statistics by and . For these statistics, we report ratios, , , and , where the numerator represents the mean value of the statistic computed across gene trees and the denominator is the corresponding mean across species trees. The higher the ratios and , and the lower the ratio , the more imbalanced the gene trees are in relation to the species trees. Because these statistics are correlated, we present only the Colless statistic in the main text and provide the other two statistics in the supplement. The statistic we report is equivalent to the average across simulations of , where is the difference in the Colless statistic for species tree–gene tree pairs. The value of depends on both the birth–death parameters and the sample size, so that dividing by helps to standardize it. For branching times, we summarized the gene tree and species tree distributions using the branching-time statistic . As is already normalized for tree size and in fact has expectation 0 for a range of species tree models, we reported the average of the difference between values computed on gene trees and species trees. We denote the average difference by . The smaller the difference , the closer the branching times of the gene trees are to the root compared to the corresponding branching times of the species tree. The simulations and analyses were performed in R unless otherwise indicated. The code was added to the R package TreeSim v2.2 (Stadler 2011).

Simulation Results: Tree Shape

Figure 1 and Supplementary Figures 1 and 2 present the ratios , , and , respectively, of the summary statistics for simulated gene trees and species trees. We briefly summarize the results for shapes of gene trees compared to species trees.

Mean statistic of gene trees minus mean statistic of species trees (). Solid lines correspond to complete species sampling , dashed lines to sampling probability , and dot-dashed lines to sampling probability . Plots are obtained based on 100,000 simulated species tree–gene tree pairs at each choice of parameter values, taking means separately for the gene trees and the species trees.

Mean Colless statistic of gene trees divided by mean Colless statistic of species trees (). Solid lines correspond to complete species sampling , dashed lines to sampling probability , and dot-dashed lines to sampling probability . Plots are obtained based on 100,000 simulated species tree–gene tree pairs at each choice of parameter values, taking means separately for the gene trees and the species trees. Mean statistic of gene trees minus mean statistic of species trees (). Solid lines correspond to complete species sampling , dashed lines to sampling probability , and dot-dashed lines to sampling probability . Plots are obtained based on 100,000 simulated species tree–gene tree pairs at each choice of parameter values, taking means separately for the gene trees and the species trees. Both for very small and very large , the gene trees and species trees have approximately the same average tree shape. For intermediate , however, in the biologically plausible range, gene trees evolving on species trees have a different shape distribution from the species trees themselves. For high turnover , the imbalance was greatest in our simulations for , representing an average of five speciation events in each -generation unit of coalescent time. For low turnover, the maximal imbalance was observed for , two speciation events per generations. The effect was larger for trees with many taxa, producing an increase of 8% for the Colless statistic (Fig. 1) and 1.8% for Sackin (Supplementary Fig. 1), and a 1.3% decrease for the cherry statistic (Supplementary Fig. 2). Thus, we might expect to overestimate the tree imbalance from empirical data when using gene trees instead of species trees. We next discuss differences in gene tree and species tree properties in detail, as a function of speciation rate , turnover , sampling probability , and the number of species used in the simulations. First, we examine the limits of very small and very large , and we then consider the roles of the parameters in the biologically relevant intermediate cases.

Extreme Values of

The extreme cases of and illustrate the limiting behavior of the statistics. We use to represent , and for . .—For small , speciation is rare, and therefore, species tree branches are very long. Consequently, sufficient time exists for each gene lineage coalescence to occur on the most recent species tree branch for which the coalescence is possible. Each gene tree then has the same shape as the species tree on which it has evolved. Thus, the ratios of the mean Colless, Sackin, and cherry statistics for simulated gene trees and for the underlying simulated species trees all approach . .—For large , speciation is frequent, and species tree branches are infinitesimally short. Thus, all gene lineage coalescences occur prior to the root of the species tree. Gene tree shapes then follow the shapes of gene trees under the Kingman coalescent. It can be shown that Kingman’s coalescent and constant-rate birth–death trees produce the same distribution of tree shapes (Aldous and Pemantle 1996). Thus, as in the case, but for a different reason, the ratios of the mean Colless, Sackin, and cherry statistics for gene trees and species trees equal .

Intermediate

For intermediate values of , we observed in our simulations that gene trees were less balanced than species trees, as the Colless and Sackin ratios exceeded 1, and the cherry ratio was less than 1 (Fig. 1 and Supplementary Fig. 1 and 2). Further, these ratios move farther from 1 for larger trees.

Small

Varying , fixed turnover , and complete sampling .—In our simulations, the difference between gene trees and species trees in tree balance increases with for these parameter values. As species tree branches become shorter with increasing , gene coalescences might not happen on the first allowed branch, and therefore, they might not follow the same pattern as speciation events. Fixed , varying turnover , and complete sampling .—Here, the difference between gene trees and species trees is larger for small turnover compared to large turnover. For , species tree branches are relatively long, so that most, though not all, gene coalescences happen on the first branch allowed. Trees with small have younger root ages and, therefore, shorter branches compared to trees with large (Figures 3 and 4 of Stadler (2008)). Thus, the probability that gene coalescences do not happen on the first species tree branch—so that they might not follow the same pattern as speciation events—increases with decreasing turnover.

Distributions of and and the joint distribution of and . All plots are for the birth process only with no extinction and are based on 10,000 independent gene tree–species tree pairs simulated in Hybrid-Lambda (Zhu et al. 2015). Gray lines in the scatterplots represent the line ; above the line, based on the Colless statistic, the gene tree has less balance than the species tree.

The Colless statistic for empirical trees from TreeBASE. Each black dot represents a tree. We normalized each empirical Colless value by dividing it by the expected species tree Colless value. The expected species tree Colless value is independent of speciation rate , turnover , and species sampling . The red line represents the mean of the normalized Colless statistic for each fixed tree size.

Fixed , fixed turnover , and varying sampling probability .— Sparser sampling, as represented by smaller , decreases the difference in balance between gene trees and species trees. Recall that a process with sampling probability , speciation rate , and extinction rate is equivalent to a process with complete sampling and a smaller speciation rate and smaller turnover , provided . The smaller speciation rate produces longer species-tree branch lengths compared to a process with parameters , , and , and thus decreases tree shape differences between gene trees and species trees. On the other hand, the smaller turnover produces shorter trees compared to a process with parameters , , and , and thus increases the difference of gene trees and species trees. We observe from the figures that the effect of a smaller speciation rate—meaning longer branches and thus less difference between gene trees and species trees—dominates, so that for fixed and , decreasing the sampling fraction increases the agreement between gene tree and species tree shape. Note that for a turnover , we have . Thus, arbitrary and produces the same tree balance ratio as and . This property can be verified in our figures by comparing and , and , and , or and .

Large

Varying , fixed turnover , and complete sampling .—The difference in balance between gene trees and species trees decreases with increasing , particularly for the larger values (). As increases, species tree branches become so short that most coalescences happen prior to the species tree root. Such coalescences occur according to the Kingman coalescent, inducing the same tree shapes as the constant-rate birth–death process. Thus, as increases, the gene tree shape distribution approaches the same distribution as that of species trees. Fixed , varying turnover , and complete sampling .—We observe a larger difference between gene trees and species trees for high turnover compared to low turnover; for , this result begins at . Species trees become very short for large , and for fixed , low-turnover species trees are shorter than high-turnover trees. Thus, more gene coalescences happen above the root for low-turnover trees, so that the approach of the distribution of gene tree shapes to the same distribution seen for species trees is faster at low turnover. Fixed , fixed turnover , and varying sampling probability .—For these values, incomplete sampling minimally changes tree balance: the effects of incomplete sampling, amounting to a process with complete sampling and both a decrease in that produces a greater difference between gene trees and species trees as well as a decrease in turnover that produces a smaller difference, cancel. Fixed , fixed turnover , and varying sampling probability .—In this case, incomplete sampling can be seen as a process with the same turnover and complete sampling, but a decreased speciation rate—meaning that species trees are longer for smaller . Consequently, decreasing increases the difference between gene trees and species trees. Note again that statistics in the different plots are the same for different pairs with the same value of .

Simulation Results: Branching Times

Figure 2 presents the difference of the statistics for simulated gene trees and species trees. Briefly, gene trees tend to have a smaller statistic than species trees for low to medium values of , and a larger for large , depending on the turnover. As was observed for tree shapes, all effects increased in magnitude with the number of taxa . A value of means that a speciation occurs on average after calendar time units, which seems very high. Because is smaller for gene trees than for species trees for realistic values of , we expect to underestimate from empirical data when using gene trees instead of species trees. We discuss below differences in branching times between gene trees and species trees in detail, as a function of , , and (Fig. 2). .—For small and hence long species tree branch lengths compared to the coalescent rate for gene lineages, each gene tree coalescence occurs immediately prior to its associated speciation event. Thus, the branching times are nearly identical for gene trees and species trees, and is close to . .—For large and hence short branch lengths compared to the coalescent rate, gene tree coalescences happen prior to the root of the species tree, so that the gene trees are Kingman-coalescent trees. Kingman-coalescent branch lengths are in expectation, up to a scaling constant, equal to constant-rate birth–death branch lengths with (Gernhard 2008b). Thus, is in expectation equal to for constant-rate birth–death trees with . The value of depends on and , so that for large , the behavior of depends on the other parameters. , varying turnover , and complete sampling .—For these parameter values, decreases as decreases (Pybus and Harvey 2000). Thus, is increasingly positive with decreasing turnover. As for turnover , the constant-rate birth–death trees equal in expectation Kingman-coalescent trees up to a scaling constant; thus, we obtain ≈0 (Fig. 2, ). , fixed turnover , varying sampling probability .—In this case, species trees can be interpreted to arise from a process with complete sampling and decreased turnover. Thus, decreases for decreasing sampling probability , so that increases. , fixed turnover , varying sampling probability .—At , incomplete sampling does not change relative branch lengths (Stadler 2008), Figure 3d). Incomplete sampling can be interpreted as a process with decreased speciation rate , turnover , and complete sampling . Thus, with , is the same for all sampling probabilities. Varying , fixed turnover , and complete sampling .—As increases, first becomes more negative, then switches () and becomes more positive. Fixed , varying turnover , and complete sampling .—We observe a decrease of with increasing turnover, meaning gene trees have branching events closer to the root compared to species trees for increasing turnover. Note that and small is an exception; these trees are very long, and gene trees are almost equal to species trees. Because changes from negative to positive for increasing , for small , gene trees and species trees are most similar in for small turnover, whereas for large , they are most similar for large turnover. By contrast, recall that for shape statistics, for increasing turnover, a switch occurred from decreasing to increasing values for in . exceeded 1 for all . Thus, for small , gene trees and species trees were most similar in shape for large turnover, whereas for large , they were most similar for small turnover. Fixed , fixed turnover , and varying sampling probability .—The value increases with decreasing sampling, meaning gene trees had branching events closer to the tips compared to species trees. Recall that a process with decreased sampling is equivalent to a complete sampling process and decreased birth rate and turnover. A decrease in leads to an increase in for small and a decrease for large (see paragraph “Varying in this section). A decrease in turnover leads to an increase in (see paragraph “Fixed in this section). The effect of turnover dominates. Fixed , fixed turnover , and varying sampling probability .—Incomplete sampling increases for small and decreases for large . The reason is that for , a process with decreased sampling is equivalent to a complete sampling process with decreased birth rate and turnover . Recall that a decrease in increases for small and decreases it for large .

Simulation Results: Comparing Gene Trees to Their Species Trees

We have reported average gene tree balance compared to average species-tree balance (Fig. 1). This approach does not give an indication of the joint distribution of shape statistics for gene trees and species trees and, therefore, of the extent to which the shape can differ for a gene tree and its underlying species tree. To illustrate this joint variability, we simulate distributions of and and the joint distribution of and for , , , and , with , for taxa. As discussed above, for small , gene tree balance closely accords with species tree balance (), as species tree branches are very long and the gene tree and species tree are hence highly correlated (Fig. 3). For increasing , the correlation decreases as the species tree branches become shorter, and in the limit, gene tree balance is independent of species tree balance. Because of this independence, the gene trees give rise to the same shape distribution as the species trees, and thus for large , again —but now, with a low correlation coefficient between and . Distributions of and and the joint distribution of and . All plots are for the birth process only with no extinction and are based on 10,000 independent gene tree–species tree pairs simulated in Hybrid-Lambda (Zhu et al. 2015). Gray lines in the scatterplots represent the line ; above the line, based on the Colless statistic, the gene tree has less balance than the species tree. For , 49.8% of gene trees have a higher Colless statistic than the underlying species trees and 48.1% have a lower value, the remaining cases having identical values for the gene tree and species tree (Fig. 3). For , 62% of gene trees have a higher Colless statistic than the underlying species tree. The percentage drops to 53% for and is again nearly 50% for . For and , the average value of is 1.12, somewhat larger than the corresponding value for and (Fig. 1).

Empirical Trees

To determine whether the difference in tree balance between gene trees and species trees under the model can explain the excess imbalance in empirical trees, we reanalyzed a set of empirical phylogenies from TreeBASE (Sanderson et al. 1994; Hagen et al. 2015). This set of phylogenies included 2759 fully resolved trees, 156 of which possessed calendar-time branch-length information. We hypothesize that many of these phylogenies are not species trees, but are either gene trees or trees that result from concatenation of genes. Recall that the species tree Colless value for each tree size is independent of speciation rate, turnover, and sampling probability, as all constant-rate birth–death processes induce the same distribution on tree shapes (Aldous and Pemantle 1996). We calculated the average Colless statistics for all empirical phylogenies for all sizes up to , and we report for each tree. This ratio is on average about 2 (Fig. 4), so that empirical phylogenies have about twice the Colless value as constant-rate birth–death species trees. Although our simulations detected the correct direction for the deviation from the baseline value of 1, they also revealed that the multispecies coalescent with the constant-rate birth–death model can only explain an increase of the Colless statistic in gene trees compared to species trees by a factor of 1.08. The Colless statistic for empirical trees from TreeBASE. Each black dot represents a tree. We normalized each empirical Colless value by dividing it by the expected species tree Colless value. The expected species tree Colless value is independent of speciation rate , turnover , and species sampling . The red line represents the mean of the normalized Colless statistic for each fixed tree size. For the empirical phylogenies that reported branch lengths scaled in calendar time, although relatively few data points were available, we further calculated the statistic for completeness, plotting the empirical values together with simulated mean values for different and (Supplementary Fig. 3). We did not plot , as such a calculation would yield an excessive 15 points (five turnover values, three sampling probabilities) for each empirical data point. Because relatively few trees with branch length information are available for each value of the number of species, it is not feasible to take an expectation of empirical for each tree size, as we did for the simulations and empirical Colless statistic. The relationship between the empirical and simulated trends in is, therefore, difficult to discern.

SUMMARY

Using simulations, we have quantified the difference in tree shape and branching times between gene trees and species trees under a simple hierarchical model, incorporating a constant-rate birth–death process for species trees, and a multispecies coalescent for gene trees conditional on species trees. The results suggest that although in limiting cases of very low and very high speciation rate, gene trees and species trees have the same distribution of shapes, for a variety of intermediate parameter values, gene trees are in expectation less balanced than the species trees. Branching times in gene trees and species trees differ except in the limiting case of very low speciation rate. Depending on the question of interest, either of two effect sizes could be reported for the balance ratio for gene trees and species trees: obtained from the average of the ratios, we which denote (Fig. 3), or obtained from (Fig. 1). If we compare a species tree to its embedded gene tree, the effect size based on is appropriate; for our data application, however, we compared a set of empirical trees to a set of model species trees. Thus, we do not consider pairs, but averages of two distributions, which calls for the latter effect size, . If gene trees and species trees follow the same shape distribution, then the ratio of the expected shape statistics is equal to 1; however, the mean value of the ratio, , does not generally equal 1 under the null hypothesis that and have the same distribution. In particular, for two random variables and , both the expectations and can exceed 1. Thus, we suggest that is less appropriate than as a measure of the difference in shape distributions. The observed difference between gene trees and species trees highlights a problem in tests of species tree models that make use of empirical phylogenies, demonstrating that empirical phylogenies obtained by taking gene trees as estimates of species trees follow a different tree shape distribution than that predicted for species trees themselves. It is thus problematic to equate an inferred gene tree to the species tree when testing for the most appropriate species tree model. Gene trees are expected to be less balanced compared to the underlying species tree, with branching events closer to the root for most biologically relevant parameter regions that do not involve implausibly large speciation rates. It is noteworthy that our comparison of model gene trees to model species trees yields qualitatively similar patterns to the comparison of empirical trees to model species trees: empirical phylogenies are less balanced than predicted by birth–death models (Blum and François 2006), and they have branching events closer to the root compared to birth–death trees (Etienne and Rosindell 2012). Under the model, the differences in tree shape and branching times between gene trees and species trees depend on a speciation rate , a turnover rate , and a sampling rate . In particular, the relative timing of branching events in gene trees compared to species trees depends mainly on the speciation rate : gene tree branching events are closer to the root than in species trees for small , and closer to the tips for large . This result reflects the fact that for higher speciation rates, species tree branches are short, and thus, coalescences occur in more ancestral populations, making gene trees more like Kingman-coalescent trees. We emphasize that our model is a neutral model: speciation rates, extinction rates, and coalescent rates are assumed to be the same through time and across lineages. However, relaxing this assumption to allow for rate heterogeneity will not eliminate incomplete lineage sorting and thus, as in the constant-rate case, we expect that gene trees will continue to differ in balance from species trees. Are our parameter settings in the range of empirically observed parameter values? We can use the great ape tree to examine if our model parameters are sensible in light of empirical observations. Recent estimates of the branch lengths in the great ape tree, for which there is considerable evidence of incomplete lineage sorting (Ebersberger et al. 2007; Burgess and Yang 2008; Hobolth et al. 2011), lie between 0.7 and 3.7 coalescent time units (Schrago 2014). Consider a birth–death model for a species tree. The pure-birth model has the property that the mean branch length in the species tree is coalescent units (Stadler and Steel 2012), meaning induces a mean branch length of 1. Thus, with , setting to —a value among those on which our analysis has focused—places branch lengths within the range observed in the great ape tree. For , a mean branch length of 1 suggests higher ; for , the expected pendant branch length under a birth–death process is (Mooers et al. (2012), so that the expected pendant branch length is 1 at . Bokma et al. (2012) estimated the mean for the hominoid primate tree to be per myr (95% confidence interval 0.12–1.37). Assuming and years—approximate values from Schrago (2014) for the ancestor of humans and chimpanzees—produces speciations per coalescent unit (95% confidence interval 0.072–0.822). Turnover was estimated close to 1, as the mean was myr (95% confidence interval 0.01–1.44). These similarities of empirical trees to a model with and on the order of 0.1–1 indicate that our approach of centering parameter choices around such values is reasonable. Obtaining unbiased empirical species trees requires using appropriate methods for inferring species trees. Recent developments in estimation methods permit joint inference of species trees and gene trees, or inference of species trees from multiple gene trees (Degnan and Rosenberg 2009; Edwards 2009; Liu et al. 2015; Szöllősi et al. 2015; Ogilvie et al. 2016). Species trees estimated by such methods take into account the hierarchical production of gene trees from species trees, and they do not rely on an implicit or explicit identification of species trees with gene trees. Thus, the shapes of species trees obtained by these methods would be expected to follow a distribution appropriate to species trees. In our empirical analysis, however, the set of previously published empirical phylogenies that we used to determine the difference between empirical and model species trees dates as far back as 1994—prior to the widespread use of phylogenetic tools that distinguish between gene trees and species trees. The hypothesis that many of the empirical trees are in fact gene trees rather than species trees explains some of the excess imbalance observed in empirical tree shape distributions; however, because our inflation of the Colless statistic is only for gene trees compared to species trees and the empirical inflation of the statistic is , other factors are required for explaining the imbalance in empirical trees. Because our number of time-calibrated empirical trees is low, our temporal computations have been less exhaustive compared to those we performed for tree shape; unlike for shape, at present, the empirical values—of which there are fewer—are explained reasonably well by species tree values. We comment on two of the many factors that could influence the difference between empirical trees and gene trees and species trees under our model. First, we assumed in our analyses that the gene trees and species trees are known without error. It is possible that reconstruction biases in tree estimation (Mooers and Heard 1997; Holton et al. 2014) could contribute to a difference between empirical and theoretical distributions for trees. Second, even when species tree inference is informed by gene tree discordance, species tree inference methods might generate shape biases. For example, the minimize deep coalescence criterion (Maddison and Knowles 2006; Than and Nakhleh 2009) is expected to produce highly balanced tree estimates (Than and Rosenberg 2014) and indeed its empirical estimates are more balanced than those obtained by other methods from the same data (DeGiorgio et al. 2014). We hope that this article stimulates analytic and simulation-based investigations of more complex nested species tree–gene tree models, thereby linking extensive traditions modeling species trees (Nee et al. 1994; Stadler 2013) and modeling gene trees conditional on fixed species trees (Degnan and Rosenberg 2009). Only if we understand the predictions produced by plausible null models—and the relationships between those models and the assumptions underlying empirical trees—can we produce a proper account of the macroevolutionary phenomena that give rise to species tree patterns.

Supplementary Material

Supplementary material can be found in the Dryad data repository at http://dx.doi.org/10.5061/dryad.8f97r.

39 in total

1. Distributions of cherries for two models of trees.

Authors: A McKenzie; M Steel
Journal: Math Biosci Date: 2000-03 Impact factor: 2.144

2. Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci.

Authors: Bruce Rannala; Ziheng Yang
Journal: Genetics Date: 2003-08 Impact factor: 4.562

3. The probability distribution of ranked gene trees on a species tree.

Authors: James H Degnan; Noah A Rosenberg; Tanja Stadler
Journal: Math Biosci Date: 2011-10-31 Impact factor: 2.144

4. apTreeshape: statistical analysis of phylogenetic tree shape.

Authors: Nicolas Bortolussi; Eric Durand; Michael Blum; Olivier François
Journal: Bioinformatics Date: 2005-12-01 Impact factor: 6.937

5. New analytic results for speciation times in neutral models.

Authors: Tanja Gernhard
Journal: Bull Math Biol Date: 2008-01-03 Impact factor: 1.758

Review 6. Gene tree discordance, phylogenetic inference and the multispecies coalescent.

Authors: James H Degnan; Noah A Rosenberg
Journal: Trends Ecol Evol Date: 2009-03-21 Impact factor: 17.712

7. Estimation of hominoid ancestral population sizes under bayesian coalescent models incorporating mutation rate variation and sequencing errors.

Authors: Ralph Burgess; Ziheng Yang
Journal: Mol Biol Evol Date: 2008-07-04 Impact factor: 16.240

8. Unexpectedly many extinct hominins.

Authors: Folmer Bokma; Valentijn van den Brink; Tanja Stadler
Journal: Evolution Date: 2012-05-09 Impact factor: 3.694

9. Prolonging the past counteracts the pull of the present: protracted speciation can explain observed slowdowns in diversification.

Authors: Rampal S Etienne; James Rosindell
Journal: Syst Biol Date: 2011-08-26 Impact factor: 15.683

10. Species tree inference by minimizing deep coalescences.

Authors: Cuong Than; Luay Nakhleh
Journal: PLoS Comput Biol Date: 2009-09-11 Impact factor: 4.475

4 in total

1. Probabilities of Unranked and Ranked Anomaly Zones under Birth-Death Models.

Authors: Anastasiia Kim; Noah A Rosenberg; James H Degnan
Journal: Mol Biol Evol Date: 2020-05-01 Impact factor: 16.240

2. Comprehensive Phylogenetic Analysis of Bovine Non-aureus Staphylococci Species Based on Whole-Genome Sequencing.

Authors: Sohail Naushad; Herman W Barkema; Christopher Luby; Larissa A Z Condas; Diego B Nobrega; Domonique A Carson; Jeroen De Buck
Journal: Front Microbiol Date: 2016-12-20 Impact factor: 5.640

3. An analytical upper bound on the number of loci required for all splits of a species tree to appear in a set of gene trees.

Authors: Lawrence H Uricchio; Tandy Warnow; Noah A Rosenberg
Journal: BMC Bioinformatics Date: 2016-11-11 Impact factor: 3.169

4. Taxonomic Uncertainty and the Anomaly Zone: Phylogenomics Disentangle a Rapid Radiation to Resolve Contentious Species (Gila robusta Complex) in the Colorado River.

Authors: Tyler K Chafin; Marlis R Douglas; Max R Bangs; Bradley T Martin; Steven M Mussmann; Michael E Douglas
Journal: Genome Biol Evol Date: 2021-09-01 Impact factor: 3.416

4 in total