Literature DB >> 29893954

Theoretical Foundation of the RelTime Method for Estimating Divergence Times from Variable Evolutionary Rates.

Koichiro Tamura1,2, Qiqing Tao3,4, Sudhir Kumar3,4,5.   

Abstract

RelTime estimates divergence times by relaxing the assumption of a strict molecular clock in a phylogeny. It shows excellent performance in estimating divergence times for both simulated and empirical molecular sequence data sets in which evolutionary rates varied extensively throughout the tree. RelTime is computationally efficient and scales well with increasing size of data sets. Until now, however, RelTime has not had a formal mathematical foundation. Here, we show that the basis of the RelTime approach is a relative rate framework (RRF) that combines comparisons of evolutionary rates in sister lineages with the principle of minimum rate change between evolutionary lineages and their respective descendants. We present analytical solutions for estimating relative lineage rates and divergence times under RRF. We also discuss the relationship of RRF with other approaches, including the Bayesian framework. We conclude that RelTime will be useful for phylogenies with branch lengths derived not only from molecular data, but also morphological and biochemical traits.

Entities:  

Mesh:

Year:  2018        PMID: 29893954      PMCID: PMC5995221          DOI: 10.1093/molbev/msy044

Source DB:  PubMed          Journal:  Mol Biol Evol        ISSN: 0737-4038            Impact factor:   16.240


Introduction

The inference of divergence times is accomplished by assuming a constant rate throughout the tree (a strict molecular clock) or by relaxing the strict molecular clock (Ho and Duchêne 2014; dos Reis et al. 2016; Kumar and Hedges 2016). Bayesian approaches are widely applied to relaxed clock dating analyses. They require specification of a probability distribution of evolutionary rates in the tree (e.g., lognormal distribution), as well as an assumption of the presence or absence of rate autocorrelation among branches. In contrast, the RelTime approach does not require specification of such priors, and produces relative node ages that can then be transformed into absolute dates by using calibration constraints for one or more nodes (Tamura et al. 2012, 2013). RelTime performs well for estimating divergence times in analyses of many large empirical data sets (Mello et al. 2017) and simulated data sets (Tamura et al. 2012; Filipski et al. 2014). RelTime’s computational speed (fig. 1) and accuracy have led to its use for estimating divergence time for large data sets (Tamura et al. 2012; Mahler et al. 2013; Bond et al. 2014; Bonaldo et al. 2016). However, a mathematical foundation for the RelTime method has not yet been provided, which is needed not only to understand basic properties and assumptions of RelTime, but also to reveal its relationship with other molecular dating methods (Ho and Duchêne 2014; dos Reis et al. 2016; Kumar and Hedges 2016). In the following, we present the theoretical foundation of the RelTime method. We also assess the absolute performance of RelTime and compare it with other methods by analyzing data generated using computer simulations in which sequences were evolved according to three different branch rate models: independent (Drummond et al. 2006), autocorrelated (Kishino et al. 2001), and hybrid (Beaulieu et al. 2015).
. 1.

(a) Computational time taken by RelTime and MCMCTree (Bayesian method) to estimate divergence times for data sets containing increasing number of sequences (n). The tested sequence alignment consisted of 4,493 sites in which sequence evolution was simulated with extensive rate variation (RR50 data from Tamura et al. 2012). RelTime’s speed advantage increases with data volume by O(n2). (b) Distribution of computation speed ratio of RelTime to MCMCTree for 70 data sets, each containing 100 ingroup sequences that were simulated as described in the Materials and Methods.

(a) Computational time taken by RelTime and MCMCTree (Bayesian method) to estimate divergence times for data sets containing increasing number of sequences (n). The tested sequence alignment consisted of 4,493 sites in which sequence evolution was simulated with extensive rate variation (RR50 data from Tamura et al. 2012). RelTime’s speed advantage increases with data volume by O(n2). (b) Distribution of computation speed ratio of RelTime to MCMCTree for 70 data sets, each containing 100 ingroup sequences that were simulated as described in the Materials and Methods.

Mathematical Theory

Theoretical Analysis for a Phylogeny with Three Taxa and an Outgroup

We begin with the simplest case in which the phylogeny contains a clade with three ingroup taxa (subtree at node 5) and one outgroup taxon (fig. 2). In this tree, branch lengths b1 and b2 represent the amount of evolutionary change that has occurred in lineages emanating from node 4 and leading to taxon 1 and taxon 2, respectively. We assume that taxon 1 and 2 are sampled at the same evolutionary time (t1 = 0 and t2 = 0), which is usually the case in phylogenetic analysis of data sampled from living species. The contemporaneous sampling of data produces sampling times equal to 0, which serve as calibration points (Tamura et al. 2012). By using branch lengths in this phylogeny (b’s), we can estimate the relative evolutionary rates (r’s) for all the lineages as well as relative divergence times (t’s). Here, a lineage refers to a branch and all the taxa (and branches) in the descendant subtree. For example, lineage a in figure 2 contains two taxa (1 and 2) and three branches with lengths b1, b2, and b4.
. 2.

The relative rate framework for RelTime method. (a) A tree containing three ingroup sequences with an outgroup. Branch lengths are bi’s and lineage rates are ri’s. La = b4 + ½(b1 + b2). (b) The case of four ingroup sequences with an outgroup. Here, La = b5 + ½(b1 + b2). Lb = b6 + ½(b3 + b4), when using the arithmetic mean. See figure 3 and its legend for a simple procedure outlining the calculation of relative rates using RRF.

The relative rate framework for RelTime method. (a) A tree containing three ingroup sequences with an outgroup. Branch lengths are bi’s and lineage rates are ri’s. La = b4 + ½(b1 + b2). (b) The case of four ingroup sequences with an outgroup. Here, La = b5 + ½(b1 + b2). Lb = b6 + ½(b3 + b4), when using the arithmetic mean. See figure 3 and its legend for a simple procedure outlining the calculation of relative rates using RRF.
. 3.

Calculating relative lineage rates in a phylogeny. (a) A phylogeny containing eight ingroup sequences (taxa) that consists of three taxa nodes (x and w) and four taxa nodes (y and z). (b) Reduced phylogeny and branch lengths after applying RRF to nodes x and y. (c) Final phylogeny after applying RRF to node z in panel b, which produces node w in a three taxa configuration. After applying RRF to node w, a preorder traversal scales descendant lineage rates by multiplying them by their ancestral lineage rates to generate final relative rates. The ingroup root node (w) has an average relative lineage rate equal to 1. Multisequence taxa are indicated by Roman numerals.

The following system of equations formalizes the RelTime approach mathematically by linking relative rates for lineages (ri) with branch lengths (bi) in figure 2. Here,, where La is depth of node 5 on lineage a that contains taxon 1 and 2; La = b4 + ½(b1 + b2). We set: The setting of equalities in equations (3) and (4) leads to preference for the minimum rate change between the lineage originating at node 5 and the descendant lineage originating at node 4. The selection of lineage rate ra will be constrained by rate ratios imposed by equations (1) and (2), which relaxes the strict molecular clock. Because all the lineage rates are relative, one unknown is reduced by setting the group rate at the most recent common ancestor of the ingroup node to be 1, that is, We solve for each ri by using equations (1–5) and get: The estimates of relative lineage rates and lengths produce an ultrametric tree with relative times for node 4 (t4) and node 5 (t5): The above equations (6–11) constitute the relative rate framework (RRF) for phylogenies with three taxa and one outgroup. These equations yield point estimates for lineage rates and node ages. Because branch lengths have variances, the resulting lineage rate estimates have variances. The variance of relative node ages will be a function of branch length variances and the variance contributed by evolutionary rate differences among branches (see Discussion).

Theoretical Analysis for Phylogenies Containing Four Taxa and an Outgroup

To estimate six evolutionary rates (r1–r4, ra, and rb) using six branch length estimates (b1–b6) for a phylogeny containing four ingroup taxa (fig. 2) we write a set of seven equations: Here, and . Equations (14–16) above lead to a preference for minimum changes in rates between ancestor–descendant lineage pairs. In this system, ri ’s are not required to be equal to one another and the rate assignments are constrained by the rate ratios among lineages (eqs. 12, 13, and 17), which relaxes the strict molecular clock. We solve for each ri from equations (12–18) to estimate relative rates and node ages. The estimates of relative node ages t5, t6, t7 for nodes 5, 6, and 7, respectively, are: The above equations (19–27) establish RRF of the RelTime approach for a tree containing four taxa and one outgroup. As previously mentioned, point estimates of node ages and lineage rates have variances because branch lengths have variances and because evolutionary rates are not equal among lineages (see Discussion).

Relative Rate Framework with Geometric Means

In both the original RelTime approach (Tamura et al. 2012) and the mathematical formulations above, we considered an arithmetic mean when averaging branch lengths to minimize evolutionary rate changes. This approach does not assume an equal rate, but is rather a natural way to calculate node depths by averaging branch lengths. We have now developed analytical formulas for an alternative RRF using the geometric mean that balances the rate changes between two descendant lineages. For example, if b1 = 1 and b2 = 4 in figure 2, then the arithmetic mean will give 2.5. Thus, evolutionary rate r1 is 2.5 times slower and r2 is 1.6 times faster when compared with the average rate. The difference in rate change (between 2.5 and 1.6, in the present case) becomes larger as the difference between b1 and b2 becomes larger when using the arithmetic mean. In contrast, the geometric mean would give 2.0, which results in two times slower rate in b1 and two times faster rate in b2, as compared with the average rate. That is, the difference in rate between the ancestral and descendant lineages is always equal for sister lineages when using the geometric mean, which is not the case if the arithmetic mean is used. Using a geometric means approach, we obtain the following analytical formulas for a phylogeny containing three ingroup taxa (fig. 2): For the 4-taxon case in figure 2, the equations are as follows:

Relative Rate Framework for a General Case

Next, we consider a general case of a phylogeny with more than four ingroup taxa. In this case, RelTime applies RRF with a bottom-up approach, starting from the tips (external branches) of the phylogeny and moving toward the root. In the first step, we generate local relative lineage rates for the subtrees containing three and four taxa (subtree x and y, respectively) in a phylogeny containing eight taxa and an outgroup (fig. 3). Subtree y contains four taxa, so we apply equations (34–39) to generate relative rates. Equations (28–31) will be used to estimate rates for clade x which contains three ingroup taxa. Calculating relative lineage rates in a phylogeny. (a) A phylogeny containing eight ingroup sequences (taxa) that consists of three taxa nodes (x and w) and four taxa nodes (y and z). (b) Reduced phylogeny and branch lengths after applying RRF to nodes x and y. (c) Final phylogeny after applying RRF to node z in panel b, which produces node w in a three taxa configuration. After applying RRF to node w, a preorder traversal scales descendant lineage rates by multiplying them by their ancestral lineage rates to generate final relative rates. The ingroup root node (w) has an average relative lineage rate equal to 1. Multisequence taxa are indicated by Roman numerals. In the next step, we consider the parent clade (z) that has four taxa: composite taxon I consisting of two taxa (1 and 2), taxon II consisting of one taxon (3), composite taxon III consisting of two taxa (4 and 5), and composite taxon IV containing two taxa (6 and 7) (fig. 3). We estimate branch lengths (bI, bII, bIII, and bIV) by using the geometric means: , , , and . Then, we use equations (34–39) to compute relative rates for all the lineages using these branch lengths. In bigger phylogenies, this process is carried out for every internal node in a postorder traversal. In the current example, node w is the common ancestor of all the ingroup taxa and is in a 3-taxon configuration (fig. 3). We now have and and apply equations (28–31). At this stage, we have local (relative) lineage rate estimates for nodes in the ingroup tree. Finally, all the rates in the tree are computed by multiplying descendant lineage rates by their respective ancestral lineage rates in preorder traversal to generate final relative rates such that the ingroup root node (e.g., node w in fig. 3) has an average relative lineage rate equal to 1.

Results

We evaluated the performance of RRF for correctly estimating lineage rates and divergence times by analyzing data generated using computer simulation in which sequences were evolved according to the autocorrelated rate (AR) model (Kishino et al. 2001), the independent rate (IR) model (Drummond et al. 2006), as well as a model that contains multiple distributions of rates (hybrid rates [HR]). We present results from the analysis of a collection of small data sets (three ingroup sequences; AR and IR models) and two collections of large data sets: one containing 100 ingroup sequences (AR and IR models) and another containing 91-ingroup sequences (HR model).

Analysis of Small Data Sets (Phylogeny with Three Ingroup Sequences)

We first tested the accuracy of divergence times estimated via RRF by analyzing data sets containing three ingroup sequences. We used RRF with geometric means and compared the modeled (true) values with estimated lineage rates as well as node ages (fig. 4). The lineage rates produced by RRF were similar to the true rates for data sets that were evolved under the AR model (fig. 4); the relationship showed a linear regression slope of 0.99. RRF estimates for external lineages, were similarly good (blue circles in fig. 4, r2 = 0.68). The relationship was also strong for internal branches, but with greater dispersion (red circles in fig. 4, r2 = 0.29). RRF performance for AR data sets was similar to that observed for IR data sets (fig. 4). Overall, the success in estimating lineage-specific evolutionary rates translates into robust estimates of relative times for AR (fig. 4) and IR data sets (fig. 4).
. 4.

Performance of RRF and MCMCTree Bayesian analyses for three ingroup sequences with an outgroup (topology in fig. 2). RRF lineage rate estimates are compared with the true lineage rate estimates for sequences that evolved under (a) autocorrelated and (b) independent rate models. Blue circles represent external lineages (single taxon, r1, r2, and r3) and red circles represent the internal lineage (ra). RRF estimates of divergence times for (c) autocorrelated rate and (d) independent rate data sets. Bayesian (MCMCTree) estimates of branch rates are compared with the true branch rates for sequences evolved under (e) autocorrelated and (f) independent rate models. Blue circles indicate external branches (r1, r2, and r3) and red circles show the internal branch (r4). Bayesian estimates of divergence times for (g) autocorrelated rate and (h) independent rate data sets. Each panel contains results from 50 simulated data sets. All rates and divergence time estimates are normalized to allow direct comparison between true and estimated values. Regression slope and correlation coefficient (r2) are shown for each panel.

Performance of RRF and MCMCTree Bayesian analyses for three ingroup sequences with an outgroup (topology in fig. 2). RRF lineage rate estimates are compared with the true lineage rate estimates for sequences that evolved under (a) autocorrelated and (b) independent rate models. Blue circles represent external lineages (single taxon, r1, r2, and r3) and red circles represent the internal lineage (ra). RRF estimates of divergence times for (c) autocorrelated rate and (d) independent rate data sets. Bayesian (MCMCTree) estimates of branch rates are compared with the true branch rates for sequences evolved under (e) autocorrelated and (f) independent rate models. Blue circles indicate external branches (r1, r2, and r3) and red circles show the internal branch (r4). Bayesian estimates of divergence times for (g) autocorrelated rate and (h) independent rate data sets. Each panel contains results from 50 simulated data sets. All rates and divergence time estimates are normalized to allow direct comparison between true and estimated values. Regression slope and correlation coefficient (r2) are shown for each panel. Bayesian methods produce branch-specific rates under a given statistical distribution of rates (e.g., lognormal), which differs from RRF where relative lineage rates are considered. Therefore, we compared the true values of branch rates with the branch rates derived using a Bayesian method. We provided correct priors (based on simulation parameters) and conducted the analyses using MCMCTree software (Yang 2007). Bayesian branch rate estimates showed a more diffuse relationship with true rates in internal branches for AR data sets (fig. 4r2 = 0.00) as compared with RRF (fig. 4r2 = 0.29). Fundamental reasons underlying these patterns are presented in the Discussion section. These patterns notwithstanding, Bayesian estimates of times for AR data sets showed a slope of 0.99 with true time estimates (fig. 4), which means that node age estimates are generally robust to difficulties in estimating branch-specific rates. This robustness was also seen for IR data sets, where Bayesian branch rates showed a more diffuse relationship with the true rates (fig. 4), but estimated times showed a slope close to 1 with a high r2 (0.91). Overall, both Bayesian and RRF approaches showed similar or lower accuracy in estimation of rates and node ages for IR data sets as compared with AR data sets, potentially because rate independence requires the estimation of a greater number of free parameters.

Analysis of Large Data Sets (Phylogeny with 100 Ingroup Sequences)

We next analyzed data sets consisting of 100 ingroup sequences, which were evolved over a range of empirical rate variation parameters. As observed for data sets containing only three ingroup sequences, RRF lineage rate estimates were highly correlated with the true rates for AR data sets (fig. 5). Lineage rate correlations were generally lower for IR data sets and these correlations were higher for external (tip) lineages (fig. 5). Importantly, distributions of the slopes of RelTime node age estimates and true times were centered at close to 1 for both AR and IR data sets (fig. 5).
. 5.

Performance of RRF in the analysis of data sets with 100 ingroup sequences and an outgroup. Fraction of data sets for which RRF inferred lineage rates are correlated with true rates at different levels of correlation for data sets simulated with (a) autocorrelated rates and (b) independent rates. Dotted lines represent external branches and solid lines indicate the internal branches. (c) Distribution of the linear regression slopes of RRF estimates and true times for different data sets. The results presented are from analyses of 35 data sets in which sequences evolved with autocorrelated branch rates (blue lines) and another 35 data sets that were evolved with independent branch rates (red lines). In regression analysis, the intercept was set to zero, because the estimated node age is expected to be zero when the true node age is zero. The regression slopes generated with and without this assumption produced similar patterns.

Performance of RRF in the analysis of data sets with 100 ingroup sequences and an outgroup. Fraction of data sets for which RRF inferred lineage rates are correlated with true rates at different levels of correlation for data sets simulated with (a) autocorrelated rates and (b) independent rates. Dotted lines represent external branches and solid lines indicate the internal branches. (c) Distribution of the linear regression slopes of RRF estimates and true times for different data sets. The results presented are from analyses of 35 data sets in which sequences evolved with autocorrelated branch rates (blue lines) and another 35 data sets that were evolved with independent branch rates (red lines). In regression analysis, the intercept was set to zero, because the estimated node age is expected to be zero when the true node age is zero. The regression slopes generated with and without this assumption produced similar patterns.

Analysis of Hybrid Rate Data Sets (Phylogeny Containing 91 Ingroup Sequences)

We also examined the performance of RRF in an analysis of simulated data from Beaulieu et al. (2015), who simulated two lognormal distributions (hybrid rate model) for an angiosperm phylogeny in which herbaceous clades exhibited higher and more variable evolutionary rates than woody clades (fig. 6). They reported that one rate model Bayesian methods produced considerably more ancient date estimates for the divergence of herbaceous and woody clades. This overestimation of divergence time became more severe as the difference between the two rates increased (fig. 6). Application of RRF produced divergence time estimates that were much closer to true times (fig. 6), which shows that RRF can be useful in cases where the rate distribution differs among clades (Smith and Donoghue 2008; Dornburg et al. 2012; Beaulieu et al. 2015) or when clocks are local (Drummond and Suchard 2010; Crisp et al. 2014).
. 6.

(a) Hybrid distribution of rates for branches leading to woody taxa (brown) and herbaceous taxa (green), with the former evolving three times (3×) slower than the latter. (b) Bayesian estimates reported by Beaulieu et al. (2015) when the rate difference between clades was 3-fold (3×, solid line) and 6-fold (6×, dashed line), with the simulated angiosperm age of 140 Ma shown by a red line. RelTime estimates of angiosperm age for Beaulieu et al. (2015)’s alignments with (c) 3× mean rate difference and (d) 6× mean rate difference. Median and SD for age estimates are shown. Beaulieu et al. (2015) simulated 100 replicates (1,000 bases) under a GTR model for each scenario. Bayesian analyses were conducted using a single uncorrelated lognormal rate prior in Beaulieu et al. (2015). The same alignments, topology, and ingroup calibrations were used in RRF analyses.

(a) Hybrid distribution of rates for branches leading to woody taxa (brown) and herbaceous taxa (green), with the former evolving three times (3×) slower than the latter. (b) Bayesian estimates reported by Beaulieu et al. (2015) when the rate difference between clades was 3-fold (3×, solid line) and 6-fold (6×, dashed line), with the simulated angiosperm age of 140 Ma shown by a red line. RelTime estimates of angiosperm age for Beaulieu et al. (2015)’s alignments with (c) 3× mean rate difference and (d) 6× mean rate difference. Median and SD for age estimates are shown. Beaulieu et al. (2015) simulated 100 replicates (1,000 bases) under a GTR model for each scenario. Bayesian analyses were conducted using a single uncorrelated lognormal rate prior in Beaulieu et al. (2015). The same alignments, topology, and ingroup calibrations were used in RRF analyses. Furthermore, Tamura et al. (2012) found that RelTime produced accurate time estimates in simulations with a very large number of sequences even when one clade possessed accelerated evolutionary rates, where penalized likelihood methods did not perform as well. In general, we expect that the limitations of on rate model Bayesian analyses will be overcome by local clock methods (Drummond and Suchard 2010; Höhna et al. 2016; Lartillot et al. 2016), but the computational time required to analyze even modestly sized data sets via these approaches can be prohibitive. So, the current RRF approach, which does not assume a specific model for rate variation, may be suitable for such data in its current implementation or as a foundation for future methodological refinements.

Discussion

We have presented a mathematical foundation for the relative rate framework (RRF) underlying the RelTime method. In the following, we compare RRF with other approaches for estimating divergence times and present the motivation behind RRF.

Lineage Rates versus Branch Rates

In RRF, evolutionary rate heterogeneity in a phylogeny is considered by comparing rates between sister lineages emanating from internal nodes in a phylogeny. A lineage rate is an average of all the lineage rates that belong to that lineage, including all the descendants (e.g., node x in fig. 3). RRF’s estimation of evolutionary lineage rates is fundamentally different from the comparison and modeling of branch rates in other approaches. For example, the penalized likelihood methods consider differences in rates between ancestral and descendant branches (Sanderson 1997, 2003), and Bayesian methods model branch rates to share a probabilistic distribution (e.g., lognormal distribution). The distinction between lineage and branch rates complicates direct comparisons of evolutionary rates produced by using RelTime and Bayesian methods, except for external (tip) branches for which the lineages consist of only one branch. In this case, RelTime and Bayesian estimates of rates show similar trends (fig. 4, blue circles). This trend was also observed in the analysis of 100 sequence data sets, where the correlation of the estimated external branch rates with true branch rates was high for both RelTime (median correlation = 0.76 and 0.69) and Bayesian (median correlation = 0.72 and 0.86) analyses of AR and IR data sets, respectively. For internal branches, computer simulations showed greater similarity between RRF estimates of lineage rates and the true lineage rates (fig. 4) as compared with the similarity observed between Bayesian estimates of branch rates and the true branch rates (fig. 4, respectively). This was also the case in the analysis of 100 sequence data sets: internal branch rates from Bayesian methods were less strongly correlated with the true rates (median correlation = 0.42 and 0.36 for AR and IR data sets, respectively), as compared with those observed for lineage rates from RRF (median correlation = 0.79 and 0.55 for AR and IR data sets, respectively). These trends are due to the fact that the estimate of a branch rate is a function of two time estimates: one for the ancestral node and another for the descendant node. For example, the variance of the rate on branch with length b4 in figure 2 is a function of the variance of two time estimates (t4 and t5), in addition to the variance of b4. In contrast, the variance of a lineage rate (e.g., ra in fig. 2) is a function of the variance of only one time estimate (t5), in addition to the lineage depth (La), because the other time point is zero in contemporary sequence sampling. Thus, branch rates are estimated with greater variance than lineage rates, which results in lower correlations seen for Bayesian approaches.

Underlying Evolutionary Rate Model in RRF

RRF exploits the fact that the estimation of the ratio of lineage rates at any node in a phylogeny is independent of that node’s age. For example, the ratio of evolutionary rates at node 4, r1/r2, does not depend on t4 (Fig. 7a). It is clear that the rate of evolution is higher in the lineage leading from node 4 to taxon 2 than to taxon 1 (r2 > r1), because b2 is longer than b1 in figure 7. In fact, we can estimate r1/r2 (= b1/b2) without knowing anything about the probability distribution of evolutionary rates throughout the tree. Similarly, the other rate ratio in this tree does not depend on knowledge of the distribution of rates among branches or lineages, it is simply [(b1 + b2)/2 + b4]/b3 when using the arithmetic means and when using the geometric means.
. 7.

A phylogenetic tree of three taxa (1, 2, and 3). (a) Original phylogenetic tree with the observed branch lengths (b’s), which are necessary to estimate node times (t’s) shown in panel (b). Evolutionary trees where the rate for the subtree containing taxon 1 and 2 (s4) is much (c) faster or (d) slower than that of its ancestral branch (r4).

A phylogenetic tree of three taxa (1, 2, and 3). (a) Original phylogenetic tree with the observed branch lengths (b’s), which are necessary to estimate node times (t’s) shown in panel (b). Evolutionary trees where the rate for the subtree containing taxon 1 and 2 (s4) is much (c) faster or (d) slower than that of its ancestral branch (r4). However, the node-by-node specification of relative lineage rates is not sufficient to estimate relative times t4 and t5. For that, we need to know the relationship of subtree rate s4 and branch rate r4, where s4 is the overall evolutionary rate of the subtree originating at node 4 (contains taxon 1 and 2) and r4 is the evolutionary rate on branch b4 (fig. 7). Without assuming a specific distribution of rates, s4/r4 cannot be determined uniquely and t4 can be at any point between 0 and t5. Figure 7 present two extreme possibilities. In one, if the subtree rate (s4) is much higher after the divergence event at node 4 (s4 ≫ r4), then the estimate of t4 will be small and the divergence event recent (fig. 7). Alternately, if the subtree rate is much slower after the divergence event at node 4 (s4 ≪ r4), then t4 will be much more ancient (fig. 7). In its mathematical formulation, RRF considers the best estimate of the rate of evolution of an ancestral lineage to be the average of the rate of evolution of its two descendant lineages (e.g., eqs. 3, 14, 15, and 16), while accommodating the relative rates among lineages. In the current example, this would result in the timetree shown in figure 7. This is the principle of minimum rate change from the ancestor to its immediate descendants, where we do not favor extreme rate assignments, for example, those in figure 7 and . Probabilities of such extreme rate assignments are also low in commonly used branch rate distributions (e.g., lognormal, normal, and exponential distributions), so Bayesian methods also tend to favor the smallest rate change needed to explain the data.

Relationship of RRF with Other Molecular Clock Methods

The treatment of evolutionary rates is conceptually different between Bayesian methods and RRF, because RRF does not assume a specific statistical distribution for modeling (lineage) rate variation at the outset and Bayesian methods model branch rates rather than the lineage rates. The application of the principle of minimum rate change in RRF is different from nonparametric and semiparametric approaches based on the idea of Sanderson (1997), because RRF minimizes lineage rate changes rather than the branch rate changes from an ancestor to its immediate descendants. In addition, RRF does not attempt to estimate a universal penalty for the speed of rate change throughout the tree. RRF is distinct from strict clock approaches because strict clock methods only apply rate averaging (e.g., eqs. 3 and 4) at each node in the tree, but RRF imposes an additional constraint that the ratio of sister lineage rates be the ratio of their lineage lengths (e.g., eqs. 1 and 2). These additional constraints relax the strict clock and allow rates to vary throughout the tree. For this reason, RelTime is also different from some molecular clock methods that assume the ratio of ages between two nodes to be proportional to the ratio of their average node-to-tip distances (Purvis 1995; Britton et al. 2002, 2007). This assumption is tantamount to assuming equality of rates among lineages, and, thus, a strict molecular clock. For this reason, these approaches require pruning of taxa or lineages for which the rate equality does not apply (Takezaki et al. 1995). RRF does not assume equality among any lineage rates at any time, and it does not require the removal of rate heterogeneous lineages.

Statistical Distribution of Relative Lineage Rates

Relative lineage rates produced by RRF will show extensive correlation, because the evolutionary rate of a lineage is a function of evolutionary rates of all its descendant lineages. This would result in both local and global correlation, which is expected to be present in phylogenies with autocorrelated as well as independent branch rates. As expected, the analysis of 100 sequence data sets showed correlation between ancestral and descendant lineage rates when branch rates were autocorrelated (median correlation = 0.88) or independent (median correlation = 0.77). Therefore, RRF is fundamentally different from the autocorrelated (branch) rate model of Kishino et al. (2001) as well as the independent (branch) rate model of Drummond et al. (2006), as they deal with branches rather than lineages, as defined here. For this reason, the lineage rates produced by RRF are not directly comparable with those observed for branch rates produced by Bayesian methods. Even though RRF does not require a statistical distribution of lineage rates at the outset, the resulting estimates of lineage rates may follow a statistical distribution. We examined this relationship in an analysis of 100 ingroup sequence data sets in which branch rates were simulated with lognormal, exponential, or uniform distributions. When branch rates followed a lognormal distribution, the distribution of true lineage rates was also lognormal, as was the distribution of RRF lineage rate estimates (fig. 8). When the branch rates followed an exponential distribution, the RRF and true lineage rates showed a similar distribution (fig. 8). In the case of a uniform distribution of branch rates, the lineage rates showed a normal-like distribution and the RRF rate estimates were lognormally distributed (fig. 8f). All of these results suggest that a flexible lognormal distribution will generally fit the distribution of RRF lineage rates. Importantly, time estimates showed a linear relationship with the true times, with slopes close to 1.0 (fig. 8).
. 8.

Distributions of true and RRF-derived estimates of lineage rates. Branch rates were simulated under autocorrelated lognormal rate models with (a) low and (b) high dispersions, independent lognormal rate models with (c) low and (d) high dispersions, and independent rate models with (e) exponential and (f) uniform distributions. Green lines represent the fitted curves of the true lineage rate distributions and blue bars show the distributions of RRF rates. Rate unit is substitutions per site per million years. (g–i) Relationships between true times and RRF times in six rate scenarios. Black circles and lines represent the average times of 5 replicates simulated under rate scenarios in (a), (c), and (e), and gray triangles and lines represent the average times of 5 replicates simulated under rate scenarios in (b), (d), and (f). All times are normalized to the sum of ingroup divergence times. Regression slopes and correlation coefficients (r2) are shown.

Distributions of true and RRF-derived estimates of lineage rates. Branch rates were simulated under autocorrelated lognormal rate models with (a) low and (b) high dispersions, independent lognormal rate models with (c) low and (d) high dispersions, and independent rate models with (e) exponential and (f) uniform distributions. Green lines represent the fitted curves of the true lineage rate distributions and blue bars show the distributions of RRF rates. Rate unit is substitutions per site per million years. (g–i) Relationships between true times and RRF times in six rate scenarios. Black circles and lines represent the average times of 5 replicates simulated under rate scenarios in (a), (c), and (e), and gray triangles and lines represent the average times of 5 replicates simulated under rate scenarios in (b), (d), and (f). All times are normalized to the sum of ingroup divergence times. Regression slopes and correlation coefficients (r2) are shown.

Point Estimates and Their Variances

RRF yields point estimates for (relative) lineage rates and divergence times based on branch lengths. As is the common practice in classical statistics, estimates of dispersion such as SEs and confidence intervals accompany all estimates. The variance of a lineage rate estimate is a function of branch length variances. It can be obtained analytically by using the equations for lineage rate (e.g., eqs. 28–31 for the case of three taxa with outgroup by the delta method) or simply by using a bootstrap sampling procedure. However, the estimation of confidence intervals around the node ages depends on branch length variances as well as the degree of inequality of evolutionary rates among lineages (Kumar and Hedges 2016). Tamura et al. (2013) proposed a method to estimate confidence intervals within RelTime, which produces rather wide confidence intervals. We are currently investigating an advanced approach to narrow confidence intervals while maintaining appropriate coverage probabilities, but this subject is beyond the scope of the current article.

Computational Speed of RelTime

RRF scales well with increasing numbers of sequences and is much faster than Bayesian methods for analyzing of molecular sequence data (fig. 1). The fast computational speed is due to the innovation that RRF uses all the data first to map a large alignment onto a phylogeny, and then it uses the resulting branch lengths to generate relative divergence times and evolutionary lineage rates. The computational time taken by RRF is the sum of time taken to generate maximum likelihood estimates of branch lengths for a given sequence alignment and phylogeny and the time taken to estimate rates and dates using RRF. The latter is negligible compared with the former, because of the analytical nature of RRF. In comparison, Bayesian methods are computationally demanding because they require a substantial exploration of likelihood space using prior distributions to generate posterior estimates of rates and divergence times.

RelTime for Phylogenies with Branch Lengths

The above decomposition has a positive side effect, in addition to making RelTime computationally speedy. RRF may be applied to any phylogeny where branch lengths reflect the amount of change. For example, RRF is directly applicable when branch lengths are estimated by using pairwise evolutionary distances and a least squares approach for a given tree topology (Rzhetsky and Nei 1993). In addition to multiple sequence alignments, such distances can come from unaligned and locally aligned genome or genomic segments (Otu and Sayood 2003; Henz et al. 2005; Auch et al. 2006; Deng et al. 2006; Gao and Qi 2007; Lin et al. 2009; Xu and Hao 2009). In fact, RRF can be applied to any phylogeny where branch lengths are generated using other types of molecular data (e.g., gene expression patterns and breakpoint distances) or nonmolecular data (e.g., morphological and life history traits) (Herniou et al. 2001; Gramm and Niedermeier 2002; King et al. 2016; Cooney et al. 2017). Of course, the accuracy of the relative rate and time inferences made for such data depends directly on the accuracy of the phylogenetic tree and the branch lengths, so the biological interpretations of the results obtained will require utmost care.

Usefulness of Relative Node Ages

RRF’s accurate estimation of relative node ages without assuming a speciation model or calibration priors can benefit many applications (Tamura et al. 2012). For example, relative node ages from molecular data are directly comparable with those from fossil data. This allows evaluation of biological hypotheses without the circularity created by the use of calibration priors and densities inferred from molecular data (Battistuzzi et al. 2015; Gold et al. 2017). Along these lines, RRF has been used to develop a protocol to identify calibration priors that have the strongest influence on the final time estimates in Bayesian dating (Battistuzzi et al. 2015), because the cross-validation methods are unlikely to be effective (Warnock et al. 2012, 2015).

Inferring Absolute Times from Relative Node Ages

By placing calibration constraints on one or more nodes in the tree, we can generate an absolute timetree from the ultrametric tree containing relative node ages. Tamura et al. (2013) presented an algorithmic approach for adjusting relative rates to ensure that the estimated times for calibrated nodes are within the boundaries. This process uses the maximum and minimum boundaries only for calibration, which is preferable when the uncertainty distribution of calibrations is not known precisely. Otherwise, there is high probability of biased time estimation (Hedges and Kumar 2004; Ho and Phillips 2009; Inoue et al. 2010; Heath et al. 2014; Ho and Duchêne 2014; dos Reis et al. 2015). Tamura et al.’s approach worked well in the analysis of large data sets, because Bayesian time estimates reported in multiple large-scale studies are similar to those produced by RRF using ultrametric trees with relative times that were transformed into timetrees using many calibration constraints (Mello et al. 2017).

Conclusions

We have presented a mathematical foundation for the RelTime method and elucidated its relationship with other relaxed and strict clock methods. We have shown that the relative rate framework (RRF) produces excellent estimates of rates and divergence times for evolutionary lineages. It is, however, important to note that estimates of divergence times in a phylogeny are only biologically meaningful when the reconstruction of evolutionary relationships is robust. Therefore, the best practice is to first obtain a reliable phylogeny and then estimate divergence times. We must also consider the confidence intervals associated with node ages to assess the precision of time estimates prior to making biological inferences.

Materials and Methods

Computer Simulations and Analysis

We simulated 200 multisequence alignments: 50 each for two models of evolutionary rates (independent and autocorrelated among branches) for two model topologies containing three- and four-ingroup taxa (fig. 2, respectively). The node height of the ingroup subtree was set to be 10 time units, while the node heights of all descendent subtrees varied randomly between 0 and 10 time units. For each resulting model timetree, branch rates were sampled from a lognormal distribution, where the mean rate was drawn randomly from an empirical distribution (Rosenberg and Kumar 2003) and the SD varied from 0.25 to 0.75 for all branches independently. For the autocorrelated rates, the initial rate was drawn randomly from an empirical distribution (Rosenberg and Kumar 2003) and the autocorrelation parameter was varied from 0.1 to 0.3. This rate sampling resulted in a phylogram with branch lengths that could be used as input for SeqGen (Grassly et al. 1997). We used the Hasegawa–Kinshino–Yano (HKY) model (Hasegawa et al. 1985) with 4 gamma categories and empirically derived GC content and transition/transversion ratio (Rosenberg and Kumar 2003) as simulation parameters. We generated simulated multispecies alignments where each sequence was 3, 000 base pairs long. The results from three-ingroup sequence analysis (fig. 4) were similar to those from the analysis of four-ingroup sequences (not shown). Using the same simulation strategy, we created 35 alignments each under independent and autocorrelated rate scenarios following a master phylogeny of 100 taxa that was sampled from the bony-vertebrate clade in the Timetree of Life (Hedges and Kumar 2009). In the independent rate case, the SD varied from 0.3 to 0.5. In the autocorrelated rate case, the autocorrelation parameter varied from 0.01 to 0.04. All other simulation parameters (GC contents, transition/transversion ratio, and sequence length) were derived from empirical distributions (Rosenberg and Kumar 2003). Using the same 100 ingroup taxon master phylogeny of bony vertebrates, we simulated five additional sequence data sets under (1) an autocorrelated lognormal rate model with low dispersion, where the initial rate and the autocorrelation parameter were set to be r and 0.02, respectively; (2) an autocorrelated lognormal rate model with high dispersion, where the initial rate and the autocorrelation parameter were set to be r and 0.05, respectively; (3) an independent lognormal rate model with low dispersion, where the mean rate and the SD (in log scale) were set to be r and 0.25, respectively; (4) an independent lognormal rate model with high dispersion, where the mean rate and the SD (in log scale) were set to be r and 0.75, respectively; (5) an independent exponential rate model, where the mean rate was set to be r; and (6) an independent uniform rate model, where the branch-specific rates were sampled from an uniform distribution from 0 to 2r. GC contents, transition/transversion ratio, sequence length, and evolutionary rate r were derived from empirical distributions (Rosenberg and Kumar 2003). MEGA software (Kumar et al. 2012, 2016) was used to obtain maximum likelihood estimates of branch lengths from simulated sequence alignments, where the correct substitution model and tree topology were used. RRF (eqs. 28–42) was applied to the resulting phylogram with branch lengths, and relative lineage rates and times were obtained. One calibration (true age ± 10 Ma) at the crown node of the ingroup was used to convert relative time estimates for comparison with true times (fig. 8). No calibrations were used in other RRF analyses. All Bayesian analyses were conducted using MCMCTree (Yang 2007) using correct priors; two independent runs of five million generations were carried out. Results were checked for convergence using Tracer (Rambaut et al. 2014). ESS values were higher than 200 after removing 10% burn-in samples in each run. MCMCTree analyses used one root calibration (true age ± 0.1 time units).

Analysis of Hybrid Rate Models

Simulated data sets and BEAST results were provided by Beaulieu et al. (2015), or retrieved from the Dryad Repository. All outgroup and root calibrations are automatically disregarded in RelTime, because the assumption of equal rates of evolution between the ingroup and outgroup sequences is not testable (Kumar et al. 2016). Lognormal distributions with fixed median values of “true ages” were used as calibration densities in the original study (Beaulieu et al. 2015). Because RelTime does not require specific density distributions for calibrations, we used true age ± 5 Ma for all 15 ingroup calibrated nodes in the reanalysis to directly compare RelTime divergence time estimates with those from BEAST. Calibrations employed in RelTime (true age ± 5 Ma) had boundaries similar to 99% probability densities of lognormal distributions originally employed as calibrations. The same alignments, topology, and ingroup calibrations were used in RRF analyses. Estimates of angiosperm age were obtained by summarizing estimates of 100 data sets each in which the herbaceous clades have 3 times (3×) and 6 times (6×) higher rates than those of woody clades.
  51 in total

1.  The impact of the representation of fossil calibrations on Bayesian estimation of species divergence times.

Authors:  Jun Inoue; Philip C J Donoghue; Ziheng Yang
Journal:  Syst Biol       Date:  2009-11-25       Impact factor: 15.683

2.  Relaxed clocks and inferences of heterogeneous patterns of nucleotide substitution and divergence time estimates across whales and dolphins (Mammalia: Cetacea).

Authors:  Alex Dornburg; Matthew C Brandley; Michael R McGowen; Thomas J Near
Journal:  Mol Biol Evol       Date:  2011-09-15       Impact factor: 16.240

3.  The fossilized birth-death process for coherent calibration of divergence-time estimates.

Authors:  Tracy A Heath; John P Huelsenbeck; Tanja Stadler
Journal:  Proc Natl Acad Sci U S A       Date:  2014-07-09       Impact factor: 11.205

Review 4.  Molecular-clock methods for estimating evolutionary rates and timescales.

Authors:  Simon Y W Ho; Sebastián Duchêne
Journal:  Mol Ecol       Date:  2014-10-30       Impact factor: 6.185

Review 5.  Bayesian molecular clock dating of species divergences in the genomics era.

Authors:  Mario dos Reis; Philip C J Donoghue; Ziheng Yang
Journal:  Nat Rev Genet       Date:  2015-12-21       Impact factor: 53.242

6.  Bayesian random local clocks, or one rate to rule them all.

Authors:  Alexei J Drummond; Marc A Suchard
Journal:  BMC Biol       Date:  2010-08-31       Impact factor: 7.431

7.  ComPhy: prokaryotic composite distance phylogenies inferred from whole-genome gene sets.

Authors:  Guan Ning Lin; Zhipeng Cai; Guohui Lin; Sounak Chakraborty; Dong Xu
Journal:  BMC Bioinformatics       Date:  2009-01-30       Impact factor: 3.169

8.  CVTree update: a newly designed phylogenetic study platform using composition vectors and whole genomes.

Authors:  Zhao Xu; Bailin Hao
Journal:  Nucleic Acids Res       Date:  2009-04-26       Impact factor: 16.971

9.  Isolation of Infective Zika Virus from Urine and Saliva of Patients in Brazil.

Authors:  Myrna C Bonaldo; Ieda P Ribeiro; Noemia S Lima; Alexandre A C Dos Santos; Lidiane S R Menezes; Stephanie O D da Cruz; Iasmim S de Mello; Nathália D Furtado; Elaine E de Moura; Luana Damasceno; Kely A B da Silva; Marcia G de Castro; Alexandra L Gerber; Luiz G P de Almeida; Ricardo Lourenço-de-Oliveira; Ana Tereza R Vasconcelos; Patrícia Brasil
Journal:  PLoS Negl Trop Dis       Date:  2016-06-24

10.  Mega-evolutionary dynamics of the adaptive radiation of birds.

Authors:  Christopher R Cooney; Jen A Bright; Elliot J R Capp; Angela M Chira; Emma C Hughes; Christopher J A Moody; Lara O Nouri; Zoë K Varley; Gavin H Thomas
Journal:  Nature       Date:  2017-02-01       Impact factor: 49.962

View more
  43 in total

1.  Golden Orbweavers Ignore Biological Rules: Phylogenomic and Comparative Analyses Unravel a Complex Evolution of Sexual Size Dimorphism.

Authors:  Matjaž Kuntner; Chris A Hamilton; Ren-Chung Cheng; Matjaž Gregorič; Nik Lupše; Tjaša Lokovšek; Emily Moriarty Lemmon; Alan R Lemmon; Ingi Agnarsson; Jonathan A Coddington; Jason E Bond
Journal:  Syst Biol       Date:  2019-07-01       Impact factor: 15.683

2.  Two strains of a novel begomovirus encoding Rep proteins with identical β1 strands but different β5 strands are not compatible in replication.

Authors:  Jesús Aarón Avalos-Calleros; Guillermo Pastor-Palacios; Omayra C Bolaños-Martínez; Armando Mauricio-Castillo; Josefat Gregorio-Jorge; Nadia Martínez-Marrero; Bernardo Bañuelos-Hernández; Jesús Méndez-Lozano; Gerardo Rafael Arguello-Astorga
Journal:  Arch Virol       Date:  2021-04-14       Impact factor: 2.574

3.  Pseudonocardia cytotoxica sp. nov., a novel actinomycete isolated from an Arctic fjord with potential to produce cytotoxic compound.

Authors:  Mohandas Dhaneesha; Md Umar; Titus Susan Merlin; Kottekkatu Padinchati Krishnan; Vrinda Sukumaran; Rupesh Kumar Sinha; Abdulaziz Anas; Peng Fu; John Bryan MacMillan; Thavarool Puthiyedathu Sajeevan
Journal:  Antonie Van Leeuwenhoek       Date:  2020-11-23       Impact factor: 2.271

4.  Complete genome sequence of cnidium closterovirus 1, a novel member of the genus Closterovirus infecting Cnidium officinale.

Authors:  Workitu Firomsa Gudeta; Mesele Tilahun Belete; Davaajargal Igori; Se Eun Kim; Jae Sun Moon
Journal:  Arch Virol       Date:  2022-04-26       Impact factor: 2.574

5.  Systematic analysis of HD-ZIP transcription factors in sesame genome and gene expression profiling of SiHD-ZIP class I entailing drought stress responses at early seedling stage.

Authors:  Maryam Mehmood; Muhammad Jadoon Khan; Muhammad Jawad Khan; Nadeem Akhtar; Fizza Mughal; Syed Tahir Abbas Shah; Muhammad Zeeshan Hyder; Sumaira Farrakh; Irfan Sadiq
Journal:  Mol Biol Rep       Date:  2022-01-07       Impact factor: 2.316

6.  Comparative genome analyses suggest a hemibiotrophic lifestyle and virulence differences for the beech bark disease fungal pathogens Neonectria faginata and Neonectria coccinea.

Authors:  Catalina Salgado-Salazar; Demetra N Skaltsas; Tunesha Phipps; Lisa A Castlebury
Journal:  G3 (Bethesda)       Date:  2021-04-15       Impact factor: 3.154

7.  Role of HLA-B exon 1 in graft-versus-host disease after unrelated haemopoietic cell transplantation: a retrospective cohort study.

Authors:  Effie W Petersdorf; Mary Carrington; Colm O'hUigin; Mats Bengtsson; Dianne De Santis; Valerie Dubois; Ted Gooley; Mary Horowitz; Katharine Hsu; J Alejandro Madrigal; Martin J Maiers; Mari Malkki; Caroline McKallor; Yasuo Morishima; Machteld Oudshoorn; Stephen R Spellman; Jean Villard; Phil Stevenson
Journal:  Lancet Haematol       Date:  2019-10-25       Impact factor: 18.959

8.  Confronting Sources of Systematic Error to Resolve Historically Contentious Relationships: A Case Study Using Gadiform Fishes (Teleostei, Paracanthopterygii, Gadiformes).

Authors:  Adela Roa-Varón; Rebecca B Dikow; Giorgio Carnevale; Luke Tornabene; Carole C Baldwin; Chenhong Li; Eric J Hilton
Journal:  Syst Biol       Date:  2021-06-16       Impact factor: 15.683

9.  Protein targeting to starch 1, a functional protein of starch biosynthesis in wheat (Triticum aestivum L.).

Authors:  Vinita Sharma; Vikas Fandade; Prashant Kumar; Afsana Parveen; Akansha Madhawan; Manik Bathla; Ankita Mishra; Himanshu Sharma; Vikas Rishi; Santosh B Satbhai; Joy Roy
Journal:  Plant Mol Biol       Date:  2022-03-24       Impact factor: 4.076

10.  Genomes of Anguillid Herpesvirus 1 Strains Reveal Evolutionary Disparities and Low Genetic Diversity in the Genus Cyprinivirus.

Authors:  Owen Donohoe; Haiyan Zhang; Natacha Delrez; Yuan Gao; Nicolás M Suárez; Andrew J Davison; Alain Vanderplasschen
Journal:  Microorganisms       Date:  2021-05-05
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.