Jaehee Kim1, Noah A Rosenberg1, Julia A Palacios2,3. 1. Department of Biology, Stanford University, Stanford, CA 94305. 2. Department of Statistics, Stanford University, Stanford, CA 94305; juliapr@stanford.edu. 3. Department of Biomedical Data Science, Stanford School of Medicine, Stanford, CA 94305.
Abstract
Genealogical tree modeling is essential for estimating evolutionary parameters in population genetics and phylogenetics. Recent mathematical results concerning ranked genealogies without leaf labels unlock opportunities in the analysis of evolutionary trees. In particular, comparisons between ranked genealogies facilitate the study of evolutionary processes of different organisms sampled at multiple time periods. We propose metrics on ranked tree shapes and ranked genealogies for lineages isochronously and heterochronously sampled. Our proposed tree metrics make it possible to conduct statistical analyses of ranked tree shapes and timed ranked tree shapes or ranked genealogies. Such analyses allow us to assess differences in tree distributions, quantify estimation uncertainty, and summarize tree distributions. We show the utility of our metrics via simulations and an application in infectious diseases.
Genealogical tree modeling is essential for estimating evolutionary parameters in population genetics and phylogenetics. Recent mathematical results concerning ranked genealogies without leaf labels unlock opportunities in the analysis of evolutionary trees. In particular, comparisons between ranked genealogies facilitate the study of evolutionary processes of different organisms sampled at multiple time periods. We propose metrics on ranked tree shapes and ranked genealogies for lineages isochronously and heterochronously sampled. Our proposed tree metrics make it possible to conduct statistical analyses of ranked tree shapes and timed ranked tree shapes or ranked genealogies. Such analyses allow us to assess differences in tree distributions, quantify estimation uncertainty, and summarize tree distributions. We show the utility of our metrics via simulations and an application in infectious diseases.
Gene genealogies are rooted binary trees that describe the ancestral relationships of a sample of molecular sequences at a locus. The properties of these genealogies are influenced by the nature of the evolutionary forces experienced by the sample’s ancestry. Hence, assessing differences among gene genealogies can provide information about differences in these forces. In this article, we propose a distance on genealogies that enables biologically meaningful comparisons between genealogies of nonoverlapping samples. Our proposed distance, in its more general form, is a distance on ranked genealogies.Ranked genealogies are rooted binary unlabeled trees with branch lengths and ordered internal nodes (Fig. 1). These genealogies are also known as Tajima’s genealogies (1, 2), as they were examined by Tajima (3). Unlabeled ranked genealogies are coarser than labeled ranked genealogies but finer than unlabeled unranked genealogies (Fig. 1); in a labeled or unlabeled unranked genealogy, multiple bifurcations may occur simultaneously (4).
Fig. 1.
Types of tree topology. (A) Labeled ranked tree shape. (B) Ranked tree shape (). (C) Labeled unranked tree shape. (D) Tree shape. Genealogies corresponding to each topology include branch lengths.
Types of tree topology. (A) Labeled ranked tree shape. (B) Ranked tree shape (). (C) Labeled unranked tree shape. (D) Tree shape. Genealogies corresponding to each topology include branch lengths.Recently, there has been increasing interest in modeling ranked genealogies for studying evolutionary dynamics (1, 5, 6). One method for inferring evolutionary parameters from molecular data is based on the Tajima coalescent for ranked genealogies (2): Modeling of ranked genealogies, as opposed to labeled ranked genealogies as in the Kingman coalescent, reduces the dimensionality of the inference problem. Thus, for example, in studying macroevolution, Maliet et al. (7) proposed a model on ranked tree shapes—ranked genealogies without branch lengths but with internal node orders retained—to investigate factors influencing nonrandom extinction and the loss of phylogenetic diversity.Many metrics on labeled trees have been proposed, such as the Robinson–Foulds (RF) metric (8), the Billera–Holmes–Vogtmann (BHV) metric (9), and the Kendall–Colijn (KC) metric (10). These metrics have been used for summarizing posterior distributions and bootstrap distributions of trees on the same set of taxa (11, 12), for comparing estimated genealogies of the same organisms obtained with different procedures, and for quantifying uncertainty in inferring evolutionary histories (13). The Colijn–Plazzotta (CP) metric (14) on the coarser-resolution space of tree shapes—rooted binary unlabeled unranked trees without branch lengths—has recently been devised. To our knowledge, no other metrics on ranked tree shapes or ranked unlabeled genealogies have been proposed to date.A tree metric on the space of ranked genealogies can facilitate evaluations of the quality of an estimation procedure, by enabling measurements of the distance between an estimated ranked genealogy and the true ranked genealogy. It can assist in comparing different estimators from different procedures and in comparing estimated ranked genealogies of different organisms living at different geographical locations and different time periods. Moreover, a useful tree metric not only provides a quantitative measure for ranked genealogy comparison, but can also discriminate between key aspects of different evolutionary processes (15). We show that our metrics separate samples of genealogies originating from different sampling distributions of ranked tree shapes and ranked genealogies. Our distances enable the computation of summary statistics, such as the mean and the variance, from samples of ranked genealogies. When ranked genealogy samples are obtained from posterior distributions of genealogies, such as those obtained from Bayesian Evolutionary Analysis by Sampling Trees (BEAST) (16), our tree metrics enable the construction of credible sets and Markov chain Monte Carlo (MCMC) convergence assessment.We first define a metric on ranked tree shapes. Our metric relies on an integer-valued triangular matrix representation of ranked tree shapes. This matrix representation allows us to use metrics on matrices, such as the norm and (Frobenius) norm, to define metrics on ranked tree shapes. The choice of these two distances produces computational benefits, as computations of the metrics are quadratic in the number of leaves. Our metrics on ranked tree shapes retain more information than metrics based on unlabeled unranked tree shapes alone. We expand our metric definition to ranked genealogies, including branch lengths by weighting the matrix representation of ranked tree shapes by branch lengths.We define metrics on isochronous ranked tree shapes and ranked genealogies, with all samples obtained at the same point in time. We next define metrics on heterochronous ranked tree shapes and ranked genealogies, in which samples are time stamped. While modeling isochronous genealogies is the standard practice for macroscopic organisms, such as animals and plants, modeling heterochronous genealogies is the standard approach for rapidly evolving organisms, such as viruses and bacteria (16–21). Heterochronous genealogies are also increasingly relevant in the study of ancient DNA samples (22–26).We analyze properties of our proposed metrics and compare them to other tree-valued metrics—such as the BHV (9), KC (10), and CP (14) metrics—by projecting these other metrics into the space of ranked tree shapes and ranked genealogies. We then demonstrate the performance of our metrics on simulations from various tree topology distributions and demographic scenarios. We use our distances to compare posterior distributions of genealogies of humaninfluenza A/H3N2 virus from different geographic regions and to assess MCMC convergence. An implementation of our metrics is available for download at https://github.com/JuliaPalacios/phylodyn.
Definitions of Tree Topologies and Genealogies
All of the trees we consider are rooted and binary. We first assume that trees are isochronously sampled; that is, all tips of the trees start at the same time. A ranked tree shape with leaves is a rooted binary unlabeled tree with an increasing ordering of the interior nodes, starting at the root with label 2 (Fig. 1). We use the symbol to denote a ranked tree shape. A ranked genealogy is a ranked tree shape equipped with branch lengths. We use the symbol to denote a ranked genealogy. Although we focus mainly on ranked tree shapes and ranked genealogies, we will compare our metrics to metrics defined on other rooted tree spaces. A labeled ranked tree shape (Fig. 1) and a labeled ranked genealogy are the corresponding labeled counterparts of unlabeled ranked trees. A labeled unranked tree shape (Fig. 1) and a labeled unranked genealogy are a rooted binary labeled tree shape and a rooted binary labeled genealogy without ranking of internal nodes, respectively. A tree shape (Fig. 1) is a rooted binary unlabeled unranked tree. A genealogy is a tree shape equipped with branch lengths.
Tree Metrics for Comparative Analysis
Metrics on Labeled Trees.
Many tree metrics have been proposed for phylogenetic studies (8–10, 27–30). We consider several of these. Billera et al. (9) introduced a metric on labeled unranked genealogies that is now commonly known as BHV space. Owen and Provan (31) and Chakerian and Holmes (11) provided polynomial algorithms and implementations for calculating the geodesic distance metric proposed by Billera et al. (9). Recently, Kendall and Colijn (10) proposed a metric on labeled unranked tree shapes and labeled unranked genealogies, representing each tree as a convex combination of two vectors—one encoding the number of edges from the root to the most recent common ancestor of every pair of tips in the tree and the other encoding the sum of the branch lengths of the corresponding paths. The most popular metric that can be computed in polynomial time is the symmetric difference of Robinson and Foulds (RF) (8) on labeled unranked tree shapes. This measure has an associated branch-length measure RFL on labeled genealogies (27).
Metrics on Unlabeled Trees.
A metric on unlabeled trees is desirable because it enables comparison between trees of different samples. However, few such metrics are available. Colijn and Plazzotta (14) proposed a metric on tree shapes which is the Euclidean norm of the difference between two integer vectors that uniquely describe the two trees. Poon et al. (32) developed a kernel function that measures the similarity between two genealogies by accounting for differences in branch lengths and matching the number of descendants over all nodes for both trees. Lewitus and Morlon (33) proposed the Jensen–Shannon distance between the spectral density profiles of the corresponding modified graph Laplacians of the genealogies.We are introducing metrics for unlabeled ranked tree shapes and unlabeled ranked genealogies. To define our metrics on ranked tree shapes, we need to introduce a unique encoding of a ranked tree shape as an integer-valued triangular matrix.
New Approaches
Unique Encoding of Ranked Tree Shapes and Matrix.
Let be a ranked tree shape with leaves sampled at time . Let be the coalescence times. Here, time increases into the past (rootward), , and and correspond to the most recent sampling time and the time to the most recent common ancestor (root), respectively. An matrix that encodes is an lower triangular matrix of integers with elements for all , and for , is the number of extant lineages in that do not bifurcate during the entire time interval traversed forward in time (tipward).In Fig. 2, we show the corresponding matrix to the ranked tree shape depicted in Fig. 2. In the interval , there are two lineages, so . Traversing tipward, one of the two lineages extant at time branches at time , while the other lineage does not branch throughout the entire interval . This gives the first column of the matrix: . For the second column, we start with three lineages in the interval , . Traversing tipward, of the three lineages extant at , one branches at (), and one branches at (). We construct the third column by starting from four lineages in , . One lineage branches at , and thus, . Finally, in the interval , there are five lineages, which gives .
Fig. 2.
Unique encoding of isochronous ranked tree shapes and matrices. (A) Example of a ranked genealogy with isochronous sampling. (B) The corresponding matrix that encodes the ranked tree shape of the tree in A.
Unique encoding of isochronous ranked tree shapes and matrices. (A) Example of a ranked genealogy with isochronous sampling. (B) The corresponding matrix that encodes the ranked tree shape of the tree in A.If is a heterochronous ranked tree shape with leaves and sampling events, then the corresponding -matrix representation is an lower triangular matrix of integers, where for is the number of extant lineages in that do not bifurcate or become extinct during the entire time interval , traversing forward in time. Here, , such that , are the ordered change points of , at each of which the number of branches changes either by a coalescent event or by a sampling event. We show an example of a heterochronous ranked tree shape and its -matrix encoding in .Although the branch lengths and the actual values of coalescent and sampling times are irrelevant for the specification of the ranked tree shape, we rely on the to identify the order and type of change points: coalescence or sampling in and .
Theorem 1 (Unique Encoding of Ranked Tree Shapes).
The map by which ranked tree shapes with
samples and
sampling events are encoded as
matrices of size
uniquely associates a ranked tree shape with an
matrix.In other words, given an matrix, if it encodes a ranked tree shape, then it encodes exactly one ranked tree shape. The proof appears in . In the next section, we leverage the -matrix representation of ranked tree shapes to define a distance between ranked tree shapes. From this point onward, we assume a matrix represents a ranked tree shape.
Metrics on Ranked Tree Shapes and Ranked Genealogies.
We define two distance functions and on the space of ranked tree shapes with leaves as follows. For a pair of ranked tree shapes and and their corresponding -matrix representations and of size , where ,The two distances are metrics as they inherit and element-wise matrix metrics with well-understood properties (34). The definitions are valid for both isochronous and heterochronous ranked tree shapes, as both types of trees have unique matrix encodings.In the following definition, we include branch lengths to define distances between ranked genealogies. We define two weighted distance functions and on the space of ranked genealogies with samples and sampling events. We first define the weight matrix of a ranked genealogy as a lower triangular matrix of size with entries for and otherwise. Here, , such that , is the vector of coalescent times and sampling times , ordered with time increasing into the past (rootward). For a pair of ranked genealogies and , and their corresponding -matrix representations and ,where and are the weight matrices associated with and , respectively. shows an example weight matrix , associated with the example heterochronous ranked genealogy and its matrix in .
Proposition 2.
The weighted distances
and
are metrics.The proof appears in . Our distances on ranked tree shapes and ranked genealogies are distances between trees with the same number of leaves and, additionally in the heterochronous case, the same number of sampling events. An extension to cases in which the numbers of sampling events differ but the total numbers of leaves remain the same is described in .In the following section, we propose sample summary statistics based on our metrics and for ranked tree shapes and and for ranked genealogies.
Ranked Tree Shape and Ranked Genealogy Summary Statistics.
We first use our proposed distances to define a notion of mean value and dispersion value from a finite sample of ranked tree shapes with leaves.Our proposed measures of centrality are the -medoid sets defined asfor . We note that our definition of the -medoid set corresponds to the ranked tree shape(s) in the sample minimizing the sum of squared distances as opposed to the sum of distances. In addition, when the sample is replaced by the complete population of ranked tree shapes with leaves or when we allow the medoid to belong to the population of trees but do not require it to be a sampled tree, Eq. corresponds to the Fréchet mean or barycenter under uniform sampling probabilities (35).We use the following as a measure of dispersion around the medoid for :When the medoid is not unique, the dispersion is defined with respect to any chosen -medoid ranked tree shape from the set of medoids.Similarly, given a finite sample of ranked genealogies with leaves, the -medoid set is defined asfor , and our empirical measure of dispersion for ranked genealogies is the mean sum of squared distances to the medoid.
Statistical Comparison of Ranked Tree Shape and Ranked Genealogy Sampling Distributions.
To assess the utility of a metric in distinguishing between two different ranked tree shape (or ranked genealogy) sampling distributions, we propose a nonparametric test of equality of tree sampling distributions as follows. Given two sets of independent samples of ranked tree shapes (or ranked genealogies) from distributions and , we are interested in testing the null hypothesis .Assume are independent and identically distributed (i.i.d.) ranked tree shapes (or ranked genealogies) with leaves according to a probability distribution , and independently, are i.i.d. ranked tree shapes (or ranked genealogies) with leaves according to . For a given distance function defined on the space of ranked tree shapes (or ranked genealogies), let be the medoid of the samples and let be the medoid of the samples, as defined in Eq. . The mean confusion is defined asThe closer the value of the mean confusion statistic is to 0, the more easily the two tree distributions can be distinguished from each other. The mean confusion as defined in Eq. assumes the two medoids are unique, each observed only once, and ignores ties (trees equidistant to and ). When the sample space of ranked tree shapes is large or when the sample space is the space of ranked genealogies, the expected value of the mean confusion statistic under the null hypothesis is close to 0.5. In , we define a more general mean confusion statistic that accounts for nonunique medoids and ties. We note that for our simulations and real data analyses, Eq. is used since we observe a unique medoid in each ranked tree shape or ranked genealogy distribution, and no ties are present in pairwise comparisons.An -level test can be constructed for testing the null hypothesis using random permutations. The permutation P value is computed as (36)where is a confusion statistic from the th replicate and is the observed value from data.For comparing more than two sampling distributions, we compute the confusion matrix , where entry corresponds to the percentage of trees in the th sampling distribution that are closest to the medoid of the th sampling distribution. Each row of the confusion matrix sums to 100%, and the diagonal elements represent the percentage of trees that are closer to the medoid of their originating distribution than medoids of the other sampling distributions. The average diagonal of the confusion matrix, which we term mean separation, provides a measure of overall separation between sampling distributions based on the distance used. It ranges from 0 to 100, with a larger value indicating better separation between sampling distributions.
Adapting Other Tree Metrics to Ranked Tree Shapes and Ranked Genealogies.
Although there are no other metrics designed specifically for ranked tree shapes and ranked genealogies, we can adapt other tree distances defined on finer or coarser tree spaces to the space of ranked tree shapes and ranked genealogies, and we can compare them to our metrics. We adapted metrics that are originally defined on the space of labeled genealogies—the BHV distance and the KC distance—to the space of ranked genealogies by artificially assigning uniquely defined leaf labels to ranked genealogies. The unique assignment of the leaf labels on a ranked tree shape consists of assigning labels in increasing index order starting with leaves subtended by an internal node closest to the tips and ending with leaves subtended closest to the root. For adapting the BHV distance and the KC distance to the space of ranked tree shapes, we assigned a unit length to each change point time interval and unique leaf labels to tips of a ranked tree shape. We also compared our distances on ranked tree shapes to the CP distance defined on the space of tree shapes by ignoring internal node labels. Details of these adaptations can be found in .We term the adapted BHV, KC, and CP distances on the space of ranked tree shapes , , and , respectively. We denote the adapted BHV and KC distances on ranked genealogies by and , respectively. Table 1 shows a summary of the distances defined on the spaces of ranked tree shapes and ranked genealogies, as used in this paper.
Table 1.
Summary of distances in the spaces of ranked tree shapes and ranked genealogies
Name
Tree space
Original tree space
Reference
d1,d2
Ranked tree shape
Ranked tree shape
This paper
d1w,d2w
Ranked genealogy
Ranked genealogy
This paper
dBHV-RTS
Ranked tree shape
Labeled genealogy
Billera, Holmes, and Vogtmann (9)
dBHV-RG
Ranked genealogy
Labeled genealogy
Billera, Holmes, and Vogtmann (9)
dKC-RTS
Ranked tree shape
Labeled genealogy
Kendall and Colijn (10)
dKC-RG
Ranked genealogy
Labeled genealogy
Kendall and Colijn (10)
dCP-RTS
Ranked tree shape
Tree shape
Colijn and Plazzotta (14)
A brief description of existing distances that are originally defined on the space of tree shapes and the space of labeled genealogies, and their adaptations to the spaces of ranked tree shapes and ranked genealogies, can be found in .
Summary of distances in the spaces of ranked tree shapes and ranked genealogiesA brief description of existing distances that are originally defined on the space of tree shapes and the space of labeled genealogies, and their adaptations to the spaces of ranked tree shapes and ranked genealogies, can be found in .
Results
Having introduced the metrics on the spaces of ranked tree shapes and ranked genealogies, we first examine the behavior of our metrics in an illustrative example for . We then show their utility in simulated data under various evolutionary models. Finally, we apply them to analyze real humaninfluenza A/H3N2 virus data.
Interpreting Proposed Distances between Ranked Tree Shapes of Leaves.
Before demonstrating the utility of our metrics in distinguishing different sampling distributions by simulations and a real data application, as a simple illustrative example, we first explore our distances on ranked tree shapes of leaves—the smallest number of taxa giving a nontrivial space of ranked tree shapes—and compare them with the adaptations of other metrics.In Fig. 3, we show all five possible ranked tree shapes and their corresponding pairwise distances. Trees , and are trees with the same tree shape but different ranked tree shapes. The trees in each of the pairs and differ by one rank switch, and their pairwise distance is . The distance between the pair is 2; indeed, to go from to , we need two rank switches. All pairwise distances are shown in .
Fig. 3.
distances between all pairs of ranked tree shapes of leaves. Rank labels of internal nodes are removed for ease of visualization. There are three different distance values among the 10 pairs of distances.
distances between all pairs of ranked tree shapes of leaves. Rank labels of internal nodes are removed for ease of visualization. There are three different distance values among the 10 pairs of distances.The pairs and have the maximum distance. In both cases, we need a rank switch and a change of tree shapes to move from one tree to the other. The pairs of trees with the maximum distance are the same two pairs with the maximum distance. However, an additional pair, , also has the same maximum, even though and differ by two rank change events but have the same tree shape.Qualitative behaviors of distances among the trees , , and mirror . , however, differs in that the trees , , and are equidistant from but not from the caterpillar tree , whereas, using or , all pairwise distances to the trees , , and from or are distinct. We also note that because is a distance originally defined on tree shapes, all pairwise distances between the trees , and are 0, and all distances to the trees , , and from or have the same value of 2.
Separation between Ranked Tree Shapes with Different Distributions.
Having demonstrated that our proposed metrics, and , effectively differentiate both tree topologies and internal node rankings in the case, we now evaluate by simulation how and can distinguish different ranked tree shape distributions from models with biological relevance.We consider isochronous ranked tree shapes from a two-parameter branching model to which we refer as the alpha–beta-splitting model (7). These parameters capture two phenomena that have been widely used for phylogenetic studies. The parameter , as in the beta-splitting model (37), determines the degree of balance, an important tree attribute that carries information on underlying evolutionary or epidemiological processes, including speciation and extinction (38–41), natural selection (42–45), and infectious disease transmission (46–48). The other parameter regulates the relationship between clade family size and proximity to the root (clade size–age relation), which is fundamental to understanding the effects of ecological, evolutionary, geographic, and other factors on biodiversity (7, 49–52). More details about these models can be found in .We evaluate and compare our metrics, and , with the adaptations of other metrics to the space of ranked tree shapes (Table 1)—, , and —by the following methods. We summarize each sampling distribution by the summary statistics, the medoid (Eq. ) and the dispersion (Eq. ). We then assess the performance of the metrics in distinguishing different ranked tree shape distributions by several measures: the confusion matrix, mean separation, and mean confusion (Eq. ) and its permutation P value (Eq. ) for testing equality between pairs of sampling distributions ( permutations). Additionally, for ease of visualization and interpretation, we use multidimensional scaling (MDS) (53) to embed our distance metrics into Euclidean spaces of two dimensions.
Distinguishing Different Tree Balance Distributions.
To show how our proposed metrics can be used to differentiate ranked tree shapes sampled from distributions with different degrees of tree balance, we simulated ranked tree shapes under the alpha–beta-splitting model with different values of the tree balance parameter () while keeping the clade size–age relation parameter () fixed. We considered , representing a sequence from unbalanced to balanced ranked tree shapes. We refer to the ranked tree shape distributions with their corresponding well-known models of speciation: the Yule model (54, 55) (), the proportional-to-distinguishable-arrangements (PDA) model (41) (), and the Aldous branching (AB) model (40) (). Additionally, we refer to and as “unbalanced” and “balanced”, respectively. We simulated 1,000 ranked tree shapes with leaves for each value, generating 5,000 simulated ranked tree shapes. We then computed the pairwise distance matrices of size with , , , , and .The confusion matrices displayed in show the greatest mean separation across the five distributions for and : About of the trees are closest to the medoid of their originating distribution with and distances, with , with , and only with . Our and distances discriminate tree distributions with different tree balance parameters to a greater extent than ; and show more similar performance.To test for equality in sampling distributions, we computed the pairwise mean confusions and associated P values. shows mean confusions computed using our metrics, and . All off-diagonal values are smaller than 0.21 and statistically significant (P
). In contrast, comparisons with and () produce higher pairwise mean confusion values across all pairs—with off-diagonal values larger than 0.45 with and values between 0.04 and 0.28 with —and thus, lower discrimination. Mean confusion values closer to 0 are indicative of good discrimination. Comparisons with () display similar pairwise mean confusion values and significance to our distances (all P values are significant at P
).shows the dispersion statistic for each tree distribution and for each distance function. The unbalanced sample shows the smallest dispersion value (all sampled trees are closer to each other) with our distances and , whereas, with and , it displays the largest dispersion value (more separation among trees within the same distribution).Our findings are confirmed visually through the MDS plots displayed in Fig. 4. In the MDS plots, each dot corresponds to a tree, and each color corresponds to the sampling distribution for a specified value. Fig. 4 shows the -medoid trees using our distance. The total distances explained by the first two MDS dimensions are using (Fig. 4) and using (Fig. 4). The two-dimensional MDS mappings using the other distances explain less than .
Fig. 4.
MDS representation of distances between 5,000 simulated isochronous ranked tree shapes of leaves, aggregated from five different beta-splitting models. A total of 1,000 isochronous ranked tree shapes were simulated from each of the following models: balanced model (), Yule model (), Aldous branching (AB) model (), proportional-to-distinguishable-arrangements (PDA) model (), and unbalanced model (). (A) MDS of the metric. (B) MDS of the metric. (C) -medoid trees from each distribution using the metric. (D–F) MDS plots for (D) , (E) , and (F) . In each MDS plot, the triangle represents the -medoid tree of 1,000 points for a specified model.
MDS representation of distances between 5,000 simulated isochronous ranked tree shapes of leaves, aggregated from five different beta-splitting models. A total of 1,000 isochronous ranked tree shapes were simulated from each of the following models: balanced model (), Yule model (), Aldous branching (AB) model (), proportional-to-distinguishable-arrangements (PDA) model (), and unbalanced model (). (A) MDS of the metric. (B) MDS of the metric. (C) -medoid trees from each distribution using the metric. (D–F) MDS plots for (D) , (E) , and (F) . In each MDS plot, the triangle represents the -medoid tree of 1,000 points for a specified model.
Distinguishing Different Internal Node Ranking Distributions.
To show how our proposed metrics can be used to differentiate ranked tree shapes sampled from distributions with different internal node rankings, we generated 1,000 random trees from the alpha–beta-splitting model with tips, fixed , and varying values of in , producing 5,000 total trees. By varying the value of while keeping the balance parameter () fixed, we test the performance of our metrics in distinguishing between ranked tree shapes with small family sizes closer to the root () and ranked tree shapes with small family sizes closer to the tips ().The confusion matrices displayed in show greater mean separation for and than for any of the other distances. More than of the trees are closest to the medoid of their originating distributions with and , and less than have this property with the other distances.The pairwise mean confusions computed using and () have values strictly smaller than the mean confusions computed with the other distances. All comparisons with our distances are statistically significant (P
), indicating good distinguishability of distributions with different values.The dispersion statistics in indicate that, according to our and metrics, trees from the alpha–beta-splitting model with are the most dispersed group among the five values considered. This, however, is not the case with the other distances. In particular, no variations in dispersion across different distributions are observed using , and a positive correlation between and dispersion is seen with .Our two-dimensional MDS representations confirm the findings observed in the confusion matrix and mean confusions: and (Fig. 5 ), when compared to the other three metrics (Fig. 5 ), distinguish to a greater extent between trees with positive and negative parameters. In addition, our and distances show tighter embeddings in two-dimensional MDS, with a high proportion of the total distance explained. Although the MDS visualizations in Fig. 5 show that most clusters of trees are well separated according to their sampling distributions with and , a large overlap exists between groups of trees with and . The observed similarity is evident from the -medoid trees (Fig. 5).
Fig. 5.
MDS representation of distances between 5,000 simulated isochronous ranked tree shapes of leaves, aggregated from five different alpha–beta-splitting models. A total of 1,000 isochronous ranked tree shapes were simulated for each value in . Different values generate different distributions of internal node ranking while keeping the same tree shape distribution at . (A) MDS of the metric. (B) MDS of the metric. (C) -medoid trees from each distribution using the metric. (D–F) MDS plots for (D) , (E) , and (F) . In each MDS plot, the triangle represents the -medoid tree of 1,000 points for a specified model.
MDS representation of distances between 5,000 simulated isochronous ranked tree shapes of leaves, aggregated from five different alpha–beta-splitting models. A total of 1,000 isochronous ranked tree shapes were simulated for each value in . Different values generate different distributions of internal node ranking while keeping the same tree shape distribution at . (A) MDS of the metric. (B) MDS of the metric. (C) -medoid trees from each distribution using the metric. (D–F) MDS plots for (D) , (E) , and (F) . In each MDS plot, the triangle represents the -medoid tree of 1,000 points for a specified model.Our results confirm that our metrics capture both tree balance and clade size–age relation more effectively than the other adapted metrics on ranked tree shapes, in that simulation models differing in these features are more easily discriminated with our distances than with the others. To further demonstrate this point, we simulated 1,000 trees in each of the varying tree balance and internal node rankings: . The groups with the same value share the same tree balance distribution, and the groups with the same value share the same internal node ranking distribution. In show that and effectively separate all four distributions, while and , designed to measure differences in tree shapes only, separate tree distributions with the same tree balance but with different internal node rankings less effectively.
Separation between Ranked Genealogies with Different Distributions.
Having investigated the utility of our metrics in separating different ranked tree shape distributions, we now demonstrate our metrics on the space of ranked genealogies. We consider two scenarios with evolutionary and epidemiological importance: isochronous ranked genealogy distributions resulting from different demographic histories and heterochronous ranked genealogies inferred from humaninfluenza A/H3N2 virus data from two geographical regions. We evaluate and compare our metrics, and , with the adaptations of other metrics to the space of ranked genealogies (Table 1)— and —using the same measures we employed in the previous section. We note that there is no adaptation of the CP metric to the space of ranked genealogies, as the CP metric is defined on the space of tree shapes and, thus, does not incorporate tree branch lengths.
Distinguishing Different Demographic Histories.
We assess how our proposed weighted metrics can differentiate genealogies with different branch length distributions. In coalescent models, the rate of coalescence (or branching) depends on the effective population size , as shown in Eq. . Under the neutral coalescent, the tree topology distribution () is independent of the branch lengths (see for details). To investigate the performance of our weighted metrics or on genealogies in separating trees according to their branch length distribution, we generated 1,000 random ranked genealogies with leaves from the Tajima coalescent with each of the following demographic scenarios:The coalescent trees under each specified population trajectory were simulated with . The functional form and parameter values chosen for the seasonal logistic trajectory mimic estimated trajectories of human influenza A virus in temperate regions (56). We removed the leaf labels and retained branch lengths to produce isochronous ranked genealogies. We produced pairwise distance matrices of using the weighted metrics and .Constant effective population size ;Exponential growth ;Seasonal logistic trajectoryThe dispersion statistics in show that the sampled genealogies under the constant population size trajectory have the greatest dispersion, and the sampled genealogies under the exponential growth trajectory have the least dispersion.Our metrics’ ability to distinguish different demographic scenarios is evident in the confusion matrix in , where they display good performance. Across all three distributions, of the trees are closest to the medoid of their originating distribution with and ; corresponding values are with and with .To test for equality in sampling distributions, we computed the mean confusion statistics and P values. In displays all off-diagonal mean confusion statistics that are less than 0.002 and are statistically significant (P
) with and , indicating a high level of discrimination between ranked genealogy distributions with different demographic histories. The mean confusions computed with and have far higher values than the values with and ().The two-dimensional MDS plots, Fig. 6 , display that our weighted metrics distinguish the three simulated ranked genealogy distributions, visually confirming the same finding using the confusion matrix and the mean confusion statistic. The first two dimensions of MDS account for 72.1% of the total distance, with the first MDS coordinate separating the constant trajectory from the others and the second coordinate distinguishing the exponential trajectory from the logistic trajectory. A comparison of panels within Fig. 6 shows that is far less informative than either of our metrics.
Fig. 6.
MDS representation of distances between 3,000 simulated isochronous ranked genealogies of leaves under different demographic models. The number of simulated genealogies per population model is 1,000. (A) metric. (B) metric. (C) -medoid genealogies from each distribution using the metric. (D) . (E) .
MDS representation of distances between 3,000 simulated isochronous ranked genealogies of leaves under different demographic models. The number of simulated genealogies per population model is 1,000. (A) metric. (B) metric. (C) -medoid genealogies from each distribution using the metric. (D) . (E) .
Analysis of Human Influenza A/H3N2 Virus from Different Geographical Regions.
We next apply our metrics to ranked genealogies sampled from the posterior distributions of humaninfluenza A/H3N2 genealogies of hemagglutinin (HA) gene sequences from two geographic regions—New York and Southeast Asia. In temperate climates, such as New York, influenza epidemics display seasonality, with peaks occurring during the winter months, whereas in regions with tropical or subtropical climates, such as Southeast Asia, influenza activity persists throughout the year (57). We reanalyzed a subset of the sequences used in a previous study of Bahl et al. (58) that showed patterns of seasonal dynamics of influenza in New York and stable dynamics in Southeast Asia. We use our metrics to assess the differences in genealogical posterior distributions between the two regions.The frequency of the collected samples by collection date and the estimated effective population size trajectories (Fig. 7) show distinct patterns consistent with those in Bahl et al. (58): a seasonal effective population size trajectory in New York that peaks during the winter seasons and a flat effective population size trajectory in Southeast Asia.
Fig. 7.
Human influenza A/H3N2 virus collection dates and inferred effective population size trajectories. (A) Collection date histogram, New York. (B) Collection date histogram, Southeast Asia. (C) Inferred effective population size trajectory with BEAST, New York. (D) Inferred effective population size trajectory with BEAST, Southeast Asia.
Humaninfluenza A/H3N2 virus collection dates and inferred effective population size trajectories. (A) Collection date histogram, New York. (B) Collection date histogram, Southeast Asia. (C) Inferred effective population size trajectory with BEAST, New York. (D) Inferred effective population size trajectory with BEAST, Southeast Asia.For each geographical region, we sampled 1,000 ranked genealogies per region from posterior distributions obtained from BEAST. We refer to the resulting samples from New York and Southeast Asia as NY–NY and SEA–SEA, respectively.To investigate the effect of the different sampling schedules, we simulated two additional groups of 1,000 trees: one with fixed New York sampling schedule (Fig. 7) and Southeast Asia estimated effective population size trajectory (Fig. 7) and the other with fixed Southeast Asia sampling schedule (Fig. 7) and New York estimated effective population size trajectory (Fig. 7). We denote the former by NY–SEA and the latter by SEA–NY.We then aggregated the 4,000 ranked genealogies—1,000 from each of NY–NY, SEA–SEA, NY–SEA, and SEA–NY—and computed pairwise distance matrices of with , , , and . The corresponding confusion matrices () show that and have good performance in discriminating all four distributions, both with mean separation of 99.9%, while and display less discrimination, with mean separations of 28.9 and 65.6%, respectively.As shown in , the observed mean confusion statistics between all sampling distribution pairs with our distances are near zero and statistically significant (P
). While all observed off-diagonal mean confusion values with () are larger than the values with or , all P values are significant, showing somewhat comparable performance with our distances. The test of equality in sampling distributions is not rejected at the significance level of with between the pair (NY–SEA, NY–NY) and between the pair (SEA–NY, NY–NY) ().Fig. 8 shows that all four distributions with different sampling and evolutionary histories are also well separated in the two-dimensional MDS plots using and . However, we note that the first two dimensions of MDS account for only 51.7% () and 50.5% () of the total distance.
Fig. 8.
MDS representation of distances between 4,000 heterochronous ranked genealogies sampled from two posterior distributions (1,000 trees from each distribution) and two sets of simulated trees (1,000 trees from each simulation). Observed data consist of 122 HA sequences of human influenza A/H3N2 virus from each geographic region, New York (NY–NY) and Southeast Asia (SEA–SEA). An additional two sets of trees were simulated using sampling times of New York and the effective population size trajectory of Southeast Asia (NY–SEA) and using sampling times of Southeast Asia and the effective population size trajectory of New York (SEA–NY). (A) metric. (B) metric. (C) -medoid trees from each distribution using the metric. (D) metric. (E) metric.
MDS representation of distances between 4,000 heterochronous ranked genealogies sampled from two posterior distributions (1,000 trees from each distribution) and two sets of simulated trees (1,000 trees from each simulation). Observed data consist of 122 HA sequences of humaninfluenza A/H3N2 virus from each geographic region, New York (NY–NY) and Southeast Asia (SEA–SEA). An additional two sets of trees were simulated using sampling times of New York and the effective population size trajectory of Southeast Asia (NY–SEA) and using sampling times of Southeast Asia and the effective population size trajectory of New York (SEA–NY). (A) metric. (B) metric. (C) -medoid trees from each distribution using the metric. (D) metric. (E) metric.displays the dispersion values computed with all distances on ranked genealogies. In particular, SEA–SEA ranked genealogies with relatively uniform sampling history and evolutionary trajectory have larger dispersion compared to the other distributions. The large dispersion of SEA–SEA ranked genealogies is consistent with the simulated results in Fig. 6 and , where the distribution of isochronous ranked genealogies with constant population size trajectory has larger dispersion than distributions with exponential or seasonal logistic trajectory.Our proposed distances confirm that the HA segments of humaninfluenza A/H3N2 experienced different evolutionary processes in New York and Southeast Asia between 2001 and 2005. While different sampling schedules of different datasets may obfuscate signals of underlying evolutionary dynamics in the application of our distances, we note that in this particular analysis, sampling frequencies are highly correlated with effective population sizes and are thus informative about the underlying evolutionary processes. This situation of preferential sampling (59–61) is common in molecular phylodynamic studies of influenza viruses: More samples are sequenced when the effective population size is larger.
Convergence Diagnostic.
In the previous section, we compared posterior samples of ranked genealogies derived from MCMC procedures that introduce correlation in the samples. Although we thinned our posterior samples, it is important to test the mixing of the Markov chains. We used our metrics to assess convergence of each of the three MCMC chains for the two geographical regions. We computed the distance between the running posterior -medoid ranked genealogy and the global posterior -medoid ranked genealogy after every iterations for each chain. That is, in each chain, we sampled 1,000 ranked genealogies from the posterior distribution so that the th sample corresponds to iteration of the chain. For each , we computed a running -medoid of all samples up to and including the th ranked genealogy using our metric. We then computed distances from the running medoids to the overall medoid of all of the 1,000 sampled ranked genealogies ().Fig. 9 shows as a function of the chain iteration index for the three independent runs for each geographical region. The initial variability of the running-mean plot is relatively high, but, after an initial burn-in period (), the distance from the running mean to the overall mean stabilizes quickly with an increasing number of states indicating convergence.
Fig. 9.
Assessment of convergence of MCMC BEAST chains. Each curve corresponds to the distance between the running posterior -medoid ranked genealogy and the global posterior -medoid ranked genealogy after every iterations for each chain. (A) New York, . (B) New York, . (C) Southeast Asia, . (D) Southeast Asia, .
Assessment of convergence of MCMC BEAST chains. Each curve corresponds to the distance between the running posterior -medoid ranked genealogy and the global posterior -medoid ranked genealogy after every iterations for each chain. (A) New York, . (B) New York, . (C) Southeast Asia, . (D) Southeast Asia, .
Discussion
Ranked tree shapes and ranked genealogies are used to model the dependencies between samples of molecular data in the fields of population genetics and phylogenetics. These tree structures have a particular value in studying heterochronous data and in providing a resolution that is fine enough to be informative but is computationally more feasible than finer resolutions.In this article, we equipped the spaces of ranked tree shapes and ranked genealogies with a metric. By defining a bijection between ranked tree shapes and a triangular matrix of integers, we defined distance functions between ranked tree shapes as the and norms of the difference between two matrices. Our distances on ranked genealogies are the and norms of the difference between two matrices weighted by the genealogy branch lengths. To apply our distances to real data for comparing ranked genealogies or ranked tree shapes with different sampling events, we proposed an augmentation scheme, increasing the dimension of the matrix representations, and we defined our distances as and norms of the difference between extended matrices of the same size. We used our distances to summarize the distribution of trees and proposed using the -medoid tree set, a sample version of the Fréchet mean, as the central set of a sample of trees. We proposed a sample version of the Fréchet variance to quantify uncertainty. We demonstrated the utility of our metrics using simulated data from models with biological relevance and humaninfluenza A/H3N2 virus between temperate and tropical regions.The performance of the proposed metrics was evaluated through their ability to distinguish different tree distributions quantified by the confusion matrix, mean separation, and mean confusion statistic. We statistically assessed equality in distributions with a nonparametric test. This feature is particularly relevant where analytic expressions for underlying probability distributions are unknown. We note that our nonparametric test assumes samples are independent, and a future direction is to account for the nonindependence of samples. Another important future direction is to develop statistical tests based on our metrics for uncovering specific evolutionary events, for example, inferring mode and strength of natural selection and identifying specific genomic regions undergoing selective pressure, beyond detecting differences in evolutionary histories from tree distributions.Our metrics provide the basis for a decision-theoretic statistical inference that can be constructed by finding the best-ranked genealogy that minimizes the expected error or loss function, which, in turn, is a function of the tree distance. Further, our proposed metric provides a tool for evaluating convergence and the mixing of MCMC procedures on ranked genealogies. We demonstrate the applicability of our metrics as an MCMC convergence diagnostic with influenza data.Our proposed distances inherit the properties of and norms, but other higher-order norms can also be applied to our matrix representation of ranked tree shapes. In general, the norm magnifies large differences more than the norm. Under the scenarios we investigated, no noticeable dissimilarities were observed between - and -based metrics in assessing different tree distributions, but -based metrics overall provided better two-dimensional MDS embeddings by the distortion measure (). For general scenarios beyond those we examined, while we expect both and ( and ) would perform similarly well in distinguishing tree distributions, rigorous theoretical investigations of their properties will be needed.We note that our matrix representation can be modified to a symmetric positive definite (SPD) matrix by replacing the upper matrix triangle with the transposed entries of the original lower triangular matrix. In this case, our new distance will be twice the original distance and will not change the observed properties. This view of an SPD matrix representation of ranked tree shapes will make it possible to explore additional distances (62) on ranked tree shapes and ranked genealogies.
Materials and Methods
Random Ranked Tree Shape Generation.
To generate random isochronous ranked tree shapes, we used the simulate_tree function in the R package apTreeshape (7) with the following parameters. In distinguishing different tree balance distributions, we simulated 1,000 ranked tree shapes with leaves for each value in with fixed . In distinguishing different internal node ranking distributions, we simulated 1,000 ranked tree shapes with leaves for each value in with fixed . All other parameters were kept at their default values. Heterochronous ranked tree shapes with different sampling events were generated using phylodyn::coalsim (63) in R.
Random Ranked Genealogy Generation.
For all our simulations, we assume neutral evolution. Hence, the distribution of branching or coalescence waiting times depends on the number of extant branches and is independent of the particular topology. In the isochronous coalescent model, when there are lineages, the coalescent time backward in time from tips to root (Fig. 2) has conditional densitywhere is the effective population size at time and . In the heterochronous coalescent, sampling times are deterministic and the rate of coalescence depends on the number of extant lineages and the effective population size . More details about coalescent models can be found in .We used phylodyn::coalsim (63) in R to generate random isochronous ranked genealogies with different branch length distributions and random heterochronous ranked genealogies with different sampling events and different branch length distributions.
Comparison with Other Metrics.
For comparing our metrics to adaptations of other tree distances, we used the Geodesic Treepath Problem (GTP) software implemented in Java (31) for the BHV distance, multiDist implemented in the R package treespace (10) for the KC distance, and the vecMultiDistUnlab in the R package treetop (14) for the CP distance.
Human Influenza A/H3N2 Virus Analysis.
We selected 122 complete HA sequences at random with overlapping collection dates between New York and Southeast Asia—from January 2002 to September 2005. The National Center for Biotechnology Information (NCBI) GenBank accession numbers for the 122 HA sequences from New York and Southeast Asia used in this article are provided in . We aligned the sequences using the FFT-NS-i option of Multiple Alignment using Fast Fourier Transform (MAFFT) v7.427 software (64, 65).We used BEAST v1.10.4 (16) to sample from the posterior distribution of model parameters with the following prior settings: Shapiro–Rambaut–Drummond codon-partition substitution model (66) to accommodate rate heterogeneity among sites, the uncorrelated log-normal relaxed molecular clock, and a time-aware Bayesian Skyride prior model (67) on evolutionary histories. For each geographical region, we ran three independent MCMC chains with 100 million steps, with 10% of burn-in, and thinned to obtain a total of 1,000 samples per region. We assessed convergences and effective sample sizes of the estimates using Tracer v1.7.1 (68) and with our distances.
Code and Data Availability.
An implementation of our metrics is available at https://github.com/JuliaPalacios/phylodyn. The NCBI GenBank accession numbers for the humaninfluenza A/H3N2 HA sequences are provided in .
Authors: Barbara Mühlemann; Terry C Jones; Peter de Barros Damgaard; Morten E Allentoft; Irina Shevnina; Andrey Logvin; Emma Usmanova; Irina P Panyushkina; Bazartseren Boldgiv; Tsevel Bazartseren; Kadicha Tashbaeva; Victor Merz; Nina Lau; Václav Smrčka; Dmitry Voyakin; Egor Kitov; Andrey Epimakhov; Dalia Pokutta; Magdolna Vicze; T Douglas Price; Vyacheslav Moiseyev; Anders J Hansen; Ludovic Orlando; Simon Rasmussen; Martin Sikora; Lasse Vinner; Albert D M E Osterhaus; Derek J Smith; Dieter Glebe; Ron A M Fouchier; Christian Drosten; Karl-Göran Sjögren; Kristian Kristiansen; Eske Willerslev Journal: Nature Date: 2018-05-09 Impact factor: 49.962
Authors: Gabriel E Leventhal; Roger Kouyos; Tanja Stadler; Viktor von Wyl; Sabine Yerly; Jürg Böni; Cristina Cellerai; Thomas Klimkait; Huldrych F Günthard; Sebastian Bonhoeffer Journal: PLoS Comput Biol Date: 2012-03-08 Impact factor: 4.475
Authors: J Voznica; A Zhukova; V Boskova; E Saulnier; F Lemoine; M Moslonka-Lefebvre; O Gascuel Journal: Nat Commun Date: 2022-07-06 Impact factor: 17.694