Literature DB >> 32109980

On the Matrix Condition of Phylogenetic Tree.

Dwueng-Chwuan Jhwueng¹, Brian C O'Meara².

Abstract

Phylogenetic comparative analyses use trees of evolutionary relationships between species to understand their evolution and ecology. A phylogenetic tree of n taxa can be algebraically transformed into an n by n squared symmetric phylogenetic covariance matrix C where each element c ij in C represents the affinity between extant species i and extant species j. This matrix C is used internally in several comparative methods: for example, it is often inverted to compute the likelihood of the data under a model. However, if the matrix is ill-conditioned (ie, if κ , defined by the ratio of the maximum eigenvalue of C to the minimum eigenvalue of C , is too high), this inversion may not be stable, and thus neither will be the calculation of the likelihood or parameter estimates that are based on optimizing the likelihood. We investigate this potential issue and propose several methods to attempt to remedy this issue.

Entities: Chemical Disease Gene Species

Keywords: Brownian motion; condition number; covariance matrix inversion; phylogenetic comparative analysis; phylogenetic tree

Year: 2020 PMID： 32109980 PMCID： PMC7019399 DOI： 10.1177/1176934320901721

Source DB: PubMed Journal: Evol Bioinform Online ISSN： 1176-9343 Impact factor: 1.625

Introduction

A main role for phylogenetic comparative studies is to test evolutionary hypotheses.[1] To conduct analyses, phylogenetic comparative studies use phylogenetic trees, which represent evolutionary relationships among, typically, various biological species. However, a relatively unexamined potential issue is the condition number of the phylogenetic covariance matrix. To do so, we use a compilation of empirical trees from the TreeBASE database[2-4] as well as of trees simulated in different ways[5] Given the hierarchical property of phylogenetic tree, is a positive definite matrix and the condition number of a phylogenetic covariance matrix is defined as the ratio of the maximum eigenvalue to the minimum eigenvalue of that matrix: where and and is a positive eigenvalue of that satisfies . The condition number is essentially a measure of how stable the matrix is for subsequent operations[6] A matrix with condition number much greater than 1 such as (ie, ) for a Hilbert matrix[7] is often said to be ill-conditioned. Matrices with small condition numbers are more stable matrices, whereas larger condition numbers are less stable. More stable matrices have less error in downstream algebraic operations, using that matrix (or its inverse) such as data multiplication, projection, linear model prediction, and even simulating data using that matrix. By contrast, large condition numbers mean these operations are unstable and more prone to error propagation. In other work,[8] we found evidence of ill-conditioned from some actual phylogenetic trees (though more commonly in phylogenetic networks), and sought to investigate this, and potential solutions, in more detail. This has also been explored by Adams and Collyer.[9] In phylogenetic comparative studies, the matrix from a given rooted ultrametric phylogenetic tree of n extant taxa has element in measured by the shared branch length between a pair of species on the tips of tree. For instance, a phylogenetic tree of 5 taxa shown in Figure 1 can be represented as a phylogenetic covariance matrix in equation (2). The maximum and minimum eigenvalues of are and , respectively. Hence, the condition number defined by equation (1) for the matrix of the tree in Figure 1 is .

Figure 1.

A phylogenetic tree of 5 taxa with tip labels and internal nodes labels . The root to tip tree height is 1. The corresponding phylogenetic covariance matrix is displayed in equation (2). An element in equation (2) is measured by the shared branch length of tip A and tip B, whereas another element can be measured by the shared branch lengths (0.36 + 0.34). Commonly in phylogenetic comparative analysis, the model used assumes traits evolve along the tree under Brownian motion (BM).[10] Phenotypic values on the tips of tree of n species are treated as a set of random variables. The joint distribution for random vector of n species is a multivariate normal distribution with common mean , and variance-covariance matrix . The statistical model is displayed in equation (3): The likelihood function given trait and tree with branch lengths is a multivariate normal, represented as where is the ancestral status at the root of the phylogeny, is the rate of evolution, is the determinant of and is a vector of 1s. In fact, the best solution for the likelihood function in trait model in equation (4) depends on the phylogenetic covariance matrix itself. For instance, the maximum likelihood estimators (MLEs) for the and in BM model in equation (3) are respectively, and they both depend on computing the inverse of (ie, ). There is an extensive literature of methods precisely trying to avoid the actual computation of this inverse , to gain speed and numerical stability. Starting with Felsenstein’s[11] pruning algorithm, there are many extensions in many contexts.[12-19] These approaches have been implemented in several popular R packages such as Diversitree,[14]phylolm,[19]Rphylopars,[20]MCMCglmm,[21]PhyloNetworks,[22]mvMORPH,[23] and PCMBase[24] where Felsenstein’s pruning algorithm is extended to support all other Gaussian models assuming independently evolving branches. Overall, using these kind of efficient pruning algorithms, the phylogenetic community indeed have come up with more robust and efficient solutions than the matrix inversion. However, not all models have pruning or other algorithms that can avoid matrix inversion; even when there is such an algorithm, not all software has implemented it. Manceau et al[25] developed models involving dependent co-evolution between traits on different branches in the tree; Jhwueng and O’Meara[26] developed model that allows species evolved on the phylogenetic network. Both works have the models that do not avoid inversion of the covariance matrix between all tips in the tree/network. We focus on studying the condition number of matrices on its own and explore the impact on subsequent analyses. To calculate the inverse of the phylogenetic covariance matrix , one can use a Moore-Penrose (MP) inverse that makes the algebra tractable.[27] However, the MP inverse fails to give the exact inverse when the is an ill-conditioned matrix (see supplemental material MPfail.pdf). On the contrary, the regular methods such as Cholesky decomposition implemented in R base package: function solve for solving the inverse of matrix returns an error when the condition number is large for most matrices, but becomes unstable at some even lower condition numbers. Considering that the ill-conditioned matrix problem may impact further analysis in the aspect of parameter estimation and statistical inference in phylogenetic comparative studies, we investigate 3 methods to appropriately adjust the phylogeny when it falls in the area where it is poorly conditioned. Our first goal is to search on a range of acceptable value of condition number of phylogeny with n extant species; our next goal is to find the best, well-conditioned estimate of observed phylogenetic covariance matrix, when the observed matrix is ill-conditioned. The ultimate goal is to provide the community with a series of tree transformations which help ameliorate the issues produced by ill-conditioned matrices. Note that these tree transformations and other potential solutions do not fix the problem—they involve a loss of data or a modification of the tree structure, such that the likelihood does not match what it would be if we were able to invert a matrix with absolute precision. However, they could result in answers closer to the true estimate than by ignoring this issue and continuing to invert very ill-conditioned matrices.

Data collection

For this analysis, we needed a set of empirical trees with branch lengths to understand the risk of this issue in practice. The R package datelife[28,29] stores a cache of the OpenTree chronogram,[2] pulled in and processed using tools from rotl[30] and phylotastic.[31] The trees represent trees from TreeBASE[3,4] as well as directly from many studies. The sizes of trees in the cache range from 6 taxa to 48 016 taxa; branch lengths were normalized so that the root to tip height was one for all trees. There were 3 trees from the same study of 4510 taxa[32] as well as one tree of 48 016 taxa[33] all other trees were 815 taxa or fewer. We excluded these 4 outlier trees for computational convenience. After excluding these trees, the median size of the remaining 126 trees is 72 taxa and the mean size is 140 taxa.

Preliminary analysis

The left panel of Figure 2 shows the condition numbers vs number of taxa for all 126 chronograms in the database. Overall, increases with number of taxa. The right panel in Figure 2 shows the average calculated from 50 simulations for each empirical tree, using TreeSim[5] where birth rate and death rate are estimated from the associated tree by R package ape function birthdeath[34] where some trees with multichotomies issue are resolved using function multi2di, simulating to have the same number of extant species as each empirical tree. Note that, even though the number of taxa are the same between the simulated and empirical trees, the simulated trees had worse (higher) on average.

Figure 2.

The 126 condition numbers vs number of taxa for trees from the literature (left panel) and for trees from simulation (right panel). A simple linear regression yields 2 line equations for the trees from literature and for the simulated trees. The interval for the regression line is shown in gray area. From the plots, the condition of real trees and simulated trees can be viewed via their matrix, and it becomes larger as number of taxa increases, but does so somewhat slowly. The simulated trees generally have higher (worse) condition than empirical trees. We next compared condition numbers vs number of taxa using trees from more extensive simulations. In Figure 3, there are 3 lines where each line represents the average of 100 runs of simulated phylogenies for different number of taxa. The purple line was obtained from random trees using coalescent trees method (created using R package: rcoal)[35] and the orange line was obtained from trees simulated by birth-death process with a given age on a fixed number of extant taxa[36] using birth rate where larger birth rates are used for larger taxa, death rate . The blue line is obtained from trees under a pure-birth process (birth , death ). The root to tip height was one for all trees.

Figure 3.

The condition number vs number of taxa for trees from simulation. Each dot in the lines represents the average of the value for 100 trees. The orange line is for birth-death model, the purple line is for coalescent model, and the blue line is for pure-birth model. In Figure 3, the coalescent trees (purple lines) have the highest values, whereas the birth-death trees (orange lines) overall have condition numbers slightly higher than the pure-birth trees (blue lines). A positive finding from this exploration is that all trees simulated in this ideal situation as well as the empirical trees from the literature do not fall into the numerical limit of approximately where LAPACK[37] considers the matrix singular and thus infeasible for analysis using Cholesky decomposition. However, there still exist trees with ill-conditioned matrices that could impact the subsequent analysis. To clarify this situation, consider the trees with 1 or 2 clades with short terminal branch lengths shown in Figure 4. The matrix condition for the tree in left panel is , whereas the matrix condition for the tree in the right panel is . Both trees have much larger matrix condition numbers than of the tree in Figure 1. They also far exceed the bound of that a Hilbert matrix of the same size would have which is known to be ill-conditioned.[7]

Figure 4.

Two cases of 5 taxa trees with a pair of short tips (taxa A and taxa B) (left panel) and with 2 clades of short tips (right panel).

Two cases of 5 taxa trees with a pair of short tips (taxa A and taxa B) (left panel) and with 2 clades of short tips (right panel). Consider a more extreme case (generated by setting a fairly short terminal branch ) where LAPACK considers the corresponding matrix singular (see supplemental material solvefail.pdf). Figure 5 shows the proportion of unsolvable matrices vs their condition numbers for 3 types of commonly used trees.

Figure 5.

Three types of trees (birth-death tree, coalescent tree, and pure-birth tree) of taxa size 50, 100, 500, and 800 are used. For each taxa size, 100 trees were simulated under each type of tree and are equipped with a short terminal branches ranging from −19 to −8 (in scale). The condition numbers of their matrices as well as their inverse (if numerically feasible) by LAPACK are computed. The horizontal axis shows the proportion (using scale of power of 100 of the raw proportion) of unsolvable matrices over 100 trees vs their condition numbers shown in the vertical axis. These type of trees could be unstable statistically when their matrix has no exact inverse by LAPACK, and the subsequent phylogenetic comparative analyses which use the inverse of [8,38] could be unstable. This does not just mean that the likelihood would be impossible to calculate, throwing an annoying but at least transparent error. At better but still bad condition, there would be a finite number returned for the likelihood but it is unstable: a slight change in a parameter value in the model or to a branch length of the tree could result in a very different likelihood value due to accumulated errors in the matrix inversion, not rugosity of the true likelihood surface. A researcher would get a likelihood and parameter estimates back rather than an error, but these numbers are not accurate (see MLE estimator for in BM model in Figure 14 in supplemental material). These consequences would be expected to be more acute for multivariate data, though that remains an area for further investigation. These issues would affect any method that uses calculation of likelihood: both analyses that use likelihood to optimize parameter values or Bayesian approaches that sample a space using information from priors and likelihoods. The empirical analyses above suggest that, at least for most trees biologists have encountered, there are reliable matrices (though transformations of these matrices from models to make variance-covariance matrices that are then inverted can still have issues, which remain to be investigated). However, there be dragons here: there are trees with poor matrix condition that could affect PCM calculations, even the simple trees of Figure 4. We investigate the potential impact on the statistical stability of the BM model[10] in equation (3) when the associated phylogenetic tree has ill-conditioned covariance matrix .

Methods

One way of dealing with ill-conditioned matrices is just to reject matrices, , that are poorly conditioned. However, the true estimate of the matrix (from the tree and/or from the transformation of the matrix from a comparative methods model) can fall in that region, and users would not be happy to hear that their hard-earned tree cannot be analyzed. That is far better than quietly returning a wrong result, but still far from ideal. We propose several possible approaches to remediate the issue of an ill-conditioned matrix from the tree. Our goal is to use some of the following methods to estimate the best version of the observed phylogenetic covariance matrix. These all involve modifying from what it should be based on the tree and model, which of course is not ideal, but given that the goal is to use this to compute estimates (such as the rate of BM), we will investigate whether the resulting estimates are better from a modified matrix than from the original ill-conditioned matrix. Three approaches are examined: (1) shrinkage matrix regularization: lengthen the tip lengths with respect to the tree, (2) pruning tips of the tree: removing tips from the tree, and (3) lengths stretching: lengthen/shorten all branch lengths of the tree.

Shrinkage matrix regularization

An approach by Schafer and Strimmer[39] was developed for regularizing covariance matrices in molecular biology (including some network covariance matrices), later improved and generalized by Theiler.[40] Let and , define the shrinkage matrix estimator of by where . Let be an estimate of the mean of the Mahalanobis distance by recognizing that is generating from a Gaussian distribution with covariance (ie, ). Theiler[40] showed that the negative log likelihood function based on the mean Mahalanobis distance approximation for the shrinkage estimated covariance matrix as a function of the shrinkage parameter is The best shrinkage estimate is to search the optima and the matrix is updated variance-covariance matrix for the next step of the analysis. Although the shrinkage matrix used here is mathematical equivalent (differs up to a constant multiplier ) to the very broadly used Pagel’s lambda[41,42] transformation where and is an identity matrix, the parameters and have different meaning. While is estimated through the BM likelihood and is used for testing phylogenetic signal, is estimated through likelihood function based on the mean Mahalanobis distance approximation for the shrinkage estimated covariance matrix. Figure 6 compares the raw, untransformed tree with transformed tree computed by taking the shrinkage matrix and converting back into a tree using the unweighted pair group method of arithmetic mean[43] by R package: upgma.[44] The transformed tree has much longer tip branch lengths relative to internal branch lengths relative to the untransformed tree.

Figure 6.

Tree transformation under shrinkage matrix regularization method. A simulated birth-death tree of 100 taxa is shown in the left panel. The phylogenetic covariance matrix of the shrunk tree is obtained by setting shrinkage parameter to such that where is the phylogenetic covariance matrix of the simulated tree. The shrunk tree is shown in the right panel and compared with the untransformed tree. Both trees are plotted with the same taxon order. Using 5 different numbers of taxa , the average of the shrinkage estimator across 100 replicate trees are (, respectively). It appears that the magnitude of required shrinkage decreases with the number of taxa.

Pruning tips of the tree

We further investigate what other factors, whether a property of the tree or parameters of the simulations used to generate it (which then, of course, results in trees of particular distribution of branch lengths), affects the condition of the tree. We consider the tree properties that could affect the condition of the tree as follows: (1) taxa size n, (2) min branch length , (3) max branch length, (4) ratio between max branch length and min branch length, (5) variance of branch length, (6) median of branch lengths, (7) min tip length, (8) max tip length, (9) min internal branch length, (10) max internal branch length, (11) generating birth rate , (12) generating death rate , and (13) turnover rate . We simulated birth-death trees under uniformly varying number of taxa between size of 10 and 800, birth rate between and , death rate between 0 and where larger tree are simulated with higher value of . Note that one can simulate from uniform distribution with upper limit (ie, . In this case and have dependency. Their condition numbers are calculated and compared with a variety of measures using multiple linear regression. The R package: MuMIn[45] was used to generate a set of models to correlate with combination of parameters or tree measures in the global model of 13 predictors. The maximum number of variables is set to 4 accounting for 1093 models (intercept model: model, 1 predictor: models, 2 predictors: models, 3 predictors: models, 4 predictors: models where is the number of combination that selects r distinct items from a collection of n items). Table 1 shows the regression estimates for the covariates for the top 5 models that accounts for majority of weights.

Table 1.

Regression estimates, log likelihood, AICc, , and Akaike weights for the top 5 multiple linear regression models (M1-M5) out of the 1093 models.

Model	M1	M2	M3	M4	M5
min tip ι	−0.19	−0.19	−0.24	−0.24	−0.24
death rate μ	17.77	17.77	16.05	16.06	16.05
birth rate λ					10.45
brlen median	−0.02	−0.02
max brlen	3.6e−5		3.8e−5		3.8e−5
max internal		3.6e−5		3.8e−5
Ntip			1.2e−3	1.2e−3
logLik	−11 879.64	−11 879.81	−11 882.73	−11 882.99	−11 883.92
AICc	23 771.29	23 771.62	23 777.48	23 777.99	23 779.84
ΔAICc	0.00	0.33	6.19	6.70	8.55
weight	0.51	0.43	0.02	0.02	0.01

Abbreviation: AIC, Akaike information criterion.

Regression estimates, log likelihood, AICc, , and Akaike weights for the top 5 multiple linear regression models (M1-M5) out of the 1093 models. Abbreviation: AIC, Akaike information criterion. From Table 1, we have several comparisons: (1) condition number vs min tip length (ie, vs ), (2) condition number vs death rate (ie, vs ), and (3) condition number vs birth rate (ie, vs ) are interesting. For (2) and (3), it is known that the expected waiting time to the next event of birth-death model is exponential with parameter (so the regression estimates for and for are both positive values), it remains to see how the birth and death parameters affect the branch length (and hence matrix). For (1), having a larger minimum tip branch length seems to lead to better (lower) , whereas smaller minimum tip may yield worse (higher) . This is consistent with Figure 4 above, where short tip lengths seemed to lead to bad matrix condition. Here, we focus on exploring the relation between the condition number and the minimum tip length. Given that smaller trees tend to have better condition than larger ones, it could be that removing a taxon at random would improve matrix condition. However, another possibility is that it is just the presence of very small terminal branches: these are less likely on a smaller tree, but we could just try to remove the short branches directly by removing one of the taxa with the shortest terminal branch. Taxon removal does tend to help matrix condition, but removing the taxon with the shortest branch helps far more in Figure 7. This points to a potential solution: dropping one of the tips with the shortest branch length as a quick run suggests a much bigger improvement than dropping a tip at random. mvMORPH[23] and PCMBase[24] allow pruning tips with tiny or zero lengths, allowing this to be done easily by users.

Figure 7.

The condition numbers for dropping the shortest tip vs the condition numbers for the raw tree. 100 birth-death trees are generated with random size ranging from 100 to 800 with birth rate and death rate . The horizontal axis is the condition number for raw trees, whereas the condition numbers for the removing a tip tree are shown in the vertical axis. The vertical lines connect points corresponding to that given tree, but with one taxon removed: the blue square is removing a taxon at random, and the red dot removing the taxon with the shortest branch tip length. We found that, in fact, the problem of having ill-conditioned matrix is highly related to terminal branch lengths of taxon. To illustrate this issue, a simple example of 2 taxa is shown in Figure 8 (right panel). The corresponding phylogenetic matrix is

Figure 8.

A 3-taxa tree (left panel) and a 2-taxa tree (right panel) for illustration.

Note that has 2 eigenvalues and . The condition number of defined by the ratio of the largest eigenvalues to the smallest eigenvalues: where is the big O notation that describes the limit behavior of a function. When tree has tiny tips (very small ), the value of will be fairly large and the matrix is more likely to be ill-conditioned. For instance, with the condition number , whereas with , . The problem becomes serious as is very close to zero where matrix has 2 almost identical columns/rows which makes a singular matrix with . In general, for a tree with a clade of very short tips relatively to the tree height from the root to the tips, the corresponding matrix is more ill-conditioned. A 3-taxa tree (left panel) and a 2-taxa tree (right panel) for illustration. Moreover, the phylogenetic covariance matrix has a nested structure and possess some special matrix properties. Ané[46] determined eigenvalues of the covariance matrix for symmetric trees for the purpose of studying the behavior of the estimator. Here, we show that the shortest tip of a ultrametric tree is equal to the smallest eigenvalue of the matrix. We start using a 3-taxa example in Figure 8 (left panel) where the tip lengths for species and Z are b, b and , respectively. The shortest terminal branch length is b. The phylogenetic matrix for the tree in Figure 8 is shown in equation (8). Let be the characteristic polynomial of where is a 3 by 3 identity matrix. As the roots of are the eigenvalues of , solving and simplifying yield . As are both positive numbers, the smallest eigenvalue for is b which is the shortest tip length on the tree in Figure 8 left panel. For a general case, a property of ultrametric tree is provided in Lemma 1.

Lemma 1

The shortest tip length of an ultrametric phylogenetic tree is the smallest eigenvalue of , ie, where b is the smallest tip length and I is an n by n identity matrix. The general proof in Lemma 1 shows up the pruning approach from a theoretical perspective. Researchers may object to losing a taxon in their analysis: getting data for a species to put it on a tree and include trait information may have entailed a significant effort. They may also worry about biased estimates that come from such pruning. Below, we show that such pruning does improve matrix condition substantially and later show that this can provide reliable parameter estimates with sufficient remaining taxa. The following lemma shows that the new tree obtained from dropping the shortest tip of the original tree has a better (lower) .

Lemma 2

Let be the n by n strictly ultrametric matrix from the tree and be the condition number of . Let be the by matrix obtained by dropping the shortest tip from the tree and be the condition number of . Then, .

Remark

Above 2 lemmas have a link with the result obtained in Ané[46] where the whole spectrum of the matrix is derived for the special case of a symmetric tree and has been extended in Ho and Ané[47] for an Ornstein-Uhlenbeck (OU) model.[48] It would be interesting to explore whole spectrum for arbitrary ultrametric tree for generalization, but that remains as future work.

Length stretching

Another possible solution is adopting the method in Jhwueng[38,49] which stretches the branch lengths of the raw tree without changing its topology. For an ultrametric tree, let be the tree height from the root to the tip. Without loss of generality, is scaled into a unit and is decomposed into d components. That is, where represents the length between the and speciation events. For instance, is the length from the root to the first speciation event since the root and is the minimum tip length for the species evolved from its most recent common ancestor. Next, consider the matrix obtained from the raw tree. Let the -tuple elements be the distinct entries in satisfying . The relation between and can be represented as . The following lemma describes the relationship between and .

Lemma 3

The set where is the length between the m and the m speciation event has d elements if and only if the number of distinct elements in of an ultrametric tree is . For example, in Figure 8 (left panel), if setting and , then the matrix is and has 3 distinct elements , and . The 2 lengths and can be determined accordingly by , and . To stretch the lengths but retain the topology of the raw tree, as and , we can treat as a d-dimensional random vector from a Dirichlet distribution. Then, can be generated by first drawing d independent gamma random variables, each with different shape parameters and rate parameter 1 where m is an arbitrary but positive constant. Then, the d-tuple vector is a Dirichlet random vector with , , and concentration parameters . Here, the positive constant m is an arbitrary scaling variable that always preserves the correct mean. By the property of Dirichlet distribution, we have , and the mode is given by . The choice of m is thus determined by . A positive integer m is chosen to satisfy where returns the least integer greater than or equal to a. The choice of m here is designed to be the minimal needed to prevent the phylogenetic tree from varying too wildly from the given one while still adequately testing robustness.[49] Figure 9 shows the phylogeny and their condition number for a raw tree and its transformed trees.

Figure 9.

The raw tree with is shown in upper left panel, the shortest tip pruned tree with is shown in upper right panel, the shrunk tree with is shown in lower left panel, and the stretched tree with is shown in lower right tree. We implemented the shrinkage method, length stretching method and pruning tips (drop shortest tip) method to transform trees. We simulated 200 birth-death trees of taxa size between 300 and 1000 using R package: TreeSim[36] with speciation rate and extinction rate . The condition numbers of the phylogenetic covariance matrix for the raw trees and transformed trees are calculated and are plotted in Figure 10. Among the 3 methods, shrinkage method and pruning tip (drop the shortest tip) method help to reduce the condition number. As seen in Figure 10 (left most panel), the shrinkage method provides a large amount of reduction in condition numbers for all trees. The condition number of raw tree ranges from 3 to 8, whereas all matrices of shrunk trees have condition number of value less than 4 (in scale). The pruning tip method, shown in Figure 10 (right most panel), is similar to Figure 7 and contributed to a smaller condition numbers for all trees. The length stretching method, shown in Figure 10 (middle panel), in general does not improve the condition number when comparing with the raw method without transformation.

Figure 10.

Comparison of condition number between the raw tree and transformed tree. The horizontal axis as well as the vertical axis shows the range of condition number in scale of the simulated trees. The diagonal dashed line is 1:1 relationship to demonstrate the pattern. Scatter plot of condition number of the simulated trees vs the condition number for the transformed trees is shown in each panel.

Assessment of the Methods

There is a metaphor for parameter estimation that is about searching a lost key in a region of street with no light. A man who lost his keys in the night. A friend came across the man searching under a streetlight, and asked, where did you lose them? Down there, the man said, but there’s no light there, so we are looking for them here. In our case, the tree is transformed in some way (taxa deleted, branches stretched) which lets parameter search work more easily—but have we moved so far away from where the parameters are that the ease of search does not make for better results? To address this, we examined whether estimates of rate of evolution the root state under BM are better with transformed trees than on the original tree. There are 2 aspects to this: are these estimates calculable at all, and, if they are, are the estimates good? Performance is assessed by examining normalized root-mean-square deviation (RMSD) defined by where are the true parameters and are the MLEs. As the study of interest is the impact from the ill-conditioned tree, trees are simulated with more ill-conditioned matrix where the Cholesky decomposition by R package solve fails to find the inverse for some trees. One case of an ill-conditioned tree can be built by adding up tiny lengths to all tips right after the tips of tree are trimmed to the most common ancestor of the shortest tips. This can be done by adding a relative small number to the diagonal of the matrix after the diagonal elements are replaced with the the second largest elements in the (the height from the root to the most recent common ancestor of the shortest tips). Ultrametric trees are simulated from TreeSim[5] with speciation rate and extinction rate where is a uniform distribution. Trait of size is simulated using 2 set of true parameters and given a tree. For each tree, we simulated 50 traits and repeated this procedure 700 times for each combination of parameters. Among 700 trees, 345 trees have their matrices invertible by the Cholesky decomposition (ie, solve returns with no error). The other 355 trees are transformed under the 3 tree transformation methods. For the purpose of comparison of the 3 methods and the raw method without transformation, traits simulated under BM model from raw trees are fixed and used for parameter estimation across transformed trees. Each tree can be in a region where the matrix is sufficiently well-conditioned to be used with solve or in an area where it would fail. There are 2 analyses done: Solve works: Estimates the parameters on just the well-conditioned trees using the raw tree and after the various transformations. This evaluates whether, for trees that are already somewhat feasible, does transformation still help even though the raw tree could be used and which transformation works best. Solve fails: Estimates the parameters on just the poorly conditioned trees (where the raw tree is not feasible) after transformations. This evaluates which transformation performs best in the hard cases. If the best approach in the first case is to just use the raw tree, then a good protocol is to use the raw tree when possible and use a transformation if needed. If one of the transformations works best in both the first and second cases, then it is best to use that transformation in general, at least for the trees in this tricky but sometimes solvable region. Results for comparison of trees with size 100 and 500 are reported in Figure 11 for and in Figure 12 for .

Figure 11.

Figure 12.

Evaluation of performance of parameter estimation under trees transformations for rate parameter . The box plots under each panel compare the RMSD values under each tree transformation method. The labels are the same as in Figure 11. The box plots are reported in scale. Among those transformations, the shrink method has significantly larger RMSD than other methods in both of the is solvable (sol) and unsolvable (unsol).

Evaluation of performance of parameter estimation under trees transformation for ancestral status parameter . The left panel is for true parameter and the right panel is for with tree size (upper panel) and (lower panel). The box plots of RMSD of defined in equation (9) under each method are shown in each panel. The labels of the horizontal axis starting from the left to the right are droptip: prune the shortest tip, prun: the pruning algorithm,[11] raw: untransformed tree, stretch: length stretched tree, shrink: shrinkage matrix regularization method and lambda: Pagel’s lambda method. Each label contains 2 groups: sol and unsol where sol represents the estimates of parameters on well-conditioned trees using raw tree and after various transformation, and unsol represents the group of estimates of the parameters on just the poorly conditioned trees after various transformations are reported. Because raw trees cannot be evaluated when their matrices fall in ill-condition, only the box plot for raw tree is reported when the matrix is invertible under Cholesky method. Graphs are plotted in scale. Evaluation of performance of parameter estimation under trees transformations for rate parameter . The box plots under each panel compare the RMSD values under each tree transformation method. The labels are the same as in Figure 11. The box plots are reported in scale. Among those transformations, the shrink method has significantly larger RMSD than other methods in both of the is solvable (sol) and unsolvable (unsol). In Figure 11, comparison of using various tree transformation methods to search MLEs for in BM model shows that those methods returns consistent parameter estimates for . The medians of the values are around 0.1 or lower across all methods. For comparison of methods using RMSD of shown in Figure 12, the pruning tip method, raw method, length stretching method, and lambda method perform well. However, the shrinkage method has significantly larger RMSDs in both sol and unsol cases. From earlier work, it suggests that the tip length of shrunk trees after tree transformation has relationship with the proportion of tree height. For instance, the average of the shrinkage estimator across 100 replicates for birth-death tree of 100 taxa and 800 taxa is . Consider a more extreme case: a star tree obtained from the shrinkage method with is used for parameter estimation, then is a diagonal matrix. Then, the trait is indeed analyzed under independent normal distribution. The bias of parameters in this case can be seen from theoretical approach. Without loss of generality, let be an identity matrix and let trait data be simulated under BM model (ie, . Then, under the shrinkage method with , is analyzed under the i.i.d. normal distribution with MLEs and . Given true parameters , , the is bounded by . And if the root to tip tree height is 1 (ie, for all ), then has a natural upper bound (see Lemma 4 in supplemental material). However, the RMSD for has a nontrivial lower bound. For mathematical convenience, is used instead to show that a lower bound is (see Lemma 5 in supplemental material). However, tips of the tree may overly lengthen by the shrinkage method so the parameter cannot be estimated well. To explore the utility of the shrinkage method, a simulation comparing the RMSD under different and taxa size n is shown in Figure 13 which suggests that RMSD of in general increases with . The model parameter can be estimated better with smaller . If the goal is to obtain a reliable estimate of , then a relatively small number of shall be considered to attain a better estimate for with lower RMSD value. Therefore, users may manually set up the shrinkage parameter as small as possible to obtain reliable estimate when the transformed tree has invertible matrix.

Figure 13.

Evaluation of the shrunk tree of taxa sizes 100 and 500 with different shrinkage parameter values . For each taxa size, RMSDs are computed using 300 birth-death trees where for each tree 50 traits are simulated under Brownian motion model. The 2 plots in upper panel investigate the RMSD for and when the inverse of (sol C) can be computed directly, whereas the other 2 plots in lower panel investigate the RMSD for and for the case where the matrix is ill-conditioned (unsol C). Overall, the 4 panels suggest a relatively small shrinkage value may be used to achieve a better estimate for . To sum up, although the tree transformation methods proposed here do not provide significantly better improvement than the raw tree, they are still reliable options for users to choose when the matrix of tree is ill-conditioned. Moreover, even when the matrix is fairly well-conditioned, applying transformations does not have a substantially worse effect. One suggestion is that when the phylogenetic covariance matrix is solvable, it may be the best to use the raw tree. Meanwhile, when is so poorly conditioned that cannot be inverted exactly by Cholesky decomposition, tree transformation methods provide alternative options and can give reliable estimates for most cases.

Conclusions

In this article, we explore the condition number of the phylogenetic covariance matrix transformed from a phylogenetic tree. We found with fairly short terminal branch (eg, a tip with length of or smaller for tree of 100 taxa) the phylogenetic matrix fails to give the exact inverse by the Cholesky decomposition method by current software. The failure of returning exact inverse of also depends on the number of taxa and range of condition numbers . We proposed 3 methods (shrinkage matrix regularization, pruning the tips of tree, and length stretching) to alleviate the ill-conditioned matrix issue arising from the phylogenetic tree and obtain a well-conditioned estimate for matrix. Simulations here are similar in spirit to what is performed in Figure 5 in the work by Adams and Collyer[9] where the condition number of phylogenetic covariance matrices at different level of sample size was shown. Their work was interested in the condition number relative to type I error of comparative methods, whereas our examinations use different aspects of the effect of tree condition. Another common approach is to add a small constant to the eigenvalues of the phylogenetic covariance matrix and reestimate the phylogenetic covariance matrix from the eigenvectors and adjusted eigenvalues. However, this method would alter the topology of the phylogenies, so we do not consider to implement it here. Current R software packages that implement Felsenstein’s pruning algorithm work effectively. For example, the R package mvMORPH can compute the square root of the phylogenetic covariance matrix and its determinant with arbitrary small tips, whereas the R package PCMBase implements the pruning algorithm to return the likelihood even with zero-length branches (see supplemental material pmmfelprunzerobranch.pdf). Note that the analysis of parameter estimation under the maximum likelihood estimation for the univariate BM model used in this work is performed without implementing measurement error. When assuming that each species in the tree has a measurement error variance, technically, this is also equivalent to extending each branch in the tree by a constant. Then, the variance-covariance matrix that includes measurement error and an ill-conditioned matrix could have better matrix condition, especially in our cases where the poor matrix condition typically came from near zero-length tips. Measurement error is also included as part of the model for the phylogenetic mixed model.[50] However, measurement error can lead to bias in the inferred parameters.[51] Also, some software with hundreds of citations, such as OUCH[52] and SURFACE,[53] does not yet allow for incorporation of measurement error, rendering this potential solution impossible. As we focus on studying the condition number of matrices on its own and explore the impact on subsequent analyses, we do not include the measurement error in this study. There are a couple of extensions from this work. One possible work is to look at the multirate evolutionary BM model[54] where the rate of phenotypic evolution is assumed to change throughout history of life. As the rate parameters are embedded in the phylogenetic covariance matrix, the condition number of the phylogenetic covariance matrix hence depends on both the tree and rate parameters. Another possible extension from this work is to look at the condition number of the phylogenetic variance-covariance matrix for more general trait models. For instance, if one assumes OU process for trait evolution,[48] then the phylogenetic variance-covariance matrix is where is the constraining force, is the rate of evolution, is the time that separating species i and j, and is the time that species i and j shared a common ancestor. The statistical model for the OU process trait evolution is . In particular, when , is the BM model. Larger constraining force parameter in OU model has the effect of lengthening the tip relative to the internal edges and may help the ill-conditioned matrix issue. There are more complex models developed under different assumptions based on this OU process, for instance, the multiforce, multioptimum, and multirate model in the work by Beaulieu et al[55] has fairly complicated phylogenetic variance-covariance matrix . Its poor performance with especially complex models could stem from issues with matrix condition, though this needs to be investigated. In addition, Bastide et al[12] and Jhwueng and O’Meara[8] developed a phylogenetic comparative method (PCM) using phylogenetic networks, rather than trees, it would be interesting to further explore the matrix condition in this case. We hope that our tree transformation methods provide the community options to ameliorate the issues produced by ill-conditioned . All analysis and simulations are done by computer Mac Pro: macOS Mojave (Late 2013), processor: 3.7 GHz Quad-Core Intel Xeon E5, RAM:12 GB 1866 MHz DDR3. The R scripts and relevant files to generate results can be accessed at https://tonyjhwueng.info/KappaPCM. Click here for additional data file. Supplemental material, supplemental_material for On the Matrix Condition of Phylogenetic Tree by Dwueng-Chwuan Jhwueng and Brian C O’Meara in Evolutionary Bioinformatics

31 in total

1. Inferring the historical patterns of biological evolution.

Authors: M Pagel
Journal: Nature Date: 1999-10-28 Impact factor: 49.962

2. Phylogenetic analysis and comparative data: a test and review of evidence.

Authors: R P Freckleton; P H Harvey; M Pagel
Journal: Am Nat Date: 2002-12 Impact factor: 3.926

3. On incomplete sampling under birth-death models and connections to the sampling-based coalescent.

Authors: Tanja Stadler
Journal: J Theor Biol Date: 2009-07-23 Impact factor: 2.691

4. A Unifying Comparative Phylogenetic Framework Including Traits Coevolving Across Interacting Lineages.

Authors: Marc Manceau; Amaury Lambert; Héléne Morlon
Journal: Syst Biol Date: 2017-07-01 Impact factor: 15.683

5. Monte Carlo algorithms for Brownian phylogenetic models.

Authors: Benjamin Horvilleur; Nicolas Lartillot
Journal: Bioinformatics Date: 2014-07-22 Impact factor: 6.937

6. A linear-time algorithm for Gaussian and non-Gaussian trait evolution models.

Authors: Lam si Tung Ho; Cécile Ané
Journal: Syst Biol Date: 2014-02-04 Impact factor: 15.683

7. PhyloNetworks: A Package for Phylogenetic Networks.

Authors: Claudia Solís-Lemus; Paul Bastide; Cécile Ané
Journal: Mol Biol Evol Date: 2017-12-01 Impact factor: 16.240

8. Phylogenetic Comparative Analysis: A Modeling Approach for Adaptive Evolution.

Authors: Marguerite A Butler; Aaron A King
Journal: Am Nat Date: 2004-12 Impact factor: 3.926

9. Unifying the spatial epidemiology and molecular evolution of emerging epidemics.

Authors: Oliver G Pybus; Marc A Suchard; Philippe Lemey; Flavien J Bernardin; Andrew Rambaut; Forrest W Crawford; Rebecca R Gray; Nimalan Arinaminpathy; Susan L Stramer; Michael P Busch; Eric L Delwart
Journal: Proc Natl Acad Sci U S A Date: 2012-08-27 Impact factor: 11.205

10. Assessing the Goodness of Fit of Phylogenetic Comparative Methods: A Meta-Analysis and Simulation Study.

Authors: Dwueng-Chwuan Jhwueng
Journal: PLoS One Date: 2013-06-27 Impact factor: 3.240