Dwueng-Chwuan Jhwueng1, Brian C O'Meara2. 1. Department of Statistics, Feng Chia University, Taichung, Taiwan R.O.C. 2. Department of Ecology and Evolutionary Biology, The University of Tennessee, Knoxville, Knoxville, TN, USA.
Abstract
Phylogenetic comparative analyses use trees of evolutionary relationships between species to understand their evolution and ecology. A phylogenetic tree of n taxa can be algebraically transformed into an n by n squared symmetric phylogenetic covariance matrix C where each element c ij in C represents the affinity between extant species i and extant species j. This matrix C is used internally in several comparative methods: for example, it is often inverted to compute the likelihood of the data under a model. However, if the matrix is ill-conditioned (ie, if κ , defined by the ratio of the maximum eigenvalue of C to the minimum eigenvalue of C , is too high), this inversion may not be stable, and thus neither will be the calculation of the likelihood or parameter estimates that are based on optimizing the likelihood. We investigate this potential issue and propose several methods to attempt to remedy this issue.
Phylogenetic comparative analyses use trees of evolutionary relationships between species to understand their evolution and ecology. A phylogenetic tree of n taxa can be algebraically transformed into an n by n squared symmetric phylogenetic covariance matrix C where each element c ij in C represents the affinity between extant species i and extant species j. This matrix C is used internally in several comparative methods: for example, it is often inverted to compute the likelihood of the data under a model. However, if the matrix is ill-conditioned (ie, if κ , defined by the ratio of the maximum eigenvalue of C to the minimum eigenvalue of C , is too high), this inversion may not be stable, and thus neither will be the calculation of the likelihood or parameter estimates that are based on optimizing the likelihood. We investigate this potential issue and propose several methods to attempt to remedy this issue.
A main role for phylogenetic comparative studies is to test evolutionary hypotheses.[1] To conduct analyses, phylogenetic comparative studies use phylogenetic trees,
which represent evolutionary relationships among, typically, various biological
species. However, a relatively unexamined potential issue is the condition number of
the phylogenetic covariance matrix. To do so, we use a compilation of empirical
trees from the TreeBASE database[2-4] as well as of trees simulated in
different ways[5] Given the hierarchical property of phylogenetic tree, is a positive definite matrix and the condition number of a phylogenetic covariance matrix is defined as the ratio of the maximum eigenvalue to the minimum eigenvalue
of that matrix:where and and is a positive eigenvalue of that satisfies .The condition number is essentially a measure of how stable the matrix is for
subsequent operations[6] A matrix with condition number much greater than 1 such as (ie, ) for a Hilbert matrix[7] is often said to be ill-conditioned. Matrices with small condition numbers
are more stable matrices, whereas larger condition numbers are less stable. More
stable matrices have less error in downstream algebraic operations, using that
matrix (or its inverse) such as data multiplication, projection, linear model
prediction, and even simulating data using that matrix. By contrast, large condition
numbers mean these operations are unstable and more prone to error propagation. In
other work,[8] we found evidence of ill-conditioned from some actual phylogenetic trees (though more commonly in
phylogenetic networks), and sought to investigate this, and potential solutions, in
more detail. This has also been explored by Adams and Collyer.[9]In phylogenetic comparative studies, the matrix from a given rooted ultrametric phylogenetic tree of
n extant taxa has element in measured by the shared branch length between a pair of species on the tips
of tree. For instance, a phylogenetic tree of 5 taxa shown in Figure 1 can be represented as a phylogenetic
covariance matrix in equation (2). The maximum and minimum eigenvalues of are and , respectively. Hence, the condition number defined by equation (1) for the matrix of the tree in Figure 1 is .
Figure 1.
A phylogenetic tree of 5 taxa with tip labels and internal nodes labels . The root to tip tree height is 1. The corresponding
phylogenetic covariance matrix is displayed in equation (2). An element
in equation (2) is measured by
the shared branch length of tip A and tip B, whereas another element
can be measured by the shared branch lengths
(0.36 + 0.34).
A phylogenetic tree of 5 taxa with tip labels and internal nodes labels . The root to tip tree height is 1. The corresponding
phylogenetic covariance matrix is displayed in equation (2). An element
in equation (2) is measured by
the shared branch length of tip A and tip B, whereas another element
can be measured by the shared branch lengths
(0.36 + 0.34).Commonly in phylogenetic comparative analysis, the model used assumes traits evolve
along the tree under Brownian motion (BM).[10] Phenotypic values on the tips of tree of n species are treated as a
set of random variables. The joint distribution for random vector of n species is a multivariate normal
distribution with common mean , and variance-covariance matrix . The statistical model is displayed in equation
(3):The likelihood function given trait and tree with branch lengths is a multivariate normal, represented aswhere is the ancestral status at the root of the phylogeny,
is the rate of evolution, is the determinant of and is a vector of 1s. In fact, the best solution for the likelihood
function in trait model in equation (4) depends on the
phylogenetic covariance matrix itself. For instance, the maximum likelihood estimators (MLEs) for the
and in BM model in equation (3) arerespectively, and they both depend on computing the inverse of (ie, ).There is an extensive literature of methods precisely trying to avoid the actual
computation of this inverse , to gain speed and numerical stability. Starting with Felsenstein’s[11] pruning algorithm, there are many extensions in many contexts.[12-19] These approaches have been
implemented in several popular R packages such as Diversitree,[14]phylolm,[19]Rphylopars,[20]MCMCglmm,[21]PhyloNetworks,[22]mvMORPH,[23] and PCMBase[24] where Felsenstein’s pruning algorithm is extended to support all other
Gaussian models assuming independently evolving branches. Overall, using these kind
of efficient pruning algorithms, the phylogenetic community indeed have come up with
more robust and efficient solutions than the matrix inversion. However, not all
models have pruning or other algorithms that can avoid matrix inversion; even when
there is such an algorithm, not all software has implemented it. Manceau et al[25] developed models involving dependent co-evolution between traits on different
branches in the tree; Jhwueng and O’Meara[26] developed model that allows species evolved on the phylogenetic network. Both
works have the models that do not avoid inversion of the covariance matrix between
all tips in the tree/network.We focus on studying the condition number of matrices on its own and explore the impact on subsequent analyses.
To calculate the inverse of the phylogenetic covariance matrix , one can use a Moore-Penrose (MP) inverse that makes the algebra tractable.[27] However, the MP inverse fails to give the exact inverse when the
is an ill-conditioned matrix (see supplemental material MPfail.pdf). On the contrary, the regular
methods such as Cholesky decomposition implemented in R base
package: function solve for solving the inverse of matrix
returns an error when the condition number is large for most matrices, but becomes unstable at some even lower
condition numbers.Considering that the ill-conditioned matrix problem may impact further analysis in
the aspect of parameter estimation and statistical inference in phylogenetic
comparative studies, we investigate 3 methods to appropriately adjust the phylogeny
when it falls in the area where it is poorly conditioned. Our first goal is to
search on a range of acceptable value of condition number of phylogeny with
n extant species; our next goal is to find the best,
well-conditioned estimate of observed phylogenetic covariance matrix, when the
observed matrix is ill-conditioned. The ultimate goal is to provide
the community with a series of tree transformations which help ameliorate the issues
produced by ill-conditioned matrices. Note that these tree transformations and other potential solutions
do not fix the problem—they involve a loss of data or a modification of the tree
structure, such that the likelihood does not match what it would be if we were able
to invert a matrix with absolute precision. However, they could result in answers
closer to the true estimate than by ignoring this issue and continuing to invert
very ill-conditioned matrices.
Data collection
For this analysis, we needed a set of empirical trees with branch lengths to
understand the risk of this issue in practice. The R package
datelife[28,29] stores a cache of the
OpenTree chronogram,[2] pulled in and processed using tools from rotl[30] and phylotastic.[31] The trees represent trees from TreeBASE[3,4] as well as directly from
many studies. The sizes of trees in the cache range from 6 taxa to 48 016 taxa;
branch lengths were normalized so that the root to tip height was one for all
trees. There were 3 trees from the same study of 4510 taxa[32] as well as one tree of 48 016 taxa[33] all other trees were 815 taxa or fewer. We excluded these 4 outlier trees
for computational convenience. After excluding these trees, the median size of
the remaining 126 trees is 72 taxa and the mean size is 140 taxa.
Preliminary analysis
The left panel of Figure
2 shows the condition numbers vs number of taxa for all 126 chronograms in the database.
Overall, increases with number of taxa. The right panel in Figure 2 shows the average
calculated from 50 simulations for each empirical tree, using
TreeSim[5] where birth rate and death rate are estimated from the associated tree by R package
ape function birthdeath[34] where some trees with multichotomies issue are resolved using function
multi2di, simulating to have the same number of
extant species as each empirical tree. Note that, even though the number of taxa
are the same between the simulated and empirical trees, the simulated trees had
worse (higher) on average.
Figure 2.
The 126 condition numbers vs number of taxa for trees from the
literature (left panel) and for trees from simulation (right panel). A
simple linear regression yields 2 line equations for the trees from literature and for the simulated trees. The interval for the regression line is shown in gray
area. From the plots, the condition of real trees and simulated trees
can be viewed via their matrix, and it becomes larger as number of taxa increases, but
does so somewhat slowly. The simulated trees generally have higher
(worse) condition than empirical trees.
The 126 condition numbers vs number of taxa for trees from the
literature (left panel) and for trees from simulation (right panel). A
simple linear regression yields 2 line equations for the trees from literature and for the simulated trees. The interval for the regression line is shown in gray
area. From the plots, the condition of real trees and simulated trees
can be viewed via their matrix, and it becomes larger as number of taxa increases, but
does so somewhat slowly. The simulated trees generally have higher
(worse) condition than empirical trees.We next compared condition numbers vs number of taxa using trees from more extensive simulations.
In Figure 3, there are 3
lines where each line represents the average of 100 runs of simulated
phylogenies for different number of taxa. The purple line was obtained from
random trees using coalescent trees method (created using
R package: rcoal)[35] and the orange line was obtained from trees simulated by birth-death
process with a given age on a fixed number of extant taxa[36] using birth rate where larger birth rates are used for larger taxa, death rate
. The blue line is obtained from trees under a pure-birth
process (birth , death ). The root to tip height was one for all trees.
Figure 3.
The condition number vs number of taxa for trees from simulation. Each dot
in the lines represents the average of the value for 100 trees. The orange line is for
birth-death model, the purple line is for coalescent model, and the blue
line is for pure-birth model.
The condition number vs number of taxa for trees from simulation. Each dot
in the lines represents the average of the value for 100 trees. The orange line is for
birth-death model, the purple line is for coalescent model, and the blue
line is for pure-birth model.In Figure 3, the
coalescent trees (purple lines) have the highest values, whereas the birth-death trees (orange lines) overall
have condition numbers slightly higher than the pure-birth trees (blue
lines).A positive finding from this exploration is that all trees simulated in this
ideal situation as well as the empirical trees from the literature do not fall
into the numerical limit of approximately where LAPACK[37] considers the matrix singular and thus infeasible for analysis using
Cholesky decomposition. However, there still exist trees with ill-conditioned
matrices that could impact the subsequent analysis. To clarify this situation,
consider the trees with 1 or 2 clades with short terminal branch lengths shown
in Figure 4. The matrix
condition for the tree in left panel is , whereas the matrix condition for the tree in the right panel
is . Both trees have much larger matrix condition numbers than
of the tree in Figure 1. They also far exceed the bound
of that a Hilbert matrix of the same size would have which is
known to be ill-conditioned.[7]
Figure 4.
Two cases of 5 taxa trees with a pair of short tips (taxa A and taxa B)
(left panel) and with 2 clades of short tips (right panel).
Two cases of 5 taxa trees with a pair of short tips (taxa A and taxa B)
(left panel) and with 2 clades of short tips (right panel).Consider a more extreme case (generated by setting a fairly short terminal branch
) where LAPACK considers the corresponding matrix singular (see supplemental material solvefail.pdf). Figure 5 shows the proportion of
unsolvable matrices vs their condition numbers for 3 types of commonly used
trees.
Figure 5.
Three types of trees (birth-death tree, coalescent tree, and pure-birth
tree) of taxa size 50, 100, 500, and 800 are used. For each taxa size,
100 trees were simulated under each type of tree and are equipped with a
short terminal branches ranging from −19 to −8 (in scale). The condition numbers of their matrices as well as their inverse (if numerically feasible) by
LAPACK are computed. The horizontal axis shows the proportion (using
scale of power of 100 of the raw proportion) of unsolvable matrices over 100 trees vs their condition numbers shown in the
vertical axis.
Three types of trees (birth-death tree, coalescent tree, and pure-birth
tree) of taxa size 50, 100, 500, and 800 are used. For each taxa size,
100 trees were simulated under each type of tree and are equipped with a
short terminal branches ranging from −19 to −8 (in scale). The condition numbers of their matrices as well as their inverse (if numerically feasible) by
LAPACK are computed. The horizontal axis shows the proportion (using
scale of power of 100 of the raw proportion) of unsolvable matrices over 100 trees vs their condition numbers shown in the
vertical axis.These type of trees could be unstable statistically when their matrix has no exact inverse by LAPACK, and the subsequent
phylogenetic comparative analyses which use the inverse of [8,38] could be
unstable. This does not just mean that the likelihood would be impossible to
calculate, throwing an annoying but at least transparent error. At better but
still bad condition, there would be a finite number returned for the likelihood
but it is unstable: a slight change in a parameter value in the model or to a
branch length of the tree could result in a very different likelihood value due
to accumulated errors in the matrix inversion, not rugosity of the true
likelihood surface. A researcher would get a likelihood and parameter estimates
back rather than an error, but these numbers are not accurate (see MLE estimator
for in BM model in Figure 14 in supplemental material). These consequences would be expected to
be more acute for multivariate data, though that remains an area for further
investigation. These issues would affect any method that uses calculation of
likelihood: both analyses that use likelihood to optimize parameter values or
Bayesian approaches that sample a space using information from priors and
likelihoods.The empirical analyses above suggest that, at least for most trees biologists
have encountered, there are reliable matrices (though transformations of these matrices from models to make
variance-covariance matrices that are then inverted can still have issues, which
remain to be investigated). However, there be dragons here: there are trees with
poor matrix condition that could affect PCM calculations, even the simple trees
of Figure 4. We
investigate the potential impact on the statistical stability of the BM model[10] in equation (3) when the associated phylogenetic tree has
ill-conditioned covariance matrix .
Methods
One way of dealing with ill-conditioned matrices is just to reject matrices,
, that are poorly conditioned. However, the true estimate of the
matrix (from the tree and/or from the transformation of the matrix from a
comparative methods model) can fall in that region, and users would not be happy to
hear that their hard-earned tree cannot be analyzed. That is far better than quietly
returning a wrong result, but still far from ideal. We propose several possible
approaches to remediate the issue of an ill-conditioned matrix from
the tree. Our goal is to use some of the following methods to estimate the
best version of the observed phylogenetic covariance
matrix.These all involve modifying from what it should be based on the tree and model, which of course is not
ideal, but given that the goal is to use this to compute estimates (such as the rate
of BM), we will investigate whether the resulting estimates are better from a
modified matrix than from the original ill-conditioned matrix. Three approaches are
examined: (1) shrinkage matrix regularization: lengthen the tip lengths with respect
to the tree, (2) pruning tips of the tree: removing tips from the tree, and (3)
lengths stretching: lengthen/shorten all branch lengths of the tree.
Shrinkage matrix regularization
An approach by Schafer and Strimmer[39] was developed for regularizing covariance matrices in molecular biology
(including some network covariance matrices), later improved and generalized by Theiler.[40] Let and , define the shrinkage matrix estimator of by where . Let be an estimate of the mean of the Mahalanobis distance
by recognizing that is generating from a Gaussian distribution with covariance (ie, ). Theiler[40] showed that the negative log likelihood function based on the mean
Mahalanobis distance approximation for the shrinkage estimated covariance matrix
as a function of the shrinkage parameter isThe best shrinkage estimate is to search the optima and the matrix is updated variance-covariance matrix for the next step of the
analysis. Although the shrinkage matrix used here is mathematical equivalent (differs up to a constant
multiplier ) to the very broadly used Pagel’s lambda[41,42]
transformation where and is an identity matrix, the parameters and have different meaning. While is estimated through the BM likelihood and is used for testing
phylogenetic signal, is estimated through likelihood function based on the mean
Mahalanobis distance approximation for the shrinkage estimated covariance
matrix.Figure 6 compares the
raw, untransformed tree with transformed tree computed by taking the shrinkage
matrix and converting back into a tree using the unweighted pair group method of
arithmetic mean[43] by R package: upgma.[44] The transformed tree has much longer tip branch lengths relative to
internal branch lengths relative to the untransformed tree.
Figure 6.
Tree transformation under shrinkage matrix regularization method. A
simulated birth-death tree of 100 taxa is shown in the left panel. The
phylogenetic covariance matrix of the shrunk tree is obtained by setting
shrinkage parameter to such that where is the phylogenetic covariance matrix of the simulated tree. The
shrunk tree is shown in the right panel and compared with the
untransformed tree. Both trees are plotted with the same taxon
order.
Tree transformation under shrinkage matrix regularization method. A
simulated birth-death tree of 100 taxa is shown in the left panel. The
phylogenetic covariance matrix of the shrunk tree is obtained by setting
shrinkage parameter to such that where is the phylogenetic covariance matrix of the simulated tree. The
shrunk tree is shown in the right panel and compared with the
untransformed tree. Both trees are plotted with the same taxon
order.Using 5 different numbers of taxa , the average of the shrinkage estimator across 100 replicate
trees are (, respectively). It appears that the magnitude of required
shrinkage decreases with the number of taxa.
Pruning tips of the tree
We further investigate what other factors, whether a property of the tree or
parameters of the simulations used to generate it (which then, of course,
results in trees of particular distribution of branch lengths), affects the
condition of the tree. We consider the tree properties that could affect the
condition of the tree as follows: (1) taxa size n, (2) min
branch length , (3) max branch length, (4) ratio between max branch length
and min branch length, (5) variance of branch length, (6) median of branch
lengths, (7) min tip length, (8) max tip length, (9) min internal branch length,
(10) max internal branch length, (11) generating birth rate , (12) generating death rate , and (13) turnover rate .We simulated birth-death trees under uniformly varying number of taxa
between size of 10 and 800, birth rate between and , death rate between 0 and where larger tree are simulated with higher value of
. Note that one can simulate from uniform distribution with upper limit (ie, . In this case and have dependency. Their condition numbers are calculated and compared with a variety of measures using
multiple linear regression. The R package:
MuMIn[45] was used to generate a set of models to correlate with combination of parameters or tree measures in the global
model of 13 predictors. The maximum number of variables is set to 4 accounting
for 1093 models (intercept model: model, 1 predictor: models, 2 predictors: models, 3 predictors: models, 4 predictors: models where is the number of combination that selects r
distinct items from a collection of n items). Table 1 shows the
regression estimates for the covariates for the top 5 models that accounts for
majority of weights.
Table 1.
Regression estimates, log likelihood, AICc, , and Akaike weights for the top 5 multiple linear
regression models (M1-M5) out of the 1093 models.
Model
M1
M2
M3
M4
M5
min tip ι
−0.19
−0.19
−0.24
−0.24
−0.24
death rate μ
17.77
17.77
16.05
16.06
16.05
birth rate λ
10.45
brlen median
−0.02
−0.02
max brlen
3.6e−5
3.8e−5
3.8e−5
max internal
3.6e−5
3.8e−5
Ntip
1.2e−3
1.2e−3
logLik
−11 879.64
−11 879.81
−11 882.73
−11 882.99
−11 883.92
AICc
23 771.29
23 771.62
23 777.48
23 777.99
23 779.84
ΔAICc
0.00
0.33
6.19
6.70
8.55
weight
0.51
0.43
0.02
0.02
0.01
Abbreviation: AIC, Akaike information criterion.
Regression estimates, log likelihood, AICc, , and Akaike weights for the top 5 multiple linear
regression models (M1-M5) out of the 1093 models.Abbreviation: AIC, Akaike information criterion.From Table 1, we have
several comparisons: (1) condition number vs min tip length (ie, vs ), (2) condition number vs death rate (ie, vs ), and (3) condition number vs birth rate (ie, vs ) are interesting. For (2) and (3), it is known that the
expected waiting time to the next event of birth-death model is exponential with
parameter (so the regression estimates for and for are both positive values), it remains to see how the birth and
death parameters affect the branch length (and hence matrix).For (1), having a larger minimum tip branch length seems to lead to better (lower) , whereas smaller minimum tip may yield worse (higher)
. This is consistent with Figure 4 above, where short tip lengths
seemed to lead to bad matrix condition. Here, we focus on exploring the relation
between the condition number and the minimum tip length. Given that smaller
trees tend to have better condition than larger ones, it could be that removing
a taxon at random would improve matrix condition. However, another possibility
is that it is just the presence of very small terminal branches: these are less
likely on a smaller tree, but we could just try to remove the short branches
directly by removing one of the taxa with the shortest terminal branch. Taxon
removal does tend to help matrix condition, but removing the taxon with the
shortest branch helps far more in Figure 7. This points to a potential
solution: dropping one of the tips with the shortest branch length as a quick
run suggests a much bigger improvement than dropping a tip at random.
mvMORPH[23] and PCMBase[24] allow pruning tips with tiny or zero lengths, allowing this to be done
easily by users.
Figure 7.
The condition numbers for dropping the shortest tip vs the condition
numbers for the raw tree. 100 birth-death trees are generated with
random size ranging from 100 to 800 with birth rate and death rate . The horizontal axis is the condition number for raw
trees, whereas the condition numbers for the removing a tip tree are
shown in the vertical axis. The vertical lines connect points
corresponding to that given tree, but with one taxon removed: the blue
square is removing a taxon at random, and the red dot removing the taxon
with the shortest branch tip length.
The condition numbers for dropping the shortest tip vs the condition
numbers for the raw tree. 100 birth-death trees are generated with
random size ranging from 100 to 800 with birth rate and death rate . The horizontal axis is the condition number for raw
trees, whereas the condition numbers for the removing a tip tree are
shown in the vertical axis. The vertical lines connect points
corresponding to that given tree, but with one taxon removed: the blue
square is removing a taxon at random, and the red dot removing the taxon
with the shortest branch tip length.We found that, in fact, the problem of having ill-conditioned matrix is highly
related to terminal branch lengths of taxon. To illustrate this issue, a simple
example of 2 taxa is shown in Figure 8 (right panel). The corresponding phylogenetic matrix is
Figure 8.
A 3-taxa tree (left panel) and a 2-taxa tree (right panel) for
illustration.
Note that has 2 eigenvalues and . The condition number of defined by the ratio of the largest eigenvalues to the smallest
eigenvalues: where is the big O notation that describes the
limit behavior of a function. When tree has tiny tips (very small
), the value of will be fairly large and the matrix is more likely to be
ill-conditioned. For instance, with the condition number , whereas with , . The problem becomes serious as is very close to zero where matrix has 2 almost identical columns/rows which makes a singular matrix with . In general, for a tree with a clade of very short tips
relatively to the tree height from the root to the tips, the corresponding matrix is more ill-conditioned.A 3-taxa tree (left panel) and a 2-taxa tree (right panel) for
illustration.Moreover, the phylogenetic covariance matrix has a nested structure and possess some special matrix properties. Ané[46] determined eigenvalues of the covariance matrix for symmetric trees for
the purpose of studying the behavior of the estimator. Here, we show that the
shortest tip of a ultrametric tree is equal to the smallest eigenvalue of the matrix. We start using a 3-taxa example in Figure 8 (left panel) where the tip
lengths for species and Z are b,
b and , respectively. The shortest terminal branch length is
b. The phylogenetic matrix for the tree in Figure 8 is shown in equation (8).Let be the characteristic polynomial of where is a 3 by 3 identity matrix. As the roots of are the eigenvalues of , solving and simplifying yield . As are both positive numbers, the smallest eigenvalue for is b which is the shortest tip length on the tree in
Figure 8 left panel.
For a general case, a property of ultrametric tree is provided in Lemma 1.
Lemma 1
The shortest tip length of an ultrametric phylogenetic tree is the smallest
eigenvalue of , ie, where b is the smallest tip length and
I is an n by n
identity matrix.The general proof in Lemma 1 shows up the pruning approach from a theoretical
perspective. Researchers may object to losing a taxon in their analysis:
getting data for a species to put it on a tree and include trait information
may have entailed a significant effort. They may also worry about biased
estimates that come from such pruning. Below, we show that such pruning does
improve matrix condition substantially and later show that this can provide
reliable parameter estimates with sufficient remaining taxa. The following
lemma shows that the new tree obtained from dropping the shortest tip of the
original tree has a better (lower) .
Lemma 2
Let be the n by n strictly ultrametric
matrix from the tree and be the condition number of . Let be the by matrix obtained by dropping the shortest tip from the tree
and be the condition number of . Then, .
Remark
Above 2 lemmas have a link with the result obtained in Ané[46] where the whole spectrum of the matrix is derived for the special case of a symmetric tree and has been
extended in Ho and Ané[47] for an Ornstein-Uhlenbeck (OU) model.[48] It would be interesting to explore whole spectrum for arbitrary
ultrametric tree for generalization, but that remains as future work.
Length stretching
Another possible solution is adopting the method in Jhwueng[38,49] which
stretches the branch lengths of the raw tree without changing its topology. For
an ultrametric tree, let be the tree height from the root to the tip. Without loss of
generality, is scaled into a unit and is decomposed into
d components. That is, where represents the length between the and speciation events. For instance, is the length from the root to the first speciation event
since the root and is the minimum tip length for the species evolved from its
most recent common ancestor. Next, consider the matrix obtained from the raw tree. Let the -tuple elements be the distinct entries in satisfying . The relation between and can be represented as . The following lemma describes the relationship between
and .
Lemma 3
The set where is the length between the m and the
m speciation event has d elements if
and only if the number of distinct elements in of an ultrametric tree is .For example, in Figure
8 (left panel), if setting and , then the matrix is and has 3 distinct elements , and . The 2 lengths and can be determined accordingly by , and .To stretch the lengths but retain the topology of the raw tree, as
and , we can treat as a d-dimensional random vector from a
Dirichlet distribution. Then, can be generated by first drawing d
independent gamma random variables, each with different shape parameters and rate parameter 1 where m is an
arbitrary but positive constant. Then, the d-tuple vector
is a Dirichlet random vector with , , and concentration parameters . Here, the positive constant m is an
arbitrary scaling variable that always preserves the correct mean. By the
property of Dirichlet distribution, we have , and the mode is given by . The choice of m is thus determined by
. A positive integer m is chosen to
satisfy where returns the least integer greater than or equal to
a. The choice of m here is designed to
be the minimal needed to prevent the phylogenetic tree from varying too
wildly from the given one while still adequately testing robustness.[49]Figure 9 shows the
phylogeny and their condition number for a raw tree and its transformed
trees.
Figure 9.
The raw tree with is shown in upper left panel, the shortest tip
pruned tree with is shown in upper right panel, the shrunk tree
with is shown in lower left panel, and the stretched
tree with is shown in lower right tree.
The raw tree with is shown in upper left panel, the shortest tip
pruned tree with is shown in upper right panel, the shrunk tree
with is shown in lower left panel, and the stretched
tree with is shown in lower right tree.We implemented the shrinkage method, length stretching method and pruning
tips (drop shortest tip) method to transform trees. We simulated 200
birth-death trees of taxa size between 300 and 1000 using
R package: TreeSim[36] with speciation rate and extinction rate . The condition numbers of the phylogenetic covariance
matrix for the raw trees and transformed trees are calculated and are
plotted in Figure
10. Among the 3 methods, shrinkage method and pruning tip (drop
the shortest tip) method help to reduce the condition number. As seen in
Figure 10 (left
most panel), the shrinkage method provides a large amount of reduction in
condition numbers for all trees. The condition number of raw tree ranges from 3 to 8, whereas
all matrices of shrunk trees have condition number of value less than 4
(in scale). The pruning tip method, shown in Figure 10 (right most
panel), is similar to Figure 7 and contributed to a smaller condition numbers for all
trees. The length stretching method, shown in Figure 10 (middle panel), in general
does not improve the condition number when comparing with the raw method
without transformation.
Figure 10.
Comparison of condition number between the raw tree and transformed
tree. The horizontal axis as well as the vertical axis shows the
range of condition number in scale of the simulated trees. The diagonal dashed
line is 1:1 relationship to demonstrate the pattern. Scatter plot of
condition number of the simulated trees vs the condition number for
the transformed trees is shown in each panel.
Comparison of condition number between the raw tree and transformed
tree. The horizontal axis as well as the vertical axis shows the
range of condition number in scale of the simulated trees. The diagonal dashed
line is 1:1 relationship to demonstrate the pattern. Scatter plot of
condition number of the simulated trees vs the condition number for
the transformed trees is shown in each panel.
Assessment of the Methods
There is a metaphor for parameter estimation that is about searching a lost key in a
region of street with no light. A man who lost his keys in the night. A friend came
across the man searching under a streetlight, and asked, where did you lose
them? Down there, the man said, but there’s no light there, so
we are looking for them here.In our case, the tree is transformed in some way (taxa deleted, branches stretched)
which lets parameter search work more easily—but have we moved so far away from
where the parameters are that the ease of search does not make for better results?
To address this, we examined whether estimates of rate of evolution the root state under BM are better with transformed trees than on the original
tree. There are 2 aspects to this: are these estimates calculable at all, and, if
they are, are the estimates good? Performance is assessed by examining normalized
root-mean-square deviation (RMSD) defined bywhere are the true parameters and are the MLEs.As the study of interest is the impact from the ill-conditioned tree, trees are
simulated with more ill-conditioned matrix where the Cholesky decomposition by R package
solve fails to find the inverse for some trees. One case
of an ill-conditioned tree can be built by adding up tiny lengths to all tips right
after the tips of tree are trimmed to the most common ancestor of the shortest tips.
This can be done by adding a relative small number to the diagonal of the matrix after the diagonal elements are replaced with the the second largest
elements in the (the height from the root to the most recent common ancestor of the shortest
tips). Ultrametric trees are simulated from TreeSim[5] with speciation rate and extinction rate where is a uniform distribution. Trait of size is simulated using 2 set of true parameters and given a tree. For each tree, we simulated 50 traits and repeated
this procedure 700 times for each combination of parameters. Among 700 trees, 345
trees have their matrices invertible by the Cholesky decomposition (ie,
solve returns with no error). The other 355 trees are transformed under the 3
tree transformation methods. For the purpose of comparison of the 3 methods and the
raw method without transformation, traits simulated under BM model from raw trees
are fixed and used for parameter estimation across transformed trees. Each tree can
be in a region where the matrix is sufficiently well-conditioned to be used with
solve or in an area where it would fail. There are 2
analyses done:Solve works: Estimates the parameters on just the
well-conditioned trees using the raw tree and after the various
transformations. This evaluates whether, for trees that are already somewhat
feasible, does transformation still help even though the raw tree could be
used and which transformation works best.Solve fails: Estimates the parameters on just the
poorly conditioned trees (where the raw tree is not feasible) after
transformations. This evaluates which transformation performs best in the
hard cases.If the best approach in the first case is to just use the raw tree, then a good
protocol is to use the raw tree when possible and use a transformation if needed. If
one of the transformations works best in both the first and second cases, then it is
best to use that transformation in general, at least for the trees in this tricky
but sometimes solvable region. Results for comparison of trees with size 100 and 500
are reported in Figure 11
for and in Figure
12 for .
Figure 11.
Evaluation of performance of parameter estimation under trees transformation
for ancestral status parameter . The left panel is for true parameter and the right panel is for with tree size (upper panel) and (lower panel). The box plots of RMSD of defined in equation (9) under each
method are shown in each panel. The labels of the horizontal axis starting
from the left to the right are droptip: prune the shortest tip, prun: the
pruning algorithm,[11] raw: untransformed tree, stretch: length stretched tree, shrink:
shrinkage matrix regularization method and lambda: Pagel’s lambda method.
Each label contains 2 groups: sol and unsol where
sol represents the estimates of parameters on
well-conditioned trees using raw tree and after various transformation, and
unsol represents the group of estimates of the parameters
on just the poorly conditioned trees after various transformations are
reported. Because raw trees cannot be evaluated when their matrices fall in ill-condition, only the box plot for raw tree is
reported when the matrix is invertible under Cholesky method. Graphs are plotted in
scale.
Figure 12.
Evaluation of performance of parameter estimation under trees transformations
for rate parameter . The box plots under each panel compare the RMSD values
under each tree transformation method. The labels are the same as in Figure 11. The box
plots are reported in scale. Among those transformations, the shrink method has
significantly larger RMSD than other methods in both of the is solvable (sol) and unsolvable (unsol).
Evaluation of performance of parameter estimation under trees transformation
for ancestral status parameter . The left panel is for true parameter and the right panel is for with tree size (upper panel) and (lower panel). The box plots of RMSD of defined in equation (9) under each
method are shown in each panel. The labels of the horizontal axis starting
from the left to the right are droptip: prune the shortest tip, prun: the
pruning algorithm,[11] raw: untransformed tree, stretch: length stretched tree, shrink:
shrinkage matrix regularization method and lambda: Pagel’s lambda method.
Each label contains 2 groups: sol and unsol where
sol represents the estimates of parameters on
well-conditioned trees using raw tree and after various transformation, and
unsol represents the group of estimates of the parameters
on just the poorly conditioned trees after various transformations are
reported. Because raw trees cannot be evaluated when their matrices fall in ill-condition, only the box plot for raw tree is
reported when the matrix is invertible under Cholesky method. Graphs are plotted in
scale.Evaluation of performance of parameter estimation under trees transformations
for rate parameter . The box plots under each panel compare the RMSD values
under each tree transformation method. The labels are the same as in Figure 11. The box
plots are reported in scale. Among those transformations, the shrink method has
significantly larger RMSD than other methods in both of the is solvable (sol) and unsolvable (unsol).In Figure 11, comparison of
using various tree transformation methods to search MLEs for in BM model shows that those methods returns consistent parameter
estimates for . The medians of the values are around 0.1 or lower across all methods.For comparison of methods using RMSD of shown in Figure
12, the pruning tip method, raw method, length stretching method, and
lambda method perform well. However, the shrinkage method has significantly larger
RMSDs in both sol and unsol cases. From earlier work, it
suggests that the tip length of shrunk trees after tree transformation has
relationship with the proportion of tree height. For instance, the average of the
shrinkage estimator across 100 replicates for birth-death tree of 100 taxa and 800
taxa is . Consider a more extreme case: a star tree obtained from the
shrinkage method with is used for parameter estimation, then is a diagonal matrix. Then, the trait is indeed analyzed under
independent normal distribution. The bias of parameters in this case can be seen
from theoretical approach. Without loss of generality, let be an identity matrix and let trait data be simulated under BM model (ie, . Then, under the shrinkage method with , is analyzed under the i.i.d. normal distribution with MLEs and . Given true parameters , , the is bounded by . And if the root to tip tree height is 1 (ie, for all ), then has a natural upper bound (see Lemma 4 in supplemental material). However, the RMSD for has a nontrivial lower bound. For mathematical convenience,
is used instead to show that a lower bound is (see Lemma 5 in supplemental material).However, tips of the tree may overly lengthen by the shrinkage method so the
parameter cannot be estimated well. To explore the utility of the shrinkage
method, a simulation comparing the RMSD under different and taxa size n is shown in Figure 13 which suggests that RMSD of
in general increases with . The model parameter can be estimated better with smaller . If the goal is to obtain a reliable estimate of , then a relatively small number of shall be considered to attain a better estimate for
with lower RMSD value. Therefore, users may manually set up the
shrinkage parameter as small as possible to obtain reliable estimate when the
transformed tree has invertible matrix.
Figure 13.
Evaluation of the shrunk tree of taxa sizes 100 and 500 with different
shrinkage parameter values . For each taxa size, RMSDs are computed using 300
birth-death trees where for each tree 50 traits are simulated under Brownian
motion model. The 2 plots in upper panel investigate the RMSD for
and when the inverse of (sol C) can be computed directly, whereas the other 2
plots in lower panel investigate the RMSD for and for the case where the matrix is ill-conditioned (unsol C). Overall, the 4
panels suggest a relatively small shrinkage value may be used to achieve a better estimate for
.
Evaluation of the shrunk tree of taxa sizes 100 and 500 with different
shrinkage parameter values . For each taxa size, RMSDs are computed using 300
birth-death trees where for each tree 50 traits are simulated under Brownian
motion model. The 2 plots in upper panel investigate the RMSD for
and when the inverse of (sol C) can be computed directly, whereas the other 2
plots in lower panel investigate the RMSD for and for the case where the matrix is ill-conditioned (unsol C). Overall, the 4
panels suggest a relatively small shrinkage value may be used to achieve a better estimate for
.To sum up, although the tree transformation methods proposed here do not provide
significantly better improvement than the raw tree, they are still reliable options
for users to choose when the matrix of tree is ill-conditioned. Moreover, even when the matrix is fairly well-conditioned, applying transformations does not have a
substantially worse effect. One suggestion is that when the phylogenetic covariance matrix is solvable, it may be the best to use the raw tree. Meanwhile, when is so poorly conditioned that cannot be inverted exactly by Cholesky decomposition, tree transformation
methods provide alternative options and can give reliable estimates for most
cases.
Conclusions
In this article, we explore the condition number of the phylogenetic covariance matrix transformed from a phylogenetic tree. We found with fairly short terminal
branch (eg, a tip with length of or smaller for tree of 100 taxa) the phylogenetic matrix fails to give the exact inverse by the Cholesky decomposition method
by current software. The failure of returning exact inverse of also depends on the number of taxa and range of condition numbers
.We proposed 3 methods (shrinkage matrix regularization, pruning the tips of tree, and
length stretching) to alleviate the ill-conditioned matrix issue arising from the
phylogenetic tree and obtain a well-conditioned estimate for matrix. Simulations here are similar in spirit to what is performed in Figure 5 in the work by Adams
and Collyer[9] where the condition number of phylogenetic covariance matrices at different
level of sample size was shown. Their work was interested in the condition number
relative to type I error of comparative methods, whereas our examinations use
different aspects of the effect of tree condition. Another common approach is to add
a small constant to the eigenvalues of the phylogenetic covariance matrix and
reestimate the phylogenetic covariance matrix from the eigenvectors and adjusted
eigenvalues. However, this method would alter the topology of the phylogenies, so we
do not consider to implement it here.Current R software packages that implement Felsenstein’s pruning algorithm work
effectively. For example, the R package mvMORPH can compute
the square root of the phylogenetic covariance matrix and its determinant with
arbitrary small tips, whereas the R package PCMBase
implements the pruning algorithm to return the likelihood even with zero-length
branches (see supplemental material pmmfelprunzerobranch.pdf). Note that the
analysis of parameter estimation under the maximum likelihood estimation for the
univariate BM model used in this work is performed without implementing measurement
error. When assuming that each species in the tree has a measurement error variance,
technically, this is also equivalent to extending each branch in the tree by a
constant. Then, the variance-covariance matrix that includes measurement error and
an ill-conditioned matrix could have better matrix condition, especially in our cases where the
poor matrix condition typically came from near zero-length tips. Measurement error
is also included as part of the model for the phylogenetic mixed model.[50] However, measurement error can lead to bias in the inferred parameters.[51] Also, some software with hundreds of citations, such as
OUCH[52] and SURFACE,[53] does not yet allow for incorporation of measurement error, rendering this
potential solution impossible. As we focus on studying the condition number of matrices on its own and explore the impact on subsequent analyses, we do not
include the measurement error in this study.There are a couple of extensions from this work. One possible work is to look at the
multirate evolutionary BM model[54] where the rate of phenotypic evolution is assumed to change throughout
history of life. As the rate parameters are embedded in the phylogenetic covariance
matrix, the condition number of the phylogenetic covariance matrix hence depends on
both the tree and rate parameters. Another possible extension from this work is to
look at the condition number of the phylogenetic variance-covariance matrix for more
general trait models. For instance, if one assumes OU process for trait evolution,[48] then the phylogenetic variance-covariance matrix is where is the constraining force, is the rate of evolution, is the time that separating species i and
j, and is the time that species i and j
shared a common ancestor. The statistical model for the OU process trait evolution
is . In particular, when , is the BM model. Larger constraining force parameter
in OU model has the effect of lengthening the tip relative to the
internal edges and may help the ill-conditioned matrix issue.There are more complex models developed under different assumptions based on this OU
process, for instance, the multiforce, multioptimum, and multirate model in the work
by Beaulieu et al[55] has fairly complicated phylogenetic variance-covariance matrix
. Its poor performance with especially complex models could stem
from issues with matrix condition, though this needs to be investigated. In
addition, Bastide et al[12] and Jhwueng and O’Meara[8] developed a phylogenetic comparative method (PCM) using phylogenetic
networks, rather than trees, it would be interesting to further explore the matrix
condition in this case. We hope that our tree transformation methods provide the
community options to ameliorate the issues produced by ill-conditioned . All analysis and simulations are done by computer Mac Pro: macOS Mojave
(Late 2013), processor: 3.7 GHz Quad-Core Intel Xeon E5, RAM:12 GB 1866 MHz DDR3.
The R scripts and relevant files to generate results can be accessed at https://tonyjhwueng.info/KappaPCM.Click here for additional data file.Supplemental material, supplemental_material for On the Matrix Condition of
Phylogenetic Tree by Dwueng-Chwuan Jhwueng and Brian C O’Meara in Evolutionary
Bioinformatics
Authors: Oliver G Pybus; Marc A Suchard; Philippe Lemey; Flavien J Bernardin; Andrew Rambaut; Forrest W Crawford; Rebecca R Gray; Nimalan Arinaminpathy; Susan L Stramer; Michael P Busch; Eric L Delwart Journal: Proc Natl Acad Sci U S A Date: 2012-08-27 Impact factor: 11.205