Literature DB >> 22593551

Rooting gene trees without outgroups: EP rooting.

Janet S Sinsheimer, Roderick J A Little, James A Lake.

Abstract

Gene sequences are routinely used to determine the topologies of unrooted phylogenetic trees, but many of the most important questions in evolution require knowing both the topologies and the roots of trees. However, general algorithms for calculating rooted trees from gene and genomic sequences in the absence of gene paralogs are few. Using the principles of evolutionary parsimony (EP) (Lake JA. 1987a. A rate-independent technique for analysis of nucleic acid sequences: evolutionary parsimony. Mol Biol Evol. 4:167-181) and its extensions (Cavender, J. 1989. Mechanized derivation of linear invariants. Mol Biol Evol. 6:301-316; Nguyen T, Speed TP. 1992. A derivation of all linear invariants for a nonbalanced transversion model. J Mol Evol. 35:60-76), we explicitly enumerate all linear invariants that solely contain rooting information and derive algorithms for rooting gene trees directly from gene and genomic sequences. These new EP linear rooting invariants allow one to determine rooted trees, even in the complete absence of outgroups and gene paralogs. EP rooting invariants are explicitly derived for three taxon trees, and rules for their extension to four or more taxa are provided. The method is demonstrated using 18S ribosomal DNA to illustrate how the new animal phylogeny (Aguinaldo AMA et al. 1997. Evidence for a clade of nematodes, arthropods, and other moulting animals. Nature 387:489-493; Lake JA. 1990. Origin of the metazoa. Proc Natl Acad Sci USA 87:763-766) may be rooted directly from sequences, even when they are short and paralogs are unavailable. These results are consistent with the current root (Philippe H et al. 2011. Acoelomorph flatworms are deuterostomes related to Xenoturbella. Nature 470:255-260).

Entities: Chemical Disease Species

Mesh：

Substances：
RNA, Ribosomal, 18S

Year: 2012 PMID： 22593551 PMCID： PMC3509888 DOI： 10.1093/gbe/evs047

Source DB: PubMed Journal: Genome Biol Evol ISSN： 1759-6653 Impact factor: 3.416

Introduction

Fully understanding the evolution of life on Earth requires identifying the ancestor from which all others on the tree evolved, that is, the root. A number of important unresolved questions in evolution, including the origins of modern humans (Reich et al. 2011), the rise of placental mammals (Waddell et al. 1999; Madsen et al. 2001; Murphy et al. 2001a, 2001b; Scally et al. 2001), animal evolution (Aguinaldo et al. 1997; Philippe et al. 2011), prokaryotic evolution (Cox et al. 2008; Lake 2009), and even the beginnings of life (Lake et al. 2009; Ragan et al. 2009), would benefit from phylogenetic methods that can root a tree without having to specify outgroups. And yet, there are no general methods available for determining roots. Trees can be rooted using outgroups and ancient gene duplications (Dayhoff et al. 1972), insertions and deletions (Rivera and Lake 1992), and occasionally gene presences and absences (Lake and Rivera 2004; Simonson et al. 2005). But, these methods cannot be applied to some of the most widely used molecular sequences, including ribosomal RNAs (rRNAs), because useful gene duplications are rare. Furthermore, usable gene duplications and indels are, in general, so infrequent that it is the exception when trees can be rooted using them. Phylogenetic reconstruction algorithms typically delete rooting information because they assume, either implicitly or explicitly, that evolutionary distances are symmetric, that is, d = d, where d is the evolutionary distance between taxon i and taxon j. Strictly speaking, the assumption of symmetric distances is only valid if evolutionary processes are time reversible—which they are not (Lake 1997). Assuming symmetric distances thus removes valuable rooting information from many types of sequence analyses. Here we show how to root nucleotide sequences without making this assumption. An alternative for rooting without an outgroup is to use a likelihood-based approach with a nonreversible transition state matrix (Barry and Hartigan 1987a, 1987b; Ferretti et al. 1994; Hendy and Penny 1996; Evans and Zhou 1998). That approach allows for a general model of evolutionary change to be used, however branch lengths, transition rates, and nucleotide distributions at the root must be estimated or integrated out. In contrast, our approach does not need to estimate these parameters, it is fast and simple to implement, and may be easily extended to larger numbers of taxa. Evolutionary parsimony (EP) (Lake 1987a; 1987b) and its extensions (Cavender 1989; Nguyen and Speed 1992) contain rooting information because they do not assume that evolutionary distances are symmetric. Instead, they are based on the balanced transversion assumption, namely that transversions from purines/pyrimidines produce approximately equal numbers of pyrimidines/purines. Because DNA copying and repair mechanisms can most readily distinguish differences between the larger purines (A and G) and the smaller pyrimidines (C and U/T), those exchanges which substitute one purine for another or one pyrimidine for another (transitions) occur more frequently than those that substitute purines and pyrimidines (transversions), as has been known for nearly 30 years (Brown et al. 1982). In addition because transversions occur less frequently than transitions, they are more desirable for investigating early events in evolution. Thus, methods based on balanced transversions, or its modifications, provide both rooting and topological information, and the slow evolution of transversions makes them better suited for rooting. Here we explicitly describe how to root trees using EP rooting invariants. The literature on both linear and polynomial phylogenetic invariants and the related topic of Hadamard conjugations, is rich, see as examples (Cavender and Felsenstein 1987; Lake 1987a; Cavender 1989; Nguyen and Speed 1992; Steel et al. 1993, 1998; Ferretti et al. 1994; Sinsheimer 1994; Steel and Fu 1995; Hendy and Penny 1996; Sinsheimer et al. 1996; Waddell et al. 1997; Evans and Zhou 1998; Allman and Rhodes 2003; Yap and Speed 2005), but to our knowledge there has been no work on using linear invariants to root a phylogenetic tree. Extending and simplifying Nguyen and Speed’s (1992) results, we show that EP invariants can be classified into three major categories. These classes can be used to: (1) test the EP assumptions (the simplest example is a two-taxon test), (2) distinguish between rooted trees (the simplest example is a three-taxon test), and (3) distinguish between unrooted trees (the simplest example is a four-taxon test). The EP invariants that solely contain root information are explicitly derived, and posterior probabilities are developed to determine three-taxon rooted trees. The method is illustrated using a simple, but difficult example, rooting the new animal phylogeny directly from short 18S rRNA sequences in the absence of outgroups.

Results

Principles of EP

For s aligned nucleic acid sequences, there are 4 different nucleotide patterns that can be observed at any nucleotide position, ignoring insertions and deletions. Thus when three nucleic acid sequences are aligned there are 43 = 64 possible patterns at each position of the alignment, for example, (AAA, AAG, …). For EP, the set of aligned sequences is represented as the number of times each of these patterns are observed over the entire sequence and are denoted as xyz where x, y, and z can take on nucleotide values A, G, C, and T (or U). N3 is a vector with 64 entries containing the observed nucleotide count spectrum for three taxa, N3 = (#AAA, #AAG, … , #TTC, #TTT), where # refers to the number of occurrences of each pattern. The components of N3 for a hypothetical 30-mer sequence are shown in figure 1. In this example, the pattern TTT is observed three times so that #TTT = 3.

EP’s major assumption, balanced transversions, is a constraint placed on the transversion probabilities (Lake 1987a). It can be represented as four conditional probability statements: where P(X|Y) denotes the probability of observing nucleotide X at some site in an existing organism’s sequence given there was a Y at that site in the ancestral sequence. This constraint implies that for infinitely long sequences, an equal number of both possible transversions will occur along the path from an ancestor to an existing taxon. For example in figure 2, the expected number of A to C transversions will equal the expected number of A to T transversions between the root and taxon 1. In the next section, we illustrate how balanced transversions constrain some invariants to equal zero.

Two-taxon EP invariants. An example illustrating how balanced transversions require that the two-taxon invariants, U1 and U2, have expected values of zero. Patterns in U1 = #R1Y2 and U2 = #Y1R2 arise from a net transversion in branch 1 or a net transversion in branch 2, but not net transversions in both branches. This is true for all choices of ancestral nucleotide states. Under balanced transversions, the expected number of AC patterns equals the expected number of AT patterns, the expected number of GCs equals the expected number of GTs, the expected number of CAs equals the expected number of TAs, and the expected number of CGs equals the expected number of TGs. Therefore, U1 and U2 have expected values of zero, as is further explained in the text.

Hypothetical aligned gene sequences for three taxa. An example illustrating how three aligned sequences can be reduced to a list of the informative EP subpatterns that specifies the number of counts supporting each pattern. These counts can then be used to reconstruct the most likely rooted tree for this hypothetical three-taxon comparison. The data in this example strongly support the G tree (rooted in taxon 3), as discussed in the text. Two-taxon EP invariants. An example illustrating how balanced transversions require that the two-taxon invariants, U1 and U2, have expected values of zero. Patterns in U1 = #R1Y2 and U2 = #Y1R2 arise from a net transversion in branch 1 or a net transversion in branch 2, but not net transversions in both branches. This is true for all choices of ancestral nucleotide states. Under balanced transversions, the expected number of AC patterns equals the expected number of AT patterns, the expected number of GCs equals the expected number of GTs, the expected number of CAs equals the expected number of TAs, and the expected number of CGs equals the expected number of TGs. Therefore, U1 and U2 have expected values of zero, as is further explained in the text.

Illustrating the Balanced Transversions Assumption: Two-Taxon Invariants

We demonstrate the application of EP rooting invariants by considering the simplest possible rooted tree, the two-taxon tree shown in figure 2. For two taxa there are two EP invariants and one possible rooted tree. These are: and We use the shorthand notation #(x−y)1(w−z)2 to represent #xw − #xz − #yw + #yz, so that U1 can then be expressed as #(A−G)1(C−T)2 and U2 as #(C−T)1(A−G)2. This notation can be extended to any number of taxa. For three sequences #(u − v)1(w − x)2(y − z)3 = #uwy − #uwz −#uxy + #uxz − #vwy + #vwz + #vxy − #vxz. In figure 2, we illustrate why U1 and U2 have expected value zero under the balanced transversions assumption. First note that an odd number of transversions at a nucleotide site leads to a net transversion along the path, and an even number of transversions at a nucleotide site leads to either a net transition or no net change along the path. Consider the two-taxon, rooted tree in figure 2. The patterns AC, AT, GT, GC, CA, CG, TA, and TG can only be present if a net transversion has occurred along the path from the root to taxon 2 (branch 2), or along the path from the root to taxon 1 (branch 1), but not both (Lake 1987a). U1 and U2 can be partitioned into four groups indexed by the unknown nucleotide present in the ancestral sequence. That is, U1 = U1(A) + U1(G) + U1(C) + U1(T) and U2 = U2(A) + U2(G) + U2(C) + U2(T), where U(X) denotes the counts derived from ancestral sequence positions containing nucleotide X. If one can show that the expected values of U1(X) and U2(X) are zero for all nucleotides X, then U1 and U1 have expected value zero. Using figure 2, we demonstrate that the expected value of the partition U1(A) is zero. In the first tree in figure 2, the ancestral state is A, so that the nucleotide patterns that comprise U1(A) = #(A−G)1(A)(C−T)2(A) must have resulted from one net transversion in branch 2 (A to C, or A to T) and a net transition or no net change in branch 1 (A to G, or A to A). To simplify the discussion, we introduce the notation Y = (C−T) = C − T for the pyrimidine difference operator, R = (A−G) for the purine difference operator. Using this notation the two taxon invariants can be written as U1 = #R1Y2 = #(A−G)1(C−T)2 and U2 = #Y1R2. (Note also that multiplication of these operators is commutative.) By the balanced transversions assumption, the number of net A to C transversions equals the number of net A to T transversions in branch 2. Consequently, the expected value of #Y2(A) is zero, and the expected value of #R1(A)Y2(A) is zero. By similar reasoning one can show that the expected values of #Y2(G), #R1(C), #R1(T) are zero, and therefore, U1(G), U1(C), and U1(T) have expected value zero. Hence, the sum U1 = U1(A) + U1(G) + U1(C) + U1(T) has expected value zero under the balanced transversion assumptions. By symmetry, as shown on the right tree in figure 2, U2 also has expected value zero when the balanced transversion assumptions hold.

Determining the Roots of Three-Taxon Trees

The principles and reasoning used in the previous section can also be applied to derive the EP statistics for rooting trees. For three taxa, there are three possible rooted trees and three possible operators. The E tree is rooted in taxon 1, the F tree is rooted in taxon 2, and the G tree is rooted in taxon 3. In addition to the Y and the R operators discussed above, we utilize a third operator, Z = (A+G−C−T), the transversion difference operator. The derivation of the expected value of the rooted EP invariant, UF2, is illustrated for the F and G trees in figure 3. Invariant UF2 evaluates nucleotide patterns, R1Y2R3, at positions where the nucleotides in sequences 1 and 3 are purines (Pu) and the nucleotide in sequence 2 is a pyrimidine (Py). When the G tree is correct, then 1 and 2 are sister taxa and UF2 has expected value zero as will be demonstrated with reference to figure 3A.

Three-taxon EP rooting invariants. An illustration of how three-taxon EP rooting invariants provide rooting information. In this example, for the UF2 rooting invariant, the unknown nucleotides at the root of the tree and the interior node of the tree, as described in the text, must, be either pyrimidines at the root and purines at the node, Py/Pu, or vice versa. When the possible partitions of the roots that result in a transversion are computed, we find that UF2 is unconstrained for the F tree, and that it is constrained to 0 expected value for the G tree and, in like manner, for the E tree (not shown). The expected values for all 12 invariants are summarized in table 1.

Table 1

Expected Values of the Three-Taxon EP Rooting Invariants

TREE	U_E1	U_E2	U_F1	U_F2	U_G1	U_G2	U_EF1	U_EF2	U_EG1	U_EG2	U_FG1	U_FG2
E	U	U	0	0	0	0	U	U	U	U	0	0
F	0	0	U	U	0	0	U	U	0	0	U	U
G	0	0	0	0	U	U	0	0	U	U	U	U

Note.—U: unconstrained expected value.

We know from the two-taxon EP invariants, that a transversion must occur on either branch 1 or 2 in order to produce the Pu1Py2 pattern. This is true for all values of the nucleotide present at the node connecting taxa 1 and 2, as in figure 3A. Because the two-taxon EP invariant U1 = R1Y2 has expected value zero for all four possible values, A, G, C, and T, of the most recent common ancestor of any two-taxon tree, we conclude that the 1, 2 clade of the three-taxon EP invariant, UF1 = (R1Y2) R3 must also have zero expected value. Furthermore, because this value depends only on the value of the most recent common ancestor, located at the interior node of the F tree, it is therefore independent of any earlier ancestral values present at the root of the three-taxon tree. Thus, UF2 has zero expected value for all possible combinations of nucleotides at the interior node and at the root of the G tree (and similarly for the E tree—not shown). Thus for the E and G trees (but not for the F tree): In contrast, when the F tree is correct, see figure 3B, the three-taxon rooting invariant, UF2 = R1Y2R3 is not constrained to zero. This happens because the terminal two-taxon tree relating taxa 1 and 3, corresponds to the operator R1R3 which is not an EP invariant, and hence is unconstrained. Furthermore, the operator related to the branch leading to taxon 2, Y2, is also unconstrained because transversions have not necessarily occurred in branch 2. Thus their product, UF2 = R1Y2R3, is unconstrained for the F tree. The following 12 three-taxon EP statistics contain root information: There are two types of rooting statistics for three-taxon trees. UE1 is a representative of one type and UEF1 is a representative of the other type. It is easily seen from the definitions above, that UE1, UF1, and UG1 are permutations of a common pattern, and similarly for UE2, UF2, and UG2. Likewise, UEF1, UEF2, UEG1, UEG2, UFG1, and UFG2 are permutations of the other common pattern. Unlike the two-taxon EP invariants, the expected values of these three-taxon EP invariants depend on the position of the root. If the E (or F, or G) trees are correct and if the assumptions of EP are met, then the 12 invariants will have the expected values that are summarized in table 1. Expected Values of the Three-Taxon EP Rooting Invariants Note.—U: unconstrained expected value.

Analyzing the Hypothetical Example in Figure 1

If we return to the data generated in the hypothetical example shown in figure 1, we can now estimate the root location from the hypothetical sequences. From these data, we calculate the values of each of the invariants, as shown in table 2. The expected values of these EP root invariants under each of the possible trees are also shown in the table. Clearly, the observed values most closely match the values expected if the G tree is correct, so we predict that the G tree is the most probable tree. In Material and Methods, we formalize this prediction by determining the posterior probability, the probability of a tree given the observed sequence data. We find that the probability of the G tree is 99.68%, the probability of the F tree is 0.20%, and the probability of the E tree is 0.12%, results that overwhelmingly support the G tree.

Table 2

Expected and Observed EP Rooting Invariants for Hypothetical Sequence

If E tree is correct			If F tree is correct			If G tree is Correct
Invariant	Expect	Observe	Invariant	Expect	Observe	Invariant	Expect	Observe
U_F1	0	0	U_E1	0	0	U_E1	0	0
U_F2	0	1	U_E2	0	1	U_E2	0	1
U_G1	0	4	U_G1	0	4	U_F1	0	0
U_G2	0	4	U_G2	0	4	U_F2	0	1
U_FG1	0	8	U_EG1	0	6	U_EF1	0	1
U_FG2	0	−1	U_EG2	0	3	U_EF2	0	−1

Expected and Observed EP Rooting Invariants for Hypothetical Sequence EP invariants are not restricted to two-, three-, and four-taxon trees (Lake 1987a; Cavender 1989; Nguyen and Speed 1992; Sinsheimer 1994), but can be extended to any number of taxa using the notation developed here (see Sinsheimer 1994). We classify the EP statistics for any number of taxa into three major categories according to their use in statistical inference, namely: (1) for tests of EP assumptions, (2) for distinguishing between rooted trees, and (3) for distinguishing between unrooted trees, topology (see Materials and Methods for details).

A Second Example: Rooting rRNA Trees

Ribosomal rRNA sequences are particularly difficult to root because the genes are short, paralogous genes are lacking, and nucleotides evolve faster than amino acids often making nucleotide sequences too divergent to be useful. Here we demonstrate the usefulness of EP rooting using a particularly difficult example: rooting a deep, three-taxon metazoan tree using only partial 18S rRNA sequences. Today the root of the multicellular animals is fairly well known, and even the most challenging, deeper branching parts of the metazoan tree are being reconstructed (Philippe et al. 2011), unlike when the lophophorates were initially shown to be protostomes (Halanych et al. 1995). Here, using short 18S ribosomal DNA sequences and in the complete absence of an outgroup, we show that the lophophorates are protostomes, using three-taxon, EP rooting analyses. In figure 4, three rooted trees correspond to the hypothesis that the lophophorates are protostomes (G tree), deuterostomes (F tree), or an independent, earlier branching lineage (E tree). Using slowly evolving 18S rRNA sequences of representatives of these three groups, namely the inarticulate brachiopod lophophorate, Glottidia pyramidata; the bivalve protostome, Placopecten magellanicus; and the echinoderm deuterostome, Antedon serrata. We first test the balanced transversion assumption using the six goodness-of-fit invariants for three taxa (see Materials and Methods). The test statistic has a chi-square distribution with 6 degrees of freedom (Mood AM, 1974; Nguyen T, 1992). The null hypothesis of balanced transversions is not rejected at the 5% significance level in these data.

Rooting a three-taxon metazoan tree. Posterior Bayesian support for the three possible, three-taxon, rooted metazoan trees obtained by analysis of short, nonparalogous, 18S rDNA sequences is listed beneath each of the three rooted trees. In this example, the G tree, rooted in the branch leading to the echinoderm, is strongly supported by the data (P = 99.51%), consistent with the current multicellular animal root, whereas the E and F trees are not significantly supported, P = 0.46% and 0.03%, respectively. We then determine the posterior probability of each of the rooted trees. There are 76 informative positions out of a total of 1457 aligned 18S ribosomal DNA positions, that may be used for determining the rooted tree that relates these three organisms. The rooting statistics are shown table 3 for those invariants that are constrained to zero expected value for the E, F, and G trees, respectively. As described in the Materials and Methods, we find that the posterior probability is 99.51% for the G tree, the tree rooted in the branch leading to the echinoderm; 0.03% for the F tree, the tree rooted in the branch leading to the bivalve; and 0.46% for the E tree, the tree rooted in the branch leading to the inarticulate brachiopod. Because EP rooting uses a different type of sequence information than methods designed to test topologies, these analyses provide independent support for the, now well-known result, that the lophophorates are protostomes, and not deuterostomes (Halanych et al. 1995; Philippe et al. 2011).

Table 3

Expected and Observed Invariant Values for G. pyramidata, P. magellanicus, and A. serrata

If E tree is correct			If F tree is correct			If G tree is correct
Invariant	Expected	Observed	Invariant	Expected	Observed	Invariant	Expected	Observed
U_F1	0	1	U_E1	0	−4	U_E1	0	−4
U_F2	0	−1	U_E2	0	0	U_E2	0	0
U_G1	0	−7	U_G1	0	−7	U_F1	0	1
U_G2	0	5	U_G2	0	5	U_F2	0	−1
U_FG1	0	−2	U_EG1	0	13	U_EF1	0	−1
U_FG2	0	10	U_EG2	0	−3	U_EF2	0	1

Expected and Observed Invariant Values for G. pyramidata, P. magellanicus, and A. serrata

Discussion

EP rooting can recover rooted trees directly from nucleotide sequences in the absence of outgroups, even when sequences are relatively short. Given the large amounts of sequence data available, this raises that possibility that outstanding problems in rooted bilateral animal relationships may be resolved using EP rooting. Furthermore, these rooting analyses are quite well suited for Bayesian interpretations. The three-taxon test, important and useful in its own right, also illustrates the use of EP rooting invariants to target-specific aspects of phylogenetic reconstruction. EP invariants can be classified into three groups based on the phylogenetic information they contain (Sinsheimer 1994). One class is the goodness-of-fit invariants. A second class contains the invariants with topological information but no root information, and a third class contains the statistics with both root and topological information. As is now well known (Sinsheimer et al. 1996), Bayesian predictions, such as those presented here, provide an attractive alternative to classical hypothesis tests for phylogenetic reconstruction. Multiple hypotheses, as in the example here, where three possible trees exist, are better suited to Bayesian analysis than to classical hypothesis testing (Sinsheimer 1994; Sinsheimer et al. 1996). Further, posterior probabilities estimate the probability of each of the trees given the observed data, a result that is far easier to interpret than the P value outcomes of classical hypothesis testing (Burnham KP, 1998). A major shortcoming of current phylogenetic analyses is that the roots of many major groups are currently unknown due primarily to the difficulty of obtaining useful outgroups. And in some cases, such as the origin of life, outgroups are simply not available. We hope that the EP rooting invariants make even deeper explorations of the origin of life possible.

Materials and Methods

General Rules for Defining the Classes of EP Invariants

The same simplifying notation introduced in the previous sections can also be used for EP statistics of any number of taxa. EP statistics are a form of linear invariants. All the work described in this article refers to linear invariants, that is, invariants which allow summing over nucleotide positions. The term linear invariants will therefore be shortened to invariants. In the general s-taxon case, EP statistics are linear combinations of counts made up of the building blocks, U = #X1X2X3 … X, where for at least one i ∈ {1, … , s} X = (A−G) = R, for at least one j ∈ {1, … , s} X = (C−T) = Y, and for all other k ∈ {1, … ,s} X ∈ {R, Y, S, Z},where S = A+G+C+T and Z = A+G−C−T (Cavender 1989; Nguyen and Speed 1992; Sinsheimer 1994; Sinsheimer et al. 1996). When X = (A+G+C+T) = S, the nucleotides for the k-th taxon are summed over, effectively ignoring that branch. Additional EP assumptions are necessary to preserve balanced transversions when the k-th taxon is ignored. These restrictions are P(A|T) + P(C|T) = P(A|C) + P(T|C) and P(T|A) + P(G|A) = P(T|G) + P(A|G) and similar constraints exist for the proportionally balanced transversion model (Cavender 1989; Nguyen and Speed 1992). In the s-taxon case, where s > 2, EP statistics can be partitioned into three classes. The first class is comprised of goodness-of-fit statistics that are linear combinations of counts, U = #X1X2X3 … X where for exactly one i ∈ {1, … , s} X = R, for exactly one j ∈ {1, … , s} X = Y, and for all other k ∈ (1, … , s} X = (A+G+C+T). The second class are statistics containing topological information but no rooting information and are linear combinations of counts, U = #X1X2X3 … Xs where for exactly two taxa i, j ∈ {1, … , s} X = X = R, for exactly two taxa m, n ∈ {1, … , s} X = X = Y, and for all other k ∈ (1, … , s} X = S. The statistics containing both root position and topological information comprise the third class and include all EP statistics that are not in Class 1 or 2. Examples of Class 3 statistics include the three-taxon statistics (4)–(15), the four-taxon statistic #R1Y2S3Z4 and the five-taxon statistic S1R2Y3R4S5. In the interest of saving space these classifications are not proven here, but can be found elsewhere (Sinsheimer 1994; Sinsheimer et al. 1996). The hypothesis tests for goodness-of-fit using the Class 1 statistics (Nguyen and Speed 1992), and for topology using Class 2 statistics are easily generalized to s taxa (Sinsheimer 1994). For more than three taxa, the s-taxon statistics containing rooting information can also be used to infer the correct rooted tree. In practice, however, inference becomes much more complicated. The expected value patterns of these statistics contain topological information as well as root information. In addition, the number of statistics increases rapidly as the number of taxa increases. For example, there are 92 statistics that contain information to infer the correct four-taxon rooted tree.

Statistical Inference Using EP Rooting Invariants

In this section we derive the posterior probabilities for inference of the correct three-taxon rooted tree. The three possible rooted trees correspond to three alternative hypotheses (figs. 3 and 4), namely HE: tree E is the true tree, HF: tree F is the true tree, and HG: tree G is the true tree. Under the principles of EP, the hypotheses can be expressed in terms of the expected values of the components of U (table 1). We assume all three trees are equally probable in the absence of prior sequence data, that is, P(H) = 1/3 where r ∈{E,F,G}. By Bayes theorem, the probability of a rooted tree given the observed data can be expressed as: where P(U|H) is the probability of the observed sequence data given tree r is the true tree. Because we assume each of the rooted trees is equally probable a priori, these posterior probabilities are equivalent to Akaike weights (Burnham and Anderson 1998). Let M be the vector of expected values, E(U) and let µ the expected value of k-th statistic comprising U, U. One approach to inference is to approximate P(U|H) by the multivariate normal density, where , is the sample estimate of E(U) under H and is the sample estimate of the 12 by 12 variance–covariance matrix under H (Mood et al. 1974). Equation (17) then yields The matrix is composed of estimates of the variances and covariances of the EP statistics. Expressions for these estimates are derived in the next section.

Variances and Covariances of the Three-Taxon Rooting Invariants

We first formalize the definition of an EP statistic. For s aligned taxa, there are 4 possible combinations of the four nucleotides at each site. The data consist of the observed nucleotide spectrum N, a 4 vector tallying each of these combinations over the observed sites. The sum of the counts, , is equal to the total number of aligned nucleotides. Ignoring site-to-site variation, N is modeled as a 4 multinomial. We restrict attention to statistics that are linear combinations of these counts, with coefficients −1, 0, or 1. Let U denote such a statistic. U can be written concisely as vector products, , where V is a vector of length 4 whose components that are −1, 0, or 1. U is an EP statistic of tree τ if its expected value is zero when tree τ is the true tree. An invariant vector of tree τ, V, is any non-zero vector that, when multiplied by N generates an EP statistic U when tree τ is the true tree. For example, the two-taxon statistic, U1 is the vector product of N and V = (0,0,1,−1,0,0,−1,1,0,0,0,0,0,0,0,0), and U2 is the product of N and V = (0,0,0,0,0,0,0,0,1,−1,0,0,−1,1,0,0). For linear invariants, U and U the expected value of U, the variance of U and the covariance of U and U can be calculated by recalling that for multinomial counts, N and N, E(N) = Np, Var(N) = Np(1−p), and Cov(N, N) = −Npp (Mood et al. 1974). Estimates of the expected value, variance, and covariance follow from substituting the sample estimate of p, p = N/N, into equations (19–21): When µ is constrained to be zero, (22)–(24) reduce to: Equations (25–27) were used to construct the test statistics (eqs. 5–16, 17) for the three-taxon case. The expressions for the variances and covariances can be easily calculated by introducing rules of multiplication for the simplified notation based on the Kronecker product representation of V and where • is multiplication branch by branch, Within any branch i, the rules of multiplication, *, are: where we have dropped the subscript i to reduce the clutter in the notation. If for any branch i (W*V) = 0, then U • U = 0. Using these rules, any of the variance or covariance terms of matrix Ω (eq. 18) can be determined, for example, using the transversion, purine and pyrimidine difference operator notation: Following the same logic, we can derive the expressions for the covariances and variances for three taxa: and for all other U and U combinations,

Goodness-of-Fit Invariants and Test Statistics

For three taxa, there are 6 goodness-of-fit invariants, U12A = #R1Y2S3, U12B = #Y1R2S3, U13A = #R1S2Y3, U13B = #Y1S2R3, U23A = #S1R2Y3, and U23B = #S1Y2R3 (Nguyen 1992; Sinsheimer 1994.). Let U be the vector of goodness-of-fit invariants. We approximate the density of U with a multivariate normal density and denote the sample estimate of the six by six variance–covariance matrix as . The test statistic has a quadratic form, , and has a chi-square density with 6 degrees of freedom (Rao 1973), which provides a test of the null hypothesis that all the goodness-of-fit invariants are zero. (More generally, when there are r taxa there are r(r−1) goodness-of-fit invariants and the corresponding test statistic has a chi-square density with r(r−1) degrees of freedom). As in the case of the rooting invariants, we use the within branch multiplication rules (in particular eqs. (29–34)) to derive the entries of . The variance of U12A is and the variance of U12B is . The variances for the other goodness-of-fit invariants have the same forms, just with permuted taxon labels. The dependence among invariants is reflected in the covariances, the off diagonal terms of this matrix: and for all other U and U combinations, (Nguyen T, 1992; Sinsheimer, 1994.) (Rao, 1973).

30 in total

1. Constructing and counting phylogenetic invariants.

Authors: S N Evans; X Zhou
Journal: J Comput Biol Date: 1998 Impact factor: 1.479

2. Mechanized derivation of linear invariants.

Authors: J A Cavender
Journal: Mol Biol Evol Date: 1989-05 Impact factor: 16.240

3. Evidence that eukaryotes and eocyte prokaryotes are immediate relatives.

Authors: M C Rivera; J A Lake
Journal: Science Date: 1992-07-03 Impact factor: 47.728

4. Bayesian hypothesis testing of four-taxon topologies using molecular sequence data.

Authors: J S Sinsheimer; J A Lake; R J Little
Journal: Biometrics Date: 1996-03 Impact factor: 2.571

5. Skewed base compositions, asymmetric transition matrices, and phylogenetic invariants.

Authors: V Ferretti; B F Lang; D Sankoff
Journal: J Comput Biol Date: 1994 Impact factor: 1.479

6. Complete families of linear invariants for some stochastic models of sequence evolution, with and without the molecular clock assumption.

Authors: M D Hendy; D Penny
Journal: J Comput Biol Date: 1996 Impact factor: 1.479