Literature DB >> 19204821

Minimum contradiction matrices in whole genome phylogenies.

Marc Thuillard1.   

Abstract

Minimum contradiction matrices are a useful complement to distance-based phylogenies. A minimum contradiction matrix represents phylogenetic information under the form of an ordered distance matrix Y(i) (,) (j) (n). A matrix element corresponds to the distance from a reference vertex n to the path (i, j). For an X-tree or a split network, the minimum contradiction matrix is a Robinson matrix. It therefore fulfills all the inequalities defining perfect order: Y(i) (,) (j) (n) >or= Y(i) (,) (k) (n) (,)Y(k j) (n) >or= Y(k) (,) (I) (n), i <or= j <or= k < n. In real phylogenetic data, some taxa may contradict the inequalities for perfect order. Contradictions to perfect order correspond to deviations from a tree or from a split network topology. Efficient algorithms that search for the best order are presented and tested on whole genome phylogenies with 184 taxa including many Bacteria, Archaea and Eukaryota. After optimization, taxa are classified in their correct domain and phyla. Several significant deviations from perfect order correspond to well-documented evolutionary events.

Entities:  

Keywords:  minimum contradiction; phylogenetic trees; split network; whole genome phylogeny

Year:  2008        PMID: 19204821      PMCID: PMC2614196          DOI: 10.4137/ebo.s909

Source DB:  PubMed          Journal:  Evol Bioinform Online        ISSN: 1176-9343            Impact factor:   1.625


Introduction

The discovery of the importance of lateral transfers, losses and duplications events in the evolution of genetic sequences has motivated the development of new approaches to graphically represent phylogenies. Methods like NeighborNet (Bryant and Moulton, 2004), T-Rex (Makarenkov et al. 2006), SplitTrees (Bandelt and Dress, 1992; Dress and Huson, 2004; Huson, 1998), Qnet (Grünewald et al. 2006), Pyramids (Bertrand and Diday, 1985), Tree of Life (Kunin et al. 2005a) allow visualizing deviations from a tree topology. All these methods have in common that they summarize the information in the form of a planar network. Deviations from an X-tree are often represented by supplementary edges (Makarenkov et al. 2006; Nakhleh et al. 2004) that create cycles in the graph. Phylogenetic information can be represented by a distance matrix Y, . For an X-tree, the elements of the distance matrix Y, correspond to the distance from a reference taxon n to the path (i, j). The taxa can be ordered through permutations, so that the distance matrix is a Robinson matrix (Bertrand and Diday, 1985), with values of both rows and columns decreasing away from the diagonal. The corresponding circular order is defined as a perfect order. We have shown with a probabilistic model that perfect order is quite robust against lateral transfer and crossover (Thuillard, 2007). The search for the order minimizing a measure of the deviation from perfect order can be efficiently done with a multi-resolution algorithm (Thuillard, 2001, 2007). The method has been tested on SSU rRNA data for Archaea. The matrix with the best order corresponds quite well to a Robinson matrix. In this article, the minimum contradiction approach is further developed and applied to whole genome phylogenies. With the availability of complete genomes, many methods have been proposed to determine the evolution of whole genomes (For reviews see Galperin et al. 2006; Delsuc et al. 2005; Henz et al. 2005). The construction of trees from whole genomes has proved over recent years to be a quite difficult task. This is mainly because of the very limited number of genes shared by Archaea, Eukaryota and Bacteria. Furthermore, gene evolution can sometimes be very different from species evolution. The main difficulty consists in finding a good operator to estimate the distance between genomes. Distances have been estimated with measures based on gene order or arrangement (Wolf et al. 2002; Wang et al. 2006), gene content (Fitz-Gibbon and House, 1999; Snel et al. 1999; Korbel et al. 2002), protein domain organization (Fukami-Kobayashi et al. 2007; Yang et al. 2005), folds (Lin and Gerstein, 2007), combining the information from many genes in a supertree or a superdistance (Dutihl et al. 2007 for a comparative study) or using a local alignment search tool such as Blast (Kunin et al. 2005b; Clarke et al. 2002). Among genome distances obtained with Blast, the genome conservation (Kunin et al. 2005b) has furnished some of the best trees up to date, if the quality of a whole genome phylogeny is measured by its concordance to broadly accepted classifications. The genome conservation estimates the distance between two taxa using the sum of BlastP reciprocal best hits between two genomes. The method is capable of quite correctly recovering all main phyla. At the phylum level, the evolution of the different genes is sufficiently similar to form a distinct cluster. The main uncertainties in whole genome phylogenies are on the relationships between phyla. Different evolution rates of the genes, gene losses or duplications, lateral gene transfer may result into large deviations of the distance matrix from a tree topology. In this context, minimum contradiction matrices can furnish information not contained in a single tree or a split network. The paper is organized as follows. After introducing minimum contradiction matrices in section 2 and their connection to Robinson matrices and Kalmanson inequalities, section 3 explains why the identification of deviations from perfect order is a useful complement to phylogenetic studies. Section 4 presents an algorithm to search for the order minimizing a measure of the deviation from perfect order over all taxa. This order can be interpreted as an average best order over all reference taxa Y, (N = 1, …, n). The algorithm is applied in section 5 to distance matrices for whole genome phylogenies obtained with the genome conservation method.

Circular Order and the Minimum Contradiction Approach

Definitions

Let us start by recalling a number of definitions that are necessary to introduce the notion of circular order. A graph G is defined by a set of vertices V(G) and a set of edges E(G). Let us write e(x, y), the edge between the two vertices x and y. In a graph G, a path P between two vertices x and y is a sequence of non-repeating edges e(x1, z1), e(z1, z2), …, e(z, y) connecting x to y. The degree of a vertex x is the number of edges e ∈ E(G) to which x belongs. A leaf x of a graph is a vertex of degree one. A vertex of degree larger than one is called an internal vertex. A valued X-tree T is a graph with X as its set of leaves and a unique path between any two distinct vertices x and y, with internal vertices of at most degree 3. The distance d between leaves satisfies the classical triangular inequality with d(x, y) representing the sum of the weights on the edges of T in the path connecting x and y. A central problem in phylogeny is to determine if there is an X-tree T and a real-valued weighting of the edges of T that fits a dissimilarity matrix δ. Typically, a dissimilarity matrix δ corresponds to an estimation of the pairwise distance d(x, x) between all elements in X. A necessary and satisfactory condition for the existence of a unique tree is that the dissimilarity matrix δ satisfies the so-called 4-point condition (Bunemann, 1971). For any four elements in X, the 4-point condition requires that

Circular order and Kalmanson inequalities

Consider a planar representation of a tree T or a split network S. A circular order corresponds to an indexing of the n leaves according to a circular (clockwise or anti-clockwise) scanning of the leaves (Barthélemy and Guénoche, 1991; Makarenkov and Leclerc, 1997, 2000; Yushmanov, 1984). In an X-tree, a circular order has the property that for any integer k (modulo n), all the branches on the path P(x, x+1) between x and x+1 correspond to the left branch (or right branch if anti-clockwise). A circular order can be obtained by considering the distance matrix Y, . As illustrated in Figure 1, the matrix element Y, = ½ (d(x, x) + d(x, x) − d(x, x)) corresponds to the distance between a reference leaf n and the path P(x, x). A circular order can be computed by ordering the distance matrix Y, so that it fulfils the inequalities defining a perfect order
Figure 1

The distance matrix Y corresponds to the distance between the leaf n and the path P(i, j ).

The above inequalities characterize also a Robinson matrix (Christopher et al. 1996; Thuillard, 2007). Using the definition of Y, the inequalities become and These inequalities have a similar form to the 4-point condition (2) and are known as the Kalmanson inequalities.

Minimum contradiction matrix

In real applications, the distance matrix Y, does often only partially fulfill the inequalities corresponding to a perfect order. The contradiction on the order of the taxa can be defined as The best order of a distance matrix is, per definition, the order minimizing the contradiction. The ordered matrix Y, corresponding to the best order is defined as the minimum contradiction matrix for the reference taxon n. For a perfectly ordered X-tree, the contradiction C is zero. A tree with a low contradiction value C is a tree that can be trusted, while a high contradiction value C is the indication of a distance matrix deviating significantly from an X-tree.

Why Perfect Order is an Important Property?

Kalmanson inequalities are at the center of a number of important results relating convexity (Kalmanson, 1975), the Traveling Salesman Problem (TSP) (Deineko et al. 1995; Korostensky and Gonnet, 2000), phylogenetic trees and networks (Christopher et al.1996; Dress and Huson, 2004). Let us explain why perfect order is an important property. – If the error on the distance in an X-tree is not greater than xmin/2 with xmin the shortest edge on the tree, then the Neighbor-Joining algorithm will recover the correct tree topology and Kalmanson inequalities hold (Atteson, 1999; Korostensky and Gonnet, 2000). – If a distance matrix d fulfills Kalmanson inequalities, then the distance matrix can be exactly represented by a split network (Bandelt and Dress, 1992). – If Kalmanson inequalities are fulfilled, then the tour (1, 2, …, n) corresponds to a solution of the Traveling Salesman Problem (Christopher et al. 1996). The last result can be demonstrated starting from the sum . When Kalmanson inequalities are fulfilled, the sum is maximized. As Y, +1 ≥ Y, + (i + m ≤ n, m > 1). Developing , one gets . The first sum is independent of the order and one concludes that a perfect order minimizes . The tour (1, 2, …, n) is therefore a solution of the TSP. The solution to the TSP has the Master Tour property (Deineko et al. 1995). A Master Tour is a solution of the TSP with the property that the optimal tour restricted to a subset of points is also a solution of the reduced TSP. This result follows directly from the inequalities for perfect order Y, ≥ Y, , Y, ≥ Y, (i ≤ j ≤ k < n). Any restriction of a perfectly ordered distance matrix Y, to a subset of taxa is perfectly ordered and consequently is a solution to the reduced TSP. In contrast to this result, one finds with numerical experiments that, if the minimum contradiction matrix does not fulfill the inequalities for perfect order, the best order is not always preserved when a number of taxa are removed. The order minimizing the contradiction over n taxa does not always minimize the contradiction when restricted to a subset of taxa. It follows that one cannot exclude that the topology of a tree or a split network may change when taxa contradicting perfect order are removed. Deviations from perfect order correspond to problematic regions that have to be interpreted very carefully. For that reason we suggest that minimum contradiction matrices are a useful complement to any distance-based phylogeny.

Searching for the Best Order in Whole Genome Phylogenies

Fast algorithm to search for the best order

The choice of the reference taxon n in Y, can significantly influence the best order, when the distance matrix cannot be perfectly ordered. For that reason, an average best order is determined by minimizing the contradiction over all reference taxa. The contradiction over all n reference taxa is given by with i(m) = mod(m + i0 − 2, n) + 1; j(m) = mod(m + j0 − 2, n) + 1, n(m) = n0 −m + 1 and β = 2. The best order is the order (1, …, i0, …, j0, …, n0) minimizing the contradiction. The computation of the contradiction requires O(n4) operations. For a large ensemble of taxa, the computational cost may become quite high. We will therefore introduce below an algorithm requiring only O(n3) operations to compute a (slightly different) measure of the contradiction. Let us start by considering an X-tree and the 3 vertices i, j, k as in Figure 2. The distance matrix fulfills the inequalities for perfect order. The order between the vertices i, j, k is preserved for any reference vertex not in the interval (i, k) and the inequalities Y, ≥ Y, and Y, ≥ Y, n = 1, …, i, k, …, N hold. The inequalities can be summed up over all n and one obtains two new inequalities:
Figure 2

The inequalities Y, ≥ Y are fulfilled for any reference vertex n with n ≥ k or n ≤ i.

With If the contradiction c, between the vertices i, j is defined as the sum of two terms then the best order is the order minimizing . Computing the contradiction requires O(n3) operations (As the computation of the contradiction is the most computer-intensive, the algorithm requires approximately n times less computing time than the O(n4) algorithm). The quantities Sa and Sb in Eq. (6) can be related to the NJ algorithm. For 3 consecutive vertices (i, j = i + 1, k = i + 2), Eq. (6a) can be written, assuming perfect order, as Writing and S = r + r − (N – 2). d(x, x) one obtains The value S, is central to the NJ algorithm (Saitou and Nei, 1987; Gascuel and Steel, 2006 ). Two vertices i, j are joined by the NJ algorithm, if they maximize S (i.e. max(S) = S, ). From the above discussion, it seems natural to initialize the search for the best order on the NJ tree. The search for the best order of Y, is initialized with the NJ algorithm and a small supplementary procedure that we describe below. Given two vertices a and b that are joined by the NJ algorithm and the leaves a1, a2, …, a (resp. b1, b2, …, b) that have the vertex a (resp. b) as first ancestor. The best order of the leaves is chosen so as to minimize the contradiction among 4 possibilities: (ab, āb, ab̄, āb with ab the order a1, a2, …, a, b1, b2, …, b and ā the inversed order a, a–1, …, a1. Once the order is optimized over the NJ tree, the best order is refined with a multiresolution search algorithm (Thuillard, 2001, 2007).

Similarity matrix for whole genomes phylogenies

For whole genome phylogenies, the search for appropriate measures to estimate the evolutionary distance between taxa is still the subject of significant research efforts (Korbel et al. 2002; Kunin et al. 2005b; Yang et al. 2005; Fukami-Kobayashi, 2007). Distance matrices obtained from BlastP scores have been quite successful to generate good trees. The similarity score obtained with BlastP programs can be given a probabilistic interpretation. The statistics of high scoring segments in the absence of gaps tends to an extreme value distribution (Karlin and Altschul, 1990). The probability P of finding at least a high scoring segment is well approximated, for small values of P, by the formula P = m1·m2·2− with m1, m2 the length of the 2 sequences. It follows that Score = −log2 P + log2(m1·m2). Defining the distance d between two sequences as d = −Score and assuming equal lengths one has d = log2(P/m2). Using that definition, the distance matrix Y, becomes for 3 sequences The log term has the form of a mutual information and furnishes a measure of the similarity of the genomes i and j in reference to the genome n. Different approaches have been proposed to normalize the distance matrix using the marginal entropy (Kraskov et al. 2005), the self-score (Kunin et al. 2005b), Korbel normalization (Korbel et al. 2002) or the average score. The normalization by the self-score in the genome conservation gives some of the best results. It is based on a nonlinear weighted sum of the BlastP scores. The gene conservation method computes the distance between two taxa by normalizing the sum of reciprocal best hits between genome i and j by the self-score. The effect of duplication is limited by using only reciprocal best hits. The normalization by the self-score is important to correct, at least partially, the effect of different genome sizes. The genome conservation similarity matrix is given by with ∑ (i,j) the sum of reciprocal best hits between the genomes of the two taxa.

Whole Genome Phylogenies

Search for the best average order

The algorithms described in section 4 have been used to search for the best order. The distance matrix was computed using the data furnished by the genome phylogeny server (Kunin et al. 2005b) obtained with an e-value cut-off set to 10−10. The contradiction is significantly lower with the score (1 – S,) than with the logarithm of the score. Figure 3 shows the best order after optimization with the algorithms described in section 4 followed by 5000 steps of the multiresolution search algorithm using Eq. (7) to compute the contradiction.
Figure 3

Minimum contradiction matrices corresponding to the best order found after optimization with Eq. 6,7. The contradiction is minimized over the lines of the matrix (left) and the columns (right).

Table 1 gives the order of the different taxa corresponding to the best order. Archaea and Eukaryota are grouped into two adjacent clusters of taxa. One observes, for Bacteria, that all the members of a class or a phylum are neighbors. All proteobacteria (together with Aquifex?) are grouped together. The best order obtained with the minimum contraction approach differs from the NJ tree on the following aspect: all spirochetes and δ-proteobacteria form a cluster. This is not the case of the NJ tree.
Table 1

Best Order (Fig. 3, 4).

1. α-Proteobacteria1–14
2. γ-Proteobacteria15–18
3. β-Proteobacteria19–29
4. γ-Proteobacteria30–54
5. ɛ-Proteobacteria55–59
6. Aquificae60
7. δ-Proteobacteria61–63
8. Chlorobi64
9. Bacteroidetes65–66
10. Spirochetes67–71
11. Thermotogae72
12. Fusobacteria73
13. Firmicutes74–116
14. Eukaryota117–135
15. Archaea136–152
16. Actinobacteria153–166
17. Deinococcus-Thermus167–168
18. Cyanobacteria169–176
19. Planctomycetes177
20. Chlamydiae178–184

(see annex for detailed list of taxa).

Interpreting minimum contradiction matrices

This article focus on the mathematical aspects of Minimum Contradiction Matrices. We will limit the discussion to 3 examples showing how to interpret Minimum Contradiction Matrices. The matrix Y, can be imaged for different reference taxa using the best order of Figure 3 given in the annex. Figure 4 shows the matrix Y, using Pirellula (taxa 177) as reference taxa. The scale on the right of the figure gives the color code used to represent Y, after rescaling. The minimum value of Y, corresponds to dark blue, while the largest values are coded red. Low values of Y, are associated to two vertices (i, j) having a first common ancestor vertex close to the reference taxa. A cluster of adjacent taxa with large values (red cluster) can be interpreted as a group of close taxa. One observes that Archaea and Eukaryota are not only adjacent but form also a cluster.
Figure 4

Distance matrix Yi, j using the best order in Figure 3 and Pirellula (taxon 177) as reference taxon.

The best order in Figure 3 is obtained by minimizing the contradiction using all taxa as reference vertex at least once. The best order is therefore a kind of “average” best order. The matrix Y (resp. ) with n corresponding to a unique taxon (resp. a group of taxa belonging to some phylum) allows the identification of large contradictions from the best order. These contradictions can often be specifically related to the reference taxon. A loss of a gene, a lateral gene transfer or a crossover in the reference taxon modifies all elements of the distance matrix Y. A similar perturbation on a taxon that is not a reference taxon affects at most the row and the column corresponding to that taxon. Many contradictions in Figure 5 can be associated to well accepted endosymbiotic events (Chloroplasts in plants or mitochondria in Eukaryota). Figure 5a shows Y for Archaea, Eukaryota and some Bacteria (Taxa 72–116) using Rickettsiales (Taxa 1–4 in annex) as reference taxa. The average best order is used to order the taxa. Contradictions on the order of the taxa are identified by looking for regions with Y increasing away from the diagonal (i.e. Y < Y, i < j < k < n). Contradictions are observed for i = Bacteria (without Mycoploasma) j = Eukaryota. One observes that Y decreases away from the diagonal except between Eukaryota and Archaea (dark blue compared to light blue for Archaea). This result is, at first glance, somewhat surprising. Similar values of Y for Archaea and Eukaryota are expected when i, n correspond to Bacteria. The low values for Eukaryota can be explained by a lateral transfer between the Rickettsiales and Eukaryota. We have shown with a probabilistic model (Thuillard, 2007) that a lateral transfer between the reference taxa and some taxa reduces the expected values of Y for those taxa. In this model, the expected value Ŷ after an α-lateral transfer is given by Ŷ = (1 − α) · Y + α · Y ≤ Y with α the proportion of the genome laterally transferred (α ≤ 1) from the reference taxa R, and R, R the laterally transferred sequence after further evolution into the Eukaryota genomes E1, E2. The observed contradiction and the small values of Y for Eukaryota are consistent with a lateral transfer between the reference taxa (Rickettsiales) and Eukaryota. Let us recall here that mitochondria are believed to be the result of an endosymbiotic event involving Rickettsia (Timmis et al. 2004), an event that resulted also into the transfer of some Rickettsia genes into the nucleus of the host.
Figure 5

Distance matrix Yi, j for a) Rickettsiales (Taxa 1–4) as reference taxa and taxa 72–152 in Figure 3. b) Eukaryota using Cyanobacteria as reference taxa. The arrow points to Arabidopsis and Cyanidioschyzon.

Figure 5b shows the distance matrix using all Cyanobacteria as reference taxa. The elements associated to Arabidopsis and Cyanidioschyzon have lower values than both adjacent lines (resp. columns). The observed contradictions for Arabidopsis and Cyanidioschyzon merolae (a plant and a red alga) may be explained by the many genes that are found in both Cyanobacteria and plants/red alga but absent in other Eukaryota, a hypothesis that is supported by the small value of the distance between Cyanobacteria and (Arabidopsis, Cyanidioschyzon). Chloroplasts in plants and red alga are generally considered to have originated as endosymbiotic Cyanobacteria. The low values of Y for i = Arabidopsis, Cyanidioschyzon are compatible with the hypothesis that some Cyanobacteria genes have been transferred into the host.

Conclusions

For an X-tree or a split network the minimum contradiction matrix fulfills all the inequalities defining perfect order (i.e. Y ≥ Y, Y , ≥ Y, i ≤ j ≤ k ≤n). In real applications a number of taxa may typically be in contradiction to the inequalities for perfect order. In that case, the Master Tour property does not hold. It follows that the removal or the addition of taxa in contradiction to the inequalities may change the topology of the associated NJ tree or split network. An average best order can be obtained by searching for the best circular order over Y (N 1, …, n). The matrix Y can be used to localize a problematic taxon, as large deviations from the average best order are often related to the reference taxon n. This approach was applied to whole genome phylogenies using distances computed with the genome conservation method. Several large deviations from the average best order were found to correspond to well-documented evolutionary events.
  28 in total

1.  Whole genome-based phylogenetic analysis of free-living microorganisms.

Authors:  S T Fitz-Gibbon; C H House
Journal:  Nucleic Acids Res       Date:  1999-11-01       Impact factor: 16.971

2.  Inferring genome trees by using a filter to eliminate phylogenetically discordant sequences and a distance matrix based on mean normalized BLASTP scores.

Authors:  G D Paul Clarke; Robert G Beiko; Mark A Ragan; Robert L Charlebois
Journal:  J Bacteriol       Date:  2002-04       Impact factor: 3.490

Review 3.  Genome trees and the tree of life.

Authors:  Yuri I Wolf; Igor B Rogozin; Nick V Grishin; Eugene V Koonin
Journal:  Trends Genet       Date:  2002-09       Impact factor: 11.639

4.  Phylogeny determined by protein domain content.

Authors:  Song Yang; Russell F Doolittle; Philip E Bourne
Journal:  Proc Natl Acad Sci U S A       Date:  2005-01-03       Impact factor: 11.205

Review 5.  Phylogenomics and the reconstruction of the tree of life.

Authors:  Frédéric Delsuc; Henner Brinkmann; Hervé Philippe
Journal:  Nat Rev Genet       Date:  2005-05       Impact factor: 53.242

Review 6.  Neighbor-joining revealed.

Authors:  Olivier Gascuel; Mike Steel
Journal:  Mol Biol Evol       Date:  2006-07-28       Impact factor: 16.240

7.  Genome phylogeny based on gene content.

Authors:  B Snel; P Bork; M A Huynen
Journal:  Nat Genet       Date:  1999-01       Impact factor: 38.330

8.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes.

Authors:  S Karlin; S F Altschul
Journal:  Proc Natl Acad Sci U S A       Date:  1990-03       Impact factor: 11.205

9.  The neighbor-joining method: a new method for reconstructing phylogenetic trees.

Authors:  N Saitou; M Nei
Journal:  Mol Biol Evol       Date:  1987-07       Impact factor: 16.240

10.  Minimizing contradictions on circular order of phylogenic trees.

Authors:  Marc Thuillard
Journal:  Evol Bioinform Online       Date:  2007-10-11       Impact factor: 1.625

View more
  2 in total

1.  Phylogenetic applications of the minimum contradiction approach on continuous characters.

Authors:  Marc Thuillard; Didier Fraix-Burnet
Journal:  Evol Bioinform Online       Date:  2009-06-11       Impact factor: 1.625

2.  Phylogenetic Trees and Networks Reduce to Phylogenies on Binary States: Does It Furnish an Explanation to the Robustness of Phylogenetic Trees against Lateral Transfers.

Authors:  Marc Thuillard; Didier Fraix-Burnet
Journal:  Evol Bioinform Online       Date:  2015-10-13       Impact factor: 1.625

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.