Literature DB >> 24982934

The comparison of tree-sibling time consistent phylogenetic networks is graph isomorphism-complete.

Gabriel Cardona1, Mercè Llabrés1, Francesc Rosselló1, Gabriel Valiente2.   

Abstract

Several polynomial time computable metrics on the class of semibinary tree-sibling time consistent phylogenetic networks are available in the literature; in particular, the problem of deciding if two networks of this kind are isomorphic is in P. In this paper, we show that if we remove the semibinarity condition, then the problem becomes much harder. More precisely, we prove that the isomorphism problem for generic tree-sibling time consistent phylogenetic networks is polynomially equivalent to the graph isomorphism problem. Since the latter is believed not to belong to P, the chances are that it is impossible to define a metric on the class of all tree-sibling time consistent phylogenetic networks that can be computed in polynomial time.

Entities:  

Mesh:

Year:  2014        PMID: 24982934      PMCID: PMC3996867          DOI: 10.1155/2014/254279

Source DB:  PubMed          Journal:  ScientificWorldJournal        ISSN: 1537-744X


1. Introduction

After the realization that reticulation processes, like hybridizations, recombinations, or lateral gene transfers, have been more relevant in the evolution of life on Earth than previously thought [1], there has been a growing interest in the development of algorithms for the reconstruction of phylogenetic networks: graphical models of evolutionary histories that go beyond phylogenetic trees by including hybrid nodes of in-degree greater than one representing reticulation events. As the number of available such algorithms increases, the need of methods for the comparison of phylogenetic networks also increases, as they are used, for instance, to assess the reliability and robustness of these algorithms [2, 3]. One of the types of phylogenetic networks for which there exist reconstruction methods [4, 5] are the tree-sibling time consistent networks, TSTC networks, for short (see [6] for a formal definition). Two metrics on the class of all semibinary TSTC networks, where all hybrid nodes have in-degree two, have been proposed in the last years. Both metrics are based on encodings of phylogenetic networks that turn out to single out any TSTC network among all such networks: the μ-vectors (where each node in the network is represented by the vector of numbers of paths from it to each leaf) [7] and the nested labels (where each node in the network is represented by a certain Newick-like representation of the subnetwork rooted at it) [8, 9]. Actually, this last metric turns out to be sound also for the class of all semibinary tree-sibling networks, without the time-consistency restriction [6]. But, although there have been several attempts to define a metric on the class of all TSTC networks on a given set of taxa [10], none of the metrics for phylogenetic networks computable in polynomial time proposed so far satisfies the separation axiom (distance 0 means isomorphism) for TSTC networks [8, 11, 12]. In this paper we show why it should come as no surprise; such a metric would solve in polynomial time the graph isomorphism problem. The graph isomorphism problem is one of the most important decision problems for which the computational complexity is not known yet [13, 14]. It is believed to be neither in P nor NP complete, and subexponential time solutions for it are known. A problem is said to be graph isomorphism-complete when it is polynomially equivalent to the graph isomorphism problem [15]. In this paper we show that, for every set S with more than two elements, the isomorphism problem for TSTC phylogenetic networks with taxa bijectively labeled in S is graph isomorphism-complete.

2. Preliminaries

Let G = (V, E) be a nonempty rooted directed acyclic graph (a rDAG, for short). A node of G is a leaf if it has out-degree 0, internal if its out-degree is ⩾1, of tree type if its in-degree is ⩽1, of hybrid type if its in-degree is >1, and elementary if it is a tree node of out-degree 1. A node v is a child of another node u (and, hence, u is a parent of v) if (u, v) ∈ E. Two nodes u and v are siblings of each other if they share a parent. An arc (u, v) in a rDAG is a tree arc when v is a tree node and a hybridization arc when v is a hybrid node. The height of a node v is the longest length of a directed path from v to a leaf, and the depth of v is the longest length of a directed path from the root to v. Given a finite set S of labels, a S-rDAG is a rDAG with its leaves injectively labelled by S. By an isomorphism of S-rDAGs we understand an isomorphism of directed graphs that preserves and reflects the labelling, that is, that matches each leaf in one network with the leaf with the same label in the other network. In a S-rDAG, we will always identify without any further reference every leaf with its label. A phylogenetic network on a set S of taxa is a S-rDAG such that no tree node is elementary; every hybrid node has out-degree 1, and its single child is a tree node. A phylogenetic tree is a phylogenetic network without hybrid nodes. We will say that a phylogenetic network is tree-sibling if every hybrid node has at least one sibling that is a tree node. A temporal assignment [16] on a network N = (V, E) is mapping τ : V → ℕ such that if v is a hybrid node and (u, v) ∈ E, then τ(u) = τ(v); if v is a tree node and (u, v) ∈ E, then τ(u) < τ(v). We will say that a phylogenetic network is time-consistent if it admits a temporal assignment. The following alternative characterization of time consistency will be used later. For a proof, see [16, 17].

Proposition 1

Let N = (V, E) be a phylogenetic network, let E be its set of hybridization arcs, and let N* = (V, E*) be the directed graph with the same set V of nodes as N and set of arcs E* = E ∪ {(v, u) | (u, v) ∈ E }. Then, N is time consistent if, and only if, N* does not have any cycle containing some tree arc of N. For short, we will refer henceforth to tree-sibling time consistent phylogenetic networks simply as TSTC networks. The underlying biological motivation for the definitions on phylogenetic networks introduced so far is the following. In a phylogenetic network, tree nodes model species (either extant, the leaves, or nonextant, the internal tree nodes), while hybrid nodes model reticulation events, where different species interact to create new species. The parents of the hybrid node represent the species involved in this event and its single child being the resulting species. The tree children of a tree node represent direct descendants through mutation. The first condition in the definition of phylogenetic network says that every nonextant species is assumed to have at least two different direct descendants. This is a very common restriction in any definition of phylogeny (be it a tree or a network), since species with only one child cannot be reconstructed from biological data. The tree-sibling condition says then that, for every reticulation event, at least one of the species involved in it must have some descendant through mutation. This condition was introduced with the name class I in L. Nakhleh's Ph.D. thesis [10], and it has reappeared in several phylogenetic network reconstruction methods [4, 5]. As far as the time consistency goes, we understand that the time assigned to a node represents the time when the corresponding species existed, or when the reticulation event took place. The first condition in time consistency means then that the species involved in a reticulation event must coexist in time in order to interact, while the second condition means that speciation takes some amount of time to take place.

3. Main Results

It is well known [13, 18] that the isomorphism problem for rDAGs is graph isomorphism-complete. It turns out that the isomorphism problem for rDAGs with their leaves injectively labeled in any given set of labels is also graph isomorphism-complete; since we have not been able to find a proof of this easy result in the literature, we provide one here.

Proposition 2

For every nonempty set S of labels, the isomorphism for S-rDAGs is graph isomorphism-complete.

Proof

Without any loss of generality, we assume that S = {1,…, n}⊆ℕ. Let us prove first that the isomorphism of S-rDAGs reduces to the isomorphism of rDAGs. For every S-rDAG G, let G′ be the rDAG obtained from G by unlabelling its leaves and, then, for each k = 1,…, n, if G contained a leaf labeled with k, then adding to this leaf k tree-children leaves; see Figure 1. The construction of G′ from G = (V, E) adds O(n 2) ⩽ O(|V|2) nodes and arcs, and therefore it is polynomial in the size of G. And G can be reconstructed from G′ by simply replacing, for each k = 1,…, n, the node of height 1 with k leaves by a leaf labeled with k. Then, it is straightforward to check that, for every pair of S-rDAGs G 1 and G 2 over S, G 1≅G 2 as S-rDAGs if, and only if, G 1′≅G 2′ as rDAGs.
Figure 1

The construction involved in the reduction of the isomorphism of S-rDAGs to the isomorphism of rDAGs.

Let us prove now that the isomorphism of rDAGs reduces to the isomorphism of S-rDAGs. For every rDAG G, let G′′ be the S-rDAG obtained from G by adding a new node a, arcs from each leaf of G to a, and finally labeling the new node a with 1; see Figure 2. The construction of G′′ from G = (V, E) adds 1 node and O(|V|) arcs, and therefore it is polynomial. And G can be reconstructed from G′′ by simply removing its leaf and all arcs pointing to it. It is straightforward to check that, for every pair of rDAGs G 1 and G 2 over S, G 1≅G 2 if, and only if, G 1′≅G 2′ as S-rDAGs.
Figure 2

The construction involved in the reduction of the isomorphism of rDAGs to the isomorphism of S-rDAGs.

Let us see now that the isomorphism problem for S-rDAGs reduces to the isomorphism problem for TSTC networks on a new set of labels consisting of S and two extra labels. This entails that the isomorphism of TSTC networks on sets with at least three labels is graph isomorphism-complete.

Theorem 3

For every set S with |S | ⩾3, the isomorphism of TSTC networks on S is graph isomorphism-complete. Without any loss of generality, we assume that S = {1,…, n}⊆ℕ. The isomorphism of TSTC networks on S clearly reduces to the isomorphism of S-rDAGs, since the former is a special case of the latter. Let us prove now the converse reduction. We will associate to each S-rDAG N = (V, E) a TSTC network on S ∪ {n + 1, n + 2}. If N is a phylogenetic tree, then it is already a TSTC network, and in this case we take . Consider now the case when N has some hybrid node or some elementary node, and let m be the largest label actually appearing in N. In this case, we define the TSTC network as follows. For every hybrid node h in N, remove all arcs from h to its children, and then add a new (tree) node u , an arc from h to u , and new arcs from u to the children of h in N. If h was a leaf, say with label k, then u becomes the new leaf labeled with k. For every hybridization arc e = (v, h) in the resulting S-rDAG, split it into arcs (v, v ) and (v , h), with v a new (tree and, for the moment, elementary) node. Let N′ denote the resulting S-rDAG after these two first steps. For every elementary node v in N, add a new (tree) node v′ and an arc (v, v′). Split the arc (w, m) in N′ pointing to the leaf m into two arcs (w, w ) and (w , m). Add two new nodes a and b, and, for every node v′ added in step (3), add arcs (v′, a) and (v′, b). Add also arcs (w , a) and (w , b). Notice that the nodes a and b will be hybrid. Add a tree leaf children labelled n + 1 to a and another one labelled n + 2 to b. An example of this construction is displayed in Figure 3.
Figure 3

An example of the construction involved in the reduction of the isomorphism of S-rDAGs to the isomorphism of TSTC networks.

Let us prove now that is a tree-sibling time consistent phylogenetic network. It is rooted (with the same root as N) and acyclic, because all new arcs are either used to split arcs in N into pairs of consecutive arcs, or to define paths that end in the new leaves n + 1 or n + 2 without forming cycles. It has no elementary node. Indeed, any elementary node in N gets an extra child in step (3), and the tree nodes that are added to N either get an extra child in step (3) or they get two children in (5). Its hybrid nodes have only one child, and it is a tree node; this is ensured for the hybrid nodes in N in step (1), and for the new hybrid nodes a and b by construction. It is tree-sibling. All hybrid nodes in N get a tree sibling in steps (2) and (3) (for every hybrid node h in N, if e is any arc pointing to h, then the tree child v ′ of the new node v added in the middle of e is such a tree sibling of h), and the hybrid nodes a and b have the tree sibling m. It is time consistent. To check this, we use Proposition 1 (and the notations introduced therein). Since we already know that is acyclic, any cycle in must contain some inverse of a hybridization arc. There are two possibilities for this inverse. If it has the form (h, x), with h one of the new hybrid nodes a or b introduced in step (5) and x one of the tree nodes v′ introduced in step (3) or the tree node w introduced in step (4), then the only tree arcs that can be reached from x in are those pointing to the leaves m, n + 1 or n + 2, and therefore no cycle in contains this arc (h, x) together with a tree arc. And if this inverse is of the form (h, v ), with h a hybrid node in N and v one of the tree nodes introduced in step (2), then it must be followed in the cycle by the arc (v , v ′) added in step (3), and, as we have just said, the only tree arcs that can be reached from v ′ point to a leaf, and hence no cycle in contains this arc (h, v ′) and a tree arc, either. It is clear that the construction of from N adds O(|V | +|E|) nodes and arcs to N, and thus it is polynomial in the size of N. Notice also that in this case always contains hybrid nodes, and in particular that it is never a phylogenetic tree. Moreover, in this nontree case, the S-rDAG N can be easily reproduced from by simply undoing its construction as follows. Remove the leaves n + 1 and n + 2 and their hybrid parents a and b, together with all arcs pointing to them. Remove the elementary parent of the leaf m (which will be the remaining leaf with largest label in S) and replace it by an arc from the parent of the removed node to m. Remove all nonlabeled leaves of the resulting rDAG together with the arcs pointing to them. Remove each parent v of every hybrid node, and replace it by an arc from the parent of v to the hybrid child of v . Remove the only tree child of each hybrid node, and replace it by an arc from the hybrid node to each one of the children of the removed node. The resulting S-rDAG is N. It is straightforward to check now that, for every pair of S-rDAGs N 1 and N 2, N 1≅N 2 if, and only if, as phylogenetic networks over S ∪ {n + 1, n + 2}. We cannot remove the condition |S | ⩾3 in the previous result because there are only two TSTC networks with less than 3 leaves (up to the actual names of the labels). In particular, this implies that, in the proof of the previous result, we cannot add less than 2 new leaves in the construction of from N.

Proposition 4

There is only one TSTC network with one leaf, and only one TSTC phylogenetic with two leaves (up to relabeling), and in both cases they are trees. The {1}-rDAG consisting of a single node, labeled 1, and the {1,2}-rDAG consisting of the phylogenetic tree with Newick code (1,2); are clearly TSTC networks. Let us check now that any other (up to relabeling) TSTC network has at least 3 leaves. Let N = (V, E) be a TSTC network other than those described in the last paragraph, let τ : V → ℕ be a time assignment, and let v be an internal node with largest τ-value and, among those with this largest time assignment, of largest depth. If v is a tree node, then all its children are either leaves or hybrid nodes with leaf children (because any tree descendant node of v has time assignment larger than τ(v)). And v's hybrid children would have the same time assignment as v but depth largest than v's depth, against the assumption. Therefore all children of v are leaves, and it has at least 2 children, because it cannot be elementary. Now, if v has more than 2 children, we are done, while if it has only two children, say the leaves 1 and 2, then v will have a parent in N (because N is not the tree (1,2);). If the parent of v is a tree node, let w be this node, and let z be another child of w. Since N does not contain cycles, and any path to 1 or 2 must contain w, we deduce that any descendant leaf of z must be different from 1 or 2; this gives at least 3 leaves. If, on the contrary, the parent of v is a hybrid node x, let w be the parent of x that has a tree child, say z. The time consistency prevents x to be a descendant of z (because τ(z) > τ(w) = τ(x)) and, therefore, since any path leading to 1 or 2 must contain x, any leaf that is a descendant of z will be different from 1,2; this gives again at least 3 leaves. If v is a hybrid node, then its child is a leaf, say 1. Let v 1 be a parent of v that has a tree child. Since τ(v 1) = τ(v) is the largest τ value of an internal node of N, this tree child must be a leaf, say 2. Now let v 2 be another parent of v. Since it is a tree node, it must have another child other than v, say x. If x is a tree node, it is a leaf, as we have just seen. If x is hybrid, then since τ(x) = τ(v 2) = τ(v), the tree child of x must be a leaf. In both cases, we obtain a leaf that is different from 1 and 2; that is, N contains at least 3 leaves. It is usual in the literature to define a phylogenetic network on a set S of taxa as an rDAG with its leaves bijectively labeled in S. Theorem 3 also holds in this case.

Corollary 5

For every set S with |S|⩾3, the isomorphism of TSTC networks with leaves bijectively labeled on S is graph isomorphism-complete. The isomorphism of TSTC networks with leaves bijectively labeled on S clearly reduces to the isomorphism of TSTC networks with leaves injectively labeled on S, since the former is a special case of the latter. For the converse reduction, let N 1 and N 2 be two TSTC networks with leaves injectively labeled on S, let S 1⊆S be the leaf labels of N 1, and let S 2⊆S be the leaf labels of N 2. If S 1 ≠ S 2, then N 1 and N 2 are not isomorphic. If S 1 = S 2, let and be the TSTC networks obtained by adding to the roots of N 1 and N 2, respectively, |S∖S 1| leaf children bijectively labeled on S∖S 1. These TSTC networks and have their leaves bijectively labeled on S, their construction from N 1 and N 2 is polynomial in the size of N 1, N 2, and S, and it is clear that N 1≅N 2 if, and only if, . This shows that the isomorphism problem for TSTC networks with leaves bijectively labeled on S is polynomially equivalent to the isomorphism problem for TSTC networks with leaves injectively labeled on S, which is graph isomorphism-complete by Theorem 3.

4. Conclusion

We have proved that, unless the graph isomorphism problem belongs to P, there is no hope of defining a polynomially computable metric on the class of all TSTC networks on a set S of at least 3 taxa. It remains open the problem of defining polynomially computable metrics on the class of all TSTC networks on a given set S with all their hybrid nodes of in-degree bounded by some d ∈ ℕ. When d = 2, the μ-distance [7] and Nakhleh's m metric [8, 9] are such metrics, but they are no longer metrics for d = 3 (Figure  4 in [8]). Actually, we do not even know whether the isomorphism problem for TSTC networks on a given set S of taxa with globally bounded in-degree hybrid nodes (but without bounding the out-degree of the tree nodes; otherwise, Luks' theorem [19] would apply) is always in P, but we conjecture that this is the case.
  12 in total

Review 1.  Phylogenetic classification and the universal tree.

Authors:  W F Doolittle
Journal:  Science       Date:  1999-06-25       Impact factor: 47.728

2.  Towards the development of computational tools for evaluating phylogenetic network reconstruction methods.

Authors:  Luay Nakhleh; Jerry Sun; Tandy Warnow; C Randal Linder; Bernard M E Moret; Anna Tholse
Journal:  Pac Symp Biocomput       Date:  2003

3.  A metric on the space of reduced phylogenetic networks.

Authors:  Luay Nakhleh
Journal:  IEEE/ACM Trans Comput Biol Bioinform       Date:  2010 Apr-Jun       Impact factor: 3.710

4.  Hybrids in real time.

Authors:  Mihaela Baroni; Charles Semple; Mike Steel
Journal:  Syst Biol       Date:  2006-02       Impact factor: 15.683

5.  Efficient parsimony-based methods for phylogenetic network reconstruction.

Authors:  Guohua Jin; Luay Nakhleh; Sagi Snir; Tamir Tuller
Journal:  Bioinformatics       Date:  2007-01-15       Impact factor: 6.937

6.  Tripartitions do not always discriminate phylogenetic networks.

Authors:  Gabriel Cardona; Francesc Rosselló; Gabriel Valiente
Journal:  Math Biosci       Date:  2007-12-03       Impact factor: 2.144

7.  Metrics for phylogenetic networks I: generalizations of the Robinson-Foulds metric.

Authors:  Gabriel Cardona; Mercè Llabrés; Francesc Rosselló; Gabriel Valiente
Journal:  IEEE/ACM Trans Comput Biol Bioinform       Date:  2009 Jan-Mar       Impact factor: 3.710

8.  Metrics for phylogenetic networks II: nodal and triplets metrics.

Authors:  Gabriel Cardona; Mercè Llabrés; Francesc Rosselló; Gabriel Valiente
Journal:  IEEE/ACM Trans Comput Biol Bioinform       Date:  2009 Jul-Sep       Impact factor: 3.710

9.  Maximum likelihood of phylogenetic networks.

Authors:  Guohua Jin; Luay Nakhleh; Sagi Snir; Tamir Tuller
Journal:  Bioinformatics       Date:  2006-08-23       Impact factor: 6.937

10.  A distance metric for a class of tree-sibling phylogenetic networks.

Authors:  Gabriel Cardona; Mercè Llabrés; Francesc Rosselló; Gabriel Valiente
Journal:  Bioinformatics       Date:  2008-05-12       Impact factor: 6.937

View more
  1 in total

1.  A Metric on the Space of Partly Reduced Phylogenetic Networks.

Authors:  Juan Wang
Journal:  Biomed Res Int       Date:  2016-06-23       Impact factor: 3.411

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.