| Literature DB >> 25648467 |
Beatrice Donati1,2,3, Christian Baudet1,2, Blerina Sinaimeri1,2, Pierluigi Crescenzi3, Marie-France Sagot1,2.
Abstract
BACKGROUND: Phylogenetic tree reconciliation is the approach of choice for investigating the coevolution of sets of organisms such as hosts and parasites. It consists in a mapping between the parasite tree and the host tree using event-based maximum parsimony. Given a cost model for the events, many optimal reconciliations are however possible. Any further biological interpretation of them must therefore take this into account, making the capacity to enumerate all optimal solutions a crucial point. Only two algorithms currently exist that attempt such enumeration; in one case not all possible solutions are produced while in the other not all cost vectors are currently handled. The objective of this paper is two-fold. The first is to fill this gap, and the second is to test whether the number of solutions generally observed can be an issue in terms of interpretation.Entities:
Keywords: Cophylogeny; Enumeration algorithm; Host-parasite systems; Polynomial delay; Reconciliation
Year: 2015 PMID: 25648467 PMCID: PMC4310143 DOI: 10.1186/s13015-014-0031-3
Source DB: PubMed Journal: Algorithms Mol Biol ISSN: 1748-7188 Impact factor: 1.405
Figure 1Recoverable events for a coevolutionary reconstruction. Schematic representation of cospeciation, duplication, host switch and loss events. The tube represents the host phylogenetic tree while the dotted line the one of the parasite.
Figure 2Local tree structure for a given cell of . Schematic representation of the content of a cell in the dynamic programming matrix. Suppose the cell is related to the association p:h and let p 1,p 2 be the two children of p. One single cell-root node is created to represent the association p:h (the circular node in the picture). This association has a local minimum cost c that can be obtained in different ways, that is choosing different associations for p1 and p2. Each equivalent alternative is represented by a node (squared in the picture). The number of alternatives is variable. In this example, we have three alternatives: (i) p 1 is mapped into h and p 2 is mapped into h ; (ii) p 1 is mapped into h and p 2 is mapped into h ; and, (iii) p 1 and p 2 are both mapped into h . Each one of these alternatives, combined with the mapping of p into h give the same local minimum cost c. Notice that, h, h , h , h , and h are distinct nodes of the host tree.
Figure 3Multiple sub-solutions. The tree structure allows us to save the information in an efficient way. Each sub-solution corresponds to a subtree and there is no need to duplicate it each time it appears in a solution. In particular, only one node is created for each association and if two different alternatives share this association, the respective (square) nodes will point exactly at the same (circular) node. In this example, the mapping of p into h has the same alternatives (i), (ii) and (iii) as depicted in Figure 2. The association of p with h ′ has local minimum cost of c ′ and can be obtained by two mappings of p 1 and p 2: (iv) p 1 is mapped into h and p 2 is mapped into h ; and (v) p 1 is mapped into h and p 2 is mapped into h . Notice that, h, h ′, h , h , h , h , h , and h are distinct nodes of the host tree.
Number of solutions found by each one of the programs CORE-PA, NOTUNG and EUCALYPT
|
|
|
|
| |||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
| |||||||
|
|
|
|
|
|
|
| ||||
| EC | 7 | 10 | 〈0,1,1,1〉 | 16 | 6 | 10 | 16 | 6 | 10 | 5 |
| 〈0,1,2,1〉 |
| 0 |
| 18 | 0 | 18 | 6 | |||
| 〈0,2,3,1〉 |
| 0 |
| 16 | 0 | 16 | 4 | |||
| GL | 8 | 10 | 〈0,1,1,1〉 | 2 | 0 | 2 | 2 | 0 | 2 | 1 |
| 〈0,1,2,1〉 | 2 | 0 | 2 | 2 | 0 | 2 | 1 | |||
| 〈0,2,3,1〉 | 2 | 0 | 2 | 2 | 0 | 2 | 1 | |||
| SC | 11 | 14 | 〈0,1,1,1〉 | 1 | 0 | 1 | 1 | 0 | 1 | 1 |
| 〈0,1,2,1〉 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | |||
| 〈0,2,3,1〉 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | |||
| RP | 13 | 13 | 〈0,1,1,1〉 | 18 | 2 | 16 | 18 | 2 | 16 | 3 |
| 〈0,1,2,1〉 | 3 | 1 | 2 | 3 | 1 | 2 | 1 | |||
| 〈0,2,3,1〉 | 3 | 1 | 2 | 3 | 1 | 2 | 1 | |||
| SFC | 15 | 16 | 〈0,1,1,1〉 | 184 | 40 | 144 | 184 | 40 | 144 | 1 |
| 〈0,1,2,1〉 | 40 | 40 | 0 | 40 | 40 | 0 | 0 | |||
| 〈0,2,3,1〉 | 40 | 40 | 0 | 40 | 40 | 0 | 0 | |||
| PLML | 18 | 18 | 〈0,1,1,1〉 |
| 0 |
| 180 | 0 | 180 | 4 |
| 〈0,1,2,1〉 | 2 | 0 | 2 | 2 | 0 | 2 | 1 | |||
| 〈0,2,3,1〉 | 11 | 0 | 11 | 11 | 0 | 11 | 2 | |||
| PLMP | 18 | 18 | 〈0,1,1,1〉 | 2 | 0 | 2 | 2 | 0 | 2 | 1 |
| 〈0,1,2,1〉 | 2 | 0 | 2 | 2 | 0 | 2 | 1 | |||
| 〈0,2,3,1〉 |
| 0 |
| 18 | 0 | 18 | 2 | |||
| RH | 34 | 42 | 〈0,1,1,1〉 |
| 0 |
| 42 | 0 | 42 | 4 |
| 〈0,1,2,1〉 |
|
| 0 | 2208 | 2208 | 0 | 0 | |||
| 〈0,2,3,1〉 |
|
| 0 | 288 | 288 | 0 | 0 | |||
| PP | 36 | 41 | 〈0,1,1,1〉 |
| 0 |
| 5120 | 0 | 5120 | 4 |
| 〈0,1,2,1〉 |
| 0 |
| 72 | 0 | 72 | 2 | |||
| 〈0,2,3,1〉 |
| 0 |
| 72 | 0 | 72 | 2 | |||
| FD | 20 | 51 | 〈0,1,1,1〉 |
|
|
| 25184 | 1792 | 23392 | 11 |
| 〈0,1,2,1〉 |
|
|
| 408 | 132 | 276 | 5 | |||
| 〈0,2,3,1〉 |
|
| 0 | 80 | 80 | 0 | 0 | |||
| COG2085 | 100 | 44 | 〈0,1,1,1〉 |
|
|
| 44544 | 2304 | 42240 | 3 |
| 〈0,1,2,1〉 |
|
|
| 37568 | 480 | 37088 | 7 | |||
| 〈0,2,3,1〉 |
| 0 |
| 46656 | 0 | 46656 | 4 | |||
| COG3715 | 100 | 40 | 〈0,1,1,1〉 |
|
|
| 1172598 | 1155958 | 16640 | 6 |
| 〈0,1,2,1〉 | 9 | 9 | 0 | 9 | 9 | 0 | 0 | |||
| 〈0,2,3,1〉 |
|
| 0 | 33 | 33 | 0 | 0 | |||
| COG4964 | 100 | 27 | 〈0,1,1,1〉 |
|
| 0 | 224 | 224 | 0 | 0 |
| 〈0,1,2,1〉 |
|
| 0 | 36 | 36 | 0 | 0 | |||
| 〈0,2,3,1〉 |
|
| 0 | 54 | 54 | 0 | 0 | |||
| COG4965 | 100 | 30 | 〈0,1,1,1〉 |
|
|
| 17408 | 5632 | 11776 | 2 |
| 〈0,1,2,1〉 |
| 0 |
| 640 | 0 | 640 | 2 | |||
| 〈0,2,3,1〉 |
|
|
| 6528 | 1408 | 5120 | 2 | |||
Number of solutions found by each one of the programs CORE-PA, NOTUNG and EUCALYPT for each dataset and each cost vector 〈c ,c ,c ,c 〉. For EUCALYPT and CORE-PA the columns represent: #T = total number of optimal solutions, #C = total number of cyclic solutions and #A = total number of acyclic solutions. In all cases #A is always equal for both NOTUNG and EUCALYPT. For EUCALYPT the column #CA denotes the number of event classes in the set of acyclic solutions. CORE-PA limits to 1000 the total number of enumerated solutions and these cases are denoted by the symbol ∗. Bold numbers indicate the cases where the number of solutions produced by CORE-PA differs from the one found by EUCALYPT.
Number of solutions found by the programs CORE-PA and EUCALYPT
|
|
|
|
| |||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
| |||||||
|
|
|
|
|
|
|
| ||||
| EC | 7 | 10 | 〈−1,1,1,1〉 | 2 | 0 | 2 | 2 | 0 | 2 | 1 |
| 〈0,1,1,0〉 |
| 0 |
| 24 | 0 | 24 | 8 | |||
| GL | 8 | 10 | 〈−1,1,1,1〉 | 2 | 0 | 2 | 2 | 0 | 2 | 1 |
| 〈0,1,1,0〉 | 12 | 0 | 12 | 12 | 0 | 12 | 5 | |||
| SC | 11 | 14 | 〈−1,1,1,1〉 | 1 | 0 | 0 | 1 | 0 | 1 | 1 |
| 〈0,1,1,0〉 |
| 2 |
| 113 | 3 | 110 | 18 | |||
| RP | 13 | 13 | 〈−1,1,1,1〉 | 3 | 1 | 2 | 3 | 1 | 2 | 1 |
| 〈0,1,1,0〉 |
|
|
| 117 | 45 | 72 | 29 | |||
| SFC | 15 | 16 | 〈−1,1,1,1〉 | 40 | 40 | 0 | 40 | 40 | 0 | 0 |
| 〈0,1,1,0〉 |
|
|
| 6332 | 5069 | 1263 | 81 | |||
| PLML | 18 | 18 | 〈−1,1,1,1〉 | 2 | 0 | 0 | 2 | 0 | 2 | 1 |
| 〈0,1,1,0〉 |
|
|
| 448 | 28 | 420 | 16 | |||
| PLMP | 18 | 18 | 〈−1,1,1,1〉 | 2 | 0 | 0 | 2 | 0 | 2 | 1 |
| 〈0,1,1,0〉 |
| 0 |
| 262 | 0 | 262 | 34 | |||
| RH | 34 | 42 | 〈−1,1,1,1〉 |
|
| 0 | 1056 | 1056 | 0 | 0 |
| 〈0,1,1,0〉 |
|
|
| 4080384 | 310284 | 3770100 | 275 | |||
| PP | 36 | 41 | 〈−1,1,1,1〉 |
| 0 |
| 144 | 0 | 144 | 2 |
| 〈0,1,1,0〉 |
|
|
| 498960 | 55440 | 443520 | 129 | |||
| FD | 20 | 51 | 〈−1,1,1,1〉 |
|
|
| 944 | 368 | 576 | 7 |
| 〈0,1,1,0〉 |
|
|
| 1.5×1015 | * | * | * | |||
| COG2085 | 100 | 44 | 〈−1,1,1,1〉 |
|
|
| 109056 | 26496 | 82560 | 3 |
| 〈0,1,1,0〉 |
|
|
| 3.5×1011 | * | * | * | |||
| COG3715 | 100 | 40 | 〈−1,1,1,1〉 |
|
| 0 | 63360 | 63360 | 0 | 0 |
| 〈0,1,1,0〉 |
|
|
| 1.2×1012 | * | * | * | |||
| COG4964 | 100 | 27 | 〈−1,1,1,1〉 |
|
| 0 | 36 | 36 | 0 | 0 |
| 〈0,1,1,0〉 |
|
|
| 8586842 | 2603598 | 5983244 | 300 | |||
| COG4965 | 100 | 30 | 〈−1,1,1,1〉 |
|
|
| 44800 | 13312 | 31488 | 5 |
| 〈0,1,1,0〉 |
|
|
| 907176 | 387192 | 519984 | 208 | |||
Number of solutions found by the programs CORE-PA and EUCALYPT for each dataset and each cost vector 〈c ,c ,c ,c 〉. The columns represent: #T = total number of optimal solutions, #C = total number of cyclic solutions and #A = total number of acyclic solutions. For EUCALYPT the column #CA denotes the number of event classes in the set of acyclic solutions. CORE-PA limits to 1000 the total number of enumerated solutions and these cases are denoted by the symbol ∗. Bold numbers indicate the cases where the number of solutions produced by CORE-PA differs from the one found by EUCALYPT.
Searching for time-feasible solutions by varying
|
|
| ||||||||
|---|---|---|---|---|---|---|---|---|---|
|
|
|
| |||||||
|
|
|
|
|
|
|
|
|
| |
| SFC | 7→6 | 6→7 | 16 | 7→6 | 21→22 | 16 | 7→5 | 31→35 | 12 |
| RH | 6→5 | 8→12 | 16 | 6→5 | 43→48 | 192 | 6→5 | 62→68 | 48 |
| COG3715 | 13→12 | 10→11 | 288 | 22→6 | 51→176 | 6 | 22→6 | 80→206 | 2 |
| COG4964 | 22→4 | 20→208 | 30 | 13→12 | 33→34 | 288 | 13→12 | 49→50 | 288 |
For some datasets (SFC, RH, COG3715 and, COG4964), the number of optimal time-feasible solutions is zero when reconciliations are obtained by using a given cost vector and unbounded k. After identifying k (minimum k whose optimal cost o is equal to the optimal cost obtained for unbounded k), we decremented k until k (maximum k which generates acyclic solutions) is found. For each pair (dataset, cost vector), the following values are given: the decrement of the bound (from k to k ), the new optimum found (from o to o ) and the new number of acyclic solutions (# A).
Reducing the number of optimal time-feasible solutions by bounding
|
|
| ||||||||
|---|---|---|---|---|---|---|---|---|---|
|
|
|
| |||||||
|
|
|
|
|
|
|
|
|
| |
| EC | 3/3 | 2/2 | 2/2 | 3/3 | 18/16 | 18/16 | 3/3 | 16/16 | 16/16 |
| GL | 4/4 | 2/2 | 2/2 | 4/4 | 2/2 | 2/2 | 4/4 | 2/2 | 2/2 |
| SC | 6/6 | 1/1 | 1/1 | 6/6 | 1/1 | 1/1 | 6/6 | 1/1 | 1/1 |
| RP | 9/9 | 2/2 | 3/3 | 8/8 | 2/2 | 8/8 | 2/2 | 2/2 | 3/3 |
| PMP | 6/6 | 2/2 | 2/2 | 6/6 | 2/2 | 2/2 | 5/5 | 11/4 | 11/4 |
| PML | 5/5 | 2/2 | 2/2 | 5/5 | 2/2 | 2/2 | 3/3 | 18/6 | 18/6 |
| PP | 4/4 | 144/96 | 144/96 | 4/4 | 72/48 | 72/48 | 4/4 | 72/48 | 72/48 |
| FD | 9/10 | 576/240 | 944/512 | 9/9 | 276/4 | 408/8 | − | − | − |
| COG2085 | 14/14 | 82560/9408 | 109056/9408 | 14/14 | 37088/4032 | 37568/4032 | 14/14 | 46656/5184 | 46656/5184 |
| COG4965 | 16/16 | 31448/15744 | 44800/22400 | 16/16 | 640/320 | 640/320 | 13/16 | 5120/2560 | 6528/3328 |
For some datasets, the number of optimal time-feasible solutions may be huge when k is unbounded. In some cases, however, by introducing a bound on k we can greatly reduce the number of time-feasible solutions while keeping their optimality. For all datasets whose number of acyclic solutions is positive for unbounded k, we identified k (minimum k whose optimal cost is equal to the optimal cost obtained for unbounded k) and we searched for the minimum k ′≥k whose number of acyclic solutions is non zero. We executed this procedure for every pair (dataset, cost vector) for which the number of optimal acyclic solutions is positive. In the first column, the values for k=k and k ′ are given. # A C/# A C ′ denotes the number of optimal acyclic solutions for the case when the switches are unbounded and the case when they are bounded by k ′, respectively. The same relation is shown for the total number of optimal solutions in the column # T/# T ′.