| Literature DB >> 25540454 |
C Baudet1, B Donati2, B Sinaimeri1, P Crescenzi3, C Gautier1, C Matias3, M-F Sagot4.
Abstract
Despite an increasingly vast literature on cophylogenetic reconstructions for studying host-parasite associations, understanding the common evolutionary history of such systems remains a problem that is far from being solved. Most algorithms for host-parasite reconciliation use an event-based model, where the events include in general (a subset of) cospeciation, duplication, loss, and host switch. All known parsimonious event-based methods then assign a cost to each type of event in order to find a reconstruction of minimum cost. The main problem with this approach is that the cost of the events strongly influences the reconciliation obtained. Some earlier approaches attempt to avoid this problem by finding a Pareto set of solutions and hence by considering event costs under some minimization constraints. To deal with this problem, we developed an algorithm, called Coala, for estimating the frequency of the events based on an approximate Bayesian computation approach. The benefits of this method are 2-fold: (i) it provides more confidence in the set of costs to be used in a reconciliation, and (ii) it allows estimation of the frequency of the events in cases where the data set consists of trees with a large number of taxa. We evaluate our method on simulated and on biological data sets. We show that in both cases, for the same pair of host and parasite trees, different sets of frequencies for the events lead to equally probable solutions. Moreover, often these solutions differ greatly in terms of the number of inferred events. It appears crucial to take this into account before attempting any further biological interpretation of such reconciliations. More generally, we also show that the set of frequencies can vary widely depending on the input host and parasite trees. Indiscriminately applying a standard vector of costs may thus not be a good strategy.Entities:
Keywords: approximate Bayesian computation; cophylogeny; host/parasite systems; likelihood-free inference
Mesh:
Year: 2014 PMID: 25540454 PMCID: PMC4395844 DOI: 10.1093/sysbio/syu129
Source DB: PubMed Journal: Syst Biol ISSN: 1063-5157 Impact factor: 15.683
FRecoverable events for a coevolutionary reconstruction. The tube represents the host tree and the dotted lines the parasite tree.
Notation
| Host tree. | |
| Parasite tree. | |
| Function from the leaves of | |
| Function from the vertices of | |
| Sets of parasite vertices associated with, respectively, cospeciation, duplication, and host switch events. | |
| Set containing arcs of the parasite tree that are associated to host switch events. | |
| Multiset containing all vertices | |
| Observed data. | |
| Generated data. | |
| Parameter space. | |
| Parameter value. | |
| Simulated parasite tree. | |
| Probability of the event | |
| Cost of the event | |
| Number of observed events of the type |
Note: cospeciation, duplication, host switch, and loss.
FEvents during the generation of the parasite tree . The host tree has white vertices and the parasite tree gray vertices. The association indicates that an unmapped parasite vertex is positioned on the arc of the host tree. The association indicates that the parasite vertex is mapped to the host vertex . a) initial configuration; b) unmapped vertex; c) cospeciation; d) duplication; e) host switch; f) loss.
FFor each simulated data set, we ran Coala 50 times and, at the end of each round (from 2 to 5), we took note of the cluster whose representative parameter vector had the smallest distance to the probability vector used to generate the simulated data set. The histograms show the distribution of the smallest distance observed on each one of the 50 runs at the end of each round (for the simulated data sets , , , and .). The solid and dotted vertical lines indicate median and mean values, respectively.
FFor each simulated data set, we ran Coala 50 times and, at the end of each round (from 2 to 5), we took note of the cluster whose representative parameter vector had the smallest distance to the probability vector used to generate the simulated data set. The histograms show the distribution of the event probabilities observed on the list of parameter vectors which have the smallest distance on each run for the data set . The solid and dotted vertical lines indicate median and mean values, respectively. The dashed vertical line indicates the “target” value.
Representative probability vectors produced by Coala at Round 3
| Data set | Cluster | No. of vectors | ||||
| 0 | 0.030 | 0.000 | 0.557 | 0.413 | 1 | |
| 1 | 1 | 0.461 | 0.258 | 0.000 | 0.281 | 24 |
| 2 | 0.554 | 0.000 | 0.270 | 0.176 | 20 | |
| 3 | 0.910 | 0.016 | 0.058 | 0.016 | 5 | |
| 1 | 0.851 | 0.082 | 0.000 | 0.066 | 25 | |
| 2 | 2 | 0.473 | 0.204 | 0.000 | 0.323 | 10 |
| 3 | 0.238 | 0.349 | 0.000 | 0.413 | 8 | |
| 4 | 0.580 | 0.002 | 0.282 | 0.136 | 7 |
FDistribution of the probability values for each event type observed on the parameter values accepted on the third round while processing the biological data sets 1 and 2.
Event vectors obtained by transforming the probability vectors (Table 2) into cost vectors
| Data set | Cluster | |||||||||||
| 0 | 3.517 | 13.816 | 0.584 | 0.885 | 14.044 | 1 | 0 | 15 | 2 | 2944 | 0 | |
| 1 | 1 | 0.775 | 1.355 | 7.824 | 1.270 | 48.664 | 11 | 2 | 3 | 11 | 2 | 0 |
| 2 | 0.591 | 8.517 | 1.310 | 1.736 | 16.217 | 9 | 0 | 7 | 1 | 1 | 0 | |
| 3 | 0.094 | 4.160 | 2.844 | 4.154 | 24.892 | 9 | 0 | 7 | 1 | 1 | 0 | |
| 1 | 0.161 | 2.496 | 9.210 | 2.717 | 153.544 | 22 | 11 | 8 | 18 | 0 | 12 | |
| 2 | 2 | 0.748 | 1.592 | 9.210 | 1.130 | 105.393 | 22 | 19 | 0 | 52 | 1 | 0 |
| 3 | 1.436 | 1.053 | 8.112 | 0.884 | 97.548 | 22 | 19 | 0 | 52 | 1 | 0 | |
| 4 | 0.545 | 6.266 | 1.265 | 1.996 | 72.588 | 17 | 5 | 19 | 4 | 4 | 0 |
Note: , and denote the number of each event type which are observed among the enumerated scenarios. and indicate, respectively, the total number of acyclic and cyclic scenarios.
Representative probability vectors produced by Coala, at the end of the third round, while processing the Wolbachia-arthropods data sets
| No. of vectors | |||||
| 1 | 0.866 | 0.006 | 0.055 | 0.073 | 26 |
| 2 | 0.771 | 0.078 | 0.010 | 0.141 | 22 |
| 3 | 0.964 | 0.022 | 0.014 | 0.000 | 2 |
Total number of solutions obtained by transforming the probability vectors (Table 4) into cost vectors for Wolbachia-arthropods data sets
| Cluster | Solutions | Acyclic solutions | |||||
| 1 | 0.144 | 5.116 | 2.899 | 2.623 | 917.475 | No | |
| 2 | 0.260 | 2.551 | 4.595 | 1.961 | 1407.877 | No | |
| 3 | 0.037 | 3.817 | 4.269 | 13.816 | 1375.725 | Yes |