Literature DB >> 25380957

Joint amalgamation of most parsimonious reconciled gene trees.

Celine Scornavacca¹, Edwin Jacox², Gergely J Szöllősi².

Abstract

MOTIVATION: Traditionally, gene phylogenies have been reconstructed solely on the basis of molecular sequences; this, however, often does not provide enough information to distinguish between statistically equivalent relationships. To address this problem, several recent methods have incorporated information on the species phylogeny in gene tree reconstruction, leading to dramatic improvements in accuracy. Although probabilistic methods are able to estimate all model parameters but are computationally expensive, parsimony methods-generally computationally more efficient-require a prior estimate of parameters and of the statistical support.
RESULTS: Here, we present the Tree Estimation using Reconciliation (TERA) algorithm, a parsimony based, species tree aware method for gene tree reconstruction based on a scoring scheme combining duplication, transfer and loss costs with an estimate of the sequence likelihood. TERA explores all reconciled gene trees that can be amalgamated from a sample of gene trees. Using a large scale simulated dataset, we demonstrate that TERA achieves the same accuracy as the corresponding probabilistic method while being faster, and outperforms other parsimony-based methods in both accuracy and speed. Running TERA on a set of 1099 homologous gene families from complete cyanobacterial genomes, we find that incorporating knowledge of the species tree results in a two thirds reduction in the number of apparent transfer events.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2014 PMID： 25380957 PMCID： PMC4380024 DOI： 10.1093/bioinformatics/btu728

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Molecular phylogenetics infers gene trees based on the information contained in molecular sequences. Unfortunately, individual sequences may contain limited signal, and, as a result, phylogenetic reconstruction often involves choosing between statistically equivalent or weakly distinguishable evolutionary relationships. Although each homologous gene family has its own unique story, these are all related by a shared species history—which can be helpful for gene tree inference (Maddison, 1997; Szöllősi ). In the past decade, several methods have been developed that model the evolutionary processes that generate gene trees within the species tree (Akerborg ; Arvestad, 2003; Hallett and Lagergren, 2001; Rannala and Yang, 2003; Rasmussen and Kellis, 2007, 2012; Sjöstrand ; Suchard, 2005; Szöllősi , 2013a; Than and Nakhleh, 2009). From an inference perspective, these methods attempt to find the optimal way to explain the phylogenetic signal in extant sequences—represented as a gene tree—given the species tree. They explore the set of possible reconciliations, i.e. different ways to draw the gene tree into the species tree given some combination of macro evolutionary events, such as gene duplications, gene transfers, gene losses and incomplete lineage sorting. Studies that incorporate such events into gene tree inference have shown that information on the species phylogeny significantly improves the accuracy of gene tree inference (Akerborg ; Boussau ; Rasmussen and Kellis, 2010; Szöllősi ). To design such species tree aware methods for reconstructing gene phylogenies, the space of reconciled gene trees must be explored using information from both a model of sequence evolution and a reconciliation model, in order to optimize a joint sequence-reconciliation score. Such exploration is computationally expensive with traditional optimization approaches that rely on the local search of the space of gene trees. To circumvent this problem, David and Alm (2010) introduced the amalgamation algorithm, described in detail in Section 2.3 below and illustrated in Figure 1. Furthermore, Szöllősi recently developed an approach to exhaustively explore all reconciled gene trees that can be amalgamated from a sample of gene trees, i.e. obtainable by combining clades observed in the sample. Additionally, their method—ALE, for Amalgamated Likelihood Estimation—combines the amalgamation algorithm of David and Alm (2010) with conditional clade probabilities (CCPs) introduced by Höhna and Drummond (2012) and reconstructs the gene phylogenies by optimizing a joint sequence-reconciliation likelihood score, resulting in gene trees that are dramatically more accurate than those reconstructed using molecular sequences alone.

Fig. 1.

CCPs can be used to estimate the posterior probability of any tree that can be amalgamated from clades present in a sample of gene trees (David and Alm, 2010; Höhna and Drummond, 2012). Conditional clade frequencies can be used to approximate CCPs and are computed as the proportion of occurrences of a particular split of a clade according to a tripartition π, e.g. (abc de) among all trees in which the clade, e.g. (abcde), is found. Estimates based on the sample of trees on the left are shown as fractions for two different gene trees that can be amalgamated. The estimate for a gene tree is given by the sum of the reconciliation score and the logarithm of the tree CCPs. Based on the sample on the left, the tree with the highest posterior probability is the third tree (blue online). Reconciling it with the species tree requires one transfer and one loss event. It is, however, possible to combine clades present in the second (green online) and third (blue online) trees to produce a gene tree that is not present in the original sample but is identical to the species tree, i.e. it requires no events to draw it into the species tree. Depending on the costs of transfer and loss events, and the self-consistently estimated c parameter, the scenario without transfer might be optimal for the joint score ALE overcomes a fundamental limitation of recent parsimony based methods that improve gene trees given a putative species tree (David and Alm, 2010; Nguyen ; Wu ). Unlike those methods, it does not require the user to specify a cost for each type of event or a threshold on statistical supports. However, ALE faces the drawbacks associated with probabilistic methods. In particular: (i) when computing the reconciliation score, ALE has an increased computational cost compared with a parsimony algorithm (e.g. Conow ; Doyon ), which is due to a potentially large constant factor resulting from the numerical integration of the likelihood; (ii) ALE’s results are contingent on difficult to estimate time-like branch lengths of the species tree, while parsimony methods can reconcile gene trees relying only on the order of speciations in time (e.g. Doyon ), and even deal with undated species trees (e.g. Bansal ). Parsimony methods in general, despite lacking an explicit connection to a generative probabilistic model and relying on other heuristics, have been shown to be highly accurate, comparable to sophisticated probabilistic reconciliation methods, with reduced runtime (Wu , 2014). Here we present the TERA algorithm (Tree Estimation using Reconciliation and Amalgamation) that amalgamates the most parsimonious reconciled gene tree from a set of gene trees reconstructed from a unique gene alignment, according to a joint sequence-reconciliation score. Although TERA, like other parsimony based methods, requires the prior specification of the costs associated with duplication, transfer and loss (DTL) events, it does not require prior assumptions about a statistical support threshold, as it estimates a self-consistent support threshold from its input. Furthermore, TERA considers explicitly the possibility of transfer from extinct or unsampled branches of the species tree, which is expected to be the case for practically all transfers (Szöllősi ). TERA does not, however, consider incomplete lineage sorting. The self-consistent score estimation scheme used by TERA, introduced in Section 2.4, should be applicable to other parsimony methods, while amalgamation is in theory compatible with any reconciliation algorithm that assumes branches of the gene tree to be independent.

2 Materials and methods

2.1 Preliminaries

Given a binary rooted tree T, we respectively denote by V(T), E(T), L(T) and r(T), its node, edge, leaf node sets and root node. The label of each leaf u is denoted by , while the set of labels of leaves of T is denoted by . Given a node , we denote respectively by up, us and the father, the sibling and the children of u (if they are defined). Note that in this article all trees are considered as unordered, so ul and ur are interchangeable. For a node u of T, Tu denotes the subtree of T rooted at u. Given two nodes u and v of T, (, respectively) if and only if v is on the unique path from u to r(T) (respectively, and ); in such a case, u is said to be a (strict) descendant of v. Given a node u of T, we define the clade associated with u, denoted by C(u), as the set . If u is an internal node, we define the tripartition associated with u, denoted by , as the triple (). For leaf nodes, the trivial tripartition is defined as the triple (). Finally, the height of T is denoted by h(T). In this article, unless stated otherwise, we assume that gene and species trees are rooted, binary and uniquely leaf labeled, i.e. within each tree there is a bijection between leaves and labels. Due to this bijectivity we will refer to leaves and labels interchangeably. We define a gene tree G as a tree where each leaf represents an extant gene. Similarly, a species tree S is defined as a tree in which each leaf represents a distinct extant species. Note that several leaves of a gene tree can be associated to the same species due to gene birth corresponding to duplication and transfer events. Formally, we indicate this by a surjective function , called the species labeling of G. The set of species labels of the leaves of G is denoted . A tree T is said to be time ordered when it is associated with a time function that associates each of its nodes with a non-negative value so that, for any two nodes , if x′ is a strict descendant of x then . Moreover, , we have that . A subdivision T′ of a time-ordered tree T is the tree obtained from T by adding a new node y with on each edge such that there exists with . For nodes corresponding to nodes already present in T, we set .

2.2 Species tree-gene tree reconciliation

Here, we consider the problem of finding the most parsimonious reconciliation (MPR) when considering—as possible macro-events that result in the birth and death of gene copies—speciation, gene duplication, gene transfers and gene loss (Szöllősi and Daubin, 2012). The general problem of finding an MPR is known to be NP-complete, even for reconciling two binary trees (Tofigh ). The complexity of the problem is due to the difficulty of ensuring the time consistency of gene transfers, i.e. satisfying the chronological constraints among nodes of the species tree that are induced by transfer events. However, the problem becomes polynomially solvable when accepting a time-ordered species tree as input (among others Conow ; Doyon Tofigh, 2009, see Doyon for a review). In this article, we build upon the combinatorial reconciliation model introduced by Doyon , which can be used to solve this special case of the problem. Some parsimony methods (e.g. Bansal ) do not need information on the order of speciations in time. This allows a more efficient recursion over reconciliations, but at the cost of considering reconciliations that contain transfer events that are not consistent with any ordering of the species tree (Tofigh ). The model of Doyon can be used to reconcile a time-ordered binary species tree S with a binary gene tree G by constructing a mapping α that maps each node into an ordered list of nodes in , namely the ancestral and/or extant species in which the sequence corresponding to u evolved. This model takes into account four kinds of biological events: speciation, gene duplication, gene transfer and gene loss. The atomic events of this model are: a speciation (), a duplication (), a transfer (), a transfer followed immediately by the loss of the non-transferred child (), a speciation followed by the loss of one of the two resulting children (), and a contemporary event () that associates an extant gene to its corresponding species. Finally, a null event (), is used to model a gene lineage crossing a time boundary. Note that duplication-loss events and transfer followed by the loss of the transferred gene, unlike a transfer followed by the loss of the non-transferred gene and speciation-loss events, leave no trace and are therefore undetectable. This is why, in the model, losses are never considered alone. We refer the reader to Doyon for the formal definition of a reconciliation. Let θ, τ, λ be, respectively, the costs of a duplication, a transfer and a loss. Given a reconciliation, we define the cost of α, denoted by , as the sum , where d, t and l are respectively the number of events, of and events, and of and events in α. In Doyon the authors give an efficient algorithm to compute c(G, S) for a time-ordered species tree S and a gene tree G, where c(G, S) is defined as the minimum cost over all possible reconciliations between G and S.

2.3 Choosing a reliable gene tree among several competing alternatives

Even though our aim is to reconstruct reliable gene trees from a multiple sequence alignment and a species phylogeny, our approach does not directly take sequence alignments as an input, but requires a sample of gene trees, typically produced from the alignment by either a Markov Chain Monte Carlo (MCMC) methods such as PhyloBayes (Lartillot ) and MrBayes (Ronquist ), or bootstrap resampling. To find the optimal gene tree, clades found in the input sample of gene trees are combined using the amalgamation approach in order to recover an optimal tree with respect to our scoring scheme. The optimal tree recovered will only contain clades found in the input sample of gene trees, but it will not in general be found in the sample itself. Figure 1 provides a schematic illustration of the amalgamation approach. Clades present in the sample of trees (the unrooted trees on the left) can be combined to obtain a tree such that each clade is found in the sample, but the tree itself is not. For example, one can produce a green-blue tree consisting of a green subtree with genes a, b and c, and a blue subtree with genes d, e and f. The sequence score of each tree is obtained using CCPs that depend on the number of times different trees are seen in the sample and is described in detail in the next section. The reconciliation score for each tree corresponds to the MPR of the gene tree with the species tree. The amalgamation algorithm itself is a joint dynamic programming recursion over (i) all trees that can be produced from clades present in the input sample and (ii) all possible ways to reconcile each of these trees with the species tree, to recover a gene tree with the smallest joint sequence-reconciliation score. As shown in Figure 1 in the online appendix, amalgamation permits us to explore a vastly larger set of trees than those contained in the sample. Conceptually, both ALE (Szöllősi ) and TERA are based on the amalgamation approach of AnGST (David and Alm, 2010), and all three methods are—at the level of the dynamic programming recursion—closely related. TERA differs from AnGST in the underlying reconciliation model (Doyon ) and because it allows transfers going/coming from extinct or unsampled species (Szöllősi ). Moreover, the AnGST scoring scheme is solely based on the reconciliation score. ALE differs from TERA in that it relies on a complex underlying probabilistic model; the results of which, in contrast to TERA, are contingent on time-like branch lengths of the species tree. TERA’s amalgamation algorithm can be regarded as a generalization of the gene tree reconciliation algorithm of Doyon , which iterates over reconciliations by mapping each node of a gene tree to branches of the species tree. In the joint recursion presented in this article, instead of nodes of a gene tree, the clades found in the input sample of gene trees are mapped into branches of the species tree. More formally, assume we are given a set of (unrooted) gene trees on the same leaf set reconstructed from a unique sequence alignment. We denote respectively by and the union of all the clades, and the union of all tripartitions in . For each tripartition π, we denote by ( and , respectively) the first (second and third, respectively) element of π. If contains unrooted trees we consider all possible rootings for each tree when computing and . Furthermore, for a given clade C of , we denote by the set of tripartitions for which . When focusing only on the reconciliation score, the optimization problem consists of computing , where is the set of gene trees such that for all . The pseudocode is given in Algorithm 1 in the appendix. Roughly speaking, our algorithm starts by computing the subdivision S′ of S, and the sets and . Then, it performs a joint traversal of all gene tree clades and species tree branches wherein clades C in are considered in order of increasing size, and nodes x′ of S in order of increasing height. For each pair (C, x′) the algorithm computes the cost of reconciling clade C with x′ by testing all possible tripartitions π in . Because each non-trivial tripartition π can be seen as an internal node of an amalgamated tree, with children and , the cost of reconciling a tripartition π with x′ can be computed according to Algorithm 1 of Doyon . We refer the reader to Algorithm 1 of Doyon for a better understanding of the pseudocode. The correctness of our approach is proven in the appendix. Note that—for ease of writing—the pseudocode of the algorithm does not contain the transfers from the dead, i.e. the transfers going/coming from extinct or unsampled species (Szöllősi ). However, Algorithm 1 can be easily modified to accommodate this kind of event by adding to the species tree S a sister group of the root clade such that, within this group, duplications and losses are free, speciations are not permitted, and transfers to this new group (formally corresponding to unrepresented speciations) cost zero—similar to what is done in the likelihood framework by Szöllősi .

2.4 Taking into account the CCP

As described in the introduction, our goal is to create a species tree aware method for reconstructing gene phylogenies that uses information from both gene sequences and from the reconciliation with a species tree. That is, we wish to construct a method that optimizes a joint sequence-reconciliation score. In order to do this, we must find an efficient manner to incorporate a sequence based cost in addition to the reconciliation cost of Doyon in the amalgamation scheme. AnGST, the seminal algorithm of David and Alm (2010) that introduced the idea of amalgamation, does not distinguish between trees that can be amalgamated. The problem with this approach is that, as the number of input trees—and thus the amount of information given as input—increases, the set of possible trees that can be amalgamated also increases—until all possible tree topologies can be amalgamated. At this point, since all possible tree topologies can be amalgamated, the most parsimonious reconciled gene tree will only depend on the reconciliation score. In practice this introduces the problem that the topology of the amalgamated gene tree may vary significantly when adding only a few trees to the sample of trees (in the worst case only one tree). In a probabilistic framework, conditional clade probabilities (CCPs, cf. Fig. 1) provide an accurate approximation of posterior probabilities for a very large number of tree topologies from a smaller MCMC sample (Höhna and Drummond, 2012; Larget, 2013; Szöllősi ). The CCP of a rooted tree (Höhna and Drummond, 2012), denoted by , is defined as the product of the conditional probabilities of all partitions in . The of the partition of clade C according to the tripartition π is denoted and is approximated by the ratio , where for each clade and for each tripartition and is the frequency of C and π in . Here, in order to construct a parsimony method that optimizes a joint sequence-reconciliation score, we choose to minimize the joint cost over , where the parameter c weights the contribution of the sequence alignment NA to the cost, defined as where corresponds to the posterior probability of the gene tree with the highest posterior probability according to the sequence alignment. The logarithm of the CCP provides an additive cost for deviation from the phylogeny preferred by the sequence alignment alone, similar to the additive cost for deviation from species phylogeny provided by the DTL event costs. The parameter c is analogous to a statistical support threshold, corresponding to a cost c for each point of log posterior probability difference between the log posterior probability of a given phylogeny and the gene tree with highest posterior probability. As illustrated in Figure 1, Algorithm 1 in online appendix is easily modified by adding to while filling the dynamic programming matrix (on line 15, 17 and 18 of Algorithm 1, respectively). The term , corresponding to the gene tree with the highest posterior probability, can be neglected during cost minimization as it simply corresponds to an additive constant. Given estimates for the DTL costs (available for example in David and Alm, 2010; Nguyen ), the parameter c can be estimated in a self-consistent manner. However, finding the proper weight between the disagreement with the species tree (increase in DTL events) and the disagreement with the sequence alignment (decrease in ) is difficult. Our estimation approach consists of looking for the set of costs that are the most self-consistent, i.e. the ratio of costs that best corresponds to the ratio of events. We assume a simple model for how costs determine the number of events: each type of event, i.e. DTL events as well as the disagreement with the alignment counted by NA, are considered to occur independently, such that events with smaller costs are expected to occur more frequently. In particular, the expected amount of disagreement with the species tree due to, respectively duplications, transfers and losses, is proportional to and , while the expected amount of disagreement with the sequence alignment is propotional to . The observed amount of disagreement with the species tree is given by the sum of the number of DTL events, i.e. N, N and N, while the observed amount of disagreement with the sequence alignment is given by NA. We then employ an expectation maximization like recursion equating at each step the observed frequencies with the expected frequencies: Algorithm 1 is then run until is larger then a threshold ϵ.

2.5 Implementation and validation

TERA is implemented in C++ and is freely available from http://mbb.univ-montp2.fr/MBB/download_sources/16__TERA. Posterior samples of gene trees, for both simulated and real alignments, were downloaded together with the ‘true’ gene trees used to simulate alignments from the dryad data repository (doi:10.5061/dryad.pv6df) provided by Szöllősi . For all the parsimony-based species tree aware methods, including TERA, we used the DTL costs δ = 2, τ = 3, λ = 1 obtained by David and Alm (2010) using a criteria based on minimizing the change in ancestral genome sizes on a large biological dataset. We ran TreeFix-DTL with default parameters, JTT/GTR with a gamma distribution as models of evolution, and as a starting tree the PhyML tree. MowgliNNI was run with default parameters, a threshold of 50 for weak edges, and with the PhyML tree—with bootstrap values—as a starting gene tree. AnGST was run with default parameters using the dated species tree on samples of 1000 gene trees, whereas JPrIME-DLTRS was run with JTT with a gamma distribution as model of evolution, 100 000 iterations, a thinning factor of 10 and a time out of 10 h. Finally, we ran TERA with a starting c of 0.1 and for samples from 10 up to 10 000 gene trees for each simulated alignment. The gene trees reconstructed by ALE were downloaded from the above mentioned data repository. Note that, from a practical perspective, the DTL costs we use are the default parameters for all the parsimonious methods described in the article, and seem to work well for several parts of the Tree of Life. If the user suspects that these values are not suited for the analyses, these parameters should be estimated beforehand, e.g. using the ALE method.

3 Results

To test the accuracy of gene trees reconstructed using TERA we chose a dataset based on 1099 homologous gene families present in 36 cyanobacterial genomes. This dataset, published in Szöllősi , was constructed using homologous families from the HOGENOM database (Penel ) and contains both real and simulated alignments as well as the gene trees used to simulate sequences. The mean number of genes per family in this dataset is 36.66, the largest family has 114 genes and the smallest 21 genes; the mean number of species in which a family is found is 31.49, with a minimum of 4 and a maximum of 36; the mean copy number per genome—counting as zero genomes in which a family is absent—is 1.012, with a minimum of 0.5833 and a maximum of 3.17. We chose this dataset, because (i) it contains a diverse set of gene families from a reasonably large and divergent set of species, and (ii) the parametric bootstrap-like simulation procedure used attempts to retain as much of the complexity of the underlying biological dataset as possible (Szöllősi ). Furthermore, to emulate the relative complexity of real data compared with available models of sequence evolution, we used a complex model of sequence evolution to simulate sequences—an LG model (Le and Gascuel, 2008) with across-site rate variation and invariant sites—and used PhyloBayes (Lartillot ) with a simpler model—a Poisson model (Felsenstein, 1981) with no rate variation—to produce the sample of gene trees used by both TERA and AnGST (see below for more details).

3.1 Validation on simulated data

For the simulated alignments, both the ‘true’ gene tree used to generate the sequences and the species tree—along which the gene trees evolved—are known. Consequently, it is possible to directly assess the accuracy of different reconstruction methods in recovering the correct gene tree. As shown in Figure 1a in the online appendix, the number of possible amalgamations increases roughly exponentially with increasing sample size in the simulated dataset, but the median reconstruction accuracy achieved by TERA begins to saturate (Figure 1b in the online appendix). To compare the accuracy of our method to that of others, we reconstructed gene trees using six different ‘species tree aware’ methods: (i) the TERA algorithm described here, (ii) ALE (Szöllősi ), (iii) TreeFix-DTL (Bansal et al., 2014, submitted for publication, http://compbio.mit.edu/treefix-dtl/), (iv) MowgliNNI (Nguyen ), (v) AnGST (David and Alm, 2010) and (vi) JPrIME-DLTRS (Sjöstrand ) as well as the species tree unaware method, PhyML (Guindon ). In Figure 2a, we plot the normalized Robinson-Foulds (defined as the Robinson-Foulds distance divided by its maximum possible value, and denoted as n-R-F in the following) distance of the reconstructed gene trees to the true tree. These results show that all of the species tree aware methods achieve better accuracy than the species tree unaware method PhyML, which is to be expected as they are given additional information in the form of the species tree. Among the species tree aware methods, with an input of 10 000 samples TERA’s accuracy is statistically indistinguishable from the more complex maximum likelihood based results from ALE (paired Wilcoxon test P > 0.1) and is significantly more accurate than TreeFix-DTL (Bansal et al., 2014, submitted for publication) (paired Wilcoxon test P < 10−8) as well as the other species tree aware method MowgliNNI (Nguyen ). TERA also outperformed jPrIME-DLTRS, although the accuracy of the latter may have been limited by the available run time (recall that a time out of 10 h per each data set was given). For an input of 1000 samples, TERA is less accurate than either TERA or ALE with 10 000 input samples (paired Wilcoxon tests P < 10−8), statistically indistinguishable from jPrIME-DLTRS, slightly more accurate then TreeFix-DTL (paired Wilcoxon test P = 0.026), and still significantly more accurate than MowgliNNI and AnGST (paired Wilcoxon tests P < 10−8).

Fig. 2.

(a) To compare the accuracy of TERA and other methods we used the simulated data set of Szöllősi et al. (2013b). We find that TERA achieves statistically equivalent accuracy to ALE and better accuracy than the other methods, see main text for details. (b) To test for over and underfitting of the species tree we examined the 431 gene families with exactly one copy in each of the 36 cyanobacterial species. For each family we plot the difference of the R-F distance of the true tree to the species tree and the R-F distance of the reconstructed gene tree from the species tree. Negative values for the difference indicate overfitting, while in the case of underfitting we expect a positive value Results for AnGST are only shown for sample sizes of 1000 gene trees, due to the very large memory requirement of the AnGST implementation. To investigate the effect of using a joint sequence-reconciliation score we also ran TERA with c = 0, i.e. emulating AnGST in only optimizing the reconciliation score. We found that on a sample of 1000 trees AnGST was more accurate then TERA with c = 0 with an n-R-F of, 0.156 and 0.166, respectively. However, using TERA with only 1000 samples, but estimating c, resulted in a mean n-R-F of 0.146. The average c estimated by TERA was 0.49 while the average N was 6.57. An important difference of TreeFix-DTL compared with the other methods considered here, is that it does not use information of the time order of speciation events in the species tree (note that AnGST can also run without information on time ordering). Therefore TreeFix-DTL uses less information, which may explain the difference in performance in comparison to TERA on the simulated dataset. Nonetheless, we ran TERA with 10 random time orderings of the species tree and this resulted in statistically identical n-R-F values when using the correct time order of speciations (Wilcoxon rank sum test P = 0.6). A potential concern regarding methods that optimize a joint reconciliation-sequence score is that we may overfit the species tree. If overfitting of the species tree occurs we expect the reconstructed gene trees to become too similar to the species tree. In the context of the simulated dataset used here, we expect that the reconstructed gene trees will become more similar to the species tree than the true trees used to simulate alignments. To test for such a signal of overfitting, we require a measure of similarity between gene trees and the species tree. The most straightforward solution is to restrict our analysis to gene families that have exactly one copy in each species. In this case, we can simply use the n-R-F distance between the species tree and each of the gene trees as our similarity measure. In Figure 2b, we show the results for the 431 single copy universal gene families in our simulated dataset. We measure the extent of over and underfitting as the difference in n-R-F distance between the species tree and the reconstructed gene tree and the n-R-F distance between the species tree and the true gene tree. We observe that the species tree unaware method, PhyML, as expected, reconstructs trees that are more distant to the species tree than the true tree. The results for the species tree aware methods are more variable: ALE, based on an explicit probabilistic approach, exhibits a median difference of zero and produces only a few examples of overfitting. TERA, which estimates the c parameter giving the relative weight of the sequence and reconciliation component of the joint score, also achieves a median difference of zero when 10 000 samples are given as input, but produces a somewhat larger number of slightly overfitted trees. When only 1000 samples are used, both TERA and AnGST underfit the species tree, similar to jPRIME-DLTRS, suggesting that it may be a lack of convergence of the sampling in all cases. TreeFix-DTL, which relies on a fixed support threshold, shows signs of more significant overfitting; while MowgliNNI substantially underfits the species tree, at least with the default parameters used here. The runtimes for the methods discussed in this section are given in Table 1. We can see that TERA has the fastest stand-alone runtime. However, if the runtime necessary to generate the input tree(s) are considered, Mowgli is the fastest method, but it is also the least accurate (cf. Fig. 2a). For an input size of 1000 samples TERA achieves comparable accuracy to TreeFix-DTL, but with a seven fold reduced mean runtime. For an input size of 10 000 samples TERA achieves similar accuracy to ALE and outperforms TreeFix-DTL, but considering the time to generate the required inputs TreeFix-DTL is 1.3 times faster on average.

Table 1.

Mean runtimes in seconds for the methods discussed in the main text on a cluster of 2.1 GHz Intel Xeon processors with 24 GB of RAM with maximum runtime limited to 10 h per family

	Stand-alone [s]	Input [s]

		PhyloBayes
		1000 samples	10 000 samples
TERA	3.65	756.6	7566
AnGST	54.9	756.6	—
ALE	159.2	756.6	7566

		PhyML

MowgliNNI	6.3	182.5
TreeFix	5718.0	182.5
		No input tree needed
jPrIME	32 137.3	0

The time required to compute inputs is given by the runtime of PhyloBayes for 1000 and 10 000 samples and for the time required for PhyML to compute an ML tree with SH branch supports. Stand-alone runtimes are given for 10 000 samples for TERA and ALE and 1000 samples for AnGST.

Mean runtimes in seconds for the methods discussed in the main text on a cluster of 2.1 GHz Intel Xeon processors with 24 GB of RAM with maximum runtime limited to 10 h per family The time required to compute inputs is given by the runtime of PhyloBayes for 1000 and 10 000 samples and for the time required for PhyML to compute an ML tree with SH branch supports. Stand-alone runtimes are given for 10 000 samples for TERA and ALE and 1000 samples for AnGST.

3.2 Results on real data

In order to test TERA on biological data, we again used the dataset published in Szöllősi , but focusing on real alignments. As inputs to TERA we used: (i) the tree samples obtained from real alignments and (ii) the ML species tree unaware gene trees obtained using PhyML from the same alignments. Similar to the results of ALE (Szöllősi ), we find that the number of transfer and loss events (but not duplication events) in most parsimonious reconciled gene trees is substantially lower than those found in the most parsimonious reconciliations of PhyML trees: the mean and median number of transfers per family was 3.914 and 3 compared with 10.38 and 9, respectively; the mean and median number of losses per family was 5.088 and 4 compared with 7.542 and 7, respectively, while the mean and median number of duplications per family was 1.071 and 0 compared with 1.042 and 0, respectively.

4 Discussion

We have presented a detailed description of the TERA algorithm, a parsimony-based species tree aware method of gene tree reconstruction. We demonstrate that TERA reconstructs gene trees with nearly identical accuracy as the more complex ML based ALE method and, at least on the simulated datasets considered here, outperforms the other parsimony based species tree aware methods. Examining a subset of single copy universal gene families we show that TERA does not overfit or underfit the species tree. This result lends credibility to TERA’s results on biological data, whereby two thirds of apparent gene transfers in gene trees reconstructed without taking into consideration the species tree are not recovered given knowledge of the species phylogeny. Although parsimony based methods are fundamentally limited in some aspects compared with model based probabilistic methods, in the case of species tree aware gene tree reconstruction our results indicate that parsimony based methods can closely approach their accuracy. A further advantage of TERA compared with the corresponding probabilistic method ALE is that it is faster (if only up to a constant factor), does not require explicit time-like branch lengths that are difficult to estimate, and due to its relative simplicity, in particular the lack of numerical integration, is more robust in practice. Compared with parsimony based methods that require prior assumptions about statistical support, TERA is distinguished by its ability to estimate a statistical support threshold from its input. In contrast to the methods considered here it does require more elaborate upstream analysis, taking as its input a sample of trees from e.g. an MCMC-based tree inference methods, while in contrast MowgliNNI requires a single tree with branch supports, and TreeFix-DTL a multiple sequence alignment. Finally, while we have shown that it is possible to estimate the c parameter, we have been less successful in estimating all four costs () simultaneously, due to the tendency of the cost estimates to diverge toward a very low transfer cost, and a correspondingly large number of transfers. We expect that relaxing the, in general, unrealistic assumption of independence between events could ameliorate this problem.

31 in total

1. Bayesian gene/species tree reconciliation and orthology analysis using MCMC.

Authors: Lars Arvestad; Ann-Charlotte Berglund; Jens Lagergren; Bengt Sennblad
Journal: Bioinformatics Date: 2003 Impact factor: 6.937

2. Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci.

Authors: Bruce Rannala; Ziheng Yang
Journal: Genetics Date: 2003-08 Impact factor: 4.562

Review 3. Models, algorithms and programs for phylogeny reconciliation.

Authors: Jean-Philippe Doyon; Vincent Ranwez; Vincent Daubin; Vincent Berry
Journal: Brief Bioinform Date: 2011-09 Impact factor: 11.622

4. Phylogenetic modeling of lateral gene transfer reconstructs the pattern and relative timing of speciations.

Authors: Gergely J Szöllosi; Bastien Boussau; Sophie S Abby; Eric Tannier; Vincent Daubin
Journal: Proc Natl Acad Sci U S A Date: 2012-10-04 Impact factor: 11.205

5. A Bayesian method for analyzing lateral gene transfer.

Authors: Joel Sjöstrand; Ali Tofigh; Vincent Daubin; Lars Arvestad; Bengt Sennblad; Jens Lagergren
Journal: Syst Biol Date: 2014-02-20 Impact factor: 15.683

6. Efficient algorithms for the reconciliation problem with gene duplication, horizontal transfer and loss.

Authors: Mukul S Bansal; Eric J Alm; Manolis Kellis
Journal: Bioinformatics Date: 2012-06-15 Impact factor: 6.937

7. MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space.

Authors: Fredrik Ronquist; Maxim Teslenko; Paul van der Mark; Daniel L Ayres; Aaron Darling; Sebastian Höhna; Bret Larget; Liang Liu; Marc A Suchard; John P Huelsenbeck
Journal: Syst Biol Date: 2012-02-22 Impact factor: 15.683

8. Species tree inference by minimizing deep coalescences.

Authors: Cuong Than; Luay Nakhleh
Journal: PLoS Comput Biol Date: 2009-09-11 Impact factor: 4.475

9. Lateral gene transfer from the dead.

Authors: Gergely J Szöllosi; Eric Tannier; Nicolas Lartillot; Vincent Daubin
Journal: Syst Biol Date: 2013-01-25 Impact factor: 15.683

10. Efficient exploration of the space of reconciled gene trees.

Authors: Gergely J Szöllõsi; Wojciech Rosikiewicz; Bastien Boussau; Eric Tannier; Vincent Daubin
Journal: Syst Biol Date: 2013-08-06 Impact factor: 15.683

14 in total

1. Deciphering Microbial Gene Family Evolution Using Duplication-Transfer-Loss Reconciliation and RANGER-DTL.

Authors: Mukul S Bansal
Journal: Methods Mol Biol Date: 2022

2. Counting and sampling gene family evolutionary histories in the duplication-loss and duplication-loss-transfer models.

Authors: Cedric Chauve; Yann Ponty; Michael Wallner
Journal: J Math Biol Date: 2020-02-15 Impact factor: 2.259

3. Consistency and convergence rate of phylogenetic inference via regularization.

Authors: Vu Dinh; Lam Si Tung Ho; Marc A Suchard; Frederick A Matsen
Journal: Ann Stat Date: 2018-06-27 Impact factor: 4.028

4. Reconstructing a SuperGeneTree minimizing reconciliation.

Authors: Manuel Lafond; Aïda Ouangraoua; Nadia El-Mabrouk
Journal: BMC Bioinformatics Date: 2015-10-02 Impact factor: 3.169

5. Integrated pipeline for inferring the evolutionary history of a gene family embedded in the species tree: a case study on the STIMATE gene family.

Authors: Jia Song; Sisi Zheng; Nhung Nguyen; Youjun Wang; Yubin Zhou; Kui Lin
Journal: BMC Bioinformatics Date: 2017-10-03 Impact factor: 3.169

6. A Comprehensive Evolutionary Scenario of Cell Division and Associated Processes in the Firmicutes.

Authors: Pierre S Garcia; Wandrille Duchemin; Jean-Pierre Flandrois; Simonetta Gribaldo; Christophe Grangeasse; Céline Brochier-Armanet
Journal: Mol Biol Evol Date: 2021-05-19 Impact factor: 16.240

7. The Great Oxidation Event expanded the genetic repertoire of arsenic metabolism and cycling.

Authors: Song-Can Chen; Guo-Xin Sun; Yu Yan; Konstantinos T Konstantinidis; Si-Yu Zhang; Ye Deng; Xiao-Min Li; Hui-Ling Cui; Florin Musat; Denny Popp; Barry P Rosen; Yong-Guan Zhu
Journal: Proc Natl Acad Sci U S A Date: 2020-04-29 Impact factor: 11.205