Literature DB >> 32502238

GeneRax: A Tool for Species-Tree-Aware Maximum Likelihood-Based Gene Family Tree Inference under Gene Duplication, Transfer, and Loss.

Benoit Morel¹, Alexey M Kozlov¹, Alexandros Stamatakis^1,2, Gergely J Szöllősi^3,4,5.

Abstract

Inferring phylogenetic trees for individual homologous gene families is difficult because alignments are often too short, and thus contain insufficient signal, while substitution models inevitably fail to capture the complexity of the evolutionary processes. To overcome these challenges, species-tree-aware methods also leverage information from a putative species tree. However, only few methods are available that implement a full likelihood framework or account for horizontal gene transfers. Furthermore, these methods often require expensive data preprocessing (e.g., computing bootstrap trees) and rely on approximations and heuristics that limit the degree of tree space exploration. Here, we present GeneRax, the first maximum likelihood species-tree-aware phylogenetic inference software. It simultaneously accounts for substitutions at the sequence level as well as gene level events, such as duplication, transfer, and loss relying on established maximum likelihood optimization algorithms. GeneRax can infer rooted phylogenetic trees for multiple gene families, directly from the per-gene sequence alignments and a rooted, yet undated, species tree. We show that compared with competing tools, on simulated data GeneRax infers trees that are the closest to the true tree in 90% of the simulations in terms of relative Robinson-Foulds distance. On empirical data sets, GeneRax is the fastest among all tested methods when starting from aligned sequences, and it infers trees with the highest likelihood score, based on our model. GeneRax completed tree inferences and reconciliations for 1,099 Cyanobacteria families in 8 min on 512 CPU cores. Thus, its parallelization scheme enables large-scale analyses. GeneRax is available under GNU GPL at https://github.com/BenoitMorel/GeneRax (last accessed June 17, 2020).

Entities: Chemical Disease Species

Keywords: gene duplication; gene family tree; horizontal gene transfer; maximum likelihood; reconciliation

Mesh：

Year: 2020 PMID： 32502238 PMCID： PMC8312565 DOI： 10.1093/molbev/msaa141

Source DB: PubMed Journal: Mol Biol Evol ISSN： 0737-4038 Impact factor: 16.240

Introduction

Reconstructing the evolutionary history of homologous genes constitutes a fundamental problem in phylogenetics, as the phylogenetic trees of gene families (henceforth, gene family trees [GFTs]) play a prominent role in numerous biological studies. For instance, GFTs are essential to understand genome dynamics (Touchon et al. 2009), to study specific traits (Musilova et al. 2019), or to infer the species tree (Boussau et al. 2013; Mirarab et al. 2014). Standard phylogenetic methods infer trees from multiple sequence alignments (MSAs), for instance using the maximum likelihood (ML) criterion (Nguyen et al. 2015; Kozlov et al. 2019). Under the correct substitution model, ML methods are statistically consistent (Yang 1994), that is, they converge to the true tree when the sequences are long enough. However, this condition is often violated for GFTs: typical per-gene MSAs are short (50–1,000 sites) and can comprise a large number of sequences representing a large number of taxa (hundreds or thousands for large gene families). As a result, there is typically insufficient signal in the MSA to reconstruct a well-supported phylogeny. In other words, the tree with the highest likelihood might not correspond to the true tree. Species-tree-aware (STA) approaches aim to compensate for this insufficient phylogenetic signal by relying on a putative species tree. Indeed, GFTs and the species tree exhibit an intricate relationship: genes evolve within a (species) genome and undergo biological processes such as duplication, horizontal gene transfer (HGT), loss, or speciation (fig. 1). Therefore, although GFTs can be incongruent with the species tree, their own evolutionary history is, to a substantial degree, determined by the species tree. STA methods exploit this relationship between the GFTs and the species tree to leverage additional information for GFT inference. In the following, we denote gene duplication, gene loss, and HGT events as duplication, transfer, loss (DTL) events .

Fig. 1.

A gene tree evolving along the species tree, and several possible inferred trees. (a) The true history. The gene tree (blue lines) evolves within the species tree (gray area) and undergoes speciations (S), duplications (D), losses (L), and HGT (T). (b) The true rooted gene tree. (c) An unrooted gene tree inferred with a sequence-aware method. The splits between gene lineages are very close in time, and there is not enough signal in the sequences to correctly infer the unrooted gene tree topology. (d) Rooted tree inferred from the species tree only (without accounting for the sequences), assuming that HGTs are less likely than duplications A common approach used by STA methods (Chen et al. 2000; Scornavacca et al. 2015; Noutahi et al. 2016) consists of contracting weakly supported GFT branches into polytomies, which are subsequently resolved using the species tree. These heuristics limit the set of GFTs explored to trees that can be obtained as combinations of alternative resolutions of the contracted branches. Most existing implementations (Chen et al. 2000; Noutahi et al. 2016) are based on parsimony and require a priori specification of arbitrary DTL parsimony costs. This is particularly problematic if the substitution model is misspecified, or if it fails to adequately capture the complexity of the data. This is commonly the case for shorter gene alignments where parameter-rich substitution models are more difficult to use. In addition, the user must define what a “low support value” for branch contraction is, often by setting an arbitrary threshold. Treerecs (Comte et al. 2020) addresses this last limitation by exploring several thresholds and returning the GFT that maximizes a likelihood score that is based on both the MSAs and the species tree. Finally, obtaining branch support values usually requires a substantial amount of computational effort (e.g., 1–2 orders of magnitude more than for a simple ML tree search on the original MSA, if the classic Felsenstein Bootstrap is used [Felsenstein 1985]). Other STA methods utilize a hierarchical probabilistic model of sequence level substitutions and gene level events, such as duplication, transfer, and loss. This allows the definition of the joint likelihood as the product of the probability of observing the alignments given the GFTs (phylogenetic likelihood) and the probability of observing the GFTs given the species tree (reconciliation likelihood): where S is the species tree, is the set of GFTs, and the set of corresponding MSAs. Phyldog (Boussau et al. 2013) coestimates the GFTs and the species tree by conducting a tree search that is based on such a joint likelihood score. However, Phyldog does not model HGT. ALE (Szöllősi, Rosikiewicz, et al. 2013) calculates the joint likelihood using a dynamic programing scheme that requires the phylogenetic likelihood to be approximated via conditional clade probabilities (Larget 2013). In order to calculate conditional clade probabilities, ALE requires a sample of GFTs as input that are typically obtained via Markov Chain Monte Carlo (MCMC) sampling. This approach has two shortcomings. First, the conditional clade probability approximation inevitably limits the set of GFTs explored to trees that are composed of clades observed in an tree sample, as the phylogenetic likelihood of all other trees is approximated to be 0 (Szöllősi, Rosikiewicz, et al. 2013). Although being less restrictive, conceptually this limitation is nonetheless analogous to those induced by the branch contraction methods discussed above. It is also similarly sensitive to model miss-specification and inadequacy. Second, obtaining a tree sample, either via Bayesian phylogenetic MCMC methods or via bootstrap methods for a set of gene families is computationally expensive. For an in depth review of GFT inference methods, see Szöllősi et al. (2015) and El-Mabrouk and Noutahi (2019). Probabilistic frameworks to model both, sequence (Felsenstein 1981) and gene evolution events (Åkerborg et al. 2009; Sennblad and Lagergren 2009; Szöllősi, Tannier, et al. 2013), can be found in the literature. However, no ML tool can currently directly infer GFTs from MSAs by simultaneously accounting for sequence substitutions and DTL events. We believe that such a method can substantially improve the accuracy of GFT inference. A common argument against using STA ML approaches is the amount of time and computational resources required to conduct such analyses (El-Mabrouk and Noutahi 2019). However, a joint (phylogenetic and reconciliation likelihood) ML approach does not require expensive preprocessing and can therefore decrease the overall computational cost substantially, while increasing accuracy at the same time. Tree search heuristics are widely used to infer phylogenies from sequence data (Nguyen et al. 2015; Kozlov et al. 2019) using the phylogenetic likelihood. Thus, extending these methods by joint likelihood calculations represents a natural way of improving the accuracy of GFT inference. Here, we introduce GeneRax, our novel software to infer ML reconciled GFTs based on a joint reconciliation and phylogenetic likelihood. We use the term reconciled GFT to designate both, the GFT topology and its reconciliation with the species tree. The input for GeneRax consists of a rooted, but undated binary (fully bifurcating) species tree, a set of per-family MSAs (DNA or amino-acid), and corresponding gene-to-species leaf name mappings. Several genes from the same gene family can be mapped to the same species. In addition, the user can provide initial GFTs, typically inferred via standard phylogenetic methods (Nguyen et al. 2015; Kozlov et al. 2019). GeneRax is easy to use, models DTL events, and can process gene families in parallel. Employing a hierarchical probabilistic model allows it to simultaneously account for both, the signal from the gene family MSAs and from the species tree. It estimates all substitution and DTL events intensity parameters and does not require any ad hoc threshold nor any arbitrary DTL event parsimony costs. Nonetheless, one should keep in mind that incomplete lineage sorting (ILS) constitutes another important source of discordance between GFTs and the species tree. A recent study suggests that ILS can bias reconciliation inference (Zheng and Zhang 2014). To this end, we also assess the impact of ILS on the reconstruction accuracy of STA methods and discuss the limitations of GeneRax in the presence of ILS.

New Approaches

In this section, we outline the joint likelihood computation, our tree search algorithm, and our parallelization scheme.

Reconciliation Likelihood

In this subsection, we derive the reconciliation likelihood for a rooted GFT given an undated, yet rooted species tree, as implemented in ALE. The “undated” DTL model, in contrast to the continuous time model described in Szöllősi, Tannier, et al. (2013), is a discrete state model, which starts with a single gene copy on a branch of the species tree. Subsequently, gene copies evolve independently until, either all copies are observed at the leaves or every gene copy becomes extinct. On an arbitrary branch of the species tree, a gene copy either duplicates and is replaced by two corresponding gene copies on the same branch (with probability p); a new copy is transferred to a random branch that is not ancestral to the donor branch, but otherwise drawn uniformly at random from the species tree, while a copy also remains on the donor branch (with probability p); is lost (with probability p); undergoes a speciation event on internal branches, in which case it is replaced by a copy on each descendant branch (with probability ); and is observed for terminal branches, that is, arrives in the present and is observed, thus terminating the process (again with probability ). By δ, λ, and τ, we denote the duplication, loss, and transfer intensity parameters that parametrize the above event probabilities as follows: and The probability of observing a rooted GFT G under the undated DTL model defined above can be calculated by summing over all possible series of D, T, L, and S events (henceforth “scenarios”) that yield a rooted topology that is congruent with G. The sum over all possible scenarios is computed in two steps (Sjöstrand et al. 2013; Szöllősi, Tannier, et al. 2013). First, we calculate the extinction probability of a gene copy that was initially present on some branch of the species tree. The extinction probability is the sum over all scenarios that do not yield descendants. Second, we sum over all reconciliations of G, where a reconciliation of G corresponds to a specific sequence of D, T, S, and gene copy extinction events, and its probability corresponds to the product of the specific sequence of events (cf., fig. 2).

Fig. 2.

Calculating the probability of G along S. (a) The probability that the rooted GFT G is generated along the rooted species tree S according to the “undated” DTL process can be calculated by summing over all reconciliations. Here, we show the leading terms in the sum over all reconciliations that start with a single gene copy on branch 1 of S and obtain a rooted gene tree that is congruent with G. (b) More generally, to calculate , that is, the sum over all reconciliations generating the subtree below some internal node u of G starting from a single gene present on the internal branch e of S, we must consider the following events (i) if e is an internal branch of S, speciation with probability p such that the descendants of v on G are observed on f of S and of w on G are observed on g of S, or vice versa. If e is a terminal branch, with probability p gene u will be observed at the terminal branch e; (ii) if e is an internal branch of S, speciation with probability p such that the descendants of u are observed on f of S and the copy on g goes extinct with probability E, or vice versa; (iii) duplication with probability p such that v and w are both observed on e; (iv) duplication with probability p such that either the first or the second copy goes extinct, each with probability E and u is observed on e; (v) transfer with probability p, such that the respective branches v and w correspond to the copy on the donor branch e of S, whereas the other copy corresponds to the recipient copy on branch h of S that is not an ancestor of e; and finally (vi) transfer with probability p followed by the extinction of either, the copy in the donor linage e with probability E of the extinction of the copy in the recipient, with probability E. These correspond to the terms of equation (2). To begin, let e be a branch of the species tree S, and let f and g be its descendant branches (remember that the species tree is rooted). Let be the set of species tree branches that can receive an HGT from e. Because we do not assume any time information on the species tree other than the order of descent induced by the rooted tree topology, we consider that corresponds to all nodes that are not ancestors of e. We allow transfers from e to its descendants, because a gene could have evolved along an extinct or unsampled lineage and could subsequently have been transferred back to a descendant of e (Szöllősi, Tannier, et al. 2013). The extinction probability, that is, the probability that a gene copy observed on an internal branch e becomes extinct before being observed at the tips of the specie tree is The terms correspond to the 1) loss probability, 2) speciation and subsequent extinction probability in both descending lineages (this term must be omitted for terminal branches), 3) duplication and subsequent extinction probability of both copies, and finally 4) transfer and subsequent extinction probability of both, the donor copy on branch e and the transferred copy on branch h. For the latter event, we have introduced the notation: In (6), the value of E depends on , and thus on the extinction probabilities of all species in the species tree. We iteratively estimate and E for all nodes e in the species tree, by initializing and computing: We do not prove the existence of the limit nor that this limit is the solution of equation (6) here. In simulations, we observed that five iterations are sufficient to estimate E, and we have thus set the number of iterations to 5 in our implementation. In the special case where τ = 0 (no HGT), the contribution of the term zero, and we can directly compute E from E and E. To calculate the probability of a rooted GFT G, we have to sum over all reconciliations of G. This includes D, T, S, and gene extinction events that may have generated the observed, rooted GFT along the species tree. The algorithm to calculate the sum over all reconciliation histories proceeds from the tips of the rooted species tree and rooted GFT toward their respective roots. Let v and w be descendants of u on G, and f as well as g be descendants of e on the species tree S. For calculating the recursive sum over reconciliations, consider , as the sum over all reconciliations that generate the subtree below some internal node u of G starting from a single gene being present on the internal branch e of the species tree S. We calculate by enumerating all possible single D, T, and S events that can result from u on e. These are shown in figure 2 and yield where we have introduced the notation: where denotes the branches of S that are not ancestors of e. If both e and u are terminal branches, we set . Similar to the expression for the extinction probability, depends on itself. We solve this through fixed point iteration analogously to (6). Aside of the self-dependence, every other term involves either descendant branches in G (u and w), descendant branches in S (f and g), or both. This allows to devise a bottom-up dynamic programing recursion starting at the leaves, such that for leaf g of the GFT and leaf s of the species tree , if gene g maps to species s, and zero otherwise. Given the above, to calculate the reconciliation likelihood, let G be a rooted GFT, r its root, S a rooted species tree, the set of nodes of S, and the set of DTL intensity parameters. The reconciliation likelihood can then be expressed as where we divide by to condition on survival, as extinct gene families cannot be observed.

Joint Likelihood Evaluation

GeneRax attempts to maximize the joint likelihood defined as where is the set of GFTs, S is the species tree, N are the DTL event intensity parameters, and is the set of gene family MSAs. GeneRax estimates the reconciliation likelihood based on the dynamics programing recursion described above. It uses the highly optimized pll-modules library (Darriba et al. 2019) to compute the phylogenetic likelihood . Hence, GeneRax offers all substitution models also supported by RAxML-NG (Kozlov et al. 2019).

Joint Likelihood Optimization

Given a set of MSAs and a species tree, GeneRax searches for the set of rooted GFTs and DTL intensity parameters that maximizes the joint likelihood. We illustrate the search procedure in figure 3.

Fig. 3.

GeneRax pipeline. In each step, we draw in red the parameters that GeneRax optimizes and in gray the fixed parameters that GeneRax uses to compute the likelihoods. GeneRax performs Step 0 only when starting from random GFTs, to infer ML GFTs from the MSAs. Step 1 optimizes the DTL event rates from the GFTs and the species tree. Step 2 optimizes the GFTs from the MSAs, the species tree and the DTL rates. GeneRax repeats Step 1 and Step 2 with increasing SPR radius, until it reaches the maximum radius. Then, it applies Step 3 to reconcile the GFTs with the species tree. GeneRax starts either from user-specified GFTs or from random GFTs. Our joint likelihood search algorithm needs to start from GFTs with high phylogenetic likelihood, preferably inferred with phylogenetic ML tools such as RAxML-NG (Kozlov et al. 2019). We provide a rationale for this in the Results section. When starting from random GFTs, GeneRax performs an initial search (Step 0 in fig. 3) that solely maximizes the phylogenetic likelihood, without accounting for the reconciliation likelihood. After this optional step, GeneRax starts optimizing the joint likelihood, by alternating between optimizing the GFTs and the DTL event intensity parameters. When optimizing the GFTs (Step 1 in fig. 3), GeneRax processes each family independently, and applies a tree search heuristic to each of them separately: for a given tree, it tests all possible Subtree Prune and Regraft (SPR) moves within a given radius and subsequently applies the SPR move that yields the tree with the highest joint likelihood. Then, it iterates by again applying SPR moves to this new tree, until the joint likelihood cannot be further improved. At the end of the GFT optimization, GeneRax increases the SPR radius by 1 until a certain maximum value is reached (see further below). GeneRax optimizes the DTL intensity parameters globally over all gene families (Step 2 in fig. 3). To this end, we apply the gradient descent method to find a set of DTL intensity parameters that maximizes the reconciliation likelihood over all gene families. We numerically approximate the gradient via finite differences. The entire procedure stops when the SPR radius (starting from 1) exceeds a user-defined value. When the user does not define this maximum SPR radius, we set it to 5, as we did not observe any improvement above this value in our experiments.

GFT and Species Tree Reconciliation

The reconciliation likelihood computation algorithm conducts a postorder traversal of both, the species tree and the GFT, and sums over all possible scenarios at each step of the traversal. To infer the ML reconciliation (Step 3 in fig. 3), GeneRax keeps track of the ML path during the traversal. GeneRax can export the reconciled GFTs into both Notung (Chen et al. 2000) and RecPhyloXML (Duchemin et al. 2018) formats (fig. 4).

Fig. 4.

Reconciled GFT and species tree. Users can easily visualize reconciliations inferred with GeneRax using the online tool RecPhyloXML-visu (Duchemin et al. 2018). This example illustrates one HGT and one duplication events.

Parallelization

Achieving “good” parallel efficiency given a large number of gene families is challenging: The most straightforward solution consists in assigning a subset of gene families to each core (Boussau et al. 2013). However, gene family MSAs are highly heterogeneous in terms of size and are hence hard to evenly distribute over cores (Morel et al. 2018) such as to achieve “good” load balance. In particular, large gene family MSAs can easily generate a parallel performance bottleneck. Our solution allows to split up individual inferences on such large gene family MSAs across several cores. Thus, we parallelize over, but also within gene families, in analogy to our ParGenes (Morel et al. 2018) tool. However, unlike ParGenes, GeneRax parallelizes individual GFT searches over the possible SPR moves and not over MSA sites. For a given GFT, we distribute the SPR moves we intend to apply among the cores assigned to the reconciliation of the GFT and apply them simultaneously. We adopted this parallelization approach for two reasons: 1) unlike the phylogenetic likelihood, the time for computing a reconciliation likelihood does not depend on the number of sites (i.e., a parallelization will not scale with the number of sites in contrast to the phylogenetic likelihood) and 2) per-MSA gene sequences are typically not long enough to efficiently parallelize the phylogenetic likelihood calculations over the sites.

Results

In the following, we present the results of our experiments. For all methods, we report GFT quality (measured by Robinson–Foulds [RF] distance to the true trees on simulated data sets and joint likelihood on empirical data sets) and computational efficiency (measured by sequential runtime and parallel efficiency). All data and all inferred trees are available at https://cme.h-its.org/exelixis/material/generax_data.tar.gz.

RF Distances to True Trees

We show the relative RF distances between the 1,099 simulated Cyanobacteria true GTRs and the respective inferred GTRs in figure 5. For methods that yield more than one potential GFT per gene family (ALE and RAxML-NG), we average the distance over all inferred trees.

Fig. 5.

Relative RF distances to true trees, by inferring gene trees with the true substitution model (LG + Γ + I) and a misspecified substitution model (WAG).

Relative RF distances to true trees, by inferring gene trees with the true substitution model (LG + Γ + I) and a misspecified substitution model (WAG). GeneRax and ALE perform better than all other methods, except in the case of the misspecified substitution model where Treerecs performs equally well. Under the true model, STA methods that do not account for HGT but use a joint likelihood score (Phyldog and Treerecs) perform better than the purely sequence-based method (RAxML-NG), but worse than methods accounting for HGT. Although EcceTERA accounts for transfers, it only performs as good as Treerecs, presumably because the EcceTERA algorithm only uses parsimony. We hypothesize that Notung performs worse than all the other methods because it rearranges trees based on a parsimony score and an arbitrary support value threshold. We summarize the results of the GenPhyloData simulations where we vary parameters (DTL intensity parameters, etc.) in the presence of HGT in figure 6 and the results of the simulations in the absence of HGT in the Supplementary Material online . GeneRax finds the best trees in 90% of our simulation scenarios, but ALE finds trees that are almost as good in most simulations. Treerecs and Phyldog perform almost as well as GeneRax and ALE in the absence of HGT, but worse under HGT. Notung performs significantly worse than all SPA methods.

Fig. 6.

Comparison of different GTF correction tools on simulated data sets, in the presence of horizontal gene transfers. (a) Species taxa number, (b) sites number, (c) increasingly wrong species tree, (d) average DTL rates, (e) ratio between duplication and transfers rates, and (f) gene tree branch lengths. All STA methods show an analogous accuracy pattern when we vary parameters: They perform better with increasing gene sequence signal strength (fig. 6) and perform worse with increasing discordance between the species tree and the GFTs (fig. 6). We show the results of the SimPhy simulations over varying ILS discordance scores in figure 7. GeneRax outperforms all other STA tools. It finds better GFTs than the only non-STA method (RAxML-NG) up to an ILS discordance score of 0.6. Our findings suggest that GeneRax can be deployed for analyzing data sets that exhibit a moderate degree of ILS.

Fig. 7.

RF distance to true trees on simulated data sets with increasing discordance due to ILS.

Branch Score Distances to True Trees

To compare the quality of the gene branch lengths in terms of expected number of substitutions per site, we measured the average branch score distance (Kuhner and Felsenstein 1994) between the inferred trees and the true trees (fig. 8) with the phangorn R library (Schliep 2011). GeneRax performs better than all competing tools. In particular, GeneRax shows a better average branch score distance (1.02) than ALE (1.48). A possible explanation for this is that ALE does not infer the branch lengths by optimizing the phylogenetic likelihood score, as opposed to GeneRax, Treerecs, and RAxML-NG. When using ALE, Notung, Phyldog, or EcceTERA, users interested in branch length accuracy would need to include an additional tool into their pipeline (e.g., RAxML-NG).

Fig. 8.

Branch score distance to true trees. We excluded from the plot methods that do not infer the branch lengths.

Joint Likelihood

We report the joint ML scores of the GFTs obtained with the different tools in figure 9. As the true tree is generally not know for empirical data, and given that we are willing to accept the ML criterion, we must assume that the tree yielding the best joint ML is also the one that best explains the data. This approach of benchmarking ML tools on empirical data sets has been used repeatedly for assessing standard tree inference tools (Nguyen et al. 2015; Kozlov et al. 2019). The rationale for this is that standard tree searches based on the phylogenetic likelihood are inherently more difficult on empirical than on smooth and perfect simulated data. That is, differences between tree search algorithms might sometimes only be observable on empirical data. As expected, GeneRax finds the highest joint likelihood score. ALE is close to GeneRax, because it strives to approximate the same model. As the remaining tools implement distinct models, our comparison might appear as being unfair. However, we mainly regard this as a means of verifying that GeneRax properly maximizes the likelihood under its specific reconciliation model. Treerecs and Phyldog are also very close to GeneRax in the absence of transfers, because they deploy a similar joint likelihood model. ALE performs better than Treerecs and Phyldog in the presence of HGT, because Treerecs and Phyldog only account for gene duplication and loss. RAxML-NG, EcceTERA, and Notung do not implement a joint reconciliation likelihood model, which explains their low scores.

Fig. 9.

Log-likelihoods (the higher the better) evaluated with GeneRax. When evaluating the joint likelihood for Primates, we set the HGT rate to 0. (a) Primates and (b) Cyanobacteria.

Log-likelihoods (the higher the better) evaluated with GeneRax. When evaluating the joint likelihood for Primates, we set the HGT rate to 0. (a) Primates and (b) Cyanobacteria. In addition, when running GeneRax on the empirical Cyanobacteria data set, we recorded both, the reconciliation likelihood and the phylogenetic likelihood during the tree search (fig. 10). We observe that the joint likelihood optimization occurs through an increase of the reconciliation likelihood in conjunction with a decrease of the phylogenetic likelihood. We observed this consistently on all simulated and empirical data sets we experimented with. In general, we observed that our joint likelihood tree search heuristic is not efficient in improving the phylogenetic likelihood score, and thus needs to start from trees with a high phylogenetic likelihood. For this reason, when the user does not provide a starting tree, we initially only optimize the phylogenetic likelihood, and only subsequently start the joint likelihood optimization.

Fig. 10.

Reconciliation and sequence log-likelihoods during GeneRax tree search on the Cyanobacteria data set. The sequence likelihood decreases while the reconciliation likelihood increases.

Sequential Runtimes

We measured the sequential runtimes of all tools on the empirical Cyanobacteria data set. Comparing runtimes is not straightforward: Some tools are very fast but require an external preprocessing step, as described in table 1 . For instance, Notung is the fastest tool, but it requires GFTs with support values as input, and obtaining those can be extremely time-consuming. For a fair comparison, we plot both the time spent in the GFT inference tools alone, and the time spent in their respective preprocessing steps (fig. 11).

Table 1.

Software	Method Type	Input Trees	STA	HGT	Ref.
RAxML-NG	ML	Random	No	No	Kozlov et al. (2019)
Notung	Parsimony	Supported ML	Yes	No	Chen et al. (2000)
Treerecs	Parsimony + ML	Supported ML	Yes	No	Comte et al. (2020)
Phyldog	ML	ML	Yes	No	Boussau et al. (2013)
EcceTERA	Parsimony	Supported ML or MCMC samples	Yes	Yes	Scornavacca et al. (2015)
ALE	ML	MCMC samples	Yes	Yes	Szöllősi, Rosikiewicz, et al. (2013)
GeneRax	ML	Random or ML	Yes	Yes	This paper

Fig. 11.

Sequential runtimes and additional overhead from precomputation steps (bootstrap trees with RAxML-NG for Notung and Treerecs, MCMC samples with MrBayes for ALE and EcceTERA, and RAxML-NG starting trees for GeneRax-raxml). The RAxML-NG column corresponds to the time spent in one single tree search. We represent times with a logarithmic scale.

Softwares Used in Our Benchmark, with the Type of Method (ML, parsimony, or both), the Nature of the Input Trees (random tree, ML tree, tree with bootstrap-support values, or MCMC sample of trees), Whether the Method Is STA and Whether the Method Accounts for HGT. Sequential runtimes and additional overhead from precomputation steps (bootstrap trees with RAxML-NG for Notung and Treerecs, MCMC samples with MrBayes for ALE and EcceTERA, and RAxML-NG starting trees for GeneRax-raxml). The RAxML-NG column corresponds to the time spent in one single tree search. We represent times with a logarithmic scale. When only considering the stand-alone runtimes of the tools, GeneRax is the slowest method. However, when including the preprocessing cost, GeneRax becomes the fastest STA approach. In addition, using only a single tool for the entire inference process substantially improves usability and reproducibility of the analyses.

Parallel Efficiency

We measured the parallel runtimes of GeneRax for different numbers of cores. For this experiment, we executed GeneRax on the empirical Cyanobacteria data set (1,099 families), starting from RAxML-NG trees. We used 4 up to 512 cores. Despite the highly heterogeneous gene family MSA sizes (in terms of both number of sites and number of taxa, see Supplementary Material online), GeneRax achieves a high parallel efficiency of 70% on 512 cores. We plot the speedup as a function of the number of cores in the Supplementary Material online. We also measured the parallel efficiency of running the competing methods as described in the Materials and Methods section and plot them in the Supplementary Material online. GeneRax is the only tool that achieves good efficiency (70%) because it parallelizes both, over, and within gene families, thereby achieving a “good” load balance. Despite a similar two-level parallelization scheme, the parallel efficiency of RAxML-NG (scheduled with ParGenes, with one starting tree per family) is below 20%. The reason for this is that ParGenes parallelizes individual tree searches over the sites whereas GeneRax parallelizes them over the SPR moves. Gene MSAs are often short, and there is typically not a sufficient number of sites to allocate several cores per tree search with RAxML-NG. Other competing tools also fail to attain good parallel efficiency (40%), because they do not parallelize individual GFT inferences and are thus limited by the longest individual per-tree inference time. The parallel efficiency of GeneRax decreases when starting from random trees, because the initial phylogenetic likelihood optimization step is based on RAxML-NG code, which does not implement our aforementioned two-level parallelization scheme yet.

Discussion

An Accurate, Robust, and Fast Approach

We present GeneRax, an open source STA GFT inference software. GeneRax can simultaneously account for substitution and DTL events. It performs a tree search to optimize a joint likelihood, that is, the product of the phylogenetic likelihood and the reconciliation likelihood. It can handle multiple gene families in parallel. To the best of our knowledge, GeneRax is the first STA tool that does not require any preprocessing of the MSAs. Also, it does not require any arbitrary threshold settings or parsimony weights, and it can account for HGT. On simulated data sets, we demonstrate that GeneRax and ALE find trees that are closer to the true trees than those inferred by competing tools. We show that GeneRax can provide more accurate GFTs even when the species tree is inaccurate and the substitution model is misspecified. Using two empirical data sets (Cyanobacteria and Primates), we confirm that GeneRax finds the best-scoring ML trees under its specific model among the tested tools, both, with and without HGT. Finally, we show that GeneRax not only is faster than the tested competing methods (when accounting for the computational cost of the preprocessing steps) but also has a substantially higher parallel efficiency, making it suitable for seamless large-scale analyses. GeneRax is a production-level code. We released it on BioConda (Grüning et al. 2018) to facilitate installation, and we kept its interface as simple as possible. Although most competing STA methods require input GFTs, sometimes, including additional information (e.g., support values), GeneRax can directly infer the GFTs from a set of given MSAs. This simplified analysis process reduces the number of ad hoc choices that users have to make: GeneRax does not require bootstrap-support thresholds, parsimony weights, MCMC convergence criteria, chain settings, proposal tuning, or priors. Reducing the number of arbitrary choices does not only yield the tool easier to run but also substantially improve the reproducibility of the results. One could contest the parameters we used in our experiments for the preprocessing steps: Treerecs and Notung might not need 100 bootstrap trees to obtain reliable support values. ALE and EcceTERA might not need as many MrBayes runs, chains, or generations to correctly approximate the phylogenetic likelihood. In general, it is possible to run the preprocessing steps faster than in our experiments. When running the competing methods, we tried to use the parameters that favor result quality/confidence over short runtimes, as we would have done in a real analysis.

Limitations of GeneRax

GeneRax relies on two important assumptions: first, that the rooted species tree is known, and second, that the observed discordance between the GFTs and the species tree is mainly due to D, T, and L events. Our experiments suggest that, when those assumptions are violated, GeneRax can only improve the quality of the GFTs up to a certain degree. In particular, users should be cautious when using GeneRax in the presence of ILS. Furthermore, GeneRax is not suitable for improving GFT topologies in the presence of hybridization. Nonetheless, GeneRax might be deployed for detecting potential hybridization events, by identifying species pairs exhibiting an “abnormally high” number of HGT events.

Future Work

Despite the favorable evaluation results, GeneRax still faces several challenges. First, the GeneRax reconciliation model does not take into account the branch lengths, neither in the species tree nor in the GFTs. This leads to information loss and furthermore allows for transfers between noncontemporary species. We believe that further adapting and extending the reconciliation model could improve the quality of the results. For instance, one could exploit an ultrametric dated species tree and use speciation events to slice the species tree, as done in Szöllosi et al. (2012). However, slicing the species tree increases the number of inner species nodes quadratically and thus incurs a substantial increase in computational cost. Second, the GeneRax reconciliation model assumes that ILS does not occur. Some promising work (Rasmussen and Kellis 2012; Chan et al. 2017) has been conducted to combine DTL events and ILS in a single model. We believe that a computationally efficient software that can account for ILS, DTL events, and substitutions in a probabilistic framework would represent a major breakthrough in phylogenetic inference. Finally, GeneRax needs a known/given species tree to estimate the GFTs. To this end, we plan to extend GeneRax to coestimate both, the GFTs and the species tree, as done in Boussau et al. (2013). An approach to solving this challenge consists in inferring initial GFTs with non-STA methods and then inferring an initial species tree that maximizes the reconciliation likelihood given these GFTs. Then, in a second step, one can propose new species tree topologies, optimize the GFTs and DTL intensity parameters on the proposed new species tree topology, and update the species tree if the joint likelihood improves.

Materials and Methods

We compared GeneRax with competing GFT inference methods on both, simulated and empirical data sets.

Tested Software

This subsection describes the settings we used for executing the competing tools (summarized in table 1) in all of our experiments. We used ParGenes (Morel et al. 2018) to run RAxML-NG with ten random and ten parsimony starting trees and 100 bootstrap trees. For methods requiring starting GFTs, we selected the tree with the best likelihood found by RAxML-NG. We used 100 bootstrap trees to compute GFTs with branch support values as required for Notung and Treerecs. As Notung does not provide any explicit recommendation for setting the bootstrap-support threshold, we used the default value (90%). We executed Treerecs with its automatic threshold selection from seven threshold values (seven is the default value). We executed Phyldog with a fixed species tree using a maximum SPR radius of 5, as in GeneRax, since Phyldog does not have a recommended setting. To execute ALE, we first generated posterior tree samples with MrBayes (Ronquist et al. 2012), using two independent runs, four chains, 1,000,000 generations, a sampling frequency of 1,000 and a burn-in of 100 trees. We used the undated ALE model to produce 100 tree samples per gene family. We used the same MrBayes tree samples to execute EcceTERA with the amalgamate option, without transfer from the dead, and with the dated species tree option. Note that, Treerecs, Notung, MrBayes, EcceTERA, and ALE do not provide a parallelization over gene families for typical distributed memory compute cluster systems. To execute them on large data sets, we scheduled them with a dedicated MPI program, by dynamically assigning jobs (with one job per gene family) to the available MPI ranks, starting from the most expensive jobs with the largest gene family MSAs. Henceforth, we refer to sequential runtime as the sum of the time required by each program and to parallel runtime as the elapsed time spent for the entire MPI run. For a given number of cores, the parallel efficiency is the sequential runtime divided by the product of the parallel runtime and the number of cores. We executed GeneRax with default parameters and with both, random (GeneRax-random) and RAxML-NG (GeneRax-raxml) starting trees. When not stated otherwise, we present GeneRax results for random starting trees. When working on simulated data sets that were not expected to contain HGT, we executed both, ALE and GeneRax, with an HGT rate set to 0 and denote these runs as ALE-DL and GeneRax-DL. When accounting for HGT, we denote them as ALE-DTL and GeneRax-DTL.

Simulated Data Sets

We executed all tools listed in table 1 on the data set originally used to benchmark ALE (Szöllősi, Rosikiewicz, et al. 2013). Szöllosi et al. initially inferred GFTs for 1,099 Cyanobacteria gene families using ALE. Then, they simulated new sequences under the LG + Γ + I model along these trees, retaining both, the MSA sizes and branch lengths. In our experiments, we inferred GFTs once under LG + Γ + I (true substitution model) and once under WAG without rate heterogeneity (misspecified substitution model). In addition, we generated additional simulated data sets to investigate the influence of various parameters on the methods and their respective accuracy. The parameters we studied are the number of sites, the average gene branch lengths, the species tree size, and the DTL intensity parameters. We also used putative species trees that were increasingly different from the true species tree to quantify the robustness of the methods with respect to topological errors in the species tree. We simulated the species tree and GFTs using GenPhyloData (Sjöstrand et al. 2013) and the sequences using Seq-Gen (Rambaut and Grass 1997), which simulates a continuous time birth and death process along a time-like species tree. Finally, we executed simulations using SimPhy (Mallo et al. 2016) with increasing population sizes to assess the impact of ILS. We define the ILS discordance of a simulated data set as being the average relative RF distance (Robinson and Foulds 1981) between the true species tree and the true GFTs obtained when running the same simulations without D, T, or L events.

Empirical Data Sets

We executed all methods in table 1 on the empirical data sets listed in table 2. We measured both, sequential and parallel runtimes. We also used GeneRax to evaluate the joint likelihood of the trees inferred with each method, to assess the quality of our tree search algorithm whose goal is to maximize this likelihood.

Table 2.

Description of the Empirical Data Sets Used in Our Benchmarks.

Data Set	Database	Species	Families	Avg. Sites	Avg. Genes	Max. Genes
Primates	ENSEMBL	13	1,523	84	45	349
Cyanobacteria	HOGENOM	36	1,099	239	37	130

Note.—We extracted the Primates data set from the release 96 of the Ensembl Compara database (Zerbino et al. 2018). The Cyanobacteria data set was originally used in a previous study (Szöllősi, Rosikiewicz, et al. 2013) and was extracted from the HOGENOM database (Penel et al. 2009).

Description of the Empirical Data Sets Used in Our Benchmarks. Note.—We extracted the Primates data set from the release 96 of the Ensembl Compara database (Zerbino et al. 2018). The Cyanobacteria data set was originally used in a previous study (Szöllősi, Rosikiewicz, et al. 2013) and was extracted from the HOGENOM database (Penel et al. 2009).

Supplementary Material

Supplementary data are available at Molecular Biology and Evolution online. Click here for additional data file.

33 in total

1. Inferring incomplete lineage sorting, duplications, transfers and losses with reconciliations.

Authors: Yao-Ban Chan; Vincent Ranwez; Céline Scornavacca
Journal: J Theor Biol Date: 2017-08-09 Impact factor: 2.691

2. Vision using multiple distinct rod opsins in deep-sea fishes.

Authors: Zuzana Musilova; Fabio Cortesi; Michael Matschiner; Wayne I L Davies; Jagdish Suresh Patel; Sara M Stieb; Fanny de Busserolles; Martin Malmstrøm; Ole K Tørresen; Celeste J Brown; Jessica K Mountford; Reinhold Hanel; Deborah L Stenkamp; Kjetill S Jakobsen; Karen L Carleton; Sissel Jentoft; Justin Marshall; Walter Salzburger
Journal: Science Date: 2019-05-10 Impact factor: 47.728

3. MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space.

Authors: Fredrik Ronquist; Maxim Teslenko; Paul van der Mark; Daniel L Ayres; Aaron Darling; Sebastian Höhna; Bret Larget; Liang Liu; Marc A Suchard; John P Huelsenbeck
Journal: Syst Biol Date: 2012-02-22 Impact factor: 15.683

4. Joint amalgamation of most parsimonious reconciled gene trees.

Authors: Celine Scornavacca; Edwin Jacox; Gergely J Szöllősi
Journal: Bioinformatics Date: 2014-11-06 Impact factor: 6.937

5. GenPhyloData: realistic simulation of gene family evolution.

Authors: Joel Sjöstrand; Lars Arvestad; Jens Lagergren; Bengt Sennblad
Journal: BMC Bioinformatics Date: 2013-06-27 Impact factor: 3.169

6. ParGenes: a tool for massively parallel model selection and phylogenetic tree inference on thousands of genes.

Authors: Benoit Morel; Alexey M Kozlov; Alexandros Stamatakis
Journal: Bioinformatics Date: 2019-05-15 Impact factor: 6.937

7. Lateral gene transfer from the dead.

Authors: Gergely J Szöllosi; Eric Tannier; Nicolas Lartillot; Vincent Daubin
Journal: Syst Biol Date: 2013-01-25 Impact factor: 15.683

8. Organised genome dynamics in the Escherichia coli species results in highly diverse adaptive paths.

Authors: Marie Touchon; Claire Hoede; Olivier Tenaillon; Valérie Barbe; Simon Baeriswyl; Philippe Bidet; Edouard Bingen; Stéphane Bonacorsi; Christiane Bouchier; Odile Bouvet; Alexandra Calteau; Hélène Chiapello; Olivier Clermont; Stéphane Cruveiller; Antoine Danchin; Médéric Diard; Carole Dossat; Meriem El Karoui; Eric Frapy; Louis Garry; Jean Marc Ghigo; Anne Marie Gilles; James Johnson; Chantal Le Bouguénec; Mathilde Lescat; Sophie Mangenot; Vanessa Martinez-Jéhanne; Ivan Matic; Xavier Nassif; Sophie Oztas; Marie Agnès Petit; Christophe Pichon; Zoé Rouy; Claude Saint Ruf; Dominique Schneider; Jérôme Tourret; Benoit Vacherie; David Vallenet; Claudine Médigue; Eduardo P C Rocha; Erick Denamur
Journal: PLoS Genet Date: 2009-01-23 Impact factor: 5.917

9. Efficient Gene Tree Correction Guided by Genome Evolution.

Authors: Emmanuel Noutahi; Magali Semeria; Manuel Lafond; Jonathan Seguin; Bastien Boussau; Laurent Guéguen; Nadia El-Mabrouk; Eric Tannier
Journal: PLoS One Date: 2016-08-11 Impact factor: 3.240

10. SimPhy: Phylogenomic Simulation of Gene, Locus, and Species Trees.

Authors: Diego Mallo; Leonardo De Oliveira Martins; David Posada
Journal: Syst Biol Date: 2015-11-01 Impact factor: 15.683

17 in total

1. DISCO: Species Tree Inference using Multicopy Gene Family Tree Decomposition.

Authors: James Willson; Mrinmoy Saha Roddur; Baqiao Liu; Paul Zaharias; Tandy Warnow
Journal: Syst Biol Date: 2022-04-19 Impact factor: 9.160

2. Deciphering Microbial Gene Family Evolution Using Duplication-Transfer-Loss Reconciliation and RANGER-DTL.

Authors: Mukul S Bansal
Journal: Methods Mol Biol Date: 2022

3. Rooting Species Trees Using Gene Tree-Species Tree Reconciliation.

Authors: Brogan J Harris; Paul O Sheridan; Adrián A Davín; Cécile Gubry-Rangin; Gergely J Szöllősi; Tom A Williams
Journal: Methods Mol Biol Date: 2022

4. A consensus view of the proteome of the last universal common ancestor.

Authors: Andrew J Crapitto; Amy Campbell; A J Harris; Aaron D Goldman
Journal: Ecol Evol Date: 2022-06-03 Impact factor: 3.167

5. A Linear Time Solution to the Labeled Robinson-Foulds Distance Problem.

Authors: Samuel Briand; Christophe Dessimoz; Nadia El-Mabrouk; Yannis Nevers
Journal: Syst Biol Date: 2022-10-12 Impact factor: 9.160

6. Evolutionary Rate Variation among Lineages in Gene Trees has a Negative Impact on Species-Tree Inference.

Authors: Mezzalina Vankan; Simon Y W Ho; David A Duchêne
Journal: Syst Biol Date: 2022-02-10 Impact factor: 15.683

7. Comparative genomics reveals high rates of horizontal transfer and strong purifying selection on rhizobial symbiosis genes.

Authors: Brendan Epstein; Peter Tiffin
Journal: Proc Biol Sci Date: 2021-01-06 Impact factor: 5.349