Literature DB >> 23355531

Lateral gene transfer from the dead.

Gergely J Szöllosi1, Eric Tannier, Nicolas Lartillot, Vincent Daubin.   

Abstract

In phylogenetic studies, the evolution of molecular sequences is assumed to have taken place along the phylogeny traced by the ancestors of extant species. In the presence of lateral gene transfer, however, this may not be the case, because the species lineage from which a gene was transferred may have gone extinct or not have been sampled. Because it is not feasible to specify or reconstruct the complete phylogeny of all species, we must describe the evolution of genes outside the represented phylogeny by modeling the speciation dynamics that gave rise to the complete phylogeny. We demonstrate that if the number of sampled species is small compared with the total number of existing species, the overwhelming majority of gene transfers involve speciation to and evolution along extinct or unsampled lineages. We show that the evolution of genes along extinct or unsampled lineages can to good approximation be treated as those of independently evolving lineages described by a few global parameters. Using this result, we derive an algorithm to calculate the probability of a gene tree and recover the maximum-likelihood reconciliation given the phylogeny of the sampled species. Examining 473 near-universal gene families from 36 cyanobacteria, we find that nearly a third of transfer events (28%) appear to have topological signatures of evolution along extinct species, but only approximately 6% of transfers trace their ancestry to before the common ancestor of the sampled cyanobacteria.

Entities:  

Mesh:

Substances:

Year:  2013        PMID: 23355531      PMCID: PMC3622898          DOI: 10.1093/sysbio/syt003

Source DB:  PubMed          Journal:  Syst Biol        ISSN: 1063-5157            Impact factor:   15.683


From the first growth of the tree, many a limb and branch has decayed and dropped off; and these lost branches of various sizes may represent those whole orders, families, and genera which have now no living representatives, and which are known to us only from having been found in a fossil state. Charles Darwin, On the Origins of Species. London, 1859 Most of the diversity of life that ever existed on earth has gone extinct and can only be glimpsed from the fossil record. Although the comparative approach allows the reconstruction of some morphological and genetical characteristics of ancestral species, it is only informative for species that have founded extant lineages. Yet, the information enclosed in genome sequences is abundant and particularly meaningful for the reconstruction of the descent and evolution of their carriers (Zuckerkandl and Pauling 1965; Boussau and Daubin 2010; David and Alm 2011), so much so that it may have recorded accounts of extinct lineages. This possibility exists because the success of lateral gene transfer (LGT) as an evolutionary process implies that each gene possesses its own, unique history, which is not necessarily confined to the history of those species that have survived (Maddison 1997; Galtier and Daubin 2008; Fournier et al. 2009; Abby et al. 2012). Several models have recently been developed to reconcile seemingly contradictory gene phylogenies with the species phylogeny by tracing the path on the species phylogeny along which they evolved as a result of a series of speciations, gene duplications, LGT, and losses (Tofigh 2009; Doyon et al. 2010; David and Alm 2011; Szöllősi and Daubin 2012; Szöllősi et al. 2012). None of these models, however, take into consideration the fact that, in the presence of LGT, gene trees record evolutionary paths along the complete species tree, including extinct and unsampled branches, and not only along the phylogeny of the species in which they reside today. This is the case because, as first noted by Maddison (1997) and later elaborated by Gogarten and coworkers (Zhaxybayeva and Gogarten 2004; Fournier et al. 2009), while LGT events imply that the donor and receiver lineages existed at the same time, the donor lineage might have subsequently become extinct, or more generally, might not have been sampled. Here, we demonstrate that, if the number of species considered in the species phylogeny is small compared with the total number of species, the overwhelming majority of gene transfers involve speciation to and evolution along extinct or unsampled species. Furthermore, we show that, if this condition is met, the evolution of genes along the unrepresented parts of the species phylogeny can to good approximation be treated as those of independently evolving lineages, the behavior of which depends only on the global parameters of the speciation dynamics. This in turn allows us to derive the probability of observing a gene phylogeny by extending the ODT model introduced previously (Szöllősi et al. 2012). Applying our model to a data set derived from 36 cyanobacterial species, we perform a preliminary assessment of the phylogenetic signal for the evolution of transferred genes along extinct species.

A Minimal Model of Speciation and Gene Birth and Death

It is not feasible to specify, much less to reconstruct, the complete phylogeny of all species that ever existed. To describe the evolution of genes outside the represented phylogeny—along lineages that have become extinct or whose descendants have not been sampled—we must resort to modeling the speciation dynamics that gave rise to the complete phylogeny. Modeling the dynamics of speciation provides a stochastic model of the evolution of unrepresented lineages that can be used to describe gene histories given knowledge of the represented phylogeny and a few global parameters. As a minimal model of speciation, here, we assume that the number of species N is constant, and that the dynamics of speciation is modeled by a continuous time Moran process (Moran 1962). That is, for each species at rate σ, a speciation occurs during which the species gives rise to two descendants and a randomly chosen species goes extinct (cf. Fig. 1a). The central assumption we make is that, of the N species existing at present (i.e., t = 0), we sample only a small fraction n ≪ N. In general, the validity of this assumption depends on the phylogenetic problem considered, but should almost always be met for major groups of bacteria and archaea, where the number of species that potentially exchange genes by LGT is inevitably much larger than the number of sampled species, even in large-scale studies (Ochman et al. 2000; Torsvik et al. 2002).
Figure 1

Gene trees are the result of the combination of speciation and gene birth and death. As a minimal description we consider: a) that for each of the N species at a rate σ, a speciation occurs, during which the species is succeeded by two descendants, and a random species suffers extinction; b) at a rate δ per gene, a gene duplicates, that is, it is succeeded by two gene copies in the same genome, at a rate τ/(N – 1) per gene per host species, a gene is transferred, resulting in one copy each in the donor and host species, and finally, with a rate λ per gene, a gene is lost. The represented phylogeny c) corresponds to the tree spanned by the n sampled species. A branch of the represented tree corresponds to a series of speciation events, but only the last of these, the speciation event that gives rise to two represented lineages (filled circles, green online) is explicitly present for internal branches as the speciation node terminating the branch. The number of unrepresented species (dashed circles) is always much larger than the number of represented species (full circles).

Gene trees are the result of the combination of speciation and gene birth and death. As a minimal description we consider: a) that for each of the N species at a rate σ, a speciation occurs, during which the species is succeeded by two descendants, and a random species suffers extinction; b) at a rate δ per gene, a gene duplicates, that is, it is succeeded by two gene copies in the same genome, at a rate τ/(N – 1) per gene per host species, a gene is transferred, resulting in one copy each in the donor and host species, and finally, with a rate λ per gene, a gene is lost. The represented phylogeny c) corresponds to the tree spanned by the n sampled species. A branch of the represented tree corresponds to a series of speciation events, but only the last of these, the speciation event that gives rise to two represented lineages (filled circles, green online) is explicitly present for internal branches as the speciation node terminating the branch. The number of unrepresented species (dashed circles) is always much larger than the number of represented species (full circles). To describe the evolution of genes within the genomes of species, we assume genes to evolve independently according to a birth-and-death process that consists of gene duplication, transfer, and loss (Tofigh 2009; Szöllősi and Daubin 2012; Szöllősi et al. 2012). As shown in Figure 1b, a gene in the genome of any of the N species can: (i) be duplicated at rate δ; (ii) be transferred from a donor species to any of the other N – 1 possible host species at a rate τ/(N – 1); or (iii) be lost at a rate λ. Gene copies can also be born and be lost as a result of the speciation dynamics: (iv) at the species level, lineages experience speciation at a rate σ, in which case they are replaced by two copies in the two new species or; (v) suffer extinction at an identical rate σ. A branch e of the represented tree S in general corresponds to a series of speciation events, however, as shown in Figure 1c, only the last one of these, the speciation event that gave rise to two represented lineages, is explicitly present for internal branches as the (green online) speciation node terminating the branch.

Almost All Transfers Involve Speciation

To understand what fraction of transfers involves evolution along unrepresented species, we must compare the relative rate of transfers that are direct transfers between branches of the represented phylogeny S and indirect transfers that result in a gene returning to S after exiting it via speciation or transfer to unrepresented species. To compare the contribution of indirect transfers and direct transfers to observed gene histories, we consider first only direct transfers and indirect transfers that involve a speciation to an unrepresented species. To describe the shape of the species tree generated by the Moran process introduced above, we can use the coalescent approach. Here, under Kingman's coalescent, the time to the most recent common ancestor of the n sampled species is of the order of 2N/σ (1–1/n) ≈ 2N/σ (Kingman 1982). This implies that the expected number of unrepresented speciation events per branch of the species tree is much larger than one, being of the order of σ×2N/σ/(2n – 2) ≈ N/n ≫ 1, as there are (2n – 2) branches of S. This suggests that for any pair of coexisting branches of the represented tree, a gene that descends from one of the branches and is transferred to the other, is likely to have experienced a speciation event “away” from the represented phylogeny spanned by the n sampled species before being transferred back to it. To quantify the above argument, we can compare the expected number of transfers from branch f to branch e of the represented phylogeny, resulting from either a direct transfer or a more complex history involving a speciation event. Clearly, if the branches do not overlap in time, the expected number of direct transfers is zero. To consider overlapping branches, let us consider for simplicity that both e and f are terminal branches—similar results can be derived for any other pair of overlapping branches. The expected branch lengths are then E(t) = E(t) ≈ N/σn, with overlap min(t) ≲ N/σn. Integrating over possible transfer times, the expected number of direct transfers is then To estimate the expected number of indirect transfers that are topologically indistinguishable from the above direct transfers, we can reason backwards in time as illustrated in Figure 2: (i) the rate at which a transfer occurs from each of the (N – n) unrepresented species to branch e is τ/(N – 1), (ii) the probability of this gene lineage not coalescing back to any of the n branches of the represented tree during a time interval t is exp(−nσ/Nt), and (iii) the rate at which it coalesces with branch f is σ/N. Integrating over possible speciation and transfer times gives: Equations (1) and (2) show that if the number of sampled species is small compared with the total number of species (n ≪ N), then the expected number of direct transfers is small compared with indistinguishable indirect ones (Tdirect ≪ Tindirect), that is, the contribution of direct transfers to observed gene histories is negligible.
Figure 2

The overwhelming majority of transfers involve evolution along unrepresented species. A direct transfers (dark gray, blue online) between two terminal branches of the represented phylogeny occurs with rate τ/(N – 1) and involves a single transfer event. An indirect transfer (light gray, red online) that leaves an indistinguishable record in the gene tree topology. To count indirect transfers, we trace their history backwards in time: transfer back to the host branch on the represented tree (branch e) occur with a rate τ/(N– 1) from each of the N – n unrepresented species, of these we are only concerned with ones which descend from the relevant donor branch (branch f), the number of these can be calculated using the exponential coalescence probability and the rate of unrepresented speciations σ/N from the donor branch (branch f).

The overwhelming majority of transfers involve evolution along unrepresented species. A direct transfers (dark gray, blue online) between two terminal branches of the represented phylogeny occurs with rate τ/(N – 1) and involves a single transfer event. An indirect transfer (light gray, red online) that leaves an indistinguishable record in the gene tree topology. To count indirect transfers, we trace their history backwards in time: transfer back to the host branch on the represented tree (branch e) occur with a rate τ/(N– 1) from each of the N – n unrepresented species, of these we are only concerned with ones which descend from the relevant donor branch (branch f), the number of these can be calculated using the exponential coalescence probability and the rate of unrepresented speciations σ/N from the donor branch (branch f). To compare the two types of possible indirect transfers back to S—those exiting via speciation and those via transfer—we must contrast the rate σ at which gene copies exit branch f as a result of speciation and the rate τ/(N – 1)×(N – 1 – n) ≈ τ that gene copies exit as a result of transfer. Estimates of τ, and more generally gene birth and death rates, are available from several sources, all of which agree that the expected number of gene birth and death events per branch is below unity. Models that consider the dynamics of the number of homologous gene copies along a species phylogeny (referred to as phylogenetic profiles) (Csűrös and Miklós 2009) have consistently found that birth and death rate is of the same order, with an excess of loss compensated by origination of new families, in agreement with phenomenological models of gene family size distribution (Karev et al. 2002; Szöllősi and Daubin 2012). In a detailed study, Csűrös et al. found for 28 archaea that the expected number of birth events (duplication and gain) is 0.12 and that the expected number of losses is 0.36 (Csűrös and Miklós 2009) per branch per gene. More recently, the ODT model that attempts to explicitly explain the evolution of multicopy gene trees (representative of complete genomes) along an ultrametric species tree has arrived at similar results (Szöllősi et al. 2012), finding for 36 cyanobacterial genomes δ ≈ τ ≈ 0.2, λ ≈ 1, in units corresponding to a tree with unit height. Assuming, as above, that the time to the most recent common ancestor of the sampled species is of the order of 2N/σ, that is, the expected number of gene copies (per gene) exiting a branch of S is proportional to N/n, while the number exiting as a result of transfer is less than one. Since the rate at which a gene that has exited the represented phylogeny returns to S as a results of transfer at some point in the future is independent of the mode of exit from S, we can conclude that indirect transfers are dominated by paths that include a speciation. In summary, if the number of sampled species is small compared with the total number of species, transfers in observed gene histories are dominated by paths that include a speciation to an unrepresented species and subsequent transfer back to the represented tree.

The Probability of Observing a Gene Tree

Reconciling gene trees with the species tree requires iterating over possible paths along which a gene tree may have been generated by a series of speciations, duplications, transfers, and losses (Fig. 3). In existing methods (Tofigh 2009; Doyon et al. 2010; Szöllősi and Daubin 2012; Szöllősi et al. 2012), this is accomplished by only considering paths along the represented phylogeny and using a dynamic programming approach exploiting the independence of gene birth and death events and by extension gene lineages.
Figure 3

Reconciling gene trees with the complete phylogeny. a) An evolutionary scenario that involves a transfer event from an unrepresented species. The represented phylogeny is shown as a solid tube with filled circles (green online) corresponding to represented speciations. The unrepresented phylogeny is indicated by dashed tubes, with white circles corresponding to unrepresented speciations (cf. Fig. 1c). The continuous line traces the gene tree spanned by genes in sampled species that is the result of a series of birth and death events along the complete phylogeny; b) a reconciliation of the gene phylogeny from (a), corresponding to the evolutionary scenario depicted in (a). In general, we do not know the evolutionary scenario that has generated the gene phylogeny. However, we can use the dynamic programming algorithm described in the text to calculate the likelihood of the gene tree by summing over all possible reconciliations, that is, all ways to draw the gene tree into the species using speciation, duplication, transfer, and loss events [cf. Eqs. (4)–(7) and Fig. A1] in the Appendix. The likelihood calculation uses the rate of different events (σ, δ, τ, and λ) together with functions describing the extinction (E and Ē) and the propagation (G and ) of gene linages [cf. Eqs. (A.1)–(A.4)].

Reconciling gene trees with the complete phylogeny. a) An evolutionary scenario that involves a transfer event from an unrepresented species. The represented phylogeny is shown as a solid tube with filled circles (green online) corresponding to represented speciations. The unrepresented phylogeny is indicated by dashed tubes, with white circles corresponding to unrepresented speciations (cf. Fig. 1c). The continuous line traces the gene tree spanned by genes in sampled species that is the result of a series of birth and death events along the complete phylogeny; b) a reconciliation of the gene phylogeny from (a), corresponding to the evolutionary scenario depicted in (a). In general, we do not know the evolutionary scenario that has generated the gene phylogeny. However, we can use the dynamic programming algorithm described in the text to calculate the likelihood of the gene tree by summing over all possible reconciliations, that is, all ways to draw the gene tree into the species using speciation, duplication, transfer, and loss events [cf. Eqs. (4)–(7) and Fig. A1] in the Appendix. The likelihood calculation uses the rate of different events (σ, δ, τ, and λ) together with functions describing the extinction (E and Ē) and the propagation (G and ) of gene linages [cf. Eqs. (A.1)–(A.4)].
Figure A1

Diagrams corresponding to reconciliation events. Each diagram corresponds to a term in Equations (4)–(7), with diagrams following each other in the same order as terms in the indicated equation. a) Depicts events that start with a gene lineage u in represented branch e of S at time t +Δt; b) events that start with a gene lineage u in an unrepresented species at time t +Δt; and finally, c) corresponds to represented speciation events in S. To illustrate the correspondence between terms and equations, consider the third diagram in the top row (a) depicting an unrepresented speciation and the corresponding (third) term in Equation (4). This term, , describes the probability that gene lineage u seen at time t +Δt is succeeded as a result of an unrepresented speciation by two gene linages (v and w) one of which (w) is present in the same branch e as u while the other (v) resides in an unrepresented species.

Although gene duplication, transfer, and loss can reasonably be modeled as independent birth and death events, speciation and extinction necessarily involve the simultaneous birth and death of many genes. Along the represented phylogeny, speciation events are fully specified and can be explicitly taken into account (Szöllősi et al. 2012). This is not the case, however, for speciation and extinction events that occur in the unrepresented part of the phylogeny, or do not correspond to speciation nodes of the represented phylogeny. Therefore, unrepresented speciations result in nonindependence of gene lineages. Consider for instance the probability Ē (t) that k genes present at time t in a species not ancestral to the sample of n extant species leave no observed descendant. Conditional on the complete phylogeny, ϕ including all extinct species lineages, gene lineages are independent, and therefore Ē (t|ϕ) = {Ē (t|ϕ)}. Averaging over all complete phylogenies compatible with the phylogeny reconstructed based on the n species, however, results in 〈Ē(t|ϕ)〉 = 〈{Ē(t|ϕ)}〉 ≠ 〈Ē (t|ϕ)〉, which is not a product of k independent factors. On the other hand, n ≪ N implies that Ē(t) ≈ 1. Introducing the notation Ē (t|ϕ) = 1 – ∈(t|ϕ) and Ē (t) = 1– ∈(t), and neglecting second- and higher order terms in ϵ(t|ϕ) and ϕ (t), we have A similar argument can be derived for k-gene propagator (see the Appendix). Therefore, if n ≪ N, then to good approximation, the evolution of two genes observed in the same unrepresented species can be treated as independent without specifying the full phylogeny. Under the above assumption that unrepresented speciation and extinction events can be considered in a genewise independent manner, we can describe the evolution of gene copies that appear as single gene lineages when observed from the present. We can calculate: (i) the extinction probability E(t) that a gene seen at time t on branch e of S leaves no observed descendant, that is, no descendant exists at time t = 0 in the genome of any of the n sampled species; (ii) the extinction probability Ē (t) that a gene seen at time t in an unrepresented species leaves no observed descendant; (iii) the single gene propagation probabilities G(s,t) that all observed descendants of a gene seen at time s on branch e descend from a descendant seen at a later time t < s on branch e; and (iv) the probability that all observed descendants of a gene seen at time s in an unrepresented species descend from a descendant seen at time t < s in an unrepresented species. Each of the above functions can be expressed as differential equations describing evolution backwards in time by considering the set of possible events that change the relevant probability. These can be derived analogously to (Tofigh 2009; Stadler 2011; Szöllősi et al. 2012) and can be found in the Appendix. Given a rooted gene tree topology G, we can now calculate the probability p(G|S,ℳ) of observing G, where ℳ denotes the parameters of the model, by summing over all possible paths along S and over all complete phylogenies compatible with the species tree spanning the n species of the sample. We can sum over all paths by recursively mapping the branches of G onto branches of S generalizing the ODT models algorithm (Szöllősi et al. 2012) to include evolution along unrepresented species (cf. Figs 3 and A1 in the Appendix). A branch of G represents the evolution of a gene copy for which (i) if the branch is nonterminal, all observed descendants descend from one of the two daughter gene lineages which emerge from the gene tree node in which the branch terminates or (ii) if the branch is terminal, a gene is observed in one of the genomes mapping to a leaf of S. To describe possible paths along S that this gene copy may take before arriving at the gene tree node in which it terminates, we must consider five events: (i) single-copy evolution along branch e of S described by G, (ii) single-copy evolution outside S described by , (iii) speciation from a branch of S to an unrepresented species such that only descendants of this copy are observed, (iv) transfer such that only descendants of the transferred copy are observed, and (v) speciation represented in S such that only one of the descending copies leaves an observed descendant. Each of these events leads to a single gene copy with observed descendants. The gene tree node in which the branch terminates can correspond to three possible events: (i) a duplication, a speciation represented in S; (ii) a speciation not represented in S; or (iii) a transfer. Each of these events leads to two gene copies with observed descendants. To derive the recursion expressing the probability of G as the sum over possible paths along S, we discretize time along S keeping track of speciation times t along S. Speciations represented in S define the time intervals [0,t1),...,[t+1),...[t–1,t–1) referred to as time slices (Tofigh 2009; Doyon et al. 2010) with indices 0,...,i,...n. We further divide each time slice into D equal time intervals of height Δt = (t+1 – t)/D. The probability of the gene lineage leading to node u of G being seen on branch e of S at time t +Δt given the probabilities at time t = t + Δt is where denotes the probability of the gene lineage leading to node u of G being seen in an unrepresented species at time t, v and w descend from u in G. As shown in Figure A1a in the Appendix, the terms correspond to (i) no event with an observed descendent; (ii) birth of two gene linages by duplication, such that both leave observed descendants; (iii) and (iv) birth of two gene linages with observed descendants as a result of an unrepresented speciation; and finally, (v) unrepresented speciation followed by the loss of the copy in branch e such that only the copy in the unrepresented phylogeny leaves an observed descendant. In the above expression, we only consider indirect transfers that involve a speciation, see the Appendix for the full expression. The probability of being seen in such an unrepresented species is where ℰ(S) denotes the set of branches of S in time slice i. As shown in Figure A1b, the terms correspond to (i) no event with an observed descendent; (ii) birth of two gene linages by speciation, duplication, or transfer, such that both leave observed descendants; (iii) and (iv) birth of two gene linages with observed descendants as a result of transfer back to the represented phylogeny; and finally, (v) transfer back to the represented phylogeny following which the copy in the unrepresented donor linage does not leave an observed descendant. Terms involving gene lineages v, w are zero if u is a leaf of G in both the above expressions. At speciation times t = t where branches f and g descend from e in S, a represented speciation takes place that may be followed by a loss: The terms (cf. Fig. A1c) correspond to (i) and (ii) represented speciation such that both resulting gene lineages lead to observed descendants; and (iii) and (iv) represented speciation such that only one of them do. Finally, at time t = 0 on each terminal branch e of S, the presence of observed genes is expressed as: As illustrated in Figures 3b and A1, each term in Equations (4)–(7) above corresponds to a series of speciation, duplication, and transfer events that recursively draw the gene phylogeny into the species tree. The recursion calculates the probability of a gene tree with m genes in O(Dn2m) steps, as there are fewer than n branches in each time slice and n time slices. Summing over roots of G can be accomplished with identical complexity using double recursion. The most likely reconciliation can be recovered by tracing back along the sum choosing at each step the event with the highest probability. Calculating the probability of a gene tree requires knowledge of the ultrametric species tree S, with branch lengths corresponding to time, the rate of duplication δ, transfer τ, and loss λ, as well as the parameters of the speciation dynamics, the species replacement rate σ, and the total number of species N. The number of parameters is reduced, if we assume the time to the common ancestor of the sampled species to correspond to its expected value under speciation dynamics. Choosing units such that S is of unit height, this corresponds to the choice σ = 2N. Furthermore, under the present choice of parameters and time scale, the probability of a gene tree and its maximum-likelihood reconciliation depends only very weakly on N, as long as the condition n ≪ N is satisfied. This is the case because the expected number of transfers between branches of S is nearly independent of N. In particular, if we assume that a gene lineage returns at most once to S we arrive at the result derived in Equation (2) according to which the number of transfers is independent of N.

Routes to Cyanobacterial Genomes

To carry out a preliminary analysis of the signal for evolution outside the represented phylogeny in real data, we considered a set of 473 single-copy gene families present in the genome of at least 34 of 36 cyanobacteria and use the dated species tree reconstructed in Szöllősi et al. (2012). We choose single-copy near-universal gene families as they are expected to be (i) relatively slowly evolving and hence to harbor a strong signal of homology and yield high-quality alignments, and (ii) they can be assumed to be well described by a single set of uniform duplication, transfer, and loss rates, at least in contrast to more complex data sets composed of multicopy families. For each family, gene tree topologies and duplication, transfer, and loss rates that maximize the joint likelihood (Maddison 1997; Szöllősi and Daubin 2012) were inferred as described in the Appendix. Using these results, 1000 reconciliations per family were sampled by stochastic backtracking along the sum over reconciliations. On average, we found 0 duplication, 2.15 transfers, and 2.56 losses per family. The distribution in time of transfer events and the preceding speciations to unrepresented species are shown in Figure 4a. The majority of transfers occur between branches of S that overlap in time, hence the resulting gene tree carries no topological signature of the length of time spent evolving along unrepresented lineages. Transfers between branches that do not overlap in time, for which the gene tree topologies explicitly record evolution outside the represented tree, correspond to 27.8% of all transfers. About a fifth of these (5.9% of all transfers) branch above the root indicating transfer from outside the sampled diversity of cyanobacteria. The median interval of time spent evolving in unrepresented lineages is 0.083 (or 222 million years, henceforth myr) for transfers between overlapping branches and 0.39 (or 1000 myr) for transfers between nonoverlapping branches. Similar values are obtained if we consider only the maximum-likelihood reconciliations, except for the median interval of time spent evolving in unrepresented lineage for transfers between overlapping branches which is only 0.0028 (or 8.1 myr corresponding to the minimum length allowed by time discretization). The corresponding value for transfers between nonoverlapping branches, 0.36 (or 990) myr, is nearly identical to the value above.
Figure 4

LGT events for 36 cyanobacteria. For 473 near universal single-copy families from 36 cyanobacterial genomes gene trees that maximize the joint likelihood were reconstructed. For the trees obtained 1000 reconciliations were sampled. a) The distribution of transfer events (light bars, green online) and the preceding speciation events (dark bars, blue online). The final bin summarizes all events occurring above the root of S. b) The distribution of the time spent by transferred genes evolving along unrepresented species for transfers between overlapping branches (dark bars, red online, 72.2% of transfers) and transfers between nonoverlapping branches (light bars, yellow online, 27.8% of all transfers). Both sets of bins sum to unity. Time units are chosen such that the height of the root of S is 1.0. The age of the root falls in the 3500–2700 My interval (Falcón et al. 2010; Szöllősi et al. 2012). Data are available from Dryad under doi:10.5061/dryad.27d0g.

LGT events for 36 cyanobacteria. For 473 near universal single-copy families from 36 cyanobacterial genomes gene trees that maximize the joint likelihood were reconstructed. For the trees obtained 1000 reconciliations were sampled. a) The distribution of transfer events (light bars, green online) and the preceding speciation events (dark bars, blue online). The final bin summarizes all events occurring above the root of S. b) The distribution of the time spent by transferred genes evolving along unrepresented species for transfers between overlapping branches (dark bars, red online, 72.2% of transfers) and transfers between nonoverlapping branches (light bars, yellow online, 27.8% of all transfers). Both sets of bins sum to unity. Time units are chosen such that the height of the root of S is 1.0. The age of the root falls in the 3500–2700 My interval (Falcón et al. 2010; Szöllősi et al. 2012). Data are available from Dryad under doi:10.5061/dryad.27d0g. We emphasize that an important caveat of these results is that the accuracy of our method to infer correct reconciliations and gene topologies has not been assessed. This could be accomplished by explicit simulations of gene family evolution along the complete phylogeny. Such simulations are, however, outside the scope of the current publication, as they are technically challenging due to the large number of species in the complete phylogeny, and because they must address a potentially long list of possible questions. In lieu of simulation, it is possible to examine the posterior support of individual transfer events, which, as described in the Appendix, can be calculated as the fraction of times we find a given transfer event among the sampled reconciliations for each family. Using this measure, we find that transfers are well supported with 66.8% of transfer events having support over 0.95. It is also important to discuss to what extent we can expect observed transfers between nonoverlapping branches to be robust to increasing the number of sampled species. Consider the extreme case that all N extant species are sampled. It is clear that transfers between overlapping branches of S (red in Fig. 4b) may correspond to transfer between nonoverlapping branches of the full phylogeny spanned by all N extant species. To ascertain how often we expect the opposite to occur, to have a transfer between nonoverlapping branches of S correspond to transfers between overlapping branches of the full phylogeny spanned by all N extant species, we need to estimate how often we expect to sample an extant descendant of the unrepresented donor lineage involved in a transfer between nonoverlapping branches of S (light bars, yellow online, in Fig. 4b). Assuming a tree with unit height, the total branch length of the full phylogeny under Kingman's coalescent is of the order of log(N), while the total branch length including extinct species is of the order N. Thus, we expect that only a vanishing fraction of the order log(N)/N of donor lineages have left extant descendants. This implies that not only do most transfers involve speciation to, and evolution along branches of the complete phylogeny, but also the majority of these donor lineages have gone extinct. Consequently, most transfers between nonoverlapping branches of S correspond to transfers between nonoverlapping branches of the full phylogeny where the donor lineage has gone extinct. In summary, we find that nearly a third (27.8%) of transfers evolve on average a billion years along lineages unrepresented in the phylogeny—most often, in fact, along extinct lineages, and only a moderate fraction of transfers originates from outside the cyanobacteria. Furthermore, both of these estimates are conservative, as increasing the number of sampled species is expected to lead to an increase in the ratio of transfers between nonoverlapping branches, and to a decrease in the fraction of transfers from outside of cyanobacteria. The first of the above results, however, apply only to transfers between branches of S, that is, transfers observed for the n = 36 cyanobacteria considered. For the complete set of transfers between branches of the full phylogeny, the fraction of transfers evolving along extinct linages is potentially different, for example, a macroscopic fraction of transfers are expected to correspond to direct transfers between its branches.

Discussion

The results developed above are conditional on two crucial assumptions: (i) that the number of sampled species is small compared with the total number of species, and (ii) the evolution of gene lineages can be treated as independent, both in the represented and in the unrepresented part of the phylogeny. As we argue above, if genes are duplicated, transferred, and lost independently, the former assumption (i.e., n ≪ N) implies that the evolution of genes outside the represented phylogeny can also be treated as independent, even if the complete phylogeny is not specified. We also make the assumptions that (iii) transfer occurs with identical rate between any two species and (iv) that the time to the last common ancestor of the sampled species corresponds to its expected value under the speciation dynamics. These conditions serve to simplify the development of the above arguments and can be relaxed without affecting our conclusion that the majority of transfers involve evolution along extinct or unsampled species. Relaxing Condition (iv) is straightforward. Concerning Assumption (iii), if, for example, transfer occurs preferentially between species that are more closely related (Andam and Gogarten 2011), the scenarios shown in Figure 2 are affected to an identical extent because the last common ancestor of branch e and either branch f (the donor lineage for dark gray paths, blue online) or any extinct species that descends from an unrepresented speciation along f (a donor lineage along light gray paths, red online) is the same. Conversely, there are known cases, for example, the transfer of thermostable enzymes from thermophilic archaea to thermophilic bacteria (Nelson et al. 1999; Nesbo et al. 2001; Brochier-Armanet and Forterre 2007), of preferential transfer between distantly related taxa due to shared ecology. In this second case, we expect to observe genes preferentially transferred from phylogenetically distant taxa to lead to an excess of transfers descending from above the root of the sampled species for which topologically equivalent direct transfers do not exist. On a more practical ground, however, relaxing the assumption of homogeneous rates of transfer between lineages might seriously complicate the computation of the likelihood, as it would require modeling the distribution of the rates of transfers from and to unrepresented lineages. More importantly, as long as these conditions are met, it is possible to extend the above results to more general models of speciation. Modeling variation in N, the total number of species, over geological times, could be of particular interest. Indeed, a corollary of the observation that LGT events record evolutionary paths along the complete species tree is that the phylogenies of genes from a limited sample of extant species carry information about extinct lineages, and therefore about the size and dynamics of ancient biodiversity. In fact, patterns of gene transfer may be even more informative about past biodiversity than the species tree itself. Drawing an analogy with population genetics, inferring biodiversity dynamics based on species trees (Nee 2001; Morlon et al. 2010; Stadler 2011) is similar to inferring past demography based on single-locus data. Single-locus inference is limited by the intrinsic stochasticity of Kingman's coalescent, in particular in the deep part of the genealogy. LGTs, on the other hand, are analogous to multiple loci (Heled and Drummond 2008), and as such, have the potential to increase the statistical power for inferring past biodiversity.

Supplementary Material

Data files related to this paper have been deposited at Dryad under doi:10.5061/dryad.27d0g.

Funding

This work was supported by the Marie Curie Fellowship 253642 “Geneforest” (to G.J.Sz.); the Institut National de Physique Nucléaire et de Physique des Particules' (IN2P3) computing centre; and the project was supported by the French Agence Nationale de la Recherche (ANR) through grant [ANR-10-BINF-01- 01] “Ancestrome”.
  28 in total

Review 1.  Lateral gene transfer and the nature of bacterial innovation.

Authors:  H Ochman; J G Lawrence; E A Groisman
Journal:  Nature       Date:  2000-05-18       Impact factor: 49.962

2.  Inferring speciation rates from phylogenies.

Authors:  S Nee
Journal:  Evolution       Date:  2001-04       Impact factor: 3.694

3.  Prokaryotic diversity--magnitude, dynamics, and controlling factors.

Authors:  Vigdis Torsvik; Lise Øvreås; Tron Frede Thingstad
Journal:  Science       Date:  2002-05-10       Impact factor: 47.728

4.  MUSCLE: multiple sequence alignment with high accuracy and high throughput.

Authors:  Robert C Edgar
Journal:  Nucleic Acids Res       Date:  2004-03-19       Impact factor: 16.971

5.  Cladogenesis, coalescence and the evolution of the three domains of life.

Authors:  Olga Zhaxybayeva; J Peter Gogarten
Journal:  Trends Genet       Date:  2004-04       Impact factor: 11.639

6.  Phylogenetic modeling of lateral gene transfer reconstructs the pattern and relative timing of speciations.

Authors:  Gergely J Szöllosi; Bastien Boussau; Sophie S Abby; Eric Tannier; Vincent Daubin
Journal:  Proc Natl Acad Sci U S A       Date:  2012-10-04       Impact factor: 11.205

7.  Evolutionary trees from DNA sequences: a maximum likelihood approach.

Authors:  J Felsenstein
Journal:  J Mol Evol       Date:  1981       Impact factor: 2.395

8.  Evidence for lateral gene transfer between Archaea and bacteria from genome sequence of Thermotoga maritima.

Authors:  K E Nelson; R A Clayton; S R Gill; M L Gwinn; R J Dodson; D H Haft; E K Hickey; J D Peterson; W C Nelson; K A Ketchum; L McDonald; T R Utterback; J A Malek; K D Linher; M M Garrett; A M Stewart; M D Cotton; M S Pratt; C A Phillips; D Richardson; J Heidelberg; G G Sutton; R D Fleischmann; J A Eisen; O White; S L Salzberg; H O Smith; J C Venter; C M Fraser
Journal:  Nature       Date:  1999-05-27       Impact factor: 49.962

9.  Molecules as documents of evolutionary history.

Authors:  E Zuckerkandl; L Pauling
Journal:  J Theor Biol       Date:  1965-03       Impact factor: 2.691

10.  Birth and death of protein domains: a simple model of evolution explains power law behavior.

Authors:  Georgy P Karev; Yuri I Wolf; Andrey Y Rzhetsky; Faina S Berezovskaya; Eugene V Koonin
Journal:  BMC Evol Biol       Date:  2002-10-14       Impact factor: 3.260

View more
  34 in total

1.  Inferring gene duplications, transfers and losses can be done in a discrete framework.

Authors:  Vincent Ranwez; Celine Scornavacca; Jean-Philippe Doyon; Vincent Berry
Journal:  J Math Biol       Date:  2015-09-04       Impact factor: 2.259

Review 2.  Probabilistic models of eukaryotic evolution: time for integration.

Authors:  Nicolas Lartillot
Journal:  Philos Trans R Soc Lond B Biol Sci       Date:  2015-09-26       Impact factor: 6.237

Review 3.  Horizontal Gene Transfer and the History of Life.

Authors:  Vincent Daubin; Gergely J Szöllősi
Journal:  Cold Spring Harb Perspect Biol       Date:  2016-04-01       Impact factor: 10.005

4.  Gene tree species tree reconciliation with gene conversion.

Authors:  Damir Hasić; Eric Tannier
Journal:  J Math Biol       Date:  2019-02-15       Impact factor: 2.259

Review 5.  The growing tree of Archaea: new perspectives on their diversity, evolution and ecology.

Authors:  Panagiotis S Adam; Guillaume Borrel; Céline Brochier-Armanet; Simonetta Gribaldo
Journal:  ISME J       Date:  2017-08-04       Impact factor: 10.302

6.  Divergent evolutionary trajectories of bryophytes and tracheophytes from a complex common ancestor of land plants.

Authors:  Brogan J Harris; James W Clark; Dominik Schrempf; Gergely J Szöllősi; Philip C J Donoghue; Alistair M Hetherington; Tom A Williams
Journal:  Nat Ecol Evol       Date:  2022-09-29       Impact factor: 19.100

7.  Rooting Species Trees Using Gene Tree-Species Tree Reconciliation.

Authors:  Brogan J Harris; Paul O Sheridan; Adrián A Davín; Cécile Gubry-Rangin; Gergely J Szöllősi; Tom A Williams
Journal:  Methods Mol Biol       Date:  2022

8.  Counting and sampling gene family evolutionary histories in the duplication-loss and duplication-loss-transfer models.

Authors:  Cedric Chauve; Yann Ponty; Michael Wallner
Journal:  J Math Biol       Date:  2020-02-15       Impact factor: 2.259

9.  Integrated pipeline for inferring the evolutionary history of a gene family embedded in the species tree: a case study on the STIMATE gene family.

Authors:  Jia Song; Sisi Zheng; Nhung Nguyen; Youjun Wang; Yubin Zhou; Kui Lin
Journal:  BMC Bioinformatics       Date:  2017-10-03       Impact factor: 3.169

10.  Efficient exploration of the space of reconciled gene trees.

Authors:  Gergely J Szöllõsi; Wojciech Rosikiewicz; Bastien Boussau; Eric Tannier; Vincent Daubin
Journal:  Syst Biol       Date:  2013-08-06       Impact factor: 15.683

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.