Literature DB >> 27998934

Xenolog classification.

Charlotte A Darby1, Maureen Stolzer1, Patrick J Ropp1, Daniel Barker2, Dannie Durand1,3.   

Abstract

Motivation: Orthology analysis is a fundamental tool in comparative genomics. Sophisticated methods have been developed to distinguish between orthologs and paralogs and to classify paralogs into subtypes depending on the duplication mechanism and timing, relative to speciation. However, no comparable framework exists for xenologs: gene pairs whose history, since their divergence, includes a horizontal transfer. Further, the diversity of gene pairs that meet this broad definition calls for classification of xenologs with similar properties into subtypes.
Results: We present a xenolog classification that uses phylogenetic reconciliation to assign each pair of genes to a class based on the event responsible for their divergence and the historical association between genes and species. Our classes distinguish between genes related through transfer alone and genes related through duplication and transfer. Further, they separate closely-related genes in distantly-related species from distantly-related genes in closely-related species. We present formal rules that assign gene pairs to specific xenolog classes, given a reconciled gene tree with an arbitrary number of duplications and transfers. These xenology classification rules have been implemented in software and tested on a collection of ∼13 000 prokaryotic gene families. In addition, we present a case study demonstrating the connection between xenolog classification and gene function prediction. Availability and Implementation: The xenolog classification rules have been implemented in N otung 2.9, a freely available phylogenetic reconciliation software package. http://www.cs.cmu.edu/~durand/Notung . Gene trees are available at http://dx.doi.org/10.7488/ds/1503 . Contact: durand@cmu.edu. Supplementary information: Supplementary data are available at Bioinformatics online.
© The Author 2016. Published by Oxford University Press.

Entities:  

Mesh:

Year:  2017        PMID: 27998934      PMCID: PMC5860392          DOI: 10.1093/bioinformatics/btw686

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 Introduction

Homology analysis, classifying gene pairs according to the evolutionary process by which they diverged, is a fundamental tool of comparative genomics. Identifying orthologs is integral to the functional annotation of novel genes (Wu ) and prediction of gene function by various methods, including phylogenetic profiling (Pellegrini ) and gene fusion (Enright ; Marcotte ). Phylostratigraphic investigations linking the age of a gene to its functions, disease associations, or ecological distribution exploit the fact that orthologs from the same pair of species diverged at roughly the same time (Capra ). Orthologs are used as markers for homologous chromosomal regions for comparative mapping (Nadeau and Sankoff, 1998; O’Brien ), phylogenetic footprinting (Dickmeis and Muller, 2005; Duret and Bucher, 1997) and operon prediction (Chen ; Ermolaeva ; Price ; Westover ). Identification of paralogs is a prerequisite for studying processes of gene duplication, the major source of genetic novelty in eukaryotes. Comparison of paralogous pairs with a pre-duplication ortholog reveals patterns and rates of diversification following duplication (Lynch, 2007, and work cited therein), as well as the functional fates of duplicated genes (Lynch, 2007). Spatial patterns of orthologs and paralogs are used to infer the specific duplication process that gave rise to a given set of paralogs (Durand and Hoberman, 2006; Simillion ; Van de Peer, 2004). Homology identification is a highly active research area, comprising methodological approaches ranging from sequence comparison to phylogenetic reconciliation. More recent innovations include the exploitation of shared synteny (Shi ) and specialized methods for identifying multidomain homologs (Ali ; Song , 2008). Most work on homology analysis to date has not considered genes related through horizontal transfer. Studies of horizontal transfer commonly use approaches that seek to identify genes of foreign origin in a given genome, rather than homologous gene pairs that are related through horizontal transfer (reviewed in Azad and Lawrence, 2012). A few methods, such as gene tree–species tree reconciliation, do infer gene pairs that correspond to the donor and recipient of a transfer. Reconciliation algorithms that account for transfer events are relatively new (reviewed by Huson and Scornavacca, 2011; Nakhleh, 2010, 2013), computationally more complex, and are only recently coming into use for genomic analyses (e.g. David and Alm, 2011; Richards ). Appropriate terminology for describing gene pairs related through horizontal transfer is a fundamental requirement for extending the homology analysis framework to include this evolutionary process. The term ‘xenolog’, proposed by Gray and Fitch (1983) to describe horizontally transferred genes, is in use, but not widely, and there is no consensus on a precise definition. Further, there has been little discussion of differentiating between xenologs to convey distinctions between horizontally transferred genes with different properties (see Koonin , for a notable exception). Such xenolog classes would be analogous to paralog subtypes proposed to convey the relative timing of duplications and speciations (e.g. in-paralogs versus out-paralogs, Sonnhammer and Koonin, 2002) or distinguish between different mechanisms of duplication (e.g. ohnologs and tandem duplications, reviewed by Durand and Hoberman, 2006; Ramos and Ferrier, 2012). Background: Fitch (1970) introduced the terms orthology (‘homology [that] is the result of speciation’) and paralogy (‘homology [that] is the result of gene duplication’) and proposed that ‘foreign genes … since they are neither orthologous nor paralogous but are clearly homologous … should be called xenologous’ (Gray and Fitch, 1983). These definitions, which are framed in terms of the event that caused the divergence, have been widely adopted. In 2000, Fitch proposed more precise definitions of orthology and xenology: Orthology includes the requirement that the ‘common ancestor lies in the cenancestor of the taxa from which the two sequences were obtained,’ where a cenancestor is the ‘most recent common ancestor of the [species] taxa under consideration,’ and xenology is the ‘relationship of any two homologous characters whose history, since their common ancestor, involves an interspecies (horizontal) transfer of the genetic material for at least one of those characters.’ In other words, a pair of genes, and , are xenologs, if there is a transfer on the path connecting and in the gene tree. In this updated definition, orthology is defined not just in terms of a speciation event, but in terms of the association of nodes in the gene and species trees. Under a duplication-loss event model, the event-based definition of orthology and this definition are equivalent. However, when transfers are included in the event model, the sets of orthologs predicted using the two definitions are not identical. Moreover, the event-based definition leads to predicted orthologs that have properties that are not usually associated with orthologs. For example, nodes g and in Figure 1 are orthologs according to the event-based definition, because the event at their most recent common ancestor (g4) is a speciation. Yet g and are genes in the same present-day species X, violating the assumption that genes in the same species cannot be orthologous. Gene pairs in species X and Z also exhibit surprising behavior according to the event-based definition. The most recent common ancestor (MRCA) of g and g is a speciation node, as is the most recent common ancestor of and g, implying that both pairs are orthologs in the species X and Z. However, these pairs arose at very different times in the species tree, violating the assumption that orthologs drawn from the same pair of species are associated with the same species divergence and are roughly the same age (Capra ; Goodman ). Neither of these problems arises when the cenancestor-based definition is used, because neither g nor g are orthologs of according to that definition. In both cases, the MRCA of the genes does not lie in their cenancestor.
Fig. 1.

Gene tree (thin black lines), with a duplication and a transfer from species Y to species X, embedded in a species tree (shown in gray). The cenancestor of the transfer is designated as. Species sets D, R and O are labeled below the leaves

Gene tree (thin black lines), with a duplication and a transfer from species Y to species X, embedded in a species tree (shown in gray). The cenancestor of the transfer is designated as. Species sets D, R and O are labeled below the leaves The additional cenancestor requirement results in a restricted set of orthologs that excludes these problematic cases. However, a consequence of defining orthologs narrowly is that xenologs are defined broadly: the set of gene pairs whose history, since their divergence, includes a transfer is substantially larger than the set of genes that diverged through a transfer event at their MRCA. Xenologs, when broadly defined, exhibit diverse properties. First, not all xenologs have the same event at their MRCA in the gene tree. We observe xenologs where this divergence arose via transfer (e.g. and g), speciation (e.g. and g) and duplication (e.g. and h). Second, xenologs can occur in the same species (e.g. and g). Third, xenologs may vary greatly in how closely they are related, and the divergence of a pair of xenologs may pre- or post-date the divergence of their associated species. For example, genes and g diverged more recently than species X and species Z, whereas genes and g diverged before species X and species W. Our contributions: This broad definition of xenologs does not convey important distinctions between the diverse and complex xenologous relationships that arise due to horizontal gene transfer. To address this, we propose xenolog classes that reflect the events associated with the divergence of a xenologous gene pair, and the relative timing of transfer and speciation events. We present formal definitions of these classes in the context of a reconciled gene tree and rules to assign xenologous gene pairs to classes. Further, we show that these classes form a hierarchy, connecting the relationship of xenologs to their placement in the gene and species trees. An algorithm implementing these rules has been integrated into the Notung 2.9 software package. An analysis of ∼13 000 prokaryotic gene families demonstrates that all of the proposed classes arise in real gene tree data. We further present a case study that illustrates the potential functional implications of xenolog classification. Finally, we discuss how this framework could be used in future research to explore the evolutionary and functional fates of transferred genes. Notation: Before stating formal definitions of the xenolog subtypes, we introduce the following notation. For a binary, rooted tree with node set V and edge set E, designates the leaf set of T. denotes vertices in set V that are not in set U, where . p(v) refers to the parent of node v. If v is an ancestor (resp., descendant) of u in T, we write (resp., ). The set represents the improper descendants of node u, i.e. u and all nodes in the subtree rooted at u. If and , then we say that u and v are incomparable (denoted ). The most recent common ancestor of u and v is denoted . Given , we say that v1 is more closely related to v2 than to v3, if .

2 Methods

Our classification takes as input a gene tree, , that has been reconciled with a species tree, , using a duplication-transfer model. The model may also include losses; losses have no impact on xenolog classification and we do not discuss them further. Reconciliation infers a mapping, , between genes and species, where indicates that gene was present in the genome of species . Each internal node, g, is annotated with , the event that caused the divergence at g, where can be a duplication (δ), a transfer (τ), or a speciation (σ). Transfer edges are denoted by , where is the recipient gene node, is the donor gene node, and . We say transfer t is on the path from to , if the path from to passes through both and . The output of our classification scheme is a homology table . In this classification, which is based on the definitions introduced by Fitch (2000), genes and are orthologs iff and there is no transfer on the path from to ; paralogs iff and there is no transfer on the path from to ; xenologs iff there is at least one transfer on the path from to . Note that by explicitly defining orthologs to be gene pairs that are not connected by a transfer, this definition of ortholog ensures that the ancestor of orthologous genes lie in their cenancestor; i.e. . If and are orthologs, then . If they are paralogs, . If and are xenologs, then , where is the xenolog class of genes and . In contrast to orthology and paralogy, xenology is not symmetric, due to the directional nature of horizontal transfer. In the remainder of this section, we define new xenolog classes and give formal rules for determining the xenolog class, . In Section 2.1, we consider the case where there is a single transfer on the path from to and they did not diverge by duplication (i.e. ). In Section 2.2, we provide xenolog classification rules for the case where the MRCA of and is a duplication and introduce a subclass of xenologs, called paraxenologs, for designating genes that are related through both duplication and transfer. Finally, in Section 2.3, we extend these definitions to allow an arbitrary number of transfers on the path from to .

2.1 Xenolog classification with a single transfer

Consider a gene tree with a single transfer from donor species to recipient species . Let be the cenancestor of t and let A be the set of nodes in the subtree of T rooted at . Transfer t defines three, non-overlapping sets of species tree nodes: , i.e. the species that are more closely related to the donor than the recipient; , i.e. the species that are more closely related to the recipient than the donor; , i.e. the nodes in the species tree equally related to the donor and recipient. We define four mutually exclusive xenolog classes based on these sets. Xenolog classes are defined with respect to a reference gene that is a descendant of the recipient of the transfer; i.e. ). For every , t is on the path from to and is a Primary xenolog iff ; Sibling Donor xenolog iff ; = SDX Sibling Recipient xenolog iff ;    = SRX Outgroup xenolog iff . Xenologs are classified relative to a reference gene; therefore, xenolog class assignments are not symmetric. In the homology table, when is used to indicate that is the xenolog of the reference gene, , and that its class is given by . In Figure 1, all genes are xenologous to . Both g and g are in set D; g is a Primary xenolog () and g is a Sibling Donor xenolog (), because g is a descendant of the donor (i.e. ) and g is not. Genes g and g are in set R and are Sibling Recipient xenologs (). Gene g is an Outgroup xenolog () because g is in set O. Genes h and h are paraxenologs and will be discussed in Section 2.2. A xenologous gene pair can be further annotated to indicate cases where the genes are found in the same species: is an autoxenolog of , iff . We designate this  = X. Autoxenologs will also be assigned to a subclass. In Figure 1, g and are both in species X; g is a Sibling Recipient autoxenolog (). Xenolog class hierarchies: The xenolog classes form a hierarchy that can elucidate how xenologs are related in both the gene and species trees. Primary xenologs are closest in the xenolog hierarchy and Outgroup xenologs are most distant. We denote this hierarchy by where , if and g1 are closer in the hierarchy than and g2. Genes that are more closely related in the hierarchy are also more closely related in the gene tree. Let genes g1 and g2 in be xenologs of such that there is no transfer ancestral to either g1 or g2. Then, , if . This hierarchy, which is illustrated in Figure 2, is stated formally as follows:
Fig. 2.

Xenolog class hierarchy: (top) Gene tree with one transfer, shown in the context of the species tree. (bottom) The reconciled gene tree. Each leaf is annotated with its xenolog class. Nodes and g4 are the common ancestors of and, respectively, the Primary, Sibling Donor, Sibling Recipient and Outgroup xenologs in the tree, as indicated by the labels on internal nodes. The labels on the path from to the root satisfy the hierarchy, , consistent with Theorem 2.1

Xenolog class hierarchy: (top) Gene tree with one transfer, shown in the context of the species tree. (bottom) The reconciled gene tree. Each leaf is annotated with its xenolog class. Nodes and g4 are the common ancestors of and, respectively, the Primary, Sibling Donor, Sibling Recipient and Outgroup xenologs in the tree, as indicated by the labels on internal nodes. The labels on the path from to the root satisfy the hierarchy, , consistent with Theorem 2.1 Theorem 2.1. (Xenolog class hierarchy in the gene tree) Given, for any Primary xenolog,, Sibling Donor xenolog,, Sibling Recipient xenolog,, and Outgroup xenolog,, of Proof. See Supplementary Section S.1. □ We sketch the basis of this theorem informally, here. For every xenolog of , the common ancestor of and is a node on the path from to the root of T; i.e. there exists , such that and . If , then and is therefore a Primary xenolog. For , the descendants of , the child of that is incomparable to the transfer, must satisfy two requirements. First, since all xenologs in are equally related to , all xenologs in must be assigned to the same xenolog class. This will be true if all descendants of are in the same species set, D, R or O. Second, for any , the xenologs in are more distantly related to than the xenologs in ; therefore, consistency requires that the class of xenologs in not be closer in the hierarchy than the class of xenologs in . Both of these conditions are satisfied when there is no transfer that is ancestral to either or . This is always true in a reconciled tree with a single transfer and no duplications. We will reexamine the hierarchical properties of xenolog classes in trees with more complex event histories in the following sections. The proposed xenolog classes also convey information about the relationship of a xenolog pair in the gene tree relative to their relationship in the species tree. For xenologs, the cenancestor of and g can predate or postdate the species containing . Our xenolog classes distinguish between these three cases and are summarized in Supplementary Table S1. Primary and Sibling Donor xenologs are more closely related in the gene tree than in the species tree, whereas Sibling Recipient xenologs are more closely related in the species tree than in the gene tree. Outgroup xenologs are equally related in both trees.

2.2 Xenolog classification with transfers and duplications

We next consider the classification of genes and when there is a single transfer on the path from to and they diverged by duplication (i.e. ). Such gene pairs satisfy both the paralog and the xenolog criteria proposed by Fitch (2000), leading to potential terminological confusion. To avoid this confusion, we introduce the explicit designation, paraxenolog, for xenologs that diverged via a duplication at their common ancestor. Note that Patterson (1988) used ‘paraxenolog' to refer to a different phenomenon. Formally, let be a duplication node in the gene tree with a transfer, , in one of its two subtrees, and let be a descendant of that transfer. Then, every gene in the other subtree of is a paraxenolog of , to be denoted X. For example, in the gene tree in Figure 1, is a duplication node with two subtrees; the g subtree contains a transfer with reference gene . All genes in the other subtree (that is, h and h) are paraxenologs of . Paraxenologs are also assigned to a specific xenolog class when it is both possible to do so and preserve the xenolog class hierarchy, as specified in Theorem 2.1. This depends on when the duplication occurred relative to , the cenancestor of the transfer. If the species in which the duplication occurred is a descendant of , then all descendants of are more closely related to the donor than to the recipient; i.e. all paraxenologs are in species in D and must be Sibling Donor xenologs. They cannot be Primary xenologs, as, by definition, Primary xenologs are the descendants of a transfer. In this case, paraxenologs satisfy the requirements of Theorem 2.1, because all paraxenologs of are equally related to and are assigned to the same xenolog class; the hierarchy is preserved. When the duplication predates or coincides with the cenancestor of the transfer, then the descendants of both children of will be inherited by species in D, R and potentially O. These paraxenologs are equally related in the gene tree, but would be assigned different classes based on their location, thus violating the requirements of Theorem 2.1. To avoid violating the hierarchy, for every paraxenolog, g, of , we assign to X, i.e. and g are untyped paraxenologs. A scenario where this occurs is shown in Supplementary Figure S1. Xenolog hierarchy with paraxenologs: The xenolog hierarchy in Theorem 2.1 holds for paraxenologs if we ignore the distinction between xenologs and paraxenologs of the same class and consider X to be on a par with the OX class in the hierarchy. If and are a Sibling Donor xenolog and a Sibling Donor paraxenolog, respectively, of , then may be either ancestral to or a descendant of (Fig. 3). Similarly, may be an ancestor or a descendant of , where is an Outgroup xenolog of and is an untyped paraxenolog. These results are stated formally in Theorem S.2.
Fig. 3.

Paraxenolog classification: The gene tree in Figure 1, which contains a duplication followed by a transfer. Each leaf is annotated with its xenolog class. Each internal node on the path from to the root is labeled with the xenolog class of all genes in its right subtree (i.e. the subtree that does not contain a transfer.) The progression of labels satisfy the xenolog hierarchy, consistent with Theorem 2.1.

Paraxenolog classification: The gene tree in Figure 1, which contains a duplication followed by a transfer. Each leaf is annotated with its xenolog class. Each internal node on the path from to the root is labeled with the xenolog class of all genes in its right subtree (i.e. the subtree that does not contain a transfer.) The progression of labels satisfy the xenolog hierarchy, consistent with Theorem 2.1. The species hierarchy in Supplementary Table S1 is also preserved, with the additional observations that Sibling Donor paraxenologs behave like Sibling Donor xenologs and .

2.3 Xenolog classification with multiple transfers

With a single transfer, xenolog classes are defined in terms of the sets of species tree nodes, D, R and O, which are determined by the positions of the donor and recipient species and their common ancestor, a. The key issue in extending the framework to multiple transfers is how to obtain a single D, R and O given multiple donor and recipient species. We first describe a xenolog classification procedure for a pair of genes connected by a path containing transfer edges, when all k transfers are mutually comparable. Transfers, and , are comparable, iff and are comparable in the gene tree. Then, we describe a procedure for the case where the gene pair is separated by incomparable transfers. The remainder of this section applies to both xenologs and paraxenologs; for simplicity, we use ‘xenolog’ to refer to both. Comparable transfers: Let be an ordered sequence of comparable transfers on the path from to . We say that is ancestral to (denoted ), iff . Any set of comparable transfers, , can be ordered such that . In particular, , and . Since are comparable, they must be on the path between and . These transfers can be summarized by a single super-transfer, , where and . With one exception, discussed below, t* behaves like a single transfer that could occur in a reconciled tree: the cenancestor of the super-transfer, , induces sets D*, R* and O* (Fig. 4). These are used to determine , using the single-transfer procedure previously described.
Fig. 4.

Xenolog classification with multiple transfers: (top) Gene tree with two comparable transfers (solid arrows) and the associated super-transfer (dashed arrow), shown in the context of the species tree. (bottom) The reconciled gene tree. Each leaf is annotated with its xenolog class. Genes g and g are classified with respect to t2 and obey the hierarchy: and . All other genes are classified with respect to the super-transfer, . Their xenolog classes are consistent with the hierarchy (Theorem 2.1): and

Xenolog classification with multiple transfers: (top) Gene tree with two comparable transfers (solid arrows) and the associated super-transfer (dashed arrow), shown in the context of the species tree. (bottom) The reconciled gene tree. Each leaf is annotated with its xenolog class. Genes g and g are classified with respect to t2 and obey the hierarchy: and . All other genes are classified with respect to the super-transfer, . Their xenolog classes are consistent with the hierarchy (Theorem 2.1): and The exceptional case arises when the recipient species of the super-transfer is a descendant of the donor species (). This scenario (Supplementary Fig. S2) cannot occur with a single transfer because the donor and recipient species of a transfer must be incomparable. With multiple transfers, however, may be in . In this case, the cenancestor of the super-transfer is also its donor (). Since all descendants of are also descendants of , all xenologs in A* are Primary xenologs. All other xenologs are in O* and are Outgroup xenologs. A possible concern about replacing k transfers with a single super-transfer is that the intermediate species are not considered. However, these intermediate species are represented by xenologous pairs that only pass through a subset of the k transfers, namely, . Information about where ancestral forms of spent time as traveled from to is captured by the complete set of xenologs of . Incomparable transfers: We first consider the special case where k = 2 and the transfers are incomparable. Given a pair of genes, and , connected by two incomparable transfers, and (Fig. 5), one gene is a descendant of one transfer recipient (), and the other gene is a descendant of the other transfer recipient (). Since and are both descendants of a transfer recipient, xenolog can be classified with respect to , and vice versa.
Fig. 5.

Xenolog classification with incomparable transfers: (top) Gene tree with two incomparable transfers shown in the context of the species tree. Species sets associated with transfers and are shown below the leaves. (bottom) The reconciled gene tree. Each leaf is annotated with its xenolog class in reference to t (top row) and t (bottom row). Genes and are separated by both transfers. Since . In contrast, since . Xenolog classes for other genes are consistent with their relatedness in the gene tree (Theorem 2.1): and

Xenolog classification with incomparable transfers: (top) Gene tree with two incomparable transfers shown in the context of the species tree. Species sets associated with transfers and are shown below the leaves. (bottom) The reconciled gene tree. Each leaf is annotated with its xenolog class in reference to t (top row) and t (bottom row). Genes and are separated by both transfers. Since . In contrast, since . Xenolog classes for other genes are consistent with their relatedness in the gene tree (Theorem 2.1): and With incomparable transfers, the xenolog classes do not satisfy the hierarchical properties of Theorem 2.1. Let and let be the child of that is ancestral to but not (i.e. and ). Recall that the first condition for preservation of the hierarchy is that all xenologs in must be in the same species set. Satisfaction of this condition is not guaranteed for incomparable xenologs because contains a transfer, , that can move to a species that is not in the same set as . Suppose, for example, the donor of is in a species in O, but its recipient is in a species in D. Since both and are in , more than one species set is represented in , violating the first condition. Primary xenologs are the one exception to this problem. Primary xenologs are defined in terms of and not in terms of D, R and O, and are therefore unaffected by incomparable transfers. To avoid a classification that violates the hierarchy, we do not assign xenologs separated by incomparable transfers to specific subclasses. Given two genes separated by incomparable transfers, and , without loss of generality, let be the reference gene, be the xenolog under classification, and be their common ancestor. Then is a Primary xenolog iff ; Incomparable xenolog iff and ; Incomparable paraxenolog iff and . In the incomparable case, is the classification of with respect to and is the classification of with respect to . Either and (or vice versa), or . We now address the case where k > 2 by reducing the problem to one involving two incomparable super-transfers and applying the protocol just described. Let be the transfers, in descending order, on the path from to and be the set of transfers on the path from to . Since must be mutually comparable, they can be replaced with super-transfer , where and . Similarly, we replace with super-transfer , where and . Xenolog hierarchy for multiple transfers: With multiple comparable transfers, the hierarchical properties in Theorem 2.1 hold for xenologs that share the same super-transfer from to . For example, in Figure 4, the xenolog class hierarchy is preserved for nodes g and g, which are xenologs of with respect to only. Similarly, xenologs , , and g, which are all defined with respect to the super-transfer , also obey the hierarchy. However, g and g do not share the super-transfer and thus, do not obey the hierarchy; , yet . Primary xenologs, including those connected by incomparable transfers, are more closely related than any other class of xenologs. Incomparable xenologs that are not Primary may fall anywhere in the hierarchy; that is, a given pair of Incomparable xenologs may be more closely related, or more distantly related, than a given pair of Sibling or Outgroup xenologs. Thus, the non-specific Incomparable xenolog class provides less information about relatedness than the specific Sibling and Outgroup classes, but guarantees a classification in which relatedness in the gene tree is consistent with the hierarchy. The species tree hierarchy for single transfers (Supplementary Table S1) also holds for multiple comparable transfers summarized by a super-transfer, with one exception. When the recipient species of the super-transfer is a descendant of the donor species (as in Supplementary Fig. S2), Primary xenologs, with respect to this super-transfer, are more or equally related in the species tree than in the gene tree. The species tree hierarchy is not guaranteed for multiple, incomparable transfers, even when the pair are classified as Primary xenologs. The reasoning for this is that the recipient of can be in any of the sets, D1, R1, or O1, defined by . Therefore the cenancestor of and can be in any species in V. Any relationship, even an incomparable relationship, is possible between the cenancestor and the ancestor containing .

3 Algorithms and implementation

The classification procedure for the xenolog classes described in Section 2 is shown as pseudocode in Supplementary Section S.4. We have implemented this procedure and integrated it in Notung 2.9, a freely available software package that implements gene tree-species tree reconciliation with transfers in a parsimony framework (Stolzer ). Upon reconciling a gene tree with a species tree, Notung 2.9 generates a homology table, H, for all pairs of leaves in the gene tree. There may be more than one minimum-cost event history that reconciles the gene and species trees. A homology table is generated for each optimal, temporally feasible reconciliation reported. Transfers imply temporal constraints because the donor and recipient of a transfer must have co-existed; a reconciliation is temporally feasible if all temporal constraints imposed by the inferred transfers are mutually compatible. In particular, temporal consistency requires that s and s be incomparable to all transfers. Notung 2.9 reports all optimal reconciliations that are temporally feasible, up to a user-specified limit (Stolzer, 2012). Homology tables can be viewed in the graphical user interface or exported from the command line in a tab-delimited, CSV, or HTML format. Row H contains the homology relationships between reference gene, , and all other genes in V. For orthologs and paralogs, . For xenologs, gives the xenolog class of g with respect to , a reference gene that is the recipient of at least one transfer on the path from to . If there is also a transfer on the path from to g, then gives the xenolog class of g with respect to reference . Otherwise, H. The classification procedure is generally applicable to reconciled gene trees and can be implemented in any reconciliation software package that enforces temporal consistency. When temporal consistency is not enforced, reconciliations with transfers between ancestor and descendant species can arise. Since this scenario is similar to super-transfers that form a loop (Supplementary Fig. S2), the classification proposed here could easily be adapted for programs that do not enforce consistency.

4 Empirical results

Genomic study: As a proof of principle, we analyzed 13 623 gene families from a dataset of 65 genomes of Proteobacteria and Cyanobacteria (Latysheva ). Phylogeny was reconstructed as described in Supplementary Section S.5. To control for spurious inference of transfers due to phylogenetic error, weakly supported branches were rearranged using a species-tree aware method as described in Supplementary Section S.5.1. The resulting rooted, rearranged trees were then reconciled with the species tree with default costs (, ). These costs are consistent with costs used in other recent phylogenomic analyses (David and Alm, 2011; Richards ), which were selected to minimize the total net change in genome content. The time required to reconcile the 13 623 trees, including generating all optimal reconciliations and testing them for temporal feasibility, was 7.25 min on an Intel Xeon 2.3 GHz processor (128 GB RAM). The computational complexity of calculating the homology table, once the gene tree has been reconciled, is negligible. Homology tables were computed for the 13 194 trees possessing at least one temporally feasible solution. From these, homologs of all categories were tabulated. For families with more than one optimal reconciliation, the number of pairs in each category was averaged over all reported, optimal event histories. Orthologs, paralogs and xenologs are all represented in this dataset, and every xenolog class is also observed (Fig. 6 and Supplementary Tables S2–S6). More than a quarter of homologous gene pairs were xenologs. Of these pairs, 85.7% are xenologs with only one reference gene, where all transfers on the path from the reference to its xenolog are mutually comparable. Of these xenologs, 60.2% are either Primary or Sibling Donor (para)xenologs; thus, the majority of the inferred xenologs are closer to the donor than the recipient.
Fig. 6.

(left) Proportions of orthologs, paralogs and xenologs (all classes) in the 13 194-tree bacterial dataset. (right) Proportions of xenolog classes

(left) Proportions of orthologs, paralogs and xenologs (all classes) in the 13 194-tree bacterial dataset. (right) Proportions of xenolog classes Gene pairs separated by incomparable transfers are fairly rare compared with all types of xenologs separated by any number of transfers. Such pairs have two xenologs, one for each reference gene; at most one member of each pair can be classified as a Primary xenolog (PX), otherwise they are untyped (IX). The fraction of Incomparable xenologs for which the hierarchy provides no information is quite small: 72.0% of incomparable (para)xenologs are (PX, IX) pairs; the rest are (IX, IX) or () pairs. Less than 1% of all xenologous pairs are autoxenologs, which could be due to preferential transfer of novel genes or a high incidence of xenologous gene displacement (Koonin ). Paralogs constitute 2.2% of all homologs, and paraxenologs are 4.8% of all xenologs. The low level of paralogy observed is consistent with prior reports that in prokaryotes transfer is a greater source of genetic novelty than duplication (Treangen and Rocha, 2011). Interestingly, the vast majority of paraxenologs, 73.4%, are Sibling Donor paraxenologs. Recall that paraxenologs that diverged after the cenancestor of the transfer can be unambiguously classified and are always more closely related to the donor than to the recipient of the transfer. Paraxenologs that diverged before the cenancestor, i.e. closer to the root, cannot be assigned a specific class without breaking the hierarchy. As with Incomparable xenologs, the low fraction of untyped paraxenologs (s) suggests that, at least for this dataset, there are relatively few pairs for which it is impossible to extract some information from the xenolog classification. Methodological factors may also contribute to the trends we observe. Gene families were inferred with OrthoMCL (Li ), which tends to place paralogous subfamilies in separate clusters. This could be a factor in the low level of paralogs, paraxenologs and autoxenologs in this study. It could also contribute to the preponderance of pairs, relative to pairs, as the tendency to break up paralogous subfamilies would result in relatively few inferred duplications near the root of the gene tree. We considered to what extent the empirical parameters influenced the outcome of the analysis presented here. First, we investigated the impact of OrthoMCL on subsequent xenolog classification in a small set of curated families (Supplementary Section S.5.5). In most cases, the OrthoMCL clusters agreed with the curated family definitions. However, when OrthoMCL did split up paralogous subfamilies, the number and type of paraxenologs predicted changed dramatically. In order to assess the impact of taxonomic breadth on our results, we also applied our classification procedure to two taxonomically-restricted subsets: families found only in the Cyanobacteria phylum (C: 49 species, 7485 trees) and families found only in the Synechococcales class (S: 30 species, 1429 trees), respectively. Orthologs, paralogs and all xenologs classes are present, and the observed trends are similar to those reported above for the full dataset (Supplementary Section S.5.4, Figs S8 and S9, and Tables S7–S16). Overall, the agreement between the full and restricted datasets suggests that our method is not highly sensitive to taxon sampling. Finally, to probe the impact of event costs on xenolog classes observed in this study, we repeated this analysis with an increased transfer cost, , as described in Supplementary Section S.5.3. All xenolog classes were, again, observed. The higher transfer cost resulted in a moderate increase in the number of paralogs and paraxenologs of all classes, and a decrease in the number of non-paralogous xenologs inferred. The change in the relative frequencies of the other various classes was generally small (less than 15%) with one exception: the proportion of Outgroup xenologs decreased by more than 50%. The increase in para(xeno)logs and decrease in Outgroup xenologs, taken together, suggest that more duplications may be inferred near the roots of gene trees, when a higher transfer cost is used. Thus, in this analysis, the trade-off between duplications and transfers does not affect all xenolog classes equally. BIO4 case study: To explore the connection between xenolog classes and protein function, we applied our approach to the BIO4 gene family; several BIO4 genes have been horizontally transferred and have been characterized experimentally (Hall and Dietrich, 2007). BIO4 is part of the biotin (vitamin B7) biosynthesis pathway (Supplementary Fig. S11). Plants and some fungi possess a BIO4 homolog that encodes a bi-functional enzyme that acts as both a 7,8-diaminopelargonic acid synthase (DAPAS) and a dethiobiotin synthetase (DTBS), steps 3 and 4 in the pathway, respectively. In bacteria, the BIO4 homolog only performs the DTBS function; the 3rd step is carried out by an unrelated protein. Unlike other fungi, however, the BIO4 homolog in some yeasts (Saccharomyces cerevisiae, and its close relatives) also encodes a DTBS-only protein. Phylogenetic analysis shows that a horizontal transfer from bacteria to yeast replaced the ancestral bi-functional homolog (Hall and Dietrich, 2007). Using Notung 2.9, we reconciled the gene and species trees (Supplementary Figs S12 and S13) constructed by Hall and Dietrich (2007) and inferred xenolog classes (Fig 7 and Supplementary Fig. S14).
Fig. 7.

Summary of the BIO4 gene family event history. Dashed lines represent lineages with a putative dual-function DTBS + DAPAS enzyme; solid lines represent lineages with a putative DTBS-only function. With respect to the gene in S.cerevisiae, all other fungal genes are SRX, α-proteobacterial genes are PX, and genes in Firmicutes are SDX

Summary of the BIO4 gene family event history. Dashed lines represent lineages with a putative dual-function DTBS + DAPAS enzyme; solid lines represent lineages with a putative DTBS-only function. With respect to the gene in S.cerevisiae, all other fungal genes are SRX, α-proteobacterial genes are PX, and genes in Firmicutes are SDX The hierarchical nature of the xenolog classification aids in the interpretation of the functional evolution of the family in this case study. The molecular function of yeast BIO4 is closer to that of its Sibling Donor xenologs, which encode the DTBS-only enzyme, than its Sibling Recipient xenologs, which encode bi-functional enzymes. In contrast, the Sibling Recipient xenologs provide information about genomic context. The fact that the Sibling Recipient xenologs encode a bi-functional enzyme raises a red flag: the replacement of a bi-functional enzyme with a DTBS-only enzyme in yeast suggests loss of the DAPAS function. Either a different enzyme must be carrying out the DAPAS function or yeast no longer has a functional biotin synthesis pathway. In fact, the former is true; the DAPAS function is performed by an unrelated gene, which was also acquired horizontally (Hall and Dietrich, 2007). In this example, a closely related gene (a DTBS-only enzyme) in a distantly related (α-proteobacterial) species is a better predictor of BIO4 enzymatic function than a distantly related gene (the dual function homolog) in a closely related species (Yarrowia lipolytica). The distantly related homolog in a closely related species provides information about the genetic background; i.e. the genome could be lacking a gene encoding the DAPAS function. These insights are linked to the hierarchical structure of the xenolog classes and may represent general trends, suggesting hypotheses for future investigation. If it proves generally true, for example, that Sibling Donors are better predictors of molecular function and Sibling Recipients are better predictors of cellular context, then this system of xenolog classification could support large scale, automated analyses in comparative, evolutionary genomics.

5 Discussion

Distinguishing orthologs from paralogs, as well as the division of paralogs into subclasses based on the timing and nature of the events by which they arose, has proved to be a valuable analytical approach in molecular evolution, systematics, comparative genomics, and homology-based function prediction. Here, we examine the challenges associated with the expansion of this framework to include horizontally transferred genes. The term ‘xenolog’ has been introduced to describe gene pairs related through horizontal transfer (Fitch, 2000; Gray and Fitch, 1983). However, the set of genes that share a history that includes at least one transfer encompasses a very broad set of relationships. In this work, we propose subtypes that provide a more nuanced classification of xenologs. We provide formal rules for classification, given a reconciled gene tree with an arbitrary number of transfers and duplications. These rules have been implemented in Notung 2.9, a freely available phylogenetic reconciliation software package. Phylogenetic reconciliation captures information about the historical association between genes and species, as well as the divergence events that characterize the xenologs in each class. A potential limitation of this approach is that it requires that species evolution be modeled as a tree. While some have argued against tree-like models, given the prevalence of horizontal gene transfer in bacteria, a tree can provide a useful heuristic, despite the reticulate nature of prokaryotic evolution (Mindell, 2013, and work cited therein). As with most theoretical work on reconciliation, our classification assumes that the gene tree and the inferred events are correct. In practice, errors in gene tree reconstruction or incongruence due to unrecognized incomplete lineage sorting could lead to downstream errors in xenolog classification. Methods that account for phylogenetic uncertainty offer an approach to bridging this gap, and are an important direction for future work. For example, the xenolog classification proposed here could be embedded in a probabilistic reconciliation framework (e.g. Akerborg ), which would support an explicit and quantitative model of uncertainty. Missing data is another potential source of error. If the dataset does not contain at least one descendant of the donor, a transfer will be inferred from a putative donor that is actually an ancestor of the donor species. When this occurs, some genes that are actually Sibling Donor xenologs may be incorrectly classified as Primary xenologs. The classification of Sibling Recipient, Outgroup, and all other Sibling Donor xenologs will be unaffected. Thus, classification errors due to missing taxa do not result in major changes in interpretation; these xenologs will still be correctly classified as being more closely related to the donor than to the recipient of the transfer. Our classification is an extension of Fitch’s classic framework and is based solely on information that can be extracted from gene tree–species tree reconciliation. The incorporation of other sources of information, such as synteny, sequence alignments, or structural comparison, could be used to develop richer accounts of xenology relationships. For example, Koonin have proposed that horizontal gene transfer can result in the acquisition of a new gene family, expansion of an existing gene family, or allelic replacement without change in copy number. Our classification provides a context for stating general hypotheses about the functional and evolutionary fates of different classes of xenologs. Since Sibling Donor xenologs are more closely related to the reference gene than Sibling Recipients, they may be more likely to share molecular functions with the reference gene. In contrast, the cellular environment of the reference gene may be more similar to that of Sibling Recipient xenologs. This could also convey information about the process of amelioration following transfer (Lawrence and Ochman, 1997). For example, the prokaryotic homologs of a fungal gene of prokaryotic origin are likely not informative with regard to the cellular compartment in which the encoded protein is active. The functional fates of genes that have experienced both duplication and transfer is a largely unexplored question. Selective pressures are likely to change following both gene duplication (Lynch, 2007, and work cited therein) and horizontal gene transfer (Boto, 2010, 2016; Treangen and Rocha, 2011 and work cited therein). Little is known about the combined effect of these changes on rates of divergence and functional specialization. Recent attempts to test the ortholog conjecture, which posits that orthologs are more functionally similar than paralogs, have demonstrated the challenges presented by confounding factors in high-throughput data, and especially in the use of ontologies (Chen and Zhang, 2012; Nehrt ). Testing analogous xenolog conjectures will be even more challenging: probing all four xenolog classes would require large-scale, unbiased functional datasets for at least five species. Nevertheless, with the current pace of functional genomics, genomic-scale investigations of xenolog function are not far in the future. Click here for additional data file.
  43 in total

1.  Prediction of operons in microbial genomes.

Authors:  M D Ermolaeva; O White; S L Salzberg
Journal:  Nucleic Acids Res       Date:  2001-03-01       Impact factor: 16.971

2.  Assigning protein functions by comparative genome analysis: protein phylogenetic profiles.

Authors:  M Pellegrini; E M Marcotte; M J Thompson; D Eisenberg; T O Yeates
Journal:  Proc Natl Acad Sci U S A       Date:  1999-04-13       Impact factor: 11.205

Review 3.  Computational approaches to unveiling ancient genome duplications.

Authors:  Yves Van de Peer
Journal:  Nat Rev Genet       Date:  2004-10       Impact factor: 53.242

Review 4.  Horizontal gene transfer in evolution: facts and challenges.

Authors:  Luis Boto
Journal:  Proc Biol Sci       Date:  2009-10-28       Impact factor: 5.349

Review 5.  Searching for regulatory elements in human noncoding sequences.

Authors:  L Duret; P Bucher
Journal:  Curr Opin Struct Biol       Date:  1997-06       Impact factor: 6.809

6.  Distinguishing homologous from analogous proteins.

Authors:  W M Fitch
Journal:  Syst Zool       Date:  1970-06

7.  MultiMSOAR 2.0: an accurate tool to identify ortholog groups among multiple genomes.

Authors:  Guanqun Shi; Meng-Chih Peng; Tao Jiang
Journal:  PLoS One       Date:  2011-06-21       Impact factor: 3.240

8.  The ortholog conjecture is untestable by the current gene ontology but is supported by RNA sequencing data.

Authors:  Xiaoshu Chen; Jianzhi Zhang
Journal:  PLoS Comput Biol       Date:  2012-11-29       Impact factor: 4.475

9.  Mechanisms of Gene Duplication and Translocation and Progress towards Understanding Their Relative Contributions to Animal Genome Evolution.

Authors:  Olivia Mendivil Ramos; David E K Ferrier
Journal:  Int J Evol Biol       Date:  2012-08-07

10.  Phylogenomics and the dynamic genome evolution of the genus Streptococcus.

Authors:  Vincent P Richards; Sara R Palmer; Paulina D Pavinski Bitar; Xiang Qin; George M Weinstock; Sarah K Highlander; Christopher D Town; Robert A Burne; Michael J Stanhope
Journal:  Genome Biol Evol       Date:  2014-04       Impact factor: 3.416

View more
  19 in total

1.  Indirect identification of horizontal gene transfer.

Authors:  David Schaller; Manuel Lafond; Peter F Stadler; Nicolas Wieseke; Marc Hellmuth
Journal:  J Math Biol       Date:  2021-07-03       Impact factor: 2.259

2.  Evidence for parallel evolution of a gene involved in the regulation of spermatogenesis.

Authors:  Xin Rui Wang; Li Bin Ling; Hsiao Han Huang; Jau Jyun Lin; Sebastian D Fugmann; Shu Yuan Yang
Journal:  Proc Biol Sci       Date:  2017-05-31       Impact factor: 5.349

3.  Gene age shapes the transcriptional landscape of sexual morphogenesis in mushroom-forming fungi (Agaricomycetes).

Authors:  Zsolt Merényi; Máté Virágh; Emile Gluck-Thaler; Jason C Slot; Brigitta Kiss; Torda Varga; András Geösel; Botond Hegedüs; Balázs Bálint; László G Nagy
Journal:  Elife       Date:  2022-02-14       Impact factor: 8.713

4.  Comparison of the protein-coding genomes of three deep-sea, sulfur-oxidising bacteria: "Candidatus Ruthia magnifica", "Candidatus Vesicomyosocius okutanii" and Thiomicrospira crunogena.

Authors:  Susan E McGill; Daniel Barker
Journal:  BMC Res Notes       Date:  2017-07-20

5.  Comparative evolutionary histories of fungal proteases reveal gene gains in the mycoparasitic and nematode-parasitic fungus Clonostachys rosea.

Authors:  Mudassir Iqbal; Mukesh Dubey; Mikael Gudmundsson; Maria Viketoft; Dan Funck Jensen; Magnus Karlsson
Journal:  BMC Evol Biol       Date:  2018-11-16       Impact factor: 3.260

6.  Tree reconciliation combined with subsampling improves large scale inference of orthologous group hierarchies.

Authors:  Davide Heller; Damian Szklarczyk; Christian von Mering
Journal:  BMC Bioinformatics       Date:  2019-05-06       Impact factor: 3.169

7.  Phylogenomic analyses and distribution of terpene synthases among Streptomyces.

Authors:  Lara Martín-Sánchez; Kumar Saurabh Singh; Mariana Avalos; Gilles P van Wezel; Jeroen S Dickschat; Paolina Garbeva
Journal:  Beilstein J Org Chem       Date:  2019-05-29       Impact factor: 2.883

8.  Analysis of Drosophila Atg8 proteins reveals multiple lipidation-independent roles.

Authors:  András Jipa; Viktor Vedelek; Zsolt Merényi; Adél Ürmösi; Szabolcs Takáts; Attila L Kovács; Gábor V Horváth; Rita Sinka; Gábor Juhász
Journal:  Autophagy       Date:  2020-12-17       Impact factor: 16.016

9.  Evolutionary and genomic analysis of the caleosin/peroxygenase (CLO/PXG) gene/protein families in the Viridiplantae.

Authors:  Farzana Rahman; Mehedi Hassan; Rozana Rosli; Ibrahem Almousally; Abdulsamie Hanano; Denis J Murphy
Journal:  PLoS One       Date:  2018-05-17       Impact factor: 3.240

10.  Massive Loss of Olfactory Receptors But Not Trace Amine-Associated Receptors in the World's Deepest-Living Fish (Pseudoliparis swirei).

Authors:  Haifeng Jiang; Kang Du; Xiaoni Gan; Liandong Yang; Shunping He
Journal:  Genes (Basel)       Date:  2019-11-08       Impact factor: 4.096

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.