Literature DB >> 29762653

RecPhyloXML: a format for reconciled gene trees.

Wandrille Duchemin1,2,3,4, Guillaume Gence1, Anne-Muriel Arigon Chifolleau5,6, Lars Arvestad7,8, Mukul S Bansal9,10, Vincent Berry5,6,11, Bastien Boussau1, François Chevenet5,12, Nicolas Comte2, Adrián A Davín1,3,4, Christophe Dessimoz13,14,15,16,17, David Dylus13, Damir Hasic18, Diego Mallo19, Rémi Planel20, David Posada21, Celine Scornavacca6,11, Gergely Szöllosi3,4, Louxin Zhang22, Éric Tannier1,2, Vincent Daubin1.   

Abstract

Motivation: A reconciliation is an annotation of the nodes of a gene tree with evolutionary events-for example, speciation, gene duplication, transfer, loss, etc.-along with a mapping onto a species tree. Many algorithms and software produce or use reconciliations but often using different reconciliation formats, regarding the type of events considered or whether the species tree is dated or not. This complicates the comparison and communication between different programs.
Results: Here, we gather a consortium of software developers in gene tree species tree reconciliation to propose and endorse a format that aims to promote an integrative-albeit flexible-specification of phylogenetic reconciliations. This format, named recPhyloXML, is accompanied by several tools such as a reconciled tree visualizer and conversion utilities. Availability and implementation: http://phylariane.univ-lyon1.fr/recphyloxml/.

Entities:  

Mesh:

Year:  2018        PMID: 29762653      PMCID: PMC6198865          DOI: 10.1093/bioinformatics/bty389

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 Introduction

The relationships between the history of genomes or species and the history of their constituent genes are often described through reconciliation. A reconciliation consists of an association between the nodes of a gene tree and the nodes or branches of a species tree, along with different evolutionary events undergone by the gene. For comprehensive reviews on the subject of reconciliations and their inference, see for example Nakhleh (2013) or Szöllősi . Reconciliations can be used to understand the history of a specific gene family, and to study the evolutionary and functional relationships between several families. They can also be used to infer genome-wide parameters such as rates of gene duplication, loss, or lateral gene transfers (Sjöstrand ; Szöllősi ), or population parameters such as divergence time and ancestral population size (Dutheil ). Furthermore, reconciliation based metrics can be used as a criterion to construct better gene trees (Durand ; Scornavacca ; Sjöstrand ; Szöllősi ; Wu ) or better species tree (Boussau ; Nakhleh, 2013). There are many algorithms and software to infer reconciliations (Nakhleh, 2013; Szöllősi ), and while they share many features, each has some unique characteristics. Some methods work according to a parsimony principle [see for instance Durand ), Bansal and Chan ] while others rely on a probabilistic approach (Åkerborg ; Sjöstrand ; Szöllősi ). Reconciliation methods may differ in the type of events they consider. Some methods also require a dated species tree (i.e. a species tree where the relative timing of internal speciations is known) while others do not. The fact that reconciliation programs (or rather each program family) use different formats to represent reconciliations makes it difficult to compare, switch between or use together reconciliations inferred from different pieces of software, which can hamper proper comparison and validation studies. This also means that any post-analysis or visualization software will either have limited scope (it will only be able to take as input the reconciliations of specific pieces of software) or be burdened by the implementation of readers for several formats. In this paper, we aim to propose a generic reconciliation format encompassing the specificities of different reconciliation programs. This will make reconciliation based analysis more accessible to scientists without the need to develop or use multiple format conversion scripts. In order to include all properties described in the scientific literature about gene tree species tree reconciliation, we should first be able to annotate gene tree nodes with events related to species tree nodes, such as speciations, and events associated to species tree branches, such as gene duplication (D), gene loss (L), lateral gene transfer (T), transfer with replacement (TR), gene conversion (C) and incomplete lineage sorting (ILS) (Mallo ; Rasmussen and Kellis, 2012; Than ). Reconciliations can be carried out with dated or undated species trees. In a dated species tree, the relative order of speciations is known and it would be desirable to be able to include information about the relative time at which the different events occurred in the reconciliation. Transfers are written with two separate events: a gene lineage leaving a species tree branch (branching out) and then entering another species tree branch (transfer reception). As noted in Szöllősi ), most transfers originate from extinct or unsampled lineages (i.e. branches absent from the species tree). This implies that the bifurcation in the gene tree when a lineage leaves the species tree is not the transfer itself but actually a speciation toward an unsampled/extinct lineage. Our format nevertheless reflects the generality of this event by adopting a neutral label compatible with the different representations of transfers. Moreover, this notion of evolution in unsampled lineages implies the possibility of a bifurcation in the gene tree in such a lineage. The children of the bifurcation can undergo transfers back to the sampled lineages. The unseen bifurcation might be a duplication, a speciation or a transfer between two unsampled lineage. Existing models are yet unable to discriminate these events. This idea is reflected in our format thanks to a specific way to specify a bifurcation in an unsampled lineage. There have been previous attempts to develop formats able to represent evolutionary events along a phylogeny. The PhyloXML format (Han and Zmasek, 2009) is able to depict various annotations along a tree. It already has some way of representing evolutionary events along a phylogeny, but with limitations. For example PhyloXML lacks a mean to specify the species associated with the different events and only includes a rudimentary representation of transfers. Adapting the already existing tags for evolutionary event in PhyloXML would require a near complete overhaul; rather, we propose a new format (recPhyloXML) with entirely new tags, ensuring no confusion with PhyloXML.

1.1 State of the art

Existing reconciliation formats can be broadly categorized in two groups. The first group describes reconciliation events as labels in a Newick or NHX tree, in place of the nodal support (e.g. bootstrap) information or in a devoted NHX comment field. Programs like ALE (Szöllősi ), NOTUNG (Durand ; Stolzer ), DrML (Górecki and Eulenstein, 2014), phylotoo2 (Zheng and Zhang, 2017) or PrIME (Åkerborg ; Sjöstrand ) adhere to this group. The Newick-based reconciliation formats have the advantage of representing the phylogeny. However the reconciliation information often takes the space of other measures like bootstrap values [as in Szöllősi ), or Górecki and Eulenstein (2014)]. The NHX-based format solves this by allocating a specific space for the reconciliation. A common problem with NHX and Newick-based formats is that some characters are forbidden in the leaf names and annotations (These forbidden characters are : , : () ; [] in Newick and NHX.), while sometimes species or gene annotations contain these characters (whereas they rarely contain whole XML tags). In addition, there is no formal format for information contained in NHX comment fields; thus, this information may not be accessible across software platforms. The second group represents reconciliations as lists of gene tree nodes mapping to species tree nodes, making references to an implicit or external gene tree (meaning that the gene tree structure might not be included in the reconciliation). Examples of such output formats are used by ranger-DTL (Bansal ), ecceTERA (Jacox ), DLcoalRecon (Rasmussen and Kellis, 2012), Mowgli (Doyon ), the visualization software SylvX (Chevenet ) or the simulation software SimPhy (Mallo ).

2 Format presentation

To describe reconciliations, we present recPhyloXML, recGeneTreeXML, recSpeciesTreeXML, three grammars extending the PhyloXML format. We also introduce recGeoXML, a grammar to annotate reconciliations with geographic information. They both rely on an XML structure composed of hierarchical tags. A specific tag may have different attributes which can be obligatory or optional. In this section we briefly detail the structure of the PhyloXML used in our format. We then expand on the tags that are specific to reconciliations.

2.1 Elements in common with PhyloXML

In PhyloXML, a tree is delimited by the tag , which is included in a root tag that specifies that the file follows the PhyloXML format. Inside the  tag, each clade is recursively inscribed in a  tag. This clade tag possesses a facultative attribute to describe branch length. The name or identifier of the node is given in the  tag. Further information can be included such as support value () or miscellaneous information ().

2.2 New elements

In our format, a reconciliation ( tag) is defined as a set comprised of one or more reconciled gene trees ( tag), and a species tree ( tag). These tags are described in the next section. Also, reconciled gene trees are always rooted and this is specified by using the tag . A recPhyloXML file allows you to store and share one or more reconciled genes trees and the associated species tree. A recGeneTreeXML file allows you to add a list of evolutionary events to the description of gene tree nodes (otherwise referred to as clades in PhyloXML), possibly also containing detailed geographic information thanks to the recGeoXML grammar ( tag). This tag can also be used in a recSpeciesTreeXML file that currently differs from PhyloXML file only in this point.

2.3 recGeneTreeXML

recGeneTreeXML enriches the PhyloXML vocabulary by adding the complex tag that must be included inside a tag. The  tag contains the sequence of evolutionary events that occur along a gene tree branch. Each type of evolutionary event is represented by a specific tag. These can be of two types, according to whether they concern a branch or a node of the gene tree: Non terminal event:. This tag can be used as many times as necessary. This event does not cause any bifurcation in the gene tree. Terminal events:, , , , and . There is exactly one of these tags at the end of the sequence of events contained in the tag. Terminal events cause either a bifurcation in the gene tree (, , , ) or the end of a lineage (, ). Aside from the and tags, all tags have an obligatory speciesLocation attribute that specifies in which species the event takes place. For , the event always takes place in an unsampled/extinct lineage.  events have instead a destinationSpecies attribute that specifies the species that receives the transfer. All event tags also have a facultative confidence attribute that is intended to store a support value for this event (Nguyen ). Additionally, all event tags have a facultative timeSlice attribute that can, in models where the species tree is dated and subdivided for instance [as done for example in Doyon ], provide information on the timing of the event. Finally, the  tag has a facultative geneName attribute that can specify to which extant gene it corresponds. We now describe each event tag in details. The  tag indicates that the branch ends on a gene tree leaf; see Figure 1A. Note that the  tag also has a facultative geneName attribute that can specify to which extant gene it corresponds.
Fig. 1.

A. Representation of the tag. B. Representation of the tag. C. Representation of the tag. D. Representation of the tag. The species tree is figured using black tube-like branches. The part of the gene tree the event occurs in is represented using plain lines. Additional parts of the gene tree are represented as dotted lines. Stars, squares and crosses respectively represent the beginning of a gene lineage, a gene duplication and a gene loss

A. Representation of the tag. B. Representation of the tag. C. Representation of the tag. D. Representation of the tag. The species tree is figured using black tube-like branches. The part of the gene tree the event occurs in is represented using plain lines. Additional parts of the gene tree are represented as dotted lines. Stars, squares and crosses respectively represent the beginning of a gene lineage, a gene duplication and a gene loss Associated recGeneTreeXML code: gene_seq_1  tag: The  tag describes a gene lineage undergoing a bifurcation due to a speciation; see Figure 1B. Associated recGeneTreeXML code: n1 The  tag describes the loss of a gene copy and is a terminal tag (as with the  tag, there can be no tag following this one). Typically, it can follow a speciation event. See Figure 1C for an example. Associated recGeneTreeXML code: –> n1 speciesLocation=“A”> gene_seq_1 LOST The  tag represents a gene duplication inside a species tree branch; see Figure 1D. Associated recGeneTreeXML code: n1 The  tag represents an event where a gene lineage splits and one gene copy exits the species tree branch while the other gene copy remains in the species branch. It actually is the first step of an horizontal gene transfer event: a gene lineage leaving a species tree branch; see Figure 2A. Figure 2C also represents the case of a  where the child that remained in the same species was lost ( tag).
Fig. 2.

A. Representation of the  tag. B. Representation of the  tag. C. Representation of a  tag followed by a  tag. D. Representation of the  tag. These figure uses the same conventions as Figure 1 with the following additions. For the  tag (D.), which is specific of the model of Szöllősi ), extinct/unsampled lineages are represented as branches of the species tree that do not extend all the way to the right. Diamonds and triangles respectively represent a transfer leaving and entering a branch of the species tree (note that the transfers leave a branch of the species tree that corresponds to an extinct/unsampled lineage)

A. Representation of the  tag. B. Representation of the  tag. C. Representation of a  tag followed by a  tag. D. Representation of the  tag. These figure uses the same conventions as Figure 1 with the following additions. For the  tag (D.), which is specific of the model of Szöllősi ), extinct/unsampled lineages are represented as branches of the species tree that do not extend all the way to the right. Diamonds and triangles respectively represent a transfer leaving and entering a branch of the species tree (note that the transfers leave a branch of the species tree that corresponds to an extinct/unsampled lineage) Associated recGeneTreeXML code: n1 The  tag represents an horizontal gene transfer toward a branch of the species tree; see Figure 2B. Associated recGeneTreeXML code: –> gene_seq_2 The  tag represents a bifurcation in the species tree that would happen while the gene evolves along an unsampled/extinct species (i.e. one that is not represented in the species tree, see the  and  tags above); see Figure 2D. Associated recGeneTreeXML code: n1

2.4 Note on the lateral gene transfer representation

A lateral gene transfer is represented in two steps: one that specifies the species where the transfer originates, and the other that specifies the species receiving the transfer. These two successive steps are respectively represented by the and tags. See the different parts of Figure 2, along with Figures 3 and 4 for illustrations of these concepts.
Fig. 3.

A  object containing a species tree and one reconciled gene tree

Fig. 4.

A visualization of the reconciled gene tree of Figure 3

A  object containing a species tree and one reconciled gene tree A visualization of the reconciled gene tree of Figure 3

2.5 recGeoXML

Geographical annotations can be indicated for gene and species tree nodes thanks to the  tag. Such an annotation mainly consists in an area, KML information for displaying areas in GIS software and geographic information as defined in the usual PhyloXML grammar. An area () is specified by a name, a description, a value such as a support and a source (e.g. ‘observed’ or ‘inferred by Beast’).

2.6 recPhyloXML

recPhyloXML facilitates the packaging of several gene families reconciled to the same species tree. Its structure is fairly simple. A  root tag contains the following sequence: 0 to 1 species tree in recSpeciesTreeXML format, contained in the tag. 1 to n gene family trees in recGeneTreeXML format, each defined in a separate tag. A complete example of a  object containing a species tree and a reconciled gene tree can be seen in Figure 3 and a visualization of this reconciled gene tree can be seen in Figure 4.

3 Availability

A detailed description of the recPhyloXML format, as well as a schema definition file (This is a file formally describing the format, used by many XML tools.), is available at http://phylariane.univ-lyon1.fr/recphyloxml/. This website also presents a tool that can generate a visual representation of any reconciled tree or group of reconciled trees in the recPhyloXML format. The generated file is a.svg file, which easily allows for further manipulation, like changing the color scheme. This tool has been used to generate the basis for the figures showing reconciled gene trees in this manuscript. The recPhyloXML format has already been implemented as an output option in the reconciliation software ALE (Szöllősi ), in the Treerecs software [https://gitlab.inria.fr/Phylophile/Treerecs, this program corrects gene trees with a species tree using principles described in Noutahi and Lafond ], in the reconciliation web server http://phylotoo2.appspot.com/ (Zheng and Zhang, 2017), in the genome simulation software Zombi (https://github.com/AADavin/ZOMBI), both as input and output options in the adjacency history computing software DeCoSTAR (Duchemin ). Furthermore, scripts have been developed to convert the reconciliations produced by ecceTERA (Jacox ), NOTUNG (Durand ) and PrIME (Åkerborg ) into recPhyloXML, and a script for converting reconciliations produced by RANGER-DTL (Bansal ) is currently under development. Additional scripts are also available to convert a recPhyloXML reconciled tree in the Newick format, count the different events represented in a recPhyloXML file, combine different files into one or extract specific trees from a file. APIs have been written to import and export in recPhyloXML for the C++ library Bio++ (Gueguen ), for the python libraries ETE3 (Huerta-Cepas ) and Biopython (Cock ). All these scripts and APIs are available at https://github.com/WandrilleD/recPhyloXML.

4 Conclusion

With the growing number of available reconciliation models and pieces of software, it becomes crucial to be able to exchange and compare their results. recPhyloXML is a format that can accommodate many reconciliation features (dated/undated; with or without lateral gene transfers). It relies on an XML structure which is a standard format for nested data that already has multiple API libraries in various programming languages. We provide a detailed description of the recPhyloXML format on a website, along with a tool to visualize it. We designed the format to be flexible in order to be able to create extensions that allow the representation of different forms of reconciliations. We are planning for future extensions for the format that would include a representation of the coalescent process that underlies ILS. recPhyloXML could also be extended to support gene conversion by a paralog or horizontal gene transfer with replacement.
  29 in total

1.  SylvX: a viewer for phylogenetic tree reconciliations.

Authors:  François Chevenet; Jean-Philippe Doyon; Celine Scornavacca; Edwin Jacox; Emmanuelle Jousselin; Vincent Berry
Journal:  Bioinformatics       Date:  2015-10-29       Impact factor: 6.937

2.  Bio++: efficient extensible libraries and tools for computational molecular evolution.

Authors:  Laurent Guéguen; Sylvain Gaillard; Bastien Boussau; Manolo Gouy; Mathieu Groussin; Nicolas C Rochette; Thomas Bigot; David Fournier; Fanny Pouyet; Vincent Cahais; Aurélien Bernard; Céline Scornavacca; Benoît Nabholz; Annabelle Haudry; Loïc Dachary; Nicolas Galtier; Khalid Belkhir; Julien Y Dutheil
Journal:  Mol Biol Evol       Date:  2013-05-21       Impact factor: 16.240

3.  Support measures to estimate the reliability of evolutionary events predicted by reconciliation methods.

Authors:  Thi-Hau Nguyen; Vincent Ranwez; Vincent Berry; Celine Scornavacca
Journal:  PLoS One       Date:  2013-10-04       Impact factor: 3.240

4.  Inferring incomplete lineage sorting, duplications, transfers and losses with reconciliations.

Authors:  Yao-Ban Chan; Vincent Ranwez; Céline Scornavacca
Journal:  J Theor Biol       Date:  2017-08-09       Impact factor: 2.691

5.  Efficient algorithms for the reconciliation problem with gene duplication, horizontal transfer and loss.

Authors:  Mukul S Bansal; Eric J Alm; Manolis Kellis
Journal:  Bioinformatics       Date:  2012-06-15       Impact factor: 6.937

6.  DeCoSTAR: Reconstructing the Ancestral Organization of Genes or Genomes Using Reconciled Phylogenies.

Authors:  Wandrille Duchemin; Yoann Anselmetti; Murray Patterson; Yann Ponty; Sèverine Bérard; Cedric Chauve; Celine Scornavacca; Vincent Daubin; Eric Tannier
Journal:  Genome Biol Evol       Date:  2017-05-01       Impact factor: 3.416

7.  Lateral gene transfer from the dead.

Authors:  Gergely J Szöllosi; Eric Tannier; Nicolas Lartillot; Vincent Daubin
Journal:  Syst Biol       Date:  2013-01-25       Impact factor: 15.683

8.  Efficient exploration of the space of reconciled gene trees.

Authors:  Gergely J Szöllõsi; Wojciech Rosikiewicz; Bastien Boussau; Eric Tannier; Vincent Daubin
Journal:  Syst Biol       Date:  2013-08-06       Impact factor: 15.683

9.  PhyloNet: a software package for analyzing and reconstructing reticulate evolutionary relationships.

Authors:  Cuong Than; Derek Ruths; Luay Nakhleh
Journal:  BMC Bioinformatics       Date:  2008-07-28       Impact factor: 3.169

10.  SimPhy: Phylogenomic Simulation of Gene, Locus, and Species Trees.

Authors:  Diego Mallo; Leonardo De Oliveira Martins; David Posada
Journal:  Syst Biol       Date:  2015-11-01       Impact factor: 15.683

View more
  3 in total

1.  Deciphering Microbial Gene Family Evolution Using Duplication-Transfer-Loss Reconciliation and RANGER-DTL.

Authors:  Mukul S Bansal
Journal:  Methods Mol Biol       Date:  2022

2.  GeneRax: A Tool for Species-Tree-Aware Maximum Likelihood-Based Gene  Family Tree Inference under Gene Duplication, Transfer, and Loss.

Authors:  Benoit Morel; Alexey M Kozlov; Alexandros Stamatakis; Gergely J Szöllősi
Journal:  Mol Biol Evol       Date:  2020-09-01       Impact factor: 16.240

3.  Detection of interphylum transfers of the magnetosome gene cluster in magnetotactic bacteria.

Authors:  Maria Uzun; Veronika Koziaeva; Marina Dziuba; Pedro Leão; Maria Krutkina; Denis Grouzdev
Journal:  Front Microbiol       Date:  2022-08-01       Impact factor: 6.064

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.