Literature DB >> 23967010

GraphML specializations to codify ancestral recombinant graphs.

James R McGill1, Elizabeth A Walkup, Mary K Kuhner.   

Abstract

Software which simulates, infers, or analyzes ancestral recombination graphs (ARGs) faces the problem of communicating them. Existing formats omit information either about the location of recombinations along the chromosome or the position of recombinations relative to the branching topology. We present a specialization of GraphML, an XML-based standard for mathematical graphs, for communication of ARGs. The GraphML <node> type is specialized to contain the node type, time, recombination location, and name. The GraphML <edge> type is specialized to contain the ancestral material passed along that edge. This approach, which we call ArgML, retains all information in the original ARG. Due to its use of established formats ArgML can be parsed, checked and displayed by existing software.

Entities:  

Keywords:  ARG; Newick; XML; ancestral recombination graph; graphML

Year:  2013        PMID: 23967010      PMCID: PMC3735989          DOI: 10.3389/fgene.2013.00146

Source DB:  PubMed          Journal:  Front Genet        ISSN: 1664-8021            Impact factor:   4.599


Introduction

Phylogenetic trees used to represent the histories of species or populations are usually communicated using the Newick format described in Olsen (1990). Ancestral recombinant graphs (ARGs) (Griffiths and Marjoram, 1997) are directed acyclic graphs which generalize phylogenetic trees to allow recombination. Griffiths' specification of the ARG gives both the time and branching structure associated with each recombination event and the ancestral material inherited along each branch. However, this has not been easy to accommodate within current formats for communicating phylogenies. Two approaches have been used. Interval-tree representations break the chromosome into non-recombining segments, specifying the Newick tree of each segment along with the segment boundaries. This approach is used by the ms program of Hudson (2002) and other data simulators as it provides sufficient information for simulation of data on an ARG, but it loses information about the number, time, and topological location of recombination events. Directed-graph representations [used by Extended Newick (Cardona et al., 2008) among others (see for a review, Arenas et al., 2010)], store the ARG as a directed graph with no specification of which material is inherited along each edge. This is useful in analysis of hybridization, but it loses information about which parts of the chromosome were inherited from each ancestor. While the NeXML standard (Vos et al., 2012) discusses the potential use of NeXML for ARGs, it does not specify the tags needed to add ancestral information, so currently offers only the directed-graph representation. In this paper we propose a format based on the directed-graph approach but specifying the ancestral material inherited along each edge. All details of the ARG can be reconstructed from this format. The GraphML standard (Bandes et al., 2001) was developed to codify graph structures in terms of nodes and edges. Tools such as Mathematica (Wolfram, 2003) and Gephi (Bastian et al., 2009) provide methods for reading and plotting GraphML files, though they display only connectivity as they have no concept of time ordering. Since GraphML is based on XML (Bray et al., 2008), GraphML files can be parsed and error-checked by XML-handling software. Thus, programs wishing to read or write GraphML can make use of existing XML libraries such as TinyXML (Thomason, 2013). Motivated by the need of our program LAMARC (Kuhner, 2006) to store and communicate ARGs, we have developed ArgML, a specialization of GraphML which adds time and ancestral material information. We propose it as a standard format for communicating ARGs between programs. ArgML files can be read directly by Mathematica (an example is shown in Figure A1) and will be read and written by an upcoming version of Lamarc.
Figure A1

Mathematica version of Figure .

Methods

To express coalescent times, node types, and sites transmitted, we leveraged GraphML's general-purpose node and edge annotation capability as follows. To the tag we added four fields:, the kind of node (Tip, Rec, Coal); , the (optional) name of the node; , the time of the node (relative to the time at the tips); and <rec_location>, the chromosomal location of the recombination represented by this node, if any. To the tag we added , giving the ancestral material transmitted along that . The contents of are one or more entries of the form [firstsite:lastsite+1). This [x:y) notation is a standard convention for half-open intervals (e.g., Austern, 1999) and indicates that the first site of the recombinational interval is x and the last site is one site before y; site y itself is not included. If the ancestral material contains more than one discontinuous segment, this is written as [w:x) [y:z) . These new keys are defined within the GraphML source file (see Appendix) and can be handled by an XML parser such as TinyXML (Thomason, 2013) without further intervention. Time information could be expressed either as a branch length (as in Newick format) or a node time. We have found that branch length representation of a strict clocklike tree is prone to numerical precision issues leading to violations of the clock when branch lengths are summed. Use of node times avoids this problem.

Limitations

We assume that an ARG is time-ordered and clocklike. Non-clocklike trees are difficult to use in the ARG context as time information is needed to distinguish the lineages contributing to a recombination from the resulting recombinant lineage. Therefore, violation of the molecular clock in an ARG is best represented by a multiplier on the time-based branch length, not by a mutation-based non-clocklike branch length. Such a multiplier could readily be added to ArgML. Users of ArgML should be aware that the clock requirement cannot be checked by GraphML parsers and should be checked by special-purpose code in programs reading or writing ARGs. We also assume that the ARG is fully specified with the locations of all recombinations. Graphs without locational information can arise from hybridization where the contribution of each parental species to the hybrid is not known. They could be straightforwardly coded in GraphML but will not be substitutable for ARGs in most applications (for example, whereas an ARG can be decomposed into interval trees, this is not true for a hybridization graph). The ArgML format does not represent gene conversion or multiple crossovers in the same meiosis. These events could be coded as two or more recombinations occurring at the same time, although this would impose a fictitious ordering among what are actually components of the same event. Currently no tool exists to display time ordered ArgML trees. It is to be hoped that someone will create such a tool in the future.

Example

Consider the following time ordered ARG (Figure 1). The tips are labeled 2, 3, and 4, the root is 1, the coalescences are 7, 8, 9, and 10, and the recombinations are 5 and 6. Ancestral material transmitted along each edge is indicated. There are 20 sites in the ancestral material. Recombinations occur at the link between two sites and there cannot be links before the first site or after the last, therefore there are 19 links.
Figure 1

A recombinant graph.

A recombinant graph. Thus, recombination 6 above is defined by: The ancestral material transmitted between nodes 8 and 6 above is expressed as [17:21) which is a half open interval and is read as “the segment that begins at site 17 and ends before site 21.” Thus, it contains sites 17, 18, 19, and 20 and the links between them. Similarly [10:17) contains sites 10–16 and their connecting links. To maintain consistency with this half open interval notation, the <rec_location> of the recombination that is between 16 and 17 is numbered 17 and can be thought of as being “before” site 17. Note in the figure that two discontinuous segments are transmitted between nodes 8 and 9. This is expressed by:

Conclusions

ArgML augments the well-established GraphML format with all of the information needed to transmit ARGs. A full ARG identical to the original can be drawn from the ArgML representation even if multiple recombinations occurred at the same inter-site link. This specialization allows users to leverage the numerous existing tools that already understand GraphML. Further information needed for handling of ARGs could be readily added to the standard.

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
  5 in total

1.  Generating samples under a Wright-Fisher neutral model of genetic variation.

Authors:  Richard R Hudson
Journal:  Bioinformatics       Date:  2002-02       Impact factor: 6.937

2.  LAMARC 2.0: maximum likelihood and Bayesian estimation of population parameters.

Authors:  Mary K Kuhner
Journal:  Bioinformatics       Date:  2006-01-12       Impact factor: 6.937

3.  Characterization of phylogenetic networks with NetTest.

Authors:  Miguel Arenas; Mateus Patricio; David Posada; Gabriel Valiente
Journal:  BMC Bioinformatics       Date:  2010-05-20       Impact factor: 3.169

4.  NeXML: rich, extensible, and verifiable representation of comparative data and metadata.

Authors:  Rutger A Vos; James P Balhoff; Jason A Caravas; Mark T Holder; Hilmar Lapp; Wayne P Maddison; Peter E Midford; Anurag Priyam; Jeet Sukumaran; Xuhua Xia; Arlin Stoltzfus
Journal:  Syst Biol       Date:  2012-02-22       Impact factor: 15.683

5.  Extended Newick: it is time for a standard representation of phylogenetic networks.

Authors:  Gabriel Cardona; Francesc Rosselló; Gabriel Valiente
Journal:  BMC Bioinformatics       Date:  2008-12-15       Impact factor: 3.169

  5 in total
  5 in total

1.  Efficient ancestry and mutation simulation with msprime 1.0.

Authors:  Franz Baumdicker; Gertjan Bisschop; Daniel Goldstein; Graham Gower; Aaron P Ragsdale; Georgia Tsambos; Sha Zhu; Bjarki Eldon; E Castedo Ellerman; Jared G Galloway; Ariella L Gladstein; Gregor Gorjanc; Bing Guo; Ben Jeffery; Warren W Kretzschumar; Konrad Lohse; Michael Matschiner; Dominic Nelson; Nathaniel S Pope; Consuelo D Quinto-Cortés; Murillo F Rodrigues; Kumar Saunack; Thibaut Sellinger; Kevin Thornton; Hugo van Kemenade; Anthony W Wohns; Yan Wong; Simon Gravel; Andrew D Kern; Jere Koskela; Peter L Ralph; Jerome Kelleher
Journal:  Genetics       Date:  2022-03-03       Impact factor: 4.402

2.  Eight challenges in phylodynamic inference.

Authors:  Simon D W Frost; Oliver G Pybus; Julia R Gog; Cecile Viboud; Sebastian Bonhoeffer; Trevor Bedford
Journal:  Epidemics       Date:  2014-09-16       Impact factor: 4.396

3.  The importance and application of the ancestral recombination graph.

Authors:  Miguel Arenas
Journal:  Front Genet       Date:  2013-10-14       Impact factor: 4.599

4.  Efficient Coalescent Simulation and Genealogical Analysis for Large Sample Sizes.

Authors:  Jerome Kelleher; Alison M Etheridge; Gilean McVean
Journal:  PLoS Comput Biol       Date:  2016-05-04       Impact factor: 4.475

Review 5.  Inferences from tip-calibrated phylogenies: a review and a practical guide.

Authors:  Adrien Rieux; François Balloux
Journal:  Mol Ecol       Date:  2016-04-20       Impact factor: 6.185

  5 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.