Literature DB >> 23505295

HAL: a hierarchical format for storing and analyzing multiple genome alignments.

Glenn Hickey¹, Benedict Paten, Dent Earl, Daniel Zerbino, David Haussler.

Abstract

MOTIVATION: Large multiple genome alignments and inferred ancestral genomes are ideal resources for comparative studies of molecular evolution, and advances in sequencing and computing technology are making them increasingly obtainable. These structures can provide a rich understanding of the genetic relationships between all subsets of species they contain. Current formats for storing genomic alignments, such as XMFA and MAF, are all indexed or ordered using a single reference genome, however, which limits the information that can be queried with respect to other species and clades. This loss of information grows with the number of species under comparison, as well as their phylogenetic distance.
RESULTS: We present HAL, a compressed, graph-based hierarchical alignment format for storing multiple genome alignments and ancestral reconstructions. HAL graphs are indexed on all genomes they contain. Furthermore, they are organized phylogenetically, which allows for modular and parallel access to arbitrary subclades without fragmentation because of rearrangements that have occurred in other lineages. HAL graphs can be created or read with a comprehensive C++ API. A set of tools is also provided to perform basic operations, such as importing and exporting data, identifying mutations and coordinate mapping (liftover). AVAILABILITY: All documentation and source code for the HAL API and tools are freely available at http://github.com/glennhickey/hal. CONTACT: hickey@soe.ucsc.edu or haussler@soe.ucsc.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2013 PMID： 23505295 PMCID： PMC3654707 DOI： 10.1093/bioinformatics/btt128

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

A DNA (or protein) sequence alignment groups together positions in the sequences they contain that are homologous (related by descent). Positions within the same sequence can be homologous via duplication events. The multiple alignment problem is NP-hard, but tools have been developed to produce large (tens of thousands of sequences) (Notredame, 2007) accurate alignments, provided the input sequences are relatively short and conserved, such as gene exons. Whole-genome alignment is much more difficult not only because of the increased length of the input but also because of the presence of large spans of non-conserved sequence (Blanchette ; Paten ). Changes because of large-scale rearrangement events, such as inversions, segmental duplications and transpositions, must also be taken into account in addition to point mutations and small insertions and deletions (indels). These challenges of creating whole-genome alignments carry over to their representation and analysis. Gapped matrices that are traditionally used for gene alignments become fragmented into blocks by rearrangements or excessive divergence. These blocks are stored by current formats as ASCII lines containing coordinates and DNA strings for each sequence in the alignment. As blocks can only be ordered with respect to a single reference row, performing disk-based queries using non-reference coordinates is extremely inefficient because of fragmentation, even if external indexes were to be constructed for them. The hierarchical alignment (HAL) graph structure and tool set described later in the text were designed to address this issue, while adding support for file compression.

2 METHODS

2.1 HAL format

HAL’s design was guided by two observations: (i) breakpoint graphs are the most natural way of representing genome rearrangements (Pevzner and Tesler, 2003) and (ii) progressive alignment (based on a phylogenetic decomposition) has been the most successful heuristic for multiple sequence alignment (Notredame, 2007) and is likely to remain so for whole-genome alignment. A HAL graph, therefore, decomposes a multiple alignment into a set of pairwise alignments, which are represented as breakpoint graphs. Each pairwise alignment corresponds to a branch of a rooted phylogenetic tree. In the absence of a tree, a reference genome can be used as a root with all other genomes as leaves. A genome in HAL is represented by up to three arrays (Fig. 1A): a sequence array, a top segment array if the genome has an ancestor in the tree and a bottom segment array if the genome has one or more descendants in the tree. For each branch in the phylogenetic tree, edges in the HAL graph connect bottom segments from the ancestral genome to top segments in the descendant genome (Fig. 1B). These edges define the pairwise alignments between the ancestral genome and each of its descendants. The amount of segmentation along a branch is, therefore, determined by the number of unique breakpoints in these pairwise alignments, including those induced by indels. Each top (resp. bottom) segment S is assigned a parse edge connecting it to the bottom (resp. top) segment of the same genome that overlaps the first base of S on the DNA sequence. Paralogs (duplicated) regions of the genome are represented by sets of top segments that share an ancestor. Inversions are represented as flags on edges between segments. Chromosomes or scaffolds are represented as contiguous subranges of the genome.

Fig. 1.

(A) A single genome as represented in HAL. Two sequences are stored in an array of DNA characters and are segmented with respect to its parent (top segments) and children (bottom segments). (B) The same genome in the context of HAL graph of five genomes. The dashed edge corresponds to an inversion event Provided the graph is in memory, all segments and DNA bases can be looked up by their array index, and all edges can be traversed, in O(1) time. Locating the segment that contains a particular DNA position within a given chromosome requires time, where m is the number of segments. Heuristics are used to further speed this operation up in practice. An arbitrary amount of metadata can be added to each genome. Missing data are stored as segments of ‘N’ characters with no edges. Supplementary Sections S1 and S3 describe how HAL offers similar compression to gzip and vastly reduced query times when compared with Multiple Alignment Format (MAF).

2.2 HAL API

A comprehensive C++ API is provided to create and query HAL files, which are presently stored in Hierarchical Data Format (HDF5). HDF5 is a longtime standard, supported on all major platforms, for storing large matrix data and is optimized for efficient indexing, caching and compression (The HDF5 Group, 2000–2010). Data within HDF5 can be quickly and randomly indexed, and decompression and caching is abstracted from the user. This allows for efficient external memory algorithm design, a requirement for multi-genome analysis. We note that the different back ends could be added in the future. HAL graphs are accessed by segment iterators, which traverse the graph through its native structure of segments and edges, and column iterators, which dynamically transpose the graph (or desired subgraph) to a traditional matrix block/format to iterate across alignment columns.

2.3 HAL tools

A set of command line utilities, summarized in Table 1, are provided to create and analyze multiple genome alignments in HAL. Importers are provided for UCSC’s MAF, which is a standard with its own rich set of filters and converters (ex. to FASTA) (Blanchette ) and Cactus (Paten ), which has been designed specifically to output HAL. MAF files can be quickly produced from HAL graphs for given subgraphs with respect to arbitrary references to be compatible with existing browsers and tools. The memory usage of each tool is configurable via its command line options.

Table 1.

HAL tools summary

Tool	Description
halStats	Print summary statistics of HAL file
halSummarizeMutations	Print mutation summary for given subgraph
halBranchMutations	Generate BED file(s) of mutations for a branch
halLiftover	Map BED coordinates between genomes
hal2maf/maf2hal	Convert to and from MAF
cactus2hal	Convert from Cactus

HAL tools summary Mutations can be identified along branches and output to tab delimited annotation files using the halBranchMutations tool. A cycle decomposition of the breakpoint graph structure allows rearrangements, such as duplications, inversions and transpositions to be reported in addition to substitutions, insertions and deletions. Small indels (determined by a provided threshold) can be nested within larger rearrangements to avoid overcounting in these cases. Patterns of conservation within a target sequence can be aggregated using the halLiftover tool, which maps coordinates in a BED file to an arbitrary target in the alignment. This utility provides a general strategy to efficiently liftover and project any comparative genomics information into the coordinate system of any reference genome. Excellent software packages are available for sorting, combining and querying BED files (Quinlan and Hall, 2010; Neph ) and can be combined with the aforementioned tools to create powerful analysis pipelines for multiple genome alignments.

3 CONCLUSION

We have presented HAL, a data format, API and set of tools for storing and analyzing genome alignments and ancestral reconstructions. The key features of HAL are its indexing, which allows fast coordinate mapping between arbitrary subsets of genomes, and its graph structure, which facilitates analysis of genome rearrangements, as well as modular decomposition into clades. The compression and chunking capabilities of HDF5 are leveraged to keep I/O and memory usage to a minimum. All of these properties, in particular the ability to parallelize by clade, will be necessary for alignments that arise from current large-scale sequencing projects, such as Genome 10K (Haussler ) (more details in Supplementary Section S2).

ACKNOWLEDGEMENT

We thank the reviewers for their valuable suggestions, as well as our sources of funding. Funding: California Institute for Quantitative Biosciences (to G.H.). A Data Analysis Center for the Encyclopedia of DNA Elements 5U01HG004695 (NHGRI/NIH), and gift funds from Dr and Mrs Gordon Ringold (to B.P.). Howard Hughes Medical Institute (to D.H.). Conflict of Interest: none declared.

7 in total

1. Aligning multiple genomic sequences with the threaded blockset aligner.

Authors: Mathieu Blanchette; W James Kent; Cathy Riemer; Laura Elnitski; Arian F A Smit; Krishna M Roskin; Robert Baertsch; Kate Rosenbloom; Hiram Clawson; Eric D Green; David Haussler; Webb Miller
Journal: Genome Res Date: 2004-04 Impact factor: 9.043

2. BEDOPS: high-performance genomic feature operations.

Authors: Shane Neph; M Scott Kuehn; Alex P Reynolds; Eric Haugen; Robert E Thurman; Audra K Johnson; Eric Rynes; Matthew T Maurano; Jeff Vierstra; Sean Thomas; Richard Sandstrom; Richard Humbert; John A Stamatoyannopoulos
Journal: Bioinformatics Date: 2012-05-09 Impact factor: 6.937

3. Genome 10K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species.

Authors:
Journal: J Hered Date: 2009-11-05 Impact factor: 2.645

4. Cactus: Algorithms for genome multiple sequence alignment.

Authors: Benedict Paten; Dent Earl; Ngan Nguyen; Mark Diekhans; Daniel Zerbino; David Haussler
Journal: Genome Res Date: 2011-06-10 Impact factor: 9.043

5. BEDTools: a flexible suite of utilities for comparing genomic features.

Authors: Aaron R Quinlan; Ira M Hall
Journal: Bioinformatics Date: 2010-01-28 Impact factor: 6.937

6. Genome rearrangements in mammalian evolution: lessons from human and mouse genomes.

Authors: Pavel Pevzner; Glenn Tesler
Journal: Genome Res Date: 2003-01 Impact factor: 9.043

Review 7. Recent evolutions of multiple sequence alignment algorithms.

Authors: Cédric Notredame
Journal: PLoS Comput Biol Date: 2007-08 Impact factor: 4.475

7 in total

50 in total

1. Whole-Genome Analyses Resolve the Phylogeny of Flightless Birds (Palaeognathae) in the Presence of an Empirical Anomaly Zone.

Authors: Alison Cloutier; Timothy B Sackton; Phil Grayson; Michele Clamp; Allan J Baker; Scott V Edwards
Journal: Syst Biol Date: 2019-11-01 Impact factor: 15.683

2. SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips.

Authors: Shoshana Marcus; Hayan Lee; Michael C Schatz
Journal: Bioinformatics Date: 2014-11-13 Impact factor: 6.937

3. Splicing conservation signals in plant long noncoding RNAs.

Authors: Jose Antonio Corona-Gomez; Irving Jair Garcia-Lopez; Peter F Stadler; Selene L Fernandez-Valverde
Journal: RNA Date: 2020-04-02 Impact factor: 4.942

4. HALPER facilitates the identification of regulatory element orthologs across species.

Authors: Xiaoyu Zhang; Irene M Kaplow; Morgan Wirthlin; Tae Yoon Park; Andreas R Pfenning
Journal: Bioinformatics Date: 2020-08-01 Impact factor: 6.937

5. Comparative assembly hubs: web-accessible browsers for comparative genomics.

Authors: Ngan Nguyen; Glenn Hickey; Brian J Raney; Joel Armstrong; Hiram Clawson; Ann Zweig; Donna Karolchik; William James Kent; David Haussler; Benedict Paten
Journal: Bioinformatics Date: 2014-08-18 Impact factor: 6.937

6. Three crocodilian genomes reveal ancestral patterns of evolution among archosaurs.

Authors: Richard E Green; Edward L Braun; Joel Armstrong; Dent Earl; Ngan Nguyen; Glenn Hickey; Michael W Vandewege; John A St John; Salvador Capella-Gutiérrez; Todd A Castoe; Colin Kern; Matthew K Fujita; Juan C Opazo; Jerzy Jurka; Kenji K Kojima; Juan Caballero; Robert M Hubley; Arian F Smit; Roy N Platt; Christine A Lavoie; Meganathan P Ramakodi; John W Finger; Alexander Suh; Sally R Isberg; Lee Miles; Amanda Y Chong; Weerachai Jaratlerdsiri; Jaime Gongora; Christopher Moran; Andrés Iriarte; John McCormack; Shane C Burgess; Scott V Edwards; Eric Lyons; Christina Williams; Matthew Breen; Jason T Howard; Cathy R Gresham; Daniel G Peterson; Jürgen Schmitz; David D Pollock; David Haussler; Eric W Triplett; Guojie Zhang; Naoki Irie; Erich D Jarvis; Christopher A Brochu; Carl J Schmidt; Fiona M McCarthy; Brant C Faircloth; Federico G Hoffmann; Travis C Glenn; Toni Gabaldón; Benedict Paten; David A Ray
Journal: Science Date: 2014-12-11 Impact factor: 47.728

7. Ecology shapes epistasis in a genotype-phenotype-fitness map for stick insect colour.

Authors: Patrik Nosil; Romain Villoutreix; Clarissa F de Carvalho; Jeffrey L Feder; Thomas L Parchman; Zach Gompert
Journal: Nat Ecol Evol Date: 2020-09-14 Impact factor: 15.460

8. IB4-binding sensory neurons in the adult rat express a novel 3' UTR-extended isoform of CaMK4 that is associated with its localization to axons.

Authors: Benjamin J Harrison; Robert M Flight; Cynthia Gomes; Gayathri Venkat; Steven R Ellis; Uma Sankar; Jeffery L Twiss; Eric C Rouchka; Jeffrey C Petruska
Journal: J Comp Neurol Date: 2014-02-01 Impact factor: 3.215

9. Genetic Mapping and Biochemical Basis of Yellow Feather Pigmentation in Budgerigars.

Authors: Thomas F Cooke; Curt R Fischer; Ping Wu; Ting-Xin Jiang; Kathleen T Xie; James Kuo; Elizabeth Doctorov; Ashley Zehnder; Chaitan Khosla; Cheng-Ming Chuong; Carlos D Bustamante
Journal: Cell Date: 2017-10-05 Impact factor: 41.582

Review 10. Whole-Genome Alignment and Comparative Annotation.

Authors: Joel Armstrong; Ian T Fiddes; Mark Diekhans; Benedict Paten
Journal: Annu Rev Anim Biosci Date: 2018-10-31 Impact factor: 8.923