Literature DB >> 20435580

The Genomedata format for storing large-scale functional genomics data.

Michael M Hoffman¹, Orion J Buske, William Stafford Noble.

Abstract

SUMMARY: We present a format for efficient storage of multiple tracks of numeric data anchored to a genome. The format allows fast random access to hundreds of gigabytes of data, while retaining a small disk space footprint. We have also developed utilities to load data into this format. We show that retrieving data from this format is more than 2900 times faster than a naive approach using wiggle files.
AVAILABILITY AND IMPLEMENTATION: Reference implementation in Python and C components available at http://noble.gs.washington.edu/proj/genomedata/ under the GNU General Public License.

Entities: Disease Gene Species

Mesh：

Year: 2010 PMID： 20435580 PMCID： PMC2872006 DOI： 10.1093/bioinformatics/btq164

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

The advent of functional genomics assays based on next-generation sequencing (Brunner et al., 2009; Hesselberth et al., 2009; Park, 2009; Wold and Myers, 2008) finally allows the high-throughput acquisition of data at 1-bp resolution across entire genomes. Processing this information, however, provides a challenge for several orders of magnitude beyond that of previous genomic analyses and demands new techniques for efficient operation. We introduce the Genomedata format for genome-scale numerical data, which uses an HDF5 (Hierarchical Data Format; http://hdfgroup.org/HDF5/) container for efficient, random access to huge genomic datasets. We also provide a Python interface to this format. Traditional data interchange formats such as the wiggle (http://genome.ucsc.edu/goldenPath/help/wiggle.html) and bedGraph (http://genome.ucsc.edu/goldenPath/help/bedgraph.html) formats provide excellent means of disseminating genome-wide datasets but suffer from several disadvantages in the repeated processing of this data. Storing numerical data as ASCII text is inefficient and impedes random access to the data. This problem becomes even more apparent when processing the data in scripting languages such as Python and R, which provide high-performance methods for bulk numerical operations on arrays, but no method for reading in data in interchange formats quickly. It is also necessary to validate this data before use, checking that there is exactly one data point per position and that data are not defined outside the boundaries of the underlying sequence. Genomedata provides an intermediate format and off-loads the frustrations of parsing and validating the data from an analysis programmer. It provides the conveniences of an application programming interface for reading a binary file format, akin to the programmatic access to sequence and alignment data provided by BAM (Li et al., 2009) and BioHDF (Mason et al., 2010), while being suited for dense numeric data such as bigWig (Rhead et al., 2010). In many workflows, Genomedata allows the user to parse, validate and convert the data into a binary format once, eliminating the computational expense of doing this repeatedly. The data are stored as 32-bit IEEE floating point numbers to allow minimal processing when loading into memory. Not a number entries are used where data are missing or unassigned. HDF5 transparently breaks the data into chunks aligned with data columns, so that it minimizes work during loading. Genomedata compresses these chunks when stored on disk to save space, especially when values are repeated within a column, but in a way that still facilitates efficient random access. We also store some metadata in the archive such that simple summary statistics may be accessed quickly. To ease the memory requirements of subsequent analysis, Genomedata may optionally break chromosomes into ‘supercontigs,’ which avoid the allocation of empty space in the observation matrix at large assembly gaps (by default, >100 000 bp). This is not necessary for efficient performance on disk, but it is convenient for programmers who wish to process the whole genome. The reference implementation includes several programs for loading data. The software requires Python 2.5.1, HDF5 1.8 and PyTables 2.1.

2 USING GENOMEDATA

Genomedata supplies command-line utilities that make it easy to create archives and load data. The genomedata-load command loads the genome sequence and a number of tracks in wiggle, BED or bedGraph formats, and stores metadata that allow one to rapidly calculate summary statistics such as minimum, maximum, mean or SD. The package also contains utilities to complete only parts of the loading process so that one may load tracks for different chromosomes in parallel. It is easy to access data in a Genomedata archive using the supplied Python interface. A programmer may retrieve a matrix of data by specifying individual coordinate ranges to the Genomedata interface. Alternatively, one can iterate through the entire dataset chromosome by chromosome. Programmers can accomplish tasks such as reporting the average data value in a number of tracks for specified genomic regions easily, allowing a greater focus on more interesting areas of analysis.

3 PERFORMANCE

Genomedata can quickly load large amounts of data. We measured the time to load a Genomedata archive with the complete human genome sequence (build NCBI36) and from one to 11 ChIP-seq data tracks on a 2.33-GHz Intel Xeon E5345 processor, and performed a linear regression on the timing results with the statistical computing environment R. This yielded a model with the coefficient of determination R2 = 0.98, where loading the sequence and other constant overhead took 5.0 ± 2.5 × 103 s, and each track took an additional 7.5 ± 0.4 × 103 s. One may retrieve functional genomics data from Genomedata archives much more quickly than the text-based formats commonly used for this data. We measured the time to retrieve data from a whole-genome 1-bp-resolution DNase-seq data track at each of a randomly generated list of genomic positions using a method that accessed the original gzip-compressed wiggle file and two different methods that access a Genomedata archive loaded from that file (Fig. 1). The offline (sequential access) wiggle algorithm first sorts the list and then iterates through the original wiggle files until it finds the specified positions. The offline Genomedata algorithm works in a similar way, but iterates through a Genomedata archive instead. The online (random access) Genomedata algorithm retrieves the data at each position in the random order specified by the list. We repeated this process with nine different list sizes to examine the dependence of retrieval time on the number of positions.

Fig. 1.

Scatter plot of the time to retrieve data from a list of random genomic positions against the number of positions for different algorithms. Each point represents the average run time of the last three of four sequential trials (to eliminate caching effects) with a specific algorithm and a particular list of random positions. We used three different random lists of nine different sizes on three different algorithms, resulting in 81 plotted data points. The wiggle (circles) and offline Genomedata (crosses) algorithms ran in approximately constant time for greater than 100 positions, averaging 140 000 s (39 h) and 48 s, respectively. The online Genomedata algorithm (triangles) ran in approximately linear time for greater than 1000 random positions, averaging 1.7 ms per random access. Because the offline algorithms read data sequentially rather than randomly, their run times are mostly independent of the number of genomic positions. After creation of the Genomedata archive, the offline Genomedata algorithm ran 2900 times faster than the comparable offline wiggle approach, suggesting a considerable advantage for the use of Genomedata when repeatedly accessing a dataset. Even when including the one-time cost of creating the archive (4 h), the Genomedata approach still ran 10 times faster, because we wrote the Genomedata track loader in C. The advantage for an online Genomedata approach is even greater when retrieving fewer than ∼10 000 positions at once. Genomedata is especially suited for whole-genome, dense datasets, so it has less of a comparative advantage in cases of sparse datasets with data at only a limited number of genomic positions. Genomedata should still perform as well, however, in an absolute sense. Not only does using Genomedata improve performance, but it also makes programming against this type of data easier, resulting in less boilerplate code for data retrieval. According to SLOCCount (http://www.dwheeler.com/sloccount/), which counts the physical source lines of code in a program, it took 70 source lines of code to implement the wiggle method, while only 44 (37% fewer) to implement the offline Genomedata method and 16 (77% fewer) to implement the online Genomedata method. Funding: National Institutes of Health (HG004695). Conflict of Interest: none declared.

7 in total

1. Standardizing the next generation of bioinformatics software development with BioHDF (HDF5).

Authors: Christopher E Mason; Paul Zumbo; Stephan Sanders; Mike Folk; Dana Robinson; Ruth Aydt; Martin Gollery; Mark Welsh; N Eric Olson; Todd M Smith
Journal: Adv Exp Med Biol Date: 2010 Impact factor: 2.622

Review 2. Sequence census methods for functional genomics.

Authors: Barbara Wold; Richard M Myers
Journal: Nat Methods Date: 2007-12-19 Impact factor: 28.547

3. Distinct DNA methylation patterns characterize differentiated human embryonic stem cells and developing human fetal liver.

Authors: Alayne L Brunner; David S Johnson; Si Wan Kim; Anton Valouev; Timothy E Reddy; Norma F Neff; Elizabeth Anton; Catherine Medina; Loan Nguyen; Eric Chiao; Chuba B Oyolu; Gary P Schroth; Devin M Absher; Julie C Baker; Richard M Myers
Journal: Genome Res Date: 2009-03-09 Impact factor: 9.043

4. The Sequence Alignment/Map format and SAMtools.

Authors: Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal: Bioinformatics Date: 2009-06-08 Impact factor: 6.937

Review 5. ChIP-seq: advantages and challenges of a maturing technology.

Authors: Peter J Park
Journal: Nat Rev Genet Date: 2009-09-08 Impact factor: 53.242

6. The UCSC Genome Browser database: update 2010.

Authors: Brooke Rhead; Donna Karolchik; Robert M Kuhn; Angie S Hinrichs; Ann S Zweig; Pauline A Fujita; Mark Diekhans; Kayla E Smith; Kate R Rosenbloom; Brian J Raney; Andy Pohl; Michael Pheasant; Laurence R Meyer; Katrina Learned; Fan Hsu; Jennifer Hillman-Jackson; Rachel A Harte; Belinda Giardine; Timothy R Dreszer; Hiram Clawson; Galt P Barber; David Haussler; W James Kent
Journal: Nucleic Acids Res Date: 2009-11-11 Impact factor: 16.971

7. Global mapping of protein-DNA interactions in vivo by digital genomic footprinting.

Authors: Jay R Hesselberth; Xiaoyu Chen; Zhihong Zhang; Peter J Sabo; Richard Sandstrom; Alex P Reynolds; Robert E Thurman; Shane Neph; Michael S Kuehn; William S Noble; Stanley Fields; John A Stamatoyannopoulos
Journal: Nat Methods Date: 2009-03-22 Impact factor: 28.547

7 in total

9 in total

1. Unsupervised pattern discovery in human chromatin structure through genomic segmentation.

Authors: Michael M Hoffman; Orion J Buske; Jie Wang; Zhiping Weng; Jeff A Bilmes; William Stafford Noble
Journal: Nat Methods Date: 2012-03-18 Impact factor: 28.547

2. metaseq: a Python package for integrative genome-wide analysis reveals relationships between chromatin insulators and associated nuclear mRNA.

Authors: Ryan K Dale; Leah H Matzat; Elissa P Lei
Journal: Nucleic Acids Res Date: 2014-07-24 Impact factor: 16.971

3. Exploratory analysis of genomic segmentations with Segtools.

Authors: Orion J Buske; Michael M Hoffman; Nadia Ponts; Karine G Le Roch; William Stafford Noble
Journal: BMC Bioinformatics Date: 2011-10-26 Impact factor: 3.307

4. Identifying elemental genomic track types and representing them uniformly.

Authors: Sveinung Gundersen; Matúš Kalaš; Osman Abul; Arnoldo Frigessi; Eivind Hovig; Geir Kjetil Sandve
Journal: BMC Bioinformatics Date: 2011-12-30 Impact factor: 3.169

5. eRFSVM: a hybrid classifier to predict enhancers-integrating random forests with support vector machines.

Authors: Fang Huang; Jiawei Shen; Qingli Guo; Yongyong Shi
Journal: Hereditas Date: 2016-06-30 Impact factor: 3.271

6. Set2 methyltransferase facilitates cell cycle progression by maintaining transcriptional fidelity.

Authors: Raghuvar Dronamraju; Deepak Kumar Jha; Umut Eser; Alexander T Adams; Daniel Dominguez; Rajarshi Choudhury; Yun-Chen Chiang; W Kimryn Rathmell; Michael J Emanuele; L Stirling Churchman; Brian D Strahl
Journal: Nucleic Acids Res Date: 2018-02-16 Impact factor: 16.971

7. Continuous chromatin state feature annotation of the human epigenome.

Authors: Habib Daneshpajouh; Bowen Chen; Neda Shokraneh; Shohre Masoumi; Kay C Wiese; Maxwell W Libbrecht
Journal: Bioinformatics Date: 2022-04-22 Impact factor: 6.931

8. Genome contact map explorer: a platform for the comparison, interactive visualization and analysis of genome contact maps.

Authors: Rajendra Kumar; Haitham Sobhy; Per Stenberg; Ludvig Lizana
Journal: Nucleic Acids Res Date: 2017-09-29 Impact factor: 16.971

9. Benchmarking database systems for Genomic Selection implementation.

Authors: Yaw Nti-Addae; Dave Matthews; Victor Jun Ulat; Raza Syed; Guilhem Sempéré; Adrien Pétel; Jon Renner; Pierre Larmande; Valentin Guignon; Elizabeth Jones; Kelly Robbins
Journal: Database (Oxford) Date: 2019-01-01 Impact factor: 3.451

9 in total