| Literature DB >> 20435580 |
Michael M Hoffman1, Orion J Buske, William Stafford Noble.
Abstract
SUMMARY: We present a format for efficient storage of multiple tracks of numeric data anchored to a genome. The format allows fast random access to hundreds of gigabytes of data, while retaining a small disk space footprint. We have also developed utilities to load data into this format. We show that retrieving data from this format is more than 2900 times faster than a naive approach using wiggle files.Entities:
Mesh:
Year: 2010 PMID: 20435580 PMCID: PMC2872006 DOI: 10.1093/bioinformatics/btq164
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Scatter plot of the time to retrieve data from a list of random genomic positions against the number of positions for different algorithms. Each point represents the average run time of the last three of four sequential trials (to eliminate caching effects) with a specific algorithm and a particular list of random positions. We used three different random lists of nine different sizes on three different algorithms, resulting in 81 plotted data points. The wiggle (circles) and offline Genomedata (crosses) algorithms ran in approximately constant time for greater than 100 positions, averaging 140 000 s (39 h) and 48 s, respectively. The online Genomedata algorithm (triangles) ran in approximately linear time for greater than 1000 random positions, averaging 1.7 ms per random access.