Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Disk-based compression of data from genome sequencing.

Literature DB >> 25536966

Disk-based compression of data from genome sequencing.

Szymon Grabowski¹, Sebastian Deorowicz¹, Łukasz Roguski².

Abstract

MOTIVATION: High-coverage sequencing data have significant, yet hard to exploit, redundancy. Most FASTQ compressors cannot efficiently compress the DNA stream of large datasets, since the redundancy between overlapping reads cannot be easily captured in the (relatively small) main memory. More interesting solutions for this problem are disk based, where the better of these two, from Cox et al. (2012), is based on the Burrows-Wheeler transform (BWT) and achieves 0.518 bits per base for a 134.0 Gbp human genome sequencing collection with almost 45-fold coverage.
RESULTS: We propose overlapping reads compression with minimizers, a compression algorithm dedicated to sequencing reads (DNA only). Our method makes use of a conceptually simple and easily parallelizable idea of minimizers, to obtain 0.317 bits per base as the compression ratio, allowing to fit the 134.0 Gbp dataset into only 5.31 GB of space.
AVAILABILITY AND IMPLEMENTATION: http://sun.aei.polsl.pl/orcom under a free license. CONTACT: sebastian.deorowicz@polsl.pl SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Gene Species

Mesh：

Year: 2014 PMID： 25536966 DOI： 10.1093/bioinformatics/btu844

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

Keyword Cloud
Cited

13 in total

Disk-based compression of data from genome sequencing.

1. Comparison of high-throughput sequencing data compression tools.

2. Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis.

3. CoLoRd: compressing long reads.

4. Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph.

5. GDC 2: Compression of large collections of genomes.

6. Indexing Arbitrary-Length k-Mers in Sequencing Reads.

7. LW-FQZip 2: a parallelized reference-based compression of FASTQ files.

8. Optimal compressed representation of high throughput sequence data via light assembly.

9. CARGO: effective format-free compressed storage of genomic information.

10. BdBG: a bucket-based method for compressing genome sequencing data with dynamic de Bruijn graphs.