Literature DB >> 25536966

Disk-based compression of data from genome sequencing.

Szymon Grabowski1, Sebastian Deorowicz1, Łukasz Roguski2.   

Abstract

MOTIVATION: High-coverage sequencing data have significant, yet hard to exploit, redundancy. Most FASTQ compressors cannot efficiently compress the DNA stream of large datasets, since the redundancy between overlapping reads cannot be easily captured in the (relatively small) main memory. More interesting solutions for this problem are disk based, where the better of these two, from Cox et al. (2012), is based on the Burrows-Wheeler transform (BWT) and achieves 0.518 bits per base for a 134.0 Gbp human genome sequencing collection with almost 45-fold coverage.
RESULTS: We propose overlapping reads compression with minimizers, a compression algorithm dedicated to sequencing reads (DNA only). Our method makes use of a conceptually simple and easily parallelizable idea of minimizers, to obtain 0.317 bits per base as the compression ratio, allowing to fit the 134.0 Gbp dataset into only 5.31 GB of space.
AVAILABILITY AND IMPLEMENTATION: http://sun.aei.polsl.pl/orcom under a free license. CONTACT: sebastian.deorowicz@polsl.pl SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
© The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

Entities:  

Mesh:

Year:  2014        PMID: 25536966     DOI: 10.1093/bioinformatics/btu844

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  13 in total

1.  Comparison of high-throughput sequencing data compression tools.

Authors:  Ibrahim Numanagić; James K Bonfield; Faraz Hach; Jan Voges; Jörn Ostermann; Claudio Alberti; Marco Mattavelli; S Cenk Sahinalp
Journal:  Nat Methods       Date:  2016-10-24       Impact factor: 28.547

2.  Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis.

Authors:  Shubham Chandak; Kedar Tatwawadi; Tsachy Weissman
Journal:  Bioinformatics       Date:  2018-02-15       Impact factor: 6.937

3.  CoLoRd: compressing long reads.

Authors:  Marek Kokot; Adam Gudyś; Heng Li; Sebastian Deorowicz
Journal:  Nat Methods       Date:  2022-03-28       Impact factor: 47.990

4.  Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph.

Authors:  Gaëtan Benoit; Claire Lemaitre; Dominique Lavenier; Erwan Drezen; Thibault Dayris; Raluca Uricaru; Guillaume Rizk
Journal:  BMC Bioinformatics       Date:  2015-09-14       Impact factor: 3.169

5.  GDC 2: Compression of large collections of genomes.

Authors:  Sebastian Deorowicz; Agnieszka Danek; Marcin Niemiec
Journal:  Sci Rep       Date:  2015-06-25       Impact factor: 4.379

6.  Indexing Arbitrary-Length k-Mers in Sequencing Reads.

Authors:  Tomasz Kowalski; Szymon Grabowski; Sebastian Deorowicz
Journal:  PLoS One       Date:  2015-07-16       Impact factor: 3.240

7.  LW-FQZip 2: a parallelized reference-based compression of FASTQ files.

Authors:  Zhi-An Huang; Zhenkun Wen; Qingjin Deng; Ying Chu; Yiwen Sun; Zexuan Zhu
Journal:  BMC Bioinformatics       Date:  2017-03-20       Impact factor: 3.169

8.  Optimal compressed representation of high throughput sequence data via light assembly.

Authors:  Antonio A Ginart; Joseph Hui; Kaiyuan Zhu; Ibrahim Numanagić; Thomas A Courtade; S Cenk Sahinalp; David N Tse
Journal:  Nat Commun       Date:  2018-02-08       Impact factor: 14.919

9.  CARGO: effective format-free compressed storage of genomic information.

Authors:  Łukasz Roguski; Paolo Ribeca
Journal:  Nucleic Acids Res       Date:  2016-04-29       Impact factor: 16.971

10.  BdBG: a bucket-based method for compressing genome sequencing data with dynamic de Bruijn graphs.

Authors:  Rongjie Wang; Junyi Li; Yang Bai; Tianyi Zang; Yadong Wang
Journal:  PeerJ       Date:  2018-10-19       Impact factor: 2.984

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.