Literature DB >> 25252952

Fast lossless compression via cascading Bloom filters.

Roye Rozov, Ron Shamir, Eran Halperin.   

Abstract

BACKGROUND: Data from large Next Generation Sequencing (NGS) experiments present challenges both in terms of costs associated with storage and in time required for file transfer. It is sometimes possible to store only a summary relevant to particular applications, but generally it is desirable to keep all information needed to revisit experimental results in the future. Thus, the need for efficient lossless compression methods for NGS reads arises. It has been shown that NGS-specific compression schemes can improve results over generic compression methods, such as the Lempel-Ziv algorithm, Burrows-Wheeler transform, or Arithmetic Coding. When a reference genome is available, effective compression can be achieved by first aligning the reads to the reference genome, and then encoding each read using the alignment position combined with the differences in the read relative to the reference. These reference-based methods have been shown to compress better than reference-free schemes, but the alignment step they require demands several hours of CPU time on a typical dataset, whereas reference-free methods can usually compress in minutes.
RESULTS: We present a new approach that achieves highly efficient compression by using a reference genome, but completely circumvents the need for alignment, affording a great reduction in the time needed to compress. In contrast to reference-based methods that first align reads to the genome, we hash all reads into Bloom filters to encode, and decode by querying the same Bloom filters using read-length subsequences of the reference genome. Further compression is achieved by using a cascade of such filters.
CONCLUSIONS: Our method, called BARCODE, runs an order of magnitude faster than reference-based methods, while compressing an order of magnitude better than reference-free methods, over a broad range of sequencing coverage. In high coverage (50-100 fold), compared to the best tested compressors, BARCODE saves 80-90% of the running time while only increasing space slightly.

Entities:  

Mesh:

Year:  2014        PMID: 25252952      PMCID: PMC4168706          DOI: 10.1186/1471-2105-15-S9-S7

Source DB:  PubMed          Journal:  BMC Bioinformatics        ISSN: 1471-2105            Impact factor:   3.169


  11 in total

1.  Performance comparison of benchtop high-throughput sequencing platforms.

Authors:  Nicholas J Loman; Raju V Misra; Timothy J Dallman; Chrystala Constantinidou; Saheer E Gharbia; John Wain; Mark J Pallen
Journal:  Nat Biotechnol       Date:  2012-05       Impact factor: 54.908

2.  Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform.

Authors:  Anthony J Cox; Markus J Bauer; Tobias Jakobi; Giovanna Rosone
Journal:  Bioinformatics       Date:  2012-05-03       Impact factor: 6.937

3.  Efficient storage of high throughput DNA sequencing data using reference-based compression.

Authors:  Markus Hsi-Yang Fritz; Rasko Leinonen; Guy Cochrane; Ewan Birney
Journal:  Genome Res       Date:  2011-01-18       Impact factor: 9.043

4.  SCALCE: boosting sequence compression algorithms using locally consistent encoding.

Authors:  Faraz Hach; Ibrahim Numanagic; Can Alkan; S Cenk Sahinalp
Journal:  Bioinformatics       Date:  2012-10-09       Impact factor: 6.937

5.  Scaling metagenome sequence assembly with probabilistic de Bruijn graphs.

Authors:  Jason Pell; Arend Hintze; Rosangela Canino-Koning; Adina Howe; James M Tiedje; C Titus Brown
Journal:  Proc Natl Acad Sci U S A       Date:  2012-07-30       Impact factor: 11.205

6.  Fast gapped-read alignment with Bowtie 2.

Authors:  Ben Langmead; Steven L Salzberg
Journal:  Nat Methods       Date:  2012-03-04       Impact factor: 28.547

7.  Efficient counting of k-mers in DNA sequences using a bloom filter.

Authors:  Páll Melsted; Jonathan K Pritchard
Journal:  BMC Bioinformatics       Date:  2011-08-10       Impact factor: 3.169

8.  Compression of FASTQ and SAM format sequencing data.

Authors:  James K Bonfield; Matthew V Mahoney
Journal:  PLoS One       Date:  2013-03-22       Impact factor: 3.240

9.  Compression of next-generation sequencing reads aided by highly efficient de novo assembly.

Authors:  Daniel C Jones; Walter L Ruzzo; Xinxia Peng; Michael G Katze
Journal:  Nucleic Acids Res       Date:  2012-08-16       Impact factor: 16.971

10.  Space-efficient and exact de Bruijn graph representation based on a Bloom filter.

Authors:  Rayan Chikhi; Guillaume Rizk
Journal:  Algorithms Mol Biol       Date:  2013-09-16       Impact factor: 1.405

View more
  6 in total

1.  Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters.

Authors:  David Pellow; Darya Filippova; Carl Kingsford
Journal:  J Comput Biol       Date:  2016-11-09       Impact factor: 1.479

2.  Fast search of thousands of short-read sequencing experiments.

Authors:  Brad Solomon; Carl Kingsford
Journal:  Nat Biotechnol       Date:  2016-02-08       Impact factor: 54.908

3.  To Petabytes and beyond: recent advances in probabilistic and signal processing algorithms and their application to metagenomics.

Authors:  R A Leo Elworth; Qi Wang; Pavan K Kota; C J Barberan; Benjamin Coleman; Advait Balaji; Gaurav Gupta; Richard G Baraniuk; Anshumali Shrivastava; Todd J Treangen
Journal:  Nucleic Acids Res       Date:  2020-06-04       Impact factor: 16.971

4.  Compression of Large genomic datasets using COMRAD on Parallel Computing Platform.

Authors:  Christopher Leela Biji; Manu K Madhu; Vineetha Vishnu; Satheesh Kumar K; Achuthsankar S Nair
Journal:  Bioinformation       Date:  2015-05-28

5.  LW-FQZip 2: a parallelized reference-based compression of FASTQ files.

Authors:  Zhi-An Huang; Zhenkun Wen; Qingjin Deng; Ying Chu; Yiwen Sun; Zexuan Zhu
Journal:  BMC Bioinformatics       Date:  2017-03-20       Impact factor: 3.169

6.  Data-dependent bucketing improves reference-free compression of sequencing reads.

Authors:  Rob Patro; Carl Kingsford
Journal:  Bioinformatics       Date:  2015-04-24       Impact factor: 6.937

  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.