Literature DB >> 29444237

Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis.

Shubham Chandak1, Kedar Tatwawadi1, Tsachy Weissman1.   

Abstract

Motivation: New Generation Sequencing (NGS) technologies for genome sequencing produce large amounts of short genomic reads per experiment, which are highly redundant and compressible. However, general-purpose compressors are unable to exploit this redundancy due to the special structure present in the data.
Results: We present a new algorithm for compressing reads both with and without preserving the read order. In both cases, it achieves 1.4×-2× compression gain over state-of-the-art read compression tools for datasets containing as many as 3 billion Illumina reads. Our tool is based on the idea of approximately reordering the reads according to their position in the genome using hashed substring indices. We also present a systematic analysis of the read compression problem and compute bounds on fundamental limits of read compression. This analysis sheds light on the dynamics of the proposed algorithm (and read compression algorithms in general) and helps understand its performance in practice. The algorithm compresses only the read sequence, works with unaligned FASTQ files, and does not require a reference. Contact: schandak@stanford.edu. Supplementary information: Supplementary material are available at Bioinformatics online. The proposed algorithm is available for download at https://github.com/shubhamchandak94/HARC.
© The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com

Entities:  

Mesh:

Year:  2018        PMID: 29444237      PMCID: PMC5860611          DOI: 10.1093/bioinformatics/btx639

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  16 in total

1.  Aligned genomic data compression via improved modeling.

Authors:  Idoia Ochoa; Mikel Hernaez; Tsachy Weissman
Journal:  J Bioinform Comput Biol       Date:  2014-12       Impact factor: 1.122

2.  Reducing storage requirements for biological sequence comparison.

Authors:  Michael Roberts; Wayne Hayes; Brian R Hunt; Stephen M Mount; James A Yorke
Journal:  Bioinformatics       Date:  2004-07-15       Impact factor: 6.937

3.  Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform.

Authors:  Anthony J Cox; Markus J Bauer; Tobias Jakobi; Giovanna Rosone
Journal:  Bioinformatics       Date:  2012-05-03       Impact factor: 6.937

4.  QVZ: lossy compression of quality values.

Authors:  Greg Malysa; Mikel Hernaez; Idoia Ochoa; Milind Rao; Karthik Ganesan; Tsachy Weissman
Journal:  Bioinformatics       Date:  2015-05-28       Impact factor: 6.937

5.  Disk-based compression of data from genome sequencing.

Authors:  Szymon Grabowski; Sebastian Deorowicz; Łukasz Roguski
Journal:  Bioinformatics       Date:  2014-12-22       Impact factor: 6.937

6.  Comparison of high-throughput sequencing data compression tools.

Authors:  Ibrahim Numanagić; James K Bonfield; Faraz Hach; Jan Voges; Jörn Ostermann; Claudio Alberti; Marco Mattavelli; S Cenk Sahinalp
Journal:  Nat Methods       Date:  2016-10-24       Impact factor: 28.547

7.  Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and genome analyzer systems.

Authors:  André E Minoche; Juliane C Dohm; Heinz Himmelbauer
Journal:  Genome Biol       Date:  2011-11-08       Impact factor: 13.583

8.  Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph.

Authors:  Gaëtan Benoit; Claire Lemaitre; Dominique Lavenier; Erwan Drezen; Thibault Dayris; Raluca Uricaru; Guillaume Rizk
Journal:  BMC Bioinformatics       Date:  2015-09-14       Impact factor: 3.169

9.  MFCompress: a compression tool for FASTA and multi-FASTA data.

Authors:  Armando J Pinho; Diogo Pratas
Journal:  Bioinformatics       Date:  2013-10-16       Impact factor: 6.937

10.  Compression of next-generation sequencing reads aided by highly efficient de novo assembly.

Authors:  Daniel C Jones; Walter L Ruzzo; Xinxia Peng; Michael G Katze
Journal:  Nucleic Acids Res       Date:  2012-08-16       Impact factor: 16.971

View more
  6 in total

1.  SPRING: a next-generation compressor for FASTQ data.

Authors:  Shubham Chandak; Kedar Tatwawadi; Idoia Ochoa; Mikel Hernaez; Tsachy Weissman
Journal:  Bioinformatics       Date:  2019-08-01       Impact factor: 6.937

Review 2.  Efficient compression of SARS-CoV-2 genome data using Nucleotide Archival Format.

Authors:  Kirill Kryukov; Lihua Jin; So Nakagawa
Journal:  Patterns (N Y)       Date:  2022-07-07

3.  Sequence Compression Benchmark (SCB) database-A comprehensive evaluation of reference-free compressors for FASTA-formatted sequences.

Authors:  Kirill Kryukov; Mahoko Takahashi Ueda; So Nakagawa; Tadashi Imanishi
Journal:  Gigascience       Date:  2020-07-01       Impact factor: 6.524

4.  BdBG: a bucket-based method for compressing genome sequencing data with dynamic de Bruijn graphs.

Authors:  Rongjie Wang; Junyi Li; Yang Bai; Tianyi Zang; Yadong Wang
Journal:  PeerJ       Date:  2018-10-19       Impact factor: 2.984

5.  Variable-order reference-free variant discovery with the Burrows-Wheeler Transform.

Authors:  Nicola Prezza; Nadia Pisanti; Marinella Sciortino; Giovanna Rosone
Journal:  BMC Bioinformatics       Date:  2020-09-16       Impact factor: 3.169

6.  Hamming-shifting graph of genomic short reads: Efficient construction and its application for compression.

Authors:  Yuansheng Liu; Jinyan Li
Journal:  PLoS Comput Biol       Date:  2021-07-19       Impact factor: 4.475

  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.