Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis.

Literature DB >> 29444237

Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis.

Shubham Chandak¹, Kedar Tatwawadi¹, Tsachy Weissman¹.

Abstract

Motivation: New Generation Sequencing (NGS) technologies for genome sequencing produce large amounts of short genomic reads per experiment, which are highly redundant and compressible. However, general-purpose compressors are unable to exploit this redundancy due to the special structure present in the data.
Results: We present a new algorithm for compressing reads both with and without preserving the read order. In both cases, it achieves 1.4×-2× compression gain over state-of-the-art read compression tools for datasets containing as many as 3 billion Illumina reads. Our tool is based on the idea of approximately reordering the reads according to their position in the genome using hashed substring indices. We also present a systematic analysis of the read compression problem and compute bounds on fundamental limits of read compression. This analysis sheds light on the dynamics of the proposed algorithm (and read compression algorithms in general) and helps understand its performance in practice. The algorithm compresses only the read sequence, works with unaligned FASTQ files, and does not require a reference. Contact: schandak@stanford.edu. Supplementary information: Supplementary material are available at Bioinformatics online. The proposed algorithm is available for download at https://github.com/shubhamchandak94/HARC.

Entities: Gene

Mesh：

Year: 2018 PMID： 29444237 PMCID： PMC5860611 DOI： 10.1093/bioinformatics/btx639

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

16 in total

1. Aligned genomic data compression via improved modeling.

Authors: Idoia Ochoa; Mikel Hernaez; Tsachy Weissman
Journal: J Bioinform Comput Biol Date: 2014-12 Impact factor: 1.122

2. Reducing storage requirements for biological sequence comparison.

Authors: Michael Roberts; Wayne Hayes; Brian R Hunt; Stephen M Mount; James A Yorke
Journal: Bioinformatics Date: 2004-07-15 Impact factor: 6.937

3. Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform.

Authors: Anthony J Cox; Markus J Bauer; Tobias Jakobi; Giovanna Rosone
Journal: Bioinformatics Date: 2012-05-03 Impact factor: 6.937

4. QVZ: lossy compression of quality values.

Authors: Greg Malysa; Mikel Hernaez; Idoia Ochoa; Milind Rao; Karthik Ganesan; Tsachy Weissman
Journal: Bioinformatics Date: 2015-05-28 Impact factor: 6.937

5. Disk-based compression of data from genome sequencing.

Authors: Szymon Grabowski; Sebastian Deorowicz; Łukasz Roguski
Journal: Bioinformatics Date: 2014-12-22 Impact factor: 6.937

6. Comparison of high-throughput sequencing data compression tools.

Authors: Ibrahim Numanagić; James K Bonfield; Faraz Hach; Jan Voges; Jörn Ostermann; Claudio Alberti; Marco Mattavelli; S Cenk Sahinalp
Journal: Nat Methods Date: 2016-10-24 Impact factor: 28.547

7. Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and genome analyzer systems.

Authors: André E Minoche; Juliane C Dohm; Heinz Himmelbauer
Journal: Genome Biol Date: 2011-11-08 Impact factor: 13.583

8. Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph.

Authors: Gaëtan Benoit; Claire Lemaitre; Dominique Lavenier; Erwan Drezen; Thibault Dayris; Raluca Uricaru; Guillaume Rizk
Journal: BMC Bioinformatics Date: 2015-09-14 Impact factor: 3.169

9. MFCompress: a compression tool for FASTA and multi-FASTA data.

Authors: Armando J Pinho; Diogo Pratas
Journal: Bioinformatics Date: 2013-10-16 Impact factor: 6.937

10. Compression of next-generation sequencing reads aided by highly efficient de novo assembly.

Authors: Daniel C Jones; Walter L Ruzzo; Xinxia Peng; Michael G Katze
Journal: Nucleic Acids Res Date: 2012-08-16 Impact factor: 16.971

6 in total

1. SPRING: a next-generation compressor for FASTQ data.

Authors: Shubham Chandak; Kedar Tatwawadi; Idoia Ochoa; Mikel Hernaez; Tsachy Weissman
Journal: Bioinformatics Date: 2019-08-01 Impact factor: 6.937

Review 2. Efficient compression of SARS-CoV-2 genome data using Nucleotide Archival Format.

Authors: Kirill Kryukov; Lihua Jin; So Nakagawa
Journal: Patterns (N Y) Date: 2022-07-07

3. Sequence Compression Benchmark (SCB) database-A comprehensive evaluation of reference-free compressors for FASTA-formatted sequences.

Authors: Kirill Kryukov; Mahoko Takahashi Ueda; So Nakagawa; Tadashi Imanishi
Journal: Gigascience Date: 2020-07-01 Impact factor: 6.524