Literature DB >> 30535063

SPRING: a next-generation compressor for FASTQ data.

Shubham Chandak1, Kedar Tatwawadi1, Idoia Ochoa2, Mikel Hernaez3, Tsachy Weissman1.   

Abstract

MOTIVATION: High-Throughput Sequencing technologies produce huge amounts of data in the form of short genomic reads, associated quality values and read identifiers. Because of the significant structure present in these FASTQ datasets, general-purpose compressors are unable to completely exploit much of the inherent redundancy. Although there has been a lot of work on designing FASTQ compressors, most of them lack in support of one or more crucial properties, such as support for variable length reads, scalability to high coverage datasets, pairing-preserving compression and lossless compression.
RESULTS: In this work, we propose SPRING, a reference-free compressor for FASTQ files. SPRING supports a wide variety of compression modes and features, including lossless compression, pairing-preserving compression, lossy compression of quality values, long read compression and random access. SPRING achieves substantially better compression than existing tools, for example, SPRING compresses 195 GB of 25× whole genome human FASTQ from Illumina's NovaSeq sequencer to less than 7 GB, around 1.6× smaller than previous state-of-the-art FASTQ compressors. SPRING achieves this improvement while using comparable computational resources.
AVAILABILITY AND IMPLEMENTATION: SPRING can be downloaded from https://github.com/shubhamchandak94/SPRING. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
© The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

Entities:  

Mesh:

Year:  2019        PMID: 30535063      PMCID: PMC6662292          DOI: 10.1093/bioinformatics/bty1015

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  8 in total

1.  QVZ: lossy compression of quality values.

Authors:  Greg Malysa; Mikel Hernaez; Idoia Ochoa; Milind Rao; Karthik Ganesan; Tsachy Weissman
Journal:  Bioinformatics       Date:  2015-05-28       Impact factor: 6.937

2.  Comparison of high-throughput sequencing data compression tools.

Authors:  Ibrahim Numanagić; James K Bonfield; Faraz Hach; Jan Voges; Jörn Ostermann; Claudio Alberti; Marco Mattavelli; S Cenk Sahinalp
Journal:  Nat Methods       Date:  2016-10-24       Impact factor: 28.547

3.  SCALCE: boosting sequence compression algorithms using locally consistent encoding.

Authors:  Faraz Hach; Ibrahim Numanagic; Can Alkan; S Cenk Sahinalp
Journal:  Bioinformatics       Date:  2012-10-09       Impact factor: 6.937

4.  Effect of lossy compression of quality scores on variant calling.

Authors:  Idoia Ochoa; Mikel Hernaez; Rachel Goldfeder; Tsachy Weissman; Euan Ashley
Journal:  Brief Bioinform       Date:  2017-03-01       Impact factor: 11.622

5.  Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis.

Authors:  Shubham Chandak; Kedar Tatwawadi; Tsachy Weissman
Journal:  Bioinformatics       Date:  2018-02-15       Impact factor: 6.937

6.  FaStore: a space-saving solution for raw sequencing data.

Authors:  Lukasz Roguski; Idoia Ochoa; Mikel Hernaez; Sebastian Deorowicz
Journal:  Bioinformatics       Date:  2018-08-15       Impact factor: 6.937

7.  DSRC 2--Industry-oriented compression of FASTQ files.

Authors:  Lukasz Roguski; Sebastian Deorowicz
Journal:  Bioinformatics       Date:  2014-04-18       Impact factor: 6.937

8.  Compression of FASTQ and SAM format sequencing data.

Authors:  James K Bonfield; Matthew V Mahoney
Journal:  PLoS One       Date:  2013-03-22       Impact factor: 3.240

  8 in total
  8 in total

1.  Lossless indexing with counting de Bruijn graphs.

Authors:  Mikhail Karasikov; Harun Mustafa; Gunnar Rätsch; André Kahles
Journal:  Genome Res       Date:  2022-05-24       Impact factor: 9.438

2.  ACO:lossless quality score compression based on adaptive coding order.

Authors:  Yi Niu; Mingming Ma; Fu Li; Xianming Liu; Guangming Shi
Journal:  BMC Bioinformatics       Date:  2022-06-07       Impact factor: 3.307

3.  CoLoRd: compressing long reads.

Authors:  Marek Kokot; Adam Gudyś; Heng Li; Sebastian Deorowicz
Journal:  Nat Methods       Date:  2022-03-28       Impact factor: 47.990

Review 4.  Efficient compression of SARS-CoV-2 genome data using Nucleotide Archival Format.

Authors:  Kirill Kryukov; Lihua Jin; So Nakagawa
Journal:  Patterns (N Y)       Date:  2022-07-07

5.  Sequence Compression Benchmark (SCB) database-A comprehensive evaluation of reference-free compressors for FASTA-formatted sequences.

Authors:  Kirill Kryukov; Mahoko Takahashi Ueda; So Nakagawa; Tadashi Imanishi
Journal:  Gigascience       Date:  2020-07-01       Impact factor: 6.524

6.  LFastqC: A lossless non-reference-based FASTQ compressor.

Authors:  Sultan Al Yami; Chun-Hsi Huang
Journal:  PLoS One       Date:  2019-11-14       Impact factor: 3.240

7.  FCLQC: fast and concurrent lossless quality scores compressor.

Authors:  Minhyeok Cho; Albert No
Journal:  BMC Bioinformatics       Date:  2021-12-20       Impact factor: 3.169

8.  Hamming-shifting graph of genomic short reads: Efficient construction and its application for compression.

Authors:  Yuansheng Liu; Jinyan Li
Journal:  PLoS Comput Biol       Date:  2021-07-19       Impact factor: 4.475

  8 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.