Literature DB >> 26093148

LFQC: a lossless compression algorithm for FASTQ files.

Marius Nicolae1, Sudipta Pathak1, Sanguthevar Rajasekaran1.   

Abstract

MOTIVATION: Next Generation Sequencing (NGS) technologies have revolutionized genomic research by reducing the cost of whole genome sequencing. One of the biggest challenges posed by modern sequencing technology is economic storage of NGS data. Storing raw data is infeasible because of its enormous size and high redundancy. In this article, we address the problem of storage and transmission of large FASTQ files using innovative compression techniques.
RESULTS: We introduce a new lossless non-reference based FASTQ compression algorithm named Lossless FASTQ Compressor. We have compared our algorithm with other state of the art big data compression algorithms namely gzip, bzip2, fastqz (Bonfield and Mahoney, 2013), fqzcomp (Bonfield and Mahoney, 2013), Quip (Jones et al., 2012), DSRC2 (Roguski and Deorowicz, 2014). This comparison reveals that our algorithm achieves better compression ratios on LS454 and SOLiD datasets.
AVAILABILITY AND IMPLEMENTATION: The implementations are freely available for non-commercial purposes. They can be downloaded from http://engr.uconn.edu/rajasek/lfqc-v1.1.zip. CONTACT: rajasek@engr.uconn.edu.
© The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

Mesh:

Year:  2015        PMID: 26093148      PMCID: PMC4795634          DOI: 10.1093/bioinformatics/btv384

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  14 in total

1.  Transformations for the compression of FASTQ quality scores of next-generation sequencing data.

Authors:  Raymond Wan; Vo Ngoc Anh; Kiyoshi Asai
Journal:  Bioinformatics       Date:  2011-12-13       Impact factor: 6.937

2.  G-SQZ: compact encoding of genomic sequence and quality data.

Authors:  Waibhav Tembe; James Lowey; Edward Suh
Journal:  Bioinformatics       Date:  2010-07-06       Impact factor: 6.937

Review 3.  Textual data compression in computational biology: a synopsis.

Authors:  Raffaele Giancarlo; Davide Scaturro; Filippo Utro
Journal:  Bioinformatics       Date:  2009-02-27       Impact factor: 6.937

4.  Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform.

Authors:  Anthony J Cox; Markus J Bauer; Tobias Jakobi; Giovanna Rosone
Journal:  Bioinformatics       Date:  2012-05-03       Impact factor: 6.937

5.  Efficient storage of high throughput DNA sequencing data using reference-based compression.

Authors:  Markus Hsi-Yang Fritz; Rasko Leinonen; Guy Cochrane; Ewan Birney
Journal:  Genome Res       Date:  2011-01-18       Impact factor: 9.043

6.  Compressing genomic sequence fragments using SlimGene.

Authors:  Christos Kozanitis; Chris Saunders; Semyon Kruglyak; Vineet Bafna; George Varghese
Journal:  J Comput Biol       Date:  2011-03       Impact factor: 1.479

7.  SCALCE: boosting sequence compression algorithms using locally consistent encoding.

Authors:  Faraz Hach; Ibrahim Numanagic; Can Alkan; S Cenk Sahinalp
Journal:  Bioinformatics       Date:  2012-10-09       Impact factor: 6.937

8.  Compression of DNA sequence reads in FASTQ format.

Authors:  Sebastian Deorowicz; Szymon Grabowski
Journal:  Bioinformatics       Date:  2011-01-19       Impact factor: 6.937

Review 9.  The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants.

Authors:  Peter J A Cock; Christopher J Fields; Naohisa Goto; Michael L Heuer; Peter M Rice
Journal:  Nucleic Acids Res       Date:  2009-12-16       Impact factor: 16.971

10.  GReEn: a tool for efficient compression of genome resequencing data.

Authors:  Armando J Pinho; Diogo Pratas; Sara P Garcia
Journal:  Nucleic Acids Res       Date:  2011-12-01       Impact factor: 16.971

View more
  14 in total

1.  Comparison of high-throughput sequencing data compression tools.

Authors:  Ibrahim Numanagić; James K Bonfield; Faraz Hach; Jan Voges; Jörn Ostermann; Claudio Alberti; Marco Mattavelli; S Cenk Sahinalp
Journal:  Nat Methods       Date:  2016-10-24       Impact factor: 28.547

2.  The Biological Object Notation (BON): a structured file format for biological data.

Authors:  Jan P Buchmann; Mathieu Fourment; Edward C Holmes
Journal:  Sci Rep       Date:  2018-06-25       Impact factor: 4.379

3.  CoLoRd: compressing long reads.

Authors:  Marek Kokot; Adam Gudyś; Heng Li; Sebastian Deorowicz
Journal:  Nat Methods       Date:  2022-03-28       Impact factor: 47.990

4.  MZPAQ: a FASTQ data compression tool.

Authors:  Achraf El Allali; Mariam Arshad
Journal:  Source Code Biol Med       Date:  2019-06-03

Review 5.  Single-cell Transcriptome Study as Big Data.

Authors:  Pingjian Yu; Wei Lin
Journal:  Genomics Proteomics Bioinformatics       Date:  2016-02-11       Impact factor: 7.691

6.  GTZ: a fast compression and cloud transmission tool optimized for FASTQ files.

Authors:  Yuting Xing; Gen Li; Zhenguo Wang; Bolun Feng; Zhuo Song; Chengkun Wu
Journal:  BMC Bioinformatics       Date:  2017-12-28       Impact factor: 3.169

7.  LW-FQZip 2: a parallelized reference-based compression of FASTQ files.

Authors:  Zhi-An Huang; Zhenkun Wen; Qingjin Deng; Ying Chu; Yiwen Sun; Zexuan Zhu
Journal:  BMC Bioinformatics       Date:  2017-03-20       Impact factor: 3.169

8.  Sequence Compression Benchmark (SCB) database-A comprehensive evaluation of reference-free compressors for FASTA-formatted sequences.

Authors:  Kirill Kryukov; Mahoko Takahashi Ueda; So Nakagawa; Tadashi Imanishi
Journal:  Gigascience       Date:  2020-07-01       Impact factor: 6.524

9.  LFastqC: A lossless non-reference-based FASTQ compressor.

Authors:  Sultan Al Yami; Chun-Hsi Huang
Journal:  PLoS One       Date:  2019-11-14       Impact factor: 3.240

10.  ARSDA: A New Approach for Storing, Transmitting and Analyzing Transcriptomic Data.

Authors:  Xuhua Xia
Journal:  G3 (Bethesda)       Date:  2017-12-04       Impact factor: 3.154

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.