Literature DB >> 20605925

G-SQZ: compact encoding of genomic sequence and quality data.

Waibhav Tembe1, James Lowey, Edward Suh.   

Abstract

SUMMARY: Large volumes of data generated by high-throughput sequencing instruments present non-trivial challenges in data storage, content access and transfer. We present G-SQZ, a Huffman coding-based sequencing-reads-specific representation scheme that compresses data without altering the relative order. G-SQZ has achieved from 65% to 81% compression on benchmark datasets, and it allows selective access without scanning and decoding from start. This article focuses on describing the underlying encoding scheme and its software implementation, and a more theoretical problem of optimal compression is out of scope. The immediate practical benefits include reduced infrastructure and informatics costs in managing and analyzing large sequencing data. AVAILABILITY: http://public.tgen.org/sqz. Academic/non-profit: SOURCE: available at no cost under a non-open-source license by requesting from the web-site; Binary: available for direct download at no cost. For-Profit: Submit request for for-profit license from the web-site.

Mesh:

Year:  2010        PMID: 20605925     DOI: 10.1093/bioinformatics/btq346

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  21 in total

1.  LFQC: a lossless compression algorithm for FASTQ files.

Authors:  Marius Nicolae; Sudipta Pathak; Sanguthevar Rajasekaran
Journal:  Bioinformatics       Date:  2015-06-20       Impact factor: 6.937

2.  SCALCE: boosting sequence compression algorithms using locally consistent encoding.

Authors:  Faraz Hach; Ibrahim Numanagic; Can Alkan; S Cenk Sahinalp
Journal:  Bioinformatics       Date:  2012-10-09       Impact factor: 6.937

3.  Toward a Better Compression for DNA Sequences Using Huffman Encoding.

Authors:  Anas Al-Okaily; Badar Almarri; Sultan Al Yami; Chun-Hsi Huang
Journal:  J Comput Biol       Date:  2016-12-13       Impact factor: 1.479

4.  Traversing the k-mer Landscape of NGS Read Datasets for Quality Score Sparsification.

Authors:  Y William Yu; Deniz Yorukoglu; Bonnie Berger
Journal:  Res Comput Mol Biol       Date:  2014-04

Review 5.  Computational solutions for omics data.

Authors:  Bonnie Berger; Jian Peng; Mona Singh
Journal:  Nat Rev Genet       Date:  2013-05       Impact factor: 53.242

6.  SOLiDzipper: A High Speed Encoding Method for the Next-Generation Sequencing Data.

Authors:  Young Jun Jeon; Sang Hyun Park; Sung Min Ahn; Hee Joung Hwang
Journal:  Evol Bioinform Online       Date:  2011-03-10       Impact factor: 1.625

7.  QualComp: a new lossy compressor for quality scores based on rate distortion theory.

Authors:  Idoia Ochoa; Himanshu Asnani; Dinesh Bharadia; Mainak Chowdhury; Tsachy Weissman; Golan Yona
Journal:  BMC Bioinformatics       Date:  2013-06-08       Impact factor: 3.169

8.  NGC: lossless and lossy compression of aligned high-throughput sequencing data.

Authors:  Niko Popitsch; Arndt von Haeseler
Journal:  Nucleic Acids Res       Date:  2012-10-12       Impact factor: 16.971

9.  Compression of FASTQ and SAM format sequencing data.

Authors:  James K Bonfield; Matthew V Mahoney
Journal:  PLoS One       Date:  2013-03-22       Impact factor: 3.240

10.  Compression of next-generation sequencing reads aided by highly efficient de novo assembly.

Authors:  Daniel C Jones; Walter L Ruzzo; Xinxia Peng; Michael G Katze
Journal:  Nucleic Acids Res       Date:  2012-08-16       Impact factor: 16.971

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.