Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 G-SQZ: compact encoding of genomic sequence and quality data.

Literature DB >> 20605925

G-SQZ: compact encoding of genomic sequence and quality data.

Waibhav Tembe¹, James Lowey, Edward Suh.

Abstract

SUMMARY: Large volumes of data generated by high-throughput sequencing instruments present non-trivial challenges in data storage, content access and transfer. We present G-SQZ, a Huffman coding-based sequencing-reads-specific representation scheme that compresses data without altering the relative order. G-SQZ has achieved from 65% to 81% compression on benchmark datasets, and it allows selective access without scanning and decoding from start. This article focuses on describing the underlying encoding scheme and its software implementation, and a more theoretical problem of optimal compression is out of scope. The immediate practical benefits include reduced infrastructure and informatics costs in managing and analyzing large sequencing data. AVAILABILITY: http://public.tgen.org/sqz. Academic/non-profit: SOURCE: available at no cost under a non-open-source license by requesting from the web-site; Binary: available for direct download at no cost. For-Profit: Submit request for for-profit license from the web-site.

Mesh：

Year: 2010 PMID： 20605925 DOI： 10.1093/bioinformatics/btq346

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

Keyword Cloud
Cited

21 in total

G-SQZ: compact encoding of genomic sequence and quality data.

1. LFQC: a lossless compression algorithm for FASTQ files.

2. SCALCE: boosting sequence compression algorithms using locally consistent encoding.

3. Toward a Better Compression for DNA Sequences Using Huffman Encoding.

4. Traversing the k-mer Landscape of NGS Read Datasets for Quality Score Sparsification.

Review 5. Computational solutions for omics data.

6. SOLiDzipper: A High Speed Encoding Method for the Next-Generation Sequencing Data.

7. QualComp: a new lossy compressor for quality scores based on rate distortion theory.

8. NGC: lossless and lossy compression of aligned high-throughput sequencing data.

9. Compression of FASTQ and SAM format sequencing data.

10. Compression of next-generation sequencing reads aided by highly efficient de novo assembly.