Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 CALQ: compression of quality values of aligned sequencing data.

Literature DB >> 29186284

CALQ: compression of quality values of aligned sequencing data.

Jan Voges¹, Jörn Ostermann¹, Mikel Hernaez².

Abstract

Motivation: Recent advancements in high-throughput sequencing technology have led to a rapid growth of genomic data. Several lossless compression schemes have been proposed for the coding of such data present in the form of raw FASTQ files and aligned SAM/BAM files. However, due to their high entropy, losslessly compressed quality values account for about 80% of the size of compressed files. For the quality values, we present a novel lossy compression scheme named CALQ. By controlling the coarseness of quality value quantization with a statistical genotyping model, we minimize the impact of the introduced distortion on downstream analyses.
Results: We analyze the performance of several lossy compressors for quality values in terms of trade-off between the achieved compressed size (in bits per quality value) and the Precision and Recall achieved after running a variant calling pipeline over sequencing data of the well-known NA12878 individual. By compressing and reconstructing quality values with CALQ, we observe a better average variant calling performance than with the original data while achieving a size reduction of about one order of magnitude with respect to the state-of-the-art lossless compressors. Furthermore, we show that CALQ performs as good as or better than the state-of-the-art lossy compressors in terms of variant calling Recall and Precision for most of the analyzed datasets. Availability and implementation: CALQ is written in C ++ and can be downloaded from https://github.com/voges/calq. Contact: voges@tnt.uni-hannover.de or mhernaez@illinois.edu. Supplementary information: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease

Mesh：

Year: 2018 PMID： 29186284 PMCID： PMC5946873 DOI： 10.1093/bioinformatics/btx737

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

26 in total

CALQ: compression of quality values of aligned sequencing data.

1. Aligned genomic data compression via improved modeling.

2. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.

3. Base-calling of automated sequencer traces using phred. II. Error probabilities.

4. The GEM mapper: fast, accurate and versatile alignment by filtration.

5. Fast gapped-read alignment with Bowtie 2.

6. Effect of lossy compression of quality scores on variant calling.

7. A cluster-based approach to compression of Quality Scores.

8. Compression of DNA sequence reads in FASTQ format.

9. The Sequence Alignment/Map format and SAMtools.

10. The Scramble conversion tool.

1. Crumble: reference free lossy compression of sequence quality values.

2. CRAM 3.1: Advances in the CRAM File Format.