Literature DB >> 29186284

CALQ: compression of quality values of aligned sequencing data.

Jan Voges1, Jörn Ostermann1, Mikel Hernaez2.   

Abstract

Motivation: Recent advancements in high-throughput sequencing technology have led to a rapid growth of genomic data. Several lossless compression schemes have been proposed for the coding of such data present in the form of raw FASTQ files and aligned SAM/BAM files. However, due to their high entropy, losslessly compressed quality values account for about 80% of the size of compressed files. For the quality values, we present a novel lossy compression scheme named CALQ. By controlling the coarseness of quality value quantization with a statistical genotyping model, we minimize the impact of the introduced distortion on downstream analyses.
Results: We analyze the performance of several lossy compressors for quality values in terms of trade-off between the achieved compressed size (in bits per quality value) and the Precision and Recall achieved after running a variant calling pipeline over sequencing data of the well-known NA12878 individual. By compressing and reconstructing quality values with CALQ, we observe a better average variant calling performance than with the original data while achieving a size reduction of about one order of magnitude with respect to the state-of-the-art lossless compressors. Furthermore, we show that CALQ performs as good as or better than the state-of-the-art lossy compressors in terms of variant calling Recall and Precision for most of the analyzed datasets. Availability and implementation: CALQ is written in C ++ and can be downloaded from https://github.com/voges/calq. Contact: voges@tnt.uni-hannover.de or mhernaez@illinois.edu. Supplementary information: Supplementary data are available at Bioinformatics online.

Entities:  

Mesh:

Year:  2018        PMID: 29186284      PMCID: PMC5946873          DOI: 10.1093/bioinformatics/btx737

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  26 in total

1.  Aligned genomic data compression via improved modeling.

Authors:  Idoia Ochoa; Mikel Hernaez; Tsachy Weissman
Journal:  J Bioinform Comput Biol       Date:  2014-12       Impact factor: 1.122

2.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.

Authors:  Aaron McKenna; Matthew Hanna; Eric Banks; Andrey Sivachenko; Kristian Cibulskis; Andrew Kernytsky; Kiran Garimella; David Altshuler; Stacey Gabriel; Mark Daly; Mark A DePristo
Journal:  Genome Res       Date:  2010-07-19       Impact factor: 9.043

3.  Base-calling of automated sequencer traces using phred. II. Error probabilities.

Authors:  B Ewing; P Green
Journal:  Genome Res       Date:  1998-03       Impact factor: 9.043

4.  The GEM mapper: fast, accurate and versatile alignment by filtration.

Authors:  Santiago Marco-Sola; Michael Sammeth; Roderic Guigó; Paolo Ribeca
Journal:  Nat Methods       Date:  2012-10-28       Impact factor: 28.547

5.  Fast gapped-read alignment with Bowtie 2.

Authors:  Ben Langmead; Steven L Salzberg
Journal:  Nat Methods       Date:  2012-03-04       Impact factor: 28.547

6.  Effect of lossy compression of quality scores on variant calling.

Authors:  Idoia Ochoa; Mikel Hernaez; Rachel Goldfeder; Tsachy Weissman; Euan Ashley
Journal:  Brief Bioinform       Date:  2017-03-01       Impact factor: 11.622

7.  A cluster-based approach to compression of Quality Scores.

Authors:  Mikel Hernaez; Idoia Ochoa; Tsachy Weissman
Journal:  Proc Data Compress Conf       Date:  2016-12-19

8.  Compression of DNA sequence reads in FASTQ format.

Authors:  Sebastian Deorowicz; Szymon Grabowski
Journal:  Bioinformatics       Date:  2011-01-19       Impact factor: 6.937

9.  The Sequence Alignment/Map format and SAMtools.

Authors:  Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal:  Bioinformatics       Date:  2009-06-08       Impact factor: 6.937

10.  The Scramble conversion tool.

Authors:  James K Bonfield
Journal:  Bioinformatics       Date:  2014-06-14       Impact factor: 6.937

View more
  2 in total

1.  Crumble: reference free lossy compression of sequence quality values.

Authors:  James K Bonfield; Shane A McCarthy; Richard Durbin
Journal:  Bioinformatics       Date:  2019-01-15       Impact factor: 6.937

2.  CRAM 3.1: Advances in the CRAM File Format.

Authors:  James K Bonfield
Journal:  Bioinformatics       Date:  2022-01-06       Impact factor: 6.937

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.