Literature DB >> 29057318

A cluster-based approach to compression of Quality Scores.

Mikel Hernaez1, Idoia Ochoa1, Tsachy Weissman1.   

Abstract

Massive amounts of sequencing data are being generated thanks to advances in sequencing technology and a dramatic drop in the sequencing cost. Storing and sharing this large data has become a major bottleneck in the discovery and analysis of genetic variants that are used for medical inference. As such, lossless compression of this data has been proposed. Of the compressed data, more than 70% correspond to quality scores, which indicate the sequencing machine reliability when calling a particular basepair. Thus, to further improve the compression performance, lossy compression of quality scores is emerging as the natural candidate. Since the data is used for genetic variants discovery, lossy compressors for quality scores are analyzed in terms of their rate-distortion performance, as well as their effect on the variant callers. Previously proposed algorithms do not do well under all performance metrics, and are hence unsuitable for certain applications. In this work we propose a new lossy compressor that first performs a clustering step, by assuming all the quality scores sequences come from a mixture of Markov models. Then, it performs quantization of the quality scores based on the Markov models. Each quantizer targets a specific distortion to optimize for the overall rate-distortion performance. Finally, the quantized values are compressed by an entropy encoder. We demonstrate that the proposed lossy compressor outperforms the previously proposed methods under all analyzed distortion metrics. This suggests that the effect that the proposed algorithm will have on any downstream application will likely be less noticeable than that of previously proposed lossy compressors. Moreover, we analyze how the proposed lossy compressor affects Single Nucleotide Polymorphism (SNP) calling, and show that the variability introduced on the calls is considerably smaller than the variability that exists between different methodologies for SNP calling.

Entities:  

Year:  2016        PMID: 29057318      PMCID: PMC5649045          DOI: 10.1109/DCC.2016.49

Source DB:  PubMed          Journal:  Proc Data Compress Conf        ISSN: 2375-0383


  8 in total

1.  QVZ: lossy compression of quality values.

Authors:  Greg Malysa; Mikel Hernaez; Idoia Ochoa; Milind Rao; Karthik Ganesan; Tsachy Weissman
Journal:  Bioinformatics       Date:  2015-05-28       Impact factor: 6.937

Review 2.  Toward better understanding of artifacts in variant calling from high-coverage samples.

Authors:  Heng Li
Journal:  Bioinformatics       Date:  2014-06-27       Impact factor: 6.937

3.  Lossy compression of quality scores in genomic data.

Authors:  Rodrigo Cánovas; Alistair Moffat; Andrew Turpin
Journal:  Bioinformatics       Date:  2014-04-10       Impact factor: 6.937

4.  Effect of lossy compression of quality scores on variant calling.

Authors:  Idoia Ochoa; Mikel Hernaez; Rachel Goldfeder; Tsachy Weissman; Euan Ashley
Journal:  Brief Bioinform       Date:  2017-03-01       Impact factor: 11.622

5.  Traversing the k-mer Landscape of NGS Read Datasets for Quality Score Sparsification.

Authors:  Y William Yu; Deniz Yorukoglu; Bonnie Berger
Journal:  Res Comput Mol Biol       Date:  2014-04

6.  DSRC 2--Industry-oriented compression of FASTQ files.

Authors:  Lukasz Roguski; Sebastian Deorowicz
Journal:  Bioinformatics       Date:  2014-04-18       Impact factor: 6.937

7.  QualComp: a new lossy compressor for quality scores based on rate distortion theory.

Authors:  Idoia Ochoa; Himanshu Asnani; Dinesh Bharadia; Mainak Chowdhury; Tsachy Weissman; Golan Yona
Journal:  BMC Bioinformatics       Date:  2013-06-08       Impact factor: 3.169

8.  Compression of FASTQ and SAM format sequencing data.

Authors:  James K Bonfield; Matthew V Mahoney
Journal:  PLoS One       Date:  2013-03-22       Impact factor: 3.240

  8 in total
  3 in total

1.  CALQ: compression of quality values of aligned sequencing data.

Authors:  Jan Voges; Jörn Ostermann; Mikel Hernaez
Journal:  Bioinformatics       Date:  2018-05-15       Impact factor: 6.937

2.  FCLQC: fast and concurrent lossless quality scores compressor.

Authors:  Minhyeok Cho; Albert No
Journal:  BMC Bioinformatics       Date:  2021-12-20       Impact factor: 3.169

3.  LCQS: an efficient lossless compression tool of quality scores with random access functionality.

Authors:  Jiabing Fu; Bixin Ke; Shoubin Dong
Journal:  BMC Bioinformatics       Date:  2020-03-18       Impact factor: 3.169

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.