Literature DB >> 24728856

Lossy compression of quality scores in genomic data.

Rodrigo Cánovas1, Alistair Moffat1, Andrew Turpin1.   

Abstract

MOTIVATION: Next-generation sequencing technologies are revolutionizing medicine. Data from sequencing technologies are typically represented as a string of bases, an associated sequence of per-base quality scores and other metadata, and in aggregate can require a large amount of space. The quality scores show how accurate the bases are with respect to the sequencing process, that is, how confident the sequencer is of having called them correctly, and are the largest component in datasets in which they are retained. Previous research has examined how to store sequences of bases effectively; here we add to that knowledge by examining methods for compressing quality scores. The quality values originate in a continuous domain, and so if a fidelity criterion is introduced, it is possible to introduce flexibility in the way these values are represented, allowing lossy compression over the quality score data.
RESULTS: We present existing compression options for quality score data, and then introduce two new lossy techniques. Experiments measuring the trade-off between compression ratio and information loss are reported, including quantifying the effect of lossy representations on a downstream application that carries out single nucleotide polymorphism and insert/deletion detection. The new methods are demonstrably superior to other techniques when assessed against the spectrum of possible trade-offs between storage required and fidelity of representation.
AVAILABILITY AND IMPLEMENTATION: An implementation of the methods described here is available at https://github.com/rcanovas/libCSAM. CONTACT: rcanovas@student.unimelb.edu.au SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
© The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

Mesh:

Year:  2014        PMID: 24728856     DOI: 10.1093/bioinformatics/btu183

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  20 in total

1.  Aligned genomic data compression via improved modeling.

Authors:  Idoia Ochoa; Mikel Hernaez; Tsachy Weissman
Journal:  J Bioinform Comput Biol       Date:  2014-12       Impact factor: 1.122

2.  QVZ: lossy compression of quality values.

Authors:  Greg Malysa; Mikel Hernaez; Idoia Ochoa; Milind Rao; Karthik Ganesan; Tsachy Weissman
Journal:  Bioinformatics       Date:  2015-05-28       Impact factor: 6.937

3.  Quality score compression improves genotyping accuracy.

Authors:  Y William Yu; Deniz Yorukoglu; Jian Peng; Bonnie Berger
Journal:  Nat Biotechnol       Date:  2015-03       Impact factor: 54.908

4.  Effect of lossy compression of quality scores on variant calling.

Authors:  Idoia Ochoa; Mikel Hernaez; Rachel Goldfeder; Tsachy Weissman; Euan Ashley
Journal:  Brief Bioinform       Date:  2017-03-01       Impact factor: 11.622

5.  An Evaluation Framework for Lossy Compression of Genome Sequencing Quality Values.

Authors:  Claudio Alberti; Noah Daniels; Mikel Hernaez; Jan Voges; Rachel L Goldfeder; Ana A Hernandez-Lopez; Marco Mattavelli; Bonnie Berger
Journal:  Proc Data Compress Conf       Date:  2016-12-19

6.  CALQ: compression of quality values of aligned sequencing data.

Authors:  Jan Voges; Jörn Ostermann; Mikel Hernaez
Journal:  Bioinformatics       Date:  2018-05-15       Impact factor: 6.937

7.  A cluster-based approach to compression of Quality Scores.

Authors:  Mikel Hernaez; Idoia Ochoa; Tsachy Weissman
Journal:  Proc Data Compress Conf       Date:  2016-12-19

8.  Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis.

Authors:  Shubham Chandak; Kedar Tatwawadi; Tsachy Weissman
Journal:  Bioinformatics       Date:  2018-02-15       Impact factor: 6.937

9.  CROMqs: an infinitesimal successive refinement lossy compressor for the quality scores.

Authors:  I Ochoa; A No; M Hernaez; T Weissman
Journal:  Proc Inf Theory Workshop       Date:  2016-10-27

10.  Denoising of Quality Scores for Boosted Inference and Reduced Storage.

Authors:  Idoia Ochoa; Mikel Hernaez; Rachel Goldfeder; Tsachy Weissman; Euan Ashley
Journal:  Proc Data Compress Conf       Date:  2016-12-19
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.