Literature DB >> 26026138

QVZ: lossy compression of quality values.

Greg Malysa1, Mikel Hernaez1, Idoia Ochoa1, Milind Rao1, Karthik Ganesan1, Tsachy Weissman1.   

Abstract

MOTIVATION: Recent advancements in sequencing technology have led to a drastic reduction in the cost of sequencing a genome. This has generated an unprecedented amount of genomic data that must be stored, processed and transmitted. To facilitate this effort, we propose a new lossy compressor for the quality values presented in genomic data files (e.g. FASTQ and SAM files), which comprise roughly half of the storage space (in the uncompressed domain). Lossy compression allows for compression of data beyond its lossless limit.
RESULTS: The proposed algorithm QVZ exhibits better rate-distortion performance than the previously proposed algorithms, for several distortion metrics and for the lossless case. Moreover, it allows the user to define any quasi-convex distortion function to be minimized, a feature not supported by the previous algorithms. Finally, we show that QVZ-compressed data exhibit better performance in the genotyping than data compressed with previously proposed algorithms, in the sense that for a similar rate, a genotyping closer to that achieved with the original quality values is obtained.
AVAILABILITY AND IMPLEMENTATION: QVZ is written in C and can be downloaded from https://github.com/mikelhernaez/qvz. CONTACT: mhernaez@stanford.edu or gmalysa@stanford.edu or iochoa@stanford.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
© The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

Mesh:

Year:  2015        PMID: 26026138      PMCID: PMC5856090          DOI: 10.1093/bioinformatics/btv330

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  22 in total

1.  OnlineCall: fast online parameter estimation and base calling for illumina's next-generation sequencing.

Authors:  Shreepriya Das; Haris Vikalo
Journal:  Bioinformatics       Date:  2012-05-07       Impact factor: 6.937

2.  A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data.

Authors:  Heng Li
Journal:  Bioinformatics       Date:  2011-09-08       Impact factor: 6.937

Review 3.  Sequencing technologies - the next generation.

Authors:  Michael L Metzker
Journal:  Nat Rev Genet       Date:  2009-12-08       Impact factor: 53.242

4.  Adaptive reference-free compression of sequence quality scores.

Authors:  Lilian Janin; Giovanna Rosone; Anthony J Cox
Journal:  Bioinformatics       Date:  2013-05-09       Impact factor: 6.937

5.  The DNA Data Deluge: Fast, efficient genome sequencing machines are spewing out more data than geneticists can analyze.

Authors:  Michael C Schatz; Ben Langmead
Journal:  IEEE Spectr       Date:  2013-07       Impact factor: 2.875

6.  Traversing the k-mer Landscape of NGS Read Datasets for Quality Score Sparsification.

Authors:  Y William Yu; Deniz Yorukoglu; Bonnie Berger
Journal:  Res Comput Mol Biol       Date:  2014-04

7.  The Sequence Alignment/Map format and SAMtools.

Authors:  Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal:  Bioinformatics       Date:  2009-06-08       Impact factor: 6.937

8.  Deploying whole genome sequencing in clinical practice and public health: meeting the challenge one bin at a time.

Authors:  Jonathan S Berg; Muin J Khoury; James P Evans
Journal:  Genet Med       Date:  2011-06       Impact factor: 8.822

9.  A framework for variation discovery and genotyping using next-generation DNA sequencing data.

Authors:  Mark A DePristo; Eric Banks; Ryan Poplin; Kiran V Garimella; Jared R Maguire; Christopher Hartl; Anthony A Philippakis; Guillermo del Angel; Manuel A Rivas; Matt Hanna; Aaron McKenna; Tim J Fennell; Andrew M Kernytsky; Andrey Y Sivachenko; Kristian Cibulskis; Stacey B Gabriel; David Altshuler; Mark J Daly
Journal:  Nat Genet       Date:  2011-04-10       Impact factor: 38.330

10.  Sequencing and assembly of the 22-gb loblolly pine genome.

Authors:  Aleksey Zimin; Kristian A Stevens; Marc W Crepeau; Ann Holtz-Morris; Maxim Koriabine; Guillaume Marçais; Daniela Puiu; Michael Roberts; Jill L Wegrzyn; Pieter J de Jong; David B Neale; Steven L Salzberg; James A Yorke; Charles H Langley
Journal:  Genetics       Date:  2014-03       Impact factor: 4.562

View more
  18 in total

1.  SPRING: a next-generation compressor for FASTQ data.

Authors:  Shubham Chandak; Kedar Tatwawadi; Idoia Ochoa; Mikel Hernaez; Tsachy Weissman
Journal:  Bioinformatics       Date:  2019-08-01       Impact factor: 6.937

Review 2.  Towards precision medicine.

Authors:  Euan A Ashley
Journal:  Nat Rev Genet       Date:  2016-08-16       Impact factor: 53.242

3.  Effect of lossy compression of quality scores on variant calling.

Authors:  Idoia Ochoa; Mikel Hernaez; Rachel Goldfeder; Tsachy Weissman; Euan Ashley
Journal:  Brief Bioinform       Date:  2017-03-01       Impact factor: 11.622

4.  An Evaluation Framework for Lossy Compression of Genome Sequencing Quality Values.

Authors:  Claudio Alberti; Noah Daniels; Mikel Hernaez; Jan Voges; Rachel L Goldfeder; Ana A Hernandez-Lopez; Marco Mattavelli; Bonnie Berger
Journal:  Proc Data Compress Conf       Date:  2016-12-19

5.  CALQ: compression of quality values of aligned sequencing data.

Authors:  Jan Voges; Jörn Ostermann; Mikel Hernaez
Journal:  Bioinformatics       Date:  2018-05-15       Impact factor: 6.937

6.  A cluster-based approach to compression of Quality Scores.

Authors:  Mikel Hernaez; Idoia Ochoa; Tsachy Weissman
Journal:  Proc Data Compress Conf       Date:  2016-12-19

7.  Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis.

Authors:  Shubham Chandak; Kedar Tatwawadi; Tsachy Weissman
Journal:  Bioinformatics       Date:  2018-02-15       Impact factor: 6.937

8.  CROMqs: an infinitesimal successive refinement lossy compressor for the quality scores.

Authors:  I Ochoa; A No; M Hernaez; T Weissman
Journal:  Proc Inf Theory Workshop       Date:  2016-10-27

9.  GeneComp, a new reference-based compressor for SAM files.

Authors:  Reggy Long; Mikel Hernaez; Idoia Ochoa; Tsachy Weissman
Journal:  Proc Data Compress Conf       Date:  2017-05-11

10.  Denoising of Quality Scores for Boosted Inference and Reduced Storage.

Authors:  Idoia Ochoa; Mikel Hernaez; Rachel Goldfeder; Tsachy Weissman; Euan Ashley
Journal:  Proc Data Compress Conf       Date:  2016-12-19
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.