MOTIVATION: Recent advancements in sequencing technology have led to a drastic reduction in the cost of sequencing a genome. This has generated an unprecedented amount of genomic data that must be stored, processed and transmitted. To facilitate this effort, we propose a new lossy compressor for the quality values presented in genomic data files (e.g. FASTQ and SAM files), which comprise roughly half of the storage space (in the uncompressed domain). Lossy compression allows for compression of data beyond its lossless limit. RESULTS: The proposed algorithm QVZ exhibits better rate-distortion performance than the previously proposed algorithms, for several distortion metrics and for the lossless case. Moreover, it allows the user to define any quasi-convex distortion function to be minimized, a feature not supported by the previous algorithms. Finally, we show that QVZ-compressed data exhibit better performance in the genotyping than data compressed with previously proposed algorithms, in the sense that for a similar rate, a genotyping closer to that achieved with the original quality values is obtained. AVAILABILITY AND IMPLEMENTATION: QVZ is written in C and can be downloaded from https://github.com/mikelhernaez/qvz. CONTACT: mhernaez@stanford.edu or gmalysa@stanford.edu or iochoa@stanford.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: Recent advancements in sequencing technology have led to a drastic reduction in the cost of sequencing a genome. This has generated an unprecedented amount of genomic data that must be stored, processed and transmitted. To facilitate this effort, we propose a new lossy compressor for the quality values presented in genomic data files (e.g. FASTQ and SAM files), which comprise roughly half of the storage space (in the uncompressed domain). Lossy compression allows for compression of data beyond its lossless limit. RESULTS: The proposed algorithm QVZ exhibits better rate-distortion performance than the previously proposed algorithms, for several distortion metrics and for the lossless case. Moreover, it allows the user to define any quasi-convex distortion function to be minimized, a feature not supported by the previous algorithms. Finally, we show that QVZ-compressed data exhibit better performance in the genotyping than data compressed with previously proposed algorithms, in the sense that for a similar rate, a genotyping closer to that achieved with the original quality values is obtained. AVAILABILITY AND IMPLEMENTATION: QVZ is written in C and can be downloaded from https://github.com/mikelhernaez/qvz. CONTACT: mhernaez@stanford.edu or gmalysa@stanford.edu or iochoa@stanford.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Authors: Mark A DePristo; Eric Banks; Ryan Poplin; Kiran V Garimella; Jared R Maguire; Christopher Hartl; Anthony A Philippakis; Guillermo del Angel; Manuel A Rivas; Matt Hanna; Aaron McKenna; Tim J Fennell; Andrew M Kernytsky; Andrey Y Sivachenko; Kristian Cibulskis; Stacey B Gabriel; David Altshuler; Mark J Daly Journal: Nat Genet Date: 2011-04-10 Impact factor: 38.330
Authors: Aleksey Zimin; Kristian A Stevens; Marc W Crepeau; Ann Holtz-Morris; Maxim Koriabine; Guillaume Marçais; Daniela Puiu; Michael Roberts; Jill L Wegrzyn; Pieter J de Jong; David B Neale; Steven L Salzberg; James A Yorke; Charles H Langley Journal: Genetics Date: 2014-03 Impact factor: 4.562
Authors: Claudio Alberti; Noah Daniels; Mikel Hernaez; Jan Voges; Rachel L Goldfeder; Ana A Hernandez-Lopez; Marco Mattavelli; Bonnie Berger Journal: Proc Data Compress Conf Date: 2016-12-19