Literature DB >> 29046896

GeneComp, a new reference-based compressor for SAM files.

Reggy Long1, Mikel Hernaez1, Idoia Ochoa1,2, Tsachy Weissman1.   

Abstract

The affordability of DNA sequencing has led to unprecedented volumes of genomic data. These data must be stored, processed, and analyzed. The most popular format for genomic data is the SAM format, which contains information such as alignment, quality values, etc. These files are large (on the order of terabytes), which necessitates compression. In this work we propose a new reference-based compressor for SAM files, which can accommodate different levels of compression, based on the specific needs of the user. In particular, the proposed compressor GeneComp allows the user to perform lossy compression of the quality scores, which have been proven to occupy more than half of the compressed file (when losslessly compressed). We show that the proposed compressor GeneComp overall achieves better compression ratios than previously proposed algorithms when working on lossless mode.

Entities:  

Year:  2017        PMID: 29046896      PMCID: PMC5641594          DOI: 10.1109/DCC.2017.76

Source DB:  PubMed          Journal:  Proc Data Compress Conf        ISSN: 2375-0383


  11 in total

1.  Aligned genomic data compression via improved modeling.

Authors:  Idoia Ochoa; Mikel Hernaez; Tsachy Weissman
Journal:  J Bioinform Comput Biol       Date:  2014-12       Impact factor: 1.122

2.  Efficient storage of high throughput DNA sequencing data using reference-based compression.

Authors:  Markus Hsi-Yang Fritz; Rasko Leinonen; Guy Cochrane; Ewan Birney
Journal:  Genome Res       Date:  2011-01-18       Impact factor: 9.043

3.  QVZ: lossy compression of quality values.

Authors:  Greg Malysa; Mikel Hernaez; Idoia Ochoa; Milind Rao; Karthik Ganesan; Tsachy Weissman
Journal:  Bioinformatics       Date:  2015-05-28       Impact factor: 6.937

4.  DeeZ: reference-based compression by local assembly.

Authors:  Faraz Hach; Ibrahim Numanagić; S Cenk Sahinalp
Journal:  Nat Methods       Date:  2014-11       Impact factor: 28.547

5.  CSAM: Compressed SAM format.

Authors:  Rodrigo Cánovas; Alistair Moffat; Andrew Turpin
Journal:  Bioinformatics       Date:  2016-08-18       Impact factor: 6.937

6.  Effect of lossy compression of quality scores on variant calling.

Authors:  Idoia Ochoa; Mikel Hernaez; Rachel Goldfeder; Tsachy Weissman; Euan Ashley
Journal:  Brief Bioinform       Date:  2017-03-01       Impact factor: 11.622

7.  The Sequence Alignment/Map format and SAMtools.

Authors:  Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal:  Bioinformatics       Date:  2009-06-08       Impact factor: 6.937

8.  NGC: lossless and lossy compression of aligned high-throughput sequencing data.

Authors:  Niko Popitsch; Arndt von Haeseler
Journal:  Nucleic Acids Res       Date:  2012-10-12       Impact factor: 16.971

9.  The Scramble conversion tool.

Authors:  James K Bonfield
Journal:  Bioinformatics       Date:  2014-06-14       Impact factor: 6.937

10.  CARGO: effective format-free compressed storage of genomic information.

Authors:  Łukasz Roguski; Paolo Ribeca
Journal:  Nucleic Acids Res       Date:  2016-04-29       Impact factor: 16.971

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.