Literature DB >> 23221092

KungFQ: a simple and powerful approach to compress fastq files.

Elena Grassi1, Federico Di Gregorio, Ivan Molineris.   

Abstract

Nowadays storing data derived from deep sequencing experiments has become pivotal and standard compression algorithms do not exploit in a satisfying manner their structure. A number of reference-based compression algorithms have been developed but they are less adequate when approaching new species without fully sequenced genomes or nongenomic data. We developed a tool that takes advantages of fastq characteristics and encodes them in a binary format optimized in order to be further compressed with standard tools (such as gzip or lzma). The algorithm is straightforward and does not need any external reference file, it scans the fastq only once and has a constant memory requirement. Moreover, we added the possibility to perform lossy compression, losing some of the original information (IDs and/or qualities) but resulting in smaller files; it is also possible to define a quality cutoff under which corresponding base calls are converted to N. We achieve 2.82 to 7.77 compression ratios on various fastq files without losing information and 5.37 to 8.77 losing IDs, which are often not used in common analysis pipelines. In this paper, we compare the algorithm performance with known tools, usually obtaining higher compression levels.

Entities:  

Mesh:

Year:  2012        PMID: 23221092     DOI: 10.1109/TCBB.2012.123

Source DB:  PubMed          Journal:  IEEE/ACM Trans Comput Biol Bioinform        ISSN: 1545-5963            Impact factor:   3.710


  2 in total

1.  Data compression for sequencing data.

Authors:  Sebastian Deorowicz; Szymon Grabowski
Journal:  Algorithms Mol Biol       Date:  2013-11-18       Impact factor: 1.405

2.  Compression of FASTQ and SAM format sequencing data.

Authors:  James K Bonfield; Matthew V Mahoney
Journal:  PLoS One       Date:  2013-03-22       Impact factor: 3.240

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.