Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 KungFQ: a simple and powerful approach to compress fastq files.

Literature DB >> 23221092

KungFQ: a simple and powerful approach to compress fastq files.

Elena Grassi¹, Federico Di Gregorio, Ivan Molineris.

Abstract

Nowadays storing data derived from deep sequencing experiments has become pivotal and standard compression algorithms do not exploit in a satisfying manner their structure. A number of reference-based compression algorithms have been developed but they are less adequate when approaching new species without fully sequenced genomes or nongenomic data. We developed a tool that takes advantages of fastq characteristics and encodes them in a binary format optimized in order to be further compressed with standard tools (such as gzip or lzma). The algorithm is straightforward and does not need any external reference file, it scans the fastq only once and has a constant memory requirement. Moreover, we added the possibility to perform lossy compression, losing some of the original information (IDs and/or qualities) but resulting in smaller files; it is also possible to define a quality cutoff under which corresponding base calls are converted to N. We achieve 2.82 to 7.77 compression ratios on various fastq files without losing information and 5.37 to 8.77 losing IDs, which are often not used in common analysis pipelines. In this paper, we compare the algorithm performance with known tools, usually obtaining higher compression levels.

Entities: Species

Mesh：

Year: 2012 PMID： 23221092 DOI： 10.1109/TCBB.2012.123

Source DB: PubMed Journal: IEEE/ACM Trans Comput Biol Bioinform ISSN： 1545-5963 Impact factor: 3.710

Keyword Cloud
Cited

2 in total

1. Data compression for sequencing data.

Authors: Sebastian Deorowicz; Szymon Grabowski
Journal: Algorithms Mol Biol Date: 2013-11-18 Impact factor: 1.405

2. Compression of FASTQ and SAM format sequencing data.

Authors: James K Bonfield; Matthew V Mahoney
Journal: PLoS One Date: 2013-03-22 Impact factor: 3.240

2 in total