Shubham Chandak1, Kedar Tatwawadi1, Idoia Ochoa2, Mikel Hernaez3, Tsachy Weissman1. 1. Department of Electrical Engineering, Stanford University, Stanford, CA, USA. 2. Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, IL, USA. 3. Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL, USA.
Abstract
MOTIVATION: High-Throughput Sequencing technologies produce huge amounts of data in the form of short genomic reads, associated quality values and read identifiers. Because of the significant structure present in these FASTQ datasets, general-purpose compressors are unable to completely exploit much of the inherent redundancy. Although there has been a lot of work on designing FASTQ compressors, most of them lack in support of one or more crucial properties, such as support for variable length reads, scalability to high coverage datasets, pairing-preserving compression and lossless compression. RESULTS: In this work, we propose SPRING, a reference-free compressor for FASTQ files. SPRING supports a wide variety of compression modes and features, including lossless compression, pairing-preserving compression, lossy compression of quality values, long read compression and random access. SPRING achieves substantially better compression than existing tools, for example, SPRING compresses 195 GB of 25× whole genome human FASTQ from Illumina's NovaSeq sequencer to less than 7 GB, around 1.6× smaller than previous state-of-the-art FASTQ compressors. SPRING achieves this improvement while using comparable computational resources. AVAILABILITY AND IMPLEMENTATION: SPRING can be downloaded from https://github.com/shubhamchandak94/SPRING. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: High-Throughput Sequencing technologies produce huge amounts of data in the form of short genomic reads, associated quality values and read identifiers. Because of the significant structure present in these FASTQ datasets, general-purpose compressors are unable to completely exploit much of the inherent redundancy. Although there has been a lot of work on designing FASTQ compressors, most of them lack in support of one or more crucial properties, such as support for variable length reads, scalability to high coverage datasets, pairing-preserving compression and lossless compression. RESULTS: In this work, we propose SPRING, a reference-free compressor for FASTQ files. SPRING supports a wide variety of compression modes and features, including lossless compression, pairing-preserving compression, lossy compression of quality values, long read compression and random access. SPRING achieves substantially better compression than existing tools, for example, SPRING compresses 195 GB of 25× whole genome human FASTQ from Illumina's NovaSeq sequencer to less than 7 GB, around 1.6× smaller than previous state-of-the-art FASTQ compressors. SPRING achieves this improvement while using comparable computational resources. AVAILABILITY AND IMPLEMENTATION:SPRING can be downloaded from https://github.com/shubhamchandak94/SPRING. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Authors: Ibrahim Numanagić; James K Bonfield; Faraz Hach; Jan Voges; Jörn Ostermann; Claudio Alberti; Marco Mattavelli; S Cenk Sahinalp Journal: Nat Methods Date: 2016-10-24 Impact factor: 28.547