Lukasz Roguski1,2, Idoia Ochoa3, Mikel Hernaez4, Sebastian Deorowicz5. 1. Centro Nacional de Análisis Genómico-Centre for Genomic Regulation, Barcelona Institute of Science and Technology (CNAG-CRG), Barcelona, Spain. 2. Experimental and Health Sciences, Universitat Pompeu Fabra (UPF), Barcelona, Spain. 3. Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, IL, USA. 4. Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, IL, USA. 5. Institute of Informatics, Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Gliwice, Poland.
Abstract
Motivation: The affordability of DNA sequencing has led to the generation of unprecedented volumes of raw sequencing data. These data must be stored, processed and transmitted, which poses significant challenges. To facilitate this effort, we introduce FaStore, a specialized compressor for FASTQ files. FaStore does not use any reference sequences for compression and permits the user to choose from several lossy modes to improve the overall compression ratio, depending on the specific needs. Results: FaStore in the lossless mode achieves a significant improvement in compression ratio with respect to previously proposed algorithms. We perform an analysis on the effect that the different lossy modes have on variant calling, the most widely used application for clinical decision making, especially important in the era of precision medicine. We show that lossy compression can offer significant compression gains, while preserving the essential genomic information and without affecting the variant calling performance. Availability and implementation: FaStore can be downloaded from https://github.com/refresh-bio/FaStore. Supplementary information: Supplementary data are available at Bioinformatics online.
Motivation: The affordability of DNA sequencing has led to the generation of unprecedented volumes of raw sequencing data. These data must be stored, processed and transmitted, which poses significant challenges. To facilitate this effort, we introduce FaStore, a specialized compressor for FASTQ files. FaStore does not use any reference sequences for compression and permits the user to choose from several lossy modes to improve the overall compression ratio, depending on the specific needs. Results: FaStore in the lossless mode achieves a significant improvement in compression ratio with respect to previously proposed algorithms. We perform an analysis on the effect that the different lossy modes have on variant calling, the most widely used application for clinical decision making, especially important in the era of precision medicine. We show that lossy compression can offer significant compression gains, while preserving the essential genomic information and without affecting the variant calling performance. Availability and implementation: FaStore can be downloaded from https://github.com/refresh-bio/FaStore. Supplementary information: Supplementary data are available at Bioinformatics online.