Literature DB >> 30799504

Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences.

Kirill Kryukov¹, Mahoko Takahashi Ueda², So Nakagawa^1,2, Tadashi Imanishi¹.

Abstract

SUMMARY: DNA sequence databases use compression such as gzip to reduce the required storage space and network transmission time. We describe Nucleotide Archival Format (NAF)-a new file format for lossless reference-free compression of FASTA and FASTQ-formatted nucleotide sequences. Nucleotide Archival Format compression ratio is comparable to the best DNA compressors, while providing dramatically faster decompression. We compared our format with DNA compressors: DELIMINATE and MFCompress, and with general purpose compressors: gzip, bzip2, xz, brotli and zstd.
AVAILABILITY AND IMPLEMENTATION: NAF compressor and decompressor, as well as format specification are available at https://github.com/KirillKryukov/naf. Format specification is in public domain. Compressor and decompressor are open source under the zlib/libpng license, free for nearly any use. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Species

Year: 2019 PMID： 30799504 PMCID： PMC6761962 DOI： 10.1093/bioinformatics/btz144

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

DNA sequence databases are growing exponentially, owing to the continuing advances in sequencing technologies. Data compression is typically used for all stored DNA sequence data to save storage space and network transmission times. In 1993, the first specialized DNA compressor was proposed (Grumbach and Tahi, 1993). Since then, numerous DNA compressors were developed (e.g. Al-Okaily ; Benoit ; Cao ; Li ). In our experience only two compressors pass the practicality threshold: DELIMINATE (Mohammed ) and MFCompress (Pinho and Pratas, 2014). They are stable, support commonly used features of FASTA format, and are efficient enough to be able to handle practical tasks such as compressing (and decompressing) entire vertebrate genomes. Although they achieve impressive compression ratios, both DELIMINATE and MFCompress have very slow decompression, significantly limiting their usefulness with large databases. Despite the numerous studies on DNA compression, currently the majority of sequence databases continue to rely on gzip (https://www.gzip.org/). We attribute this enduring popularity to gzip’s wide availability, robustness and speed (especially decompression speed). These qualities appear to be able to outweigh gzip’s less than stellar compression ratio. Other popular general purpose compressors have been developed, such as bzip2 (http://www.bzip.org/) and lzma/xz (https://tukaani.org/xz/format.html). In addition, recent years have seen the emergence of a new generation of advanced general purpose compressors, most notably brotli (https://github.com/google/brotli) and zstd (https://github.com/facebook/zstd). These compressors improve upon gzip performance, but still cannot touch specialized DNA compressors in compression strength. In this work we describe a new DNA compression format called Nucleotide Archival Format (NAF). NAF provides a state of the art compression ratio, on par with DELIMINATE and slightly behind MFCompress. At the same time, it provides 30–80 times faster decompression. NAF compresses and decompresses from/to FASTA and FASTQ formats. NAF supports masked sequence and ambiguous IUPAC nucleotide codes. NAF has no restrictions on sequence length, and does not require reference sequences.

2 Materials and methods

Analogously to many previous methods, including DELIMINATE and MFCompress, NAF compression operates by splitting the input into headers, sequences, mask (in case of masked sequence) and qualities (in case of FASTQ input), which are processed separately. Sequences are concatenated together, with lengths stored separately. The combined sequence is then converted into 4-bit encoding, which stores two nucleotides per byte. This representation is extremely fast, for both encoding and decoding, and allows natively representing ambiguous IUPAC nucleotide codes (NC-IUB, 1985), including ‘N’, ‘Y’, ‘R’ etc. All the resulting data streams are then compressed with the general purpose compressor zstd. The decompression consists of decompressing those separate streams, and re-assembling them together into FASTA or FASTQ-formatted output. NAF’s high decompression speed owes to these factors: (i) using zstd compressor, which itself is designed for fast decompression. (ii) In NAF, zstd works with 4-bit encoded sequence, which means that it deals with data half the size of original sequence. (iii) Decoding of 4-bit sequence is very fast using a simple table lookup for pairs of nucleotides. NAF implementation provides interface that is friendly to automated sequence analysis workflows. NAF compressor reads data from standard input stream, enabling on-the-fly compression of data originating from other process. Similarly, NAF decompressor allows piping decompressed sequences directly into the next analysis step. In addition, NAF allows decompressing only headers or only sequences, as well as 4-bit encoded sequence. This will allow applications such as sequence search or composition analysis to work directly with 4-bit encoded sequence.

3 Results

We compared NAF with DNA compressors DELIMINATE and MFCompress, as well as general purpose compressors: gzip, bzip2, xz, brotli and zstd. Figure 1, Supplementary Figures S1 and S2 and Table S1 show their results on the human genome. Supplementary Figure S3 and Table S2 show overall compression ratios on larger set of genomes. Supplementary Table S3 shows the result on FASTQ data. Supplementary Table S4 compares availability and features of these compressors.

Fig. 1.

Compression strength and decompression speed of eight compressors. Human genome (GRCh38, 3.3 GB) was used as test dataset. ‘mfc’ and ‘dlim’ represent MFCompress and DELIMINATE, respectively. Each compressor was used with its strongest compression setting: ‘gzip -9’, ‘bzip2 -9’, ‘brotli -11’, ‘zstd –ultra -22’, ‘xz -e9’, ‘ennaf -22’, ‘delim a’, ‘MFCompressC -3’. CPU used: Intel Xeon E5-2643v3 (3.4 GHz) Compression strength of NAF is close to DELIMINATE and slightly behind MFCompress. All three DNA compressors achieve markedly better compression ratio than the general purpose compressors. However, what sets NAF apart is its high speed of decompression. In the human genome example, NAF decompression is faster by a factor of 35 and 78 than DELIMINATE and MFCompress, respectively. Also NAF provides high compression speed in its fast mode, using ‘-1’ option of the compressor (Supplementary Fig. S2 and Tables S1 and S3). When considering the typical application of compression for distributing data from sequence databases, we can estimate the total time that it takes from initiating download until accessing the decompressed data, for different compressors. Supplementary Figure S1 compares access times for eight compressors, as well as for the uncompressed FASTA format, while assuming link speeds of 100 and 1000 Mbit/s. It can be seen that NAF enables the fastest distribution of data over network, allowing to reduce waiting time and bandwidth cost (in addition to reducing the required storage space). In Supplementary Figure S2, we compared the total transfer time including compression, for a hypothetical scenario of one-time data transfer. In this case the fast setting (‘-1’ option) of NAF provides the best total transfer time. As Supplementary Tables S1 and S3 show, NAF’s fast mode retains useful compression strength, which is still better than gzip’s strong mode (‘-9’).

4 Conclusion

NAF offers significant advantages over both general purpose and specialized DNA compressors. NAF’s combination of compactness and decompression speed enables rapid distribution of database sequences to users. NAF’s fast decompression allows storing NAF-compressed databases, decompressing them on-the-fly when necessary. NAF also provides a useful combination of compactness and high compression speed in fast mode, making it ideal for one-time data transfer. We believe that our new format will save network bandwidth, time and disk space, and thus contribute greatly to both operators and users of DNA sequence databases. Conflict of Interest: none declared. Click here for additional data file.

6 in total

1. Nomenclature Committee of the International Union of Biochemistry (NC-IUB). Nomenclature for incompletely specified bases in nucleic acid sequences. Recommendations 1984.

Authors:
Journal: Eur J Biochem Date: 1985-07-01

2. DELIMINATE--a fast and efficient method for loss-less compression of genomic sequences: sequence analysis.

Authors: Monzoorul Haque Mohammed; Anirban Dutta; Tungadri Bose; Sudha Chadaram; Sharmila S Mande
Journal: Bioinformatics Date: 2012-07-25 Impact factor: 6.937

3. Toward a Better Compression for DNA Sequences Using Huffman Encoding.

Authors: Anas Al-Okaily; Badar Almarri; Sultan Al Yami; Chun-Hsi Huang
Journal: J Comput Biol Date: 2016-12-13 Impact factor: 1.479

4. Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph.

Authors: Gaëtan Benoit; Claire Lemaitre; Dominique Lavenier; Erwan Drezen; Thibault Dayris; Raluca Uricaru; Guillaume Rizk
Journal: BMC Bioinformatics Date: 2015-09-14 Impact factor: 3.169

5. MFCompress: a compression tool for FASTA and multi-FASTA data.

Authors: Armando J Pinho; Diogo Pratas
Journal: Bioinformatics Date: 2013-10-16 Impact factor: 6.937

6. DNA-COMPACT: DNA COMpression based on a pattern-aware contextual modeling technique.

Authors: Pinghao Li; Shuang Wang; Jihoon Kim; Hongkai Xiong; Lucila Ohno-Machado; Xiaoqian Jiang
Journal: PLoS One Date: 2013-11-25 Impact factor: 3.240

6 in total

8 in total

1. The complexity landscape of viral genomes.

Authors: Jorge Miguel Silva; Diogo Pratas; Tânia Caetano; Sérgio Matos
Journal: Gigascience Date: 2022-08-11 Impact factor: 7.658

Review 2. Efficient compression of SARS-CoV-2 genome data using Nucleotide Archival Format.

Authors: Kirill Kryukov; Lihua Jin; So Nakagawa
Journal: Patterns (N Y) Date: 2022-07-07

3. Efficient DNA sequence compression with neural networks.

Authors: Milton Silva; Diogo Pratas; Armando J Pinho
Journal: Gigascience Date: 2020-11-11 Impact factor: 6.524

4. AC2: An Efficient Protein Sequence Compression Tool Using Artificial Neural Networks and Cache-Hash Models.

Authors: Milton Silva; Diogo Pratas; Armando J Pinho
Journal: Entropy (Basel) Date: 2021-04-26 Impact factor: 2.524

5. Sequence Compression Benchmark (SCB) database-A comprehensive evaluation of reference-free compressors for FASTA-formatted sequences.

Authors: Kirill Kryukov; Mahoko Takahashi Ueda; So Nakagawa; Tadashi Imanishi
Journal: Gigascience Date: 2020-07-01 Impact factor: 6.524

6. MBGC: Multiple Bacteria Genome Compressor.

Authors: Szymon Grabowski; Tomasz M Kowalski
Journal: Gigascience Date: 2022-01-27 Impact factor: 6.524

7. CHAPAO: Likelihood and hierarchical reference-based representation of biomolecular sequences and applications to compressing multiple sequence alignments.

Authors: Md Ashiqur Rahman; Abdullah Aman Tutul; Sifat Muhammad Abdullah; Md Shamsuzzoha Bayzid
Journal: PLoS One Date: 2022-04-18 Impact factor: 3.752

8. Disk compression of k-mer sets.

Authors: Amatur Rahman; Rayan Chikhi; Paul Medvedev
Journal: Algorithms Mol Biol Date: 2021-06-21 Impact factor: 1.405

8 in total