| Literature DB >> 35818472 |
Kirill Kryukov1, Lihua Jin2, So Nakagawa3.
Abstract
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) genome data are essential for epidemiology, vaccine development, and tracking emerging variants. Millions of SARS-CoV-2 genomes have been sequenced during the pandemic. However, downloading SARS-CoV-2 genomes from databases is slow and unreliable, largely due to suboptimal choice of compression method. We evaluated the available compressors and found that Nucleotide Archival Format (NAF) would provide a drastic improvement compared with current methods. For Global Initiative on Sharing Avian Flu Data's (GISAID) pre-compressed datasets, NAF would increase efficiency 52.2 times for gzip-compressed data and 3.7 times for xz-compressed data. For DNA DataBank of Japan (DDBJ), NAF would improve throughput 40 times for gzip-compressed data. For GenBank and European Nucleotide Archive (ENA), NAF would accelerate data distribution by a factor of 29.3 times compared with uncompressed FASTA. This article provides a tutorial for installing and using NAF. Offering a NAF download option in sequence databases would provide a significant saving of time, bandwidth, and disk space and accelerate biological and medical research worldwide.Entities:
Keywords: GISAID; Global Initiative on Sharing Avian Flu Data; INSDC; International Nucleotide Sequence Database Collaboration; NAF; Nucleotide Archival Format; SARS-CoV-2 genome database; sequence data compression
Year: 2022 PMID: 35818472 PMCID: PMC9259476 DOI: 10.1016/j.patter.2022.100562
Source DB: PubMed Journal: Patterns (N Y) ISSN: 2666-3899
Figure 1Comparison of compressors on SARS-CoV-2 genomes
The test dataset is 100,000 SARS-CoV-2 genomes (3.05 GB) selected randomly from the entire set of SARS-CoV-2 genomes downloaded from GenBank on January 17, 2022, which is available in the Sequence Compression Benchmark database (http://kirr.dyndns.org/sequence-compression-benchmark/).
(A–C) The best settings of each compressor are selected according to the measures compression ratio (A), transfer + decompression speed (B), and single-threaded compression + transfer + decompression speed (C). 100 Mbit/sec link speed is used for calculating transfer time.
Figure 2Comparison of best-performing settings of gzip, xz, and naf
(A and B) Performance in terms of time (on a log scale) required to complete transfer + decompression (A) and compression + transfer + decompression (B).
(C and D) The same results but in terms of speed (in MB/s), computed as uncompressed data size divided by the time required for transfer + decompression (C) and compression + transfer + decompression (D).