| Literature DB >> 32627830 |
Kirill Kryukov, Mahoko Takahashi Ueda, So Nakagawa, Tadashi Imanishi.
Abstract
BACKGROUND: Nearly all molecular sequence databases currently use gzip for data compression. Ongoing rapid accumulation of stored data calls for a more efficient compression tool. Although numerous compressors exist, both specialized and general-purpose, choosing one of them was difficult because no comprehensive analysis of their comparative advantages for sequence compression was available.Entities:
Keywords: DNA; RNA; benchmark; compression; database; genome; protein; sequence
Year: 2020 PMID: 32627830 PMCID: PMC7336184 DOI: 10.1093/gigascience/giaa072
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Compressor versions
| A) Specialized sequence compressors | |
|---|---|
| Compressor | Version |
| 2bit | "faToTwoBit" and "twoBitToFa" binaries dated 7 November 2018 |
| ac | AC 1.1, 29 January 2020 |
| alapy | ALAPY 1.3.0, 25 July 2017 |
| beetl | BEETL, commit 327cc65, 14 November 2019 |
| blast | "convert2blastmask", "makeblastdb", and "blastdbcmd" binaries from BLAST 2.8.1+, 26 November 2018 |
| dcom | DNA-COMPACT, latest public source 29 August 2013 |
| dlim | DELIMINATE, version 1.3c, 2012 |
| dnaX | dnaX 0.1.0, 3 August 2014 |
| dsrc | DSRC 2.02, commit 5eda82c, 4 June 2015 |
| fastqz | fastqz 1.5, commit 39b2bbc, 15 March 2012 |
| fqs | FQSqueezer 0.1, commit 5741fc5, 17 May 2019 |
| fqzcomp | fqzcomp 4.6, commit 96f2f61, 2 December 2019 |
| geco | GeCo: v.2.1, 24 December 2016 |
| GeCo2: v.1.1, 2 February 2019 | |
| gtz | GTX.Zip PROFESSIONAL-2.1.3-V-2020-03-18 07:11:20, binary |
| harc | HARC, commit cf35caf, 4 October 2019 |
| jarvis | JARVIS v.1.1, commit d7daef5, 30 April 2019 |
| kic | KIC binary, 0.2, 25 November 2015 |
| leon | Leon, 1.0.0, 27 February 2016, Linux binary |
| lfastqc | LFastqC, commit 60e5fda, 28 February 2019, with fixes |
| lfqc | LFQC, commit 59f56e0, 6 January 2016 |
| mfc | MFCompress,s1.01, 3 September 2013, 64-bit Linux binary |
| minicom | Minicom, commit 2360dd9, 9 September 2019 |
| naf | NAF, 1.1.0, 1 October 2019 |
| nuht | NUHT, commit 08a42a8, 26 September 2018, Linux binary |
| pfish | Pufferfish, v.1.0 alpha, 11 April 2012 |
| quip | Quip, commit 9165bb5, 1.1.8-8-g9165bb5, 17 December 2017 |
| spring | SPRING, commit 6536b1b, 28 November 2019 |
| uht | UHT, binary from 27 December 2016 |
| xm | XM (eXpert-Model), 3.0, commit 9b9ea57, 7 January 2019 |
|
| |
| bcm | 1.30, 21 January 2018 |
| brieflz | 1.3.0, 15 February 2020 |
| brotli | 1.0.7, 23 October 2018 |
| bsc | 3.1.0, 1 January 2016 |
| bzip2 | 1.0.6, 6 September 2010 |
| cmix | 17, 24 March 2019 |
| gzip | 1.6, 9 June 2013 |
| lizard | 1.0.0, 8 March 2019 |
| lz4 | 1.9.1, 24 April 2019 |
| lzop | 1.04, 10 August 2017 |
| lzturbo | 1.2, 11 August 2014 |
| nakamichi | 9 May 2020 |
| pbzip2 | 1.1.13, 18 December 2015 |
| pigz | 2.4, 26 December 2017 |
| snzip | 1.0.4, 2 October 2016 |
| xz | 5.2.2, 29 September 2015 |
| zpaq | 7.15, 17 August 2016 |
| zpipe | 2.01, 23 December 2010 |
| zstd | 1.4.5, 22 May 2020 |
Test datasets
| A) Genome sequence datasets | ||||
|---|---|---|---|---|
| Category | Organism | Accession | Size | |
| Virus |
| GCF_001884535.1 | 50.7 kB | |
| Bacteria | WS1 bacterium JGI 0000059-K21 [ | GCA_000398605.1 | 522 kB | |
| Protist |
| GCA_000211355.2 | 1.71 MB | |
| Fungus |
| GCA_000988165.1 | 5.81 MB | |
| Protist |
| GCA_000165345.1 | 9.22 MB | |
| Protist |
| GCA_000497125.1 | 13.1 MB | |
| Protist |
| GCA_001606155.1 | 23.7 MB | |
| Fungus |
| GCF_000240135.3 | 36.9 MB | |
| Protist |
| GCA_000188695.1 | 56.2 MB | |
| Algae |
| GCA_000350225.2 | 106 MB | |
| Algae |
| GCA_002205965.2 | 341 MB | |
| Animal |
| GCF_000002235.4 | 1.01 GB | |
| Plant |
| GCA_900067695.1 | 13.4 GB | |
|
| ||||
|
|
|
|
|
|
| Mitochondrion [ | 9,402 | 245 MB | RefSeq ftp: | 15 March 2019 |
|
| ||||
| NCBI Virus Complete Nucleotide Human [ | 36,745 | 482 MB | NCBI Virus: | 11 May 2020 |
| Influenza [ | 700,001 | 1.22 GB | Influenza Virus Database: | 27 April 2019 |
| Helicobacter [ | 108,292 | 2.76 GB | NCBI Assembly: | 24 April 2019 |
|
| ||||
| SILVA 132 LSURef [ | 198,843 | 610 MB | Silva database: | 11 December 2017 |
| SILVA 132 SSURef Nr99 [ | 695,171 | 1.11 GB | Silva database: | 11 Devember 2017 |
| SILVA 132 SSURef [ | 2,090,668 | 3.28 GB | Silva database: | 11 December 2017 |
|
| ||||
| UCSC hg38 7way knownCanonical-exonNuc [ | 1,470,154 | 340 MB | UCSC: | 6 June 2014 |
| UCSC hg38 20way knownCanonical-exonNuc [ | 4,211,940 | 969 MB | UCSC: | 30 June 2015 |
|
| ||||
| PDB [ | 109,914 | 67.6 MB | PDB database FTP: | 9 April 2019 |
| Homo sapiens GRCh38 [ | 105,961 | 73.2 MB | NCBI ftp: | 12 March 2019 |
| NCBI Virus RefSeq Protein [ | 373,332 | 122 MB | NCBI Virus: | 10 May 2020 |
| UniProtKB Reviewed (Swiss-Prot) [ | 560,118 | 277 MB | UniProt ftp: | 2 April 2019 |
Figure 1:Comparison of 36 compressors on human genome. The best settings of each compressor are selected on the basis of different aspects of performance: (A) compression ratio, (B) transfer + decompression speed, and (C) compression + transfer + decompression speed. The copy-compressor ("cat" command), shown in red, is included as a control. The selected settings of each compressor are shown in their names, after hyphen. Multi-threaded compressors have "-1t" or "-4t" at the end of their names to indicate the number of threads used. Test data are the 3.31 GB reference human genome (accession number GCA_000001405.28). Benchmark CPU: Intel Xeon E5-2643v3 (3.4 GHz). Link speed of 100 Mbit/sec was used for estimating the transfer time.
Figure 2:Comparison of 334 settings of 36 compressors on human genome. Each point represents a particular setting of some compressor. A, The relationship between compression ratio and decompression speed. B, The transfer + decompression speed plotted against compression + transfer + decompression speed. Test data are the 3.31 GB reference human genome (accession number GCA_000001405.28). Benchmark CPU: Intel Xeon E5-2643v3 (3.4 GHz). Link speed of 100 Mbit/sec was used for estimating the transfer time.
Figure 3:Comparison of compressor settings to gzip. Genome datasets were used as test data. Each point shows the performance of a compressor setting on a specific genome test dataset. All values are shown relative to representative setting of gzip. Only performances that are at least half as good as gzip on both axes are shown. A, Settings that performed best in Transfer + Decompression speed. B, Settings that performed best in Compression + Transfer + Decompression speed. Link speed of 100 Mbit/sec was used for estimating the transfer time.
Figure 4:Compressor memory consumption. The strongest setting of each compressor is shown. On the x-axis is the test data size. On the y-axis is the peak memory used by the compressor, for compression (A) and decompression (B).