| Literature DB >> 31171931 |
Achraf El Allali1, Mariam Arshad1.
Abstract
BACKGROUND: Due to the technological progress in Next Generation Sequencing (NGS), the amount of genomic data that is produced daily has seen a tremendous increase. This increase has shifted the bottleneck of genomic projects from sequencing to computation and specifically storing, managing and analyzing the large amount of NGS data. Compression tools can reduce the physical storage used to save large amount of genomic data as well as the bandwidth used to transfer this data. Recently, DNA sequence compression has gained much attention among researchers.Entities:
Keywords: DNA compression; FASTA files; FASTQ files; Next generation sequences
Year: 2019 PMID: 31171931 PMCID: PMC6547476 DOI: 10.1186/s13029-019-0073-5
Source DB: PubMed Journal: Source Code Biol Med ISSN: 1751-0473
Characteristics of selected compression
| Input | C-ratio | Speed | Memory | Techniques | |
|---|---|---|---|---|---|
| gzip | General ASCII | Moderate | High | Low | LZ77 and Huffman coding |
| bzip2 | General ASCII | Moderate | High | Low | BWT and Huffman coding |
| LZMA | General ASCII | Moderate | Low | High | Lempel-Ziv Markov chain and LZ77 |
| Deliminate | FASTA | High | High | Low | Delta encoding with Lempel-Ziv |
| MFCompress | FASTA | High | Moderate | High | Finite Contexts Models |
| Leon | FASTQ | High | High | Moderate | De Bruijn graph and Order-0 Arithmetic coding |
| Slimfast | FASTQ | High | High | Moderate | Delta encoding, Arithmetic coding, and Context Models |
| SCALCE | FASTQ | High | High | High | Reordering, gzip, bzip2 and Order-3 Arithmetic coding |
| LFQC | FASTQ | High | Low | High | PAQ compressors |
Description of benchmark datasets
| Identifier | Size (MB) | Type | Technique | Organism | Description |
|---|---|---|---|---|---|
| SRR 554369 | 456 | FASTQ paired short reads | Illumina GAIIx; 50x total depth | P.aeruginosa | Small genome (6-7 MB), medium depth |
| SRR 327342 | 3,881 | FASTQ paired short reads | Illumina GAII; 175x total depth; | S.cerevisiae | Small genome (12 MB), high depth. |
| MH0001. 081026 | 1,880 | FASTQ paired short reads | Illumina GA; unknown depth | Human gut metagenome | Mixed species and unknown references |
| SRR 1284073 | 1,309 | FASTQ single variable-length long reads | PacBio; 140x depth | Bacteria E.Coli | Small genome (4.7 MB), higher error rate. |
| SRR 870667 | 22,987 | FASTQ paired short reads | Illumina GAIIx; 35x total depth | Plant T.cacao. | Medium sized genome (345 MB) |
| ERR 174310 | 53,869 | FASTQ paired short reads | Illumina HiSeq 2000; 13x total depth | H.sapiens (NA12877) individual | Common instrument depth |
Fig. 1MZPAQ: Illustration of the overall framework used to obtain MZPAQ
Compression of identifiers and sequences: Blue color represents original file size
| Dataset | Method | Read identifiers (MB) | Sequences (MB) | Identifiers & Sequences (MB) | C-ratio |
|---|---|---|---|---|---|
| SRR554369 | Original | blue60.8 | blue167.4 | blue228.2 | |
| gzip | 7.5 | 48.8 | 56.3 | 4.05 | |
| bzip2 | 3.8 | 46.7 | 50.5 | 4.52 | |
| LZMA | 0.8 | 17.6 | 18.4 | 12.40 | |
| Leon | 0.1 | 18.6 | 18.7 | 12.20 | |
| SCALCE | 6.8 | 17.0 | 23.8 | 9.59 | |
| Slimfastq | 0.1 | 29.9 | 30.0 | 7.61 | |
| LFQC | 0.0 | 17.4 | 17.4 | green13.15 | |
| Deliminate | N/A | N/A | 27.9 | 8.18 | |
| MFCompress | N/A | N/A | 14.0 |
| |
| SRR327342 | Original | blue978.1 | blue962.3 | blue1940.4 | |
| gzip | 83.6 | 284.0 | 367.6 | 5.28 | |
| bzip2 | 70.9 | 269.4 | 340.3 | 5.70 | |
| LZMA | 46.7 | 120.2 | 166.9 | 11.63 | |
| Leon | 26.3 | 89.3 | 115.6 |
| |
| SCALCE | 69.7 | 68.4 | 138.1 | 14.05 | |
| Slimfastq | 23.0 | 149.0 | 172.0 | 11.28 | |
| LFQC | 20.9 | 128.7 | 149.6 | 12.97 | |
| Deliminate | N/A | N/A | 164.1 | 11.83 | |
| MFCompress | N/A | N/A | 124.9 | green15.54 | |
| MH0001 | Original | blue416.4 | blue523.8 | blue940.2 | |
| gzip | 36.7 | 157.5 | 194.2 | 4.84 | |
| bzip2 | 32.1 | 151.4 | 183.5 | 5.12 | |
| LZMA | 22.0 | 101.7 | 123.7 | 7.60 | |
| Leon | 20.6 | 87.0 | 107.6 |
| |
| SCALCE | 76.7 | 70.7 | 147.4 | 6.38 | |
| Slimfastq | 17.5 | 103.7 | 121.2 | 7.76 | |
| LFQC | 16.1 | 103.2 | 119.3 | 7.88 | |
| Deliminate | N/A | N/A | 115.9 | 8.11 | |
| MFCompress | N/A | N/A | 111.1 | green8.46 | |
| SRR1284073 | Original | blue4.9 | blue649.6 | blue654.5 | |
| gzip | 0.8 | 182.7 | 183.5 | 3.57 | |
| bzip2 | 0.7 | 176.5 | 177.2 | 3.69 | |
| LZMA | 0.5 | 160.1 | 160.6 | 4.08 | |
| Leon | 0.3 | 170.1 | 170.4 | 3.84 | |
| SCALCE | N/A | N/A | N/A | N/A | |
| Slimfastq | N/A | N/A | N/A | N/A | |
| LFQC | 0.3 | 155.8 | 156.1 | 4.19 | |
| Deliminate | N/A | N/A | 155.2 |
| |
| MFCompress | N/A | N/A | 155.9 | green4.20 | |
| SRR870667 | Original | blue3,947.2 | blue7.546.1 | blue11,493.3 | |
| gzip | 514.3 | 2081.6 | 2595.9 | 4.43 | |
| bzip2 | 422.0 | 1,974.4 | 2396.4 | 4.80 | |
| LZMA | 280.4 | 1515.7 | 1796.1 | 6.40 | |
| Leon | 139.7 | 1363.1 | 1502.8 | 7.65 | |
| SCALCE | 341.6 | 999.4 | 1341.0 |
| |
| Slimfastq | 128.2 | 1419.1 | 1547.3 | 7.43 | |
| LFQC | 122.2 | N/A | N/A | N/A | |
| Deliminate | N/A | N/A | 1768.5 | 6.50 | |
| MFCompress | N/A | N/A | 1407.7 | green8.16 | |
| ERR174310 | Original | blue11,107.5 | blue21,173.1 | blue32,280.6 | |
| gzip | 1483.8 | 6018.7 | 7501.7 | 4.30 | |
| bzip2 | 1223.6 | 5745.2 | 6968.8 | 4.63 | |
| LZMA | 691.0 | 4982.0 | 5673.0 | 5.69 | |
| Leon | 355.2 | 4734.4 | 5089.6 | 6.34 | |
| SCALCE | 1073.0 | 3016.0 | 4089.0 |
| |
| Slimfastq | 323.4 | 4426.4 | 4749.8 | 6.80 | |
| LFQC | N/A | N/A | N/A.0 | N/A | |
| Deliminate | N/A | N/A | 5604.0 | 5.76 | |
| MFCompress | N/A | N/A | 4666.3 | green6.92 |
Best results are bold faced and second to best are colored green. N/A refers to unsupported or unsuccessful cases
Compression of Quality Scores: Blue color represents original file size
| Dataset | Method | Compression size (MB) | Compression ratio |
|---|---|---|---|
| SRR554369 | Original | blue167.4 | |
| gzip | 64.7 | 2.59 | |
| bzip2 | 57.6 | 2.91 | |
| LZMA | 57.0 | 2.94 | |
| Leon | 64.6 | 2.59 | |
| SCALCE | 52.0 | 3.22 | |
| Slimfastq | green47.8 | green3.50 | |
| LFQC |
|
| |
| SRR327342 | Original | blue962.3 | |
| gzip | 428.6 | 2.25 | |
| bzip2 | 405.8 | 2.37 | |
| LZMA | 383.5 | 2.51 | |
| Leon | 429.1 | 2.24 | |
| SCALCE | 349.3 | 2.75 | |
| Slimfastq | green334.9 | green2.87 | |
| LFQC |
|
| |
| MH0001 | Original | blue523.8 | |
| gzip | 184.4 | 2.84 | |
| bzip2 | 173.5 | 3.02 | |
| LZMA | 165.9 | 3.16 | |
| Leon | 183.9 | 2.85 | |
| SCALCE | 297.5 | 1.76 | |
| Slimfastq | green144.8 | green3.62 | |
| LFQC |
|
| |
| SRR1284073 | Original | blue649.6 | |
| gzip | 308.7 | 2.10 | |
| bzip2 | 283.6 | 2.29 | |
| LZMA | green280.5 | green2.32 | |
| Leon | 308.6 | 2.10 | |
| SCALCE | N/A | N/A | |
| Slimfastq | N/A | N/A | |
| LFQC |
|
| |
| SRR870667 | Original | blue7,546.1 | |
| gzip | 3021.5 | 2.50 | |
| bzip2 | 2780.7 | 2.71 | |
| LZMA | 2668.8 | 2.83 | |
| Leon | 3022.4 | 2.50 | |
| SCALCE | 2365.0 | 3.19 | |
| Slimfastq | green2281.7 | green3.31 | |
| LFQC |
|
| |
| ERR174310 | Original | blue21,173.1 | |
| gzip | 8525.8 | 2.48 | |
| bzip2 | 7439.9 | 2.85 | |
| LZMA | 7397.0 | 2.86 | |
| Leon | 8533.3 | 2.48 | |
| SCALCE | 6738.0 | 3.14 | |
| Slimfastq | green6295.0 | green3.36 | |
| LFQC |
|
|
Best results are bold faced and second to best are colored green. N/A refers to unsuccessful cases
Compression ratios of evaluated tools
| Dataset | SRR554369 | SRR327342 | MH0001 | SRR1284073 | SRR870667 | ERR174310 |
|---|---|---|---|---|---|---|
| Gzip | 3.16 | 3.87 | 3.89 | 2.40 | 3.43 | 2.96 |
| Bzip2 | 3.74 | 4.67 | 4.82 | 2.83 | 4.06 | 3.62 |
| LZMA | 4.99 | 5.47 | 5.15 | 2.84 | 4.40 | 3.67 |
| Leon | 5.48 | 7.13 | 6.45 | 2.73 | 5.08 | 3.95 |
| SCALCE | 5.97 | 7.96 | 6.32 | N/A | 6.20 | 4.98 |
| Slimfastq | 5.87 | 7.66 | 7.07 | N/A | 6.00 | 4.88 |
| LFQC | 7.02 | 8.06 | 7.18 |
| N/A | N/A |
| MZPAQ |
|
|
|
|
|
|
N/A refers to unsuccessful compression
The values in bold typeface represent the best performance
Compression Speed of evaluated tools
| Dataset | SRR554369 | SRR327342 | MH0001 | SRR1284073 | SRR870667 | ERR174310 |
|---|---|---|---|---|---|---|
| Gzip | 5.77 | 11.22 | 4.69 | 5.13 | 6.18 | 6.24 |
| Bzip2 | 14.71 | 12.48 | 12.96 | 11.48 | 12.91 | 12.41 |
| LZMA | 0.91 | 1.24 | 1.05 | 0.79 | 1.04 | 0.96 |
| Leon | 3.86 | 5.94 | 4.8 | 3.54 | 3.12 | 3.35 |
| SCALCE | 18.24 | 21.8 | 19.58 | N/A | 12.95 | 9.24 |
| Slimfastq |
|
|
| N/A |
|
|
| LFQC | 0.82 | 0.98 | 1.21 | 0.7 | N/A | N/A |
| MZPAQ | 0.98 | 1.34 | 1.33 | 0.78 | 0.99 | 0.83 |
N/A refers to unsuccessful compression
The values in bold typeface represent the best performance
Decompression speed of evaluated tools
| Dataset | SRR554369 | SRR327342 | MH0001 | SRR1284073 | SRR870667 | ERR174310 |
|---|---|---|---|---|---|---|
| Gzip |
|
|
|
|
|
|
| Bzip2 | 35.08 | 32.07 | 33.57 | 22.96 | 24.48 | 22.03 |
| LZMA | 76 | 55.44 | 62.67 | 46.75 | 35.47 | 33.77 |
| Leon | 16.29 | 18.39 | 27.65 | 8.5 | 13.12 | 9.26 |
| SCALCE | 25.33 | 31.55 | 27.24 | N/A | 22.3 | 19.12 |
| Slimfastq | 24 | 24.72 | 20.89 | N/A | 20.58 | 17.25 |
| LFQC | 0.8 | 1.04 | 1.11 | 0.68 | N/A | N/A |
| MZPAQ | 0.91 | 1.07 | 1.29 | 0.82 | 0.97 | 0.99 |
N/A refers to unsuccessful compression
The values in bold typeface represent the best performance
Compression memory usage of evaluated tools
| Dataset | SRR554369 | SRR327342 | MH0001 | SRR1284073 | SRR870667 | ERR174310 |
|---|---|---|---|---|---|---|
| Gzip |
|
|
|
|
|
|
| Bzip2 | 7.8 | 8.7 | 8.8 | 8.3 | 7.7 | 8.7 |
| LZMA | 691.4 | 691.5 | 691.3 | 691.4 | 691.4 | 691.3 |
| Leon | 382.2 | 385.7 | 95.1 | 4213.5 | 1858 | 3324.6 |
| SCALCE | 1429.9 | 3111.2 | 2584.2 | N/A | 5424.5 | 5450.4 |
| Slimfastq | 82.5 | 82.5 | 82.5 | N/A | 82.5 | 82.6 |
| LFQC | 1445.2 | 1189.5 | 1540.9 | 1522 | N/A | N/A |
| MZPAQ | 2398.8 | 2901.8 | 2691 | 2385.6 | 4544.5 | 5326.4 |
N/A refers to unsuccessful compression
The values in bold typeface represent the best performance
Decompression memory usage of evaluated tools
| Dataset | SRR554369 | SRR327342 | MH0001 | SRR1284073 | SRR870667 | ERR174310 |
|---|---|---|---|---|---|---|
| Gzip |
|
|
|
|
|
|
| Bzip2 | 5 | 5 | 5 | 4.8 | 4.9 | 4.9 |
| LZMA | 67.8 | 67.8 | 67.7 | 67.7 | 67.8 | 67.8 |
| Leon | 247.8 | 221.07 | 35.3 | 2923.9 | 762.2 | 2971 |
| SCALCE | 1031.4 | 1030.6 | 1031.1 | N/A | 1031.1 | 1032.6 |
| Slimfastq | 82.5 | 82.5 | 82.4 | N/A | 82.2 | 82.3 |
| LFQC | 1457.7 | 1559.6 | 1451.6 | 1527.7 | N/A | N/A |
| MZPAQ | 2383.9 | 2382 | 2384.1 | 2383 | 2396.3 | 2383 |
N/A refers to unsuccessful compression
The values in bold typeface represent the best performance
Compression of benchmark datasets using FASTQ tools
| Dataset | Size (MB) | Leon | SCALCE | Slimfastq | LFQC | MZPAQ |
|---|---|---|---|---|---|---|
| SRR554369 | 456 |
|
|
|
|
|
| SRR327342 | 3,881 |
|
|
|
|
|
| MH0001 | 1,880 |
|
|
|
|
|
| SRR1284073 | 1,309 | yellow × | orange × | red × | orange × |
|
| SRR870067 | 22,987 |
|
|
| red × |
|
| ER174310 | 53,869 |
|
|
| red × |
|
red ×: Tool does not support data.orange ×: Tool produces invalid output.yellow ×: Tool produces wrong output
Fig. 2Comparison: Compression sizes of different fastq steams in two large datasets using different compression tools
Fig. 3Compression ratio vs. compression speed: The compression ratio versus the speed of compression for all benchmark datasets using different compression tools
Fig. 4Memory usage vs. compression ratio: The maximum memory used during compression versus the compression ratio for all benchmark datasets using different compression tools