| Literature DB >> 31725736 |
Sultan Al Yami1,2, Chun-Hsi Huang1.
Abstract
The cost-effectiveness of next-generation sequencing (NGS) has led to the advancement of genomic research, thereby regularly generating a large amount of raw data that often requires efficient infrastructures such as data centers to manage the storage and transmission of such data. The generated NGS data are highly redundant and need to be efficiently compressed to reduce the cost of storage space and transmission bandwidth. We present a lossless, non-reference-based FASTQ compression algorithm, known as LFastqC, an improvement over the LFQC tool, to address these issues. LFastqC is compared with several state-of-the-art compressors, and the results indicate that LFastqC achieves better compression ratios for important datasets such as the LS454, PacBio, and MinION. Moreover, LFastqC has a better compression and decompression speed than LFQC, which was previously the top-performing compression algorithm for the LS454 dataset. LFastqC is freely available at https://github.uconn.edu/sya12005/LFastqC.Entities:
Mesh:
Year: 2019 PMID: 31725736 PMCID: PMC6855649 DOI: 10.1371/journal.pone.0224806
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Compression tools adopted and their parameters.
| Algorithm | Parameters |
|---|---|
| SPRING | -c -i -t 16 / -c -l -i -t 16 |
| LFQC | - |
| DSRC2 | c -m2 |
| Fqzcomp | SOLEXA: -n2 -s7þ -b -q3 |
| LS454: -n1 -s7þ -b -q2 | |
| SOLiD: -S -n2 -s5þ -q1 | |
| SeqSqueeze1 | -h 4 1/5 -hs 5 -b 1:3 -b 1:7 -b 1:12 1/10 -bg 0.9 -s 1:2 1/5 -s 1:3 1/10 -ss 10 -sg 0.95 |
| Quip | - |
| Gzip | -9 |
| Bzip2 | -9 |
Datasets.
| Datasets | Type | Organism | Coverage | Read Length | Size (Mb) |
|---|---|---|---|---|---|
| SRR001471 | LS454 | 0.07x | 188 | 216 | |
| SRR003177 | LS454 | Homo sapiens | 0.27x | 564 | 1196 |
| SRR003186 | LS454 | Homo sapiens | 0.21x | 581 | 886 |
| SRR007215 | SOLiD | Homo sapiens | 0.07x | 25 | 695 |
| SRR010637 | SOLiD | Homo sapiens | 0.14x | 35 | 2086 |
| SRR013951 | SOLEXA | Homo sapiens | 0.89x | 76 | 3190 |
| SRR027520_1 | SOLEXA | Homo sapiens | 1.19x | 76 | 4808 |
| SRR027520_2 | SOLEXA | Homo sapiens | 1.19x | 76 | 4808 |
| SRR554369 | GAIIx | P.aeruginosa | 50x | 100 | 384 |
| SRR327342 | GAII | S.cerevisiae | 175x | 63 | 2812 |
| SRR1284073 | PacBio | E.coli | 140x | 2942 | 1302 |
| SRR9046049 | PacBio | A. brasilense | 136x | 3078 | 2622 |
| SRR8858470 | PacBio | Homo sapiens | 0.67x | 13964 | 4288 |
| ERR3307082 | MinION | C.freundii | 367x | 4002 | 3632 |
| ERR637420 | MinION | E. coli | 118x | 6232 | 264 |
Compression ratios for each tool.
| Dataset | Compression Ratio | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| LFastqC | LFQC | SPRING | DSRC2 | SeqSqueeze1 | Quip | FQZComp | Gzip | Bzip2 | |
| SRR001471 | 5.24 | 4.58 | 4.84 | 5.15 | 4.47 | 5.02 | 3.23 | 3.93 | |
| SRR003177 | 5.11 | 4.46 | 4.60 | 4.90 | 4.45 | 4.77 | 3.16 | 3.81 | |
| SRR003186 | 4.64 | 4.17 | 4.34 | 4.63 | 4.17 | 4.49 | 2.97 | 3.59 | |
| SRR007215 | 6.60 | - | 6.76 | 7.07 | - | - | 4.18 | 5.20 | |
| SRR010637 | 5.30 | - | 5.31 | 5.56 | - | - | 3.48 | 4.25 | |
| SRR013951 | 3.46 | 3.48 | 3.29 | 3.39 | 3.46 | 3.48 | 2.40 | 2.80 | |
| SRR027520_1 | 4.28 | 4.36 | 4.14 | 4.33 | 4.44 | 4.48 | 2.87 | 3.41 | |
| SRR027520_2 | 4.25 | 4.27 | 4.04 | 4.24 | 4.35 | 4.38 | 2.80 | 3.33 | |
| SRR554369 | 6.12 | 5.90 | 4.32 | 5.37 | 4.34 | 4.94 | 2.82 | 3.38 | |
| SRR327342 | 5.90 | 5.84 | 4.74 | 5.64 | 5.24 | 6.08 | 3.07 | 3.65 | |
| SRR1284073 | 3.20 | 3.10 | - | - | 3.10 | 2.39 | 2.82 | ||
| SRR9046049 | 2.98 | - | - | - | 2.74 | 2.36 | |||
| SRR8858470 | 3.02 | 2.85 | - | - | - | 2.50 | 2.32 | ||
| ERR3307082 | 2.70 | 2.60 | - | - | - | 2.02 | 2.32 | ||
| ERR637420 | 2.81 | 2.85 | - | - | 2.21 | 2.59 | |||
Table 3: Compression ratio is defined as the ratio of the original file size to the compressed file size. Best performance is indicated in bold.
Compression speed.
| Dataset | Compression Time | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| LFastqC | LFQC | SPRING | DSRC2 | SeqSqueeze1 | Quip | FQZComp | Gzip | Bzip2 | |
| SRR001471 | 2m00s | 3m19s | 0m12s | 1m45s | 0m17s | 0m11s | 0m41s | 0m17s | |
| SRR003177 | 10m13s | 18m04s | 0m31s | 10m03s | 0m39s | 1m02s | 4m35s | 1m35s | |
| SRR003186 | 7m15s | 12m06s | 0m17s | 7m29s | 0m29s | 0m59s | 3m41s | 1m13s | |
| SRR007215 | 6m18s | 6m00s | - | 2m23s | - | - | 0m46s | 1m10s | |
| SRR010637 | 21m18s | 18m05s | - | 8m21s | - | - | 3m30s | 3m30s | |
| SRR013951 | 37m20s | 41m04s | 0m57s | 25m30s | 1m41s | 3m06s | 8m53s | 5m27s | |
| SRR027520_1 | 44m37s | 68m01s | 2m03s | 33m44s | 2m24s | 4m34s | 11m17s | 7m35s | |
| SRR027520_2 | 46m42s | 59m08s | 1m23s | 34m00s | 2m22s | 4m31s | 11m07s | 7m37s | |
| SRR554369 | 5m34s | 6m38s | 0m15s | 0m23s | 3m56s | 0m25s | 1m12s | 0m32s | |
| SRR327342 | 41m40s | 45m0s | 2m17s | 20m31s | 1m14s | 2m20s | 6m35s | 4m33s | |
| SRR1284073 | 15m11s | 21m21s | 1m7s | - | - | - | 3m49s | 2m7s | |
| SRR9046049 | 40m21s | 46m52s | - | - | - | - | 4m37s | 8m10s | |
| SRR8858470 | 70m49s | 74m47s | - | - | - | - | 7m56s | 22m9s | |
| ERR3307082 | 66m35s | 69m23s | - | - | - | - | 9m35s | 6m53s | |
| ERR637420 | 3m44s | 4m56s | 0m24 | - | - | - | 0m41s | 0m26s | |
Decompression speed.
| Dataset | Decompression Speed | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| LFastqC | LFQC | SPRING | DSRC2 | SeqSqueeze1 | Quip | FQZComp | Gzip | Bzip2 | |
| SRR001471 | 2m16s | 3m20s | 0m4 | 0m10s | 1m45s | 0m47s | 0m13s | 0m30s | |
| SRR003177 | 10m43s | 14m48s | 0m22s | 10m03s | 3m40s | 1m21s | 0m34s | 3m00s | |
| SRR003186 | 7m59s | 11m40s | 0m20s | 7m29s | 2m38s | 0m58s | 0m43s | 2m08s | |
| SRR007215_1 | 6m08s | 7m14s | - | 2m23s | - | - | 0m23s | 1m16s | |
| SRR010637 | 20m59s | 23m28s | - | 8m21s | - | - | 1m27s | 4m12s | |
| SRR013951_2 | 35m27s | 37m27s | 0m57s | 25m30s | 9m39s | 3m12s | 2m34s | 8m28s | |
| SRR027520_1 | 48m27s | 56m12s | 2m34s | 33m44s | 15m38s | 5m01s | 3m51s | 13m57s | |
| SRR027520_2 | 55m49s | 56m59s | 4m01s | 34m00s | 16m03s | 5m24s | 4m13s | 13m05s | |
| SRR554369_1 | 5m54s | 4m46s | 0m27s | 5m16s | 1m24s | 0m26s | 0m6s | 0m48 | |
| SRR327342_1 | 40m30s | 44m38s | 0m42s | 32m12s | 8m48s | 2m49s | 1m56s | 5m49s | |
| SRR1284073 | 16m37s | 18m44s | 0m31s | - | 2m18s | - | 2m46s | ||
| SRR9046049 | 35m52s | 40m29s | 2m45s | - | - | - | 6m18s | ||
| SRR8858470 | 62m59s | 68m12s | 2m36s | - | - | - | 8m50s | ||
| ERR3307082 | 56m40s | 59m38s | - | - | - | 2m13s | 9m16s | ||
| ERR637420 | 2m17s | 4m3s | 0m12 | 0m14s | - | - | - | 0m34s | |
Memory consumption.
| Datasets | Size (Mb) | Memory Usage |
|---|---|---|
| SRR001471 | 216 | 4 GB |
| SRR003177 | 1196 | 4 GB |
| SRR003186 | 886 | 4 GB |
| SRR007215 | 695 | 4 GB |
| SRR010637 | 2086 | 4 GB |
| SRR013951 | 3190 | 4 GB |
| SRR027520_1 | 4808 | 4 GB |
| SRR027520_2 | 4808 | 4 GB |
| SRR554369 | 384 | 4 GB |
| SRR327342 | 2812 | 4 GB |
| SRR1284073 | 1302 | 4 GB |
| SRR9046049 | 2622 | 4 GB |
| SRR8858470 | 4288 | 4 GB |
| ERR3307082 | 3632 | 4 GB |
| ERR637420 | 264 | 4 GB |