| Literature DB >> 34930110 |
Abstract
BACKGROUND: Advances in sequencing technology have drastically reduced sequencing costs. As a result, the amount of sequencing data increases explosively. Since FASTQ files (standard sequencing data formats) are huge, there is a need for efficient compression of FASTQ files, especially quality scores. Several quality scores compression algorithms are recently proposed, mainly focused on lossy compression to boost the compression rate further. However, for clinical applications and archiving purposes, lossy compression cannot replace lossless compression. One of the main challenges for lossless compression is time complexity, where it takes thousands of seconds to compress a 1 GB file. Also, there are desired features for compression algorithms, such as random access. Therefore, there is a need for a fast lossless compressor with a reasonable compression rate and random access functionality.Entities:
Keywords: Concurrency; FASTQ; Lossless compressor; Quality score; Random access
Year: 2021 PMID: 34930110 PMCID: PMC8686598 DOI: 10.1186/s12859-021-04516-7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1The general workflow of FCLQC after Splitter. ConunterHandler assigns sub-files (Q) to threads, and each thread counts the number of occurrences (LS) of quality scores in each file. The main thread aggregates all local count information (LS) and then generates summary statistics (SS) which contains estimated marginal and conditional distributions. The estimated distributions are passed to the EncoderHandler, and the EncoderHandler provides a sub-file with estimated distributions to each thread. Finally, each thread compresses quality scores of the divided file line by line using the adaptive arithmetic coder (AAC), and outputs a compressed sub-file (C)
Details of quality scores datasets
| Filename | Organism | Technology | Length | Size (MB) | Coverage |
|---|---|---|---|---|---|
| P.Aeruginosa | Illumina GAIIx | 100 | 160 | 50x | |
| P.Aeruginosa | Illumina GAIIx | 100 | 160 | 50x | |
| S.Cerevisiae | Illumina GAII | 63 | 918 | 175x | |
| S.Cerevisiae | Illumina GAII | 75 | 1090 | 175x | |
| T.Cacao | Illumina GAIIx | 108 | 7197 | 35x | |
| T.Cacao | Illumina GAIIx | 74 | 4952 | 35x | |
| Synthetic | SimNGS | 101 | 43,775 | 30x | |
| Synthetic | SimNGS | 101 | 43,775 | 30x |
Supported features of compressors
| FCLQC | LCQS | AQUA | 7-zip | pigz | |
|---|---|---|---|---|---|
| Without preprocessing | ✕ | ✕ | ✕ | ✕ | |
| Random access | ✕ | ✕ | |||
| Multi-threading | ✕ | ||||
| Custom number of threads | ✕ | ✕ |
Configurations for compressors
| Compressor | Parameters | Source URL |
|---|---|---|
| FCLQC | Precision = 35 thead_num = 6 or 16 | |
| LCQS | ||
| AQUa | Windowsize = 1, cabacgrouping=10485760 | |
| 7-zip | -mx9(best) -mmt6 or -mmt16 | |
| pigz | -9(best) -p 6 or -p 16 |
Comparison results of compression speed and average memory usage
| Filename | Compression speed (MB/s) | Average memory usage (GB) | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| FCLQC | LCQS | AQUa | 7-zip | Pigz | FCLQC | LCQS | AQUa | 7-zip | Pigz | |
| 3.09 | 3.29 | 1.02 | 17.40 | 0.0132 | 1.76 | 0.59 | 0.63 | |||
| 2.98 | 3.28 | 0.97 | 14.41 | 0.0132 | 1.38 | 0.58 | 0.63 | |||
| 7.12 | 6.43 | 2.87 | 50.70 | 0.0134 | 7.68 | 0.60 | 3.52 | |||
| 6.37 | 8.30 | 3.45 | 77.64 | 0.0133 | 7.93 | 0.59 | 4.41 | |||
| 10.36 | 9.47 | 6.12 | 36.33 | 0.0133 | 10.71 | 0.62 | 7.43 | |||
| 7.73 | 8.85 | 4.21 | 43.10 | 0.0134 | 9.57 | 0.61 | 7.56 | |||
| 8.94 | 12.17 | 6.10 | 41.98 | 0.0134 | 14.24 | 0.62 | 12.12 | |||
| 7.95 | 11.46 | 5.58 | 38.32 | 0.0133 | 14.31 | 0.61 | 12.04 | |||
Bold denotes the fastest compression speed or lowest memory usage
Compression time with the number of thread and CPU usage
| Number of threads | Compression time (s) | Average CPU Usage (%) | ||||
|---|---|---|---|---|---|---|
| FCLQC | 7-zip | Pigz | FCLQC | 7-zip | Pigz | |
| 1 | 9728.62 | 3080.33 | 100 | 100 | 100 | |
| 10 | 1749.51 | 314.78 | 1000 | 600 | 1000 | |
| 20 | 903.39 | 159.89 | 1900 | 900 | 2000 | |
| 40 | 695.92 | 83.70 | 3200 | 2800 | 4000 | |
| 60 | 480.06 | 60.13 | 5900 | 2800 | 6000 | |
| 80 | 481.98 | 49.13 | 7300 | 2800 | 8000 | |
| 100 | 478.78 | 42.41 | 9500 | 2800 | 10,000 | |
| 120 | 481.53 | 36.63 | 11,600 | 2800 | 12,000 | |
Bold denotes the lowest compression time
Fig. 2Speedup of FCLQC, 7-zip, and pigz where thread counts are from 10 to 120
Result of random access decompression speed
| Filename | Comparison of random access decompression speed (s) | |||||
|---|---|---|---|---|---|---|
| FCLQC | LCQS | |||||
| Low | Mid | High | Low | Mid | High | |
| 50.027 | 50.082 | 50.213 | ||||
| 52.310 | 52.507 | 52.627 | ||||
| 52.855 | 55.487 | 56.429 | ||||
| 54.961 | 57.307 | 58.232 | ||||
| 53.417 | 54.124 | 52.771 | ||||
| 58.847 | 58.174 | 57.997 | ||||
| 60.651 | ||||||
| 61.965 | ||||||
Bold denotes the fastest random access decompression
Comparison results of decompression speed
| Filename | Decompression speed (MB/s) | |
|---|---|---|
| FCLQC | LCQS | |
| 3.21 | ||
| 3.07 | ||
| 11.38 | ||
| 11.81 | ||
| 8.21 | ||
| 18.8 | ||
| 10.29 | ||
| 10.35 | ||
Bold denotes the fastest decompression speed
Decompression time of FCLQC when the number of threads increases
| Filename | Decompression time (s) | ||||
|---|---|---|---|---|---|
| 10 | 30 | 60 | 90 | 120 | |
| 1.49492 | 0.80012 | 0.55523 | 0.51588 | 0.48690 | |
| 1.52766 | 0.80448 | 0.56461 | 0.52217 | 0.48321 | |
| 11.26921 | 6.51885 | 4.63464 | 3.98554 | 3.35740 | |
| 12.21315 | 7.17441 | 4.93512 | 4.40018 | 4.08363 | |
| 67.20655 | 35.34112 | 24.08606 | 21.38290 | 18.88168 | |
| 54.44474 | 31.43011 | 21.77509 | 18.16317 | 16.11619 | |
| 412.46256 | 217.53169 | 151.75177 | 130.78657 | 111.23652 | |
| 433.82677 | 223.34369 | 153.61791 | 132.44471 | 112.75674 | |
Comparison results of compression ratio
| Filename | Compression ratio | ||||
|---|---|---|---|---|---|
| FCLQC | LCQS | AQUa | 7-zip | pigz | |
| 3.02 | 2.97 | 2.94 | 2.59 | ||
| 3.04 | 2.93 | 2.87 | 2.54 | ||
| 2.59 | 2.57 | 2.51 | 2.25 | ||
| 2.42 | 2.35 | 2.31 | 2.09 | ||
| 2.89 | 2.86 | 2.83 | 2.50 | ||
| 2.66 | 2.58 | 2.54 | 2.27 | ||
| 2.52 | 2.39 | 2.29 | 2.14 | ||
| 2.20 | 2.07 | 2.07 | 1.91 | ||
Bold denotes the highest compression ratio
Fig. 3The average compression ratio and the compression of FCLQC and baseline compressors for all dataset
Fig. 4The average compression ratio and decompression speed of FCLQC and baseline compressors for all dataset