| Literature DB >> 32183707 |
Jiabing Fu1,2, Bixin Ke1,2, Shoubin Dong3,4.
Abstract
BACKGROUND: Advanced sequencing machines dramatically speed up the generation of genomic data, which makes the demand of efficient compression of sequencing data extremely urgent and significant. As the most difficult part of the standard sequencing data format FASTQ, compression of the quality score has become a conundrum in the development of FASTQ compression. Existing lossless compressors of quality scores mainly utilize specific patterns generated by specific sequencer and complex context modeling techniques to solve the problem of low compression ratio. However, the main drawbacks of these compressors are the problem of weak robustness which means unstable or even unavailable results of sequencing files and the problem of slow compression speed. Meanwhile, some compressors attempt to construct a fine-grained index structure to solve the problem of slow random access decompression speed. However, they solve the problem at the sacrifice of compression speed and at the expense of large index files, which makes them inefficient and impractical. Therefore, an efficient lossless compressor of quality scores with strong robustness, high compression ratio, fast compression and random access decompression speed is urgently needed and of great significance.Entities:
Keywords: Efficient; Lossless compression; Parallelization; Quality score; Random access; Robust; ZPAQ
Mesh:
Year: 2020 PMID: 32183707 PMCID: PMC7079445 DOI: 10.1186/s12859-020-3428-7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1The framework of Proposed Lossless Compressor LCQS
Fig. 2The Procedure of Light-weight Index Method: Step 2
Fig. 3The Procedure of Adaptive k-mer Packing Method: Step 3
Detailed Descriptions of Test Quality Score Datasets
| Code | Filename(quality score only) | Organism | Technology | Coverage | Length | Size(MB) |
|---|---|---|---|---|---|---|
| 1_01 | SRR554369_1 | P.Aeruginosa | Illumina GAIIx | 50x | 100 | 160 |
| 1_02 | SRR554369_2 | P.Aeruginosa | Illumina GAIIx | 50x | 100 | 160 |
| 2_01 | MH0001_081026_clean.1 | H.Sapiensgut | Illumina GAII | Unknown | 44 | 500 |
| 2_02 | MH0001_081026_clean.2 | H.Sapiensgut | Illumina GAII | Unknown | 44 | 500 |
| 3_01 | SRR1284073 | E.Coli | PacBio | 140x | Varied | 620 |
| 4_01 | SRR327342_1 | S.Cerevisiae | Illumina GAII | 175x | 75 | 918 |
| 4_02 | SRR327342_2 | S.Cerevisiae | Illumina GAII | 175x | 75 | 1090 |
| 5_02 | SRR870667_2 | T.Cacao | Illumina GAIIx | 35x | 74 | 4952 |
| 5_01 | SRR870667_1 | T.Cacao | Illumina GAIIx | 35x | 74 | 7197 |
| 6_01 | ERR09157 | Human | Illumina | Unknown | 101 | 166,142 |
Detailed Descriptions of Benchmark Compressors
| Compressors | Parameters | Source URLs |
|---|---|---|
| LCQS | k=4, α= 0.1 (they are defined in step 1) | |
| AQUa | windowsize=1, cabacgrouping=10485760 | |
| 7-zip | -mx9 | |
| Gzip | -9 |
Comparison Results of Compression Ratio
| Datasets | Compression Ratio | LCQS File Size Reduction Versus (%) | |||||
|---|---|---|---|---|---|---|---|
| LCQS | AQUa | 7-zip best | Gzip best | AQUa | 7-zip best | Gzip best | |
| 1_01 | 2.9726 | 2.9351 | 2.5884 | 13.56 | 14.65 | 24.73 | |
| 1_02 | 2.9296 | 2.8668 | 2.5365 | 11.87 | 13.76 | 23.69 | |
| 2_01 | 3.1762 | 3.1570 | 2.8401 | 9.31 | 9.86 | 18.91 | |
| 2_02 | 2.1817 | 2.2387 | 2.0756 | 11.29 | 8.97 | 15.60 | |
| 3_01 | - | 2.3159 | 2.1041 | - | 10.62 | 18.80 | |
| 4_01 | 2.5730 | 2.5093 | 2.2453 | 7.81 | 10.09 | 19.55 | |
| 4_02 | 2.3483 | 2.3099 | 2.0933 | 8.80 | 10.29 | 18.70 | |
| 5_02 | 2.5795 | 2.5400 | 2.2735 | 9.80 | 11.18 | 20.50 | |
| 5_01 | 2.8602 | 2.8276 | 2.4974 | 12.08 | 13.09 | 23.23 | |
| 6_01 | 3.2156 | 3.3046 | 2.8245 | ||||
Comparison Results of Compression Speed
| Datasets | Compression Speed (MB/s) | LCQS Accelerating Ratio Versus% | |||||
|---|---|---|---|---|---|---|---|
| LCQS | AQUa | 7-zip best | Gzip best | AQUa | 7-zip best | Gzip best | |
| 1_01 | 0.31 | 1.09 | 2.25 | 648% | 113% | 3% | |
| 1_02 | 0.31 | 1.04 | 1.79 | 639% | 120% | 28% | |
| 2_01 | 0.30 | 1.35 | 1.58 | 1617% | 281% | 226% | |
| 2_02 | 0.26 | 1.34 | 1.88 | 1650% | 240% | 142% | |
| 3_01 | 4.63 | - | 0.98 | - | 372% | -21% | |
| 4_01 | 0.31 | 0.99 | 2.39 | 1810% | 498% | 148% | |
| 4_02 | 0.31 | 0.88 | 3.84 | 1874% | 595% | 59% | |
| 5_02 | 0.31 | 1.00 | 2.03 | 1942% | 533% | 212% | |
| 5_01 | 0.31 | 0.99 | 1.75 | 2897% | |||
| 6_01 | 0.32 | 1.09 | 2.28 | 783% | 322% | ||
Comparison Results of CPU Usage and MEMORY Usage
| Datasets | AVERAGE CPU USAGE(%) | AVERAGE MEMORY USAGE(GB) | M/C | ||||||
|---|---|---|---|---|---|---|---|---|---|
| LCQS | AQUa | 7-zip best | Gzip best | LCQS | AQUa | 7-zip best | Gzip best | LCQS | |
| 1_01 | 104 | 176 | 99 | 1.32 | 0.57 | 0.58 | |||
| 1_02 | 104 | 175 | 99 | 1.26 | 0.57 | 0.56 | 0.33 | ||
| 2_01 | 105 | 180 | 100 | 3.82 | 0.6 | 0.63 | 0.42 | ||
| 2_02 | 105 | 175 | 99 | 4.6 | 0.61 | 0.64 | 0.44 | ||
| 3_01 | - | 157 | 99 | 6.12 | - | 0.65 | 0.47 | ||
| 4_01 | 105 | 168 | 99 | 7.52 | 0.63 | 0.65 | 0.46 | ||
| 4_02 | 105 | 165 | 99 | 9.26 | 0.63 | 0.66 | 0.46 | ||
| 5_02 | 104 | 173 | 100 | 15.6 | 0.63 | 0.67 | 0.53 | ||
| 5_01 | 105 | 177 | 100 | 14.37 | 0.66 | 0.67 | 0.49 | ||
| 6_01 | 105 | 172 | 100 | 16.99 | 0.66 | 0.67 | 0.54 | ||
Comparison of Random Access Decompression Functionality
| Datasets | Random Access Decompression Speed (Thousand lines / s) | Extra index size (%) | ||||||
|---|---|---|---|---|---|---|---|---|
| LCQS | AQUa | LCQS | AQUa | |||||
| 40000 | 80000 | 160000 | 40000 | 80000 | 160000 | |||
| 1_01 | - | - | - | 40.73 | ||||
| 1_02 | - | - | - | 40.73 | ||||
| 2_01 | 0.53 | 0.83 | - | 92.94 | ||||
| 2_02 | 0.24 | 0.47 | - | 93.46 | ||||
| 3_01 | - | - | - | - | ||||
| 4_01 | 0.33 | 0.75 | - | 65.85 | ||||
| 4_02 | 0.47 | 0.61 | - | 55.7 | ||||
| 5_02 | 0.35 | 0.78 | - | 56.62 | ||||
| 5_01 | 0.26 | 0.40 | - | 39.18 | ||||
| 6_01 | 0.40 | 0.65 | - | 41.79 | ||||
| Average | 0.37 | 0.64 | - | 58.56 | ||||
Optimization Result of Libzpaq Library Using SIMD
| Datasets | Improvements (%) | |
|---|---|---|
| JIT | NON-JIT | |
| 1_01 | 18.63 | |
| 1_02 | 20.33 | 21.55 |
| 2_01 | 16.15 | 16.43 |
| 2_02 | 16.35 | 16.17 |
| 3_01 | 16.17 | 19.17 |
| 4_01 | 16.75 | 19.96 |
| 4_02 | 16.52 | 19.16 |
| 5_02 | 12.27 | 19.47 |
| 5_01 | 15.96 | |