| Literature DB >> 35870880 |
Hansen Chen1, Jianhua Chen2, Zhiwen Lu1, Rongshu Wang1.
Abstract
BACKGROUND: Over the past few decades, the emergence and maturation of new technologies have substantially reduced the cost of genome sequencing. As a result, the amount of genomic data that needs to be stored and transmitted has grown exponentially. For the standard sequencing data format, FASTQ, compression of the quality score is a key and difficult aspect of FASTQ file compression. Throughout the literature, we found that the majority of the current quality score compression methods do not support random access. Based on the above consideration, it is reasonable to investigate a lossless quality score compressor with a high compression rate, a fast compression and decompression speed, and support for random access.Entities:
Keywords: FASTQ; Lossless compressor; Quality score; Random access
Mesh:
Year: 2022 PMID: 35870880 PMCID: PMC9308261 DOI: 10.1186/s12859-022-04837-1
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.307
Fig. 1The example of FASTQ format
Fig. 2The Structure of the random access and index
Detailed descriptions of tested genome datasets
| Code | Datasets | Platforms | Organism | Bases (Mbp) | Read length | Size (Quality Score) |
|---|---|---|---|---|---|---|
| 1 | SRR1284073 | PacBio | Escherichia coli | 649.4 | (130,10,000) | 476,930,701 bytes |
| 2 | SRR327342 | Illumina | S.Cerevisiae | 2100 | 75 | 2,105,137,860 bytes |
| 3 | SRR870667 | Illumina | T.Cacao | 12,600 | 74 or 108 | 11,455,676,056 bytes |
| 4 | ERR091571 | Illumina | Homo sapiens | 42,700 | 100 | 43,133,335,476 bytes |
| 5 | SRR003187 | LS454 | Homo sapiens | 803 | (500,1000) | 798,985,944 bytes |
| 6 | SRR003177 | LS454 | Homo sapiens | 855 | (500,1000) | 850,464,554 bytes |
| 7 | SRR007215 | ABI Solid | Homo sapiens | 238.6 | 25 | 248,099,332 bytes |
| 8 | SRR010712 | ABI Solid | Homo sapiens | 431.6 | 35 | 443,972,736 bytes |
| 9 | SRR070253 | ABI Solid | Homo sapiens | 45,600 | 50 | 12,719,021,580 bytes |
| 10 | SRR801793 | Illumina | Legionella pneumophila | 1100 | 100 | 1,092,105,122 bytes |
| 11 | SRR14340293 | OXFORD NANOPORE | Puccinia graminis | 8900 | (1000,10,000) | 7,782,970,748 bytes |
Comparison results of compression rates
| Dataset | Compression rates (bits per quality score) | CMIC File Size versus | |||||||
|---|---|---|---|---|---|---|---|---|---|
| CMIC | LCQS | AQUa | 7-zip | Gzip | LCQS | AQUa | 7-zip | Gzip | |
| 1 | 2.10 | 2.27 | – | 2.54 | 2.79 | − 8.13% | − | − 20.99% | − 33.17% |
| 2 | 2.75 | 3.05 | 3.38 | 3.37 | 3.74 | − 10.71% | − 22.78% | − 22.46% | − 36.03% |
| 3 | 2.31 | 2.38 | – | 2.85 | 3.05 | − 2.96% | − | − 23.42% | − 31.95% |
| 4 | 2.01 | 2.04 | 2.52 | 2.44 | 2.86 | − 1.49% | − 25.64% | − 21.82% | − 42.51% |
| 5 | 1.33 | 1.74 | – | 2.12 | 2.59 | − | – | − | − |
| 6 | 1.38 | 1.70 | – | 2.08 | 2.54 | − 23.34% | – | − 50.52% | − 83.91% |
| 7 | 4.13 | 4.68 | 4.91 | 5.09 | 5.26 | − 13.35% | − 18.78% | − 23.17% | − 27.17% |
| 8 | 4.20 | 4.75 | 5.02 | 5.15 | 5.32 | − 13.14% | − 19.73% | − 22.65% | − 26.67% |
| 9 | 1.01 | 1.16 | 1.30 | 1.28 | 1.37 | − 15.08% | − | − 26.43% | − 35.45% |
| 10 | 2.47 | 2.50 | 3.02 | 2.96 | 3.44 | − 1.05% | − 22.32% | − 19.96% | − 39.17% |
| 11 | 3.14 | 3.82 | – | 4.23 | 4.52 | − 21.91% | − | − 34.95% | − 44.03% |
Bold denotes the best compression rates for compressors
Fig. 3The averaged compression rates of five compressors over 11 datasets
Comparison results of compression speed
| Dataset | Compression speed (MB/S) | CMIC accelerating ratio versus% | |||||||
|---|---|---|---|---|---|---|---|---|---|
| CMIC | LCQS | FCLQC | 7-zip | Gzip | LCQS | FCLQC | 7-zip | Gzip | |
| 1 | 5.592 | 5.073 | 1.511 | 10.23% | – | 270.09% | − 12.61% | ||
| 2 | 8.796 | 8.741 | 6.435 | 4.523 | 0.63% | − 3352.25% | 36.69% | 94.47% | |
| 3 | 10.556 | 9.472 | 1.213 | 2.292 | 11.44% | − 3018.23% | 770.24% | 360.56% | |
| 4 | 12.86 | 11.074 | 1.09 | 2.622 | 16.13% | − 2156.84% | 1079.82% | 390.47% | |
| 5 | 6.981 | – | 2.822 | 2.0538 | 20.41% | – | 197.87% | 309.29% | |
| 6 | 6.983 | – | 2.846 | 2.768 | 14.74% | – | 181.52% | 189.45% | |
| 7 | 2.831 | 2.787 | 0.986 | 21.274 | 1.58% | − 5520.63% | 187.12% | − 86.69% | |
| 8 | 3.532 | 3.374 | 1.882 | 9.876 | 4.68% | – | 87.67% | − 64.24% | |
| 9 | 6.621 | 6.126 | 3.063 | 13.644 | 8.08% | − 6267.32% | 116.16% | − 51.47% | |
| 10 | 5.444 | 5.389 | 2.351 | 3.36 | 1.02% | − 5644.67% | 131.56% | 62.02% | |
| 11 | 6.095 | 5.274 | – | 2.586 | 15.57% | – | 135.69% | − 73.87% | |
Bold denotes the fastest compression speed
Comparison results of decompression speed
| Dataset | Decompression speed (thousand lines/s) | CMIC accelerating ratio versus% | |
|---|---|---|---|
| CMIC | LCQS | ||
| 1 | 2.43 | 1.36 | 32.92% |
| 2 | 2.51 | 2.31 | 7.97% |
| 3 | 2.67 | 2.10 | 21.35% |
| 4 | 2.42 | 2.31 | 4.55% |
| 5 | 3.13 | 2.96 | 5.43% |
| 6 | 3.21 | 2.85 | 11.21% |
| 7 | 2.20 | 2.11 | 4.27% |
| 8 | 2.67 | 2.16 | 19.10% |
| 9 | 1.82 | 1.73 | 4.95% |
| 10 | 2.77 | 2.75 | 0.72% |
| 11 | 1.63 | 1.47 | 9.82% |
Comparison results of random access speed
| Methods | Random access speed (ms) | ||||
|---|---|---|---|---|---|
| (No.3) [75200000–75600000] | (No.4)[300500000–300900000] | (No.9) [120400000–121000000] | (No.10) [4300000,4700000] | (No.11) [700000–1100000] | |
| CMIC | |||||
| LCQS | 524.4 | 869.3 | 530.5 | 418.0 | 496.1 |
Bold denotes the fastest random access speed
The decompression times by random access for different block sizes
| Dataset | The block size | |||
|---|---|---|---|---|
| 15,000 | 20,000 | 25,000 | 30,000 | |
| 1 | 149.5 | 201.2 | 242.8 | 298.6 |
| 3 | 136.8 | 176.3 | 225.6 | 274.5 |
| 4 | 145.2 | 194.7 | 235.7 | 289.2 |
| 10 | 140.8 | 186.4 | 230.3 | 284.6 |