| Literature DB >> 28320326 |
Zhi-An Huang1, Zhenkun Wen1, Qingjin Deng1, Ying Chu1, Yiwen Sun2, Zexuan Zhu3.
Abstract
BACKGROUND: The rapid progress of high-throughput DNA sequencing techniques has dramatically reduced the costs of whole genome sequencing, which leads to revolutionary advances in gene industry. The explosively increasing volume of raw data outpaces the decreasing disk cost and the storage of huge sequencing data has become a bottleneck of downstream analyses. Data compression is considered as a solution to reduce the dependency on storage. Efficient sequencing data compression methods are highly demanded.Entities:
Keywords: High-throughput sequencing; Reference- based compression; Sequence alignment; Sequencing data compression
Mesh:
Year: 2017 PMID: 28320326 PMCID: PMC5359991 DOI: 10.1186/s12859-017-1588-x
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1The general framework of LW-FQZip 2. Firstly, the input FASTQ file is split into three data streams of metadata, bases, and quality scores. Secondly, the quality scores and metadata are compacted with run-length-limited encoding and incremental encoding, respectively. The nucleotide bases are partitioned and mapped to an external reference sequence based on the light-weight mapping model. Finally, the processed intermediate files from the three streams are compressed with arithmetic coder and/or other specific coding schemes
The ten real-world FASTQ data sets used for performance evaluation
| Datasets | Platforms | Species | Read length (bp) | Size (MB) | GC content | |
|---|---|---|---|---|---|---|
| Long-read | SRR2916693 | 454GS | Pseudomonas moraviensis | 67-1201 | 425 | 58.8% |
| SRR2994368 | Illumina Miseq | Escherichia coli | 70-502 | 4688 | 49.7% | |
| SRR3211986 | Pacbio RS | Homo sapiens | 2-62746 | 1759 | 39.6% | |
| ERR739513 | MinION | Phage | 5-246140 | 871 | 47.9% | |
| SRR3190692 | Illumina MiSeq | Escherichia coli | 70-602 | 11379 | 52.3% | |
| Short-read | ERR385912 | Illumina Hiseq 2000 | Escherichia coli | 51 | 641 | 43.5% |
| ERR386131 | Ion Torrent PGM | Capsicum baccatum | 151 | 1371 | 50.5% | |
| SRR034509 | Illumina Analyzer II | Escherichia coli | 101 | 5247 | 52.6% | |
| ERR174310 | Illumina Hiseq 2000 | Homo sapiens | 202 | 105122 | N.A. | |
| ERR194147 | Illumina Hiseq 2000 | Homo sapiens | 101 | 202631 | 40.3% |
Note: The long-read data sets have variable-length reads, while the short-read data sets have fixed-length reads
The compression ratios of the compared methods on ten test data sets
| LW-FQZip 2 | LW-FQZip 2 (−g) | LW-FQZip 1 | Quip (−a) | Quip (−r) | DSRC 2 | CRAM | FQZcomp | LFQC | LEON | SCALCE | bzip 2 | gzip | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Long-read | SRR2916693 | 16.7% | 15.3% | 18.1% | 20.9% | 20.5% | 20.2% | 21.9% | 21.6% |
| 19.5% | 17.2%a | 24.2% | 29.6% |
| SRR2994368 | 17.3% |
| 17.9% | 20.1% | N/A | 23.2% | 26.4% | N/A | N/A | 23.1% | 17.3%a | 28.5% | 34.2% | |
| SRR3211986 | 33.3% |
| N/A | 33.3% | N/A | N/A | 33.9% | N/A |
| N/A | 33.4%a | 36.4% | 42.6% | |
| ERR739513 | 35.2% |
| N/A | N/A | N/A | N/A | 35.6% | N/A | 34.9% | N/A | N/A | 39.7% | 45.4% | |
| SRR3190692 | 12.7% |
| 13.2% | 16.5% | N/A | 20.3% | 22.3% | N/A | N/A | 18.1% | 12.7%a | 24.4% | 29.5% | |
| Short-read | ERR385912 | 6.4% |
| 6.6% | 7.2% | N/A | 7.8% | N/A | N/A | 5.8% | 7.0% | 6.6%a | 13.9% | 17.9% |
| ERR386131 | 16.5% | 16.0% | 18.7% | 17.7% | 16.6% | 16.8% | 25.5% | 24.6% |
| N/A | 16.6%a | 21.5% | 26.0% | |
| SRR034509 | 23.7% |
| 25.0% | 25.1% | 24.9% | 26.1% | 27.4% | 26.1% | 23.7% | 27.9% | 24.5%a | 31.5% | 36.9% | |
| ERR174310 | 21.0% | 20.1% | N/A |
| N/A | 20.2% | N/A | N/A | N/A | 25.3% | 19.6%a | 26.2% | 31.7% | |
| ERR194147 | 20.1% |
| N/A | 20.0% | N/A | 20.3% | N/A | N/A | N/A | 20.3% | 15.4%a | 19.7% | 23.6% |
Compressed Ratio: the compressed file size divided by the original file size; ‘N/A’: the program cannot work on the data, some error occur in program, such as loses fidelity after decompression or decompression failed; ‘a’: the read order is changed after decompression; The best results are highlighted in bold
Fig. 2The average compression and decompression speeds of the compared methods on ten test data sets. The compression speed is calculated as the original file size divided by the compression time. The decompression speed is calculated as the original file size divided by the decompression time
The memory usage (MB) of the compared methods on ten test data sets
| LW- | LW- | LW- | Quip | Quip | DSRC 2 | CRAM | FQZcomp | LFQC | SCALCE | LEON | bzip 2 | gzip | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SRR2916693 | compression | 1605 | 4420 | 459 | 759 | 391 | 582 | 784 | 312 | 3965 | 1380 | 1722 | 7.6 |
|
| decompression | 1598 | 4127 | 37 | 756 | 389 | 601 | 620 | 314 | 3932 | 1050 | 696 | 4.8 |
| |
| SRR2994368 | compression | 1582 | 14158 | 1048 | 2801 | 389 | 6175 | 1355 | N/A | N/A | 4073 | 5435 | 7.6 |
|
| decompression | 1579 | 13154 | 69 | 2198 | 387 | 7234 | 652 | N/A | N/A | 1050 | 2623 | 4.8 |
| |
| SRR3211986 | compression | 1190 | 12935 | N/A | 1098 | N/A | N/A | 5777 | N/A | 4768 | 2158 | N/A | 7.6 |
|
| decompression | 1528 | 5657 | N/A | 1109 | N/A | N/A | 2381 | N/A | 4320 | 1035 | N/A | 4.8 |
| |
| ERR739513 | compression | 1283 | 11511 | N/A | N/A | N/A | N/A | 3694 | N/A | 5108 | N/A | N/A | 7.6 |
|
| decompression | 1403 | 11079 | N/A | N/A | N/A | N/A | 1455 | N/A | 4748 | N/A | N/A | 4.8 |
| |
| SRR3190692 | compression | 1726 | 14560 | 1058 | 3552 | 391 | 14157 | 1363 | N/A | N/A | 5219 | 6776 | 7.6 |
|
| decompression | 1725 | 13329 | 69 | 2898 | 386 | 14794 | 661 | N/A | N/A | 1055 | 3217 | 4.8 |
| |
| ERR385912 | compression | 1603 | 2793 | 410 | 772 | 389 | 911 | N/A | N/A | 3140 | 1422 | 1717 | 7.6 |
|
| decompression | 1603 | 2655 | 69 | 770 | 392 | 908 | N/A | N/A | 3060 | 1040 | 504 | 4.8 |
| |
| ERR386131 | compression | 1691 | 12443 | 1033 | 771 | 389 | 1844 | 1318 | 322 | 5175 | 1961 | N/A | 7.6 |
|
| decompression | 1721 | 12165 | 39 | 768 | 384 | 1950 | 651 | 319 | 4835 | 1049 | N/A | 4.8 |
| |
| SRR034509 | compression | 1748 | 14670 | 1073 | 1531 | 383 | 6683 | 1351 | 324 | 5352 | 4151 | 4139 | 7.6 |
|
| decompression | 1752 | 7042 | 71 | 1218 | 391 | 7736 | 653 | 309 | 4859 | 1050 | 1799 | 4.8 |
| |
| ERR174310 | compression | 1886 | 16270 | N/A | 5333 | N/A | 5558 | N/A | N/A | N/A | 5333 | 7797 | 7.6 |
|
| decompression | 1865 | 11239 | N/A | 1156 | N/A | 18487 | N/A | N/A | N/A | 1045 | 5106 | 4.8 |
| |
| ERR194147 | compression | 1953 | 17771 | N/A | 782 | N/A | 20271 | N/A | N/A | N/A | 5380 | 7192 | 7.6 |
|
| decompression | 1963 | 13908 | N/A | 780 | N/A | 24284 | N/A | N/A | N/A | 1057 | 5322 | 4.8 |
| |
Note: The best results are highlighted in bold
Fig. 3Comparison between LW-FQZip 2, LW-FQZip 2 (−g) and LFQC in a radar chart in terms of average compression ratio, compression time, decompression time, compression memory usage, and decompression usage. In each criterion, the results of the three methods are normalized to the range of [0, 1] and a smaller value, i.e., closer to the centroid, indicates a better performance