| Literature DB >> 29297296 |
Yuting Xing1, Gen Li2, Zhenguo Wang2, Bolun Feng2, Zhuo Song3, Chengkun Wu4.
Abstract
BACKGROUND: The dramatic development of DNA sequencing technology is generating real big data, craving for more storage and bandwidth. To speed up data sharing and bring data to computing resource faster and cheaper, it is necessary to develop a compression tool than can support efficient compression and transmission of sequencing data onto the cloud storage.Entities:
Keywords: Cloud computing; Compression; FASTQ; General-purpose; Lossless; Parallel compression and transmission
Mesh:
Year: 2017 PMID: 29297296 PMCID: PMC5751770 DOI: 10.1186/s12859-017-1973-5
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1The workflow of GTZ
Fig. 2The hierarchy of data containers
The format of an FASTQ file
| 1 | @ERR194147.1.HSQ1004:134:C0D8DACXX:1:1104:3874:86,238/1 |
|---|---|
| 2 | GGTTCCTACTTNAGGGTCATTAAATAGCCCACACGTC |
| 3 | + |
| 4 | CC@FFFFFHHH#JJJFHIIJJJJJJJIJHIJJJJJJJ |
Fig. 3Pre-process data files with pre-processing controllers and compression units
Fig. 4Work flow of a typical statistical modelling
Fig. 5A low-order encoder scheme
Fig. 6A multi-order encoder scheme
Descriptions of 8 FASTQ datasets used for performance evaluation
| Dataset | Species | Reference genome size | Encoding | No. of quality scores in data file |
|---|---|---|---|---|
| ERR233152 |
| 556 | Sanger | 32 |
| SRR935126 |
| 9755 | Sanger | 39 |
| SRR489793 |
| 12,807 | Illumina 1.8+ | 38 |
| SRR801793 | L. pneumophila | 2756 | Sanger | 38 |
| SRR125858 |
| 50,744 | Sanger | 39 |
| SRR5419422 | RNA seq (H. sapiens) | 15,095 | Illumina 1.8+ | 6 |
| ERR1137269 | metagenomes | 56,543 | Illumina 1.8+ | 7 |
| NA12878 (read 2) | H. sapiens | 202,631 | Sanger | 38 |
Compression ratios of different tools on 8 FASTQ datasets
| Dataset | Compression ratio (%) | ||||||
|---|---|---|---|---|---|---|---|
| GTZ | DSRC2 | QUIP | LW-FQZip | Fqzcomp | LFQC | pigz | |
| ERR233152 | 15.9 | 16.7 | 19 | 19 | 16.8 |
| 26.4 |
| SRR935126 | 18.6 | 19.6 | 17.7 | 20.5 | 17.8 |
| 30.2 |
| SRR489793 | 22.8 | 22.7 | 22.6 | 25.5 | 22.5 |
| 34.4 |
| SRR801793 | 21.4 | 21.9 | 21.1 | 21.2 | 20.8 |
| 34.1 |
| SRR125858 | 19.4 | 19.5 | 18.9 | 23.1 | 28.9 |
| 31 |
| SRR5419422 | 12.8 | 13.9 |
| 12.5 | 12 | ERROR | 22 |
| ERR1137269 | 12.2 | 13.4 | 12.8 | 14.3 |
| ERROR | 21.9 |
| NA12878 (read 2) |
| 24 | 20.4 | TLE | 19.9 | TLE | 24.7 |
| avg | 17.86 | 18.96 | 17.93 | 19.44 | 18.83 |
| 28.09 |
| SD | 3.87 | 3.97 | 4.07 | 4.64 | 5.60 | 3.62 | 5.05 |
| CV | 0.22 | 0.21 | 0.23 | 0.24 | 0.30 | 0.30 |
|
The best results of all the tools are boldfaced
Fig. 7CVs for the compression ratio of different tools
Compression time of different tools on 8 FASTQ datasets
| Dataset | Size (MB) | Compression Time (s) | ||||||
|---|---|---|---|---|---|---|---|---|
| GTZ | DSRC2 | QUIP | LW-FQZip | Fqzcomp | LFQC | pigz | ||
| ERR233152 | 556.1 | 19 | 13 | 10 | 284 | 13 | 297 |
|
| SRR935126 | 9754.6 | 49 |
| 195 | 3966 | 191 | 3610 | 129 |
| SRR489793 | 12,807 | 51 |
| 343 | 4893 | 289 | 4253 | 122 |
| SRR801793 | 2756.2 | 43 | 28 | 59 | 1212 | 73 | 1143 |
|
| SRR125858 | 50,744.2 | 178 |
| 1044 | 18,300 | 977 | 10,202 | 481 |
| SRR5419422 | 15,094.6 | 26 |
| 329 | 4234 | 267 | ERROR | 67 |
| ERR1137269 | 56,543 | 117 |
| 806 | 12,018 | 851 | ERROR | 213 |
| NA12878 (read 2) | 202,631 | 820 | 700 | 4703 | TLE | 4389 | TLE |
|
| Average speed (MB/s) | 267.4 |
| 49.7 | 2.9 | 49.6 | 33.7 | 176.8 | |
The best results of all the tools are boldfaced
Total time of different tools on 8 FASTQ datasets with maximum bandwidth
| Dataset | Size (MB) | Compression Time (s) + Data best upload time | ||||||
|---|---|---|---|---|---|---|---|---|
| GTZ | DSRC2 | QUIP | LW-FQZip | Fqzcomp | LFQC | pigz | ||
| ERR233152 | 556.1 | 19.0 | 13.4 | 10.4 | 284.4 | 13.4 | 297.4 |
|
| SRR935126 | 9754.6 | 49.0 |
| 202.8 | 3973.8 | 198.8 | 3617.8 | 136.8 |
| SRR489793 | 12,807 |
| 59.2 | 353.2 | 4903.2 | 299.2 | 4263.2 | 132.2 |
| SRR801793 | 2756.2 | 43.0 | 30.2 | 61.2 | 1214.2 | 75.2 | 1145.2 |
|
| SRR125858 | 50,744.2 |
| 193.6 | 1084.6 | 18,340.6 | 1017.6 | 10,242.6 | 521.6 |
| SRR5419422 | 15,094.6 | 26.0 |
| 341.1 | 4246.1 | 279.1 | ERROR | 79.1 |
| ERR1137269 | 56,543 | 117.0 |
| 851.2 | 12,063.2 | 896.2 | ERROR | 258.2 |
| NA12878 (read 2) | 202,631 | 820.0 | 862.1 | 4865.1 | TLE | 4551.1 | TLE |
|
| Average speed (MB/s) |
| 269.1 | 45.2 | 7.8 | 47.9 | 17.9 | 181.1 | |
The best results of all the tools are boldfaced
Total time of different tools on the SRR125858_2 dataset in a real test
| Metrics | Comparative methods | ||||||
|---|---|---|---|---|---|---|---|
| GTZ | DSRC2 | QUIP | LW-FQZip | Fqzcomp | LFQC | pigz | |
| Compression ratio (%) | 19.2 | 19.2 | 18.7 | 23.2 | 28.7 | 18 | 30.7 |
| Total time (s) | 99 | 122 | 553 | 9283 | 549 | 4982 | 324 |
Qualitative performance summary
| Algorithm | Compression speed | Compression ratio |
|---|---|---|
| GTZ | High | Moderate |
| DSRC2 | High | Moderate |
| QUIP | Moderate | Moderate |
| LW-FQZip | Low | Moderate |
| Fqzcomp | Moderate | Low |
| LFQC | Moderate | Low |
| pigz | High | High |
The compression ratio of GTZ on the three components of FASTQ files
| Dataset | Compression ratio (%) | ||
|---|---|---|---|
| Metadata | Reads | Quality scores | |
| ERR233152 | 2.62 | 20.6 | 20.8 |
| SRR935126 | 3.29 | 22.2 | 25.3 |
| SRR489793 | 0.01 | 22.7 | 29.95 |
| SRR801793 | 3.73 | 23.15 | 31.1 |
| SRR125858 | 2.81 | 23.3 | 28.25 |
| SRR5419422 | 0.01 | 22.9 | 9.5 |
| ERR1137269 | 3.23 | 24.05 | 19.35 |
| NA12878 (read 2) | 7.59 | 20.4 | 27.3 |
| Average | 2.91 | 22.39 | 23.94 |