| Literature DB >> 23533605 |
James K Bonfield1, Matthew V Mahoney.
Abstract
Storage and transmission of the data produced by modern DNA sequencing instruments has become a major concern, which prompted the Pistoia Alliance to pose the SequenceSqueeze contest for compression of FASTQ files. We present several compression entries from the competition, Fastqz and Samcomp/Fqzcomp, including the winning entry. These are compared against existing algorithms for both reference based compression (CRAM, Goby) and non-reference based compression (DSRC, BAM) and other recently published competition entries (Quip, SCALCE). The tools are shown to be the new Pareto frontier for FASTQ compression, offering state of the art ratios at affordable CPU costs. All programs are freely available on SourceForge. Fastqz: https://sourceforge.net/projects/fastqz/, fqzcomp: https://sourceforge.net/projects/fqzcomp/, and samcomp: https://sourceforge.net/projects/samcomp/.Entities:
Mesh:
Year: 2013 PMID: 23533605 PMCID: PMC3606433 DOI: 10.1371/journal.pone.0059190
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1SequenceSqueeze results: real time vs compression ratio.
Each mark represents a different entry, coloured by author. Ibrahim Numanagic’s entry had a minor decoding problem causing a minority of read-names to mismatch. All other entries plotted here were lossless. The entries have been broken down into reference based (A) and non-reference based (B) solutions. A clear wall can be seen in the non-reference methods requiring exponential growth in CPU time for minor linear improvements in compression ratio.
Data sets used for program evaluation.
| Run ID | Platform | Species | No. Seqs | Length | File size | Depth |
| SRR003177 | 454 GS FLX Titanium | Human | 1,504,571 | 564 | 1,754,042,560 | 0.28x |
| SRR007215_1 | ABI SOLiD System 2.0 | Human | 4,711,141 | 25 | 689,319,444 | 0.04x |
| SRR027520_1 | Illumina GA II | Human | 24,246,685 | 76 | 5,055,253,238 | 0.61x |
| SRR065390_1 | Illumina GA II | C.Elegans | 33,808,546 | 100 | 8,819,496,191 | 33.8x |
The data sets used to test the compression tools along with the sequencing platforms that produced them. Length is the average sequence length. Depth is the average genome depth assuming 100% of sequences in the data set can be aligned.
Program names, versions and options.
| Name | Version | Compression mode | Options |
| SCALCE | 2.3 | fast | -B 1G -T 2 |
| slow | -c bz -T 2 | ||
| dsrc | 1.01 | fast | |
| slow | -l -lm2048 | ||
| SeqSqueeze1 | 1.0(svn) | slow | -h 4 1/5 -hs 5 -b 1∶3 -b 1∶7 -b 1∶11 -b 1∶15 1/20 -bg 0.9 -N -s 1∶1 -s 1∶2 1/5 -s 1∶3 1/10 -s 1∶4 1/20 -ss 10 -sg 0.95 |
| fastqz | 1.5 | fast | e |
| slow | c | ||
| fqzcomp | 4.4 | fast | -n1 -q1 -s1 |
| medium | -n2 -q2 -s6 | ||
| slow | -n2 -q3 -s8+ -b | ||
| quip | 1.1.1 | fast | |
| slow | -a | ||
| sam_comp1 | 0.7 | – | |
| sam_comp2 | 0.3 | – | |
| cramtools | 1.0 | – | –preserve-read-names -L m999 |
| goby | 2.01 | – | -x MessageChunkWriter:codec = hybrid-1–preserve-soft-clips –preserve-read-names –preserve-all-mapped-qualities |
| samtools | 0.1.18 | – | |
| gzip | 1.3.12 | – | |
| bzip2 | 1.05 | – |
Program version names, numbers and common command line options. Additional options were sometimes required to specify the name of the reference used, but this differed per data set.
Figure 2File size ratios vs real time to compress.
SRR007215 is SOLiD data, SRR003177 is 454 data, while SRR02750 and SRR065390 are Illumina data at shallow and deep depths respectively. Not all programs support all types of data.
Compression rates and ratios.
| SRR003177 (LS454) | SRR007215_1 (SOLiD) | ||||||||
| Program | Mode | Ratio | C.R. | D.R. | Mem | Ratio | C.R. | D.R. | Mem |
| gzip | 0.3295 | 8.2 |
|
| 0.2524 | 16.6 |
|
| |
| bzip2 | 0.2681 | 8.3 | 12.0 | 7 | 0.1987 | 4.9 | 23.1 | 7 | |
| SCALCE | fast | (a) | (b) | ||||||
| slow | (a) | (b) | |||||||
| DSRC | fast | 0.2422 |
| 51.1 | 61 | 0.1605 | 19.1 | 55.3 | 11 |
| slow | 0.2372 | 18.2 | 47.1 | 1979 | 0.1605 | 19.5 | 51.7 | 11 | |
| quip | fast | 0.2275 | 16.9 | 15.0 | 398 | (b) | |||
| slow | 0.2275 | 2.9 | 2.8 | 766 | (b) | ||||
| fastqz | fast | (a) | (b) | ||||||
| slow | (a) | (b) | |||||||
| fqzcomp | fast | 0.2236 | 28.3 | 34.1 | 40 | 0.1455 |
| 54.9 |
|
| medium | 0.2170 | 18.1 | 19.4 | 312 | 0.1422 | 27.2 | 38.2 | 310 | |
| slow | 0.2132 | 6.5 | 6.5 | 4407 |
| 13.8 | 16.8 | 4405 | |
| SeqSqueeze1 | slow |
| 0.4 | 0.4 | 4587 | 0.1465 | 1.1 | 1.1 | 4888 |
|
|
| ||||||||
|
|
|
|
|
|
|
|
|
|
|
| gzip | 0.3535 | 12.2 |
|
| 0.2805 | 8.9 |
|
| |
| bzip2 | 0.2905 | 7.0 | 13.0 | 7 | 0.2250 | 8.2 | 14.1 | 7 | |
| SCALCE | fast | 0.2709 | 9.4 | 25.4 | 2212 | 0.1675 | 9.1 | 26.1 | 2181 |
| slow | 0.2572 | 7.8 | 13.1 | 5162 | 0.1635 | 7.3 | 15.8 | 5257 | |
| DSRC | fast | 0.2507 | 24.7 | 32.9 | 18 | 0.1912 | 26.4 | 33.2 | 20 |
| slow | 0.2477 | 13.5 | 32.2 | 1058 | 0.1524 | 15.0 | 33.9 | 1965 | |
| quip | fast | 0.2240 | 16.8 | 13.7 | 396 | 0.1622 | 17.7 | 14.5 | 391 |
| slow | 0.2219 | 8.3 | 10.9 | 777 | 0.1584 | 8.9 | 11.7 | 775 | |
| fastqz | fast | 0.3887 |
| 32.8 |
| 0.3456 |
| 30.6 |
|
| slow | 0.2195 | 4.6 | 3.8 | 1459 | 0.1340 | 4.7 | 3.8 | 1527 | |
| fqzcomp | fast | 0.2243 | 31.4 | 32.5 | 44 | 0.1733 | 32.7 | 29.4 | 40 |
| medium | 0.2196 | 22.0 | 21.7 | 312 | 0.1524 | 22.4 | 20.8 | 311 | |
| slow |
| 8.2 | 8.3 | 4407 | 0.1341 | 8.3 | 8.5 | 4406 | |
| SeqSqueeze1 | slow | 0.2187 | 0.6 | 0.6 | 4919 |
| 0.5 | 0.5 | 4930 |
SRR003177 is 1.5 M human sequences of variable length (avg 564 bp); SRR07215_1 is 4.7 M human seqs of length 25 bp plus 1 primer base; SRR027520_1 is 24.2 M human seqs of length 76 bp; SRR065390_1 is 33.8 M C.Elegans seqs of length 100 bp. Ratio is the compressed size divided by the uncompressed size. C.R. and D.R. are compression and decompression rates in MB/s. (a) Program does not support variable length sequences. (b) Program does not support SOLiD data.
Compression by data type.
| SRR027520_1 | SRR065390_1 | |||||||||||||
| Prog | Ref | Sort | Ratio | ID | Base | Qual | C.R. | Mem | Ratio | ID | Base | Qual | C.R. | Mem |
| Raw FASTQ | N | ID | 1.0000 | 419.9 | 8 | 8 | 1.0000 | 454.9 | 8 | 8 | ||||
| Fastqz | N | ID | 0.2195 | 11.7 | 1.71 | 2.96 | 3.8 | 1459 |
| 15.6 |
| 1.53 | 3.8 | 1527 |
| Fqzcomp(medium) | N | ID | 0.2196 | 11.3 | 1.72 | 2.95 |
|
| 0.1524 | 14.8 | 1.52 | 1.52 |
|
|
| Fqzcomp(slow) | N | ID |
| 11.3 |
|
| 8.2 | 4407 | 0.1341 | 14.8 | 1.16 |
| 8.3 | 4406 |
| Quip | N | ID | 0.2219 |
| 1.78 | 2.95 | 8.3 | 777 | 0.1584 |
| 1.64 | 1.51 | 9.0 | 776 |
| Fastqz | Y | ID | 0.1816 |
| 0.88 | 2.96 | 3.2 | 1365 |
|
|
| 1.53 | 4.7 | 1352 |
| Samcomp2 | Y | ID |
|
|
|
|
|
| 0.1022 | 19.9 | 0.43 |
| 17.1 |
|
| Quip | Y | ID | 0.1885 | 22.2 | 0.90 | 2.95 |
| 1515 | 0.1088 | 21.3 | 0.54 | 1.52 |
| 807 |
| Fastqz | N | pos | 0.2414 | 52.1 | 1.66 | 2.95 | 3.2 | 1527 | 0.1397 | 64.1 | 0.74 | 1.54 | 4.0 | 1527 |
| Samcomp1 | N | pos |
|
|
|
|
| 315 |
| 58.7 |
| 1.50 |
| 288 |
| Samcomp2 | N | pos | 0.2628 |
| 2.18 |
| 13.5 | 341 | 0.1982 | 58.7 | 2.04 |
| 15.2 | 341 |
| Quip | N | pos | 0.2453 | 50.5 | 1.78 |
| 9.3 | 776 | 0.1890 |
| 1.83 | 1.53 | 11.2 | 775 |
| SAMtools (BAM) | N | pos | 0.4013 | 137.8 | 2.79 | 4.21 | 13.7 |
| 0.2344 | 150.9 | 0.94 | 2.47 | 16.7 |
|
| Fastqz | Y | pos | 0.2009 | 52.1 | 0.77 | 2.95 | 2.9 | 1406 | 0.1184 | 64.1 | 0.29 | 1.54 | 4.4 | 1352 |
| Samcomp1 | Y | pos |
| 49.8 |
|
| 15.7 |
|
| 58.7 |
| 1.50 |
|
|
| Samcomp2 | Y | pos | 0.1920 | 49.8 | 0.62 |
| 14.2 | 1079 | 0.1163 | 58.7 | 0.33 |
| 20.1 | 365 |
| Quip | Y | pos | 0.1926 |
| 0.64 |
|
| 1516 | 0.1165 |
| 0.32 | 1.53 | 19.6 | 808 |
| Goby | Y | pos | 0.2706 | 99.5 | 0.62 | 4.01 | 4.8 | 1797 | 0.1587 | 110.6 | 0.28 | 1.93 | 6.8 | 1250 |
| CRAM | Y | pos | 0.2504 | 92.1 | 0.58 | 3.71 | 5.0 | 1514 | 0.1676 | 105.9 | 0.27 | 2.17 | 7.9 | 898 |
Showing the compressed file size break down by bits per sequence identifier, per base-call and per quality value. In some cases these sizes refer to cases where a reference was previously used to map, but it has not been used during compression (e.g. BAM). The ID, Base and Qual columns are the number of bits required to store the complete sequence identifier, a single base nucleotide and a single quality value respectively. The C.R. column is the compression rate in MB per second. Mem is the amount of memory required during compression. References used were human hg19 and C.Elegans WS233. Non-reference based Quip used the “-a” assembly option for high compression mode.
Goby does not store unmapped data. The Goby figures have been estimated by adding 2 bits per absent base-call and scaling up the name and quality figures by the percentage of unmapped reads.
Fastqz alignment benchmarks.
| Data type | Mode | Unaligned size | Aligned size |
| Identifier | fast | 251,697,610 | 251,697,610 |
| Alignment | fast | n/a | 183,313,663 |
| Sequence | fast | 639,049,273 | 49,174,693 |
| Quality | fast | 867,178,255 | 867,178,255 |
| Total | fast | 1,757,925,138 | 1,351,364,221 |
| Identifier | slow | 47,861,283 | 47,861,283 |
| Alignment | slow | n/a | 105,063,319 |
| Sequence | slow | 503,239,070 | 30,852,888 |
| Quality | slow | 574,112,937 | 574,112,937 |
| Total | slow | 1,125,213,290 | 757,890,427 |
Size of data components in the public SequenceSqueeze test set SRR062634 (6,345,444,769 bytes uncompressed).