| Literature DB >> 24252160 |
Sebastian Deorowicz1, Szymon Grabowski2.
Abstract
: Post-Sanger sequencing methods produce tons of data, and there is a general agreement that the challenge to store and process them must be addressed with data compression. In this review we first answer the question "why compression" in a quantitative manner. Then we also answer the questions "what" and "how", by sketching the fundamental compression ideas, describing the main sequencing data types and formats, and comparing the specialized compression algorithms and tools. Finally, we go back to the question "why compression" and give other, perhaps surprising answers, demonstrating the pervasiveness of data compression techniques in computational biology.Entities:
Year: 2013 PMID: 24252160 PMCID: PMC3868316 DOI: 10.1186/1748-7188-8-25
Source DB: PubMed Journal: Algorithms Mol Biol ISSN: 1748-7188 Impact factor: 1.405
Figure 1Trends in storage, transfer, and sequencing costs. The historic costs of low-end hard disk drives were taken from http://www.jcmit.com/diskprice.htm. They had been halving every 12 months in the 1990s and around 2000–2004. Then, the doubling time lengthened suddenly, to about 25 months. The real costs of sequencing, taken from the NHGRI Web page [5], reflect not only reagent costs like some studies show, but also include labor, administration, amortization of sequencing instruments, submission of data to a public database, etc. The significant change in sequencing costs around 2008 was caused by the popularization of the second generation technologies. The prices of the Amazon storage and transfer reflect the real market offers from the top data centers. It is interesting that the storage costs at data centers drop very slowly, mainly because the costs of blank hard disks are only a part of the total costs of maintenance. The curves were not corrected for inflation.
Summary of the most important compressors of sequencing data
| | | | | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| gzip | C++ / many / many | yes / no | yes | yes | moderate / very high | low | no | LZ, Huf | | |
| bzip2 | C / many / many | yes / no | yes | yes | low / high | low | no | BWT, Huf | | |
| 7zip | C, C++ / many / many | yes / no | yes | yes | low / very high | moderate | no | LZ, AC | | |
| BWT-SAP [ | C++ / — / C++ | yes / no | yes | no | low / low | moderate | no | BWT, PPM | FASTA only | |
| DSRC [ | C++ / Lin, Win / C++, Pyt | yes / no | yes | yes | high / high | moderate | yes | LZ, Huf | | |
| Fqzcomp [ | C / Lin / — | yes / yes | no | yes | high / moderate | high | no | CM | | |
| G-SQZ [ | C++ / Lin / — | yes / no | no | no | high / moderate | low | yes | Huf | | |
| Kung-FQ [ | C# / Win / – | yes / yes | no | no | moderate / moderate | moderate | no | AC, LZ, RLE | | |
| Quip [ | C / – / – | yes / no | no | no | high / high | high | no | M. models, AC | | |
| ReCoil [ | C++ / — / C++ | yes / no | no | no | very low / high | moderate | no | BWT, PPM | FASTA only | |
| SCALCE+gzip [ | C++ / – / – | yes / yes | no | no | moderate / high | moderate | no | AC, LZ, Huf | | |
| Seq-DB [ | C++ / – / – | yes / yes | no | no | very high / very high | low | yes | AC, LZ, RLE | | |
| SeqSqueeze1 [ | C/ Lin/ — | yes / no | no | yes | very low / ver low | high | no | CM | | |
| gzip | C++ / many / many | yes / no | yes | N/A | low / very high | low | no | LZ, Huf | | |
| bzip2 | C / many / many | yes / no | yes | N/A | low / high | low | no | BWT, Huf | | |
| 7z | C, C++ / many / many | yes / no | yes | N/A | low / very high | moderate | no | LZ, AC | | |
| BAM [ | C++ / many / many | yes / no | yes | N/A | moderate / high | moderate | yes | LZ, Huf | | |
| CRAM [ | Java / many / Java | yes / yes | yes | N/A | moderate / moderate | moderate | yes | Huf, Gol, diff. | | |
| Quip [ | C / – / – | yes / no | no | N/A | high / high | high | no | M. models, AC | | |
| SAMZIP+rar [ | C/ – / – | yes / no | yes | N/A | moderate / high | moderate | no | RLE, LZ, Huf | | |
| gzip | C++ / many / many | yes / no | yes | N/A | moderate / very high | low | no | LZ, Huf | | |
| bzip2 | C / many / many | yes / no | yes | N/A | low / high | low | no | BWT, Huf | | |
| 7z | C, C++ / many / many | yes / no | yes | N/A | low / very high | moderate | no | LZ, AC | | |
| dna3 [ | C / – / – | yes / no | no | N/A | low / low | moderate | no | LZ, PPM | | |
| FCM-M [ | C / – / – | yes / no | no | N/A | very low / very low | moderate | no | M. models | | |
| XM [ | Java / many / Java | yes / no | yes | N/A | very low / very low | moderate | no | M. models, AC | | |
| gzip | C++ / many / many | yes / no | yes | N/A | low / very high | very low | no | LZ, Huf | | |
| bzip2 | C / many / many | yes / no | yes | N/A | low / high | very low | no | BWT, Huf | | |
| 7z | C, C++ / many / many | yes / no | yes | N/A | low / very high | high | no | LZ, AC | chr-ordered | |
| ABRC [ | C++ / Lin, Win / C++ | yes / no | yes | N/A | high / very high | very high | yes | LZ, Huf | | |
| GDC [ | C++ / Lin, Win / C++ | yes / no | yes | N/A | high / very high | very high | yes | LZ, Huf | | |
| GReEn [ | C / – / – | yes / no | yes | N/A | high / high | high | no | M. models, AC | | |
| GRS [ | C / Lin / – | yes / no | yes | N/A | moderate / low | high | no | LCS, Huf | | |
| RLZ [ | C++ / – / – | yes / no | yes | N/A | moderate / very high | high | no | LZ, Gol | ||
Abbreviations used in the table: src—source codes, libs—libraries, Lin—Linux, Win—Windows, Pyt—Python, exe—binary executables, AC—arithmetic coding (a statistical coding method [12]), CM—context mixing for arithmetic coding [12], diff—differential coding (paradigm: store only changes between sequences), Gol—Golomb (a statistical coding method [12]), Huf—Huffman, LCS—longest common subsequence (a measure of similarity of sequences [43]), LZ—an algorithm from Ziv–Lempel family, M. models—Markov models [12], PPM—prediction by partial matching (an efficient general-purpose compressor [12]). “Ambig. codes” means the ability to compress DNA symbols other than {A, C, G, T, N }. “chr-ordered” for 7z and genome collections means that the input (human) genomes were split into chromosomes and ordered according to them before the actual compression. In this way several chromosomes fit the 7z LZ-buffer which is beneficial for the compression.