| Literature DB >> 20157486 |
Abstract
While modern hardware can provide vast amounts of inexpensive storage for biological databases, the compression of nucleotide sequence data is still of paramount importance in order to facilitate fast search and retrieval operations through a reduction in disk traffic. This issue becomes even more important in light of the recent increase of very large data sets, such as metagenomes. In this article, I propose the Differential Direct Coding algorithm, a general-purpose nucleotide compression protocol that can differentiate between sequence data and auxiliary data by supporting the inclusion of supplementary symbols that are not members of the set of expected nucleotide bases, thereby offering reconciliation between sequence-specific and general-purpose compression strategies. This algorithm permits a sequence to contain a rich lexicon of auxiliary symbols that can represent wildcards, annotation data and special subsequences, such as functional domains or special repeats. In particular, the representation of special subsequences can be incorporated to provide structure-based coding that increases the overall degree of compression. Moreover, supporting a robust set of symbols removes the requirement of wildcard elimination and restoration phases, resulting in a complexity of O(n) for execution time, making this algorithm suitable for very large data sets. Because this algorithm compresses data on the basis of triplets, it is highly amenable to interpretation as a polypeptide at decompression time. Also, an encoded sequence may be further compressed using other existing algorithms, like gzip, thereby maximizing the final degree of compression. Overall, the Differential Direct Coding algorithm can offer a beneficial impact on disk traffic for database queries and other disk-intensive operations.Entities:
Year: 2009 PMID: 20157486 PMCID: PMC2797453 DOI: 10.1093/database/bap013
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
The 2D data model
| Type | Description | Range | Compressible |
|---|---|---|---|
| Auxiliary | ASCII | 0 to 127 | No |
| Sequence | Triplet | −1 to −125 | Yes |
| Unknown | ? | −128 | No |
aFor sequence data, auxiliary data and unknown values, the range of byte values is listed as well as whether the data will be compressed or uncompressed.
Figure 1.The 2D byte coding schema. The seven least significant bits are used to encode data. The most significant bit is used as a flag to indicate the context of the byte as either compressed data or uncompressed data.
The 2D encoding process
| Step | Input sequence | Triplet | Uncompress count | Encoded sequence |
|---|---|---|---|---|
| 0 | ACTCNTGAGA | Empty | 0 | Empty |
| 1 | CTCNTGAGA | A | 0 | Empty |
| 2 | TCNTGAGA | AC | 0 | Empty |
| 3a | CNTGAGA | ACT | 0 | Empty |
| 3b | CNTGAGA | Empty | 0 | ∼ |
| 4 | NTGAGA | C | 0 | ∼ |
| 5a | TGAGA | Empty | 0 | ∼C |
| 5b | TGAGA | Empty | 0 | ∼CN |
| 6 | GAGA | Empty | 1 | ∼CNT |
| 7 | AGA | G | 0 | ∼CNT |
| 8 | GA | GA | 0 | ∼CNT |
| 9a | A | GAG | 0 | ∼CNT |
| 9b | A | Empty | 0 | ∼CNTÀ |
| 10 | Empty | A | 0 | ∼CNTÀA |
aAn example of encoding process is given for the sequence ACTCNTGAGA that contains the auxiliary symbol N. The remaining input symbols, any symbols cached in the triplet structure, the value of the uncompress count (a variable to offset compression after the occurrence of an auxiliary symbol), and the encoded sequence are shown for each step in the process.
Genomic compression benchmarking
| Compression method | Source genome | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Size (bytes) | Ratio | Time (ms) | Size (bytes) | Ratio | Time (ms) | Size (bytes) | Ratio | Time (ms) | |
| None | 4 274 929 | 1.000 | N/A | 4 706 046 | 1.000 | N/A | 588 437 | 1.000 | N/A |
| GenCompress | 0 | Ø | 58 363 756 | 0 | Ø | 27 887 599 | 0 | Ø | 8 127 438 |
| 2D | 1 465 177 | 2.918 | 717.5 | 1 612 930 | 2.918 | 788.9 | 201 721 | 2.917 | 100.5 |
| gzip | 1 300 308 | 3.288 | 1671.3 | 1 431 844 | 3.287 | 1819.4 | 174 398 | 3.374 | 254.5 |
| 2D + gzip | 1 093 657 | 3.909 | 824.9 | 1 214 444 | 3.875 | 891.3 | 145 727 | 4.038 | 182.8 |
aCompression data for GenCompress, 2D, gzip and 2D + gzip was obtained using three bacterial genomes. File size, compression ratio and execution time are given for each algorithm with respect to each genome. Execution time is the average result from 100 trials with the exception of GenCompress which is the shortest execution time obtained after three consecutive failures.
Genomic decompression benchmarking
| Source genome | File size (bytes) | File inflation | Decomp. time (ms) | |||
|---|---|---|---|---|---|---|
| Normal | 2D Comp. | 2D Decomp. | Bytes | Lines | ||
| 4 274 929 | 1 465 177 | 4 274 930 | 1 | 1 | 923.9 | |
| 4 706 046 | 1 612 930 | 4 706 047 | 1 | 1 | 1042.3 | |
| 588 437 | 201 721 | 588 438 | 1 | 1 | 116.2 | |
aDecompression data was obtained using the 2D compressed genomes. File sizes are given for the original source file, the compressed file and the decompressed file, with respect to each genome. The differences between the original sizes and the restored sizes are also given along with the respective execution times. Execution time is the average result from 100 trials.
Metagenomic compression benchmarking
| Compression method | Sargasso sea metagenome | ||
|---|---|---|---|
| Size (bytes) | Ratio | Time (ms) | |
| None | 962 651 334 | 1.000 | N/A |
| 2D | 419 368 931 | 2.295 | 145 115.0 |
| gzip | 261 995 558 | 3.674 | 315 564.6 |
| bzip2 | 238 973 241 | 4.028 | 301 924.0 |
| 2D + gzip | 220 487 270 | 4.366 | 153 175.8 |
aCompression data for 2D, gzip, bzip2, and 2D + gzip was obtained using the Sargasso Sea metagenome. File size, compression ratio and execution time are given for each algorithm. Execution time is the average result from five trials.