| Literature DB >> 20946637 |
Kenny Daily1, Paul Rigor, Scott Christley, Xiaohui Xie, Pierre Baldi.
Abstract
BACKGROUND: High-throughput sequencing (HTS) technologies play important roles in the life sciences by allowing the rapid parallel sequencing of very large numbers of relatively short nucleotide sequences, in applications ranging from genome sequencing and resequencing to digital microarrays and ChIP-Seq experiments. As experiments scale up, HTS technologies create new bioinformatics challenges for the storage and sharing of HTS data.Entities:
Mesh:
Year: 2010 PMID: 20946637 PMCID: PMC2964686 DOI: 10.1186/1471-2105-11-514
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Example of Elias Gamma (EG) Coding.
| Number | Encoding |
|---|---|
| 1 | 1 |
| 2-3 | 01x |
| 4-7 | 001xx |
| 8-15 | 0001xxx |
| 16-31 | 00001xxxx |
Each integer j is encoded by concatenating ⌊log j⌋ 0's with the binary value of j
Example of Monotone Value (MOV) Coding.
| Number | Encoding |
|---|---|
| 1 | 1 |
| 2 | 10 |
| 3 | 11 |
| 9 | 1001 |
| 14 | 1110 |
| 26 | 11010 |
| 29 | 11101 |
The principle is illustrated using the vector of addresses (1, 2, 3, 9, 14, 26, 29). Each integer j is converted to a binary representation of length ⌊log j⌋ which begins with a 1-bit. 0-bits are used between two consecutive integers only when the length (scale) increases. The number of 0-bits is equal to the increase in the length. The final encoding of the vector is 1 0 10 11 00 1001 1110 0 11010 11101
Statistics of Three High-Throughput Data Sets
| Dataset 1 | Dataset 2 | Dataset 3 | |
|---|---|---|---|
| Reads (× 106) | 6.4 | 1.7 | 31 |
| Read length | 19 | 25 | 23-44 |
| Coverage | Very sparse | Sparse | Full |
| File sizes | |||
| Raw Sequence | 1,030,333,440 | 353,181,920 | 8,869,613,392 |
| Uniform | 912,352,288 | 252,540,968 | 4,946,059,912 |
| Location | 743,517,128 | 226,557,032 | 4,232,120,216 |
| Mismatches | 168,835,160 | 25,983,936 | 713,939,696 |
| Bowtie | 3,145,664,248 | 902,954,872 | 19,475,952,512 |
| Bowtie Extra Fields | |||
| gzip | 50,382,904 | 106,576,328 | 839,247,848 |
| 7zip | 36,306,064 | 93,238,688 | 778,347,264 |
Figure 1Dataset 2 Nucleotide Substitutions. Distribution of nucleotide substitutions at each read position in Dataset 2. The shading of the bar indicates which nucleotide was present in the read.
Compression Algorithm Results on Three High-Throughput Data Sets
| Dataset 1 | Dataset 2 | Dataset 3 | |
|---|---|---|---|
| Standalone Methods | |||
| Read Length | 6,439,584 | 1,697,990 | 59,267,219 |
| Chromosome | 31,576,860 | 9,997,062 | 31,118,531 |
| Strand | 6,439,584 | 1,697,990 | 31,118,531 |
| # Mismatches | 12,382,598 | 2,499,664 | 55,624,291 |
| Start Location | |||
| MOV† | 121,565,953 | 44,200,254 | 787,554,494 |
| EG† | 236,691,716 | 86,701,276 | 1,543,990,407 |
| REG† | 10,745,562 | 26,180,752 | 76,430,489 |
| Huffman | 91,019,189 | 82,444,521 | 1,324,964,740 |
| RHuffman | |||
| Combined Methods | |||
| (C,S,M) Lookup | 64,424,309 | 33,809,380 | 158,272,463 |
| REG Indexed† | |||
| Mismatches | |||
| Nucleotide | 13,917,023 | 1,307,870 | 53,441,350 |
| From Start | 30,028,807 | 4,177,576 | 159,433,004 |
| From End | 32,671,455 | 2,333,372 | 153,865,294 |
| Total Start | 5,485,446 | 212,874,354 | |
| Total End | 46,588,478 | 207,306,644 | |
| Combined† | 44,033,309 | 3,757,400 | |
Comparison of Compression Results
| Dataset 1 | Dataset 2 | Dataset 3 | |
|---|---|---|---|
| Original Data Sizes | |||
| Raw Sequence | 1,030,333,440 | 353,181,920 | 8,869,613,392 |
| Uniform | 912,352,288 | 252,540,968 | 4,946,059,912 |
| Bowtie | 3,145,664,248 | 902,954,872 | 19,475,952,512 |
| Bowtie Extra Fields (7zip) | 36,306,064 | 93,238,688 | 778,347,264 |
| Best Compression | 56,078,940 | 35,983,322 | 390,541,330 |
| Raw Sequence | 18 | 10 | 23 |
| Uniform | 16 | 7 | 13 |
| Bowtie | 56 | 25 | 49 |
| Bowtie+ | 34 | 7 | 17 |
| GenCompress | 56,166,419 | 36,099,480 | 390,541,330 |
| Raw Sequence | 18 | 9 | 23 |
| Uniform | 16 | 7 | 13 |
| Bowtie | 56 | 25 | 49 |
| Bowtie+ | 34 | 7 | 17 |
| gzip | |||
| Raw Sequence | 41,378,624 | 95,688,992 | 618,818,824 |
| 24 | 3 | 14 | |
| Uniform | 42,918,256 | 54,762,528 | 603,836,784 |
| 21 | 4 | 8 | |
| Bowtie | 459,640,264 | 236,156,432 | 1,640,587,416 |
| 7 | 4 | 12 | |
| bzip2 | |||
| Raw Sequence | 42,233,336 | 94,030,320 | 955,061,616 |
| 24 | 3 | 9 | |
| Uniform | 36,400,576 | 54,656,000 | 649,419,632 |
| 25 | 4 | 7 | |
| Bowtie | 250,373,616 | 171,835,792 | 1,609,317,768 |
| 13 | 5 | 12 | |
| 7zip | |||
| Raw Sequence | 30,651,664 | 83,319,584 | 411,811,520 |
| 33 | 4 | 21 | |
| Uniform | 27,852,952 | 34,482,312 | 283,490,928 |
| 33 | 7 | 17 | |
| Bowtie | 247,481,992 | 183,522,960 | 1,254,167,144 |
| 13 | 5 | 16 | |
Comparison of Compression and Decompression Timing
| Dataset 1 | Dataset 2 | Dataset 3 | |
|---|---|---|---|
| Compression (sec) | |||
| GenCompress | 20 | 5 | 111 |
| gzip | 10 | 13 | 70 |
| bzip2 | 78 | 20 | 422 |
| 7zip | 107 | 77 | 447 |
| Decompression (sec) | |||
| GenCompress | 2 | 1 | 15 |
| gzip | 2 | 1 | 13 |
| bzip2 | 7 | 4 | 53 |
| 7zip | 4 | 2 | 21 |
Sequence encoding using Huffman Trees.
| Dataset | Sequence bits | Tree bits | Total bits | |
|---|---|---|---|---|
| 1 | 1 | 31,674,558 | 40 | 31,674,598 |
| 2 | 28,340,409 | 324 | 28,340,733 | |
| 3 | 27,708,166 | 1,951 | 27,710,117 | |
| 4 | 22,565,417 | 10,471 | 22,575,888 | |
| 5 | 19,126,288 | 53,178 | 19,179,466 | |
| 6 | 21,056,658 | 256,303 | 21,312,961 | |
| 2 | 1 | 94,680,841 | 52 | 94,680,893 |
| 2 | 81,954,644 | 549 | 81,955,193 | |
| 3 | 81,038,827 | 4,303 | 81,043,130 | |
| 4 | 80,554,549 | 27,458 | 80,582,007 | |
| 5 | 83,570,470 | 148,206 | 83,718,676 | |
| 6 | 79,977,714 | 622,784 | 80,600,498 | |
The data is preprocessed by counting the frequencies of k-mers, and this is used to build a Huffman tree. The tree is used to encode the data, and the number of bits needed to store the data as well as the tree are given