| Literature DB >> 21988957 |
Abstract
The growing volume of generated DNA sequencing data makes the problem of its long term storage increasingly important. In this work we present ReCoil - an I/O efficient external memory algorithm designed for compression of very large collections of short reads DNA data. Typically each position of DNA sequence is covered by multiple reads of a short read dataset and our algorithm makes use of resulting redundancy to achieve high compression rate.While compression based on encoding mismatches between the dataset and a similar reference can yield high compression rate, good quality reference sequence may be unavailable. Instead, ReCoil's compression is based on encoding the differences between similar or overlapping reads. As such reads may appear at large distances from each other in the dataset and since random access memory is a limited resource, ReCoil is designed to work efficiently in external memory, leveraging high bandwidth of modern hard disk drives.Entities:
Year: 2011 PMID: 21988957 PMCID: PMC3219593 DOI: 10.1186/1748-7188-6-23
Source DB: PubMed Journal: Algorithms Mol Biol ISSN: 1748-7188 Impact factor: 1.405
Figure 1Top level illustration of the algorithm of Section 2.3 for construction of the . To simplify the illustration we do not consider here the reverse complement reads and the filtering of Step 2. The encoding that corresponds to the tree can be found in Section 2.4.
Comparison of compressed file size, runtime and compression ratios of ReCoil to Coil, 7-zip and bzip2.
| ReCoil | Coil | 7-zip | bzip2 | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 36 | 6912 | 1180 | 840 | 0.17 | NA | NA | NA | 1900 | 300 | 0.27 | 2250 | 45 | 0.36 |
| 70 | 1800 | 326 | 290 | 0.18 | 450 | 650 | 0.25 | 412 | 78 | 0.23 | 483 | 11 | 0.27 |
| 100 | 1800 | 278 | 246 | 0.15 | 415 | 625 | 0.23 | 405 | 76 | 0.22 | 481 | 11 | 0.27 |
| 120 | 1800 | 241 | 198 | 0.13 | 387 | 590 | 0.21 | 403 | 77 | 0.22 | 480 | 11 | 0.27 |
File sizes are shown in megabytes and run times in minutes. ReCoil, Coil and 7-zip were given 2 GB of RAM, while bzip2 was run with the compression-level = 9. Both ReCoil and Coil used 7-zip as a post-processing compression step, this step was included in the timings. Decoding time for ReCoil was less than 5 minutes for all datasets, including the 7-zip uncompress step.
Figure 2Comparison of compression with ReCoil, Coil, 7-zip and bzip2 for various coverages. The simulated datasets were generated by making random samples of length 70 from Human Chromosome 14, adding single-nucleotide errors (insertions, deletions or substitutions) with probability 0.02 and reverse complementing each read with probability 0.5.