| Literature DB >> 29422526 |
Antonio A Ginart1, Joseph Hui2, Kaiyuan Zhu3, Ibrahim Numanagić4, Thomas A Courtade5, S Cenk Sahinalp6, David N Tse1.
Abstract
The most effective genomic data compression methods either assemble reads into contigs, or replace them with their alignment positions on a reference genome. Such methods require significant computational resources, but faster alternatives that avoid using explicit or de novo-constructed references fail to match their performance. Here, we introduce a new reference-free compressed representation for genomic data based on light de novo assembly of reads, where each read is represented as a node in a (compact) trie. We show how to efficiently build such tries to compactly represent reads and demonstrate that among all methods using this representation (including all de novo assembly based methods), our method achieves the shortest possible output. We also provide an lower bound on the compression rate achievable on uniformly sampled genomic read data, which is approximated by our method well. Our method significantly improves the compression performance of alternatives without compromising speed.Entities:
Mesh:
Year: 2018 PMID: 29422526 PMCID: PMC5805770 DOI: 10.1038/s41467-017-02480-6
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
The MPEG HTS benchmarking data set
|
|
|
|
| |
|---|---|---|---|---|
| SRR554369 |
| 6.60 | 1.66 | 100 |
| SRR327342 |
| 12.14 | 15.04 | 63 |
| MH0001.081026 |
| N/A | 11.64 | 44 |
| SRR870667 |
| 335.44 | 69.10 | 108 |
| ERR174310 |
| 2989.43 | 207.58 | 101 |
| ERP001775 |
| 2989.43 | 607.56 | 101 |
| Simulated |
| 335.44 | 65.43 | 108 |
| NA12878 |
| 2989.43 | 226.11 | 101 |
The HTS read collections used to evaluate the performance of Assembltrie from the MPEG HTS benchmarking data set. Note that ERP001775 is a large data set, which combines reads from 18 human individuals. It was downsampled to fit the memory requirements of Assembltrie. In addition, NA12878 is not part of the MPEG HTS FASTQ/FASTA benchmarking data set—reads in this data set have been extracted from a corresponding BAM file to demonstrate the comparative performance of Assembltrie and Orcom on data, where strand correction is not needed. Finally, we used a simulated T. cacao data set instead of the original due to the high error rate observed in the original data set
Comparative compression ratios achieved by Assembltrie on the MPEG HTS benchmarking data set
|
|
|
| |||||
|---|---|---|---|---|---|---|---|
| Assembltrie | Assembltrie (corrected) | Orcom | BEETL | Mince | k-Path | ||
| SRR554369 | 100/25 | 0.369 | 0.345 | 0.518 | 1.133 | 0.484 | 0.673 |
| SRR327342 | 63/80 | 0.272 | 0.291 | 0.304 | 0.986 | 0.312 | 0.384 |
| MH0001.081026 | 44/NA | 0.781 | 0.758 | 0.804 | 1.785 | 0.786 | 2.545 |
| SRR870667 | 108/20 | 1.821 | 1.733 | 0.884 | 1.287 | 0.735 | 0.707 |
| ERR174310 | 101/7 | 0.701 | 0.570 | 0.686 | 1.493 | 0.746 | 0.797 |
| ERP001775 | 101/20 | 0.350 | 0.322 | 0.364 | N/A | N/A | N/A |
| Sim. | 108/19 | 0.538 | 0.479 | 0.667 | N/A | N/A | N/A |
| NA12878 | 101/7 | 0.444 | N/A | 0.650 | N/A | N/A | N/A |
Compression rates in bits per base for each software tool (with 8 threads) and each MPEG benchmark sample. The second column provides the read length and coverage; and the last columns present the compression performances for different software tools. Assembltrie outperforms all of the existing sequence-only compressors with different level of improvement depending on the read length/coverage (possibly with the greedy strand correction heuristic), except on the sample SRR870667 from T.cacao (which has an unusually high error rate)
Compression times and memory usage of Assembltrie and Orcom
|
|
|
|
|
| ||
|---|---|---|---|---|---|---|
| without / with strand correction | without / with strand correction |
|
| |||
| SRR554369 | 23.2 | 30.6 | 1275.9 | 1372.8 | 10.1 | 631.0 |
| SRR327342 | 256.2 | 439.9 | 8553.6 | 7971.8 | 131.0 | 1767.8 |
| MH0001.081026 | 118.5 | 145.7 | 7662.4 | 7223.5 | 49.6 | 1674.2 |
| SRR870667 | 15251.1 | 12219.2 | 50580.9 | 53184.3 | 919.0 | 3150.6 |
| ERR174310 | 31657.8 | 43758.2 | 152719.8 | 168861.1 | 6411.1 | 3439.2 |
| ERP001775a | 22227.0 | 31675.1 | 434425.5 | 473066.5 | 8969.1 | 10803.6 |
Compression time (in seconds) and memory usage (in MBs) of Assembltrie and Orcom to generate the compression rates for the MPEG benchmark data set. The results of Assembltrie are given both with (the 3rd and 5th columns) and without (the 2nd and 4th columns) heuristic strand correction
aAssembltrie was tested with non-default parameters
Fig. 1Compression performance of Assembltrie as a function of read coverage. The compression performance (with 8 threads) of Assembltrie in comparison to Orcom on downsampled read collections from SRR327342—as a function of coverage
Decompression times and memory usage of Assembltrie and Orcom
|
|
|
|
|
| ||
|---|---|---|---|---|---|---|
| without / with strand correction | without / with strand correction |
|
| |||
| SRR554369 | 2.9 | 3.4 | 249.7 | 259.0 | 4.7 | 15.2 |
| SRR327342 | 18.0 | 36.1 | 1558.7 | 1678.9 | 30.8 | 30.2 |
| MH0001.081026 | 13.7 | 15.3 | 1032.4 | 1133.7 | 24.9 | 23.7 |
| SRR870667 | 190.3 | 207.5 | 10500.8 | 11495.9 | 673.6 | 708.6 |
| ERR174310 | 524.1 | 775.8 | 31571.0 | 33555.2 | 707.3 | 561.7 |
| ERP001775 | 758.7 | 860.3 | 86468.1 | 91465.8 | 1487.9 | 1532.5 |
Decompression time (in seconds) and memory usage (in MBs) of Assembltrie and Orcom in single-threaded mode. The results of Assembltrie are given both with (the 3rd and 5th columns) and without (the 2nd and 4th columns) heuristic strand correction
Assembltrie’s compression performance is comparable to our entropy approximation for E. coli read collection
| Sample | LZ( |
|
|
| Assembltrie | Orcom | |
|---|---|---|---|---|---|---|---|
| DH10B 1 | 120/40 | 0.048 | 0.020 | 0.047 | 0.115 | 0.146 | 0.372 |
| DH10B 2 | 100/40 | 0.048 | 0.025 | 0.053 | 0.126 | 0.163 | 0.399 |
| DH10B 3 | 80/40 | 0.048 | 0.028 | 0.042 | 0.119 | 0.164 | 0.379 |
| DH10B 4 | 100/25 | 0.077 | 0.031 | 0.053 | 0.162 | 0.194 | 0.507 |
| DH10B 5 | 100/80 | 0.024 | 0.017 | 0.053 | 0.094 | 0.141 | 0.292 |
| DH10B 6 | 100/40 | 0.048 | 0.025 | 0.018 | 0.092 | 0.123 | 0.290 |
The entropy approximation of the reads from the above E. coli read collections. LZ(G) denotes the bits/bits per base after compressing each genome with gzip; is the entropy approximation based on a multinomial sampling of each reference genome; is the binary entropy of the error process; and is the overall entropy approximation for each read collection. Finally, we compare the compression results given by Assembltrie (run with a single thread) and Orcom
Fig. 2Compression performance of Assembltrie as a function of error rate. The compression ratio achieved by Assembltrie is close to our information theoretic approximation on simulated reads from E.coli K-12 DH10B genome (the data set involves 1.8 M “reads,” i.e., substrings of length L = 101, sampled uniformly with simulated errors)
Fig. 3The tradeoff between the running time and compression performance of Assembltrie. The tradeoff between Assembltrie’s running time and compression rate as a function of the user set parameter K (initial match length) on the data set ERR174310. The blue point depicts the performance of Orcom on the same data set with default parameters
Fig. 4Cycle-rooted trie. An example cycle-rooted trie constructed from a collection of reads of length 5, with a cycle (in the center) constituting the root