| Literature DB >> 26370285 |
Gaëtan Benoit1, Claire Lemaitre2, Dominique Lavenier3, Erwan Drezen4, Thibault Dayris5, Raluca Uricaru6,7, Guillaume Rizk8.
Abstract
BACKGROUND: Data volumes generated by next-generation sequencing (NGS) technologies is now a major concern for both data storage and transmission. This triggered the need for more efficient methods than general purpose compression tools, such as the widely used gzip method.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26370285 PMCID: PMC4570262 DOI: 10.1186/s12859-015-0709-7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1LEON method overview. First, a de Bruijn Graph is constructed from the reads: kmers are counted, then abundant enough kmers are inserted into a bloom filter representing a probabilistic de Bruijn Graph. Reads are then mapped to this graph, and the necessary information required to rebuild the reads from the graph is stored in the compressed file: an anchoring kmer and a list of bifurcations
Fig. 2Schematic description of LEON’s path encoding. In the upper part, the mapping of two reads to the de Bruijn Graph is represented. Kmer anchors are shown in blue, bifurcations to (read on the left side) or difference from the graph (read on the right side) are respectively highlighted in green and red. In the bottom part, the corresponding path encodings for these two reads are shown: the index of the kmer anchor, and for each side the path length and bifurcation list
Fig. 3Components contribution in sequence compression. Sequence compression ratio (top) and relative contribution of each component in the compressed sequence stream (bottom) for diverse datasets. WGS high means high coverage (116, 70 and 102 x respectively), WGS low means down-sampling to 10x
SNP calling precision/recall test on data from human chromosome 20, compared to a gold standard coming from the “1000 genomes project”
| HG00096 chrom 20 | |||
|---|---|---|---|
| Prog | Precision | Recall | Compression ratio |
|
| 85.02 | 67.02 | 2.95 |
| SCALCE | 85.15 | 66.13 | 4.1 |
| FASTQZ | 85.46 | 66.63 | 5.4 |
| LIBCSAM | 84.85 | 67.09 | 8.4 |
| FQZCOMP | 85.09 | 66.61 | 8.9 |
| LEON |
|
| 11.4 |
| RQS | 85.59 | 67.15 | 12.4 |
|
| 57.73 | 68.66 | - |
No quality means all qualities were discarded and replaced by ’H’. The ratio is given by the original quality size divided by the compressed size. For the lossless line, the best compression ratio obtained by lossless compression tools is given (obtained here with FQZCOMP). Results are ordered by increasing compression ratio
Best overall results are in bold
Fig. 4Sequence compression ratios by coverage. Compression ratios obtained by LEON on the sequence stream, with respect to the sequencing coverage of the datasets. The three WGS datasets were down-sampled to obtain lower coverage
Compression features obtained for the three high coverage WGS datasets with several compression tools. Total compression ratio is the compression ratio (original size / compressed size) of the whole FASTQ file, header, sequence and quality combined
| Method | Compression ratio | Compression | Decompression | |||||
|---|---|---|---|---|---|---|---|---|
| Total | Header | Base | Quality | Time (s) | Mem. (MB) | Time (s) | Mem. (MB) | |
| SRR959239 - WGS | ||||||||
| gzip | 3.9 | — | — | — | 179 | 1 | 13 | 1 |
| dsrc | 7.6 | — | — | — | 9 | 1942 | 13 | 1998 |
| fqzcomp | 17.9 | 35.2 | 12.0 | 19.6 | 73 | 4171 | 74 | 4160 |
| fastqz | 13.4 | 40.8 | 14.1 | 8.7 | 255 | 1375 | 298 | 1375 |
| leon |
| 45.1 | 17.5 | 59.3 | 39 | 353 | 33 | 205 |
| scalce | 9.8 | 21.4 | 8.3 | 9.2 | 62 | 2012 | 35 | 2012 |
| quip | 8.4 | 29.8 | 8.5 | 5.3 | 244 | 1008 | 232 | 823 |
| mince | — | — | 16.7 | — | 77 | 1812 | 19 | 242 |
| orcom* | — | — | 34.3* | — | 10 | 2243 | 15 | 197 |
| SRR065390 - WGS | ||||||||
| gzip | 3.8 | — | — | — | 2145 | 1 | 165 | |
| dsrc | 7.9 | — | — | — | 67 | 5039 | 85 | 5749 |
| fqzcomp | 12.8 | 54.2 | 7.6 | 15.0 | 952 | 4169 | 1048 | 4159 |
| fastqz | 10.3 | 61.9 | 7.3 | 8.7 | 2749 | 1527 | 3326 | 1527 |
| leon |
| 48.6 | 12.0 | 32.9 | 627 | 1832 | 471 | 419 |
| scalce | 8.2 | 34.1 | 6.5 | 7.2 | 751.4 | 5309 | 182.3 | 1104 |
| quip | 6.5 | 54.3 | 4.8 | 5.2 | 928 | 775 | 968 | 771 |
| mince | — | — | 10.3 | — | 1907 | 21825 | 387 | 242 |
| orcom* | — | — | 24.2* | — | 113 | 9408 | 184 | 1818 |
| SRR345593/SRR345594 - WGS human - 733 GB - 102x | ||||||||
| gzip | 3.3 | — | — | — | 104,457 | 1 | 9124 | 1 |
| dsrc | 7.4 | — | — | — | 2797 | 5207 | 3598 | 5914 |
| fqzcomp | 9.3 | 23.2 | 5.3 | 15.0 | 39,613 | 4169 | 48,889 | 4158 |
| fastqz(a) | — | — | — | — | — | — | — | — |
| leon |
| 27.5 | 9.2 | 26.8 | 40,766 | 9556 | 21262 | 5869 |
| scalce(b) | — | — | — | — | — | — | — | — |
| quip | 6.5 | 54.3 | 4.8 | 5.2 | 52,854 | 776 | 46594 | 775 |
| mince(a) | — | — | — | — | — | — | — | — |
| orcom* | — | — | 19.2* | — | 29,364 | 27505 | 10,889 | 60,555 |
The following columns indicate the ratio for each individual component, when available. Running time (in s) and peak memory (in MB) are given for compression and decompression. All tools were used without a reference genome. Best overall results are in bold
aProgram does not support variable length sequences
bSCALCE was not able to finish on the large WGS human dataset
-lossy suffix means the method was run in lossy mode for quality scores compression
*Stars indicate that the given program changes read order and loses read-pairing information, and thus cannot be directly compared to other tools
Running time with # is on DNA sequence only
Best overall results are in bold
Fig. 5Compression ratios comparison. Comparison of compression ratios between de novo compression software for diverse datasets. On top, overall compression factor (orignal file size / compressed file size). The bottom part represents space distribution between header, sequence and quality scores (respectively in red, green and blue)
Fig. 6Compression / accuracy trade-off for quality compression. Impact of lossy compression methods of quality scores on SNP calling, for a human chromosome 20 (HG00096 individual, SRR062634) compared to a gold standard. Each line represents the F-score/compressed size trade-off for a method, the higher the line, the better. The dashed line represents the F-score obtained by the original fastq file and by lossless compression methods