| Literature DB >> 31888458 |
Tao Tang1, Yuansheng Liu1, Buzhong Zhang2, Benyue Su3, Jinyan Li4.
Abstract
BACKGROUND: The rapid development of Next-Generation Sequencing technologies enables sequencing genomes with low cost. The dramatically increasing amount of sequencing data raised crucial needs for efficient compression algorithms. Reference-based compression algorithms have exhibited outstanding performance on compressing single genomes. However, for the more challenging and more useful problem of compressing a large collection of n genomes, straightforward application of these reference-based algorithms suffers a series of issues such as difficult reference selection and remarkable performance variation.Entities:
Keywords: Data compression; NGS data; Reference-based compression; Unsupervised learning
Mesh:
Year: 2019 PMID: 31888458 PMCID: PMC6939838 DOI: 10.1186/s12864-019-6310-0
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1Schematic diagram of our algorithm ECC
Compression ratio for the H. sapiens dataset-60 (171GB)
| Reference | Compression ratio with algorithm | |||||
|---|---|---|---|---|---|---|
| HiRGC | iDoComp | GDC2 | ERGC | NRGC | SCCG | |
| GCA_000004845 | 339.80 | 238.98 | 11.00 | 122.67 | 225.35 | |
| hg19 | 346.80 | 26.78 | 137.41 | |||
| YH | 134.13 | 237.39 | 108.26 | 123.24 | 228.20 | |
| GCA_000252825 | 241.01 | 92.65 | 230.45 | 102.62 | 176.08 | 122.25 |
| Huref | 245.79 | 140.84 | 224.47 | 69.59 | 123.26 | |
| ECC clustering result | ||||||
| Ratio gain* | 22.05% | 22.83% | 2.22% | 56.31% | 3.41% | 15.49% |
Bold text indicates the highest compression ratio of an algorithm, italic text indicates the best case of fixed single reference compression result
*The ratio gain of ECC against the best case
Compression ratios on H. sapiens dataset-1152 (3128 GB)
| Reference | Compression ratio with algorithm | ||
|---|---|---|---|
| HiRGC | iDoComp | GDC2 | |
| HG00096 | 991.77 | ||
| NA18856 | 889.32 | 437.05 | 2805.19 |
| GCA_000004845 | 784.84 | 53.94 | 2901.44 |
| GCA_000252825 | 504.41 | 114.40 | 2897.76 |
| GCA_000365445 | 13.07 | / | / |
| hg19 | 68.36 | / | |
| hg38 | 826.31 | 52.03 | / |
| Result of ECC | 576.84 | 3033.84 | |
| Ratio gain* | 7.95% | 15.86% | 3.77% |
’/’ indicates a running time longer than 500 h. Bold text indicates the highest compression ratio of an algorithm, italic text indicates the best case of fixed single reference compression result.
*The ratio gain of ECC against the best case
Compression ratio on the Oryza sativa Ldataset-2818(1012 GB)
| Reference | Compression ratio with algorithm | ||
|---|---|---|---|
| HiRGC | iDoComp | GDC2 | |
| B035 | 77.62 | 529.40 | |
| CX319 | 73.76 | ||
| IRIS_313-10010 | 69.93 | 64.74 | 519.43 |
| IRIS_313-10776 | 70.97 | 77.10 | 533.81 |
| IRIS_313-9937 | 71.39 | 66.42 | 535.31 |
| Result of ECC | |||
| Ratio gain* | 13.89% | 21.22% | 2.48% |
Bold text indicates the highest compression ratio of an algorithm, italic text indicates the best case of fixed single reference compression result
*The ratio gain of ECC against the best case
Reference selection time of ECC (in hours)
| Dataset | dataset-60 | dataset-1152 | dataset-2818 |
|---|---|---|---|
| Number of genomes | 60 | 1152 | 2818 |
| Total running time | 0.023 | 0.830 | 0.759 |
Compression time of each algorithm on the three datasets
| Algorithm | Compression time (in hours) for | |||||
|---|---|---|---|---|---|---|
| dataset-60 | dataset-1152 | datset-2818 | ||||
| reference-fixed | ECC | reference-fixed | ECC | reference-fixed | ECC | |
| HiRGC | 1.18 | 0.98 | 15.12 | 13.94 | 2.91 | 2.82 |
| iDoComp | 6.54 | 2.82 | 102.94 | 29.77 | 15.58 | 10.34 |
| GDC2 | 110.73 | 117.82 | 129.24 | 126.43 | 25.29 | 23.61 |
The time by the reference-fixed approach is the average running time of several fixed single-reference cases by each algorithm, please see the supplementary file for the time range of all the cases and compression time by ERGC, SCCG and NRGC