| Literature DB >> 35879669 |
Haichang Yao1, Guangyong Hu1, Shangdong Liu2, Houzhi Fang2, Yimu Ji3,4,5.
Abstract
Since the completion of the Human Genome Project at the turn of the century, there has been an unprecedented proliferation of sequencing data. One of the consequences is that it becomes extremely difficult to store, backup, and migrate enormous amount of genomic datasets, not to mention they continue to expand as the cost of sequencing decreases. Herein, a much more efficient and scalable program to perform genome compression is required urgently. In this manuscript, we propose a new Apache Spark based Genome Compression method called SparkGC that can run efficiently and cost-effectively on a scalable computational cluster to compress large collections of genomes. SparkGC uses Spark's in-memory computation capabilities to reduce compression time by keeping data active in memory between the first-order and second-order compression. The evaluation shows that the compression ratio of SparkGC is better than the best state-of-the-art methods, at least better by 30%. The compression speed is also at least 3.8 times that of the best state-of-the-art methods on only one worker node and scales quite well with the number of nodes. SparkGC is of significant benefit to genomic data storage and transmission. The source code of SparkGC is publicly available at https://github.com/haichangyao/SparkGC .Entities:
Keywords: Distributed parallel; Genome compression; Reference-based compression; Spark
Mesh:
Year: 2022 PMID: 35879669 PMCID: PMC9310413 DOI: 10.1186/s12859-022-04825-5
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.307
Summary of the related works of this paper
| Year | Name | Methodology | Characteristics | Parallelization |
|---|---|---|---|---|
| 2009 | DNAZip [ | A serial of compression techniques (Variable integer (VINT), Delta positions (DELTA), SNP mapping (DBSNP), K-mer partitioning (KMER)) are taken together to reduce the size of a single genome | The SNP database dbSNP [ | Serial |
| 2012 | BlockCompression [ | The reference and target sequence are divided into fixed-length blocks. Matching are performed between the blocks | Compressed suffix tree is employed to save memory. Straightforward approximate matching is used to improve matching rate | Block-processing can be distributed on several CPUs |
| 2013 | FRESCO [ | Suffix tree is used to index the reference sequence. The base after the exact match is saved as mutation | Three schemes (selecting a good reference, reference rewriting, and second-order compression) were proposed to improve the compression ratio | Serial |
| 2015 | COGI [ | COGI transforms the genomic sequences to a bitmap, then applies a rectangular partition coding algorithm to compress the binary image | The reference sequence is selected using techniques based on co-occurrence entropy and multi-scale entropy. Compressing multiple sequences is supported by COGI, but the compression ratio decreases dramatically | Serial |
| 2015 | GDC2 [ | GDC2 is developed to compress large collections of genomes. Second-order compression scheme and variable integer encoding scheme are employed to reduce the size of compressed files | GDC 2 is implemented in a multithreaded fashion. By default, GDC 2 uses 4 threads: 3 for the first level Ziv–Lempel factoring and 1 for the second-level factoring and arithmetic coding | Multithreaded parallel |
| 2015 | iDoComp [ | Suffix array is used to index the reference sequence. Greedy matching scheme is used to match the reference and the target sequence | Suffix array has to be pre-computed and stored in the hard drive before compression | Serial |
| 2016 | NRGC [ | NRGC uses the score based placement technique to quantify the differences between genome sequences, so as to obtain the best position of each target block on the reference blocks | NRGC has strict requirements on the similarity between the reference sequence and target sequence, which is prone to compression failure | Serial |
| 2017 | HiRGC [ | In the pre-processing stage, HiRGC separates the target sequence file into the identifier, the length of each line, position intervals of lowercase letters and the letter ‘N’, special letters and base letters, and then different compression schemes are used to compress them according to their characteristics | The greedy matching scheme generates some suboptimal matching result | Serial |
| 2018 | SCCG [ | SCCG optimized the greedy matching scheme of HiRGC. It combines the greedy matching with the segmentation matching used in NRGC, matches the target sequence to the corresponding reference segmentation first, improves the compression ratio | The compression time and memory consumption increase significantly | Serial |
| 2019 | HRCM [ | HRCM supports both pair-wise sequence compression and multiple sequences compression. When multiple sequences are compressed, optimized second-order compression scheme is used to improve compression ratio | HRCM balances well the compression speed, compression ratio, and robustness, especially for large collections of genomes compression | Serial |
| 2020 | memRGC [ | bfMEM algorithm [ memRGC extends the MEMs if there are less than two SNPs between MEMs, that improves the compression ratio | INDEL (INsertion and DELetion) and more than two SNPs are omitted in the approximate matching of memRGC | multithreaded parallel |
| 2021 | HadoopHRCM [ | HDFS and Map/Reduce architecture is employed to improve the compression speed of HRCM | Distributed parallel computing technology is introduced to the FASTA compression | Hadoop |
Fig. 1Architecture of Spark based genome compression
Fig. 2Data flow of SparkGC
Fig. 3Overview of the decompression of SparkGC
Fig. 4Compression ratio of SparkGC and the state-of-the-art methods
Fig. 5Compression speed of SparkGC and the state-of-the-art methods
Fig. 6Total compression time under different number of worker nodes
Runtime of different parts on different numbers of worker nodes
| Chromosome | Stage | 1 worker node | 2 worker nodes | 3 worker nodes | 4 worker nodes | ||||
|---|---|---|---|---|---|---|---|---|---|
| Time (s) | % | Time (s) | % | Time (s) | % | Time (s) | % | ||
| Chr1 | Pre-processing | 112 | 1.79 | 116 | 3.84 | 124 | 5.27 | 126 | 6.49 |
| First-order | 5577 | 89.23 | 2618 | 86.72 | 2007 | 85.37 | 1648 | 84.81 | |
| Second-order | 531 | 8.50 | 254 | 8.41 | 188 | 8.00 | 136 | 7.00 | |
| Post-processing | 30 | 0.48 | 31 | 1.03 | 32 | 1.36 | 32 | 1.70 | |
| Total | 6250 | 100 | 3019 | 100 | 2351 | 100 | 1943 | 100 | |
| Chr13 | Pre-processing | 62 | 3.58 | 70 | 6.42 | 70 | 9.06 | 70 | 10.74 |
| First-order | 1520 | 87.76 | 921 | 84.50 | 625 | 80.85 | 512 | 78.53 | |
| Second-order | 120 | 6.93 | 69 | 6.33 | 47 | 6.08 | 39 | 5.98 | |
| Post-processing | 30 | 1.73 | 30 | 2.75 | 31 | 4.01 | 31 | 4.75 | |
| Total | 1732 | 100 | 1090 | 100 | 773 | 100 | 652 | 100 | |
Fig.7Compression performance to the different number of target sequences
Compressed size under different references
| Chromosome | Original size (MB) | Method | Compressed size (MB) under different references | AVG | SD | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| HG13 | HG16 | K131 | YH | Huref | HG00096 | |||||
| Chr1 | 264,994 | HiRGC | 2474 | 1313 | 828 | 750 | 1026 | 480 | 1145 | 647 |
| SCCG | 2430 | 1284 | 775 | 706 | 986 | 464 | 1107 | 643 | ||
| memRGC | 2324 | 1192 | 650 | 593 | 887 | 406 | 1009 | 638 | ||
| HRCM | 126 | 140 | 156 | 154 | 151 | 136 | 144 | 11 | ||
| SparkGC | ||||||||||
| Chr13 | 122,492 | HiRGC | 333 | 296 | 413 | 394 | 314 | 219 | 328 | 64 |
| SCCG | / | 289 | 390 | 375 | 308 | 219 | / | / | ||
| memRGC | 288 | 254 | 333 | 319 | 262 | 190 | 274 | 47 | ||
| HRCM | 48 | 56 | 66 | 65 | 63 | 57 | 59 | 6 | ||
| SparkGC | ||||||||||
‘/’ indicates the method fails to compress the chromosome. Bold indicates the best value of the case
Compression time under different references
| Chromosome | Method | Compression time (hour) under different references | AVG | SD | |||||
|---|---|---|---|---|---|---|---|---|---|
| HG13 | HG16 | K131 | YH | Huref | HG00096 | ||||
| Chr1 | HiRGC | 11.95 | 7.40 | 8.75 | 8.82 | 10.01 | 6.55 | 8.91 | 1.75 |
| SCCG | 21.25 | 21.91 | 37.73 | 37.11 | 39.86 | 36.66 | 32.42 | 7.73 | |
| memRGC | 20.49 | 12.05 | 16.09 | 18.45 | 16.53 | 11.07 | 15.78 | 3.32 | |
| HRCM | 11.19 | 8.28 | 9.60 | 8.40 | 9.76 | 5.72 | 8.82 | 1.69 | |
| SparkGC | |||||||||
| Chr13 | HiRGC | 2.67 | 2.43 | 2.37 | 2.45 | 2.48 | 2.41 | 2.47 | 0.10 |
| SCCG | / | 24.15 | 18.61 | 23.44 | 12.37 | 10.17 | / | / | |
| memRGC | 9.07 | 8.13 | 14.62 | 9.82 | 11.25 | 8.23 | 10.19 | 2.25 | |
| HRCM | 1.64 | 1.43 | 1.47 | 1.40 | 1.44 | 1.25 | 1.44 | 0.11 | |
| SparkGC | |||||||||
‘/’ indicates the method fails to compress the chromosome. Bold indicates the best value of the case
Fig. 8Trade-off between compression ratio and compression speed
Fig. 9Compression ratio and speed on FASTQ data sets