| Literature DB >> 31915686 |
Haichang Yao1,2, Yimu Ji1,3,4, Kui Li1, Shangdong Liu1, Jing He5, Ruchuan Wang1.
Abstract
With the maturity of genome sequencing technology, huge amounts of sequence reads as well as assembled genomes are generating. With the explosive growth of genomic data, the storage and transmission of genomic data are facing enormous challenges. FASTA, as one of the main storage formats for genome sequences, is widely used in the Gene Bank because it eases sequence analysis and gene research and is easy to be read. Many compression methods for FASTA genome sequences have been proposed, but they still have room for improvement. For example, the compression ratio and speed are not so high and robust enough, and memory consumption is not ideal, etc. Therefore, it is of great significance to improve the efficiency, robustness, and practicability of genomic data compression to reduce the storage and transmission cost of genomic data further and promote the research and development of genomic technology. In this manuscript, a hybrid referential compression method (HRCM) for FASTA genome sequences is proposed. HRCM is a lossless compression method able to compress single sequence as well as large collections of sequences. It is implemented through three stages: sequence information extraction, sequence information matching, and sequence information encoding. A large number of experiments fully evaluated the performance of HRCM. Experimental verification shows that HRCM is superior to the best-known methods in genome batch compression. Moreover, HRCM memory consumption is relatively low and can be deployed on standard PCs.Entities:
Mesh:
Year: 2019 PMID: 31915686 PMCID: PMC6930768 DOI: 10.1155/2019/3108950
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Figure 1The main process of HRCM.
Algorithm 1The first-level matching.
Algorithm 2The second-level matching.
Output B sequences after sequence information extraction.
| Base sequence | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| R1 | A | G | A | T | G | G | G | C | C | C | T | T | T | A | G | G | T | A | T | T |
| T1 | A | G | C | T | G | G | T | C | C | C | T | G | A | A | G | G | A | A | T | C |
| T2 | A | G | C | T | G | G | T | C | C | C | T | G | G | A | G | G | A | A | T | C |
| T3 | A | G | T | T | G | G | T | C | C | C | T | G | G | A | G | G | A | T | T | T |
| T4 | A | G | T | T | G | G | T | C | C | C | T | G | A | A | G | G | A | T | T | T |
| T5 | A | T | A | T | G | G | T | C | C | C | T | G | A | A | G | G | A | T | T | T |
Output after the first-level matching.
| Base sequence | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|
| T1 | (1, 2, C) | (4, 3, T) | (8, 4, GA) | (14, 3, A) | (18, 2, C) |
| T2 | (1, 2, C) | (4, 3, T) | (8, 4, GG) | (14, 3, A) | (18, 2, C) |
| T3 | (1, 2, T) | (4, 3, T) | (8, 4, GG) | (14, 3, AT) | (19, 2) |
| T4 | (1, 2, T) | (4, 3, T) | (8, 4, GA) | (14, 3, AT) | (19, 2) |
| T5 | (1, 1, T) | (3, 4, T) | (8, 4, GA) | (14, 3, AT) | (19, 2) |
Output after the second-level matching.
| Base sequence | 1 | 2 | 3 | 4 |
|---|---|---|---|---|
| T2 | (1, 1, 2) | (8, 4, GG) | (1, 4, 2) | |
| T3 | (1, 2, T) | (2, 2, 2) | (14, 3, AT) | (19, 2) |
| T4 | (3, 1, 2) | (8, 4, GA) | (3, 4, 2) | |
| T5 | (1, 1, T) | (3, 4, T) | (4, 3, 3) |
Algorithm 3Lowercase character information matching.
Overall comparison of compressed size and the relative gain for different methods under different reference genomes.
| Reference | Original file size (MB) | Compressed file size (MB) by | |||||||
|---|---|---|---|---|---|---|---|---|---|
| iDoComp | GDC2 | ERGC | NRGC | HiRGC | SCCG | HRCM-S | HRCM-B | ||
| hg17 | 20,966.28 | 517.43 | 1570.94 | 2220.85 | 1952.05 | 103.62 | 89.09 | 99.66 |
|
| 84.51% | 94.90% | 96.39% | 95.90% | 22.67% | 10.06% | 19.60% | |||
| hg18 | 20,962.74 | 506.75 | 1564.89 | 1498.22 | 1237.86 | 97.27 | 82.05 | 96.55 |
|
| 84.63% | 95.02% | 94.80% | 93.71% | 19.92% | 5.07% | 19.32% | |||
| hg19 | 20,947.88 | 581.90 | 1610.37 | 1826.78 | 1179.24 | 95.57 | 81.56 | 90.89 |
|
| 86.69% | 95.19% | 95.76% | 93.43% | 18.98% | 5.07% | 14.81% | |||
| hg38 | 20,955.10 | 526.98 | 1659.51 | 1708.70 | 1247.31 | 96.78 | 81.93 | 96.73 |
|
| 85.70% | 95.46% | 95.59% | 93.96% | 22.11% | 7.99% | 22.07% | |||
| K131 | 20,972.50 | 1338.91 | 1570.45 | 1874.15 | 1172.16 | 124.20 | 108.53 | 132.50 |
|
| 93.14% | 94.15% | 95.10% | 92.17% | 26.08% | 15.42% | 30.71% | |||
| K224 | 20,972.51 | 1284.20 | 1540.93 | 1897.95 | 1172.65 | 124.97 | 109.44 | 133.70 |
|
| 92.71% | 93.93% | 95.07% | 92.02% | 25.13% | 14.50% | 30.02% | |||
| YH | 20,972.51 | 399.29 | 1643.71 | 1840.61 | 1171.03 | 128.47 | 113.26 | 134.50 |
|
| 76.61% | 94.32% | 94.93% | 92.03% | 27.31% | 17.56% | 30.58% | |||
| HuRef | 20,965.32 | 824.15 | 1501.46 | 2911.90 | 2295.88 | 146.15 | 129.33 | 152.50 |
|
| 88.18% | 93.51% | 96.65% | 95.76% | 33.35% | 24.68% | 36.13% | |||
Bold indicates the best value of the case.
Figure 2Compression and decompression time of different methods on the eight groups of human genomes. (a) Compression time. (b) Decompression time.
Compressed sizes of other species data sets by different methods.
| Reference | To-be-compressed | Original file size (KB) | Compressed file size (KB) by | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| iDoComp | GDC2 | ERGC | NRGC | HiRGC | SCCG | HRCM-S | HRCM-B | |||
| ce6 | ce10, ce11 | 199,816 | 251 | 1049 | 463 | 530 | 445 | 437 | 221 |
|
| ce10 | ce6, ce11 | 199,804 | 241 | 1003 | 459 | 641 | 534 | 434 |
|
|
| ce11 | ce6, ce10 | 199,804 | 463 | 1941 | 505 | 690 | 489 | 480 | 488 |
|
| TAIR9 | TAIR10 | 118,360 |
| 3 | 8 | 153 | 5 | 12 | 5 | 5 |
| TAIR10 | TAIR9 | 118,383 |
| 3 | 5 | 153 | 5 | 2 | 5 | 5 |
| TIGR5.0 | TIGR6.0, TIGR7.0 | 740,272 | 399 |
| 26758 | 1284 | 376 | 363 | 407 | 342 |
| TIGR6.0 | TIGR5.0, TIGR7.0 | 740,009 | 279 |
| 15841 | 11566 | 258 | 257 | 278 | 275 |
| TIGR7.0 | TIGR5.0, TIGR6.0 | 739,089 | 212 | 79 | 7091 |
| 227 | 220 | 242 | 168 |
| sacCer2 | sacCer3 | 12,109 |
| 4 | 6 | 754 | 6 | 3 | 6 | 6 |
| sacCer3 | sacCer2 | 12,109 |
| 4 | 6 | 517 | 9 | 3 | 6 | 6 |
Bold indicates the best value of the case.
Figure 3Peak memory usage of different methods. (a) Peak compression memory usage. (b) Peak decompression memory usage.