| Literature DB >> 29914357 |
Qiuming Luo1, Chao Guo2, Yi Jun Zhang1, Ye Cai1, Gang Liu1.
Abstract
BACKGROUND: With the reduction of gene sequencing cost and demand for emerging technologies such as precision medical treatment and deep learning in genome, it is an era of gene data outbreaks today. How to store, transmit and analyze these data has become a hotspot in the current research. Now the compression algorithm based on reference is widely used due to its high compression ratio. There exists a big problem that the data from different gene banks can't merge directly and share information efficiently, because these data are usually compressed with different references. The traditional workflow is decompression-and-recompression, which is too simple and time-consuming. We should improve it and speed it up.Entities:
Keywords: DNA sequence compression; Gene data transformation; Reference-based compression
Mesh:
Year: 2018 PMID: 29914357 PMCID: PMC6006589 DOI: 10.1186/s12859-018-2230-2
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Through this transformation framework, we get the distribution of target sequence on Ref1 and the distribution of Ref1 on Ref2 when processing the compressed data. Then we can make full use of the similarity between Ref1 and Ref2 to make transformation faster
Fig. 2In (a), T(target sequence) can be encoded with Ref(reference) as (0.6)(7,7,T)(8,12)(13,14,G)(15,21)(22,23,TC)(24,25)(26,26,C)(27,30). In (b), Ref2(references) can be encoded with Ref1(reference1) as (0,6,0,6)(7,7,7,7,A)(8,11,8,11)(12,12,12,12,A)(13,13,13,13)(14,14,13,13,G)(15,21,14,20)(22,24,21,23,GAT)(25,30,24,29)
Fig. 3Case (a) means that in this area, T is same to Ref1 and Ref1 is same to Ref2, so T is same to Ref2. Case (b) means that in this area, T is different with Ref1 but Ref1 is same to Ref2, so T is different with Ref2. Case (c) means that in this area, T is same to Ref1 but Ref1 is different with Ref2, so T is different with Ref2. Case (d) means that in this area, T is different with Ref1 and Ref1 is different with Ref2, so we could not figure out the relationship between T and Ref2
Fig. 4RD presents the reference the original dataset compressed with and the RT presents the reference we want transform RD to
Fig. 5Index structure of memory pool
Fig. 6Flowchart of TGI
Experiment datasets
| dataset | Target sequence(Tar) | Reference 1(Ref1) | Reference 2(Ref2) |
|---|---|---|---|
| D1 | YH-1 | KOR131 | KOR224 |
| D2 | YH-1 | KOR224 | KOR131 |
| D3 | KOR131 | YH-1 | KOR224 |
| D4 | KOR131 | KOR224 | YH-1 |
| D5 | KOR224 | YH-1 | KOR131 |
| D6 | KOR224 | KOR131 | YH-1 |
Result of transformation
| ERGC | TDM | TPI | TGI | ||||||
|---|---|---|---|---|---|---|---|---|---|
| dataset | Trans time | size | Trans time | size | Trans time | size | Trans time | size | Index time |
| D1 | 965.97 | 8.79 | 71.94 | 9.06 | 83.59 | 8.93 | 14.91 | 8.58 | 113.00 |
| D2 | 989.05 | 8.97 | 71.77 | 9.12 | 119.67 | 9.06 | 14.85 | 8.50 | 113.25 |
| D3 | 761.63 | 5.98 | 72.94 | 9.07 | 143.53 | 8.16 | 15.73 | 13.64 | 112.98 |
| D4 | 847.25 | 13.05 | 72.08 | 13.28 | 84.20 | 12.86 | 8.26 | 8.89 | 99.69 |
| D5 | 769.74 | 4.69 | 72.68 | 8.03 | 119.21 | 6.91 | 16.41 | 13.74 | 113.40 |
| D6 | 824.82 | 11.57 | 72.07 | 12.04 | 129.52 | 11.44 | 8.25 | 9.07 | 102.32 |
Fig. 7The unit of time is seconds and the transformation time of ERGC is the sum of decompression time of compressed data based on Ref1 and compression time of decompressed data based on Ref2
Fig. 8The original size of dataset is 2986.68 MB and the compression ratio presents like original data size: compressed data size
Fig. 9Memory consumption when running
Fig. 10Time of constructing index at different values of k and different chromosomes
Fig. 11Memory size of index at different values of k