| Literature DB >> 27555868 |
Pamela Vinitha Eric1, Gopakumar Gopalakrishnan2, Muralikrishnan Karunakaran2.
Abstract
This paper proposes a seed based lossless compression algorithm to compress a DNA sequence which uses a substitution method that is similar to the LempelZiv compression scheme. The proposed method exploits the repetition structures that are inherent in DNA sequences by creating an offline dictionary which contains all such repeats along with the details of mismatches. By ensuring that only promising mismatches are allowed, the method achieves a compression ratio that is at par or better than the existing lossless DNA sequence compression algorithms.Entities:
Year: 2016 PMID: 27555868 PMCID: PMC4983397 DOI: 10.1155/2016/3528406
Source DB: PubMed Journal: Adv Bioinformatics ISSN: 1687-8027
Structure of the offline dictionary.
| Extended seed | Type of repeat | Position of repeat | Length | Mismatch details |
|---|---|---|---|---|
| AATAACTTG | Approx | 5 | 9 | |
| Reverse | 20 | 9 | (1 10 01, 8 10 00) | |
|
| ||||
| AACTTG | Reverse | 36 | 6 | |
| Approx | 73 | 7 | (4 01 10) | |
Figure 1The process flow of the seed based compression method followed by an example sequence.
Figure 2Graph comparing compression ratios against varying k values for different sequences; x-axis : “k” value; y-axis : compression ratio.
Figure 3Graph comparing compression ratios of HUMDYSTROP against varying k values and the threshold for the number of mismatches allowed being logk and loglogk. x-axis : “k” value; y-axis : compression ratio.
Comparison of compression ratios of the proposed method against existing methods [2, 8, 12, 18, 19, 21, 22, 24].
| Sequence | Length | CDNA | GeMNL | Bioc | CTW + LZ | GenC | DNAC | DNAP | XM | Proposed seed based method |
|---|---|---|---|---|---|---|---|---|---|---|
| HUMDYSTROP | 38,770 | 1.93 | 1.9085 | 1.9262 | 1.9175 | 1.9231 | 1.9116 | 1.9088 | 1.9031 | 1.8624 |
| HUMGHCSA | 66,496 | 0.95 | 1.0089 | 1.3072 | 1.0972 | 1.0969 | 1.0272 | 1.639 | 0.9828 | 1.0156 |
| HUMHBB | 73,308 | 1.77 | — | 1.8800 | 1.8082 | 1.8204 | 1.7897 | 1.7771 | 1.7513 | 1.7364 |
| HUMHDABCD | 58,863 | 1.67 | 1.7059 | 1.8770 | 1.8218 | 1.8192 | 1.7951 | 1.7394 | 1.6671 | 1.6237 |
| HUMHPRTB | 56,832 | 1.72 | 1.7639 | 1.9066 | 1.8433 | 1.8466 | 1.8165 | 1.7886 | 1.7361 | 1.688 |
| MPOMTCG | 1,86,609 | 1.87 | 1.8822 | 1.9378 | 1.9000 | 1.9058 | 1.8920 | 1.8932 | 1.8768 | 1.763 |
| VACCG | 1,91,735 | 1.81 | 1.7644 | 1.7614 | 1.7616 | 1.7614 | 1.7580 | 1.7583 | 1.6749 | 1.6434 |
Time taken for execution.
| Sequence | Length | DNACompress (sec) | GenCompress (sec) | Time taken by seed based method (sec) |
|---|---|---|---|---|
| HUMDYSTROP | 38,770 | 0.125 | 0:00:45 | 1.5 |
| HUMGHCSA | 66,496 | 0.094 | 874 | 2.5 |
| HUMHBB | 73,308 | 0.125 | NA | 2.8 |
| HUMHDABCD | 58,863 | 0.125 | 104 | 2.2 |
| HUMHPRTB | 56,832 | 0.124 | 90 | 2 |
| MPOMTCG | 1,86,609 | 0.124 | 781 | 3.5 |
| VACCG | 1,91,735 | 0.219 | 1239 | 4 |
Algorithm 1Decompression algorithm.