| Literature DB >> 26501129 |
Wenrui Dai1, Hongkai Xiong2, Xiaoqian Jiang3, Lucila Ohno-Machado4.
Abstract
Previous reference-based compression on DNA sequences do not fully exploit the intrinsic statistics by merely concerning the approximate matches. In this paper, an adaptive difference distribution-based coding framework is proposed by the fragments of nucleotides with a hierarchical tree structure. To keep the distribution of difference sequence from the reference and target sequences concentrated, the sub-fragment size and matching offset for predicting are flexible to the stepped size structure. The matching with approximate repeats in reference will be imposed with the Hamming-like weighted distance measure function in a local region closed to the current fragment, such that the accuracy of matching and the overhead of describing matching offset can be balanced. A well-designed coding scheme will make compact both the difference sequence and the additional parameters, e.g. sub-fragment size and matching offset. Experimental results show that the proposed scheme achieves 150% compression improvement in comparison with the best reference-based compressor GReEn.Entities:
Year: 2013 PMID: 26501129 PMCID: PMC4617277 DOI: 10.1109/DCC.2013.45
Source DB: PubMed Journal: Proc Data Compress Conf ISSN: 2375-0383