Literature DB >> 11700586

Biological sequence compression algorithms.

T Matsumoto1, K Sadakane, H Imai.   

Abstract

Today, more and more DNA sequences are becoming available. The information about DNA sequences are stored in molecular biology databases. The size and importance of these databases will be bigger and bigger in the future, therefore this information must be stored or communicated efficiently. Furthermore, sequence compression can be used to define similarities between biological sequences. The standard compression algorithms such as gzip or compress cannot compress DNA sequences, but only expand them in size. On the other hand, CTW (Context Tree Weighting Method) can compress DNA sequences less than two bits per symbol. These algorithms do not use special structures of biological sequences. Two characteristic structures of DNA sequences are known. One is called palindromes or reverse complements and the other structure is approximate repeats. Several specific algorithms for DNA sequences that use these structures can compress them less than two bits per symbol. In this paper, we improve the CTW so that characteristic structures of DNA sequences are available. Before encoding the next symbol, the algorithm searches an approximate repeat and palindrome using hash and dynamic programming. If there is a palindrome or an approximate repeat with enough length then our algorithm represents it with length and distance. By using this preprocessing, a new program achieves a little higher compression ratio than that of existing DNA-oriented compression algorithms. We also describe new compression algorithm for protein sequences.

Mesh:

Year:  2000        PMID: 11700586

Source DB:  PubMed          Journal:  Genome Inform Ser Workshop Genome Inform


  16 in total

1.  Compressing proteomes: the relevance of medium range correlations.

Authors:  Dario Benedetto; Emanuele Caglioti; Claudia Chica
Journal:  EURASIP J Bioinform Syst Biol       Date:  2007

2.  Efficient storage of high throughput DNA sequencing data using reference-based compression.

Authors:  Markus Hsi-Yang Fritz; Rasko Leinonen; Guy Cochrane; Ewan Birney
Journal:  Genome Res       Date:  2011-01-18       Impact factor: 9.043

3.  Data Compression Concepts and Algorithms and their Applications to Bioinformatics.

Authors:  O U Nalbantog̃lu; D J Russell; K Sayood
Journal:  Entropy (Basel)       Date:  2010-01-01       Impact factor: 2.524

4.  An Adaptive Difference Distribution-based Coding with Hierarchical Tree Structure for DNA Sequence Compression.

Authors:  Wenrui Dai; Hongkai Xiong; Xiaoqian Jiang; Lucila Ohno-Machado
Journal:  Proc Data Compress Conf       Date:  2013-03-22

5.  Adaptive efficient compression of genomes.

Authors:  Sebastian Wandelt; Ulf Leser
Journal:  Algorithms Mol Biol       Date:  2012-11-12       Impact factor: 1.405

6.  Data structures and compression algorithms for high-throughput sequencing technologies.

Authors:  Kenny Daily; Paul Rigor; Scott Christley; Xiaohui Xie; Pierre Baldi
Journal:  BMC Bioinformatics       Date:  2010-10-14       Impact factor: 3.169

7.  Reference-based compression of short-read sequences using path encoding.

Authors:  Carl Kingsford; Rob Patro
Journal:  Bioinformatics       Date:  2015-02-02       Impact factor: 6.937

8.  Cross chromosomal similarity for DNA sequence compression.

Authors:  Choi-Ping Paula Wu; Ngai-Fong Law; Wan-Chi Siu
Journal:  Bioinformation       Date:  2008-07-14

9.  Compressing DNA sequence databases with coil.

Authors:  W Timothy J White; Michael D Hendy
Journal:  BMC Bioinformatics       Date:  2008-05-20       Impact factor: 3.169

10.  DNA-COMPACT: DNA COMpression based on a pattern-aware contextual modeling technique.

Authors:  Pinghao Li; Shuang Wang; Jihoon Kim; Hongkai Xiong; Lucila Ohno-Machado; Xiaoqian Jiang
Journal:  PLoS One       Date:  2013-11-25       Impact factor: 3.240

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.