Literature DB >> 27960065

Toward a Better Compression for DNA Sequences Using Huffman Encoding.

Anas Al-Okaily1, Badar Almarri1, Sultan Al Yami1, Chun-Hsi Huang1.   

Abstract

Due to the significant amount of DNA data that are being generated by next-generation sequencing machines for genomes of lengths ranging from megabases to gigabases, there is an increasing need to compress such data to a less space and a faster transmission. Different implementations of Huffman encoding incorporating the characteristics of DNA sequences prove to better compress DNA data. These implementations center on the concepts of selecting frequent repeats so as to force a skewed Huffman tree, as well as the construction of multiple Huffman trees when encoding. The implementations demonstrate improvements on the compression ratios for five genomes with lengths ranging from 5 to 50 Mbp, compared with the standard Huffman tree algorithm. The research hence suggests an improvement on all such DNA sequence compression algorithms that use the conventional Huffman encoding. The research suggests an improvement on all DNA sequence compression algorithms that use the conventional Huffman encoding. Accompanying software is publicly available (AL-Okaily, 2016 ).

Entities:  

Keywords:  DNA sequences compression; Huffman encoding; compression algorithm

Mesh:

Year:  2016        PMID: 27960065      PMCID: PMC5372760          DOI: 10.1089/cmb.2016.0151

Source DB:  PubMed          Journal:  J Comput Biol        ISSN: 1066-5277            Impact factor:   1.479


  4 in total

1.  Significantly lower entropy estimates for natural DNA sequences.

Authors:  D Loewenstern; P N Yianilos
Journal:  J Comput Biol       Date:  1999       Impact factor: 1.479

2.  PatternHunter: faster and more sensitive homology search.

Authors:  Bin Ma; John Tromp; Ming Li
Journal:  Bioinformatics       Date:  2002-03       Impact factor: 6.937

3.  DNACompress: fast and effective DNA sequence compression.

Authors:  Xin Chen; Ming Li; Bin Ma; John Tromp
Journal:  Bioinformatics       Date:  2002-12       Impact factor: 6.937

4.  G-SQZ: compact encoding of genomic sequence and quality data.

Authors:  Waibhav Tembe; James Lowey; Edward Suh
Journal:  Bioinformatics       Date:  2010-07-06       Impact factor: 6.937

  4 in total
  4 in total

Review 1.  Efficient compression of SARS-CoV-2 genome data using Nucleotide Archival Format.

Authors:  Kirill Kryukov; Lihua Jin; So Nakagawa
Journal:  Patterns (N Y)       Date:  2022-07-07

2.  Sequence Compression Benchmark (SCB) database-A comprehensive evaluation of reference-free compressors for FASTA-formatted sequences.

Authors:  Kirill Kryukov; Mahoko Takahashi Ueda; So Nakagawa; Tadashi Imanishi
Journal:  Gigascience       Date:  2020-07-01       Impact factor: 6.524

3.  Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences.

Authors:  Kirill Kryukov; Mahoko Takahashi Ueda; So Nakagawa; Tadashi Imanishi
Journal:  Bioinformatics       Date:  2019-10-01       Impact factor: 6.937

4.  A self-contained and self-explanatory DNA storage system.

Authors:  Min Li; Jiashu Wu; Junbiao Dai; Qingshan Jiang; Qiang Qu; Xiaoluo Huang; Yang Wang
Journal:  Sci Rep       Date:  2021-09-10       Impact factor: 4.379

  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.