| Literature DB >> 27960065 |
Anas Al-Okaily1, Badar Almarri1, Sultan Al Yami1, Chun-Hsi Huang1.
Abstract
Due to the significant amount of DNA data that are being generated by next-generation sequencing machines for genomes of lengths ranging from megabases to gigabases, there is an increasing need to compress such data to a less space and a faster transmission. Different implementations of Huffman encoding incorporating the characteristics of DNA sequences prove to better compress DNA data. These implementations center on the concepts of selecting frequent repeats so as to force a skewed Huffman tree, as well as the construction of multiple Huffman trees when encoding. The implementations demonstrate improvements on the compression ratios for five genomes with lengths ranging from 5 to 50 Mbp, compared with the standard Huffman tree algorithm. The research hence suggests an improvement on all such DNA sequence compression algorithms that use the conventional Huffman encoding. The research suggests an improvement on all DNA sequence compression algorithms that use the conventional Huffman encoding. Accompanying software is publicly available (AL-Okaily, 2016 ).Entities:
Keywords: DNA sequences compression; Huffman encoding; compression algorithm
Mesh:
Year: 2016 PMID: 27960065 PMCID: PMC5372760 DOI: 10.1089/cmb.2016.0151
Source DB: PubMed Journal: J Comput Biol ISSN: 1066-5277 Impact factor: 1.479