Literature DB >> 31553226

Nongreedy Unbalanced Huffman Tree Compressor for Single and Multifasta Files.

Sultan Alyami1, Chun-Hsi Huang1.   

Abstract

Next-generation sequencing technologies are producing genomic data at ever-increasing rates. It has become a challenge to store, transmit, and process the massive quantity of data, creating a vital need for a tool that compresses genomic data produced in a lossless manner, thus reducing storage space and speeding up data transmission. Data centers are adopting either of the two general-purpose genomic data compressors: gzip or bzip2. Both these use Huffman encoding, although they implement it in different ways. However, neither of these two takes advantage of properties of DNA data, such as the presence of a small alphabet and many repeats. Huffman encoding compression can be improved by exploiting DNA characteristics. Recently, it has been shown that Huffman encoding compression can be improved by creating an unbalanced Huffman tree (UHT), which demonstrates significant advances in compression over a standard Huffman tree used in both gzip and bzip2. However, the UHT created is greedy. This article proposes an improved nongreedy UHT (NUHT), a lossless nonreference-based fasta and multifasta compressor. We compare our algorithm with two well-known general-purpose compressors, gzip and bzip2, as well as with UHT, a DNA-specific compressor based on Huffman tree. Our algorithm outperforms all three in terms of compression ratio and is seven times faster than UHT.

Keywords:  DNA sequence; Huffman encoding; data compression; tree

Year:  2019        PMID: 31553226     DOI: 10.1089/cmb.2019.0249

Source DB:  PubMed          Journal:  J Comput Biol        ISSN: 1066-5277            Impact factor:   1.479


  3 in total

Review 1.  Efficient compression of SARS-CoV-2 genome data using Nucleotide Archival Format.

Authors:  Kirill Kryukov; Lihua Jin; So Nakagawa
Journal:  Patterns (N Y)       Date:  2022-07-07

2.  Sequence Compression Benchmark (SCB) database-A comprehensive evaluation of reference-free compressors for FASTA-formatted sequences.

Authors:  Kirill Kryukov; Mahoko Takahashi Ueda; So Nakagawa; Tadashi Imanishi
Journal:  Gigascience       Date:  2020-07-01       Impact factor: 6.524

3.  Values of Contrast-Enhanced Ultrasound in Classification and Diagnosis of Common Bile Duct and Superficial Organ Lesions under Compression Algorithm.

Authors:  Yezhao Li; Caihong Zhao; Minpei Qin; Xia Zhang; Haizhen Liao; Haiqing Su
Journal:  J Healthc Eng       Date:  2021-09-29       Impact factor: 2.682

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.