Literature DB >> 21576758

Iterative dictionary construction for compression of large DNA data sets.

Shanika Kuruppu1, Bryan Beresford-Smith, Thomas Conway, Justin Zobel.   

Abstract

Genomic repositories increasingly include individual as well as reference sequences, which tend to share long identical and near-identical strings of nucleotides. However, the sequential processing used by most compression algorithms, and the volumes of data involved, mean that these long-range repetitions are not detected. An order-insensitive, disk-based dictionary construction method can detect this repeated content and use it to compress collections of sequences. We explore a dictionary construction method that improves repeat identification in large DNA data sets. Our adaptation, COMRAD, of an existing disk-based method identifies exact repeated content in collections of sequences with similarities within and across the set of input sequences. COMRAD compresses the data over multiple passes, which is an expensive process, but allows COMRAD to compress large data sets within reasonable time and space. COMRAD allows for random access to individual sequences and subsequences without decompressing the whole data set. COMRAD has no competitor in terms of the size of data sets that it can compress (extending to many hundreds of gigabytes) and, even for smaller data sets, the results are competitive compared to alternatives; as an example, 39 S. cerevisiae genomes compressed to 0.25 bits per base.

Entities:  

Mesh:

Substances:

Year:  2011        PMID: 21576758     DOI: 10.1109/TCBB.2011.82

Source DB:  PubMed          Journal:  IEEE/ACM Trans Comput Biol Bioinform        ISSN: 1545-5963            Impact factor:   3.710


  10 in total

1.  iDoComp: a compression scheme for assembled genomes.

Authors:  Idoia Ochoa; Mikel Hernaez; Tsachy Weissman
Journal:  Bioinformatics       Date:  2014-10-24       Impact factor: 6.937

2.  A Hybrid Data-Differencing and Compression Algorithm for the Automotive Industry.

Authors:  Sabin Belu; Daniela Coltuc
Journal:  Entropy (Basel)       Date:  2022-04-19       Impact factor: 2.738

3.  Adaptive efficient compression of genomes.

Authors:  Sebastian Wandelt; Ulf Leser
Journal:  Algorithms Mol Biol       Date:  2012-11-12       Impact factor: 1.405

Review 4.  Searching and Indexing Genomic Databases via Kernelization.

Authors:  Travis Gagie; Simon J Puglisi
Journal:  Front Bioeng Biotechnol       Date:  2015-02-09

5.  Compression of Large genomic datasets using COMRAD on Parallel Computing Platform.

Authors:  Christopher Leela Biji; Manu K Madhu; Vineetha Vishnu; Satheesh Kumar K; Achuthsankar S Nair
Journal:  Bioinformation       Date:  2015-05-28

6.  Relative Suffix Trees.

Authors:  Andrea Farruggia; Travis Gagie; Gonzalo Navarro; Simon J Puglisi; Jouni Sirén
Journal:  Comput J       Date:  2017-11-21       Impact factor: 1.494

7.  Vertical lossless genomic data compression tools for assembled genomes: A systematic literature review.

Authors:  Kelvin V Kredens; Juliano V Martins; Osmar B Dordal; Mauri Ferrandin; Roberto H Herai; Edson E Scalabrin; Bráulio C Ávila
Journal:  PLoS One       Date:  2020-05-26       Impact factor: 3.240

8.  A compression method for DNA.

Authors:  Shengwang Du; Junyi Li; Naizheng Bian
Journal:  PLoS One       Date:  2020-11-25       Impact factor: 3.240

9.  MBGC: Multiple Bacteria Genome Compressor.

Authors:  Szymon Grabowski; Tomasz M Kowalski
Journal:  Gigascience       Date:  2022-01-27       Impact factor: 6.524

10.  CHAPAO: Likelihood and hierarchical reference-based representation of biomolecular sequences and applications to compressing multiple sequence alignments.

Authors:  Md Ashiqur Rahman; Abdullah Aman Tutul; Sifat Muhammad Abdullah; Md Shamsuzzoha Bayzid
Journal:  PLoS One       Date:  2022-04-18       Impact factor: 3.752

  10 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.