Literature DB >> 35317833

Space-efficient representation of genomic k-mer count tables.

Yoshihiro Shibuya1, Djamal Belazzougui2, Gregory Kucherov3,4.   

Abstract

MOTIVATION: k-mer counting is a common task in bioinformatic pipelines, with many dedicated tools available. Many of these tools produce in output k-mer count tables containing both k-mers and counts, easily reaching tens of GB. Furthermore, such tables do not support efficient random-access queries in general.
RESULTS: In this work, we design an efficient representation of k-mer count tables supporting fast random-access queries. We propose to apply Compressed Static Functions (CSFs), with space proportional to the empirical zero-order entropy of the counts. For very skewed distributions, like those of k-mer counts in whole genomes, the only currently available implementation of CSFs does not provide a compact enough representation. By adding a Bloom filter to a CSF we obtain a Bloom-enhanced CSF (BCSF) effectively overcoming this limitation. Furthermore, by combining BCSFs with minimizer-based bucketing of k-mers, we build even smaller representations breaking the empirical entropy lower bound, for large enough k. We also extend these representations to the approximate case, gaining additional space. We experimentally validate these techniques on k-mer count tables of whole genomes (E. Coli and C. Elegans) and unassembled reads, as well as on k-mer document frequency tables for 29 E. Coli genomes. In the case of exact counts, our representation takes about a half of the space of the empirical entropy, for large enough k's.
© 2022. The Author(s).

Entities:  

Keywords:  Bloom filter; Compressed static function; Compression; Counts; k-mers

Year:  2022        PMID: 35317833      PMCID: PMC8939220          DOI: 10.1186/s13015-022-00212-0

Source DB:  PubMed          Journal:  Algorithms Mol Biol        ISSN: 1748-7188            Impact factor:   1.405


  22 in total

1.  Reconsidering the significance of genomic word frequencies.

Authors:  Miklós Csurös; Laurent Noé; Gregory Kucherov
Journal:  Trends Genet       Date:  2007-10-26       Impact factor: 11.639

2.  Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions.

Authors:  Gregory E Sims; Se-Ran Jun; Guohong A Wu; Sung-Hou Kim
Journal:  Proc Natl Acad Sci U S A       Date:  2009-02-02       Impact factor: 11.205

3.  Toward perfect reads: self-correction of short reads via mapping on de Bruijn graphs.

Authors:  Antoine Limasset; Jean-François Flot; Pierre Peterlongo
Journal:  Bioinformatics       Date:  2020-03-01       Impact factor: 6.937

4.  'Multi-SpaM': a maximum-likelihood approach to phylogeny reconstruction using multiple spaced-word matches and quartet trees.

Authors:  Thomas Dencker; Chris-André Leimeister; Michael Gerth; Christoph Bleidorn; Sagi Snir; Burkhard Morgenstern
Journal:  NAR Genom Bioinform       Date:  2019-10-30

5.  Nebula: ultra-efficient mapping-free structural variant genotyper.

Authors:  Parsoa Khorsand; Fereydoun Hormozdiari
Journal:  Nucleic Acids Res       Date:  2021-05-07       Impact factor: 16.971

6.  KMC 3: counting and manipulating k-mer statistics.

Authors:  Marek Kokot; Maciej Dlugosz; Sebastian Deorowicz
Journal:  Bioinformatics       Date:  2017-09-01       Impact factor: 6.937

7.  Genomic DNA k-mer spectra: models and modalities.

Authors:  Benny Chor; David Horn; Nick Goldman; Yaron Levy; Tim Massingham
Journal:  Genome Biol       Date:  2009-10-08       Impact factor: 13.583

8.  Kraken: ultrafast metagenomic sequence classification using exact alignments.

Authors:  Derrick E Wood; Steven L Salzberg
Journal:  Genome Biol       Date:  2014-03-03       Impact factor: 13.583

9.  REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets.

Authors:  Camille Marchet; Zamin Iqbal; Daniel Gautheret; Mikaël Salson; Rayan Chikhi
Journal:  Bioinformatics       Date:  2020-07-01       Impact factor: 6.937

10.  SeqOthello: querying RNA-seq experiments at scale.

Authors:  Ye Yu; Jinpeng Liu; Xinan Liu; Yi Zhang; Eamonn Magner; Erik Lehnert; Chen Qian; Jinze Liu
Journal:  Genome Biol       Date:  2018-10-19       Impact factor: 13.583

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.