Literature DB >> 33693556

Compact and evenly distributed k-mer binning for genomic sequences.

Johan Nyström-Persson1,2, Gabriel Keeble-Gagnère3, Niamat Zawad2.   

Abstract

MOTIVATION: The processing of k-mers (subsequences of length k) is at the foundation of many sequence processing algorithms in bioinformatics, including k-mer counting for genome size estimation, genome assembly, and taxonomic classification for metagenomics. Minimizers-ordered m-mers where m < k - are often used to group k-mers into bins as a first step in such processing. However, minimizers are known to generate bins of very different sizes, which can pose challenges for distributed and parallel processing, as well as generally increase memory requirements. Furthermore, although various minimizer orderings have been proposed, their practical value for improving tool efficiency has not yet been fully explored.
RESULTS: We present Discount, a distributed k-mer counting tool based on Apache Spark, which we use to investigate the behaviour of various minimizer orderings in practice when applied to metagenomics data. Using this tool, we then introduce the universal frequency ordering, a new combination of frequencysampled minimizers and universal k-mer hitting sets, which yields both evenly distributed binning and small bin sizes. We show that this ordering allows Discount to perform distributed k-mer counting on a large dataset in as little as 1/8 of the memory of comparable approaches, making it the most efficient out-of-core distributed k-mer counting method available. AVAILABILITY: Discount is GPL licensed and available at https://github.com/jtnystrom/discount. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
© The Author(s) 2021. Published by Oxford University Press.

Entities:  

Year:  2021        PMID: 33693556     DOI: 10.1093/bioinformatics/btab156

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  1 in total

1.  Sequence-specific minimizers via polar sets.

Authors:  Hongyu Zheng; Carl Kingsford; Guillaume Marçais
Journal:  Bioinformatics       Date:  2021-07-12       Impact factor: 6.937

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.