Literature DB >> 33290137

Representation of k-Mer Sets Using Spectrum-Preserving String Sets.

Amatur Rahman1, Paul Medevedev1,2,3.   

Abstract

Given the popularity and elegance of k-mer-based tools, finding a space-efficient way to represent a set of k-mers is important for improving the scalability of bioinformatics analyses. One popular approach is to convert the set of k-mers into the more compact set of unitigs. We generalize this approach and formulate it as the problem of finding a smallest spectrum-preserving string set (SPSS) representation. We show that this problem is equivalent to finding a smallest path cover in a compacted de Bruijn graph. Using this reduction, we prove a lower bound on the size of the optimal SPSS and propose a greedy method called UST (Unitig-STitch) that results in a smaller representation than unitigs and is nearly optimal with respect to our lower bound. We demonstrate the usefulness of the SPSS formulation with two applications of UST. The first one is a compression algorithm, UST-Compress, which, we show, can store a set of k-mers by using an order-of-magnitude less disk space than other lossless compression tools. The second one is an exact static k-mer membership index, UST-FM, which, we show, improves index size by 10%-44% compared with other state-of-the-art low-memory indices.

Entities:  

Keywords:  bidirected graph; k-mer compression; k-mer index; k-mer set; path cover; unitigs

Mesh:

Year:  2020        PMID: 33290137      PMCID: PMC8066325          DOI: 10.1089/cmb.2020.0431

Source DB:  PubMed          Journal:  J Comput Biol        ISSN: 1066-5277            Impact factor:   1.479


  26 in total

1.  Comparison of high-throughput sequencing data compression tools.

Authors:  Ibrahim Numanagić; James K Bonfield; Faraz Hach; Jan Voges; Jörn Ostermann; Claudio Alberti; Marco Mattavelli; S Cenk Sahinalp
Journal:  Nat Methods       Date:  2016-10-24       Impact factor: 28.547

2.  Fast de Bruijn Graph Compaction in Distributed Memory Environments.

Authors:  Tony Pan; Rahul Nihalani; Srinivas Aluru
Journal:  IEEE/ACM Trans Comput Biol Bioinform       Date:  2018-07-31       Impact factor: 3.710

3.  DSK: k-mer counting with very low memory usage.

Authors:  Guillaume Rizk; Dominique Lavenier; Rayan Chikhi
Journal:  Bioinformatics       Date:  2013-01-16       Impact factor: 6.937

4.  Squeakr: an exact and approximate k-mer counting system.

Authors:  Prashant Pandey; Michael A Bender; Rob Johnson; Rob Patro; Bonnie Berger
Journal:  Bioinformatics       Date:  2018-02-15       Impact factor: 6.937

5.  Modeling biological problems in computer science: a case study in genome assembly.

Authors:  Paul Medvedev
Journal:  Brief Bioinform       Date:  2019-07-19       Impact factor: 11.622

6.  Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage.

Authors:  Guillaume Holley; Roland Wittler; Jens Stoye
Journal:  Algorithms Mol Biol       Date:  2016-04-14       Impact factor: 1.405

7.  DiscoverY: a classifier for identifying Y chromosome sequences in male assemblies.

Authors:  Samarth Rangavittal; Natasha Stopa; Marta Tomaszkiewicz; Kristoffer Sahlin; Kateryna D Makova; Paul Medvedev
Journal:  BMC Genomics       Date:  2019-08-09       Impact factor: 3.969

8.  Simplitigs as an efficient and scalable representation of de Bruijn graphs.

Authors:  Michael Baym; Gregory Kucherov; Karel Břinda
Journal:  Genome Biol       Date:  2021-04-06       Impact factor: 13.583

9.  Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing.

Authors:  Yaron Orenstein; David Pellow; Guillaume Marçais; Ron Shamir; Carl Kingsford
Journal:  PLoS Comput Biol       Date:  2017-10-02       Impact factor: 4.475

10.  deBGR: an efficient and near-exact representation of the weighted de Bruijn graph.

Authors:  Prashant Pandey; Michael A Bender; Rob Johnson; Rob Patro
Journal:  Bioinformatics       Date:  2017-07-15       Impact factor: 6.937

View more
  6 in total

1.  Assembler artifacts include misassembly because of unsafe unitigs and underassembly because of bidirected graphs.

Authors:  Amatur Rahman; Paul Medvedev
Journal:  Genome Res       Date:  2022-07-27       Impact factor: 9.438

2.  Scalable, ultra-fast, and low-memory construction of compacted de Bruijn graphs with Cuttlefish 2.

Authors:  Jamshed Khan; Marek Kokot; Sebastian Deorowicz; Rob Patro
Journal:  Genome Biol       Date:  2022-09-08       Impact factor: 17.906

3.  SPRISS: Approximating Frequent K-mers by Sampling Reads, and Applications.

Authors:  Diego Santoro; Leonardo Pellegrina; Matteo Comin; Fabio Vandin
Journal:  Bioinformatics       Date:  2022-05-18       Impact factor: 6.931

4.  Simplitigs as an efficient and scalable representation of de Bruijn graphs.

Authors:  Michael Baym; Gregory Kucherov; Karel Břinda
Journal:  Genome Biol       Date:  2021-04-06       Impact factor: 13.583

5.  Sparse and skew hashing of K-mers.

Authors:  Giulio Ermanno Pibiri
Journal:  Bioinformatics       Date:  2022-06-24       Impact factor: 6.931

6.  The K-mer File Format: a standardized and compact disk representation of sets of k-mers.

Authors:  Yoann Dufresne; Teo Lemane; Pierre Marijon; Pierre Peterlongo; Amatur Rahman; Marek Kokot; Paul Medvedev; Sebastian Deorowicz; Rayan Chikhi
Journal:  Bioinformatics       Date:  2022-07-29       Impact factor: 6.931

  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.