Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Representation of k-Mer Sets Using Spectrum-Preserving String Sets.

Literature DB >> 33290137

Representation of k-Mer Sets Using Spectrum-Preserving String Sets.

Abstract

Given the popularity and elegance of k-mer-based tools, finding a space-efficient way to represent a set of k-mers is important for improving the scalability of bioinformatics analyses. One popular approach is to convert the set of k-mers into the more compact set of unitigs. We generalize this approach and formulate it as the problem of finding a smallest spectrum-preserving string set (SPSS) representation. We show that this problem is equivalent to finding a smallest path cover in a compacted de Bruijn graph. Using this reduction, we prove a lower bound on the size of the optimal SPSS and propose a greedy method called UST (Unitig-STitch) that results in a smaller representation than unitigs and is nearly optimal with respect to our lower bound. We demonstrate the usefulness of the SPSS formulation with two applications of UST. The first one is a compression algorithm, UST-Compress, which, we show, can store a set of k-mers by using an order-of-magnitude less disk space than other lossless compression tools. The second one is an exact static k-mer membership index, UST-FM, which, we show, improves index size by 10%-44% compared with other state-of-the-art low-memory indices.

Entities: Chemical

Keywords: bidirected graph; k-mer compression; k-mer index; k-mer set; path cover; unitigs

Mesh：

Year: 2020 PMID： 33290137 PMCID： PMC8066325 DOI： 10.1089/cmb.2020.0431

Source DB: PubMed Journal: J Comput Biol ISSN： 1066-5277 Impact factor: 1.479

26 in total

Representation of k-Mer Sets Using Spectrum-Preserving String Sets.

1. Comparison of high-throughput sequencing data compression tools.

2. Fast de Bruijn Graph Compaction in Distributed Memory Environments.

3. DSK: k-mer counting with very low memory usage.

4. Squeakr: an exact and approximate k-mer counting system.

5. Modeling biological problems in computer science: a case study in genome assembly.

6. Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage.

7. DiscoverY: a classifier for identifying Y chromosome sequences in male assemblies.

8. Simplitigs as an efficient and scalable representation of de Bruijn graphs.

9. Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing.

10. deBGR: an efficient and near-exact representation of the weighted de Bruijn graph.

1. Assembler artifacts include misassembly because of unsafe unitigs and underassembly because of bidirected graphs.

2. Scalable, ultra-fast, and low-memory construction of compacted de Bruijn graphs with Cuttlefish 2.

3. SPRISS: Approximating Frequent K-mers by Sampling Reads, and Applications.

4. Simplitigs as an efficient and scalable representation of de Bruijn graphs.

5. Sparse and skew hashing of K-mers.

6. The K-mer File Format: a standardized and compact disk representation of sets of k-mers.