| Literature DB >> 33290137 |
Amatur Rahman1, Paul Medevedev1,2,3.
Abstract
Given the popularity and elegance of k-mer-based tools, finding a space-efficient way to represent a set of k-mers is important for improving the scalability of bioinformatics analyses. One popular approach is to convert the set of k-mers into the more compact set of unitigs. We generalize this approach and formulate it as the problem of finding a smallest spectrum-preserving string set (SPSS) representation. We show that this problem is equivalent to finding a smallest path cover in a compacted de Bruijn graph. Using this reduction, we prove a lower bound on the size of the optimal SPSS and propose a greedy method called UST (Unitig-STitch) that results in a smaller representation than unitigs and is nearly optimal with respect to our lower bound. We demonstrate the usefulness of the SPSS formulation with two applications of UST. The first one is a compression algorithm, UST-Compress, which, we show, can store a set of k-mers by using an order-of-magnitude less disk space than other lossless compression tools. The second one is an exact static k-mer membership index, UST-FM, which, we show, improves index size by 10%-44% compared with other state-of-the-art low-memory indices.Entities:
Keywords: bidirected graph; k-mer compression; k-mer index; k-mer set; path cover; unitigs
Mesh:
Year: 2020 PMID: 33290137 PMCID: PMC8066325 DOI: 10.1089/cmb.2020.0431
Source DB: PubMed Journal: J Comput Biol ISSN: 1066-5277 Impact factor: 1.479