Literature DB >> 35904548

The K-mer File Format: a standardized and compact disk representation of sets of k-mers.

Yoann Dufresne1, Teo Lemane2, Pierre Marijon3, Pierre Peterlongo2, Amatur Rahman4, Marek Kokot5, Paul Medvedev4,6,7, Sebastian Deorowicz5, Rayan Chikhi1.   

Abstract

SUMMARY: Bioinformatics applications increasingly rely on ad-hoc disk storage of k-mer sets, e.g. for de Bruijn graphs or alignment indexes. Here we introduce the K-mer File Format (KFF) as a general lossless framework for storing and manipulating k-mer sets, realizing space savings of 3-5x compared to other formats, and bringing interoperability across tools. AVAILABILITY: Format specification, C ++/Rust API, tools: https://github.com/Kmer-File-Format/.
© The Author(s) 2022. Published by Oxford University Press.

Entities:  

Year:  2022        PMID: 35904548      PMCID: PMC9477520          DOI: 10.1093/bioinformatics/btac528

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.931


1 Introduction

Sets of k-mers are widely used in DNA sequence analysis, for instance in genome assembly [e.g. SPAdes (Bankevich )], indexes of sequence aligners [e.g. minimap2 (Li, 2018)], large-scale sequence search tools (Marchet ). Often, bioinformatics tools are k-mer consumers, i.e. they take as input a k-mer set given by one of the k-mer producers, typically k-mer counters [e.g. KMC (Deorowicz ), DSK (Rizk )]. Producers use ad hoc binary formats for storing k-mers on disk. This leads to inefficient development practices, as consumers need to write specific parsers for each producer format. Standard file formats greatly facilitate interoperability, e.g. in the case of the SAM/BAM formats (Cock ) for sequence alignment and HDF5 (Folk ) for general structured data. We propose the K-mer File Format (KFF), an interoperable and efficient approach to store k-mer sets. We provide APIs in C++ and Rust, as well as file manipulation and conversion tools to facilitate inspection and integration into other tools. KFF has already been integrated in several tools: the KMC and DSK k-mer counters, the ESS-Compress (Rahman ) compression tool and kmtricks (Lemane ) for k-mer matrix construction. We present the rationale of our approach, the KFF 1.0 file format, and demonstrate the efficiency of KFF for storing k-mers from sequencing data.

2 Approach

Tools producing k-mer sets essentially use similar storage techniques. In Jellyfish (Marçais and Kingsford, 2011) and DSK, a k-mer is encoded in 2 bits per nucleotide and the entire set is stored as a succession of k-mers and associated data (e.g. abundances). In KMC, a more advanced format is used to reduce space and to allow fast, logarithmic time, queries (see ‘KMC file format description’ in the Supplementary Material for more details). Recent works (Břinda ; Rahman ) demonstrated space-efficient storage of genomic k-mers using their spectrum-like property (Chikhi ), i.e. assuming that most k-mers originate from a set of long strings. In this spectrum-preserving string set representation (SPSS), what are stored are sequences longer than k, where each window of length k is a k-mer from the original set, and achieve a space of around 3 bits per k-mer [in Rahman , k = 31, no counts stored]. However, the representation is non-trivial to compute and requires hours for a human genome. We propose a space-efficient format that is fast to produce, encoding k-mers in binary and storing them in overlapping form. The drawback for space efficiency is that random accesses are not supported in KFF, yet they are unnecessary in the many consumer applications that only read k-mer sets from disk sequentially (Bankevich ; Rahman ).

3 Methods

A KFF file is composed of a short header and a succession of sections (see Fig. 1). The header contains the format version, the nucleotide 2-bit encoding (e.g. A:0, C:1, G:3, T:2), global flags to indicate whether k-mers are all unique and/or in canonical form, and finally a metadata section.
Fig. 1.

Structure of the K-mer File Format with k = 10 and minimizers of size 8. Top right part: a toy k-mer set shown in plain text. Left part: The same k-mer set is represented in KFF. The top-left box is the file header and each following boxes are different sections. Bottom right part: alternatively, a Sequences section can be represented more succinctly by a Minimizer section which contains the same set of k-mers. For example, the first entry in the M section has sequence ACTG with its minimizer at position 3, hence it corresponds to sequence ACTAAACTGATG of size 12 (which is identical to the first entry in the R section), from which three k-mers can be extracted

Structure of the K-mer File Format with k = 10 and minimizers of size 8. Top right part: a toy k-mer set shown in plain text. Left part: The same k-mer set is represented in KFF. The top-left box is the file header and each following boxes are different sections. Bottom right part: alternatively, a Sequences section can be represented more succinctly by a Minimizer section which contains the same set of k-mers. For example, the first entry in the M section has sequence ACTG with its minimizer at position 3, hence it corresponds to sequence ACTAAACTGATG of size 12 (which is identical to the first entry in the R section), from which three k-mers can be extracted The rest of the file consists of sections of several types. The header of a section indicates its type. A V section defines variables that are helpful for the following sections. Actual k-mer sets and their associated data are stored in either sequences (R) or minimizer sequences (M) sections. In both R and M sections, longer sequences store overlapping k-mers, avoiding some redundancy. R sections store sequences explicitly, and the key idea of M sections is to avoid storing the minimizer sequence explicitly, and instead only indicate at which position to insert it in the stored sequence. An I section provides an index to quickly find the positions of sections within a KFF file, but its purpose is not to index k-mers themselves. For more details, see Supplementary Material ‘KFF file format details’ section. The C++ and Rust APIs provide convenient ways to read and write KFF files, and in particular a high-level C++ function is provided to iterate through k-mers in only four lines of code.

4 Results

We created the kff-tools software suite on top of the C++ KFF API. It is a collection of small programs that manipulate KFF files, such as merging/splitting, validation, bucketing. They are available at github.com/Kmer-File-Format/kff-tools. These tools complement the already existing KMC tools (Kokot ) that allow more complex operations on k-mer sets, e.g. union, intersection and complex joins. KMC tools have further been adapted to support KFF files where k-mers are ordered. To demonstrate that KFF provides significant space savings compared to other file formats, we downloaded short-read sequencing data from the chicken genome (2.8 billion distinct 32-mers) and the Human genome (5.7 billion distinct 32-mers), counted using KMC (Deorowicz ). We evaluated several file formats: naive text representation, KMC format, KFF storing k-mers naively, KFF where k-mers are compacted as super-k-mers (i.e. a group of overlapping k-mers that share the same minimizer) (see Supplementary Material ‘Experimental setup relative to kmtricks’ section) and KFF where k-mers are compacted using a spectrum-preserving string set (Rahman ) (see Supplementary Material ‘Experimental setup relative to ESS-Compress’ section). Full data processing details, as well as additional results using compression, are available in the Supplementary Materials. Table 1 shows that by recording compacted super-k-mers with KFF, it is possible to use roughly 3× less space than with native KMC format for storing the same set of k-mers. In terms of running times, on the Gallus dataset using 8 threads, KMC took 9 min, KFF+sk 113 min and KFF+SPSS 900 single-threaded minutes (optimization pending). On average KFF with super-k-mers requires 17 bits per k-mer (omitting the data), while KMC uses 56 bits/k-mer. Using SPSS improves storage further to 5 bits per k-mer. Furthermore, gzip compression adds an additional 2× compression gain for KFF files and 1.25× gain for KMC files.
Table 1.

Comparison of file sizes (in GB) for several techniques for storing 32-mers on disk: naive plain-text encoding (‘T’), KMC file format (‘KMC’), KFF file format storing one k-mer per block (‘KFF+naive’) or storing super-k-mers as created by kmtricks (‘KFF+sk’), or using k-mers stored as a string-preserving string set (’KFF+SPSS’)

SampleTKMCKFF+naiveKFF+skKFF+SPSS
Gallus gallus 95.119.124.27.44.2
G.gallus, gz 19.915.016.64.82.0
Homo sapiens 191.037.748.516.811.1
H.sapiens, gz 37.930.633.811.96.4

Note: ‘gz’ indicates gzip compressed outputs.

Comparison of file sizes (in GB) for several techniques for storing 32-mers on disk: naive plain-text encoding (‘T’), KMC file format (‘KMC’), KFF file format storing one k-mer per block (‘KFF+naive’) or storing super-k-mers as created by kmtricks (‘KFF+sk’), or using k-mers stored as a string-preserving string set (’KFF+SPSS’) Note: ‘gz’ indicates gzip compressed outputs. In conclusion, we propose the k-mer set file format KFF, along with a versatile C++ and Rust API to read and write k-mers and a toolkit for file manipulations. We hope that KFF will boost interoperability between many software tools that use k-mer sets, and simultaneously improve their efficiency due to the compression features of KFF. Many suggestions and requests are emerging from discussions with the community and extensions of features to the format are currently being considered. The KFF format could for instance be used to store k-mer sketches, although current sketching tools store hashes on disk (Pierce ), discarding the originating k-mers. Click here for additional data file.
  10 in total

1.  SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing.

Authors:  Anton Bankevich; Sergey Nurk; Dmitry Antipov; Alexey A Gurevich; Mikhail Dvorkin; Alexander S Kulikov; Valery M Lesin; Sergey I Nikolenko; Son Pham; Andrey D Prjibelski; Alexey V Pyshkin; Alexander V Sirotkin; Nikolay Vyahhi; Glenn Tesler; Max A Alekseyev; Pavel A Pevzner
Journal:  J Comput Biol       Date:  2012-04-16       Impact factor: 1.479

2.  A fast, lock-free approach for efficient parallel counting of occurrences of k-mers.

Authors:  Guillaume Marçais; Carl Kingsford
Journal:  Bioinformatics       Date:  2011-01-07       Impact factor: 6.937

3.  Minimap2: pairwise alignment for nucleotide sequences.

Authors:  Heng Li
Journal:  Bioinformatics       Date:  2018-09-15       Impact factor: 6.937

4.  DSK: k-mer counting with very low memory usage.

Authors:  Guillaume Rizk; Dominique Lavenier; Rayan Chikhi
Journal:  Bioinformatics       Date:  2013-01-16       Impact factor: 6.937

5.  Large-scale sequence comparisons with sourmash.

Authors:  N Tessa Pierce; Luiz Irber; Taylor Reiter; Phillip Brooks; C Titus Brown
Journal:  F1000Res       Date:  2019-07-04

6.  KMC 3: counting and manipulating k-mer statistics.

Authors:  Marek Kokot; Maciej Dlugosz; Sebastian Deorowicz
Journal:  Bioinformatics       Date:  2017-09-01       Impact factor: 6.937

7.  Representation of k-Mer Sets Using Spectrum-Preserving String Sets.

Authors:  Amatur Rahman; Paul Medevedev
Journal:  J Comput Biol       Date:  2020-12-07       Impact factor: 1.479

Review 8.  Data structures based on k-mers for querying large collections of sequencing data sets.

Authors:  Camille Marchet; Christina Boucher; Simon J Puglisi; Paul Medvedev; Mikaël Salson; Rayan Chikhi
Journal:  Genome Res       Date:  2020-12-16       Impact factor: 9.043

9.  Simplitigs as an efficient and scalable representation of de Bruijn graphs.

Authors:  Michael Baym; Gregory Kucherov; Karel Břinda
Journal:  Genome Biol       Date:  2021-04-06       Impact factor: 13.583

10.  Disk-based k-mer counting on a PC.

Authors:  Sebastian Deorowicz; Agnieszka Debudaj-Grabysz; Szymon Grabowski
Journal:  BMC Bioinformatics       Date:  2013-05-16       Impact factor: 3.169

  10 in total
  1 in total

Review 1.  From molecules to genomic variations: Accelerating genome analysis via intelligent algorithms and architectures.

Authors:  Mohammed Alser; Joel Lindegger; Can Firtina; Nour Almadhoun; Haiyu Mao; Gagandeep Singh; Juan Gomez-Luna; Onur Mutlu
Journal:  Comput Struct Biotechnol J       Date:  2022-08-18       Impact factor: 6.155

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.