| Literature DB >> 35904548 |
Yoann Dufresne1, Teo Lemane2, Pierre Marijon3, Pierre Peterlongo2, Amatur Rahman4, Marek Kokot5, Paul Medvedev4,6,7, Sebastian Deorowicz5, Rayan Chikhi1.
Abstract
SUMMARY: Bioinformatics applications increasingly rely on ad-hoc disk storage of k-mer sets, e.g. for de Bruijn graphs or alignment indexes. Here we introduce the K-mer File Format (KFF) as a general lossless framework for storing and manipulating k-mer sets, realizing space savings of 3-5x compared to other formats, and bringing interoperability across tools. AVAILABILITY: Format specification, C ++/Rust API, tools: https://github.com/Kmer-File-Format/.Entities:
Year: 2022 PMID: 35904548 PMCID: PMC9477520 DOI: 10.1093/bioinformatics/btac528
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.931
Fig. 1.Structure of the K-mer File Format with k = 10 and minimizers of size 8. Top right part: a toy k-mer set shown in plain text. Left part: The same k-mer set is represented in KFF. The top-left box is the file header and each following boxes are different sections. Bottom right part: alternatively, a Sequences section can be represented more succinctly by a Minimizer section which contains the same set of k-mers. For example, the first entry in the M section has sequence ACTG with its minimizer at position 3, hence it corresponds to sequence ACTAAACTGATG of size 12 (which is identical to the first entry in the R section), from which three k-mers can be extracted
Comparison of file sizes (in GB) for several techniques for storing 32-mers on disk: naive plain-text encoding (‘T’), KMC file format (‘KMC’), KFF file format storing one k-mer per block (‘KFF+naive’) or storing super-k-mers as created by kmtricks (‘KFF+sk’), or using k-mers stored as a string-preserving string set (’KFF+SPSS’)
| Sample | T | KMC | KFF+naive | KFF+sk | KFF+SPSS |
|---|---|---|---|---|---|
|
| 95.1 | 19.1 | 24.2 | 7.4 | 4.2 |
|
| 19.9 | 15.0 | 16.6 | 4.8 | 2.0 |
|
| 191.0 | 37.7 | 48.5 | 16.8 | 11.1 |
|
| 37.9 | 30.6 | 33.8 | 11.9 | 6.4 |
Note: ‘gz’ indicates gzip compressed outputs.