| Literature DB >> 31779668 |
Derrick E Wood1,2, Jennifer Lu2,3, Ben Langmead4,5.
Abstract
Although Kraken's k-mer-based approach provides a fast taxonomic classification of metagenomic sequence data, its large memory requirements can be limiting for some applications. Kraken 2 improves upon Kraken 1 by reducing memory usage by 85%, allowing greater amounts of reference genomic data to be used, while maintaining high accuracy and increasing speed fivefold. Kraken 2 also introduces a translated search mode, providing increased sensitivity in viral metagenomics analysis.Entities:
Keywords: Alignment-free methods; Metagenomics; Metagenomics classification; Microbiome; Minimizers; Probabilistic data structures
Mesh:
Year: 2019 PMID: 31779668 PMCID: PMC6883579 DOI: 10.1186/s13059-019-1891-0
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 17.906
Fig. 1Differences in operation between the two versions of Kraken. a Both versions of Kraken begin classifying a k-mer by computing its ℓ bp minimizer (highlighted in magenta). The default values of k and ℓ for each version are shown in the figure. b Kraken 2 applies a spaced seed mask of s spaces to the minimizer and calculates a compact hash code, which is then used as a search query in its compact hash table; the lowest common ancestor (LCA) taxon associated with the compact hash code is then assigned to the k-mer (see the “Methods” section for full details). In Kraken 1, the minimizer is used to accelerate the search for the k-mer, through the use of an offset index and a limited-range binary search; the association between k-mer and LCA is directly stored in the sorted list. c Kraken 2 also achieves lower memory usage than Kraken 1 by using fewer bits to store the LCA and storing a compact hash code of the minimizer rather than the full k-mer. d Impact on speed, memory usage, and prokaryotic genus F1-measure in Kraken 2 when changing k with respect to ℓ (ℓ = 31, s = 7 for all three graphs). e Impact on prokaryotic genus sensitivity and positive predictive value (PPV) when changing the number of minimizer spaces s (k = 35, ℓ = 31 for both graphs). In d and e, the data are from our parameter sweep results in Additional file 1: Table S2, and the default values of the independent variables for Kraken 2 are marked with a circle.
Fig. 2Comparison between Kraken 2 and other sequence classification tools. a Processing speed (in millions of reads per minute) and memory usage (measured by maximum resident set size, in gigabytes) are shown for each classifier, as evaluated on 50 million paired-end simulated reads with 16 threads. Accuracy results are shown for b 40 prokaryotic genomes and c 10 viral genomes. The results here are shown for sensitivity, positive predictive value (PPV), and F1-measure as evaluated on a per-fragment basis at the genus rank, with 1000 reads simulated from each genome. The strains from which reads were simulated were excluded from the reference libraries for each classification tool. “Kraken 2X” is Kraken 2 using translated search against a protein database. Full results for these strain-exclusion experiments are available in Additional file 1: Table S1