| Literature DB >> 35758788 |
Shaopeng Liu1, David Koslicki1,2,3.
Abstract
MOTIVATION: K-mer-based methods are used ubiquitously in the field of computational biology. However, determining the optimal value of k for a specific application often remains heuristic. Simply reconstructing a new k-mer set with another k-mer size is computationally expensive, especially in metagenomic analysis where datasets are large. Here, we introduce a hashing-based technique that leverages a kind of bottom-m sketch as well as a k-mer ternary search tree (KTST) to obtain k-mer-based similarity estimates for a range of k values. By truncating k-mers stored in a pre-built KTST with a large k=kmax value, we can simultaneously obtain k-mer-based estimates for all k values up to kmax. This truncation approach circumvents the reconstruction of new k-mer sets when changing k values, making analysis more time and space-efficient.Entities:
Mesh:
Year: 2022 PMID: 35758788 PMCID: PMC9235470 DOI: 10.1093/bioinformatics/btac237
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.931
Fig. 1.Overview of the CMash algorithm. (a) The input to CMash are genomes or sequencing reads. (b) Random samples of k-mers using a modified bottom m sketch can also be used for the classic MinHash algorithm. (c) For some large k value k, one and only one k-mer sketch of the reference data will be constructed and inserted into a KTST. (d) All k-mer sketches corresponding to a smaller k value will be obtained by a prefix lookup in the KTST. (e) For , k-mers from the query data are streamed through the KTST resulting in (f) reliable estimates for a range of k-mer sizes with greater computational efficiency
Fig. 2.Comparison of ground truth Jaccard indices to those estimated by CMash and MinHash on all pairs of 30 Brucella genomes. (a) The ground truth Jaccard indices as a function of k-mer size from k = 15 to k = 60. (b) Boxplot of JI value differences between CMash and the ground truth. (c) Boxplot of relative errors of CMash compared to the ground truth. (d) Boxplot of JI value differences between MinHash and the ground truth. (e) Boxplot of relative errors of MinHash compared to the ground truth
Fig. 3.Comparison of CMash with the classic MinHash approach to quantify containment indices, along with k size, database creation time and query time. The metagenomic data was simulated from 200 randomly selected genomes; and then 1000 random genomes (including the 200 true members) were analyzed for the containment index for k values ranging from 20 to 60. (a) Boxplot for absolute difference of CI value between CMash (k = 60) and the classic MinHash algorithm under different k values. The x-axis stands for different k values and y-axis stands for the absolute difference in CI. The majority of them are below 0.02. (b) Space usage for the two methods. (c) Time (per CPU minute) needed by the two methods for data structure construction. (d) Query time (per CPU minute) needed by the two methods