Literature DB >> 35758788

CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices.

Abstract

MOTIVATION: K-mer-based methods are used ubiquitously in the field of computational biology. However, determining the optimal value of k for a specific application often remains heuristic. Simply reconstructing a new k-mer set with another k-mer size is computationally expensive, especially in metagenomic analysis where datasets are large. Here, we introduce a hashing-based technique that leverages a kind of bottom-m sketch as well as a k-mer ternary search tree (KTST) to obtain k-mer-based similarity estimates for a range of k values. By truncating k-mers stored in a pre-built KTST with a large k=kmax value, we can simultaneously obtain k-mer-based estimates for all k values up to kmax. This truncation approach circumvents the reconstruction of new k-mer sets when changing k values, making analysis more time and space-efficient.
RESULTS: We derived the theoretical expression of the bias factor due to truncation. And we showed that the biases are negligible in practice: when using a KTST to estimate the containment index between a RefSeq-based microbial reference database and simulated metagenome data for 10 values of k, the running time was close to 10× faster compared to a classic MinHash approach while using less than one-fifth the space to store the data structure.
AVAILABILITY AND IMPLEMENTATION: A python implementation of this method, CMash, is available at https://github.com/dkoslicki/CMash. The reproduction of all experiments presented herein can be accessed via https://github.com/KoslickiLab/CMASH-reproducibles. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical

Mesh：

Year: 2022 PMID： 35758788 PMCID： PMC9235470 DOI： 10.1093/bioinformatics/btac237

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.931

1 Introduction

K-mers, contiguous strings of DNA or RNA of length k, are frequently utilized in computational biology for a variety of purposes including in genome assembly (Koren ; Liu ; Luo ), metagenomic sequences classification (Dilthey ; Ondov ; Wood and Salzberg, 2014), motif discovery (Fletez-Brant ; Zhang ) and large-scale genomic comparisons (Ondov ; Solomon and Kingsford, 2017). A number of hashing-based techniques such as MinHash (Broder, 1997), Bloom filter (Bloom, 1970) and Count-Min Sketch (Cormode and Muthukrishnan, 2005) have been developed or adopted for efficient computation of k-mer-based similarity methods. In each such application, the first step is to collect a set of k-mers from input sequences. Importantly, it has been found that algorithm performance depends critically on the choice of size k. Indeed, various heuristic and empirical strategies have been introduced to find optimal k-mer sizes that increase performance in certain application areas (Chikhi and Medvedev, 2014; Schulz ; Zhang ). However, whenever a new k size is selected, each computational technique requires reconstructing the k-mer-based data structure and rerunning the analytical pipeline, leading to computational inefficiencies. In particular, hashing-based k-mer methods that compute measures of similarity of genomic and metagenomic data (such as the Jaccard and containment indices) have been demonstrated to extract valuable insight from metagenomic data (Besta ; Ondov ; Pierce ). Multiple hashing-based techniques involving the estimation of Jaccard index and/or other k-mer derivatives have been developed. For example, Mash (Ondov ), Sourmash (Pierce ) and Skmer (Sarmashghi ). Several efforts have been made to improve the efficiency for single k value hashing method, such as b-bit wise MinHash (Li and König, 2011) and Dashing with HyperLogLog sketches (Baker and Langmead, 2019). In these cases too, however, each time a new k size is utilized, the entire computational processes needs to be repeated.

1.1 Motivation

In situations where reference databases can exceed several hundred gigabytes, such as in metagenomics, indexing or sketching the database multiple times for different k-mer sizes is computationally expensive and may become an analysis bottleneck. Nevertheless, adjusting k-mer sizes plays a valuable role such as in metagenomics, where selection of k-mer size can impact performance of downstream methods (such as sensitivity or specificity of taxonomic profiling algorithms). To circumvent this, some subfields of application have proposed heuristic approaches to estimating optimal k-mer size. For example, KmerGenie (Chikhi and Medvedev, 2014) is a heuristic method to determine the optimal k value for genome assembly. Additionally, it has been recognized that it is non-trivial to find the ‘right’ k-mer size in practice (Marchet ; Pierce ; Song ). Besides, the choice of k is usually purpose-specific based on the compromise between sensitivity and specificity. In practice, Mash set defaults k to 21 for the purpose of controlling the probability q of observing a random k-mer under some cutoff (e.g. k = 19 is corresponding to q = 0.01 with genomes of ∼3 GB size). On the other hand, utilizing a low k value grants the tool the power to deal with variances in the real data (Ondov ). However, the cutoffs and tolerances are dynamic: Kraken, a k-mer-based metagenomic classification tool, set the default k value to 31 for better discriminatory power on lower taxonomic ranks (Wood and Salzberg, 2014). In our previous work, k = 61 was found to be empirically optimal to reflect metagenomic composition via alignment (LaPierre ). In many computational works, the default or optimal k values are obtained through benchmarking analysis and these tools usually relax the freedom of choosing arbitrary k values by the user to fit various aims based on the datasets. While our research is mainly about metagenomics, flexibility in different taxonomic levels is crucial. Being able to adjust k values grants us one more dimension of freedom than the similarity cutoffs: a small k value (e.g. 21) is feasible to search against the whole metagenomic database for similar matches while a large k value can discriminate the sub-structures within genus and even species. While we have observed that the rate of decrease of the Jaccard index as a function of k recapitulates the evolutionary relatedness (Koslicki and Falush, 2016), a tool that can efficiently handle the computational challenges of multiple k values can be helpful to further the exploration of metagenomic studies.

1.2 Outcome

To address this problem, we combine a modified MinHash technique (ArgMinHash) and a data structure called a k-mer ternary search tree (KTST), which allows Jaccard and containment indices to be computed at multiple k-mer sizes efficiently and simultaneously. In Figure 1, we provide a high-level description of how we accomplish this: first, we randomly subsample k-mers based on a large k size k (Fig. 1b) to build k-mer sketches. The sketch elements (i.e. k-mers) are then inserted into a KTST (Fig. 1c), which allows for efficient prefix lookups. A prefix lookup in the KTST effectively truncates a k-mer resulting in a smaller k-mer (Fig. 1d). This allows us to efficiently compute k-mer sketches for every (Fig. 1e). This truncation step avoids the needs to reprocessing the whole reference database for sketches with a different k size, making CMash much more efficient when handling large reference database. Combined with the containment MinHash approach (Koslicki and Zabeti, 2019), we can estimate the Jaccard and containment indices for all without requiring explicit re-computation of each single k value. More details about CMash workflow and the data processing can be found in Supplementary Figure S1.

Fig. 1.

Overview of the CMash algorithm. (a) The input to CMash are genomes or sequencing reads. (b) Random samples of k-mers using a modified bottom m sketch can also be used for the classic MinHash algorithm. (c) For some large k value k, one and only one k-mer sketch of the reference data will be constructed and inserted into a KTST. (d) All k-mer sketches corresponding to a smaller k value will be obtained by a prefix lookup in the KTST. (e) For , k-mers from the query data are streamed through the KTST resulting in (f) reliable estimates for a range of k-mer sizes with greater computational efficiency This truncation-based method turns out to be a biased estimator of the k-mer-based similarities. However, in our empirical analysis, we find that the CMash estimate of the Jaccard and containment index does not deviate significantly from the ground truth, indicating that this approach can give fast and reliable results with minimal bias. Compared to our previous MinHash-based approximation to the containment index (Koslicki and Zabeti, 2019), we find that the CMash estimate for ten k values is approximately ten times faster and requires only one-fifth of space to store the reference database. Importantly, this approach can be generalized to more than similarity computation: many sketching, k-mer or shingling-based approach may adopt our method to avoid the need to re-compute k-mer sets when changing the k size. As such, this probabilistic data analysis approach should find application outside of metagenomics (the application we focus on) to genomics more broadly, as well as other applications that utilize a k-mer or shingling approach. In summary, we demonstrate how this CMash technique can be applied to several widely utilized tools (e.g. Mash Screen (Ondov ), Sourmash (Pierce )) and will help to speed up k-mer-based computation when multiple k sizes are needed. A proof of concept implementation of the algorithm and data structure is freely available at https://github.com/dkoslicki/CMash.

2 Materials and methods

Here, we describe our algorithmic approach, but first we recall a few necessary definitions.

2.1 Preliminaries

2.1.1 Jaccard and containment index

In computational biology, k-mers are consecutive substrings of length k of nucleotides . The similarity between genomic data can be measured by the similarity of their respective k-mer sets: the collection of all distinct k-mers appearing as contiguous substrings in the data. If A is a collection of strings on the alphabet , then A is defined to be the set of all unique k-mers in A. In this entire section, most of the definitions given apply to arbitrary sets, but with the genomic application area in mind, we often suppress the superscript and write A instead of A for simplicity, with the implicit understanding that a set of k-mers depends on the k value chosen. The Jaccard index (JI) measures the similarity of two sets by comparing the relative size of the intersection over the union (Broder, 1997). For two non-empty finite sets A and B, the Jaccard index is defined as . Hence, with larger values indicating more overlap. Similarly, the containment index (CI) of A in B (with A non-empty) measures the relative size of the intersection over the size of A: . So with larger values indicating that more content of A resides in B. If the cardinality of both A and B are known, the Jaccard index and containment index are interchangeable: When applied to sets of k-mers, we call out the dependence on k with the following definitions:

2.1.2 Classic MinHash algorithm for the Jaccard index

For very large sets A and B (such as k-mer sets for moderate to large k derived from genomic data), computing the Jaccard index directly can be computationally taxing. To circumvent this, Broder proposed MinHash to efficiently estimate the Jaccard index for large sets (Broder, 1997). MinHash uses a random sampling process: first, we fix a constant (m is usually called sketch size) and select a family of m min-wise independent hash functions whose domains contain . Then, we define the MinHash sketch of a set A as the element (ties can be solved by lexicographic order) in A that cause some h to have the minimum value on A. More formally, define Next, define m random variables , such that: The probability of a MinHash collision (i.e. ) is an unbiased estimate of J(A, B): In practice, a ‘bottom sketch’ strategy, originally proposed by Broder (1997), is commonly used to implement the MinHash algorithm. Instead of using m hash functions, all k-mers from a given set A are passed through a single hash function and the smallest m hash values (instead of elements) are stored in a sorted sketch of size m. The probability that sketch share a hash value represents the probability of random sampling a shared element from the union of set A and B. So, the resemblance of set A, B can be quickly estimated by counting the matched values between and . This efficient approach has found use in, e.g. metagenomics where hundreds of thousands of microbial genomes may under consideration. For example, both Sourmash (Pierce ) and Mash Screen (Ondov ) maintain hash value sketches of all input genomes for comparison. However, k-mer information is lost during if one only considers hash values, instead of elements leading to minimal hash values. Herein, we will show how we can benefit from using a k-mer sketch instead of a hash value sketch in similarity analysis. We now define a bottom -mer sketch. Let A be the set of all k-mers derived from a set of sequences/string A and define as the set of the m elements corresponding to the m smallest hash values in set . Namely, for m a given sketch size and k the k-mer size, the k-mer MinHash sketch of A is defined to be We may suppress m and k for notational simplicity.

2.2 Containment MinHash

Though the MinHash approach gives an unbiased estimation of the Jaccard index, its performance may degrade considerably when A and B are of significantly different sizes (Koslicki and Zabeti, 2019). More robust estimation of J(A, B) can be obtained through C(A, B), the containment index of A in B. This strategy is called ‘containment MinHash’ (Koslicki and Zabeti, 2019). We detail this procedure now. Given a fixed k-mer size and two nonempty distinct sets of strings A and B on the alphabet such that , we first compute , the bottom sketch of the smaller set. Next, we can stream all elements in the set B over to estimate C(A, B). Since is a uniform random sample from set A, the proportion of elements in that are found in set B is an unbiased estimator of the containment index. Namely, To be noted, this streaming method is an efficient algorithm for the estimation of the containment index in metagenomic settings and is utilized by Mash Screen (Ondov ), Metalign (LaPierre ), etc. Finally, we can take advantage of Equation (1) to compute based on the containment index and the cardinalities of set A and B (which can be quickly approximated by fast cardinality estimation such as Hyperloglog (Flajolet )). In CMash, we use this contaiment MinHash approach for JI estimation considering its metagenomic analysis setting.

2.3 CMash

The approach we call CMash consists of two main components: first is the aforementioned k-mer MinHash sketches, and second a traditional ternary search tree applied to sets of k-mers.

2.3.1 ArgMinHash

We now detail the first half of the CMash approach: a data structure we call ‘ArgMinHash’ that utilizes k-mer MinHash sketches. In particular, there is an important but subtle difference between the aforementioned MinHash bottom m sketches and the k-mer MinHash sketches utilized by CMash. In particular, the definition in Equation (5) shows that the bottom m sketch utilized by the containment MinHash (or even MinHash itself) are comprised of the smallest m hash values. In contrast, the sketches utilized by CMash are comprised of elements of a set that hash to small values. This difference is key to allowing a truncation-based approach. Indeed, if we used a sketch comprised of hash values of k-mers, truncating these hash values would have no relationship at all to the hash values obtained from truncated k-mers. More formally, let A be the set of all k-mers derived from a set of sequences/strings A. A hash function h with domain containing A induces an order on A. If collisions are present, we can impose an additional ordering (say, lexicographic) to break ties. Then we define as the set of m smallest, according to the ordering imposed by h, elements of the set A. Then for m a given sketch size and k the k-mer size, the ‘argmin-bottom’ sketch of A is defined to be where AMH is an abbreviation for ‘ArgMinHash’.

2.3.2 K-mer ternary search tree

Given a set of collections of sequences , here thought of as genomes of N different (micro)organisms, we populate a single ternary search tree KTST with the sketches for a fixed sketch size m and a fixed (large) k. Recall that a ternary search tree is a data structure that allows fast (average ) lookup of prefixes so that every root to leaf path (equivalently, node) represents a k-mer. Furthermore, nodes in KTST can be labeled with which elements of D contain the prefix defined by that node. We further associated a sequence of counters to each A in D. We further accelerate prefix queries by populating a bloom filter with every k-mer defined by nodes in the KTST. Note that by inserting the sketches into the KTST, we have effectively computed proxies to for each . Indeed, we obtain new sketches for a smaller k-mer size k by truncating the KTST to a depth of k (Fig. 1d). We can then approximate the containment index of each reference A in some other set of sequences B (thinking of B as a large genomic dataset) in the following way: the k-mers of B for each are streamed through the KTST similar to the aforementioned Mash Screen (see Fig. 1e). When a k-mer is found to correspond to a node in the KTST, each of the counters is incremented for each A associated with that node/k-mer. After the streaming is complete, we will have that In doing so, in a single stream over the input data B, we are able to approximate for each A and each . If during the construction of the sketches , we also store the cardinality of , we can obtain the Jaccard indices as well. The ability to estimate Jaccard or contaiment indices for multiple k-mer sizes (up to some maximum k-mer size) motivates the multi-resolution nature of CMash. Indices can be calculated for both small k-mer sizes (low resolution), and large k-mer sizes (high resolution) utilizing a single data structure.

2.3.3 Biased nature of the estimate

There is no reason to think that the estimate in Equation (8) will be unbiased. Indeed, while is truly a random sample of m elements from the set and so the estimate given in Equation (8) corresponds exactly to MinHashing with , truncating the elements of S to k-length prefixes will not be a random sample of m elements from due to duplicated prefixes. Consider with k = 3 and m = 1: every one of the four 3-mers AAT, ATA, TAA, AAG has equal probability of being selected. As such, truncating these to 2-mers results in appearing with expected probability of 50% in the truncated 3-mers. In contrast, the frequency of in A2 is only 25% as four distinct 2-mers A2 = {AA, AT, TA, AG}. Though the truncation step will inevitably introduce some bias in the estimation, the gain in speed overwhelms the small sacrifice to accuracy, which we empirically verify in the next sections.

2.4 Theoretical analysis of CMash

Theoretically and practically, a truncation-based estimate of the Jaccard similarity will introduce data-dependent bias. Consider an arbitrary sequence data A and let A denote the set of all distinct k-mers of length k in A. Obviously, . Similarly, denote the set of all distinct -mers in A. Let denote the distinct k-mers obtained by directly truncating all elements in from length to k. In an ideal situation where no two elements share the same prefix of length k, the truncated k-mer set is exactly A. Unfortunately, this will not happen in most cases where duplicate prefixes will be introduced during the truncation, leading to estimation deviance. In this section, we will show how this truncation-introduced bias correlates with the truncation length L as well as the input data themselves.

2.4.1 Bias in truncation-based Jaccard index

First we define a prefix relationship between two k-mers of different lengths. For k-mer of length k + L and N of length k, if N is a prefix of , we can truncate by length L to get N, which is written as . We may suppress the length subscript for notational simplicity. Namely: . We then define right extensions: for a given k-mer X of length k, and , we use to denote all -mers in the set that have X as prefix. That is to say, Now, we can quantitatively describe the bias in the truncation-based method. Let be a family of suitable hash functions (e.g. min-wise independent) and be the element in the set A that minimize the hash values: Given two arbitrary non-empty genome/sequence files A and B (and the k-mer sets and for any arbitrary positive integers k, L), if we truncate all k-mers from length k + L to k, the truncation-based Jaccard index truncated from k + L to k, denoted by can be computed in the following: Note that we must truncate larger k-mer values instead of extending smaller k-mer values as the latter would require an additional pass over the input data. However, after is computed in a top-down fashion from a large k value, the k can be freely set to any value smaller than the initial large input k size. Briefly, the truncation-based estimation of Jaccard index utilized by CMash will bring a multiplicative bias factor upon the classic MinHash estimation of the Jaccard index as shown in Equation (18). This bias factor reflects the imbalance of prefixes (i.e. truncated k-mers) distributions between the intersection and the union of the original k-mer sets (before truncation). The bias factor implies that CMash will be more reliable when there are few duplication after truncation (in this case, expected number of right extension of any prefix tends to be 1) or when A and B have relative high similarity (namely, the intersection well represents the union). In other words, the truncation-based method might be limited when either k or true JI values are small. Furthermore, as a multiplicative bias is introduced during truncation, the further a truncated k is from the larger k + L used to construct the KTST, the bias will also increase. This can be seen in Supplementary Figure S4. To demonstrate that CMash is reliable when using large k values or running with closely related data, we applied this approach to 31 genomes within the genus Brucella with multiple k values (Fig. 2) and showed that CMash can robustly function to estimate Jaccard indices for multiple k values simultaneously. Considering the unavoidable variance due to random sampling in MinHash algorithm, the bias in CMash may not be an obstacle empirically.

Fig. 2.

Comparison of ground truth Jaccard indices to those estimated by CMash and MinHash on all pairs of 30 Brucella genomes. (a) The ground truth Jaccard indices as a function of k-mer size from k = 15 to k = 60. (b) Boxplot of JI value differences between CMash and the ground truth. (c) Boxplot of relative errors of CMash compared to the ground truth. (d) Boxplot of JI value differences between MinHash and the ground truth. (e) Boxplot of relative errors of MinHash compared to the ground truth

2.4.2 Bias in truncation-based containment index

Similar to Mash Screen (Ondov ), when computing the containment index (CI) of B in the set A, it is practically more convenient to form only a sketch of A and then stream the elements of B over it, looking for matches. Hence, truncation of elements of B is not necessary. We can then connect this streaming, truncation-based estimate of the containment index to the classic MinHash algorithm directly. To that end, let m be the given sketch size, and compute: where refers to the number of duplicate k-mers (prefixes) generated during truncating the k-mer sketches of A; and refers to the difference of cardinality of overlapping elements between the untruncated sketch and truncated sketch with B. Although the truncated sketch is not exact the same as the bottom sketch , the differences are negligible in practice due to the uniformity of the hash function(s), as we note below in Section 3. Similar to the truncated Jaccard index, the CMash estimate of the containment index will lead to a data-dependent bias factor that relies on the original k-mer length, the truncation length as well as the k-mer distribution in the input data themselves. The bias factor can be minimized when there are few duplicate prefixes (i.e. using large k values). Besides, a larger sketch size m can overwhelm the value of a and b, making the bias negligible in practice. The performance of CMash on truncated CI is examined in Figure 3a, showing reliable estimation while being more efficient in metagenomic settings.

Fig. 3.

Comparison of CMash with the classic MinHash approach to quantify containment indices, along with k size, database creation time and query time. The metagenomic data was simulated from 200 randomly selected genomes; and then 1000 random genomes (including the 200 true members) were analyzed for the containment index for k values ranging from 20 to 60. (a) Boxplot for absolute difference of CI value between CMash (k = 60) and the classic MinHash algorithm under different k values. The x-axis stands for different k values and y-axis stands for the absolute difference in CI. The majority of them are below 0.02. (b) Space usage for the two methods. (c) Time (per CPU minute) needed by the two methods for data structure construction. (d) Query time (per CPU minute) needed by the two methods

3 Results

Here, we compare the results from CMash using the truncation-based method to both the classic MinHash estimation as well as the ground truth (brute force) calculation on real and simulated data. For a proof-of-concept purpose, we coded both CMash and the classic MinHash (which has been adopted by many existing tools) via Python to perform a fair comparison. The comparison of CMash to Sourmash and Mash can be found in Supplementary Figure S3.

3.1 CMash accurately estimates the Jaccard index

Considering the metagenomic setting where researchers are less interested in distal relationships, we benchmarked the efficacy of CMash on a collection of organisms all belonging to the same genus: we selected the genus Brucella. A total of 31 complete or scaffold genomes were found in the NCBI GenBank database (Benson ) and all were downloaded, except for a single genome belonging to the species Brucella intermedia which was discarded due to its large evolutionary distance to the remaining 30 Brucella genomes. To assess the ability of CMash to estimate the Jaccard index, we computed all pairwise Jaccard indices for this set of genomes and compared them to the ground truth Jaccard indices which were computed in a brute force fashion. Figure 2 contains the results where the k-mer size ranged from 15 to 60 in steps of 5, for a k value of k = 60. The sketch size for CMash was m = 2000 by default (estimation variance decreases exponentially with increased sketch size while the computation becomes less time-/space-efficient (Koslicki and Zabeti, 2019)). We use canonical k-mers throughout: the lexicographic minimum of the k-mer and its reverse complement. As expected, Figure 2a shows that the Jaccard index between pairs of genomes decreases as k increases. Indeed, for a k-mer size equal to an input genome’s length, the Jaccard index at this k-mer size is equivalent to exact string matching. As an aside, the rate of decrease of the Jaccard index as a function of k recapitulates the evolutionary relatedness first observed in Koslicki and Falush (2016), which motivated the investigation contained in this article. The differences between the CMash estimate when compared with the ground truth are explained by two components: the variance introduced by the sampling-based approach of the ArgMinHash component of CMash, and the bias introduced by truncating the KTST. As seen in Figure 2b and c, neither of these biases are significant when estimating the Jaccard index with CMash when comparing pairs of genomes with medium or high Jaccard similarity. The higher relative error in Figure 2c for large k size is due to the decrease in the ground truth Jaccard index values as shown in Figure 2a. CMash and the MinHash estimate exactly agree at , so the performance characteristics are already well studied in this setting (e.g. Koslicki and Zabeti, 2019). Indeed, Figure 2b shows that the absolute differences are tightly distributed around zero. Figure 2d and e depicts the performance of the classic MinHash method for comparison. The lower variance of CMash estimation is achieved through the containment MinHash method in Section 2.2 (Koslicki and Zabeti, 2019).

3.2 CMash is significantly more efficient than MinHash

The large size of microbial genome reference databases is a constraint in metagenomic analyses due to database size directly impacting computational time. This is especially a concern when multiple k values are required (Koslicki and Falush, 2016; Pierce ; Rana et al., 2016). To examine the ability of CMash to ameliorate these concerns with large reference databases, we analyzed simulated metagenomic reads for the containment estimation of selected reference genomes. Among all species with complete or scaffold genomes in the NCBI GenBank database (Benson ), we randomly selected 1000 of them spanning 26 phyla, 174 families and 313 genera to serve as a reference database. Next, 200 of these 1000 genomes were used to simulate metagenomic samples. We used BBTools randomreads.sh (Bushnell, 2018) with the default metagenomic setting to simulate these datasets. In total, ten metagenomic datasets with depths ranging from 2 million reads to 20 million reads were simulated and then processed by CMash. We compared the CMash truncation-based estimate to the classic MinHash algorithm in a direct comparison: both algorithms were coded in the same programming language and using the same hash function. The choice of k values was slightly different than before: k values ranging from 20 to 60 in steps of 5 were used and k = 15 was excluded because the probability of sharing a 15-mer merely by chance is not negligible in the metagenomic setting where the genome pool tends to be large. We used the containment index (CI) in this experiment due to the very different sizes of the input data: it has been established that hashing approaches more accurately estimate the containment index in this such situations (Koslicki and Zabeti, 2019), though recall that Jaccard and containment indices can be computed from each other when the cardinalities are known (see Equation (1)). Considering that the major interests in metagenomic anlaysis are for microbes which show up in the sample (usually with moderate or high CI values) and k-mer matches from related or random genomes is unavoidable, we compared absolute difference of CI values between CMash and the classic MinHash algorithm for all the 1000 reference genomes. While the performances from different depths are similar, we only present the results for the depth of 10 million reads. The results, in Figure 3a, show that most of the absolute CI difference falls below 0.02, suggesting that CMash consistently agrees with the classic MinHash algorithm for all k values considered. In this figure, we only compare CMash and classic MinHash as the comparison to the ground truth is shown in Figure 2. Given the comparable performance, the CMash results were obtained more efficiently in terms of space and running time. CMash requires only one reference database for the estimation of all while the classic MinHash requires space linear in the number of k values (Fig. 3b). While CMash is currently a prototype model, the classic MinHash method is not implemented in the most memory efficient way (both hash values and k-mers are stored). Though the cost of the classic MinHash method was overestimated, the superiority of CMash in dealing with multiple k values is significant. In this experiment that used 10 k values, CMash used a total of 176 MB in space for the reference database while the classic MinHash approach used 947 MB to store all of its sketches. In addition, due to not needing to reconstruct new sketches for new k values, we observe in Figure 3c that the time needed for reference construction by CMash was almost one-tenth compared to the classic MinHash. The estimation portion, depicted in Figure 3d, was negligible in comparison to the database construction time, but here too we found CMash to be more efficient than the MinHash approach.

4 Conclusion

In this article, we introduced CMash: an algorithm and data structure that can provide efficient multi-resolution estimation of k-mer-based Jaccard and containment indices. It combines a bottom ‘argmin sketch’ strategy and a prefix lookup in a KTST to avoid the reconstruction of sketches for the entire reference databases each time the k-mer size is changed. One advantage of using a KTST not explored here is that a KTST can be represented as an on-disk database, thus freeing memory for other purposes. Indeed, the minimum memory needed is the size of one leaf node in the pre-built KTST. If needed, the amount of memory utilized by a KTST can be adjusted for a trade-off between speed and memory usage. We showed that this truncation-based method can provide results that well-approximate the ground truth in a more computationally efficient manner. We used CMash to analyze real microbial data and simulated metagenomic data and found it to give consistent and reliable estimates. While not an unbiased estimate for k-mer sizes smaller than the input maximum k value, we observed that the introduced bias was negligible for genomes with moderate and high Jaccard indices. The required space used by this approach is constant when we fix the choice of k, regardless of the number of different k values that we are interested to explore. In contrast, a classic MinHash method requires space that is linear to the number of k values used. Similarly, the time to construct reference database is significantly improved compared to MinHash as CMash only needs to proceed the data once; hence the total running time are effectively linearly reduced with respect to the number of k-mer sizes when compared with MinHash. This feature is extremely helpful in metagenomic analysis where the reference database can be as large as hundreds gigabytes and the querying cost can be overwhelmed by reference construction cost. Besides the algorithmic improvement on the MinHash algorithm, CMash can take advantage of other k-mer sketching methods in which truncated sketches (from prefix lookup) remain/are close to a random sample which can be then used for containment MinHash estimation (Koslicki and Zabeti, 2019). For example, spaced k-mers can be used to replace contiguous k-mers as spaced k-mers show potentials for improving metagenomic analysis (Boden ; Břinda ). When dealing with insertions and deletions, Strobemer (Sahlin, 2021) can be adopted if we only truncate with a length equal to the strobe length. However, care needs to be taken when a truncation sketch is not (nearly) a random sample. CMash is not capable of dealing with weighted Jaccard index and Order Min Hash (Marçais ) as the weight/ordinal information cannot be inherited during truncation, leading to erroneous estimation. In the future, further study of the bias factor may enhance the usability of CMash. Inspired by the tight empirical distribution of the bias factor in Supplementary Figure S4, we are interested in proving some statistical error boundary for CMash estimation regarding the bias based on assumptions of similarity level. This may explain why the measured bias factor is so much better than the current theory suggests. We believe this method will be useful in many metagenomic analyses where multi-resolution estimates can illuminate evolutionary relationships. Beyond metagenomics, k-mer (or shingling)-based methods are utilized extensively in computer and data science, so the CMash approach should find application beyond computational biology by essentially allowing multi-resolution (in terms of k-mer/shingling size) queries with little sacrifice to accuracy but greatly improved efficiency.

Funding

This material is based upon work supported by the National Science Foundation under Grant No. 2029170. Conflict of Interest: none declared. Click here for additional data file.

25 in total

1. COPE: an accurate k-mer-based pair-end reads connection tool to facilitate genome assembly.

Authors: Binghang Liu; Jianying Yuan; Siu-Ming Yiu; Zhenyu Li; Yinlong Xie; Yanxiang Chen; Yujian Shi; Hao Zhang; Yingrui Li; Tak-Wah Lam; Ruibang Luo
Journal: Bioinformatics Date: 2012-10-08 Impact factor: 6.937

2. Improved Search of Large Transcriptomic Sequencing Databases Using Split Sequence Bloom Trees.

Authors: Brad Solomon; Carl Kingsford
Journal: J Comput Biol Date: 2018-03-12 Impact factor: 1.479

3. MetaPalette: a k-mer Painting Approach for Metagenomic Taxonomic Profiling and Quantification of Novel Strain Variation.

Authors: David Koslicki; Daniel Falush
Journal: mSystems Date: 2016-06-07 Impact factor: 6.496

4. WSMD: weakly-supervised motif discovery in transcription factor ChIP-seq data.

Authors: Hongbo Zhang; Lin Zhu; De-Shuang Huang
Journal: Sci Rep Date: 2017-06-12 Impact factor: 4.379

5. Locality-sensitive hashing for the edit distance.

Authors: Guillaume Marçais; Dan DeBlasio; Prashant Pandey; Carl Kingsford
Journal: Bioinformatics Date: 2019-07-15 Impact factor: 6.937

6. Effective sequence similarity detection with strobemers.

Authors: Kristoffer Sahlin
Journal: Genome Res Date: 2021-10-19 Impact factor: 9.043

7. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler.

Authors: Ruibang Luo; Binghang Liu; Yinlong Xie; Zhenyu Li; Weihua Huang; Jianying Yuan; Guangzhu He; Yanxiang Chen; Qi Pan; Yunjie Liu; Jingbo Tang; Gengxiong Wu; Hao Zhang; Yujian Shi; Yong Liu; Chang Yu; Bo Wang; Yao Lu; Changlei Han; David W Cheung; Siu-Ming Yiu; Shaoliang Peng; Zhu Xiaoqian; Guangming Liu; Xiangke Liao; Yingrui Li; Huanming Yang; Jian Wang; Tak-Wah Lam; Jun Wang
Journal: Gigascience Date: 2012-12-27 Impact factor: 6.524