Literature DB >> 36124869

Theory of local k-mer selection with applications to long-read alignment.

Abstract

MOTIVATION: Selecting a subset of k-mers in a string in a local manner is a common task in bioinformatics tools for speeding up computation. Arguably the most well-known and common method is the minimizer technique, which selects the 'lowest-ordered' k-mer in a sliding window. Recently, it has been shown that minimizers may be a sub-optimal method for selecting subsets of k-mers when mutations are present. There is, however, a lack of understanding behind the theory of why certain methods perform well.
RESULTS: We first theoretically investigate the conservation metric for k-mer selection methods. We derive an exact expression for calculating the conservation of a k-mer selection method. This turns out to be tractable enough for us to prove closed-form expressions for a variety of methods, including (open and closed) syncmers, (a, b, n)-words, and an upper bound for minimizers. As a demonstration of our results, we modified the minimap2 read aligner to use a more conserved k-mer selection method and demonstrate that there is up to an 8.2% relative increase in number of mapped reads. However, we found that the k-mers selected by more conserved methods are also more repetitive, leading to a runtime increase during alignment. We give new insight into how one might use new k-mer selection methods as a reparameterization to optimize for speed and alignment quality.
AVAILABILITY AND IMPLEMENTATION: Simulations and supplementary methods are available at https://github.com/bluenote-1577/local-kmer-selection-results. os-minimap2 is a modified version of minimap2 and available at https://github.com/bluenote-1577/os-minimap2. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical

Mesh：

Year: 2022 PMID： 36124869 PMCID： PMC9563685 DOI： 10.1093/bioinformatics/btab790

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.931

1 Introduction

In recent decades, there has been an exponential increase in the amount and throughput of available sequencing data (Goodwin ), necessitating more efficient modern methods for processing sequencing data (Berger ). Many methods employ k-mer- (length-k substrings of a sequence) based analysis, because k-mer methods tend to be fast and memory efficient. k-mer methods appear in metagenomics (Wood and Salzberg, 2014), genome assembly (Nagarajan and Pop, 2013), read alignment (Li, 2018), variant detection (Peterlongo ; Shajii ) and many more. Because k-mers overlap, selecting a subset of k-mers in a sequence can for many applications lead to a dramatic increase in efficiency while only losing a small amount of information. In this article, we will focus on local k-mer selection methods, which means that the criteria for selecting a specific k-mer should depend on the local information near the k-mer. A popular class of local selection methods use minimizers (Roberts ; Schleimer ), and a lot of recent literature focuses on both practically optimizing minimizer efficiency and theoretical intrinsic properties of minimizers (Marçais ; Zheng , 2020a,b). Recently, new non-minimizer local selection methods for k-mer selection have been proposed, including syncmers (Edgar, 2021) and minimally overlapping words (Frith ). Improvements for minimizer techniques have historically focused on optimizing for density, the fraction of selected k-mers. However, it often makes more sense for density to be an application dependent tunable parameter. Thus, Edgar (2021) instead propose the new metric of conservation, which measures the fraction of bases in a sequence which can be ‘recovered’ by k-mer matching after the sequence undergoes a random mutation process; a similar metric is also used in Frith and Sahlin (2021a). While newer techniques have demonstrated effectiveness through empirical studies, it is not clear why certain methods perform well beyond heuristic notions. Contributions. We make both theoretical and practical contributions in this manuscript. The first part of our article is theoretical. In Sections 2, 3 and 4, we develop a novel, more general, mathematical framework for analyzing local k-mer selection methods. We show how our framework rigorously discerns the relationship between the notion of ‘clumping’ of k-mers alluded to in Frith and conservation (Theorem 3). We then mathematically analyze existing local k-mer selection methods, resulting in new closed-form expressions of conservation for various k-mer selection methods and a novel result on optimal parameter choice for open syncmers (Theorem 8). A summary of the concepts discussed in the theoretical portion of our article can be found in Table 1.

Table 1.

Simplified definitions of concepts discussed in Sections 2 and 3

Term	Simplified definition
q-local k-mer selection method	Function which selects k-mers from a string based on windows of q consecutive k-mers
r-window guarantee	A k-mer selection method has this property if for every r consecutive k-mers, one is always selected
Minimizer	w-local k-mer selection method that selects the smallest k-mer (subject to some ordering) in windows of w consecutive k-mers
Word-based methods	1-local k-mer selection methods that select a k-mer if its prefix is in a specified set W
Open syncmer	1-local k-mer selection method that selects a k-mer if the smallest s-mer inside the k-mer is at the tth position (1-indexed)
Closed syncmer/charged context	1-local k-mer selection method that selects a k-mer if the smallest s-mer inside the k-mer is at the first or last position
Conservation	Percentage of bases in a long string S and a mutated version S′ that are covered by matching k-mers from S and S′
Spread	A k-mer selection method has this property if with high likelihood, the k-mers chosen are not too close together
Pr(f)	Probability vector of f, a vector of length k which is a precise measure of spread for a k-mer selection method
Pr(α(θ,k))	Vector of length k which measures the probability of runs of length ≥k in a sequence of 2k−1 Bernoulli trials
UB(d)	Vector which is an upper bound on Pr(f) computed using a union bound

Simplified definitions of concepts discussed in Sections 2 and 3 The second part of our contributions is practical. In Section 5, we empirically calculate conservation for a wide range of methods for which we have no closed-form expression. We then modified the existing software minimap2 (Li, 2018) to use open syncmers, which show better conservation than the default minimap2 choice of minimizers. Our results show (i) conservation and alignment sensitivity are correlated and (ii) alignment sensitivity is increased after modifying minimap2. However, we also show that the k-mers selected by more conserved methods are also more repetitive, as predicted in Section 3.4, leading to higher runtime. We investigate how different parameterizations lead to runtime and alignment quality trade-offs for ONT cDNA mapping.

2 Preliminaries

We formally define local k-mer selection in this section. We give an original general formalism that extends the existing formalisms in Marçais . We also review existing local k-mer selection methods.

2.1 k-mer selection methods

Let Σ be our alphabet. We will be implicitly dealing with nucleotides () for the rest of the article, although our results generalize without issues. For a string , we use the notation to mean the substring of length k starting at index i. We will assume our strings are 1-indexed. A k-mer selection method is a function f from the set of finite strings such that for , f(S) contains tuples (x, i) where is a k-mer in S, and i is the starting position where x occurs. We will sometimes refer to a k-mer selection method as a selection method or just a method when the value of k is implied. We now define a local k-mer selection method. A method f is a q-local method if for every of length after an appropriate shift in the position of the k-mers. In other words, a q-local method is just defined on -mers and then extended to arbitrary strings. The special case of q = 1 implies that the method can be defined by examining all k-mers and deciding if each k-mer is selected or not. We will always assume that , and will focus only on local methods in this article. The main reason for doing so is that q-local methods have the following desirable property that is an easy result of the definition. Let f be a q-local method. If two strings share a region of length , i.e. , then every k-mer in is also in and f(S) (if ignoring the index of the starting position). Proof. Follows easily from the definition of q-locality.□ To see not all k-mer selection methods are local, it is not hard to see that the MinHash (Broder, 1998; Ondov ) sketch, which is computed by selecting a fixed number of k-mers hashing to the smallest values over the entire genome, is not local. Another method that is not local is selecting every nth k-mer occurring in a string because it depends on the global property of starting position. In the next section, we will give examples of local methods. Our notion of a local k-mer selection method is more general than the notion of local schemes defined in Marçais . Local schemes are defined to be functions of the for where . Local schemes essentially select exactly one k-mer from a -mer by specifying the starting location for a specific k-mer. While local schemes give rise to w-local k-mer selection methods, not all local k-mer selection methods are local schemes because local selection methods may select 0 or more than 1 k-mer in a window. Local schemes are defined in such a way to satisfy a property called the window guarantee. A local k-mer selection method has the r-window guarantee property for r if for all . The window guarantee says that for every r consecutive k-mers, the local method will select at least one k-mer, guaranteeing that there will be no large gaps on the string for which no k-mer is selected. While the window guarantee is useful for many applications (Marçais ), in some applications such as alignment it is not necessary. A closely related notion to the window guarantee are universal hitting sets (UHS) (Ekim ; Orenstein ), which give rise to local methods with a window guarantee. To the best of our knowledge, current UHS implementations work within a relatively limited range of parameters. For example, PASHA (Ekim ) is not practical for k > 16, and in the paper for removal (DeBlasio ), only a window guarantee of length 6 is tested. Therefore, we will not explore UHS in this article, however, research in UHS is active and we expect these limitations to be improved upon in the future. Another important property of a selection method is the density. Visual example of the minimizer and open syncmer selection methods with k = 5, w = 4 and s = 3 over some arbitrary ordering on s-mers and k-mers. Minimizers are w-local methods, so they operate over all -mers and select k-mers from -mers whereas open syncmers are 1-local methods, so they select k-mers from a single k-mer, i.e. decide whether or not a k-mer is selected. The k-mer TCGTG is selected by the minimizer because it is the smallest k-mer in the window. The k-mer ATCGT is selected by the open syncmer method if and only if t = 3, since the smallest s-mer is at the third position The density of a k-mer selection method f is the expected number of selected k-mers divided by the number of total k-mers in a uniformly random string with i.i.d characters S as . Given a long uniform random string S, by taking indicator random variables Y for the event that the k-mer starting at i is in f(S), one can easily see density is equal to as long as i is not near the edges (the start/end) of S. The issue near the edges is that in Definition 2, a k-mer near the start of S appears in less of the ‘windows’ than a k-mer in the middle of the string. This is known as the edge bias problem in minimizers (Edgar, 2021).

2.2 Overview of specific selection methods

Minimizers

The most well-known class of local k-mer selection methods are minimizer methods, originally appearing in Roberts ; Schleimer . Given a triple where w, k are integers and is an ordering on the set of all k-mers, a minimizer outputs the smallest k-mer appearing in a -mer, or equivalently a window of w consecutive k-mers. From now on, when we specify the smallest value in a window, ties are broken by letting the leftmost k-mer be the smallest. A minimizer gives rise to a w-local method with a w-window guarantee by examining all windows of w k-mers and selecting k-mers inside this window. k and w are application-dependent parameters that control for density and k-mer size, so there is one free parameter which is the ordering . Somewhat surprisingly, the ordering, which can be string-dependent, plays a very important part of minimizer performance. We define a random minimizer to be a minimizer with the ordering defined by a random permutation. Schleimer show that the density of random minimizers is approximately . This proof is quite elegant and insightful, so we will reproduce the proof. First, we need the definition of a charged context. (Charged contexts). Given parameters , a window of consecutive w + 1 k-mers (i.e. a -mer) is a charged context if the smallest k-mer is the first or last k-mer in the window. A very similar definition of charging appears in Schleimer , although they do not actually define the charged context. (From Schleimer ). Assuming that no window of w + 1 k-mers contains duplicate k-mers, the density of a random minimizer is . Proof. The key in the proof is to note that counting the number of selected k-mers is the same as counting the number of times a new k-mer is selected as we check each -mer in a long random string S. Mathematically, let X be the (random) position of the k-mer selected at the window of w k-mers starting at i. By the above paragraph, the expected number of k-mers is then just , where the 1 is because the first window always selects a ‘new’ k-mer. Letting and dividing by , the density is then just . The next key step is to notice that if and only if the smallest k-mer in the window of w + 1 consecutive k-mers starting at i is first or last k-mer, or equivalently, if this window is a charged context. If the smallest k-mer is the first k-mer, then X = i, but , and if the smallest k-mer is the last, then but . If the smallest k-mer is at position , then necessarily. Assuming no duplicate k-mers in a window of length w + 1, the probability that the smallest k-mer is the first or last is simply , completing the proof. □ Of course, the assumption in the above theorem is not valid. Marçais give a true bound on the density for a random minimizer as under some reasonable assumptions for w and k. However, with a specific ordering, one can achieve provably better densities for the same parameters. For example, the miniception (Zheng ) algorithm finds an ordering which gives a density upper bounded by .

Syncmers

Syncmers were first defined in Edgar (2021) to be a class of 1-local k-mer selection methods. These methods break k-mers into a window of consecutive s-mers, for some parameter s < k, and select the k-mer based on some criteria involving the smallest s-mers in the window. From now on, we will assume the ordering on the s-mers is random. We have already seen such a construction; Edgar (2021) showed that charged contexts (Definition 6) are also called closed syncmers in the terminology of Edgar (2021), where k = s and . Closed syncmers can be shown to have a k—s window guarantee, so this is our first (and only) example of a method that is 1-local and has a window guarantee. We also analyze the open syncmer defined in Edgar (2021) as it was suggested to perform well with respect to conservation. (Open syncmer). Let (k, s, t) be parameters with s < k and . Considering a k-mer as a window of consecutive s-mers, a k-mer is selected if the smallest s-mer appears at position t in the window. Importantly, open syncmers may not have a window guarantee. Consider the string AAAAA… with t = 2 for any k, s. Remembering the leftmost s-mer is the smallest by convention, the smallest s-mer always occurs at the first position so no k-mers are selected. For t = 1, (Edgar, 2021) showed that a window guarantee exists, but may be very large. The density of open syncmers and closed syncmers are, respectively, and up to a small error term, following the exact same argument as Edgar (2021) and Section 2.3.1 of Zheng ) for the error term.

Word-based methods

Another class of 1-local selection methods considers a set of words and selects a k-mer if a prefix of the k-mer lies in W. Frith consider possible Ws and find ‘good’ possible choices for W. The intuition is that they want the words in W to not overlap with each other; this way selected k-mers overlap less. They offer a simple form for W by letting where and b can be any of {C, T, G} for the nucleotide alphabet. We will refer to this as the (a, b, n)-words method. The density of this method is . More sophisticated choices for W can be constructed, but we will not analyze such methods theoretically since these W are found by an optimization algorithm and thus hard to analyze. We will instead test against more sophisticated W empirically.

3 Analytical framework

While we have discussed intrinsic features of selection methods such as locality, density and window guarantee, we have not yet discussed the ‘performance’ of selection methods. It turns out that evaluating the performance of a method is subtle and depends on the task at hand. In certain contexts, methods cannot be compared because some tasks such as constructing de Bruijn graphs from minimizers (Rautiainen and Marschall, 2020) or counting and binning k-mers (Marçais ) require a window guarantee while other tasks such as alignment do not; it is shown in (Edgar, 2021) that when mutations are present between two genomes, the window guarantee does not necessarily ensure better k-mer matching (see Table 6 of Edgar, 2021). To compare methods, we will fix the density across methods and evaluate a precise notion of performance which we will define below. In previous studies on minimizers (Marçais ; 2018; Zheng ), the focus was only on optimizing the density for a fixed window size w, the assumption being that a method’s performance is only reliant on the size of the window guarantee. This is not an unreasonable assumption for certain tasks, but it is not applicable to methods without a window guarantee. When selecting a subset of k-mers, a good method should select k-mers that are spread apart on S—k-mers should not overlap or clump together (Frith ). Intuitively, close together k-mers give similar information due to more base overlaps. To formalize this, in Section 3.1, we give a new, precise notion of spread. In Section 3.3, we prove an original result detailing how this formalism relates to conservation.

3.1 Formalizing k-mer spread

Let be a long random string of fixed length with independent and uniformly random characters over Σ. We now define a key quantity associated to f which we will call the probability vector of f. (Probability vector of f). Let i be a position in S which is away from the edges. Define the event representing whether or not a k-mer at position j is selected by f, and the probability that some k-mer is selected from . We call the probability vector of f. This notion is well-defined because our string consists of i.i.d letters, and f is translation invariant along the string due to locality. As long as i is not near the end or beginning of S, the choice of i does not matter. Note for any j is the probability that a random k-mer is selected by f, which is the density of f. is a measure of how positionally spread out the events are, where is maximized when all events are disjoint. The interpretation follows because if all events are disjoint, then when a k-mer is selected at position i, no k-mer is selected at positions so the selected k-mers are spread out along the string. On the other hand, if all events only occur simultaneously, then is small and the selected k-mers are clumped together. We can get a natural upper bound for because so the union bound give where means over all components, remembering that is the density. The asymptotically optimal minimizer constructed in Marçais with density actually achieves this upper bound since and . On a technical note, one can actually see that for 1-local methods, is equivalent to . The issue is that for w-local methods, is not defined if . For example, one cannot deduce if a k-mer is selected by a minimizer method just based on the k-mer itself; a window of k-mers is needed.

3.2 Mutated k-mer model

Let be a mutated version of S such that and , where the mutated character is uniform over the rest of the alphabet and is some mutation parameter. A similar model for k-mer mutations is used in Blanca . We give a mathematical definition of the conservation metric from Edgar (2021). (Conservation). Given a k-mer selection method f and parameter θ, let the set of conserved bases be Define the conservation to be The set is the set of bases for which (i) a k-mer is selected by f overlapping the base and (ii) this k-mer is unmutated from S to . In our definition, the position of matching/conserved k-mers has to be the same in S and , so we disregard spurious matches across the genome.

3.3 Relating conservation and spread

We now show that and , which captures k-mer spread, are related. To calculate , we let X be indicator random variables where X = 1 if . By linearity of expectation The X are not independent; if , then it is likely that as well. If and both lie away from the ends of S, we get that . If , the contribution from positions near at the edges of the string is small. Therefore, we will make the assumption for some i in S away from the edges.

Understanding mutation configurations

Given the base i, the k-mers covering i on are so the substring of all covered bases is . We examine how mutations change the k-mers for these bases. We can consider these bases on as Bernoulli trials with success probability corresponding to an unchanged base. We call possible sequences of Bernoulli trials configurations of the mutations. A similar notion of configurations is used in Chaisson and Tesler (2012). Given some configuration, we will get unmutated k-mers overlapping i, an unmutated k-mer being one for which all bases are unmutated. These unmutated k-mers are candidates for being in . In Figure 2, we show graphically how different configurations lead to a different value of .

Fig. 2.

Three examples of mutation configurations. Circles represent bases of around i, while black and white indicate mutated or unmutated bases, respectively. Boxes indicate an unmutated k-mer overlapping position i From Figure 2, clearly is a random variable corresponding to the number of successful runs of length exactly k in trials, or equivalently where is the longest run in a sequence of Bernoulli trials with failure probability θ. This problem is well-studied; see Chapter 5.3 in Uspensky (1965) for a closed-form solution of general run length probabilities.

Calculating by conditioning

Conditioning on α, we get is the probability that some k-mer is selected from the a consecutive unmutated k-mers in S and . In the language of Definition 8, letting be the same event as E but over , where x is arbitrary and can be thought of as the starting position of the first unmutated k-mer (the probability does not depend on x). Locality comes into play now; if f is a 1-local method, then the event is true if and only if is true. This follows because by the assumption, and by 1-locality, this k-mer is in after reindexing if and only if this k-mer is selected by f. Hence for 1-local methods. Now we see that the right-hand side of Equation 1 is exactly from Definition 8, giving us Theorem 3. Let and f be a 1-local method. Then If f is not 1-local, then follows from the trivial upper bound If f is not 1-local, then . For minimizers, this is the context dependency problem; if a k-mer is selected in S and the same k-mer is also in , it may not be selected due to mutations in the window. Thus, 1-local methods are inherently superior for tasks that require k-mer matching for mutated strings (e.g. alignment), and the importance of this property has been recognized in other contexts as well (Ekim ).

Calculating

Even though the successful runs problem is solved for general parameters, the probability formula is derived by manipulating generating functions and is a bit unruly. For trials and runs, we provide a more straight-forward derivation in Supplementary Section S1, where we also show plots for over varying parameters. For i.i.d Bernoulli trials with success probability and , For the case , the probability of successes is just .

3.4 Repetitive k-mers and locality

Given that 1-local methods are superior for conservation, which is a measure of sensitivity, it is natural to ask if such methods have an inherent precision trade-off. In the context of alignment, k-mer matching precision is related to spurious k-mer matchings caused by repeat k-mers, as such matches would force an aligner to evaluate several candidate alignments. In general, unique k-mer matches are easier to handle computationally than repeat k-mer matches. It may therefore be more advantageous for a local selection method to select non-repetitive k-mers. We show that indeed, locality is related to repetitive k-mer selection. Let f local method with n > 1 such that the densities of both methods are equal. That is, for a long uniformly random string S, . Then Proof. Given any local method f, by linearity of expectation, we have Writing to mean x is unique in f(S) and doing the same for , where we abuse notation to mean unique over all k-mers of S, we can condition on to get Now notice that if , then follows. Furthermore, since S is uniformly random, is the same regardless of x. Therefore, the first term is independent of the method. However, for 1-local methods, can never be true if x is not unique in S by the definition of 1-locality. However, for an n-local method, it is certainly possible for even if x is not unique in S. Hence, . Assuming that up to a small relative error when S is large, the theorem follows. □ The above result show that 1-local methods may cause more repetitive matches than other methods such as minimizers. In fact, the above proof shows that all 1-local methods have the same expected number of unique k-mers if S is uniformly random. The same idea in the proof can be re-applied to show that the average expected multiplicity over all k-mers (i.e. how many times it repeats in the f(S)) is also higher when using a 1-local method. We explore the consequences of this in Section 5.2.3.

4 Mathematical analysis of specific methods

4.1 Syncmers

Let k, s, t be given and f be either a closed or open syncmer method. To calculate , we need to analyze the α consecutive k-mers. Breaking these k-mers into s-mers, we get a window of s-mers. Assume that all s-mers in the window are distinct. For uniform random strings, as shown in the proof of Lemma 9 in Zheng ), the probability of two identical s-mers appearing in a window is upper bounded by . For minimap2 (Li, 2018), the default parameters are k = 15 and w = 10. To achieve approximately the same density using an open syncmer, we need s = 10, and it is easy to show that this probability is very small. Since all s-mers are distinct and we assume a random ordering on all s-mers, the relative ordering of s-mers in this window is a uniformly random permutation in where is the relative ordering of position i in the window. Determining whether or not a k-mer is selected then amounts to analyzing a random permutation’s smallest elements (see Supplementary Fig. S2). Now given a permutation , we consider all windows of size corresponding to the s-mers inside a k-mer starting at position i. The permutation is ‘successful’ if one of these k-mers is chosen by f. We now count the number of successful permutation for open syncmers and closed syncmers. (Successful permutations for closed syncmers). Let be the number of permutations in such that for some window , either or is the smallest element in the window. If , If , then . Corollary 1. If f is a closed syncmer method, then We prove this theorem in Supplementary Section S2. The corollary follows by seeing that , which comes from our discussion about how random consecutive k-mers give rise to uniformly random permutations under our assumptions. Notice that and which is in line with the density and window guarantee discussed in Table 2.

Table 2.

Properties of discussed methods

Method (parameters)	(q)-locality	Density	(r)-window guarantee
Random minimizer (w, k)	w	∼2/(w+1)	w
Miniception (w,k,k0)	w	≤1.67/w+o(1/w)	w
Open syncmer (k, s, t)	1	∼1/(k−s+1)	∞*
Closed syncmer (k, s)	1	∼2/(k−s+1)	k - s
Words-based method (W)	1	depends on W	∞*
(a, b, n)-words method	1	= 14·34n	∞

Note: The ∼ sign denotes up to a small error term. means that there is no window guarantee. Words-based methods and open syncmers may have a window guarantee for some parameters, i.e. if t =1 for open syncmers, but usually do not.

We can also count open syncmers. Unfortunately, the number of permutations is only determined as a recurrence relation and the formula is not as nice. Theorem 7 is proved in Supplementary Section S2. Properties of discussed methods Note: The ∼ sign denotes up to a small error term. means that there is no window guarantee. Words-based methods and open syncmers may have a window guarantee for some parameters, i.e. if t =1 for open syncmers, but usually do not. (Successful permutations for open syncmers). Using parameters k, s, t as defined in Definition 7, let and be the number of permutations in such that for some window the smallest element is . Define . Then We define where the subscript indicates falling factorial, and if . One can divide by to get as previously discussed. Although lacking a nice closed-form, it still allows us to determine the optimal choice of parameter t. In Edgar (2021), the impact of parameter t was investigated on the performance of open syncmers, but no explicit conclusion was made about how to choose t. Below we prove a theorem that says that the parameter t should be chosen to be the middle position of a window of s-mers. Let . Then for any valid choice of t. We prove Theorem 8 in Supplementary Section S2. Note that, gives the same by Theorem 7. Notice that if is even, then the preceding remark shows that there are two optimal values for t. By Theorems 3 and 8, we can rigorously justify the optimal value for t to maximize conservation. Thus, t is not actually a free parameter when optimizing for conservation.

4.2 Random minimizer

Let w, k be parameters for a random minimizer method f. To calculate , as opposed to the analysis in the syncmer section where we look only at α consecutive k-mers, we now have to look all k-mers in all windows containing any of these α consecutive k-mers. As in the syncmer case, we will assume all k-mers are distinct. Given α consecutive k-mers, we need to also know the ordering of the w‒1 k-mers to the left and to the right of these α k-mers because they are included in some window containing one of these consecutive k-mers. This gives us k-mers in total, with their relative orders corresponding to a permutation in (see Supplementary Fig. S4 for visual). We proceed by counting permutations corresponding to some k-mer being chosen by a random minimizer method. We first define a function to count permutations in S satisfying a general condition. (Successful permutations for random minimizers). Given parameters with , let be the number of permutations in S , the smallest element is one of . Then where and using to mean the falling factorial, The specific choice of parameters corresponds to the number of successful permutations for the minimizer given α consecutive k-mers since the p parameter describes the leftmost position of the first unmutated k-mer covering position i. Therefore, as before, . This theorem is proved in Supplementary Section S3. It was shown to us recently (Spouge, 2022) that for the choice of parameters corresponding to the minimizer situation, this formula has a greatly simplified form of When α = 1, we get the desired result that the density is which agrees with Theorem 2. Note that, the exact equality for Theorem 3 holds only when f is a 1-local method, which minimizers are not. We give an example that shows how context dependency leads to lower conservation in Supplementary Section S3.1.

4.3 (a, b, n)-words method

We can also derive the probability vector for the previously (a, b, n)-words method which selects k-mers based on their prefix. We prove this result in Supplementary Section S4. under the (a, b, n)-words method is where if x < 0.

5 Empirical results

We perform two sets of experiments. In Section 5.1, we compare analytically and through simulations for a wider range of methods compared with previous studies (Edgar, 2021; Frith ) which focused only on simple minimizer methods and variations on their own methods. In Section 5.2, we modify the minimap2 (Li, 2018) software to use open syncmers and demonstrate that alignment sensitivity is increased, which is in agreement with our previous theory on conservation. However, despite the sensitivity increase given by open syncmers, the additional computational time as a result of increased repetitive k-mer indexing (see Section 3.4) due to 1-locality implies that optimizing for speed and time is not as simple as optimizing for conservation.

5.1 Comparing across different methods

In Section 4, we derived for four methods: closed and open syncmers, minimizers and words. Since all but minimizers are 1-local methods, we can calculate in closed-form for these three methods using Theorem 3. In this section, we empirically calculate for three more methods mentioned in Section 2.2. Fixing some density d, we compare to , where UB(d) is defined in Section 3.1 as the upper bound on probability vectors. For the miniception method, it is not obvious how parameters affect d so we let for as suggested in Zheng ), and then modify k0 and w slightly until we get the density close to d. For the words method, we use two choices of W as W4 and W8 (Supplementary Section S6), where the corresponding methods for W4 has density 1/4 and W8 has density 1/8. These sets are empirically found to perform well in Frith . For methods with a closed-form for , we plot the exact value. For methods without a closed-form, we ran 100 simulations with . We report mean and 95% confidence intervals (assuming normality) of for these methods. Code can be found at https://github.com/bluenote-1577/local-kmer-selection-results. We first fix and plot the fraction of upper bound achieved, , for all methods over a range of θ. We then fix and do a similar plot. For the (a, b, n)-words method, the closest choice of parameters leading to the most similar density is n = 2, 3 which gives density and ; we plot both of these for reference. Both results are shown in Figure 3.

Fig. 3.

Fraction of upper bound achieved for conservation. 95% confidence intervals are built from 100 simulations for methods with empirically deduced values. Note that, some methods have different densities due to parameter constraints; this is mentioned in the labels The results in Figure 3 show that the words method based on the set W4, W8 and the open syncmer methods perform well compared with the other methods. The large drop in conservation between the empirical random minimizer and the upper bound for the random minimizer indicates that context dependency plays a highly non-trivial role in conservation. We note that for d = 1/4, the (a, b, n)-words method performs poorly as it does not take advantage of selecting k-mers with non-overlapping prefixes n = 0. These results also suggest that the best methods already achieve fraction of the possible upper bound for reasonable parameters and error rates, so there is not room for drastic improvement.

5.2 Using open syncmers in minimap2

We now investigate applications to read mapping. We modified minimap2 (Li, 2018), a state-of-the-art read aligner, so that open syncmers are used instead of minimizers. We decided on using open syncmers because although its conservation is slightly less compared with word-based methods, it allows the user to choose a range of densities without constructing a new words set W for each density. We note that since minimap2 was designed with minimizers in mind, we expect the benefits of switching k-mer selection methods to be dampened compared with designing an aligner with open syncmers in mind. Minimap2 aligns reads using a seed-chain-extend procedure. Roughly speaking, this works by first applying a k-mer selection method to all reads and reference genomes. Selected k-mers on the reads are then used as seeds to be matched onto the selected k-mers of the reference. Colinear sets of k-mer matches are collected into chains, and then dynamic programming-based alignment is performed to fill gaps between chains. Our modification was to swap out the k-mer selection method, originally random minimizers, to an open syncmer method instead. Our version of minimap2 can be found at https://github.com/bluenote-1577/os-minimap2. We operate on four sets of real and simulated publicly available long-read datasets: Real microbial PacBio Sequel long-reads from PacBio, available at https://tinyurl.com/uhuwvxb8 and their corresponding assemblies. Real human ONT nanopore reads from Miga available at https://github.com/marbl/CHM13 (id: rel3; downsampled and only including reads > 1 kb in length) and the corresponding assembly CHM13. Simulated RNA (cDNA) ONT long-read data from Trans-NanoSim (Hafezqorani ). Simulated PacBio long-reads on human reference GRCh38 using PBSIM (Ono ). We discuss the first three listed datasets in the following sections, and discuss the simulated PacBio experiment in Supplementary Section 7.1.

Chaining score improvement

For each alignment of a read, minimap2 computes a chaining score (Section 2.1 in Li, 2018). This chaining score measures the goodness of the best possible chain for that alignment. Roughly speaking, if k-mers in a chain overlap and do not have gaps, the chaining score is high. We took the long-read sequences and assembly for E.coli W (bc1087) in the dataset and compared the mappings for open syncmers versus minimizers. This is shown in Figure 4.

Fig. 4.

Histogram of chaining scores corresponding to alignments of reads for E.coli W (bc1087) in against its assembly. Minimap2 with open syncmers and minimizers were compared against each other with parameters chosen so that density is fixed at 1/5. Mean chaining scores are given in the legend. t = 3 is the optimal value of t by Theorem 8 For t = 3, the optimal parameter of t with , open syncmers leads to better chaining scores. The theoretical value of increases as t increases for open syncmer methods because the vectors are increasing component-wise as t increases; see Supplementary Figure S3. Therefore, the increase in mean chaining score as t increases is suggestive that goodness of chaining is indeed correlated with conservation, even for PacBio reads which have higher rates of indels than substitutions (Dohm ). This suggests that the results from the conservation framework is still valid under realistic mutation models. Naturally, empirical densities may deviate from theoretical densities for minimizers (Marçais ). To verify that the increase in chaining was not because of open syncmers having higher density than minimizers, we calculated the empirical density for minimizers/open syncmers on the reference genome in this experiment and found that it was and , respectively. This shows that even though the density for open syncmers was lower than for minimizers, the chaining score was higher.

Improvement in number of mapped reads and quality for real datasets

We now examine how alignment sensitivity increases when using a more conserved k-mer selection method on real datasets. We fix t = 3 with the same parameters for k, w, s as in the previous section. For the real datasets, we analyze mapping quality and number of mapped reads. The difference between conservation of k-mer selection methods is more pronounced when the rate of mutation is higher, so we test how alignment changes as the reference genomes diverge from the reads. We test this effect by aligning the reads to varying reference genomes as well. The results are summarized in Table 3.

Table 3.

Open syncmer versus minimizer mappings for two versions of minimap2 on long-reads. We fix parameters so the density is 1/5 for both seeding methods

Long-read dataset	Reference	\|OS∩M¯\| (mapQ)	\|M∩OS¯\| (mapQ)	\|OS\|	\|M\|	Total no. of reads	% increase in mapped reads
E.coli W (bc1087)	E.coli W (bc1087)	312 (21.56)	102 (11.57)	194 455	194 245	196 901	0.108
E.coli K12 (bc1106)	E.coli W (bc1087)	548 (20.30)	187 (11.33)	220 459	220 098	226 906	0.164
K.pneumoniae (bc1074)	E.coli W (bc1087)	11 434 (19.80)	3679 (12.10)	143 724	135 969	251 838	5.70
Downsampled human ONT (rel3)	Human—CHM13	370 (3.53)	103 (2.52)	37 819	37 552	51 210	0.711
Downsampled human ONT (rel3)	Mouse—GRCm38	2467 (2.32)	1005 (1.90)	19 214	17 752	51 210	8.23

Note: OS is the subset of reads successfully mapped using open syncmers, and M similarly for minimizers. is the set of reads which are uniquely mapped by open syncmers, and are reads uniquely mapped by minimizers. The average mapQ outputted by minimap2 within the set is presented as well (Section 5.2).

Open syncmer versus minimizer mappings for two versions of minimap2 on long-reads. We fix parameters so the density is 1/5 for both seeding methods Note: OS is the subset of reads successfully mapped using open syncmers, and M similarly for minimizers. is the set of reads which are uniquely mapped by open syncmers, and are reads uniquely mapped by minimizers. The average mapQ outputted by minimap2 within the set is presented as well (Section 5.2). These results show that the number of reads that were uniquely mapped using open syncmers is consistently greater than with minimizers. In the case of human ONT reads mapping to the mouse genome, the relative increase in number of mapped reads is around 8.2%. This effect increases as the reads and reference genomes diverge more and mapping becomes more challenging. Since longer reads are more likely to be aligned than shorter reads, the effect of open syncmers on our metric for sensitivity, the percentage increase of mapped reads, is dependent on read length. We investigated the relationship between sensitivity increase and read lengths in Table 4. We mapped rel3 human nanopore reads onto CHM13 and stratified by read length. As expected, open syncmers increase mapping sensitivity for shorter nanopore reads disproportionately compared to very long reads.

Table 4.

Open syncmer versus minimizer mapping statistics as a function of read length for the rel3 ONT read set mapping onto CHM13

Human ONT (rel3) read lengths	\|OS∩M¯\| (mapQ)	\|M∩OS¯\| (mapQ)	\|OS\|	\|M\|	Total no. of reads	% change in mapped reads
100-1000 bp	546 (4.25)	146 (2.96)	6068	5668	44 397	7.05%
1000-2000 bp	165 (3.72)	38 (1.71)	3277	3150	10 097	4.03%
2000-3000 bp	53 (3.70)	14 (5.71)	1840	1801	4151	2.16%
>3000 bp	152 (3.27)	51 (2.25)	32 706	32 605	36 968	0.31%

Note: OS is the set of mapped reads with open syncmers, M is the set of mapped reads with minimizers. Values in parenthesis indicate average mapping qualities calculated by minimap2.

Open syncmer versus minimizer mapping statistics as a function of read length for the rel3 ONT read set mapping onto CHM13 Note: OS is the set of mapped reads with open syncmers, M is the set of mapped reads with minimizers. Values in parenthesis indicate average mapping qualities calculated by minimap2.

Simulated transcriptome analysis over varying parameters

We now analyze simulated RNA (cDNA) ONT data, where we can analyze mapping precision due to having a ground truth. We chose to analyze such data because read lengths are shorter, and in Table 4, we found that the difference in mapping sensitivity was more drastic when read lengths are shorter. We used Trans-NanoSim (Hafezqorani ) a software package for simulate cDNA ONT reads. We used the provided pre-trained ‘human_NA12878_cDNA_Bham1_guppy’ model for generating reads. After generating a set of reads from a transcriptome, we mapped the reads back to the genome (GRCh38) using the minimap2-x splice option. For every mapping, we take the primary alignment if the mapQ > 0. We designate a mapping successful if the transcript corresponding to the read overlaps the true gene, otherwise we designate it as an error. We only analyze genes found on the primary chromosomes of GRCh38 and ignore genes found on alternate loci. Based on the experiments in the previous section, it is not clear how increasing sensitivity using syncmers differs from just lowering the value of k to create more seed matches; this was also not investigated in (Edgar, 2021; Frith ). To investigate how each method depends on parameters k, s, w, we repeat the experiment over each value of k and over a set of fixed densities. For every parameterization, we generated 18 104 reads. The 18 104 reads are a result of simulating 20 000 reads using the Trans-NanoSim software; the software simulates garbage reads with error rates , thus we only took the 18 104 reads which should be alignable. The results are shown in Figure 5. We list some of the important takeaways from this experiment below.

Fig. 5.

Sensitivity and precision investigation of simulated ONT cDNA data. Mapped reads were classified into success and errors based on the true transcript location. We repeated the experiment 9 times to get a 95% confidence interval. 18 104 reads were generated for each parameterization. The two left plots show that reducing k or switching to syncmers improves alignment quality. The top right plot shows the success-error curves and the bottom right plot shows the CPU time taken for each method Increasing k-mer matching sensitivity by lowering k or switching to open syncmers increased the quality of overall mapping (less errors and more successfully mapped reads). Open syncmers and minimizers have similar error/success curves as seen in the top-right of Figure 5. This shows that lowering k and choosing more conserved methods effectively do the same thing by increasing the number of seed hits. Larger versions of this subplot are shown in Supplementary Figures S8 and S9. While the quality of the mapping seems independent of lowering k versus using open syncmers, open syncmers allow for new parameterizations. Figure 6 shows that when considering the most optimal parameterizations (toward the bottom-left of the figure), both open syncmers and minimizers have parameters that perform well. The results suggest that perform the best across methods, and that switching from minimizers to syncmers in this regime still gives quite optimal parameters while increasing both mapping quality and time.

Fig. 6.

Errors versus time taken for each method. Each point is for a specific value of k. Syncmers provide different parameterizations when modifying density and k. Parameters close to the bottom-left are well-performing

For the case of fixed density d = 1/7, we noticed that syncmers may offer a sizeable error-runtime improvement for small k, see Supplementary Figure S10. Thus, syncmers may provide more benefit for small d; this can also be seen in the top-left of Figure 5. Smaller values of d may be necessary in cases where a large number of genomes must be indexed by k-mers and memory is an issue. Syncmers take longer when k is fixed. For large k, this is primarily due to indexing speed, which we did not attempt to optimize. For smaller k, this seems to be due to a larger number of repetitive k-mers and spurious matches as mentioned in Section 3.4. Supplementary Figure S7 shows that the indexed set of k-mers for open syncmers contains less unique k-mers and higher k-mer multiplicity on average than minimizers. Errors versus time taken for each method. Each point is for a specific value of k. Syncmers provide different parameterizations when modifying density and k. Parameters close to the bottom-left are well-performing Interestingly, we found that for even values of , the sensitivity increase for choosing open syncmers over minimizers was greatly diminished (Supplementary Fig. S6). This is perhaps related to the fact that the optimal t value is unique when is odd, whereas when it is even there are two equally good choices.

6 Conclusion

In summary, we first described a new mathematical framework for understanding k-mer selection methods in the context of conservation, which allowed us to prove results pertaining to upper bounds, optimal parameter choices and closed-form expressions for conservation. We then investigated conservation empirically and then found that augmenting minimap2 with a more conserved method increases alignment sensitivity as predicted. However, such methods may give rise to more repetitive seed matches, increasing the computational time of alignment. To optimize for quality of mapping versus speed, one should maximize sensitivity, which is related to quality, and minimize the number of repetitive k-mers in the indexed set, which is related to speed. Further research should investigate if this trade-off can be improved using new local-selection methods and how to engineer aligners that take advantage of the trade-off. A notable result of ours is that the best methods can already achieve > 0.96 of the upper bound for conservation for certain parameters, implying that major improvements in conservation are not possible. However, real genomes do not consist of i.i.d uniform letters. There is a mix of high-complexity and low-complexity regions in genomes, so techniques for better distributed selection of k-mers should be investigated (Jain ) for example. Sequence-specific k-mer selection methods, where the selection method is specifically tuned for a certain string is another area of practical importance (Zheng et al., 2021). Theoretical problems include understanding how tight the bound given in Section 3.1 is when parameters are not in the asymptotic regime, and deeper analysis on the context dependency problem for random minimizers as well as on how locality relates to repetitiveness of selected k-mers. An orthogonally related recent idea are strobemers (Sahlin, 2021a), which have been proposed as a k-mer alternative for sequence mapping. It has been shown that strobemers allow for much higher conservation (called match-coverage in Sahlin, 2021a) than k-mers. StrobeAlign (Sahlin, 2021b) is a new short-read aligner that combines syncmers and strobemers for extremely efficient alignment. Another example is the LCP (locally consistent parsing) technique (Hach ; Sahinalp and Vishkin, 1996), which selects varying length substrings instead of k-mers in a locally consistent manner (i.e. a version of Theorem 1 holds). Understanding selection techniques for k-mer replacements is an area of unexplored research where we believe that some of our techniques may be useful. Click here for additional data file.

28 in total

Review 1. Sequence assembly demystified.

Authors: Niranjan Nagarajan; Mihai Pop
Journal: Nat Rev Genet Date: 2013-01-29 Impact factor: 53.242

2. Weighted minimizer sampling improves long read mapping.

Authors: Chirag Jain; Arang Rhie; Haowen Zhang; Claudia Chu; Brian P Walenz; Sergey Koren; Adam M Phillippy
Journal: Bioinformatics Date: 2020-07-01 Impact factor: 6.937

3. A closed formula relevant to 'Theory of local k-mer selection with applications to long-read alignment' by Jim Shaw and Yun William Yu.

Authors: John L Spouge
Journal: Bioinformatics Date: 2022-10-14 Impact factor: 6.931

4. PBSIM: PacBio reads simulator--toward accurate genome assembly.

Authors: Yukiteru Ono; Kiyoshi Asai; Michiaki Hamada
Journal: Bioinformatics Date: 2012-11-04 Impact factor: 6.937

5. Trans-NanoSim characterizes and simulates nanopore RNA-sequencing data.

Authors: Saber Hafezqorani; Chen Yang; Theodora Lo; Ka Ming Nip; René L Warren; Inanc Birol
Journal: Gigascience Date: 2020-06-01 Impact factor: 6.524

6. MBG: Minimizer-based Sparse de Bruijn Graph Construction.

Authors: Mikko Rautiainen; Tobias Marschall
Journal: Bioinformatics Date: 2021-01-20 Impact factor: 6.937

7. Effective sequence similarity detection with strobemers.

Authors: Kristoffer Sahlin
Journal: Genome Res Date: 2021-10-19 Impact factor: 9.043

8. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory.

Authors: Mark J Chaisson; Glenn Tesler
Journal: BMC Bioinformatics Date: 2012-09-19 Impact factor: 3.169

9. Kraken: ultrafast metagenomic sequence classification using exact alignments.

Authors: Derrick E Wood; Steven L Salzberg
Journal: Genome Biol Date: 2014-03-03 Impact factor: 13.583

10. Lower Density Selection Schemes via Small Universal Hitting Sets with Short Remaining Path Length.

Authors: Hongyu Zheng; Carl Kingsford; Guillaume Marçais
Journal: J Comput Biol Date: 2020-12-15 Impact factor: 1.549