Literature DB >> 27103940

MissMax: alignment-free sequence comparison with mismatches through filtering and heuristics.

Abstract

BACKGROUND: Measuring sequence similarity is central for many problems in bioinformatics. In several contexts alignment-free techniques based on exact occurrences of substrings are faster, but also less accurate, than alignment-based approaches. Recently, several studies attempted to bridge the accuracy gap with the introduction of approximate matches in the definition of composition-based similarity measures.
RESULTS: In this work we present MissMax, an exact algorithm for the computation of the longest common substring with mismatches between each suffix of a sequence x and a sequence y. This collection of statistics is useful for the computation of two similarity measures: the longest and the average common substring with k mismatches. As a further contribution we provide a "relaxed" version of MissMax that does not guarantee the exact solution, but it is faster in practice and still very precise.

Entities: Chemical Species

Keywords: Alignment free; Compositional approaches; Mismatches; Phylogenetic analysis; Sequence similarity

Year: 2016 PMID： 27103940 PMCID： PMC4839165 DOI： 10.1186/s13015-016-0072-x

Source DB: PubMed Journal: Algorithms Mol Biol ISSN： 1748-7188 Impact factor: 1.405

Background

Sequence similarity has long been playing a crucial role in Computational Biology and Bioinformatics as a key ingredient in the prediction of functional and structural properties, and of evolutionary mechanisms. Since the introduction of high throughput techniques, hundreds of fully sequenced genomes of different species have been made available at a fast pace. The increasing number of available sequences makes all kind of sequence analysis, most notably assembly, phylogenetic reconstruction, and multiple alignments, more challenging due to the time consuming and memory-demanding operations that need to be carried out on these huge datasets. To try to cope with the increasing demand of time efficiency, a wide range of alignment-free (or composition-based) approaches have been proposed. The idea behind compositional approaches is to model each sequence in terms of the substrings that it contains, and then to devise appropriate similarity measures to compare two sequences based on this model [19]. Traditionally, alignment-free approaches rely on the frequency or presence of L-mers, for a fixed length L, and consider exact matches. Although usually very fast, in several contexts such approaches can be much less accurate than alignment-based counter-parts. For this reason, within the last decade, several approaches have been proposed to improve the ability to better capture the nature of the similarity/dissimilarity between biological sequences with alignment-free techniques. Among the wide literature, we can mention, for example, the introduction of over-representation, rather than the raw frequency count, in the definition of the similarity measure for fixed length [17] and maximal length [3, 4] components; and the definition of distances based on average longest shared substrings [18], which frees the analysis from fixing the length of the substrings to analyse. More recently, several studies proposed to model the intrinsic variability of biosequences by considering approximate matches with a bounded number of mismatches, or by using spaced-words, in the characterization of the sequence composition. Several related experiments showed that, in the context of phylogenetic tree reconstruction, the introduction of approximate matches can improve the quality of the detected sequence similarity [5, 13, 14]. Given these premises, we focused our attention on these more involved formulations of the alignment-free approach, in particular on those allowing for approximate matching within a bounded number of mismatches. As a warm up we will give a brief overview of recently proposed approaches, and will use some of the presented results to introduce further notation used throughout the paper. Let us consider two sequences and defined over an alphabet . Let and be the suffixes of x and y starting at position i and j respectively. In the following we will assume, without loss of generality, that both sequences have the same length n. An early result on speed-ups for the computation of empirical statistic with mismatches was presented in [15], where an algorithm was proposed to compute the number of occurrences with k mismatches of all the substrings of length L in a string x of length n. The key feature of this algorithm is that its complexity is independent on the number of mismatches that are allowed. The algorithm was proposed within the pattern discovery framework [7, 8], thus the need to count the occurrences within the same string in order to subsequently estimate their over-representation. However, the proposed solution can be easily adapted to compute the number of occurrences of all the substrings of length L of a string x in another string y, leading to the definition of a similarity measure between the two sequences based on the frequency of shared approximate occurrences. The first formal definition of similarity measures based on shared maximal substrings with mismatches was introduced in [5]. We report here the main concepts, but with a slightly different notation. Let be the length of the longest common prefix between two strings x and y when k mismatches are allowed. Now, consider the set of defined for all the suffixes of x, and for all the suffixes of y. The following measures of cross correlation were defined for a given number of mismatches k:

Definition 1

, is an n-length vector storing the maximum value attained by for each i over all values correspondingly spanned by j.

Definition 2

, is an n-length vector storing the average value attained by for each i over all values correspondingly spanned by j.

Definition 3

is the maximum value attained by over all values of in x and in y.

Definition 4

is the average value attained by over all values of of x and in y. For measures such as and in [5] it was also proposed a subquadratic algorithm for their computation. This can futher be improved to [6]. In [14] kmacs, a greedy heuristic, was proposed to generalize the well known Average Common Substring (ACS) distance [18] so to account for k mismatches when considering the longest common substring between pairs of positions in the two strings. We refer to this variant of the ACS problem as kACS. The algorithm proposed in [14] has time complexity O(nkz), where z is the maximum number of occurrences in y of a string of maximal length occurring in both x and y. Being based on a heuristic, this method is very fast in practice, but it does not guarantee to find the optimal solution to the problem. Note that the kACS problem can be described in terms of the measures of cross correlation previously defined as the mean over all positions i in x of . We end our overview with some recently published theoretical results. In [9] the k-LCF problem is introduced as the generalization of the longest common substring (there named “factor” to avoid confusion with LCS as the Longest Common Subsequence problem) as finding the longest common shared match between two sequences when up to k mismatches are allowed. Also this problem can be described in terms of the previously defined scheme, as it corresponds to . In [9] an O(nm) time and O(1) space solution is provided for a generic k and two strings of length n and m respectively, and also an solution for the case . Finally, in [2] an time and O(n) space algorithm was proposed to provide a subquadratic solution to the kACS problem. Within this framework we developed a new strategy based on filtering for the computation of the longest common substring with mismatches between all the suffixes of a sequence x and another sequence y. This primitive is at the basis of recently proposed similarity measures for the problem of phylogentic tree reconstruction as the longest and the average common substring distances with mismatches. Formally, we will compute the values of , for all , because this vector allows us to derive the values of both (or equally k-LCF) and of kACS, being respectively:The paper is organized as follows: in "Methods" section we will describe the proposed filtering-based approach to compute the values of all . In "Results and discussion" section we will discuss the results of a set of experiments we devised to test the performances of the proposed algorithms in practice. In "Concluding remarks" section we will lead to the conclusions.

Methods

Our aim is to compute the values of for each position i in x. The main idea behind the proposed approach is to avoid the computation of the for all pairs of positions and . To this purpose we will compute the value of starting from the value of the already computed . This procedure will initially give us a candidate longest match that is at least equal to . We will use this information, among some others that will be discussed in the following subsections, to reduce the cardinality of the set of possible candidates for approximate matches longer than , and then we will verify them. Note that, when computing the for each i, one can either take track of their maximum value to compute , or of their sum to later compute kACS at no extra cost. In the description of the algorithm we will refer to techniques that allow us to keep the worst case analysis within the claimed worst case quadratic bound, but for practical purposes we will use different approaches that will be discussed in a dedicated subsection.

Initial set up:

We start with the computation of as the maximum approximate match with k mismatches of the suffix against the sequence y. For this purpose we will start from, and exploit, the classical concept of longest common substring (without any mismatch allowed).

Definition 5Longest common substring

Given two sequences x and y, of length n, find the maximum length L for which a pair of indexes (i, j) exists such that . The problem of finding the longest common substring between two sequences is a well known problem in pattern matching that can be solved in linear time by the traversal of a generalized suffix tree of the two sequences. More in details, we want to be able to find the longest common match starting at any two positions i in x and j in y. This problem can be solved through a call to the Lowest Common Ancestor of the corresponding leaves and in the generalized suffix tree. The length of the label of the path from the root to is the length of their longest common prefix. LCA queries can be carried out for any i and j in constant time after a linear-time preprocessing step [10]. In particular, similarly to the routine step in [14], we will perform jump-extensions to compute the longest approximate match between and the generic . As after the first jump-extension of length we know we will have a mismatch, we will call LCA on the nodes corresponding to positions and , and repeat the procedure until the -th mismatch is found. This is repeated for each , thus taking O(kn) time overall.

Minimum from the previous step

Candidate from when Candidate from when Assume now we have computed , and we want to compute . Let j be the position in y of a longest approximate match of . Two cases may hold, which are illustrated in Figs. 1 and 2 respectively:

Fig. 1

Candidate from when

Fig. 2

Candidate from when

: in this case the k mismatches all lie within and , respectively. Therefore we have . Note that this might or might not be the final over all positions of y. : in this case the mismatch between the first characters will be lost when considering the alignment of and , leading to mismatches in the following positions. After L positions we know we must have a mismatch, which is now counted as the k-th. To finally obtain we need a further call to LCA on the nodes corresponding to the positions and to obtain that will end on the -th mismatch. In summary: . Again, note that this might or might not be the final over all positions of y. It is possible that several suffixes in y are the site of a longest match with k mismatches with . All the starting positions of these longest matches are considered for further extension at this step. This is done for two purposes: to obtain the longest possible candidate length with k mismatches from the previous step at the minimum cost (only a jump till the next mismatch to the right is needed to have the exact length for the pairs of positions considered at this step); to avoid or reduce the possibility that we have to deal with special cases in the following steps (see Observation 4 in "Theoretical and Practical Considerations" section). Let be the candidate value for obtained either from Case 1 or Case 2.

Potential candidates from the previous step

Let us consider now a generic position r in Y. We must have , since L was the absolute maximum found in the step to compute , and the ties have already been considered. If then k mismatches lie between and , respectively, and . As a consequence, the pair can be ruled out as one that cannot have an approximate match longer than the one we are currently considering (which is greater or equal than ). Note that this observation allows us to exclude from the candidate set all the positions in y that are preceded by a symbol matching . Guessing the maximum extension between the suffix and a candidate The case where is more involved. With reference to Fig. 3, the alignment loses the mismatch in the first position of the alignment , and includes the one at position and , in x and y respectively. To obtain the length of for this alignment we should add to the value of , which gives the last exact contribution till the -th match. It may happen that the addition of this term to allows one to obtain a match longer than the potential we had from the previously discussed Case 1 or Case 2. The main problem here is that we do not know the value of .

Fig. 3

Guessing the maximum extension between the suffix and a candidate

We will then proceed by assuming r is indeed the site of a match longer than the current maximum . If this is the case, the gap with must be closed assuming the -th mismatch occurs after the positions and in the two strings. As a consequence, will end exactly where would end (see Fig. 3). If this value is indeed bigger than or equal to we need to make sure no further mismatch was present between and . This can be checked by running the jump-extension performed in the initial setup starting from positions and until mismatches are found. Let be the reached extension. If its value is equal to then the position r is the new candidate position for the longest match of the suffix in y, and the list of ties is reset, otherwise the position is dropped, and the next candidate is considered. If is equal to , then r is added to the list of ties.

Theoretical and practical considerations

We now discuss some theoretical and practical issues emerging from the proposed approach. Observation 1. The worst case complexity occurs when we inherit from step i an initial candidate that is smaller or equal than . In such a case any position r for which is a possible longer match that we need to verify with the jump-extension. This lead to potential O(n) candidate pairs per position i, and to a worst case time complexity . Observation 2. We observe that even in the worst case, we can rule out all the positions of y that are preceded by a character that matches . Assuming equal distribution, we will drop positions. In practice, we will drop a number of position equal to the frequency of in y. Observation 3. Whenever we impose a matching condition we potentially reduce the candidate set of a fraction equal to . When searching for a match longer than the current one we have that at least the first symbol of the last jump must be a match in order to have a longer match. Furthermore, if the value of computed from the longest matches at the previous step is longer than (which is the minimum) we can impose a match for the whole segment , thus further reducing the size of the candidate set. Observation 4. At step , if the value of obtained from the longest matches at step i is there is the possibility that the actual longest match for is smaller than the one at the previous step. Searching for ties when the longest match is is a very time consuming operation because we cannot apply the reduction of the candidate positions explained in Observation 3. This is because when and we will have that lands on the k-th mismatch rather than on the first position after that, thus ending up to erroneously discard position from further processing. For this reason we will always try first to find a strictly longer match. If we cannot find it, we will search for ties at length using only Observation 1 (and excluding of course also the positions that were already processed). Observation 5. The case discussed in Observation 4 is very time consuming because the candidate set remains pretty large. However, we have verified in practice that the number of ties that are actually found during that step is very small, and orders of magnitude less than the size of . For this reason we develop an “relaxed” version of our algorithm in which we ignore the search of ties for the case . This means that we are no longer guaranteed to find the longest match for each suffix , but in practice in our experiments the error was always negligible, and the time needed for the computation was reduced of about a half as it will be shown in "Results and discussion" section. Observation 6. Building indexing data structures can be expensive, and so can be operations that are theoretically efficient. For example, it was already observed in [14] that a naive extension to account for k mismatches gave better performances than calling LCA (or than performing the equivalent operation on an enhanced suffix array [1], as they did). In our experiments we experienced the same. Moreover, in [12] it is shown that, in practical applications, a simple computation of the longest common extension (LCE) between two strings can substantially improve the performances of several algorithms that use the LCE as a subroutine. Therefore, by keeping the original approach in mind, but avoiding reference to indexing data structures, we developed a tool, implementing our algorithm MissMax, in which the extensions are performed naively. Note that in many cases we just need to perform a one-step extension, as the k-step extensions are performed only in the initial step, and whenever we have a candidate with an approximate match longer than . Observation 7. In its current implementation MissMax uses arrays of bits to take track of the presence or absence of symbols at specific positions in y, in order to quickly compute the set of candidates for further inspection. For DNA applications, when reading the input, we will build 4 arrays , one for each symbol . The position i of is set if and only if . When computing , the first filter is given by the complement of . To get the second filter we take the array and shift it of L positions to the left. The bitwise AND of the two vectors is the bitvector B mentioned above that holds the positions of the candidate set (to avoid a further shift of one position to the right, when considering position j we look at the value of this vector at position ). This particular implementation limits the applicability to genomic sequences. However, this is not a theoretical limit of the approach. By changing the data structures used to store the sequences, and the approach to candidate identification, it will be also possible to deal with larger alphabets. For this purpose we plan in the near future to develop a library to compute statistic with mismatches on all kind of biological sequences.

Results and discussion

In this section we present the results of a set of experiments that we run to test the performances of the exact and relaxed version of MissMax. Here we are not mainly interested in the improvement of the quality of the tree reconstruction with mismatches, with respect of corresponding measures without mismatches, as they were already discussed in [5, 14]. We are rather interested on the time needed to compute the values of , for all the positions i in a sequence x with respect to a second sequence y, and on the precision achievable by heuristics. As explained in "Methods" section, these statistics can be used to compute both the (i.e. the longest common substring with k mismatches) based distance discussed in [5], and the kACS distance discussed in [14]. Moreover, as our algorithms are based on filtering, we will investigate also the filtering power of our approach for both the exact version and relaxed version discussed in Observation 4. For what concerns the comparison with other algorithms, the algorithm described in [2], which holds the best known asymptotic complexity for the exact computation of the n values of , has no available implementation yet. For the exact computation we will thus refer to the naive algorithm. On the other side of the spectrum, the greedy-based approach of kmacs [14], is uncomparably fast in almost all cases (we will discuss in details one experiment in which this did not occur). With respect to this approach we will therefore focus our attention on the precision achievable in the estimate of the actual value of the longest matches. For our experiments we considered two datasets that were previously used in other studies (e.g. [5, 14, 18]). The first dataset consists of the mitochondrial genomes of 34 mammals, including species from Euarchontoglires, Laurasiatheria, Afrotheria, Xenarthra, Ameridelphia, and Monotremata. The second dataset consists of the mitochondrial genomes of 27 primates. The length of the genomes is between 16,000 and 17,000 bp each. All the sequences were downloaded from the NCBI web site. All the experiments were performed on an Intel Core i5-4590 at 3,3 GHZ, with 8 GB of memory.

Time performances

As a first experiment we measured the time performances of both the exact and the “relaxed” version of MissMax on omogeneous and heterogeneous subsets of the 34 mammals datasets. In particular, we considered the subset of Rodents (rat, dormouse, house mouse, guinea pig, squirrel), the subset of Carnivora (cat, dog, harbor sail, grey sail), and a mixed set of the two. We then measured the average time needed to compute the similarity between two sequences within each of the omogeneous set, and between elements of different sets in the mixed subset. Finally we added to our analysis a set of 5 random sequences and measured the performances within the subset. Tables 1 and 2 report, for different k, the average time for the comparison of a pair of sequences in each dataset for the relaxed and the exact version of MissMax, respectively.

Table 1

Average time (in seconds) for the comparison of two sequences on several datasets with the “relaxed” version of MissMax

k	Rodents	Carnivora	Mixed	Random
5	2.11	2.22	2.15	1.85
10	2.45	2.56	2.47	2.18
20	3.07	3.17	3.09	2.72
50	4.74	4.87	4.82	4.28
100	7.42	7.62	7.57	6.65

Table 2

Average time (in seconds) for the comparison of two sequences on several datasets with the exact version of Missmax

k	Rodents	Carnivora	Mixed	Random
5	4.50	5.16	4.45	3.26
10	5.16	6.39	5.64	3.81
20	7.47	8.46	7.52	4.85
50	12.08	14.29	12.9	8.26
100	21.01	23.55	21.28	13.15

Average time (in seconds) for the comparison of two sequences on several datasets with the “relaxed” version of MissMax Average time (in seconds) for the comparison of two sequences on several datasets with the exact version of Missmax It is immediate to observe that the relaxed version of MissMax takes about half of the time needed by the exact version. Nevertheless, even the exact algorithm allows for a full pairwise comparison of typical datasets in about an hour on our desktop computer. The time required for each comparison is consistent with the one showed for the smaller datasets, thus the time required for the overall analysis is a function of the size of the dataset. For example, with the analysis of the full dataset of 27 primates took about 55 min, while the full analysis of the 34 mammals dataset took 1 h and 24 min. Moreover, if we take advantage of multithreading, the full analysis of the Primate dataset requires about 15 min, and the full analysis of the Mammals dataset about 23 min. With respect to the kind of sequences that are analized, we can see how both algorithms are faster on the random dataset than on the biological datasets. The difference is much more evident for the exact version of the algorithm, when k increases. For comparison with other approaches, we run the naive algorithm on the Rodents datasets. The average time was: 7.1 s for ; 10.6 s for ; 17.8 s for ; 39.3 s for ; and 74.99 s for . On the contrary, the software kmacs is generally much faster than MissMax: on such small datasets it ends practically istantaneously. We recall that the theoretical time complexity of kmac is O(kzn), where z is the number of positions in which there is a tie for the first longest exact match. It is shown in [14] that in practice the values of z are usually pretty small. Nevertheless, we report, for completeness of discussion, the following observation. While running our experiments, we had a case in which, increasing the length of the sequences to compare, kmacs performances had a suddent drop. After ruling out this could depend on the actual input length of the sequences (the performances were good on random sequences of the same length), or on some bug (the results were what we were expecting), a closer inspection of the input sequences revealed the probable cause. More or less in correspondance of the input length that was showing the slow down, there was an undefined region in one of the two sequences. Note that the computed value of the longest matches of suffixes was not affected, as the N region was present just in one sequence. Nevertheless, the first exact longest match of a suffix starting in the N region is 0, and in such a case, z becomes equal to n, and all the positions of the other sequence need to be extended to find the maximum extension with mismatches. Time performance of relaxed MissMax for different values of k, as a function of the input length Finally, to further investigate the scalability of the filter-based approach we performed a similar test on longer sequences. The results are reported in Fig. 4, where we show the time required for three different values of k, on sequences up to 100k bp, as a function of the input length. The trends for the exact version are the same. We reported a slowdown of about 2, with the actual values slightly increasing with k. Specifically, the average slowdown factor was 1.87 for ; 2.22 for ; and 2.29 for .

Fig. 4

Time performance of relaxed MissMax for different values of k, as a function of the input length

Exact vs heuristics

The tradeoff between exact algorithms and heuristics could be easily summarized saying that heuristics are faster, but do not guarantee the correct solution to the problem. This is of course true also in our case, but we performed anyway some further analysis to assess the performances of MissMax (both exact and relaxed version) and kmacs, in terms of precision in the computation of sequence similarity based on the aforementioned distances with mismatches. The tree for the 27 primates dataset reconstructed by MissMax with . It is in perfect agreement with the reference tree reported in [11] In terms of reconstruction of the correct phylogeny, we tested the algorithms on the Primates datasets. MissMax reported the same reference tree as in [11] (see Fig. 5; the permutations within a same level of the tree are considered as equivalents) for a relatively small number of allowed mismatches ( and ). kmacs reported an overall good reconstruction, but with some differences with respect to the references (in [14] a wide range of k’s have been tested on the same dataset, but the exact reference phylogeny was never captured).

Fig. 5

The tree for the 27 primates dataset reconstructed by MissMax with . It is in perfect agreement with the reference tree reported in [11]

As previosuly stated, kmacs is usually much faster even than the relaxed version of MissMax. However, in terms of computation of the actual value of the average common substring between two sequences, relaxed MissMax is much more precise. In fact, we measured a relative error of 0.53 % with respect to the real measure for both the Rodents and the Carnivora datasets, and 0.46 % for the mixed dataset. For the full Primate dateset, we measured the relative error for both and , reporting 0.45 % and 0.43 % respectively. kmacs did not achieve such a precision in the approximation of the correct value. In our experiments it usually estimates half of the length the average common substring. This lack of precision was noted also by the authors of kmacs in a set of experiments reported on their paper (although on simulated sequences with a given error rate). Anyway, as reported in [14], this does not seem to heavily affect the reconstruction of a phylogenetic tree. This may be due to the fact that the understimation holds equally for all pair of sequences. However, if the statistics need to be collected for other kind of analysis, then one has to keep in mind that the approximation provided by kmacs could not be as good as the one provided by relaxed MissMax.

Filtering power

Average percentage of pair of positions considered in several datasets with the relaxed filter, with respect to the quadratic maximum number of pairs Average percentage of pair of positions considered in several datasets with the exact filter, with respect to the quadratic maximum number of pairs. A set of experiments was performed to measure the goodness of our filters, investigating on the percentage of pair of positions that are actually considered with respect to the maximum possible. The results are shown in Table 3 for the relaxed and in Table 4 for the exact version of MissMax.

Table 3

Average percentage of pair of positions considered in several datasets with the relaxed filter, with respect to the quadratic maximum number of pairs

Relaxed MissMax	k = 5	k = 10	k = 20	k = 50	k = 100
Rodents	7.26	8.37	9.15	9.74	9.93
Mixed	7.16	8.15	8.76	9.28	9.79
Random	9.73	13.31	15.17	15.29	15.27

Table 4

Average percentage of pair of positions considered in several datasets with the exact filter, with respect to the quadratic maximum number of pairs.

Exact MissMax	k = 5	k = 10	k = 20	k = 50	k = 100
Rodents	9.16	12.38	16.08	18.13	19.89
Mixed	9.34	12.64	16.45	17.93	19.32
Random	15.40	26.92	33.51	32.56	31.54

We note that in the biological sequences the difference between the size of the candidate set of the relaxed and exact version is quite small for , but then increases substantially with k up to , and then remains pretty much constant for . The random sequences follow the same trend, but they appear to reach the saturation level earlier at . It is interesting to note that the analysis on random sequences is faster than in biological sequences (see Tables 1, 2) although the size of the candidate set is bigger. We explain this behavior as due to the time needed to make full check for a match. The comparison is stopped either when the expected length is reached or when the number of mismatches is . It is possible that with random sequences the second condition occurs more often after few comparisons, thus speeding up the entire process, even if the number of positions to check is bigger.

Concluding remarks

In this work we proposed a filtering-based approach for the computation of the longest common substring with k mismatches between each suffix of a sequence x and a sequence y we want to compare to. This statistics is useful for the computation of alignment free distances based on approximate matching, that are a promising approach to improve the quality of alignment free sequence comparison. We developed both an exact and a relaxed version of the algorithm. While the relaxed version cannot guarantee to find the optimal solution, it is in practice faster and still very precise.

8 in total

MissMax: alignment-free sequence comparison with mismatches through filtering and heuristics.

Background

Definition 1

Definition 2

Definition 3

Definition 4

Methods

Initial set up:

Definition 5Longest common substring

Minimum from the previous step

Potential candidates from the previous step

Theoretical and practical considerations

Results and discussion

Time performances

Exact vs heuristics

Filtering power

Concluding remarks

Review 1. Alignment-free sequence comparison-a review.

2. Whole proteome prokaryote phylogeny without sequence alignment: a K-string composition approach.

3. Efficient tools for comparative substring analysis.

4. The average common substring approach to phylogenomic reconstruction.

5. Estimating mutation distances from unaligned genomes.

6. Fast alignment-free sequence comparison using spaced-word frequencies.

7. Fast algorithms for computing sequence distances by exhaustive substring composition.

8. Kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison.

1. Prot-SpaM: fast alignment-free phylogeny reconstruction based on whole-proteome sequences.

2. Read-SpaM: assembly-free and alignment-free comparison of bacterial genomes with low sequencing coverage.

3. A Novel Method for Alignment-free DNA Sequence Similarity Analysis Based on the Characterization of Complex Networks.

4. Phylogeny reconstruction based on the length distribution of k-mismatch common substrings.

5. A greedy alignment-free distance estimator for phylogenetic inference.

Review 6. Alignment-free sequence comparison: benefits, applications, and tools.

7. SSAW: A new sequence similarity analysis method based on the stationary discrete wavelet transform.

8. An alignment-free heuristic for fast sequence comparisons with applications to phylogeny reconstruction.