Literature DB >> 33285983

Asymptotic Analysis of the kth Subword Complexity.

Abstract

Patterns within strings enable us to extract vital information regarding a string's randomness. Understanding whether a string is random (Showing no to little repetition in patterns) or periodic (showing repetitions in patterns) are described by a value that is called the kth Subword Complexity of the character string. By definition, the kth Subword Complexity is the number of distinct substrings of length k that appear in a given string. In this paper, we evaluate the expected value and the second factorial moment (followed by a corollary on the second moment) of the kth Subword Complexity for the binary strings over memory-less sources. We first take a combinatorial approach to derive a probability generating function for the number of occurrences of patterns in strings of finite length. This enables us to have an exact expression for the two moments in terms of patterns' auto-correlation and correlation polynomials. We then investigate the asymptotic behavior for values of k = Θ ( log n ) . In the proof, we compare the distribution of the kth Subword Complexity of binary strings to the distribution of distinct prefixes of independent strings stored in a trie. The methodology that we use involves complex analysis, analytical poissonization and depoissonization, the Mellin transform, and saddle point analysis.

Entities: Chemical Disease Species

Keywords: asymptotics; generating functions; moments; probability; saddle point method; subword complexity; the Mellin transform

Year: 2020 PMID： 33285983 PMCID： PMC7516637 DOI： 10.3390/e22020207

Source DB: PubMed Journal: Entropy (Basel) ISSN： 1099-4300 Impact factor: 2.524

1. Introduction

Analyzing and understanding occurrences of patterns in a character string is helpful for extracting useful information regarding the nature of a string. We classify strings to low-complexity and high-complexity, according to their level of randomness. For instance, we take the binary string , which is constructed by repetitions of the pattern . This string is periodic, and therefore has low randomness. Such periodic strings are classified as low-complexity strings, whereas strings that do not show periodicity are considered to have high complexity. An effective way of measuring a string’s randomness is to count all distinct patterns that appear as contiguous subwords in the string. This value is called the Subword Complexity. The name is given by Ehrenfeucht, Lee, and Rozenberg [1], and initially was introduced by Morse and Hedlund in 1938 [2]. The higher the Subword Complexity, the more complex the string is considered to be. Assessing information about the distribution of the Subword Complexity enables us to better characterize strings, and determine atypically random or periodic strings that have complexities far from the average complexity [3]. This type of string classification has applications in fields such as data compression [4], genome analysis (see [5,6,7,8,9]), and plagiarism detection [10]. For example, in data compression, a data set is considered compressible if it has low complexity, as consists of repeated subwords. In computational genomics, Subword Complexity (known as k-mers) is used in detection of repeated sequences and DNA barcoding [11,12]. k-mers are composed of A, T, G, and C nucleotides. For instance, 7-mers for a DNA sequence GTAGAGCTGT is four, meaning that there are 4-hour distinct substrings of length 7 in the given DNA sequence. Counting k-mers becomes challenging for longer DNA sequences. Our results can be easily extended to the alphabet and directly applied in theoretical analysis of the genomic k-mer distributions under the Bernoulli probabilistic model, particularly when the length n of the sequence approaches infinity. There are two variations for the definition of the Subword Complexity: the one that counts all distinct subwords of a given string (also known as Complexity Index and Sequence Complexity [13]), and the one that only counts the subwords of the same length, say k, that appear in the string. In our work, we analyze the latter, and we call it the kth Subword Complexity to avoid any confusion. Throughout this work, we consider the kth Subword Complexity of a random binary string of length n over a memory-less source, and we denote it by . We analyze the first and second factorial moments of (1) for the range , as . More precisely, will divide the analysis into three ranges as follows. , , and . Our approach involves two major steps. First, we choose a suitable model for the asymptotic analysis, and afterwards we provide proofs for the derivation of the asymptotic expansion of the first two factorial moments.

1.1. Part I

This part of the analysis is inspired by the earlier work of Jacquet and Szpankowski [14] on the analysis of suffix trees by comparing them to independent tries. A trie, first introduced by René de la Briandais in 1959 (see [15]), is a search tree that stores n strings, according to their prefixes. A suffix tree, introduced by Weiner in 1973 (see [16]), is a trie where the strings are suffixes of a given string. An example of these data structures are given in Figure 1.

Figure 1

The suffix tree in (a) is built over the first four suffixes of string , and the trie in (b) is build over strings , , , and .

A direct asymptotic analysis of the moments is a difficult task, as patterns in a string are not independent from each other. However, we note that each pattern in a string can be regarded as a prefix of a suffix of the string. Therefore, the number of distinct patterns of length k in a string is actually the number of nodes of the suffix tree at level k and lower. It is shown by I. Gheorghiciuc and M. D. Ward [17] that the expected value of the k-th Subword Complexity of a Bernoulli string of length n is asymptotically comparable to the expected value of the number of nodes at level k of a trie built over n independent strings generated by a memory-less source. We extend this analysis to the desired range for k, and we prove that the result holds for when k grows logarithmically with n. Additionally, we show that asymptotically, the second factorial moment of the k-th Subword Complexity can also be estimated by admitting the same independent model generated by a memory-less source. The proof of this theorem heavily relies on the characterization of the overlaps of the patterns with themselves and with one another. Autocorrelation and correlation polynomials explicitly describe these overlaps. The analytic properties of these polynomials are key to understanding repetitions of patterns in large Bernoulli strings. This, in conjunction with Cauchy’s integral formula (used to compare the generating functions in the two models) and the residue theorem, provides solid verification that the second factorial moment in the Subword Complexity behaves the same as in the independent model. To make this comparison, we derive the generating functions of the first two factorial moments in both settings. In a paper published by F. Bassino, J. Clément, and P. Nicodème in 2012 [18], the authors provide a multivariate probability generating function for the number of occurrences of patterns in a finite Bernoulli string. That is, given a pattern w, the coefficient of the term in is the probability in the Bernoulli model that a random string of size n has exactly m occurrences of the pattern w. Following their technique, we derive the exact expression for the generating functions of the first two factorial moments of the kth Subword Complexity. In the independent model, the generating functions are obtained by basic probability concepts.

1.2. Part II

This part of the proof is analogous to the analysis of profile of tries [19]. To capture the asymptotic behavior, the expressions for the first two factorial moments in the independent trie are further improved by means of a Poisson process. The poissonized version yields generating functions in the form of harmonic sums for each of the moments. The Mellin transform and the inverse Mellin transforms of these harmonic sums establish a connection between the asymptotic expansion and singularities of the transformed function. This methodology is sufficient for when the length k of the patterns are fixed. However, allowing k to grow with n, makes the analysis more challenging. This is because for large k, the dominant term of the poissonized generating function may come from the term involving k, and singularities may not be significant compared to the growth of k. This issue is treated by combining the singularity analysis with a saddle point method [20]. The outcome of the analysis is a precise first-order asymptotics of the moments in the poissonized model. Depoissonization theorems are then applied to obtain the desired result in the Bernoulli model.

2. Results

For a binary string , where ’s () are independent and identically distributed random variables, we assume that , , and . We define the kth Subword Complexity, , to be the number of distinct substrings of length k that appear in a random string X with the above assumptions. In this work, we obtain the first order asymptotics for the average and the second factorial moment of . The analysis is done in the range . We rewrite this range as , and by performing a saddle point analysis, we will show that In the first step, we compare the kth Subword Complexity to an independent model constructed in the following way: We store a set of n independently generated strings by a memory-less source in a trie. This means that each string is a sequence of independent and identically distributed Bernoulli random variables from the binary alphabet , with , . We denote the number of distinct prefixes of length k in the trie by , and we call it the kth prefix complexity. Before proceeding any further, we remind that factorial moments of a random variable are defined as following. The jth factorial moment of a random variable X is defined as where j = 1, 2, … will show that the first and second factorial moments of For large values of n, and for We also prove a similar result for the second factorial moments of the kth Subword Complexity and the kth Prefix Complexity: For large values of n, and for In the second part of our analysis, we derive the first order asymptotics of the kth Prefix Complexity. The methodology used here is analogous to the analysis of profile of tries [19]. The rate of the asymptotic growth depends on the location of the value a as seen in (1). For instance, for the average kth Subword Complexity, , we have the following observations. For the range , the growth rate is of order , in the range , we observe some oscillations with n, and in the range , the average has a linear growth . The above observations will be discussed in depth in the proofs of the following theorems. The average of the kth Prefix Complexity has the following asymptotic expansion For where is a bounded periodic function. For For for some The second factorial moment of the kth Prefix Complexity has the following asymptotic expansion. For For For The periodic function in Theorems 3 and 4 is shown in Figure 2.

Figure 2

Left: at , and various levels of . The amplitude increases as increases. Right: at , and various levels of p. The amplitude tends to zero as .

The results in Theorem 4 will follow for the second moment of the kth Subword Complexity as the analysis can be easily extended from the second factorial moment to the second moment. The variance however, as seen in Figure 3, does not show the same asymptotic behavior as the variance of kth Subword Complexity.

Figure 3

Approximated second moments (left), and variances (right) of the kth Subword Complexity (red), and the kth Prefix Complexity (blue), for , at different probability levels, averaged over 10,000 iterations.

3. Proofs and Methods

3.1. Groundwork

We first introduce a few terminologies and lemmas regarding overlaps of patterns and their number of occurrences in texts. Some of the notations we use in this work are borrowed from [18] and [21]. For a binary word The autocorrelation index set is And the autocorrelation polynomial is For the distinct binary words The correlation index set is The correlation polynomial is The following two lemmas present the probability generating functions for the number of occurrences of a single pattern and a pair of distinct pattern, respectively, in a random text of length n. For a detailed dissection on obtaining such generating functions, refer to [18]. The Occurrence probability generating function for a single pattern w in a binary text over a memoryless source is given by The coefficient The Occurrence PGF for two distinct Patterns of length k in a Bernoulli random text is given by and The coefficient The above results will be used to find the generating functions for the first two factorial moments of the kth Subword Complexity in the following section. For generating functions where where We define This yields We observe that . By defining and from (10), we obtain Having the above function, we derive the following result. For this part, we first note that Due to properties of indicator random variables, we observe that the expected value of the second factorial moment has only one term: We proceed by defining a second indicator variable as following. This gives Finally, we are able to express in the following where and . By (11) we have Having the above expression, we finally obtain □

3.2. Derivation of Generating Functions

In the following lemma, we present the generating functions for the first two factorial moments for the kth Prefix Complexity in the independent model. For We define the indicator variable as follows. For each , we have Summing over all words w of length k, determines the generating function : Similar to in (18) and (20), we obtain Subsequently, we obtain the generating function below. □ Our first goal is to compare the coefficients of the generating functions in the two models. The coefficients are expected to be asymptotically equivalent in the desired range for k. To compare the coefficients, we need more information on the analytic properties of these generating functions. This will be discussed in Section 3.3.

3.3. Analytic Properties of the Generating Functions

Here, we turn our attention to the smallest singularities of the two generating functions given in Lemma 3. It has been shown by Jacquet and Szpankowski [21] that has exactly one root in the disk . Following the notations in [21], we denote the root within the disk of by , and by bootstrapping we obtain We also denote the derivative of at the root , by , and we obtain In this paper, we will prove a similar result for the polynomial through the following work. If w and If the minimal degree of is greater than , then for . For a fixed , we have This leads to the following □ There exist In other words, There are three cases to consider: Case When either or , then every term of has degree k or larger, and therefore There exists , such that for , we have . This yields Case If the minimal degree for or is greater than , then every term of has degree at least . We also note that, by Lemma 9, . Therefore, there exists , such that Case The only remaining case is where the minimal degree for and are both less than or equal to . If , then , where u is a word of length . Then we have There exists , such that Similarly, we can show that there exists , such that . Therefore, for we have We complete the proof by setting . □ There exist has exactly one root in the disk First note that This yields There exist , large enough, such that, for , we have and for , If we define , then we have, for , by Rouché’s theorem, as has only one root in , then also has exactly one root in . □ We denote the root within the disk of by , and by bootstrapping we obtain We also denote the derivative of at the root , by , and we obtain We will refer to these expressions in the residue analysis that we present in the next section.

3.4. Asymptotic Difference

We begin this section by the following lemmas on the autocorrelation polynomials. (Jacquet and Szpankowski, 1994). For most words w, the autocorrelation polynomial where (Jacquet and Szpankowski, 1994). There exist In other words, With high probability, for most distinct pairs We will use the above results to prove that the expected values in the Bernoulli model and the model built over a trie are asymptotically equivalent. We now prove Theorem 1 below. From Lemmas 3 and 4, we have and subtracting the two generating functions, we obtain We define Therefore, by Cauchy integral formula (see [20]), we have where the path of integration is a circle about zero with counterclockwise orientation. We note that the above integrand has poles at , , and (refer to expression (29)). Therefore, we define where the circle of radius contains all of the above poles. By the residue theorem, we have We observe that Then we obtain and finally, we have First, we show that, for sufficiently large n, the sum approaches zero. □ For large enough n, and for We let The Mellin transform of the above function is We define which is negative and uniformly bounded for all w. Also, for a fixed s, we have and therefore, we obtain From this expression, and noticing that the function has a removable singularity at , we can see that the Mellin transform exists on the strip where . We still need to investigate the Mellin strip for the sum . In other words, we need to examine whether summing over all words of length k (where k grows with n) has any effect on the analyticity of the function. We observe that Lemma 8 allows us to split the above sum between the words for which and words that have . Such a split yields the following This shows that is bounded above for and, therefore, it is analytic. This argument holds for as well, as would still be bounded above by a constant that depends on s and k. We would like to approximate when . By the inverse Mellin transform, we have We choose for a fixed . Then by the direct mapping theorem [22], we obtain and subsequently, we get □ We next prove the asymptotic smallness of in (54). Let For large n and We observe that For , we show that the denominator in (71) is bounded away from zero. To find a lower bound for , we can choose large enough such that We now move on to finding an upper bound for the numerator in (71), for . Therefore, there exists a constant such that Summing over all patterns w, and applying Lemma 8, we obtain which approaches zero as and . This completes the proof of of Theorem 1. □ Similar to Theorem 1, we provide a proof to show that the second factorial moments of the kth Subword Complexity and the kth Prefix Complexity, have the same first order asymptotic behavior. We are now ready to state the proof of Theorem 2. As discussed in Lemmas 3 and 4, the generating functions representing and respectively, are and Note that In Theorem 1, we proved that for every (which does not depend on n or k), we have Therefore, both (77) and (78) are of order for . Thus, to show the asymptotic smallness, it is enough to choose , where is a small positive value. Now, it only remains to show (79) is asymptotically negligible as well. We define Next, we extract the coefficient of where the path of integration is a circle about the origin with counterclockwise orientation. We define The above integrand has poles at , (as in (46)), and . We have chosen such that the poles are all inside the circle . It follows that and the residues give us the following. and where is as in (47). Therefore, we get We now show that the above two terms are asymptotically small. □ There exists is of order O( We define The Mellin transform of the above function is where . We note that is negative and uniformly bounded from above for all .For a fixes s, we also have, and Therefore, we have To find the Mellin strip for the sum , we first note that Since , we have and Therefore, we get By Lemma 10, with high probability, a randomly selected w has the property , and thus With that and by Lemma 8, for most words w, Therefore, both sums (91) and (93) are of the form . The sums (92) and (94) are also of order by Lemma 10. Combining all these terms we will obtain By the inverse Mellin transform, for , and , we have □ In the following lemma we show that the first term in (85) is asymptotically small. Recall that We have First note that We saw in (73) that , and therefore, it follows that For , is also bounded below as the following which is bounded away from zero by the assumption of Lemma 7. Additionally, we show that the numerator in (98) is bounded above, as follows This yields By (75), the first term above is of order and by Lemma 10 and an analysis similar to (75), the second term yields as well. Finally, we have Which goes to zero asymptotically, for . □ This lemma completes our proof of Theorem 2.

3.5. Asymptotic Analysis of the kth Prefix Complexity

We finally proceed to analyzing the asymptotic moments of the kth Prefix Complexity. The results obtained hold true for the moments of the kth Subword Complexity. Our methodology involves poissonization, saddle point analysis (the complex version of Laplace’s method [23]), and depoissonization. (Jacquet and Szpankowski, 1998). Let (I) For where (II) For Then, for every non-negative integer n, we have On the Expected Value: To transform the sequence of interest, , into a Poisson model, we recall that in (25) we found Thus, the Poisson transform is To asymptotically evaluate this harmonic sum, we turn our attention to the Mellin Transform once more. The Mellin transform of is which has the fundamental strip . For , the inverse Mellin integral is the following where we define for . We emphasize that the above integral involves k, and k grows with n. We evaluate the integral through the saddle point analysis. Therefore, we choose the line of integration to cross the saddle point . To find the saddle point , we let , and we obtain and therefore, where . By (108) and the fact that for and , we can see that there are actually infinitely many saddle points of the form on the line of integration. We remark that the location of depends on the value of a. We have as , and as . We divide the analysis into three parts, for the three ranges , , and . In the first range, which corresponds to we perform a residue analysis, taking into account the dominant pole at . In the second range, we have and we get the asymptotic result through the saddle point method. The last range corresponds to and we approach it with a combination of residue analysis at , and the saddle point method. We now proceed by stating the proof of Theorem 3. We begin with proving part which requires a saddle point analysis. We rewrite the inverse Mellin transform with integration line at as Step one: Saddle points’ contribute to the integral estimation First, we are able to show those saddle points with do not have a significant asymptotic contribution to the integral. To show this, we let Since as , we observe that which is very small for large n. Note that for , is decreasing, and bounded above by . Step two: Partitioning the integral There are now only finitely many saddle points to work with. We split the integral range into sub-intervals, each of which contains exactly one saddle point. This way, each integral has a contour traversing a single saddle point, and we will be able to estimate the dominant contribution in each integral from a small neighborhood around the saddle point. Assuming that is the largest j for which , we split the integral as following By the same argument as in (115), the second term in (116) is also asymptotically negligible. Therefore, we are only left with where . Step three: Splitting the saddle contour For each integral , we write the expansion of about , as follows The main contribution for the integral estimate should come from an small integration path that reduces to its quadratic expansion about . In other words, we want the integration path to be such that The above conditions are true when and . Thus, we choose the integration path to be . Therefore, we have Saddle Tails Pruning. We show that the integral is small for . We define Note that for , we have where . Thus, Central Approximation. Over the main path, the integrals are of the form We have and Therefore, by Laplace’s theorem (refer to [22]) we obtain We finally sum over all j, and we get We can rewrite as where , and For part , we move the line of integration to . Note that in this range, we must consider the contribution of the pole at . We have Computing the residue at , and following the same analysis as in part i for the above integral, we arrive at For part of Theorem 3, we shift the line of integration to , then we have where . Step four: Asymptotic depoissonization To show that both conditions in (15) hold for , we extend the real values z to complex values , where . To prove (103), we note that and therefore is absolutely convergent for . The same saddle point analysis applies here and we obtain where , and is as in (128). Condition (103) is therefore satisfied. To prove condition (104) We see that for a fixed k, Therefore, we have This completes the proof of Theorem 3. □ On the Second Factorial Moment: We poissonize the sequence as well. By the analysis in (27), which gives the following poissonized form We show that in all ranges of a the leftover sum in (138) has a lower order contribution to compared to . We define In the first range for k, we take the Mellin transform of , which is and we note that the fundamental strip for this Mellin transform of is as well. The inverse Mellin transform for is We note that this range of corresponds to The integrand in (141) is quite similar to the one seen in (107). The only difference is the extra term . However, we notice that is analytic and bounded. Thus, we obtain the same saddle points with the real part as in (109) and the same imaginary parts in the form of , . Thus, the same saddle point analysis for the integral in (107) applies to as well. We avoid repeating the similar steps, and we skip to the central approximation, where by Laplace’s theorem (ref. [22]), we get which can be represented as where This shows that , when Subsequently, for , we get and for , we get It is not difficult to see that for each range of a as stated above, has a lower order contribution to the asymptotic expansion of , compared to . Therefore, this leads us to Theorem 4, which will be proved bellow. It is only left to show that the two depoissonization conditions hold: For condition (103) in Theorem 15, from (135) we have and for condition (104), we have, for fixed k, Therefore both depoissonization conditions are satisfied and the desired result follows. □ Corollary. A Remark on the Second Moment and the Variance For the second moment we have Therefore, by (105) and (138) the Poisson transform of the second moment, which we denote by is which results in the same first order asymptotic as the second factorial moment. Also, it is not difficult to extend the proof in Chapter 6 to show that the second moments of the two models are asymptotically the same. For the variance we have Therefore the Poisson transform, which we denote by is The Mellin transform of the above function has the following form This is quite similar to what we saw in (106), which indicates that the variance has the same asymptotic growth as the expected value. But the variance of the two models do not behave in the same way (cf. Figure 2).

4. Summary and Conclusions

We studied the first-order asymptotic growth of the first two (factorial) moments of the kth Subword Complexity. We recall that the kth Subword Complexity of a string of length n is denoted by , and is defined as the number of distinct subwords of length k, that appear in the string. We are interested in the asymptotic analysis for when k grows as a function of the string’s length. More specifically, we conduct the analysis for , and as . The analysis is inspired by the earlier work of Jacquet and Szpankowski on the analysis of suffix trees, where they are compared to independent tries (cf. [14]). In our work, we compare the first two moments of the kth Subword Complexity to the kth Prefix Complexity over a random trie built over n independently generated binary strings. We recall that we define the kth Prefix Complexity as the number of distinct prefixes that appear in the trie at level k and lower. We obtain the generating functions representing the expected value and the second factorial moments as their coefficients, in both settings. We prove that the first two moments have the same asymptotic growth in both models. For deriving the asymptotic behavior, we split the range for k into three intervals. We analyze each range using the saddle point method, in combination with residue analysis. We close our work with some remarks regarding the comparison of the second moment and the variance to the kth Prefix Complexity.

5. Future Challenges

The intervals’ endpoints for a in Theorems 3 and 4 are not investigated in this work. The asymptotic analysis of the end points can be studied using van der Waerden saddle point method [24]. The analogous results are not (yet) known in the case where the underlying probability source has Markovian dependence or in the case of dynamical sources.

7 in total

Asymptotic Analysis of the kth Subword Complexity.

1. Introduction

1.1. Part I

1.2. Part II

2. Results

3. Proofs and Methods

3.1. Groundwork

3.2. Derivation of Generating Functions

3.3. Analytic Properties of the Generating Functions

3.4. Asymptotic Difference

3.5. Asymptotic Analysis of the kth Prefix Complexity

4. Summary and Conclusions

5. Future Challenges

1. Frequent oligonucleotides and peptides of the Haemophilus influenzae genome.

2. De novo identification of repeat families in large genomes.

3. Linguistics of nucleotide sequences. II: Stationary words in genetic texts and the zonal structure of DNA.

4. Statistical analyses of counts and distributions of restriction sites in DNA sequences.

5. Base compositional structure of genomes.

6. Over- and under-representation of short oligonucleotides in DNA sequences.

7. Genomic DNA k-mer spectra: models and modalities.