Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 The distribution of word matches between Markovian sequences with periodic boundary conditions.

Literature DB >> 24160839

The distribution of word matches between Markovian sequences with periodic boundary conditions.

Conrad J Burden¹, Paul Leopardi, Sylvain Forêt.

Abstract

Word match counts have traditionally been proposed as an alignment-free measure of similarity for biological sequences. The D(2) statistic, which simply counts the number of exact word matches between two sequences, is a useful test bed for developing rigorous mathematical results, which can then be extended to more biologically useful measures. The distributional properties of the D(2) statistic under the null hypothesis of identically and independently distributed letters have been studied extensively, but no comprehensive study of the D(2) distribution for biologically more realistic higher-order Markovian sequences exists. Here we derive exact formulas for the mean and variance of the D(2) statistic for Markovian sequences of any order, and demonstrate through Monte Carlo simulations that the entire distribution is accurately characterized by a Pólya-Aeppli distribution for sequence lengths of biological interest. The approach is novel in that Markovian dependency is defined for sequences with periodic boundary conditions, and this enables exact analytic formulas for the mean and variance to be derived. We also carry out a preliminary comparison between the approximate D(2) distribution computed with the theoretical mean and variance under a Markovian hypothesis and an empirical D(2) distribution from the human genome.

Entities: Disease Species

Mesh：

Year: 2013 PMID： 24160839 PMCID： PMC3880068 DOI： 10.1089/cmb.2012.0277

Source DB: PubMed Journal: J Comput Biol ISSN： 1066-5277 Impact factor: 1.479

17 in total

The distribution of word matches between Markovian sequences with periodic boundary conditions.

1. Alignment-free sequence comparison for biologically realistic sequences of moderate length.

2. Alignment-free sequence comparison (II): theoretical power of comparison statistics.

3. Reconsidering the significance of genomic word frequencies.

4. A statistical method for alignment-free comparison of regulatory sequences.

5. Alignment-free sequence comparison (I): statistics and power.

6. Characterizing the D2 statistic: word matches in biological sequences.

7. Estimation of pairwise sequence similarity of mammalian enhancers with word neighbourhood counts.

8. Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences.

9. Genomic DNA k-mer spectra: models and modalities.

10. An integrated encyclopedia of DNA elements in the human genome.

1. A novel alignment-free method for detection of lateral genetic transfer based on TF-IDF.

Review 2. Alignment-free inference of hierarchical and reticulate phylogenomic relationships.