Literature DB >> 24160839

The distribution of word matches between Markovian sequences with periodic boundary conditions.

Conrad J Burden1, Paul Leopardi, Sylvain Forêt.   

Abstract

Word match counts have traditionally been proposed as an alignment-free measure of similarity for biological sequences. The D(2) statistic, which simply counts the number of exact word matches between two sequences, is a useful test bed for developing rigorous mathematical results, which can then be extended to more biologically useful measures. The distributional properties of the D(2) statistic under the null hypothesis of identically and independently distributed letters have been studied extensively, but no comprehensive study of the D(2) distribution for biologically more realistic higher-order Markovian sequences exists. Here we derive exact formulas for the mean and variance of the D(2) statistic for Markovian sequences of any order, and demonstrate through Monte Carlo simulations that the entire distribution is accurately characterized by a Pólya-Aeppli distribution for sequence lengths of biological interest. The approach is novel in that Markovian dependency is defined for sequences with periodic boundary conditions, and this enables exact analytic formulas for the mean and variance to be derived. We also carry out a preliminary comparison between the approximate D(2) distribution computed with the theoretical mean and variance under a Markovian hypothesis and an empirical D(2) distribution from the human genome.

Entities:  

Mesh:

Year:  2013        PMID: 24160839      PMCID: PMC3880068          DOI: 10.1089/cmb.2012.0277

Source DB:  PubMed          Journal:  J Comput Biol        ISSN: 1066-5277            Impact factor:   1.479


  17 in total

1.  Alignment-free sequence comparison for biologically realistic sequences of moderate length.

Authors:  Conrad J Burden; Junmei Jing; Susan R Wilson
Journal:  Stat Appl Genet Mol Biol       Date:  2012

2.  Alignment-free sequence comparison (II): theoretical power of comparison statistics.

Authors:  Lin Wan; Gesine Reinert; Fengzhu Sun; Michael S Waterman
Journal:  J Comput Biol       Date:  2010-10-25       Impact factor: 1.479

3.  Reconsidering the significance of genomic word frequencies.

Authors:  Miklós Csurös; Laurent Noé; Gregory Kucherov
Journal:  Trends Genet       Date:  2007-10-26       Impact factor: 11.639

4.  A statistical method for alignment-free comparison of regulatory sequences.

Authors:  Miriam R Kantorovitz; Gene E Robinson; Saurabh Sinha
Journal:  Bioinformatics       Date:  2007-07-01       Impact factor: 6.937

5.  Alignment-free sequence comparison (I): statistics and power.

Authors:  Gesine Reinert; David Chew; Fengzhu Sun; Michael S Waterman
Journal:  J Comput Biol       Date:  2009-12       Impact factor: 1.479

6.  Characterizing the D2 statistic: word matches in biological sequences.

Authors:  Sylvain Forêt; Susan R Wilson; Conrad J Burden
Journal:  Stat Appl Genet Mol Biol       Date:  2009-10-08

7.  Estimation of pairwise sequence similarity of mammalian enhancers with word neighbourhood counts.

Authors:  Jonathan Göke; Marcel H Schulz; Julia Lasserre; Martin Vingron
Journal:  Bioinformatics       Date:  2012-01-12       Impact factor: 6.937

8.  Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences.

Authors:  Sylvain Forêt; Miriam R Kantorovitz; Conrad J Burden
Journal:  BMC Bioinformatics       Date:  2006-12-18       Impact factor: 3.169

9.  Genomic DNA k-mer spectra: models and modalities.

Authors:  Benny Chor; David Horn; Nick Goldman; Yaron Levy; Tim Massingham
Journal:  Genome Biol       Date:  2009-10-08       Impact factor: 13.583

10.  An integrated encyclopedia of DNA elements in the human genome.

Authors: 
Journal:  Nature       Date:  2012-09-06       Impact factor: 49.962

View more
  2 in total

1.  A novel alignment-free method for detection of lateral genetic transfer based on TF-IDF.

Authors:  Yingnan Cong; Yao-Ban Chan; Mark A Ragan
Journal:  Sci Rep       Date:  2016-07-25       Impact factor: 4.379

Review 2.  Alignment-free inference of hierarchical and reticulate phylogenomic relationships.

Authors:  Guillaume Bernard; Cheong Xin Chan; Yao-Ban Chan; Xin-Yi Chua; Yingnan Cong; James M Hogan; Stefan R Maetschke; Mark A Ragan
Journal:  Brief Bioinform       Date:  2019-03-22       Impact factor: 11.622

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.