Literature DB >> 12374863

Distributional regimes for the number of k-word matches between two random sequences.

Ross A Lippert1, Haiyan Huang, Michael S Waterman.   

Abstract

When comparing two sequences, a natural approach is to count the number of k-letter words the two sequences have in common. No positional information is used in the count, but it has the virtue that the comparison time is linear with sequence length. For this reason this statistic D(2) and certain transformations of D(2) are used for EST sequence database searches. In this paper we begin the rigorous study of the statistical distribution of D(2). Using an independence model of DNA sequences, we derive limiting distributions by means of the Stein and Chen-Stein methods and identify three asymptotic regimes, including compound Poisson and normal. The compound Poisson distribution arises when the word size k is large and word matches are rare. The normal distribution arises when the word size is small and matches are common. Explicit expressions for what is meant by large and small word sizes are given in the paper. However, when word size is small and the letters are uniformly distributed, the anticipated limiting normal distribution does not always occur. In this situation the uniform distribution provides the exception to other letter distributions. Therefore a naive, one distribution fits all, approach to D(2) statistics could easily create serious errors in estimating significance.

Entities:  

Mesh:

Substances:

Year:  2002        PMID: 12374863      PMCID: PMC137823          DOI: 10.1073/pnas.202468099

Source DB:  PubMed          Journal:  Proc Natl Acad Sci U S A        ISSN: 0027-8424            Impact factor:   11.205


  14 in total

1.  STACK: Sequence Tag Alignment and Consensus Knowledgebase.

Authors:  A Christoffels; A van Gelder; G Greyling; R Miller; T Hide; W Hide
Journal:  Nucleic Acids Res       Date:  2001-01-01       Impact factor: 16.971

2.  Statistical measures of DNA sequence dissimilarity under Markov chain models of base composition.

Authors:  T J Wu; Y C Hsieh; L A Li
Journal:  Biometrics       Date:  2001-06       Impact factor: 2.571

3.  Basic local alignment search tool.

Authors:  S F Altschul; W Gish; W Miller; E W Myers; D J Lipman
Journal:  J Mol Biol       Date:  1990-10-05       Impact factor: 5.469

4.  Approximations to profile score distributions.

Authors:  L Goldstein; M S Waterman
Journal:  J Comput Biol       Date:  1994       Impact factor: 1.479

5.  Compound Poisson and Poisson process approximations for occurrences of multiple words in Markov chains.

Authors:  G Reinert; S Schbath
Journal:  J Comput Biol       Date:  1998       Impact factor: 1.479

6.  A measure of the similarity of sets of sequences not requiring sequence alignment.

Authors:  B E Blaisdell
Journal:  Proc Natl Acad Sci U S A       Date:  1986-07       Impact factor: 11.205

7.  Statistical method for rapid homology search.

Authors:  A A Mironov; N N Alexandrov
Journal:  Nucleic Acids Res       Date:  1988-06-10       Impact factor: 16.971

8.  Identification of common molecular subsequences.

Authors:  T F Smith; M S Waterman
Journal:  J Mol Biol       Date:  1981-03-25       Impact factor: 5.469

9.  d2_cluster: a validated method for clustering EST and full-length cDNAsequences.

Authors:  J Burke; D Davison; W Hide
Journal:  Genome Res       Date:  1999-11       Impact factor: 9.043

10.  AsMamDB: an alternative splice database of mammals.

Authors:  H Ji; Q Zhou; F Wen; H Xia; X Lu; Y Li
Journal:  Nucleic Acids Res       Date:  2001-01-01       Impact factor: 16.971

View more
  32 in total

1.  Alignment-free sequence comparison (II): theoretical power of comparison statistics.

Authors:  Lin Wan; Gesine Reinert; Fengzhu Sun; Michael S Waterman
Journal:  J Comput Biol       Date:  2010-10-25       Impact factor: 1.479

2.  A geometric interpretation for local alignment-free sequence comparison.

Authors:  Ehsan Behnam; Michael S Waterman; Andrew D Smith
Journal:  J Comput Biol       Date:  2013-07       Impact factor: 1.479

Review 3.  Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis.

Authors:  Oliver Bonham-Carter; Joe Steele; Dhundy Bastola
Journal:  Brief Bioinform       Date:  2013-07-31       Impact factor: 11.622

4.  Multiple alignment-free sequence comparison.

Authors:  Jie Ren; Kai Song; Fengzhu Sun; Minghua Deng; Gesine Reinert
Journal:  Bioinformatics       Date:  2013-08-29       Impact factor: 6.937

Review 5.  New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing.

Authors:  Kai Song; Jie Ren; Gesine Reinert; Minghua Deng; Michael S Waterman; Fengzhu Sun
Journal:  Brief Bioinform       Date:  2013-09-23       Impact factor: 11.622

6.  The distribution of word matches between Markovian sequences with periodic boundary conditions.

Authors:  Conrad J Burden; Paul Leopardi; Sylvain Forêt
Journal:  J Comput Biol       Date:  2013-10-26       Impact factor: 1.479

7.  SEME: a fast mapper of Illumina sequencing reads with statistical evaluation.

Authors:  Shijian Chen; Anqi Wang; Lei M Li
Journal:  J Comput Biol       Date:  2013-11       Impact factor: 1.479

8.  Alignment-free sequence comparison (I): statistics and power.

Authors:  Gesine Reinert; David Chew; Fengzhu Sun; Michael S Waterman
Journal:  J Comput Biol       Date:  2009-12       Impact factor: 1.479

9.  A new statistic for efficient detection of repetitive sequences.

Authors:  Sijie Chen; Yixin Chen; Fengzhu Sun; Michael S Waterman; Xuegong Zhang
Journal:  Bioinformatics       Date:  2019-11-01       Impact factor: 6.937

10.  New powerful statistics for alignment-free sequence comparison under a pattern transfer model.

Authors:  Xuemei Liu; Lin Wan; Jing Li; Gesine Reinert; Michael S Waterman; Fengzhu Sun
Journal:  J Theor Biol       Date:  2011-06-25       Impact factor: 2.691

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.