Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Distributional regimes for the number of k-word matches between two random sequences.

Literature DB >> 12374863

Distributional regimes for the number of k-word matches between two random sequences.

Ross A Lippert¹, Haiyan Huang, Michael S Waterman.

Abstract

When comparing two sequences, a natural approach is to count the number of k-letter words the two sequences have in common. No positional information is used in the count, but it has the virtue that the comparison time is linear with sequence length. For this reason this statistic D(2) and certain transformations of D(2) are used for EST sequence database searches. In this paper we begin the rigorous study of the statistical distribution of D(2). Using an independence model of DNA sequences, we derive limiting distributions by means of the Stein and Chen-Stein methods and identify three asymptotic regimes, including compound Poisson and normal. The compound Poisson distribution arises when the word size k is large and word matches are rare. The normal distribution arises when the word size is small and matches are common. Explicit expressions for what is meant by large and small word sizes are given in the paper. However, when word size is small and the letters are uniformly distributed, the anticipated limiting normal distribution does not always occur. In this situation the uniform distribution provides the exception to other letter distributions. Therefore a naive, one distribution fits all, approach to D(2) statistics could easily create serious errors in estimating significance.

Entities: Disease

Mesh：

Substances：
DNA

Year: 2002 PMID： 12374863 PMCID： PMC137823 DOI： 10.1073/pnas.202468099

Source DB: PubMed Journal: Proc Natl Acad Sci U S A ISSN： 0027-8424 Impact factor: 11.205

14 in total

Distributional regimes for the number of k-word matches between two random sequences.

1. STACK: Sequence Tag Alignment and Consensus Knowledgebase.

2. Statistical measures of DNA sequence dissimilarity under Markov chain models of base composition.

3. Basic local alignment search tool.

4. Approximations to profile score distributions.

5. Compound Poisson and Poisson process approximations for occurrences of multiple words in Markov chains.

6. A measure of the similarity of sets of sequences not requiring sequence alignment.

7. Statistical method for rapid homology search.

8. Identification of common molecular subsequences.

9. d2_cluster: a validated method for clustering EST and full-length cDNAsequences.

10. AsMamDB: an alternative splice database of mammals.

1. Alignment-free sequence comparison (II): theoretical power of comparison statistics.

2. A geometric interpretation for local alignment-free sequence comparison.

Review 3. Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis.

4. Multiple alignment-free sequence comparison.

Review 5. New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing.

6. The distribution of word matches between Markovian sequences with periodic boundary conditions.

7. SEME: a fast mapper of Illumina sequencing reads with statistical evaluation.

8. Alignment-free sequence comparison (I): statistics and power.

9. A new statistic for efficient detection of repetitive sequences.

10. New powerful statistics for alignment-free sequence comparison under a pattern transfer model.