Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Alignment-free sequence comparison (II): theoretical power of comparison statistics.

Literature DB >> 20973742

Alignment-free sequence comparison (II): theoretical power of comparison statistics.

Lin Wan¹, Gesine Reinert, Fengzhu Sun, Michael S Waterman.

Abstract

Rapid methods for alignment-free sequence comparison make large-scale comparisons between sequences increasingly feasible. Here we study the power of the statistic D2, which counts the number of matching k-tuples between two sequences, as well as D2*, which uses centralized counts, and D2S, which is a self-standardized version, both from a theoretical viewpoint and numerically, providing an easy to use program. The power is assessed under two alternative hidden Markov models; the first one assumes that the two sequences share a common motif, whereas the second model is a pattern transfer model; the null model is that the two sequences are composed of independent and identically distributed letters and they are independent. Under the first alternative model, the means of the tuple counts in the individual sequences change, whereas under the second alternative model, the marginal means are the same as under the null model. Using the limit distributions of the count statistics under the null and the alternative models, we find that generally, asymptotically D2S has the largest power, followed by D2*, whereas the power of D2 can even be zero in some cases. In contrast, even for sequences of length 140,000 bp, in simulations D2* generally has the largest power. Under the first alternative model of a shared motif, the power of D2*approaches 100% when sufficiently many motifs are shared, and we recommend the use of D2* for such practical applications. Under the second alternative model of pattern transfer,the power for all three count statistics does not increase with sequence length when the sequence is sufficiently long, and hence none of the three statistics under consideration canbe recommended in such a situation. We illustrate the approach on 323 transcription factor binding motifs with length at most 10 from JASPAR CORE (October 12, 2009 version),verifying that D2* is generally more powerful than D2. The program to calculate the power of D2, D2* and D2S can be downloaded from http://meta.cmb.usc.edu/d2. Supplementary Material is available at www.liebertonline.com/cmb.

Mesh：

Year: 2010 PMID： 20973742 PMCID： PMC3123933 DOI： 10.1089/cmb.2010.0056

Source DB: PubMed Journal: J Comput Biol ISSN： 1066-5277 Impact factor: 1.479

10 in total

Review 1. DNA binding sites: representation and discovery.

Authors: G D Stormo
Journal: Bioinformatics Date: 2000-01 Impact factor: 6.937

2. Distributional regimes for the number of k-word matches between two random sequences.

Authors: Ross A Lippert; Haiyan Huang; Michael S Waterman
Journal: Proc Natl Acad Sci U S A Date: 2002-10-08 Impact factor: 11.205

3. JASPAR: an open-access database for eukaryotic transcription factor binding profiles.

Authors: Albin Sandelin; Wynand Alkema; Pär Engström; Wyeth W Wasserman; Boris Lenhard
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

4. Linguistic features of noncoding DNA sequences.

Authors: R N Mantegna; S V Buldyrev; A L Goldberger; S Havlin; C K Peng; M Simons; H E Stanley
Journal: Phys Rev Lett Date: 1994-12-05 Impact factor: 9.161

5. The power of detecting enriched patterns: an HMM approach.

Authors: Zhiyuan Zhai; Shih-Yen Ku; Yihui Luan; Gesine Reinert; Michael S Waterman; Fengzhu Sun
Journal: J Comput Biol Date: 2010-04 Impact factor: 1.479

6. A statistical method for alignment-free comparison of regulatory sequences.

Authors: Miriam R Kantorovitz; Gene E Robinson; Saurabh Sinha
Journal: Bioinformatics Date: 2007-07-01 Impact factor: 6.937

7. Alignment-free sequence comparison (I): statistics and power.

Authors: Gesine Reinert; David Chew; Fengzhu Sun; Michael S Waterman
Journal: J Comput Biol Date: 2009-12 Impact factor: 1.479

8. Characterizing the D2 statistic: word matches in biological sequences.

Authors: Sylvain Forêt; Susan R Wilson; Conrad J Burden
Journal: Stat Appl Genet Mol Biol Date: 2009-10-08

9. Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences.

Authors: Sylvain Forêt; Miriam R Kantorovitz; Conrad J Burden
Journal: BMC Bioinformatics Date: 2006-12-18 Impact factor: 3.169

10. Computational discovery of cis-regulatory modules in Drosophila without prior knowledge of motifs.

Authors: Andra Ivan; Marc S Halfon; Saurabh Sinha
Journal: Genome Biol Date: 2008-01-28 Impact factor: 13.583

10 in total

47 in total

Alignment-free sequence comparison (II): theoretical power of comparison statistics.

Review 1. DNA binding sites: representation and discovery.

2. Distributional regimes for the number of k-word matches between two random sequences.

3. JASPAR: an open-access database for eukaryotic transcription factor binding profiles.

4. Linguistic features of noncoding DNA sequences.

5. The power of detecting enriched patterns: an HMM approach.

6. A statistical method for alignment-free comparison of regulatory sequences.

7. Alignment-free sequence comparison (I): statistics and power.

8. Characterizing the D2 statistic: word matches in biological sequences.

9. Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences.

10. Computational discovery of cis-regulatory modules in Drosophila without prior knowledge of motifs.

1. A geometric interpretation for local alignment-free sequence comparison.

2. Biological intuition in alignment-free methods: response to Posada.

Review 3. Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis.

4. Multiple alignment-free sequence comparison.

Review 5. New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing.

6. The distribution of word matches between Markovian sequences with periodic boundary conditions.

7. Prot-SpaM: fast alignment-free phylogeny reconstruction based on whole-proteome sequences.

8. Phenetic Comparison of Prokaryotic Genomes Using k-mers.

9. A new statistic for efficient detection of repetitive sequences.

10. Inferring Phylogenomic Relationship of Microbes Using Scalable Alignment-Free Methods.