Literature DB >> 20973742

Alignment-free sequence comparison (II): theoretical power of comparison statistics.

Lin Wan1, Gesine Reinert, Fengzhu Sun, Michael S Waterman.   

Abstract

Rapid methods for alignment-free sequence comparison make large-scale comparisons between sequences increasingly feasible. Here we study the power of the statistic D2, which counts the number of matching k-tuples between two sequences, as well as D2*, which uses centralized counts, and D2S, which is a self-standardized version, both from a theoretical viewpoint and numerically, providing an easy to use program. The power is assessed under two alternative hidden Markov models; the first one assumes that the two sequences share a common motif, whereas the second model is a pattern transfer model; the null model is that the two sequences are composed of independent and identically distributed letters and they are independent. Under the first alternative model, the means of the tuple counts in the individual sequences change, whereas under the second alternative model, the marginal means are the same as under the null model. Using the limit distributions of the count statistics under the null and the alternative models, we find that generally, asymptotically D2S has the largest power, followed by D2*, whereas the power of D2 can even be zero in some cases. In contrast, even for sequences of length 140,000 bp, in simulations D2* generally has the largest power. Under the first alternative model of a shared motif, the power of D2*approaches 100% when sufficiently many motifs are shared, and we recommend the use of D2* for such practical applications. Under the second alternative model of pattern transfer,the power for all three count statistics does not increase with sequence length when the sequence is sufficiently long, and hence none of the three statistics under consideration canbe recommended in such a situation. We illustrate the approach on 323 transcription factor binding motifs with length at most 10 from JASPAR CORE (October 12, 2009 version),verifying that D2* is generally more powerful than D2. The program to calculate the power of D2, D2* and D2S can be downloaded from http://meta.cmb.usc.edu/d2. Supplementary Material is available at www.liebertonline.com/cmb.

Mesh:

Year:  2010        PMID: 20973742      PMCID: PMC3123933          DOI: 10.1089/cmb.2010.0056

Source DB:  PubMed          Journal:  J Comput Biol        ISSN: 1066-5277            Impact factor:   1.479


  10 in total

Review 1.  DNA binding sites: representation and discovery.

Authors:  G D Stormo
Journal:  Bioinformatics       Date:  2000-01       Impact factor: 6.937

2.  Distributional regimes for the number of k-word matches between two random sequences.

Authors:  Ross A Lippert; Haiyan Huang; Michael S Waterman
Journal:  Proc Natl Acad Sci U S A       Date:  2002-10-08       Impact factor: 11.205

3.  JASPAR: an open-access database for eukaryotic transcription factor binding profiles.

Authors:  Albin Sandelin; Wynand Alkema; Pär Engström; Wyeth W Wasserman; Boris Lenhard
Journal:  Nucleic Acids Res       Date:  2004-01-01       Impact factor: 16.971

4.  Linguistic features of noncoding DNA sequences.

Authors:  R N Mantegna; S V Buldyrev; A L Goldberger; S Havlin; C K Peng; M Simons; H E Stanley
Journal:  Phys Rev Lett       Date:  1994-12-05       Impact factor: 9.161

5.  The power of detecting enriched patterns: an HMM approach.

Authors:  Zhiyuan Zhai; Shih-Yen Ku; Yihui Luan; Gesine Reinert; Michael S Waterman; Fengzhu Sun
Journal:  J Comput Biol       Date:  2010-04       Impact factor: 1.479

6.  A statistical method for alignment-free comparison of regulatory sequences.

Authors:  Miriam R Kantorovitz; Gene E Robinson; Saurabh Sinha
Journal:  Bioinformatics       Date:  2007-07-01       Impact factor: 6.937

7.  Alignment-free sequence comparison (I): statistics and power.

Authors:  Gesine Reinert; David Chew; Fengzhu Sun; Michael S Waterman
Journal:  J Comput Biol       Date:  2009-12       Impact factor: 1.479

8.  Characterizing the D2 statistic: word matches in biological sequences.

Authors:  Sylvain Forêt; Susan R Wilson; Conrad J Burden
Journal:  Stat Appl Genet Mol Biol       Date:  2009-10-08

9.  Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences.

Authors:  Sylvain Forêt; Miriam R Kantorovitz; Conrad J Burden
Journal:  BMC Bioinformatics       Date:  2006-12-18       Impact factor: 3.169

10.  Computational discovery of cis-regulatory modules in Drosophila without prior knowledge of motifs.

Authors:  Andra Ivan; Marc S Halfon; Saurabh Sinha
Journal:  Genome Biol       Date:  2008-01-28       Impact factor: 13.583

  10 in total
  47 in total

1.  A geometric interpretation for local alignment-free sequence comparison.

Authors:  Ehsan Behnam; Michael S Waterman; Andrew D Smith
Journal:  J Comput Biol       Date:  2013-07       Impact factor: 1.479

2.  Biological intuition in alignment-free methods: response to Posada.

Authors:  Mark A Ragan; Cheong Xin Chan
Journal:  J Mol Evol       Date:  2013-07-23       Impact factor: 2.395

Review 3.  Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis.

Authors:  Oliver Bonham-Carter; Joe Steele; Dhundy Bastola
Journal:  Brief Bioinform       Date:  2013-07-31       Impact factor: 11.622

4.  Multiple alignment-free sequence comparison.

Authors:  Jie Ren; Kai Song; Fengzhu Sun; Minghua Deng; Gesine Reinert
Journal:  Bioinformatics       Date:  2013-08-29       Impact factor: 6.937

Review 5.  New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing.

Authors:  Kai Song; Jie Ren; Gesine Reinert; Minghua Deng; Michael S Waterman; Fengzhu Sun
Journal:  Brief Bioinform       Date:  2013-09-23       Impact factor: 11.622

6.  The distribution of word matches between Markovian sequences with periodic boundary conditions.

Authors:  Conrad J Burden; Paul Leopardi; Sylvain Forêt
Journal:  J Comput Biol       Date:  2013-10-26       Impact factor: 1.479

7.  Prot-SpaM: fast alignment-free phylogeny reconstruction based on whole-proteome sequences.

Authors:  Chris-Andre Leimeister; Jendrik Schellhorn; Svenja Dörrer; Michael Gerth; Christoph Bleidorn; Burkhard Morgenstern
Journal:  Gigascience       Date:  2019-03-01       Impact factor: 6.524

8.  Phenetic Comparison of Prokaryotic Genomes Using k-mers.

Authors:  Maxime Déraspe; Frédéric Raymond; Sébastien Boisvert; Alexander Culley; Paul H Roy; François Laviolette; Jacques Corbeil
Journal:  Mol Biol Evol       Date:  2017-10-01       Impact factor: 16.240

9.  A new statistic for efficient detection of repetitive sequences.

Authors:  Sijie Chen; Yixin Chen; Fengzhu Sun; Michael S Waterman; Xuegong Zhang
Journal:  Bioinformatics       Date:  2019-11-01       Impact factor: 6.937

10.  Inferring Phylogenomic Relationship of Microbes Using Scalable Alignment-Free Methods.

Authors:  Guillaume Bernard; Timothy G Stephens; Raúl A González-Pech; Cheong Xin Chan
Journal:  Methods Mol Biol       Date:  2021
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.