Literature DB >> 9783205

Calculating the exact probability of language-like patterns in biomolecular sequences.

K Atteson1.   

Abstract

We present algorithms for the exact computation of the probability that a random string of a certain length matches a given regular expression. These algorithms can be used to determine statistical significance in a variety of pattern searches such as motif searches and gene-finding. This work improves upon work of Kleffe and Langebacker (Kleffe & Langbecker 1990) and of Sewell and Durbin (Sewell & Durbin 1995) in several ways. First, in many cases of interest, the algorithms presented here are faster. In addition, the type of pattern considered here strictly includes those of both previous works but also allows, for instance, arbitrary length gaps. Also, the type of probability model which can be used is more general than that of Sewell and Durbin, allowing for Markov chains. The problem solved in this work is in fact in the class of NP-hard problems which are believed to be intractable. However, the problem is fixed-parameter tractable, meaning that it is tractable for small patterns. The is problem is also computationally feasible for many patterns which occur in practice. As a sample application, we consider calculating the statistical significance of most of the PROSITE patterns as in Sewell and Durbin. Whereas their method was only fast enough to exactly compute the probabilities for sequences of length 13 larger than the pattern length, we calculate these probabilities for sequences of up to length 2000. In addition, we calculate most of these probabilities using a first order Markov chain. Most of the PROSITE patterns have high significance at length 2000 under both the i.i.d. and Markov chain models. For further applications, we demonstrate the calculation of the probability of a PROSITE pattern occurring on either strand of a random DNA sequence of up to 500 kilo-bases and the probability of a simple gene model occurring in a random sequence of up to 1 megabase.

Mesh:

Substances:

Year:  1998        PMID: 9783205

Source DB:  PubMed          Journal:  Proc Int Conf Intell Syst Mol Biol        ISSN: 1553-0833


  4 in total

1.  Discovery of sequence motifs related to coexpression of genes using evolutionary computation.

Authors:  Gary B Fogel; Dana G Weekes; Gabor Varga; Ernst R Dow; Harry B Harlow; Jude E Onyia; Chen Su
Journal:  Nucleic Acids Res       Date:  2004-07-20       Impact factor: 16.971

2.  Pattern statistics on Markov chains and sensitivity to parameter estimation.

Authors:  Grégory Nuel
Journal:  Algorithms Mol Biol       Date:  2006-10-17       Impact factor: 1.405

3.  TransportTP: a two-phase classification approach for membrane transporter prediction and characterization.

Authors:  Haiquan Li; Vagner A Benedito; Michael K Udvardi; Patrick Xuechun Zhao
Journal:  BMC Bioinformatics       Date:  2009-12-14       Impact factor: 3.169

4.  Evolutionary computation for discovery of composite transcription factor binding sites.

Authors:  Gary B Fogel; V William Porto; Gabor Varga; Ernst R Dow; Andrew M Craven; David M Powers; Harry B Harlow; Eric W Su; Jude E Onyia; Chen Su
Journal:  Nucleic Acids Res       Date:  2008-10-15       Impact factor: 16.971

  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.