| Literature DB >> 17044916 |
Abstract
BACKGROUND: In order to compute pattern statistics in computational biology a Markov model is commonly used to take into account the sequence composition. Usually its parameter must be estimated. The aim of this paper is to determine how sensitive these statistics are to parameter estimation, and what are the consequences of this variability on pattern studies (finding the most over-represented words in a genome, the most significant common words to a set of sequences,...).Entities:
Year: 2006 PMID: 17044916 PMCID: PMC1647278 DOI: 10.1186/1748-7188-1-17
Source DB: PubMed Journal: Algorithms Mol Biol ISSN: 1748-7188 Impact factor: 1.405
Figure 1Empirical and theoretical distributions of . A sample of size 10 000 have been used to get the empirical distribution. The solid line represents the density of (S, σ2). The adjustment test of Kolmogorov-Smirnov give D = 0.023 which corresponds to a p-value of p = 5.3 × 10-5. Nobs = 1221 and n = ℓ = 10 000.
Figure 2Comparison of . is estimated with a sample of size 1 000 and Nobs takes its values from 900 to 1 900. The solid line represents the theoretical values and the circles the empirical ones. The statistic S is used on the x-axis. n = ℓ = 10 000.
Figure 3Comparison of (. The circles reprensent σ(n) and the solid line (n). n∞ = 106 have been used to compute the value of A and B. Nobs = 1221 and ℓ = 10 000.
Comparison of theoretical and empirical pattern statistic mean and standard deviation on Escherichia coli K12.
| 1 | 35.57 | 0.28 | 35.57 | 0.27 |
| 2 | 31.61 | 0.49 | 31.60 | 0.50 |
| 3 | 46.75 | 1.04 | 46.77 | 1.03 |
| 4 | 45.33 | 1.74 | 45.32 | 1.81 |
| 5 | 62.27 | 3.45 | 62.36 | 3.34 |
We consider the pattern W = acgtacgt with Nobs = 150. The sequence length is ℓ = 4639675, we use an order m Markov model and a sample of size M = 1 000.
Comparison of theoretical and empirical pattern statistic mean and standard deviation on Mycoplasma genitalium.
| 1 | 42.48 | 0.38 | 42.47 | 0.40 |
| 2 | 44.62 | 0.78 | 44.62 | 0.81 |
| 3 | 55.96 | 1.49 | 56.02 | 1.52 |
| 4 | 55.06 | 3.39 | 55.48 | 3.48 |
| 5 | 56.49 | 10.35 | 57.21* | 9.09* |
We consider the pattern W = acgtacgt with Nobs = 30. The sequence length is ℓ = 580 076, we use an order m Markov model and a sample of size M = 1 000. (*) for 123 terms in the sample we got P (N) = 0 and hence, Swas not computed.
Comparison of theoretical and empirical pattern statistics mean and deviation on Mycoplasma genitalium.
| theoretical | binomial | compound Poisson | large deviations | ||||
| 55.96 | 1.49 | 56.05 | 1.47 | 55.42 | 1.45 | 54.27 | 1.43 |
We consider the pattern W = acgtacgt with Nobs = 30. The sequence length is ℓ = 580076, we use an order m = 3 Markov model and a sample of size M = 1 000. The pattern statistics are computed (from left to right) through binomial, compound Poisson or large deviations approximations.
Mean true positive rate and rank accordance rate in Escherichia coli K12.
| Markov order | 1 | 2 | 3 | 4 | 5 | 6 |
| TP rate | 99.0% | 98.0% | 97.9% | 94.4% | 82.1% | 47.6% |
| RA rate | 99.0% | 95.5% | 91.5% | 83.9% | 68.0% | 36.5% |
| × 103 | 383.33 | 95.83 | 23.96 | 5.99 | 1.50 | 0.37 |
Both quantities are estimated with 1 000 simulations. We consider the 1 00 most over-represented octamers, the sequence length is ℓ = 4639675. The last row gives the sample size per free parameter (length n of the sequence divided by the number k(k - 1) of parameters).
Mean true positive rate and rank accordance rate in Mycoplasma genitalium.
| Markov order | 1 | 2 | 3 | 4 | 5 | 6 |
| TP rate | 95.5% | 93.6% | 90.4% | 81.8% | 66.0% | 25.0% |
| RA rate | 92.6% | 85.4% | 79.8% | 66.5% | 45.1% | 11.0% |
| × 103 | 48.33 | 12.08 | 3.02 | 0.76 | 0.19 | 0.05 |
Both quantities are estimated with 1 000 simulations. We consider the 1 00 most over-represented octamers, the sequence length is ℓ = 580076. The last row gives the sample size per free parameter (length n of the sequence divided by the number k(k - 1) of parameters).