| Literature DB >> 25648087 |
Mireille Régnier1, Evgenia Furletova2, Victor Yakovlev3, Mikhail Roytberg4.
Abstract
BACKGROUND: Finding new functional fragments in biological sequences is a challenging problem. Methods addressing this problem commonly search for clusters of pattern occurrences that are statistically significant. A measure of statistical significance is the P-value of a number of pattern occurrences, i.e. the probability to find at least S occurrences of words from a pattern in a random text of length N generated according to a given probability model. All words of the pattern are supposed to be of same length.Entities:
Keywords: Hidden Markov model; P-value; PSSM (PWM); Pattern occurrences
Year: 2014 PMID: 25648087 PMCID: PMC4307674 DOI: 10.1186/s13015-014-0025-1
Source DB: PubMed Journal: Algorithms Mol Biol ISSN: 1748-7188 Impact factor: 1.405
Figure 1Overlap graph for pattern , CTTTCGC, TACCACA}. Nodes are the elements of . The node with the index number “1” corresponds to ε, it is the root. The left edges are shown by continuous straight lines, right edges are shown by dashed lines and deep edges are shown by double lines. Each left edge (l p r e d(w),w), where , is labeled with B a c k(w). For example, edge (2,3) corresponding to the pair of overlaps (A, ACA) is labeled with B a c k(ACA)=CA. A deep edge (w,r) corresponds to equivalence class . The right edges (w,r p r e d(w)) are not labeled.
PSSM-based patterns of length 12
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|
| PSSM(12,9.63) | 0.00001 | 169 | 14 | 57 | 468 | 2.1887831E-27 |
| PSSM(12,8.69) | 0.00003 | 503 | 22 | 125 | 1123 | 9.9588634E-22 |
| PSSM(12,7.41) | 0.0001 | 1682 | 49 | 395 | 3189 | 2.1630650E-16 |
| PSSM(12,5.89) | 0.0003 | 5045 | 157 | 1789 | 9070 | 3.9649240E-12 |
| PSSM(12,4.01) | 0.001 | 16835 | 488 | 8967 | 29297 | 2.0930535E-07 |
| PSSM(12,2.04) | 0.003 | 50490 | 1417 | 35313 | 83016 | 0.001494591 |
The number x in “PSSM(12,x)” denotes the cut-off. The P-values are given w.r.t. the text length and probability models described in the text of the paper. The intermediate values of (0.003, 0.0003, etc. instead of more common 0.005, 0.0005, etc.) were chosen to obtain more homogeneous log-scale.
Figure 2Average size of used memory of SUFPREF . The details of the experiments are given in [Additional file 4]. The computer environment is described in the subsection “Comparison with the existing algorithms”.
Figure 3Average run-time of SUFPREF . The details of the experiments are given in [Additional file 4]. The computer environment is described in the subsection “Comparison with the existing algorithms”.
Comparison of running time and used space of SUFPREF and AHOPRO programs for PSSM-based patterns of length 12
|
|
|
| ||||||
|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
| PSSM(12,9.63) | 0.00001 | Bernoulli | 0.02 | 0.37 | 20.39 | 0.44 | 0.59 | 1.36 |
| PSSM(12,8.69) | 0.00003 | Bernoulli | 0.03 | 0.90 | 32.00 | 0.5 | 0.97 | 1.94 |
| PSSM(12,7.41) | 0.0001 | Bernoulli | 0.07 | 2.60 | 37.64 | 0.69 | 1.88 | 2.74 |
| PSSM(12,5.89) | 0.0003 | Bernoulli | 0.27 | 7.64 | 28.10 | 1.21 | 4.97 | 4.11 |
| PSSM(12,4.01) | 0.001 | Bernoulli | 1.27 | 26.15 | 20.61 | 3.01 | 15.28 | 5.07 |
| PSSM(12,2.04) | 0.003 | Bernoulli | 4.99 | 78.37 | 15.70 | 7.75 | 42.61 | 5.50 |
| PSSM(12,9.63) | 0.00001 | Markov | 0.03 | 0.38 | 15.12 | 0.47 | 0.62 | 1.32 |
| PSSM(12,8.69) | 0.00003 | Markov | 0.05 | 0.91 | 18.65 | 0.53 | 0.97 | 1.84 |
| PSSM(12,7.41) | 0.0001 | Markov | 0.11 | 2.64 | 23.13 | 0.71 | 1.91 | 2.67 |
| PSSM(12,5.89) | 0.0003 | Markov | 0.41 | 7.74 | 18.78 | 1.24 | 5.02 | 4.04 |
| PSSM(12,4.01) | 0.001 | Markov | 1.77 | 26.50 | 14.95 | 3.04 | 15.31 | 5.04 |
| PSSM(12,2.04) | 0.003 | Markov | 6.67 | 79.25 | 11.88 | 8.36 | 42.65 | 4.94 |
See Table 1 for the general information on the patterns. The intermediate values of (0.003, 0.0003, etc. instead of more common 0.005, 0.0005, etc.) were chosen to obtain more homogeneous log-scale.
Sensitivity and specificity of TFBS recognition for various thresholds and probability models
|
|
| |||||
|---|---|---|---|---|---|---|
|
|
|
|
|
| ||
| Threshold | 1 | 0.5 | 0.5 | 0.5 | 0.5 | 0.8 |
| Sensitivity | 97.11% | 97.11% | 97.11% | 97.11% | 97.11% | 97.11% |
| Specificity | 62.33% | 62.56% | 62.56% | 62.56% | 62.78 % | 62.33% |
| Threshold | 2 | 0.0189 | 0.01966 | 0.0215 | 0.0232 | 0.02619 |
| Sensitivity | 69.11% | 69.11% | 69.11% | 69.11% | 69.11% | 69.22% |
| Specificity | 87.33% | 92.33% | 92.33% | 92% | 92% | 92.22% |
| Threshold | 3 | 0.00135 | 0.00135 | 0.00157 | 0.00219 | 0.003 |
| Sensitivity | 32.33% | 32.44% | 32.44% | 32.44% | 32.44% | 32.33% |
| Specificity | 95.33% | 98.11% | 98% | 98% | 97.56% | 97.78% |
See details in the text of the paper.
Figure 4ROC-curves for recognition methods. The methods are described in the text and Table 3. Blue squares correspond to the method based on the number of occurrences. The ROC-curves for P-value based methods are almost coincide.
Spearman’s rank correlation between experimental ENCODE signal value and characteristics of regions related to pattern occurrences
|
|
| |||||
|---|---|---|---|---|---|---|
|
|
|
|
|
| ||
| Spearman’s coef. | 0.12 | 0.061 | 0.061 | 0.058 | 0.059 | 0.063 |
| Significance level | 0.0003 | 0.0674 | 0.0673 | 0.0802 | 0.0796 | 0.0578 |
See the text for further explanations.