| Literature DB >> 20205909 |
Leslie Regad1,2, Juliette Martin1,3,4, Gregory Nuel5,6,7, Anne-Claude Camproux1,2.
Abstract
BACKGROUND: In bioinformatics it is common to search for a pattern of interest in a potentially large set of rather short sequences (upstream gene regions, proteins, exons, etc.). Although many methodological approaches allow practitioners to compute the distribution of a pattern count in a random sequence generated by a Markov source, no specific developments have taken into account the counting of occurrences in a set of independent sequences. We aim to address this problem by deriving efficient approaches and algorithms to perform these computations both for low and high complexity patterns in the framework of homogeneous or heterogeneous Markov models.Entities:
Year: 2010 PMID: 20205909 PMCID: PMC2828453 DOI: 10.1186/1748-7188-5-15
Source DB: PubMed Journal: Algorithms Mol Biol ISSN: 1748-7188 Impact factor: 1.405
Figure 1Minimal DFA that recognizes the language ℒ = {a, b}*.
Figure 2Geometry of the 27 structural letters of the HMM-27 structural alphabet.
P-values for structural patterns in protein loop structures using exact computations or the single sequence approximation (SSA) with offset or not.
| Structural pattern | Exact | SSA (no offset) | SSA (offset = 3) | |
|---|---|---|---|---|
| 16 | 1.62 × 10-2 | 5.95 × 10-1 | 8.43 × 10-2 | |
| 7 | 2.20 × 10-2 | 6.68 × 10-2 | 9.19 × 10-3 | |
| 25 | 1.37 × 10-3 | 4.89 × 10-1 | 2.19 × 10-2 | |
| 110 | 1.71 × 10-3 | 9.46 × 10-1 | 2.59 × 10-3 | |
| 4 | 5.78 × 10-5 | 2.81 × 10-4 | 5.49 × 10-5 | |
| 27 | 5.69 × 10-6 | 3.07 × 10-3 | 3.81 × 10-6 | |
| 50 | 3.45 × 10-7 | 4.84 × 10-2 | 9.71 × 10-6 | |
| 40 | 2.56 × 10-11 | 4.49 × 10-5 | 1.22 × 10-9 | |
| 52 | 5.74 × 10-16 | 1.96 × 10-10 | 2.30 × 10-17 | |
| 58 | 3.19 × 10-32 | 1.91 × 10-23 | 1.26 × 10-32 | |
| 149 | 1.05 × 10-41 | 1.06 × 10-30 | 3.85 × 10-51 | |
| 282 | 7.26 × 10-167 | 9.08 × 10-174 | 3.56 × 10-222 |
Figure 3Starting and stationary distributions of the 27 structural letters in the loop structure data set.
Figure 4Histogram of the log. Note that the 0.1% patterns with the largest complexities have been removed from the graph in order to improve readability.
Size of the regular expression (regex) and pattern complexity (L) for a selected subset of PROSITE signatures.
| PROSITE signature | Accession number | pattern size | |
|---|---|---|---|
| RGD | PS00016 | 3 | 22 |
| ER_TARGET | PS00014 | 3 | 28 |
| PPASE | PS00387 | 7 | 41 |
| ALDEHYDE_DEHYDR_GLU | PS00687 | 8 | 44 |
| PROKAR_NTER_METHYL | PS00409 | 21 | 46 |
| GLY_RADICAL_1 | PS00850 | 9 | 77 |
| PEP_ENZYMES_PHOS_SITE | PS00370 | 12 | 96 |
| PUR_PYR_PR_TRANSFER | PS00103 | 13 | 102 |
| PILI_CHAPERONE | PS00635 | 18 | 226 |
| SIGMA54_INTERACT_2 | PS00676 | 16 | 313 |
| EFACTOR_GTP | PS00301 | 16 | 320 |
| ALDEHYDE_DEHYDR_CYS | PS00070 | 12 | 331 |
| ADH_ZINC | PS00059 | 13 | 478 |
| THIOLASE_1 | PS00098 | 19 | 637 |
| SUGAR_TRANSPORT_1 | PS00216 | 15 to 17 | 796 |
| FGGY_KINASES_2 | PS00445 | 21 to 22 | 2668 |
| PTS_EIIA_TYPE_2_HIS | PS00372 | 16 | 2758 |
| MOLYBDOPTERIN_PROK_3 | PS00551 | 27 to 28 | 3907 |
| SUGAR_TRANSPORT_2 | PS00217 | 26 | 6889 |
P-values for a selection of PROSITE patterns of low (or moderate) complexities using the complete proteome of Escherichia coli (NC_000913.faa).
| PROSITE signature | Exact | SSA with no offset | SSA (offset) | |
|---|---|---|---|---|
| RGD | 215 | 5.35 × 10-1 | 5.91 × 10-1 | 5.55 × 10-1(2) |
| ER_TARGET | 72 | 4.01 × 10-2 | 5.21 × 10-2 | 4.70 × 10-2(2) |
| PPASE | 3 | 2.60 × 10-2 | 2.76 × 10-2 | 2.63 × 10-2(6) |
| ALDEHYDE_DEHYDR_GLU | 12 | 1.99 × 10-5 | 2.41 × 10-5 | 1.95 × 10-5(7) |
| PROKAR_NTER_METHYL | 10 | 6.79 × 10-3 | 8.01 × 10-3 | 5.10 × 10-3(20) |
| GLY_RADICAL_1 | 6 | 1.58 × 10-6 | 1.86 × 10-6 | 1.60 × 10-6(8) |
| PEP_ENZYMES_PHOS_SITE | 4 | 1.49 × 10-10 | 1.74 × 10-10 | 1.49 × 10-10(12) |
| PUR_PYR_PR_TRANSFER | 7 | 2.15 × 10-14 | 2.75 × 10-14 | 2.10 × 10-14(12) |
Exact P-values for a selection of PROSITE patterns of high complexities using the complete proteome of Escherichia coli (NC_000913.faa). We use an order 1 homogeneous Markov model estimated over the data set.
| PROSITE signature | Exact | |
|---|---|---|
| PILI_CHAPERONE | 10 | 3.27 × 10-46 |
| SIGMA54_INTERACT × 2 | 12 | 1.58 × 10-42 |
| EFACTOR_GTP | 8 | 4.43 × 10-20 |
| ALDEHYDE_DEHYDR_CYS | 11 | 5.63 × 10-9 |
| ADH_ZINC | 12 | 8.93 × 10-16 |
| THIOLASE_1 | 5 | 5.76 × 10-9 |
| SUGAR_TRANSPORT_1 | 18 | 3.75 × 10-8 |
| FGGY_KINASES_2 | 5 | 2.14 × 10-4 |
| PTS_EIIA_TYPE_2_HIS | 8 | 7.19 × 10-19 |
| MOLYBDOPTERIN_PROK_3 | 11 | 2.59 × 10-35 |
| SUGAR_TRANSPORT_2 | 10 | 1.22 × 10-5 |
P-values for several DNA patterns (known transcription factors are marked with a star) in the upstream region data set.
| DNA pattern | homogeneous | heterogeneous | ||
|---|---|---|---|---|
| 28 | 10 | 2.95 × 10-3 | 3.74 × 10-3 | |
| 427 | 11 | 1.31 × 10-99 | 1.29 × 10-99 | |
| 25 | 10 | 1.76 × 10-6 | 1.38 × 10-6 | |
| 22 | 11 | 1.12 × 10-6 | 1.55 × 10-6 | |
| 18 | 11 | 6.52 × 10-10 | 1.65 × 10-9 | |
| 391 | 14 | 7.70 × 10-12 | 1.68 × 10-12 | |
| 15 | 17 | 4.15 × 10-1 | 4.09 × 10-1 | |
| 42 | 27 | 2.05 × 10-23 | 2.14 × 10-22 | |
| 212 | 36 | 3.08 × 10-9 | 3.04 × 10-9 | |
| 11 | 40 | 3.10 × 10-2 | 3.05 × 10-2 | |
| 1 | 106 | 8.97 × 10-1 | 8.84 × 10-1 | |
| 102 | 183 | 1.26 × 10-14 | 1.73 × 10-13 | |
| 6 | 464 | 2.88 × 10-2 | 2.84 × 10-2 |
Figure 5Some transitions of the order 1 heterogeneous Markov model fitted using a sliding window of size 200 on the upstream region data set. The plots respectively correspond to the following quantities: a) π(A, G); b) π(G, A); c) π(G, G); d) π(T, T), 1 ≤ i ≤ 800.
Figure 6Marginal distribution of the four nucleotides along the 800 positions of a upstream region. The underlying model is an order 1 heterogeneous Markov model fitted using a sliding window of size 200 on the upstream region data set.