| Literature DB >> 12749771 |
Michael M Mwangi1, Eric D Siggia.
Abstract
BACKGROUND: To explain the vastly different phenotypes exhibited by the same organism under different conditions, it is essential that we understand how the organism's genes are coordinately regulated. While there are many excellent tools for predicting sequences encoding proteins or RNA genes, few algorithms exist to predict regulatory sequences on a genome wide scale with no prior information.Entities:
Mesh:
Year: 2003 PMID: 12749771 PMCID: PMC165661 DOI: 10.1186/1471-2105-4-18
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1WLC Algorithm. Choosing the correct number of clusters. The ratio R (Eq. 5) of the mean child to parent infra-cluster affinities versus the number of clusters for B. subtilis generated by our WLC algorithm. As weakest links are severed, the number of clusters increases from 29 to 732. Note the stabilization at and around 350 clusters, the optimal cluster number.
The top 10 most significant dimers (column 1). Dimers searched for had word lengths 4–5 and a spacer 3–30. Coding sequence was not considered. Listed are the number of occurrences in the dataset (column 2) and the statistical significance -log10 P (column 3), with P calculated from Eq. 2.
| Dimer | Number observed | Significance |
| ttgaN20ataat | 48 | 21.1 |
| gccgcN11gcggc | 10 | 15.9 |
| ggtggN3cgcg | 10 | 14.6 |
| ttgaN19tata | 70 | 14.5 |
| gaaacN16Cgta | 17 | 14.4 |
| ttgaN21taat | 58 | 13.8 |
| agggtN4ccgcg | 8 | 13.7 |
| gccgcN12Cggc | 12 | 13.7 |
| ttgaN23ataa | 75 | 13.6 |
| ttgacN19ataat | 18 | 13.6 |
52 unique biologically significant weight matrices. Listed are the matrix's identifier (column 1), consensus sequence (column 2), regulon size (column 3), and annotation (column 4). The matrices are sub-divided into categories according to the means by which they were identified: by comparison to documented regulatory mechanisms, by inspecting the operons in a matrix's regulon for related functions, and by examining the matrix's matches for positional biases. If a matrix was identified by several means, all listings for the matrix except the first in the top-most category are marked with pluses. Where applicable, the statistical significance – log10 P is reported in (), and entries in a category are sorted according to significance.
| Weight matrix | Consensus sequence | Regulon size | Annotation |
| WM1 | N7TTGAN19TATAATAN6 | 1141 | σ |
| WM118 | [G/T]GTTTAN13 [A/C]GGGAA [G/T] | 8 | σ |
| WM11 | NTGAAACNTTTN12CGTAT [A/T] | 16 | σ |
| WM212 | TGGCA [C/T]N4CTTGCAT | 5 | σ |
| WM2 | AANNAGGGTGGTACCGCGNN | 24 | T-box, alternate transcription termination regulation of aminoacyl-tRNA synthetases [ |
| WM22 | [A/T]AAN [A/C]GAACNN [A/T]NGTTCNNTTN | 29 | LexA, SOS response [ |
| WM71 | NT [A/T]TGTAN10ACA [A/T]AN | 111 | TnrA, pleiotropic regulator involved in global nitrogen regulation [ |
| WM317 | [A/T]TGTAA [A/G]CG [C/T]TT [A/T]N [A/T] | 54 | CcpA, carbon catabolite repression [ |
| WM298 | NTAATN20ATTAN | 27 | YccG-YccH (3.4) |
| WM259 | TGCGN10CGCA | 5 | YclK-YclJ (3.3) |
| WM171 | TGGGN11GGGA | 2 | Sec-dependent protein export machinery |
| WM116 | AATTC [A/T]N28 [A/T]GAATT | 4 | Cell lysis |
| WM266 | TGGACAN3GCAGA | 3 | Extracellular proteins |
| WM304 | AGTGTN15AGACT | 4 | Transport |
| WM69 | TATCTN4 [A/T]TCGAGA | 5 | Transport |
| WM233 | NGGGAN3TGCGG | 7 | Antimicrobial resistance |
| WM290 | NTTGAN16TGTTAN3T | 18 | DNA synthesis and repair |
| WM47 | A [A/T]AGAGN18CTCTTT [C/T]N | 27 | DNA synthesis and repair |
| WM124 | NTTAG [A/T]N6TTAGN | 17 | Transport |
| +WM2 | AANNAGGGTGGTAGGGCGNN | 24 | T-box, translation, ribosomal structure, and biogenesis (12) |
| +WM317 | [A/T]TGTAA [A/G]GG [C/T]TT [A/T]N [A/T] | 54 | CcpA, carbohydrate transport and metabolism (6.5), energy production and conversion (3.0) |
| WM130 | N4TTGAN14 [A/T]N4TGAAAN | 38 | Posttranslational modification, protein turnover, and chaperones (4.2) |
| +WM1 | N7TTGAN19TATAATAN6 | 1141 | σ |
| +WM212 | TGGCA [C/T]N4GTTGCAT | 5 | σ |
| WM255 | NCTGAAN26TTCAGN | 3 | Cell motility and secretion (2.9) |
| +WM22 | [A/T]AAN [A/C]GAACNN [A/T]NGTTCNNTTN | 29 | LexA, DNA replication, recombination, and repair (2.6) |
| WM39 | [A/G]NNTGCTN30AGCAN | 21 | Secondary metabolites biosynthesis transport, and catabolism (2.5) |
| WM228 | NGCAGAN13TCTGCN | 3 | Secondary metabolites biosynthesis transport, and catabolism (2.5) |
| WM283 | AGCTGN13GAGGTT | 3 | Translation, ribosomal structure, and biogenesis (2.4) |
| WM80 | NGTTTN29AAACN | 86 | Energy production and conversion (2.3) |
| WM223 | NATTTN28AAATN | 69 | Transcription (2.3) |
| WM16 | NCCGGC [C/T]N6GCCGGN [G/T]TTTT | 27 | Signal transduction mechanisms (2.3) |
| WM17 | [A/G]NCGGCN8 [A/G]NGCCGN | 40 | Cell motility and secretion (2.3) |
| WM23 | [A/T]CGAAN27TTCG [A/T] | 25 | Amino acid transport and metabolism (2.2) |
| WM221 | NGCGGN29CGGCN | 6 | Amino acid transport and metabolism (2.2) |
| WM119 | NAATAN9TATTN | 62 | Cell envelope biogenesis, outer membrane (2.1) |
| +WM304 | AGTGTN15ACACT | 4 | Inorganic ion transport and metabolism (2.1) |
| WM46 | NTATAN17AAAGGAG [A/G]N | 109 | DNA replication, recombinaion, and repair (2.1) |
| WM75 | [G/T]N3CTACN9GN12CTACA | 5 | Secondary metabolites biosynthesis transport, and catabolism (2.0) |
| WM31 | NTGTTN5AACAN | 58 | Carbohydrate transport and metabolism (2.0) |
| +WM46 | NTATAN17AAAGGAG [A/G]N | 109 | Repressor (17) |
| WM21 | AANGCGN15GGGNTTTTTT | 128 | Activator (7.9) |
| WM33 | NAAGC [A/T]GN12C [A/T]GCTTN | 96 | Activator (4.7) |
| WM50 | NNGGTTTTTTTATTN | 152 | Activator (3.6) |
| WM173 | NAAAGN [A/G]NGGAAN4 | 35 | Repressor (3.0) |
| WM169 | NAAAGN3GTGAN | 40 | Repressor (2.9) |
| WM13 | [A/G] [A/C] [A/G]CGG [G/T]... [G/T]N9GGG [G/T] [G/T]TT [A/T]T | 21 | Activator (2.8) |
| WM180 | [A/T]AGAGN5AGAGN | 15 | Repressor (2.6) |
| WM58 | NAAAGANAN15TGTTTTN | 42 | Activator (2.6) |
| WM79 | NTTGT[A/T N4TTGTN | 67 | Activator (2.5) |
| WM84 | AN3AACATN3GGAGGN | 19 | Repressor (2.4) |
| WM7 | NAAAGN19 [G/T]CTTTN3 | 90 | Activator (2.3) |
| +WM17 | [A/G]NGGGGN8 [A/C]NGCCGN | 40 | Activator (2.1) |
| +WM46 | NTATA-17-AAAGGAG [A/G]N | 109 | (61) |
| +WM1 | N7TTGAN19TATAATAN6 | 1141 | σ |
| +WM169 | NAAAGN3GTGAN | 40 | (10) |
| +WM21 | AANCCGN15CGGNTTTTTT | 128 | (6.3) |
| +WM2 | AANNAGGGTGGTAGGGGGNN | 24 | T-box (4.8) |
| +WM16 | NCGGGG [C/T]-6-GGCGGN [G/T]TTTT | 27 | (4.1) |
| +WM13 | [A/G] [A/C] [A/G]CCC[G/T ... | 21 | (3.9) |
| +WM58 | NAAAGANA-15-TGTTTTN | 42 | (3.4) |
| +WM11 | NTGAAACNTTTN12CGTAT [A/T] | 16 | σ |
| +WM17 | [A/G]NCGGCN8 [A/C]NGCCGN | 40 | (3.0) |
| WM25 | NNGTTT-17-GG [A/T]A [A/T] | 59 | (3.0) |
| WM37 | NAAGC [A/T]-19-GCTTT | 25 | (3.0) |
| WM14 | N3CGGCN11GCCGN3 | 197 | Tends to co-occur with T-box (3.0) |
| WM143 | NCGTCN24TTATN | 25 | (2.8) |
| WM185 | NAACC-15-GGTTNNTT | 15 | (2.7) |
| +WM47 | A [A/T]AGAGN18CTCTTT [C/T]N | 27 | (2.6) |
| +WM33 | NAAGG [A/T]GN12C [A/T]GCTTN | 96 | (2.1) |
| WM28 | [A/G]AAAGC-21- [A/G]GCTT [C/T]TT | 30 | (2.0) |
| WM34 | NCACA [A/T]N [A/T]TGTGN | 17 | Three repeats overlap dnaA boxes TTATCCAGA [ |
Figure 2Histogram of regulon sizes. A regulon for a factor in (a) is defined as the set of our predicted operons that have immediately upstream a site documented in the DBTBS database to be recognized by the factor. A regulon for a weight matrix in (b) and (c) is defined as the set of our predicted operons that have immediately upstream a match to the matrix. The matrices in (b) were derived from the experimental verified sites in the DBTBS database. The matrices in (c) were derived from our clusters of over-represented dimers. The several regulons in (b) (σ, σ, σ, ComK, GltC, GltR, Hpr, LevR, and SpoOA) and the three regulons in (c) with more than 400 members are discussed further in the text.
Figure 4Other weight matrices with matches exhibiting a clear positional bias. Histograms of positions of the matches in all upstream sequences to the six non-σweight matrices with the most positionally biased matches, using the same conventions as Figure 3.
Figure 3Matches to our σweight matrix WM1 exhibit a clear positional bias. Histograms of positions of the matches to our σweight matrix WM1 between (a) divergent and (b) convergent operons. In (a), positions are measured relative to translation start. In (b), positions are measured relative to the downstream end of the region. In either case, the first upstream base is assigned the position -1. The expected distribution, under the null hypothesis that the matches are uniformly distributed in their upstream regions, is denoted by *. Probability P of the observed distribution under the null hypothesis is reported as the significance score -log10P.
Figure 5Weight matrices with matches exhibiting a clear positional bias relative to σsites. Histograms of positions of the matches to weight matrices relative to the best matches to the σweight matrix WM1. The position of a weight matrix match relative to a WM1 match is measured center to center. The position -1 indicates that the center of the matrix match is 1 base upstream of the center of the WM1 match. Plots for the six weight matrices with the most positionally biased matches are shown using the same conventions as Figure 3.