| Literature DB >> 19594877 |
Jens Lichtenberg1, Edwin Jacox, Joshua D Welch, Kyle Kurz, Xiaoyu Liang, Mary Qu Yang, Frank Drews, Klaus Ecker, Stephen S Lee, Laura Elnitski, Lonnie R Welch.
Abstract
BACKGROUND: DNA repair genes provide an important contribution towards the surveillance and repair of DNA damage. These genes produce a large network of interacting proteins whose mRNA expression is likely to be regulated by similar regulatory factors. Full characterization of promoters of DNA repair genes and the similarities among them will more fully elucidate the regulatory networks that activate or inhibit their expression. To address this goal, the authors introduce a technique to find regulatory genomic signatures, which represents a specific application of the genomic signature methodology to classify DNA sequences as putative functional elements within a single organism.Entities:
Mesh:
Year: 2009 PMID: 19594877 PMCID: PMC2709261 DOI: 10.1186/1471-2164-10-S1-S18
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Top 25 words. The top 25 words for the bidirectional promoter set (a) and the unidirectional promoter set (b) of DNA-repair pathways. The words are sorted in descending order according to their statistical overrepresentation.
| (a) Bidirectional | |||||||||
| Word | S | ES | O | EO | Sln(S/ES) | RevComp | Position | Palindrome | |
| TCGCGCCA | 4 | 0.918299 | 4 | 0.9375 | 5.88611 | TGGCGCGA | 12538 | No | 0.015391 |
| TCCCGGGA | 8 | 3.97165 | 8 | 4.26667 | 5.60208 | TCCCGGGA | 2 | Yes | 0.068606 |
| GGCCCGCC | 10 | 5.85012 | 11 | 6.5 | 5.36123 | GGCGGGCC | 21073 | No | 0.066821 |
| TCCCGGCT | 6 | 2.54354 | 6 | 2.66667 | 5.14921 | AGCCGGGA | NA | No | 0.054084 |
| CAGGGGCC | 4 | 1.1085 | 4 | 1.13514 | 5.13315 | GGCCCCTG | 14546 | No | 0.028413 |
| AGGGCCGT | 5 | 1.80245 | 5 | 1.86667 | 5.10145 | ACGGCCCT | 613 | No | 0.04142 |
| TCTGAGGA | 5 | 1.84222 | 6 | 1.90909 | 4.99234 | TCCTCAGA | 5391 | No | 0.013499 |
| CGTGGGGG | 5 | 1.86693 | 5 | 1.93548 | 4.92572 | CCCCCACG | 20402 | No | 0.047015 |
| TGCTGAGA | 4 | 1.17067 | 4 | 1.2 | 4.91487 | TCTCAGCA | NA | No | 0.033766 |
| CGCGGCCG | 4 | 1.17067 | 4 | 1.2 | 4.91487 | CGGCCGCG | 20259 | No | 0.033766 |
| TCTGGGAT | 2 | 0.180188 | 2 | 0.181818 | 4.8138 | ATCCCAGA | 2854 | No | 0.014655 |
| GGGGCCGG | 5 | 1.92725 | 5 | 2 | 4.76672 | CCGGCCCC | 20866 | No | 0.052648 |
| AGGGAGGG | 6 | 2.73111 | 6 | 2.87234 | 4.7223 | CCCTCCCT | 9852 | No | 0.07159 |
| AGAAAAGA | 3 | 0.632564 | 3 | 0.642857 | 4.66976 | TCTTTTCT | NA | No | 0.027559 |
| CGACTCCG | 3 | 0.632564 | 3 | 0.642857 | 4.66976 | CGGAGTCG | NA | No | 0.027559 |
| GGGCCAGG | 7 | 3.61284 | 7 | 3.85714 | 4.6299 | CCTGGCCC | 19875 | No | 0.096315 |
| ACTCCAGC | 5 | 2.02051 | 5 | 2.1 | 4.53045 | GCTGGAGT | NA | No | 0.062121 |
| CGGGCCGA | 5 | 2.05153 | 5 | 2.13333 | 4.45426 | TCGGCCCG | 6128 | No | 0.065478 |
| TGCGGAAT | 2 | 0.220092 | 2 | 0.222222 | 4.41371 | ATTCCGCA | NA | No | 0.021321 |
| GCCCCTCC | 8 | 4.63031 | 9 | 5.03226 | 4.37454 | GGAGGGGC | 7041 | No | 0.070206 |
| GCCGGCGA | 3 | 0.707627 | 3 | 0.72 | 4.33335 | TCGCCGGC | 20143 | No | 0.036618 |
| TGAAGCCA | 4 | 1.38876 | 4 | 1.42857 | 4.23154 | TGGCTTCA | NA | No | 0.056996 |
| GGCAGGGA | 6 | 3.01111 | 6 | 3.18182 | 4.1367 | TCCCTGCC | 10531 | No | 0.103337 |
| TGCCCGCG | 5 | 2.19845 | 5 | 2.29167 | 4.10844 | CGCGGGCA | NA | No | 0.082773 |
| CAGCAGCC | 6 | 3.02748 | 6 | 3.2 | 4.10418 | GGCTGCTG | 19198 | No | 0.105399 |
| (b) Unidirectional | |||||||||
| Word | S | ES | O | EO | Sln(S/ES) | RevComp | Position | Palindrome | |
| ACCCGCCT | 4 | 0.716577 | 4 | 0.727273 | 6.87826 | AGGCGGGT | 19440 | No | 0.006562 |
| CTTCTTTC | 5 | 1.7686 | 5 | 1.81818 | 5.19624 | GAAAGAAG | 13567 | No | 0.037733 |
| AGGAAACA | 4 | 1.16659 | 4 | 1.19048 | 4.92885 | TGTTTCCT | 21667 | No | 0.032947 |
| GCAGGGCG | 6 | 2.75716 | 6 | 2.86957 | 4.66535 | CGCCCTGC | 1311 | No | 0.071337 |
| GGGGCTGC | 5 | 2.036 | 5 | 2.1 | 4.49226 | GCAGCCCC | 16359 | No | 0.062122 |
| TCTTCTTC | 4 | 1.30438 | 4 | 1.33333 | 4.48225 | GAAGAAGA | NA | No | 0.046491 |
| GGGGAGTA | 3 | 0.682407 | 3 | 0.692308 | 4.44222 | TACTCCCC | 17991 | No | 0.033211 |
| ATTAAAAT | 4 | 1.36853 | 4 | 1.4 | 4.29023 | ATTTTAAT | 16078 | No | 0.053723 |
| CGGAAACC | 3 | 0.750393 | 3 | 0.761905 | 4.15731 | GGTTTCCG | NA | No | 0.042101 |
| TGGGCGGA | 4 | 1.44679 | 4 | 1.48148 | 4.06778 | TCCGCCCA | NA | No | 0.063337 |
| CGGCGGCG | 3 | 0.787559 | 3 | 0.8 | 4.01229 | CGCCGCCG | 22091 | No | 0.047421 |
| TTTTTTGA | 3 | 0.787559 | 3 | 0.8 | 4.01229 | TCAAAAAA | NA | No | 0.047421 |
| TTTCTCCA | 4 | 1.48541 | 4 | 1.52174 | 3.96242 | TGGAGAAA | 2378 | No | 0.068398 |
| AGCCGGCT | 3 | 0.805285 | 3 | 0.818182 | 3.94551 | AGCCGGCT | 14 | Yes | 0.050071 |
| CCTCTTTA | 2 | 0.282982 | 2 | 0.285714 | 3.91104 | TAAAGAGG | NA | No | 0.033814 |
| CGCCCCTT | 6 | 3.12976 | 6 | 3.27273 | 3.90482 | AAGGGGCG | 21917 | No | 0.113859 |
| GCGCCGCG | 5 | 2.33164 | 5 | 2.41379 | 3.81433 | CGCGGCGC | 15062 | No | 0.097601 |
| ATTCCCAG | 3 | 0.843245 | 3 | 0.857143 | 3.80733 | CTGGGAAT | 21297 | No | 0.055985 |
| TCTCCCCT | 4 | 1.56036 | 4 | 1.6 | 3.7655 | AGGGGAGA | 18183 | No | 0.07881 |
| TCCGCCGG | 3 | 0.855341 | 3 | 0.869565 | 3.7646 | CCGGCGGA | NA | No | 0.057938 |
| CTCCCGCT | 3 | 0.867789 | 3 | 0.882353 | 3.72126 | AGCGGGAG | NA | No | 0.059981 |
| TGCGCCGA | 2 | 0.316812 | 2 | 0.32 | 3.68519 | TCGGCGCA | 3202 | No | 0.041483 |
| GGGCGCCC | 4 | 1.59514 | 4 | 1.63636 | 3.67732 | GGGCGCCC | 23 | Yes | 0.083901 |
| GTGCGTTT | 3 | 0.884961 | 3 | 0.9 | 3.66247 | AAACGCAC | NA | No | 0.062855 |
| TTGGTCTC | 4 | 1.60537 | 4 | 1.64706 | 3.65176 | GAGACCAA | NA | No | 0.085429 |
Figure 1Score-based scatterplots. Shown here are the scatterplots for the scores of all words contained in the bidirectional promoter dataset (a) and the unidirectional promoter dataset (b) of the DNA repair pathways.
Figure 2P-Value-based scatterplots. Scatterplots of the p-values for all words contained in the promoters of the DNA repair pathways exhibiting bi-directionality (a) and uni-directionality (b).
Top 25 words not part of promoter sets. The top 25 words that were not discovered as being part of the bidirectional (a) and unidirectional (b) promoter set of DNA-repair pathways. The words are sorted in descending order by the expected sequence occurrence (ES).
| (a) Bidirectional | (b) Unidirectional | ||
| Word | ES | Word | ES |
| GCGGCCCG | 3.34859 | CGCCCCTG | 4.12035 |
| GGAGGCGC | 2.94738 | GGCGGAGG | 3.91749 |
| GCCTCTCC | 2.84694 | AAAGGGGC | 3.15484 |
| GCTGAGGA | 2.59894 | CTGGTCTC | 3.14943 |
| GCCGGGGC | 2.56699 | GCCTGGGC | 2.75165 |
| GCGCCTCC | 2.56699 | GTTTGAAA | 2.47933 |
| GCGAGGCG | 2.54354 | GCGCGAGG | 2.25604 |
| AGTGGGGG | 2.46473 | TTCTTTTC | 2.23192 |
| CTGGAGGC | 2.45191 | ATTCTGGA | 2.21123 |
| CGGGGGTG | 2.41485 | CAGGCAGG | 2.17759 |
| GAGGGGAG | 2.41485 | ATTTTGTT | 2.15141 |
| TGCCCGCC | 2.39066 | CAAAAAAA | 2.13045 |
| GCACCCCC | 2.23699 | AAACCTCA | 2.11329 |
| GCCTCTGG | 2.23699 | TCCCGCCT | 2.11329 |
| TGCCTGCG | 2.23699 | CCCCGCCG | 2.05605 |
| GGGCTCGC | 2.21328 | GAGGAGGC | 2.05268 |
| GGCAGGGC | 2.18091 | AGCACTGG | 2.02023 |
| CAGCAAGG | 2.1341 | TTATCTGC | 2.02023 |
| CGAGGCCT | 2.12325 | CCGCCCCA | 1.99873 |
| GAGGGAAG | 2.12325 | CCCGCCCT | 1.94132 |
| GGAGCTGA | 2.11348 | CTCTTTCT | 1.94132 |
| CCTGTCCT | 2.10187 | GAGAGAGC | 1.94132 |
| TCCAGGAC | 2.0706 | GGCCCAAC | 1.94132 |
| CCAGGCCG | 2.06039 | GTCTGGGC | 1.94132 |
| CGCCTGTC | 2.06039 | TAGGGGGC | 1.94132 |
Figure 3Scatterplot of words not detected in the promoters. Scatterplots for the expected number of sequence occurrences for every word not detected in the bidirectional (a) or unidirectional (b) promoters.
Top 2 clusters for the bidirectional promoter. The word-based clusters for the two most overrepresented words for the bidirectional promoters. Rank 1 refers to word TCGCGCCA and Rank 2 to TCCCGGGA.
| (a) Rank 1 | ||||||||
| Word | S | ES | O | EO | Sln(S/ES) | RevComp. | Position | Palindrome |
| TCGCGCCA | 4 | 0.918299 | 4 | 0.9375 | 5.88611 | TGGCGCGA | 12538 | No |
| TCGCCCCA | 3 | 0.805161 | 3 | 0.820513 | 3.94598 | TGGGGCGA | 2834 | No |
| TAGCGCCA | 1 | 0.263929 | 1 | 0.266667 | 1.33207 | TGGCGCTA | 4918 | No |
| TCGAGCCA | 1 | 0.469775 | 1 | 0.47619 | 0.755501 | TGGCTCGA | NA | No |
| TCGCGACA | 1 | 0.655751 | 1 | 0.666667 | 0.421975 | TGTCGCGA | NA | No |
| TCGGGCCA | 1 | 0.683955 | 1 | 0.695652 | 0.379863 | TGGCCCGA | NA | No |
| TTGCGCCA | 1 | 0.693903 | 2 | 0.705882 | 0.365423 | TGGCGCAA | NA | No |
| TCGCGGCA | 1 | 0.826074 | 1 | 0.842105 | 0.191071 | TGCCGCGA | NA | No |
| TCGCGTCA | 1 | 0.84063 | 1 | 0.857143 | 0.173604 | TGACGCGA | 4051 | No |
| TCGCGCCC | 1 | 1.51582 | 1 | 1.5625 | -0.41596 | GGGCGCGA | 13089 | No |
| CCGCGCCA | 2 | 2.5054 | 2 | 2.625 | -0.4506 | TGGCGCGG | NA | No |
| (b) Rank 2 | ||||||||
| Word | S | ES | O | EO | Sln(S/ES) | RevComp. | Position | Palindrome |
| TCCCGGGA | 8 | 3.97165 | 8 | 4.26667 | 5.60208 | TCCCGGGA | 2 | Yes |
| TCCAGGGA | 2 | 0.941495 | 2 | 0.961538 | 1.50687 | TCCCTGGA | NA | No |
| TCCCGAGA | 2 | 1.05556 | 2 | 1.08 | 1.27816 | TCTCGGGA | 13248 | No |
| TGCCGGGA | 1 | 0.514348 | 1 | 0.521739 | 0.664856 | TCCCGGCA | NA | No |
| TCCCGTGA | 1 | 0.702073 | 1 | 0.714286 | 0.353718 | TCACGGGA | NA | No |
| TCCCAGGA | 4 | 3.71413 | 5 | 3.97222 | 0.296597 | TCCTGGGA | 19059 | No |
| TCTCGGGA | 2 | 1.73986 | 2 | 1.8 | 0.278683 | TCCCGAGA | 3074 | No |
| ACCCGGGA | 1 | 0.785281 | 1 | 0.8 | 0.241714 | TCCCGGGT | 20941 | No |
| TCCCCGGA | 1 | 0.852649 | 1 | 0.869565 | 0.159407 | TCCGGGGA | NA | No |
| TCCCGCGA | 1 | 1.01424 | 1 | 1.03704 | -0.01414 | TCGCGGGA | NA | No |
| TCCCGGAA | 3 | 3.29619 | 3 | 3.5 | -0.28247 | TTCCGGGA | NA | No |
| TCCTGGGA | 1 | 1.32696 | 1 | 1.36364 | -0.28289 | TCCCAGGA | 13129 | No |
| TCCCGGGG | 3 | 3.34568 | 3 | 3.55556 | -0.32717 | CCCCGGGA | 21071 | No |
| TCCCGGGT | 1 | 2.38044 | 1 | 2.48889 | -0.86729 | ACCCGGGA | 13746 | No |
| CCCCGGGA | 1 | 2.78651 | 1 | 2.93333 | -1.02479 | TCCCGGGG | 19211 | No |
| GCCCGGGA | 1 | 3.73853 | 2 | 4 | -1.31869 | TCCCGGGC | 21163 | No |
| TCCCGGGC | 3 | 5.1829 | 4 | 5.68889 | -1.64025 | GCCCGGGA | 21138 | No |
Top 2 clusters for the unidirectional promoter. The word-based clusters for the two most overrepresented words for the bidirectional promoters. Rank 1 refers to word ACCCGCCT and Rank 2 to CTTCTTTC.
| (a) Rank 1 | ||||||||
| Word | S | ES | O | EO | Sln(S/ES) | RevComp. | Position | Palindrome |
| ACCCGCCT | 4 | 0.716577 | 4 | 0.727273 | 6.87826 | AGGCGGGT | 19440 | No |
| ATCCGCCT | 1 | 0.132296 | 1 | 0.133333 | 2.02271 | AGGCGGAT | NA | No |
| ACCAGCCT | 2 | 0.738772 | 2 | 0.75 | 1.99183 | AGGCTGGT | 1303 | No |
| AGCCGCCT | 1 | 0.657331 | 1 | 0.666667 | 0.419567 | AGGCGGCT | 1056 | No |
| ACCCACCT | 1 | 0.738772 | 1 | 0.75 | 0.302766 | AGGTGGGT | NA | No |
| ACGCGCCT | 1 | 1.16147 | 1 | 1.18519 | -0.14969 | AGGCGCGT | NA | No |
| CCCCGCCT | 1 | 2.45503 | 2 | 2.54545 | -0.89814 | AGGCGGGG | 21912 | No |
| (b) Rank 2 | ||||||||
| Word | S | ES | O | EO | Sln(S/ES) | RevComp. | Position | Palindrome |
| CTTCTTTC | 5 | 1.7686 | 5 | 1.81818 | 5.19624 | GAAAGAAG | 13567 | No |
| CTACTTTC | 1 | 0.180301 | 1 | 0.181818 | 1.71313 | GAAAGTAG | NA | No |
| CTTCTTCC | 1 | 0.304671 | 1 | 0.307692 | 1.18852 | GGAAGAAG | 5306 | No |
| CTGCTTTC | 2 | 1.15305 | 2 | 1.17647 | 1.10147 | GAAAGCAG | 9703 | No |
| CGTCTTTC | 1 | 0.371023 | 1 | 0.375 | 0.991491 | GAAAGACG | 20167 | No |
| CTCCTTTC | 3 | 2.36561 | 3 | 2.45 | 0.712729 | GAAAGGAG | 11346 | No |
| CTTCTATC | 1 | 0.607134 | 1 | 0.615385 | 0.499005 | GATAGAAG | NA | No |
| CTTCCTTC | 1 | 0.921427 | 1 | 0.9375 | 0.0818318 | GAAGGAAG | 10908 | No |
| GTTCTTTC | 1 | 1.07027 | 1 | 1.09091 | -0.067912 | GAAAGAAC | 17502 | No |
| CTTTTTTC | 1 | 1.2055 | 1 | 1.23077 | -0.186894 | GAAAAAAG | NA | No |
| TTTCTTTC | 2 | 3.4628 | 2 | 3.63636 | -1.09786 | GAAAGAAA | NA | No |
Edit cluster for bidirectional promoters. The word-based clusters for the two most overrepresented words for the bidirectional promoters according to the edit distance metric. Rank 1 refers to word TCGCGCCA and Rank 2 to TCCCGGGA.
| (a) Rank 1 | ||||||||
| Word | S | ES | O | EO | Sln(S/ES) | RevComp. | Position | Palindrome |
| TCGCGCCA | 4 | 0.918299 | 4 | 0.9375 | 5.88611 | TGGCGCGA | 12538 | No |
| TCGCCCCA | 3 | 0.805161 | 3 | 0.820513 | 3.94598 | TGGGGCGA | 2834 | No |
| TAGCTCCA | 2 | 0.352982 | 2 | 0.357143 | 3.46897 | TGGAGCTA | NA | No |
| TCTCGCGA | 2 | 0.438673 | 2 | 0.444444 | 3.0343 | TCGCGAGA | 4937 | No |
| TCGCCACA | 2 | 0.455424 | 2 | 0.461538 | 2.95935 | TGTGGCGA | 4669 | No |
| ... | ||||||||
| (b) Rank 2 | ||||||||
| Word | S | ES | O | EO | Sln(S/ES) | RevComp. | Position | Palindrome |
| TCCCGGGA | 8 | 3.97165 | 8 | 4.26667 | 5.60208 | TCCCGGGA | 2 | Yes |
| TCCCGGCT | 6 | 2.54354 | 6 | 2.66667 | 5.14921 | AGCCGGGA | NA | No |
| ATCCGGGA | 2 | 0.395077 | 2 | 0.4 | 3.24364 | TCCCGGAT | NA | No |
| TCTCGCGA | 2 | 0.438673 | 2 | 0.444444 | 3.0343 | TCGCGAGA | 4937 | No |
| TTCCTGGA | 2 | 0.493082 | 2 | 0.5 | 2.80045 | TCCAGGAA | 9505 | No |
| ... | ||||||||
Edit cluster for unidirectional promoters. The word-based clusters for the two most overrepresented words for the unidirectional promoters according to the edit distance metric. Rank 1 refers to word ACCCGCCT and Rank 2 to CTTCTTTC.
| (a) Rank 1 | ||||||||
| Word | S | ES | O | EO | Sln(S/ES) | Rev.Comp. | Position | Palindrome |
| ACCCGCCT | 4 | 0.716577 | 4 | 0.727273 | 6.87826 | AGGCGGGT | 19440 | No |
| AGCCGGCT | 3 | 0.805285 | 3 | 0.818182 | 3.94551 | AGCCGGCT | 14 | Yes |
| AGGCGCCT | 3 | 1.11427 | 3 | 1.13636 | 2.97124 | AGGCGCCT | 92 | Yes |
| AAGCGCCT | 4 | 2.15617 | 4 | 2.22727 | 2.47184 | AGGCGCTT | 5872 | No |
| ACCTGCAT | 2 | 0.592063 | 2 | 0.6 | 2.43458 | ATGCAGGT | NA | No |
| ... | ||||||||
| (b) Rank 2 | ||||||||
| Word | S | ES | O | EO | Sln(S/ES) | Rev.Comp. | Position | Palindrome |
| CTTCTTTC | 5 | 1.7686 | 5 | 1.81818 | 5.19624 | GAAAGAAG | 13567 | No |
| TCTTCTTC | 4 | 1.30438 | 4 | 1.33333 | 4.48225 | GAAGAAGA | NA | No |
| CCTCTTTA | 2 | 0.282982 | 2 | 0.285714 | 3.91104 | TAAAGAGG | NA | No |
| CTTTTTCA | 3 | 0.917377 | 3 | 0.933333 | 3.55455 | TGAAAAAG | NA | No |
| GTTCATTC | 2 | 0.359828 | 2 | 0.363636 | 3.43055 | GAATGAAC | NA | No |
| ... | ||||||||
Figure 4Sequence logos for bidirectional promoters. Sequence logos corresponding to the word-based clusters of the top 2 overrepresented words of the bidirectional promoters. Rank 1 (a) is corresponding to the word TCGCGCCA, while Rank 2 (b) refers to TCCCGGGA.
Figure 5Sequence logo for unidirectional promoters. Sequence logos corresponding to the word-based clusters of the top 2 overrepresented words of the unidirectional promoters. Rank 1 (a) is corresponding to the word ACCCGCCT, while Rank 2 (b) refers to CTTCTTTC.
Figure 6Edit distance cluster for bidirectional promoters. Sequence alignments corresponding to the word-based clusters of the top 2 overrepresented words of the bidirectional promoters. For each cluster, five words were chosen based on their overall overrepresentation in the promoter set. Rank 1 (a) is corresponding to the word TCGCGCCA, while Rank 2 (b) refers to TCCCGGGA.
Figure 7Edit distance cluster for unidirectional promoters. Sequence logos corresponding to the word-based clusters of the top 2 overrepresented words of the unidirectional promoters. Rank 1 (a) is corresponding to the word ACCCGCCT, while Rank 2 (b) refers to CTTCTTTC.
Sequence clusters (pairs of sequences). Sequence clusters containing pairs of sequences for the bidirectional (a) and unidirectional (b) promoter sets. Each sequence occurs in only one cluster. The sequences are clustered based on the number of words (within the top 60 overrepresented words) that are shared between them with the distance denoting the number of words not shared between them.
| (a) Bidirectional | (b) Unidirectional | ||||
| Sequence 1 | Sequence 2 | Distance | Sequence 1 | Sequence 2 | Distance |
| chr3:185561446–185562546 | chr11:832429–833529 | 54 | chr10:50416978–50418078 | chr3:188006884–188007984 | 57 |
| chr19:53365272–53366372 | chr19:7600339–7601439 | 55 | chr12:52868924–52870024 | chr7:73306574–73307674 | 57 |
| chr11:18299718–18300818 | chr15:41589928–41591028 | 56 | chr5:68890824–68891924 | chr19:55578407–55579507 | 58 |
| chr4:57538069–57539168 | chr19:48776246–48777346 | 56 | chr6:30982955–30984055 | chr9:99499360–99500460 | 58 |
| chr11:107598052–107599152 | chr12:131773918–131775018 | 56 | chr10:131154509–131155609 | chr19:50618917–50620017 | 58 |
| chr13:107668425–107669525 | chr1:11674165–11675265 | 57 | chr5:86744492–86745592 | chr17:30330654–30331754 | 58 |
| chr6:43650922–43652022 | chr16:2037768–2038868 | 57 | chr11:118471287–118472387 | chr8:55097461–55098561 | 58 |
| chr22:36678663–36679763 | chr11:61315725–61316825 | 58 | chr16:13920523–13921623 | chr8:101231014–101232114 | 58 |
| chr5:60276548–60277648 | chr22:40346240–40347340 | 58 | chr5:131919528–131920628 | chr19:1046236–1047336 | 58 |
| chr11:93866588–93867688 | chr3:130641442–130642542 | 58 | chr12:108015528–108016628 | chr16:56053079–56054179 | 59 |
| chr17:7327421–7328521 | chr17:1679094–1680194 | 58 | chr1:28113723–28114823 | chr2:216681376–216682476 | 59 |
| chr20:5055168–5056268 | chr15:38773660–38774760 | 58 | chr8:91065972–91067072 | chr4:39044247–39045347 | 59 |
| chr14:19992129–19993229 | chr11:66877493–66878593 | 59 | chr14:60270222–60271322 | chr11:47192088–47193188 | 59 |
| chr17:38530557–38531657 | chr13:31786616–31787716 | 59 | chr7:7724663–7725763 | chr11:62284590–62285690 | 59 |
| chr12:122683333–122684433 | chr13:33289233–33290333 | chr12:116937892–116938992 | 59 | ||
| chr5:82408167–82409267 | chr9:109084364–109085464 | chr7:101906286–101907386 | 59 | ||
| chr2:127768122–127769222 | chr8:42314186–42315286 | chr19:50565569–50566669 | 59 | ||
| chr12:102882746–102883846 | chr3:9764704–9765804 | chr14:49224583–49225683 | 59 | ||
| chr13:102295174–102296274 | chr6:30790834–30791934 | 59 | |||
| chr12:912403–913503 | |||||
| chr2:128332074–128333174 | |||||
| chr7:44129555–44130655 | |||||
| chr11:73980276–73981376 | |||||
Figure 8GBrowse visualization for primary bidirectional sequence cluster. The GBrowse visualization of the two sequences for the top sequence-based cluster in the bidirectional promoter set. Shown are the words from the set of top 60 words that are detected in these two sequences.
Figure 9GBrowse visualization for primary unidirectional sequence cluster. The GBrowse visualization of the two sequences for the top sequence-based cluster in the unidirectional promoter set. Shown are the words from the set of top 60 words that are detected in these two sequences.
Word co-occurrence. The top 25 word pairs for the bidirectional (a) and unidirectional (b) promoter set. The word pairs are sorted in descending order by S*ln(S/ES) score.
| (a) Bidirectional | |||||||||
| (a) Bidirectional | (b) Unidirectional | ||||||||
| Word 1 | Word 2 | S | ES | Sln(S/ES) | Word 1 | Word 2 | S | ES | Sln(S/ES) |
| TCTGAGGA | TCGCGCCA | 3 | 0.0529 | 12.1158 | GTTCATTC | TCCGCCGG | 2 | 0.0073 | 11.2184 |
| ACTCCAGC | TCGCGCCA | 3 | 0.0580 | 11.8387 | CTGTGTGC | TGCGCCGA | 2 | 0.0074 | 11.1966 |
| GCCCAGCC | TCCGCCGC | 3 | 0.0722 | 11.1827 | TGACGCGA | CTCCCGCT | 2 | 0.0082 | 10.9997 |
| GCCCAGCC | CGGAGCGC | 2 | 0.0087 | 10.8711 | AGCCGGCT | GGGGAGTA | 2 | 0.0131 | 10.0590 |
| TGCCCGCG | TCCCGGGA | 4 | 0.2729 | 10.7404 | ATTGCAGG | ATTCTCTC | 2 | 0.0169 | 9.5459 |
| GGCAGGGA | GGGCCAGG | 4 | 0.3400 | 9.8609 | GGGGAGTA | AGGAAACA | 2 | 0.0190 | 9.3177 |
| TCCCGGGA | TCGCGCCA | 3 | 0.1140 | 9.8112 | CTGGGAGC | GTTCATTC | 2 | 0.0218 | 9.0337 |
| AGCCTGTC | TCCCGGGA | 3 | 0.1158 | 9.7646 | CCTTCCGA | CTGGGAGC | 2 | 0.0240 | 8.8439 |
| GGAGGCTG | TCGCGCCA | 3 | 0.1173 | 9.7250 | TGGGCGGA | ACCCGCCT | 2 | 0.0247 | 8.7895 |
| TCCGCCGC | GCCCCTCC | 4 | 0.3554 | 9.6830 | TTTCTCCA | CGGAAACC | 2 | 0.0265 | 8.6446 |
| AGAAAAGA | TCGCGCCA | 2 | 0.0182 | 9.4042 | CCCCCGCG | ACCCGCCT | 2 | 0.0280 | 8.5339 |
| GCCCAGCC | GCCCCTCC | 3 | 0.1360 | 9.2808 | TCCGCCGG | GGGGCTGC | 2 | 0.0415 | 7.7522 |
| TGCCAAAA | GCCGGCGA | 2 | 0.0195 | 9.2604 | AGCTGGCT | CCAGGCTG | 2 | 0.0422 | 7.7192 |
| CAGCAGCC | TGCGGAAT | 2 | 0.0208 | 9.1297 | TTGGTCTC | AGGAAACA | 2 | 0.0446 | 7.6068 |
| AGGGCCGT | TCCCGGCT | 3 | 0.1433 | 9.1249 | CTGGGAGC | TCCGCCGG | 2 | 0.0519 | 7.3020 |
| CCTCCAGA | TTCCACCC | 2 | 0.0216 | 9.0521 | CTTTTCTC | GCGCCGCG | 2 | 0.0545 | 7.2046 |
| CGAGGAGA | TCGCGCCA | 2 | 0.0220 | 9.0204 | ATTGCAGG | ATTAAAAT | 2 | 0.0585 | 7.0639 |
| TCCGCCGC | CGGAGCGC | 2 | 0.0228 | 8.9501 | TGGAACCC | GCAGGGCG | 2 | 0.0645 | 6.8693 |
| ACCCTCGT | AGGGAGGG | 2 | 0.0253 | 8.7380 | GGGCAGGC | AGCTGGCT | 2 | 0.0657 | 6.8326 |
| GCCCAGCC | TCCACTGT | 2 | 0.0254 | 8.7315 | TTGGTCTC | CTTCTTTC | 2 | 0.0676 | 6.7745 |
| CAGCAGCC | AGGGCCGT | 3 | 0.1705 | 8.6024 | CTTTTTCA | CGCCCCTT | 2 | 0.0684 | 6.7522 |
| TGCCCGCG | TCCCGGCT | 3 | 0.1747 | 8.5291 | GCAGGGCG | AGGAAACA | 2 | 0.0766 | 6.5251 |
| CCCAGGAC | AGAGAGCT | 2 | 0.0291 | 8.4590 | GGGCAGGC | TTTCTCCA | 2 | 0.0939 | 6.1181 |
| TCTGGGAT | GGCCCGCC | 2 | 0.0329 | 8.2123 | CTGGGAGC | TCTCCCCT | 2 | 0.0947 | 6.0996 |
| AGCCGGGC | AGAAAAGA | 2 | 0.0333 | 8.1930 | AGCAGGGC | GGCTTTTA | 2 | 0.0956 | 6.0805 |
Figure 10Comparison analysis: plot for complete set of words. Comparison of the words detected for the two promoter sets based on their computed overrepresentation scores.
Unique and interesting words for the promoter sets. The words for the unidirectional and bidirectional promoter set which exhibit a significant score-based distance to the other data set.
| (a) Unidirectional | (b) Bidirectional | ||||||
| Word | Unidirectional | Bidirectional | Distance | Word | Unidirectional | Bidirectional | Distance |
| ACCCGCCT | 6.87826 | -0.0263597 | 4.882303411 | TCCCGGGA | -0.0850495 | 5.60208 | -4.021407835 |
| GGGGCTGC | 4.49226 | -1.0872000 | 3.945274001 | GGCCCGCC | 0 | 5.36123 | -3.790962089 |
| CGGCGGCG | 4.01229 | -1.3139900 | 3.766248706 | CGCGGCCG | -0.3641650 | 4.91487 | -3.732841447 |
| AGGAAACA | 4.92885 | 0.1254760 | 3.396498328 | TCCCGGCT | 0 | 5.14921 | -3.641041309 |
| CTTCTTTC | 5.19624 | 0.4219750 | 3.375915157 | CAGGGGCC | 0 | 5.13315 | -3.629685174 |
| TCCGCCGG | 3.76460 | -0.8986470 | 3.297413576 | AGGGCCGT | 0 | 5.10145 | -3.607269889 |
| TCTTCTTC | 4.48225 | 0 | 3.169429370 | TCTGAGGA | 0 | 4.99234 | -3.530117468 |
| ATTAAAAT | 4.29023 | 0 | 3.033650726 | CGTGGGGG | 0.0180292 | 4.92572 | -3.470261445 |
| GGGGAGTA | 4.44222 | 0.3737000 | 2.876878081 | TCTGGGAT | 0 | 4.81380 | -3.403870623 |
| CGCCCCTT | 3.90482 | -0.1463740 | 2.864626749 | AGGGAGGG | 0 | 4.72230 | -3.339170353 |
| TTTTTTGA | 4.01229 | 0 | 2.837117467 | AGAAAAGA | 0 | 4.66976 | -3.302018963 |
| TTTCTCCA | 3.96242 | 0 | 2.801854052 | GGGCCAGG | 0 | 4.62990 | -3.273833686 |
| AGCCGGCT | 3.94551 | 0 | 2.789896876 | ACTCCAGC | 0 | 4.53045 | -3.203511917 |
| TTGGTCTC | 3.65176 | -0.2608830 | 2.766656398 | CCCCAGCT | -0.9904730 | 3.48143 | -3.162112936 |
| GCGCCGCG | 3.81433 | 0 | 2.697138609 | CGGGCCGA | 0 | 4.45426 | -3.149637451 |
| ATTCCCAG | 3.80733 | 0 | 2.692188861 | TCCGCCGC | -0.8886350 | 3.55395 | -3.141381979 |
| GCAGGGCG | 4.66535 | 0.8645290 | 2.687586303 | TGCCCGCG | -0.3137370 | 4.10844 | -3.126951344 |
| GAGGGGCG | 3.03108 | -0.7557900 | 2.677721456 | TGCGGAAT | 0 | 4.41371 | -3.120964271 |
| CCCCCGCG | 3.55664 | -0.1908410 | 2.649869227 | GCCGGCGA | 0 | 4.33335 | -3.064141170 |
| AGGGGAGC | 3.15866 | -0.5635770 | 2.632019024 | CAGCAGCC | -0.0679120 | 4.10418 | -2.950114545 |
| TGCGCCGA | 3.68519 | 0 | 2.605822839 | CGAGGAGA | 0 | 4.09415 | -2.895001228 |
| CCGCGCCC | 2.25420 | -1.4189300 | 2.597295131 | CGCAGGCG | -0.2779570 | 3.74626 | -2.845551130 |
| GTGCGTTT | 3.66247 | 0 | 2.589757373 | TTCCACCC | 0 | 4.02098 | -2.843262225 |
| CTGGGAGC | 3.36673 | -0.2940760 | 2.588580747 | TCGCCCCA | 0 | 3.94598 | -2.790229216 |
| TGCCTCCC | 3.34992 | -0.2629130 | 2.554658714 | GGGGCCGG | 0.8548330 | 4.76672 | -2.766121825 |
Figure 11Comparison analysis: plot for distinctive words. The words descriptive of the unidirectional promoter set (red) and the bidirectional promoter set (green). Words that are not sufficiently descriptive of either data set are eliminated from the plot.
Descriptive words for both the unidirectional and bidirectional promoter sets. The top 25 words that are correlated in the two promoter sets, according to their overrepresentation scores. The Words had to be overrepresented according to SlnSES with at least a score of 1.5. Shown are the words with a distance between -0.11 and 0.11.
| Word | Unidirectional | Bidirectional | Distance |
| CTTTGGCC | 2.08857 | 2.23024 | -0.100175818 |
| AGGCAGGA | 1.51526 | 1.64780 | -0.093719933 |
| CTCAGGAT | 1.58527 | 1.71375 | -0.090849079 |
| GGGGGGAC | 1.61803 | 1.70814 | -0.063717392 |
| CTTGCGGA | 1.65530 | 1.73350 | -0.055295750 |
| CTGAGCAG | 1.99183 | 2.05890 | -0.047425652 |
| GCCTGAGG | 1.99183 | 2.04796 | -0.039689904 |
| TGAAGTGG | 1.61803 | 1.66175 | -0.030914708 |
| GCCATCCG | 1.86393 | 1.89589 | -0.022599133 |
| AGGTTGCA | 2.20477 | 2.23024 | -0.018010010 |
| TCTGTGCC | 1.84096 | 1.85915 | -0.012862272 |
| TACCACTA | 1.86393 | 1.88037 | -0.011624835 |
| CAAAGAAT | 1.61803 | 1.61872 | -0.000487904 |
| ACCGCTCA | 1.61803 | 1.61872 | -0.000487904 |
| TATCTTAG | 1.61803 | 1.61872 | -0.000487904 |
| AGAGTTCC | 1.62605 | 1.61872 | 0.005183093 |
| GTCGGCTT | 1.90512 | 1.88037 | 0.017500893 |
| CGCGCGCA | 1.94164 | 1.90263 | 0.027584236 |
| CAGGCCAG | 1.95383 | 1.86972 | 0.059474751 |
| ACAGAAAG | 2.79686 | 2.70295 | 0.066404398 |
| GTCAGGAG | 2.40520 | 2.25776 | 0.104255824 |
| GGAAGTGA | 1.96108 | 1.81095 | 0.106157941 |
| TAGAGAGC | 1.99183 | 1.84125 | 0.106476139 |
| TGCCAGGG | 1.75813 | 1.60511 | 0.108201480 |
| GCACAAGC | 1.95383 | 1.80053 | 0.108399470 |
| TTCACTTA | 2.15055 | 1.99725 | 0.108399470 |
Figure 12Comparison analysis: plot for general words. The words that are significantly correlated in both data sets.
Lookup results for interesting words in the promoters. Information about the regulatory function of the top 10 overrepresented words for the bidirectional and unidirectional promoter set based on lookups in the TRANSFAC and JASPAR databases.
| (a) Bidirectional | |||||
| Sequence | Transcription Factor (Matrix Ida) | Sequence (bottom) aligned to matrix consensusb | Matchesc | Avg. Scored | Score Rangee |
| TCGCGCCA | PF0112f | 4/6 | 89.0 | 86.5–96.8 | |
| TCCCGGGA | STAT5A | 8/16 | 86.7 | 86.7–86.7 | |
| GGCCCGCC | SP1 (V$SP1_01) | 8/13 | 90.2 | 86.5–90.8 | |
| TCCCGGCT | ELK1 (MA0028) | 3/6 | 86.9 | 86.5–87.7 | |
| CAGGGGCC | V$WT1_Q6 | 5/6 | 87.4 | 85.0–91.1 | |
| AGGGCCGT | MYB (V$MYB_Q3) | 2/7 | 91.2 | 89.8–92.6 | |
| TCTGAGGA | TFIIA (V$TFIIA_Q6) | 2/8 | 88.1 | 85.8–90.5 | |
| CGTGGGGG | E2F (V$E2F1_Q3) | 6/6 | 87.3 | 87.3–87.3 | |
| TGCTGAGA | No match. | ||||
| CGCGGCCG | No match. | ||||
| (b) Unidirectional | |||||
| Sequence | Transcription Factor (Matrix Ida) | Sequence (bottom) aligned to matrix consensusb | Matchesc | Avg. Scored | Score Rangee |
| ACCCGCCT | SP1 (V$SP1_01) | 4/7 | 86.2 | 85.9–87.3 | |
| CTTCTTTC | No match. | ||||
| AGGAAACA | NFAT (V$NFAT_Q4_01) | 5/5 | 87.3 | 85.8–88.1 | |
| GCAGGGCG | PF0096f | 10/10 | 86.8 | 86.5–87.1 | |
| GGGGCTGC | LRF (V$LRF_Q2) | 5/8 | 85.4 | 85.4–85.4 | |
| TCTTCTTC | No match. | ||||
| GGGGAGTA | FOXC1 (MA0032) | 4/4 | 95.5 | 95.5–95.5 | |
| ATTAAAAT | OCT1 ($OCT1_06) | 4/9 | 86.9 | 86.5–87.5 | |
| CGGAAACC | AREB6 (V$AREB6_04) | 3/3 | 92.2 | 88.3–95.8 | |
| TGGGCGGA | GC (V$GC_01) | 4/5 | 90.3 | 85.1–95.2 | |
a. JASPAR id or TRANSFAC id.
b. The consensus is in IUPAC notation: R = G or A, Y = T or C, M = A or C, H = not G, K = G or T, W = A or T, B = not A, S = G or C, V = not T, N = anything.
c. Number of occurrences of the matrix that scored greater than 85% in the dataset.
d. Average score for the occurrences meeting the 85% threshold.
e. Range of scores for the occurrences meeting the 85% threshold.
f. A profile that was extracted from phylogenetically conserved gene upstream elements.
Conservation analysis. The results for conservation analysis of the top 10 word pairs in the bidirectional (a) and unidirectional (b) promoter set. For each word pair, the occurrence location of the pair is given, as well as an identifier for the conservation of the sites, and a PhastCons score for the quality of the conservation across 28 organisms. Conservation can be categorized as: none (no word was conserved), partial (one word was conserved) and complete (all words were conserved).
| (a) Bidirectional | |||||
| Word 1 | Word 2 | Location | Conservation | Hit | Score |
| TCTGAGGA | TCGCGCCA | chr19:53365272–53366372 | None | ||
| chr19:48776246–48777346 | None | ||||
| chr19:7600339–7601439 | Partial | TCGCGCCA | 385 | ||
| ACTCCAGC | TCGCGCCA | chr4:57538069–57539168 | None | ||
| chr19:48776246–48777346 | None | ||||
| chr19:7600339–7601439 | Partial | TCGCGCCA | 385 | ||
| GCCCAGCC | TCCGCCGC | chr3:185561446–185562546 | Partial | TCCGCCGC | 310 |
| chr14:19992129–19993229 | None | ||||
| chr11:832429–833529 | None | ||||
| GCCCAGCC | CGGAGCGC | chr3:185561446–185562546 | None | ||
| chr14:19992129–19993229 | None | ||||
| TGCCCGCG | TCCCGGGA | chr19:53365272–53366372 | Partial | TCCCGGGA | 390 |
| chr13:107668425–107669525 | None | ||||
| chr20:5055168–5056268 | None | ||||
| chr11:832429–833529 | None | ||||
| GGCAGGGA | GGGCCAGG | chr19:53365272–53366372 | Partial | GGGCCAGG | 390 |
| chr22:40346240–40347340 | Complete | GGCAGGGA | 325 | ||
| GGGCCAGG | 522 | ||||
| chr5:60276548–60277648 | None | ||||
| chr12:131773918–131775018 | None | ||||
| TCCCGGGA | TCGCGCCA | chr19:53365272–53366372 | Partial | TCCCGGGA | 390 |
| chr4:57538069–57539168 | None | ||||
| chr19:7600339–7601439 | Partial | TCGCGCCA | 385 | ||
| AGCCTGTC | TCCCGGGA | chr17:38530557–38531657 | None | ||
| chr13:107668425–107669525 | Partial | AGCCTGTC | 244 | ||
| chr4:57538069–57539168 | None | ||||
| GGAGGCTG | TCGCGCCA | chr4:57538069–57539168 | None | ||
| chr19:48776246–48777346 | None | ||||
| chr19:7600339–7601439 | Partial | TCGCGCCA | 385 | ||
| TCCGCCGC | GCCCCTCC | chr3:185561446–185562546 | Partial | TCCGCCGC | 310 |
| chr14:19992129–19993229 | None | ||||
| chr1:11674165–11675265 | Partial | GCCCCTCC | 360 | ||
| chr11:832429–833529 | None | ||||
| (b) Unidirectional | |||||
| Word 1 | Word 2 | Location | Conservation | Hit | Score |
| GTTCATTC | TCCGCCGG | chr7:73306574–73307674 | None | ||
| chr12:52868924–52870024 | Partial | TCCGCCGG | 325 | ||
| CTGTGTGC | TGCGCCGA | chr10:131154509–131155609 | None | ||
| chr19:1046236–1047336 | None | ||||
| TGACGCGA | CTCCCGCT | chr12:116937892–116938992 | None | ||
| chr17:30330654–30331754 | None | ||||
| AGCCGGCT | GGGGAGTA | chr6:30982955–30984055 | None | ||
| chr16:13920523–13921623 | None | ||||
| ATTGCAGG | ATTCTCTC | chr5:86744492–86745592 | None | ||
| chr17:30330654–30331754 | None | ||||
| GGGGAGTA | AGGAAACA | chr16:13920523–13921623 | None | ||
| chr8:101231014–101232114 | None | ||||
| CTGGGAGC | GTTCATTC | chr7:73306574–73307674 | None | ||
| chr12:52868924–52870024 | None | ||||
| CCTTCCGA | CTGGGAGC | chr5:68890824–68891924 | None | ||
| chr7:73306574–73307674 | None | ||||
| TGGGCGGA | ACCCGCCT | chr6:30982955–30984055 | None | ||
| chr9:99499360–99500460 | None | ||||
| TTTCTCCA | CGGAAACC | chr8:55097461–55098561 | None | ||
| chr11:118471287–118472387 | None | ||||