| Literature DB >> 15053842 |
Katsuhiko Murakami1, Toshio Kojima, Yoshiyuki Sakaki.
Abstract
BACKGROUND: Gene expression is regulated mainly by transcription factors (TFs) that interact with regulatory cis-elements on DNA sequences. To identify functional regulatory elements, computer searching can predict TF binding sites (TFBS) using position weight matrices (PWMs) that represent positional base frequencies of collected experimentally determined TFBS. A disadvantage of this approach is the large output of results for genomic DNA. One strategy to identify genuine TFBS is to utilize local concentrations of predicted TFBS. It is unclear whether there is a general tendency for TFBS to cluster at promoter regions, although this is the case for certain TFBS. Also unclear is the identification of TFs that have TFBS concentrated in promoters and to what level this occurs. This study hopes to answer some of these questions.Entities:
Mesh:
Substances:
Year: 2004 PMID: 15053842 PMCID: PMC375527 DOI: 10.1186/1471-2164-5-16
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1A histogram of cluster scores for PWMs.Each number of X-axis indicates the maximum score of PWMs in the bin.
Top 50 PWMs for chromosome 20 sorted by cluster score S in descending order. Each column represents rank number, accession number in TRANSFAC, identifier in TRANSFAC, cluster score, and threshold.
| Rank | ACCESSION | ID | S | T |
| 1 | M00736 | E2F1DP1_01 | 189.3 | 2.75 |
| 2 | M00332 | WHN_B | 176.0 | 1.90 |
| 3 | M00652 | NRF1_Q6 | 122.0 | 0.93 |
| 4 | M00649 | MAZ_Q6 | 117.2 | 4.35 |
| 5 | M00491 | MAZR_01 | 111.4 | 1.78 |
| 6 | M00739 | E2F4DP2_01 | 103.8 | 0.93 |
| 7 | M00737 | E2F1DP2_01 | 103.6 | 0.94 |
| 8 | M00108 | NRF2_01 | 81.4 | 0.92 |
| 9 | M00665 | SP3_Q3 | 72.1 | 2.39 |
| 10 | M00706 | TFIII_Q6 | 61.4 | 4.23 |
| 11 | M00740 | E2F1DP1RB_01 | 58.4 | 0.90 |
| 12 | M00324 | MINI20_B | 58.2 | 1.61 |
| 13 | M00032 | CETS1P54_01 | 57.3 | 3.70 |
| 14 | M00743 | CETS168_Q6 | 51.1 | 1.75 |
| 15 | M00341 | GABP_B | 48.6 | 0.88 |
| 16 | M00055 | NMYC_01 | 41.1 | 0.90 |
| 17 | M00329 | PAX9_B | 39.2 | 0.73 |
| 18 | M00243 | EGR1_01 | 37.3 | 0.87 |
| 19 | M00072 | CP2_01 | 36.5 | 1.66 |
| 20 | M00054 | NFKAPPAB_01 | 35.5 | 0.85 |
| 21 | M00056 | MYOGNF1_01 | 35.1 | 1.34 |
| 22 | M00694 | E4F1_Q6 | 35.0 | 0.86 |
| 23 | M00738 | E2F4DP1_01 | 34.9 | 0.91 |
| 24 | M00143 | PAX5_01 | 34.7 | 0.84 |
| 25 | M00235 | AHRARNT_01 | 34.6 | 0.92 |
| 26 | M00698 | HEB_Q6 | 33.6 | 0.91 |
| 27 | M00039 | CREB_01 | 33.6 | 1.00 |
| 28 | M00514 | ATF4_Q2 | 33.1 | 1.71 |
| 29 | M00650 | MTF1_Q4 | 31.4 | 0.88 |
| 30 | M00194 | NFKB_Q6 | 30.8 | 0.82 |
| 31 | M00007 | ELK1_01 | 30.0 | 0.85 |
| 32 | M00733 | SMAD4_Q6 | 29.7 | 0.81 |
| 33 | M00261 | OLF1_01 | 28.8 | 0.84 |
| 34 | M00017 | ATF_01 | 26.7 | 0.98 |
| 35 | M00053 | CREL_01 | 25.6 | 0.81 |
| 36 | M00691 | ATF1_Q6 | 25.5 | 0.89 |
| 37 | M00244 | NGFIC_01 | 25.2 | 0.88 |
| 38 | M00041 | CREBP1CJUN_01 | 24.9 | 1.00 |
| 39 | M00086 | IK1_01 | 24.2 | 0.90 |
| 40 | M00287 | NFY_01 | 24.0 | 1.95 |
| 41 | M00466 | HIF1_Q5 | 22.7 | 0.90 |
| 42 | M00634 | GCM_Q2 | 22.6 | 0.84 |
| 43 | M00273 | R_01 | 21.8 | 0.85 |
| 44 | M00373 | PAX4_01 | 21.7 | 2.57 |
| 45 | M00097 | PAX6_01 | 21.5 | 1.15 |
| 46 | M00134 | HNF4_01 | 21.1 | 0.64 |
| 47 | M00670 | TCF1P_Q6 | 21.1 | 0.80 |
| 48 | M00057 | COMP1_01 | 21.1 | 0.59 |
| 49 | M00035 | VMAF_01 | 21.0 | 1.32 |
| 50 | M00222 | HAND1E47_01 | 20.3 | 0.81 |
Figure 2Sequence logos. (a) Top three PWMs from Table 1, (b) representative PWMs from Table 2, (c) representative PWMs from Table 3.
CGI-related PWMs in descending order of r= Y. The columns are: rank number, accession number, Identifier in TRANSFAC, r(=X), r(=Y), cluster score and threshold.
| Rank | ACCESSION | ID | X | Y | S | T |
| 1 | M00332 | WHN_B | 0.09 | 0.43 | 158.4 | 1.9 |
| 2 | M00736 | E2F1DP1_01 | 0.06 | 0.39 | 151.8 | 2.6 |
| 3 | M00739 | E2F4DP2_01 | 0.09 | 0.29 | 91.4 | 0.9 |
| 4 | M00737 | E2F1DP2_01 | 0.06 | 0.27 | 81.9 | 0.9 |
| 5 | M00108 | NRF2_01 | 0.09 | 0.25 | 72.6 | 0.9 |
| 6 | M00055 | NMYC_01 | 0.05 | 0.25 | 34.6 | 0.9 |
| 7 | M00235 | AHRARNT_01 | 0.02 | 0.23 | 26.8 | 0.9 |
| 8 | M00740 | E2F1DP1RB_01 | 0.04 | 0.23 | 48.1 | 0.9 |
| 9 | M00652 | NRF1_Q6 | 0.05 | 0.22 | 105.3 | 0.9 |
| 10 | M00466 | HIF1_Q5 | 0.01 | 0.22 | 19.7 | 0.9 |
| 11 | M00341 | GABP_B | 0.1 | 0.19 | 46.6 | 0.9 |
| 12 | M00738 | E2F4DP1_01 | 0.02 | 0.19 | 28.6 | 0.9 |
| 13 | M00538 | HTF_01 | 0 | 0.16 | 9.7 | 0.8 |
| 14 | M00694 | E4F1_Q6 | 0.03 | 0.16 | 23.6 | 0.9 |
| 15 | M00743 | CETS168_Q6 | 0.13 | 0.14 | 47.1 | 1 |
| 16 | M00650 | MTF1_Q4 | 0.04 | 0.14 | 22.6 | 0.9 |
| 17 | M00243 | EGR1_01 | 0.07 | 0.12 | 32.4 | 0.9 |
| 18 | M00251 | XBP1_01 | 0.01 | 0.12 | 7.8 | 0.9 |
| 19 | M00691 | ATF1_Q6 | 0.07 | 0.12 | 17.3 | 0.9 |
| 20 | M00236 | ARNT_01 | 0.02 | 0.11 | 6.5 | 1 |
| 21 | M00143 | PAX5_01 | 0.09 | 0.11 | 25.7 | 0.8 |
| 22 | M00273 | R_01 | 0.06 | 0.11 | 23.8 | 0.8 |
| 23 | M00244 | NGFIC_01 | 0.06 | 0.1 | 23 | 0.9 |
| 24 | M00280 | RFX1_01 | 0.06 | 0.1 | 11.1 | 0.9 |
| 25 | M00121 | USF_01 | 0.03 | 0.1 | 7.6 | 1 |
| 26 | M00287 | NFY_01 | 0.04 | 0.1 | 21.3 | 1.9 |
| 27 | M00039 | CREB_01 | 0.04 | 0.09 | 23.2 | 1 |
| 28 | M00309 | ACAAT_B | 0.04 | 0.09 | 6.8 | 0.9 |
| 29 | M00651 | NFMUE1_Q6 | 0.03 | 0.09 | 13 | 1.8 |
| 30 | M00017 | ATF_01 | 0.06 | 0.08 | 19.2 | 1 |
| 31 | M00481 | AR_01 | 0.05 | 0.08 | 7.5 | 0.8 |
| 32 | M00041 | CREBP1CJUN_01 | 0.04 | 0.08 | 20.4 | 1 |
| 33 | M00040 | CREBP1_01 | 0.03 | 0.08 | 4.7 | 0.9 |
| 34 | M00114 | TAXCREB_01 | 0.02 | 0.06 | 7.3 | 0.9 |
| 35 | M00279 | MIF1_01 | 0.02 | 0.06 | 10.9 | 1.8 |
| 36 | M00246 | EGR2_01 | 0.04 | 0.06 | 9.7 | 0.9 |
| 37 | M00085 | ZID_01 | 0.05 | 0.06 | 8 | 0.8 |
CGI-independent PWMs in descending order of r(=X). The columns are: rank number, accession number, Identifier in TRANSFAC, r(=X), r(=Y), cluster score and threshold.
| Rank | ACCESSION | ID | X | Y | S | T |
| 1 | M00491 | MAZR_01 | 0.27 | 0.15 | 117.4 | 1.8 |
| 2 | M00706 | TFIII_Q6 | 0.24 | 0.06 | 52.7 | 3.5 |
| 3 | M00324 | MINI20_B | 0.22 | 0.1 | 53.2 | 0.8 |
| 4 | M00056 | MYOGNF1_01 | 0.22 | 0 | 31.6 | 1.3 |
| 5 | M00649 | MAZ_Q6 | 0.21 | 0.19 | 114.4 | 3.7 |
| 6 | M00665 | SP3_Q3 | 0.2 | 0.14 | 67.7 | 1.7 |
| 7 | M00032 | CETS1P54_01 | 0.19 | 0.1 | 47.7 | 1.8 |
| 8 | M00053 | CREL_01 | 0.19 | 0.04 | 26.9 | 0.8 |
| 9 | M00054 | NFKAPPAB_01 | 0.19 | 0.06 | 33.5 | 0.9 |
| 10 | M00632 | GATA4_Q3 | 0.19 | 0.04 | 25.1 | 0.6 |
| 11 | M00373 | PAX4_01 | 0.19 | 0.05 | 26.1 | 0.6 |
| 12 | M00072 | CP2_01 | 0.19 | 0.08 | 32 | 0.9 |
| 13 | M00733 | SMAD4_Q6 | 0.18 | 0.05 | 26.3 | 0.8 |
| 14 | M00134 | HNF4_01 | 0.18 | 0.06 | 25.7 | 0.6 |
| 15 | M00194 | NFKB_Q6 | 0.18 | 0.02 | 28.5 | 0.8 |
| 16 | M00445 | XVENT1_01 | 0.17 | 0.01 | 19.9 | 0.7 |
| 17 | M00057 | COMP1_01 | 0.17 | 0.05 | 24.1 | 0.5 |
| 18 | M00097 | PAX6_01 | 0.17 | 0.06 | 24.1 | 0.5 |
| 19 | M00104 | CDPCR1_01 | 0.17 | 0.03 | 21.3 | 0.6 |
| 20 | M00222 | HAND1E47_01 | 0.17 | 0.02 | 20.4 | 0.8 |
| 21 | M00626 | EFC_Q6 | 0.17 | 0.05 | 22.6 | 0.6 |
| 22 | M00745 | LEF1_Q6 | 0.16 | -0.02 | 15.9 | 0.8 |
| 23 | M00707 | TFIIA_Q6 | 0.16 | 0.03 | 20.2 | 0.7 |
| 24 | M00086 | IK1_01 | 0.16 | 0.06 | 24.1 | 0.9 |
| 25 | M00329 | PAX9_B | 0.16 | 0.1 | 33.7 | 0.7 |
| 26 | M00478 | CDC5_01 | 0.15 | 0.03 | 19 | 0.6 |
| 27 | M00670 | TCF1P_Q6 | 0.15 | 0.06 | 22.7 | 0.8 |
| 28 | M00257 | RREB1_01 | 0.15 | -0.02 | 15.8 | 0.8 |
| 29 | M00007 | ELK1_01 | 0.15 | 0.08 | 31 | 0.8 |
| 30 | M00698 | HEB_Q6 | 0.15 | 0.08 | 28.7 | 0.9 |
| 31 | M00052 | NFKAPPAB65_01 | 0.14 | -0.05 | 9.4 | 0.9 |
| 32 | M00514 | ATF4_Q2 | 0.14 | 0.05 | 21.8 | 1.7 |
| 33 | M00191 | ER_Q6 | 0.14 | -0.03 | 11 | 0.8 |
| 34 | M00003 | VMYB_01 | 0.14 | 0.05 | 18 | 0.8 |
| 35 | M00261 | OLF1_01 | 0.14 | 0.07 | 24.6 | 0.8 |
| 36 | M00490 | BACH2_01 | 0.13 | -0.03 | 9.3 | 0.7 |
| 37 | M00001 | MYOD_01 | 0.13 | -0.03 | 10.4 | 0.9 |
| 38 | M00634 | GCM_Q2 | 0.12 | 0.05 | 19.8 | 0.8 |
| 39 | M00035 | VMAF_01 | 0.12 | 0.06 | 17.5 | 0.7 |
| 40 | M00340 | ETS2_B | 0.12 | -0.08 | 5 | 0.8 |
| 41 | M00005 | AP4_01 | 0.12 | 0.01 | 14.1 | 0.8 |
| 42 | M00701 | SMAD3_Q6 | 0.11 | 0.03 | 11.4 | 0.8 |
| 43 | M00531 | NERF_Q2 | 0.1 | -0.08 | 4.8 | 0.9 |
| 44 | M00339 | ETS1_B | 0.1 | -0.07 | 5.7 | 0.9 |
| 45 | M00657 | PTF1BETA_Q6 | 0.1 | 0 | 7.5 | 0.9 |
| 46 | M00254 | CAAT_01 | 0.1 | -0.01 | 6.6 | 0.9 |
| 47 | M00118 | MYCMAX_01 | 0.09 | -0.02 | 6.2 | 0.9 |
| 48 | M00693 | E12_Q6 | 0.09 | -0.01 | 6.5 | 0.9 |
| 49 | M00004 | CMYB_01 | 0.08 | 0 | 7.1 | 0.9 |
| 50 | M00238 | BARBIE_01 | 0.08 | 0.02 | 9.4 | 0.9 |
| 51 | M00648 | MAF_Q6 | 0.07 | 0.01 | 5.8 | 0.8 |
| 52 | M00002 | E47_01 | 0.06 | 0.02 | 5.3 | 0.9 |
| 53 | M00262 | STAF_01 | 0.05 | 0 | 9.2 | 0.9 |
| 54 | M00119 | MAX_01 | 0.05 | 0.03 | 4.9 | 1 |
Figure 3Title: Correlation of cluster scores (a) between chromosomes 20 and 21, (b) chromosomes 20 and 22. Each dot represents a distinct PWM (defined by the TRANSFAC matrix). The correlation coefficients were (a) 0.91 and (b) 0.93.
Figure 4Title: Plot of The top circle is the area where ris around zero and ris high. The right circle is the area where ris around zero and ris high. The two circles were drawn manually. Ideal CGI-related and CGI-independent PWMs are to be plotted in the top and right circles, respectively.
The gene list sorted by DCC score. The genes, in which clusters of TFBS are found on promoters using CpG-related/independent PWMs, and tissue specificity, have been previously identified. HK denotes housekeeping. Tissue specific genes can be selected independent of CpG islands (start_p) using DCC score.
| 1 | NM006272 | 0.43 | brain | 0 |
| 2 | NM007341 | 0.4 | muscle | 0 |
| 3 | NM002592 | 0.37 | brain | 0.86 |
| 4 | NM001819 | 0.27 | brain | 0.68 |
| 5 | NM004414 | 0.23 | kidney | 0.89 |
| 6 | NM002999 | 0.19 | kidney | 0.73 |
| 7 | NM003195 | 0.16 | brain | 0.73 |
| 8 | NM002591 | 0.14 | liver | 0 |
| 9 | NM000454 | 0.11 | HK.liver | 0.87 |
| 10 | NM003312 | 0.1 | liver | 0.72 |
| 11 | NM004339 | 0.09 | brain | 0.9 |
| 12 | NM020708 | 0.08 | brain | 0.64 |
| 13 | NM006870 | 0.05 | HK | 0.7 |
| 14 | NM003277 | 0.04 | lung | 0.74 |
| 15 | NM005194 | 0.04 | brain | 0.86 |
| 16 | NM003610 | 0.01 | brain | 0.76 |
| 17 | NM000355 | -0.03 | kidney | 0 |
| 18 | NM002430 | -0.03 | muscle | 0.75 |
| 19 | NM006767 | -0.03 | brain | 0.74 |
| 20 | NM005137 | -0.03 | muscle | 0.76 |
| 21 | NM003279 | -0.05 | muscle | 0 |
| 22 | NM004535 | -0.05 | brain | 0 |
| 23 | NM007019 | -0.05 | HK | 0.72 |
| 24 | NM013236 | -0.07 | HK | 0.69 |
| 25 | NM004175 | -0.07 | brain | 0.72 |
| 26 | NM001958 | -0.07 | muscle | 0 |
| 27 | NM001338 | -0.13 | vulva | 0.84 |
| 28 | NM002676 | -0.14 | HK | 0.63 |
| 29 | NM003098 | -0.16 | muscle | 0.71 |
| 30 | NM002854 | -0.17 | brain | 0 |
| 31 | NM002305 | -0.23 | HK | 0 |
| 32 | NM005080 | -0.25 | HK | 0.84 |
| 33 | NM001024 | -0.25 | HK | 0.76 |
| 34 | NM021974 | -0.26 | HK | 0.63 |
| 35 | NM014876 | -0.3 | HK | 0.95 |
| 36 | NM001098 | -0.34 | muscle | 0.65 |
| 37 | NM000071 | -0.37 | liver | 0.8 |
| 38 | NM006198 | -0.37 | brain | 0 |
| 39 | NM001675 | -0.39 | HK.muscle | 0.8 |
| 40 | NM005423 | -0.68 | brain | 0 |
Figure 5Title: Distribution of accumulated score C for promoters and non-promoters for AP2_Q6
Figure 6A Venn diagram of three gene sets (DBTSS, old RefSeq, and new RefSeq). Gene sets from A to G (Bold alphabet) consist of genes in the regions bounded by the thick lines. D consists of Dn (genes whose 5' end sequences were not extended from the old RefSeq sequences with DBTSS data) and Dex (genes whose 5' end sequences were extended). G consists of Gn (genes whose 5' end sequences were not extended from the old RefSeq sequences with DBTSS data) and Gex (genes whose 5' end sequences were extended). Namely D = {Dn,Dex} and G = {Gn,Gex}. Genes in chromosomes 20, 21, 22 were denoted by U = {C, E, F, , }. Gene sets C, E,F, and are parts of C, E, F, Gn and Gex, respectively. Some of the numbers of the sets are given in [33], that is, |{D,G}| = 7889, |{Dex,Gex}| = 2683 and |{Dex,Gex}| / |{D,G}| = 0.34, where |{D,G}| denotes the number of genes in set {D,G}.
A contingency table when predicted TFBS and a threshold T are given.
| Sequences where TFBS clusters found | Sequences where TFBS clusters not found | Sum | |
| # of promoter | |||
| # of non-promoter |
Figure 7Title: significant score Qof matrix AP2_Q6 for different thresholds.