| Literature DB >> 16945132 |
Matthieu Defrance1, Hélène Touzet.
Abstract
BACKGROUND: Identifying cis-regulatory elements is crucial to understanding gene expression, which highlights the importance of the computational detection of overrepresented transcription factor binding sites (TFBSs) in coexpressed or coregulated genes. However, this is a challenging problem, especially when considering higher eukaryotic organisms.Entities:
Mesh:
Substances:
Year: 2006 PMID: 16945132 PMCID: PMC1570149 DOI: 10.1186/1471-2105-7-396
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Score profile and window extraction. Example of the score used to predict windows with a significant overrepresentation of TFBSs. Panel (a) shows the predicted TFBSs (black boxes) along the upstream sequences of five genes that come from two species. Panel (b) shows the evolution of the cumulative score computed for a given PWM with those sequences. Local overrepresentations detected by the algorithm are represented by boxes.
Poisson chi-square goodness-of-fit test for the hit-count distribution. Percentage of PWMs for which the hit-count distribution (i.e., the number of putative TFBSs in a given sequence) is well modeled by a Poisson distribution according to the chi-square goodness-of-fit test, for three values of significance. Proximal upstream sequences from 1000 randomly selected human genes were used to compute the data listed.
| 72% | 80% | 87% | |
| 68% | 74% | 83% |
Results for skeletal-muscle-specific human genes. Most significant TFBSs detected in the muscle data set by TFM-Explorer, TOUCAN, OTFBS, and oPOSSUM using sequences that were 2 kb upstream. The TRANSFAC vertebrate matrix collection was used with TOUCAN and OTFBS, and JASPAR vertebrate matrices were used with TFM-Explorer and oPOSSUM. TFs with experimentally verified sites in the data set are marked with *.
| Rank | PWM | Window | P-value | |
| 1 | * | SRF | [-0224: -0091] | 7.869e-06 |
| 2 | * | MEF2 | [-0060: -0030] | 2.350e-05 |
| 3 | * | MZF_1-4 | [-1431: -0576] | 1.678e-04 |
| 4 | Staf | [-1950: -1311] | 2.539e-04 | |
| 5 | Irf-2 | [-1892: -1592] | 3.002e-04 | |
| 6 | NRF-2 | [-0779: -0324] | 3.180e-04 | |
| 7 | Brachyury | [-0307: -0048] | 4.503e-04 | |
| 8 | Bsap | [-1911: -0978] | 4.800e-04 | |
| 9 | cEBP | [-1733: -1679] | 6.032e-04 | |
| 10 | * | MZF_5-13 | [-1633: -1078] | 7.141e-04 |
| Rank | PWM | P-value | ||
| 1 | HEN1_01 | 8.567e-02 | ||
| 2 | * | MEF2_02 | 1.021e-01 | |
| 3 | RSRFC4_01 | 1.129e-01 | ||
| 4 | TAL1BETAITF2_01 | 1.311e-01 | ||
| 5 | STAT5A_01 | 1.856e-01 | ||
| 6 | TAL1BETAE47_01 | 2.322e-01 | ||
| 7 | YY1_01 | 2.391e-01 | ||
| 8 | STAT5B_01 | 2.534e-01 | ||
| 9 | * | MEF2_03 | 3.056e-01 | |
| 10 | CDC5_01 | 3.134e-01 | ||
| Rank | PWM | P-value | ||
| 1 | YY1_02 | 2.047e-06 | ||
| 2 | * | MZF1_02 | 2.763e-06 | |
| 3 | * | MEF2_02 | 9.493e-06 | |
| Rank | PWM | P-value | ||
| 1 | * | MEF2 | 1.768e-04 | |
| 2 | Hen-1 | 3.730e-04 | ||
| 3 | SRY | 1.531e-03 | ||
| 4 | c-MYB_l | 1.780e-03 | ||
| 5 | S8 | 2.983e-03 | ||
| 6 | HFH-3 | 2.994e-03 | ||
| 7 | * | SP1 | 3.220e-03 | |
| 8 | * | MZF_5-13 | 3.675e-03 | |
| 9 | Nkx | 6.399e-03 | ||
| 10 | RORalfa-2 | 7.747e-03 | ||
Results for Rel/NF-κB target genes. Most significant TFBSs detected in Rel/NF-κB target genes set by TFM-Explorer, TOUCAN, OTFBS, and oPOSSUM. The data set comprised 99 human Rel/NF-κB target genes that have experimentally verified binding sites. Both 2-kb-upstream sequences and 5-kb-downstream/5-kb-upstream sequences were used. The TRANSFAC vertebrate matrix collection was used with TFM-Explorer, TOUCAN, and OTFBS, and JASPAR vertebrate matrices were used with oPOSSUM. TFs of the Rel/NF-κB family are marked with *. Other TFBSs (such as TATA-box) are also likely to be biologically valid.
| Rank | PWM | Window | P-value | |
| 1 | * | NFKAPPAB65_01 | [-0520: +0115] | 8.875e-27 |
| 2 | * | NFKAPPAB_01 | [-0698: +0116] | 1.026e-20 |
| 3 | * | NFKB_C | [-0522: -0020] | 9.148e-19 |
| 4 | TATA_01 | [-0056: -0010] | 5.585e-18 | |
| 5 | * | NFKB_Q6 | [-0537: +0092] | 2.241e-16 |
| 6 | TATA_C | [-0055: -0015] | 4.128e-16 | |
| 7 | * | CREL_01 | [-0501: -0020] | 3.510e-15 |
| 8 | CDXA_01 | [-0071: -0018] | 4.262e-15 | |
| 9 | * | NFKAPPAB50_01 | [-0521: +0012] | 8.601e-13 |
| 10 | SP1_Q6 | [-0094: -0043] | 1.451e-11 | |
| Rank | PWM | Window | P-value | |
| 1 | * | NFKAPPAB65_01 | [-0520: -0019] | 7.706e-27 |
| 2 | * | NFKAPPAB_01 | [-0698: -0019] | 9.418e-20 |
| 3 | TATA_0l | [-0056: -0023] | 1.118e-19 | |
| 4 | * | NFKB_C | [-0522: -0020] | 9.148e-19 |
| 5 | TATA_C | [-0055: -0015] | 4.128e-16 | |
| 6 | * | CREL_01 | [-0501: -0020] | 3.510e-15 |
| 7 | * | CDXA_01 | [-0071: -0018] | 4.262e-15 |
| 8 | * | NFKB_Q6 | [-0537: -0021] | 3.574e-14 |
| 9 | * | NFKAPPAB50_01 | [-0521: -0019] | 1.066e-12 |
| 10 | SP1_Q6 | [-0094: -0043] | 1.451e-11 | |
| Rank | PWM | P-value | ||
| 1 | HFH3_01 | 0.0 | ||
| 2 | BRACH_01 | 4.667e-01 | ||
| 3 | RORA2_01 | 8.596e-01 | ||
| 4 | NRSF_01 | 9.956e-01 | ||
| 5 | E47_01 | 1.0 | ||
| 6 | VMYB_01 | 1.0 | ||
| 7 | AP4_01 | 1.0 | ||
| 8 | MEF2_01 | 1.0 | ||
| 9 | ELK1_01 | 1.0 | ||
| 10 | EVI1_06 | 1.0 | ||
| Rank | PWM | P-value | ||
| 1 | * | NFKAPPAB65_01 | 1.381e-05 | |
| 2 | * | NFKB_C | 6.975e-05 | |
| 3 | * | NFKAPPAB_01 | 3.139e-04 | |
| 4 | ARP1_01 | 1.257e-03 | ||
| 5 | SREBP1_01 | 6.795e-03 | ||
| 6 | * | NFKB_Q6 | 4.683e-02 | |
| 7 | * | NFKAPPAB50_01 | 8.661e-02 | |
| 8 | RORA2_01 | 9.847e-02 | ||
| 9 | E47_02 | 1.628e-01 | ||
| 10 | HEN1_01 | 2.882e-01 | ||
| Rank | PWM | P-value | ||
| Rank | PWM | P-value | ||
| 1 | FOXJ2_01 | 6.097e-49 | ||
| 2 | FOXD3_01 | 4.229e-45 | ||
| 3 | HFH3_01 | 5.356e-41 | ||
| 4 | HNF3B_01 | 7.352e-35 | ||
| 5 | IK2_01 | 3.031e-20 | ||
| 6 | SREBP1_01 | 1.969e-19 | ||
| 7 | * | NFKAPPAB65_01 | 3.708e-19 | |
| 8 | * | NFKB_C | 8.819e-19 | |
| 9 | * | CREL_01 | 2.571e-18 | |
| 10 | CHOP_01 | 1.004e-17 | ||
| Rank | PWM | P-value | ||
| 1 | * | p65 | 1.941e-08 | |
| 2 | * | NF-kappaB | 1.579e-05 | |
| 3 | * | c-REL | 7.877e-05 | |
| 4 | * | p50 | 1.510e-04 | |
| 5 | c-FOS | 6.236e-04 | ||
| 6 | Irf-1 | 3.301e-03 | ||
| 7 | MZF_5-13 | 5.543e-03 | ||
| 8 | MZF_1-4 | 7.967e-03 | ||
| 9 | NRF-2 | 2.933e-02 | ||
| 10 | SPI-B | 3.239e-02 | ||
| Rank | PWM | P-value | ||
| 1 | * | p65 | 1.333e-14 | |
| 2 | * | NF-kappaB | 3.234e-11 | |
| 3 | * | c-REL | 4.835e-09 | |
| 4 | * | p50 | 3.272e-07 | |
| 5 | SPI-B | 5.137e-05 | ||
| 6 | c-FOS | 1.519e-04 | ||
| 7 | Elk-1 | 2.329e-04 | ||
| 8 | deltaEF1 | 2.877e-04 | ||
| 9 | MZF_1-4 | 3.731e-04 | ||
| 10 | Irf-1 | 6.815e-04 | ||
Results for the H3 gene set. Most significant TFBSs detected in the H3 data set by TFM-Explorer, TOUCAN, OTFBS, and oPOSSUM using sequences that were 2 kb upstream. The TRANSFAC vertebrate matrices were used with TOUCAN, OTFBS, and TFM-Explorer. oPOSSUM was unable to produce results from this data set. TFs with experimentally verified sites in the set are marked with *.
| Rank | PWM | Window | P -value | |
| 1 | * | NFY_C | [-1375: -0039] | 4.757e-24 |
| 2 | * | OCT1_04 | [-0588: -0022] | 1.537e-20 |
| 3 | * | NFY_Q6 | [-1318: -0039] | 4.026e-16 |
| 4 | * | OCT1_07 | [-0574: -0025] | 7.932e-14 |
| 5 | XFD1_01 | [-0890: -0025] | 2.253e-13 | |
| 6 | PBX1_02 | [-0491: -0040] | 2.737e-13 | |
| 7 | SRY_02 | [-0895: -0015] | 1.803e-12 | |
| 8 | MEF2_04 | [-0482: -0038] | 1.826e-12 | |
| 9 | HNF1_01 | [-0642: -0097] | 7.089e-12 | |
| 10 | EVI1_04 | [-0417: -0040] | 9.277e-12 | |
| Rank | PWM | P-value | ||
| 1 | * | NFY_01 | 1.364e-08 | |
| 2 | * | OCT1_01 | 1.854e-05 | |
| 3 | GFI1_01 | 4.506e-05 | ||
| 4 | TATA_01 | 1.315e-03 | ||
| 5 | * | CAAT_01 | 1.781e-03 | |
| 6 | * | OCT_C | 1_018e-02 | |
| 7 | MEF2_02 | 1.041e-02 | ||
| 8 | MEF2_03 | 1.041e-02 | ||
| 9 | NFY_C | 1.633e-02 | ||
| 10 | CART1_01 | 2.569e-02 | ||
| Rank | PWM | P-value | ||
| 1 | IRF1_01 | 5.099e-26 | ||
| 2 | HFH3_01 | 6.865e-22 | ||
| 3 | FOXJ2_01 | 1.606e-21 | ||
| 4 | MEF2_01 | 6.896e-20 | ||
| 5 | HNF3B_01 | 1.165e-18 | ||
| 6 | MEF2_04 | 1.243e-18 | ||
| 7 | FOXD3_01 | 3.698e-18 | ||
| 8 | MEF2_02 | 2.964e-15 | ||
| 9 | XFD1_01 | 8.016e-15 | ||
| 10 | * | NFY_C | 6.396e-14 | |
Figure 2Influence of noise on the positive predictive value. Starting from the Rel/NF-κB and muscle data sets, an increasing number of actual sequences were replaced by random sequences. The noise level represents the proportion of sequences for the given set that have been randomly selected in the genome. The positive predictive value corresponds to the proportion of valid predictions (the most significant extracted TF is known to be involved in the regulation of the reference set).
Figure 3Effect of P-value cutoff on the false positive error rate. Various set sizes (5, 50, and 100 sequences) were used to evaluate the rate of false positive. The suggested P-value cutoffs for a fixed false positive rate of 10% are 10-6 and 10-8 for 5 and 100 sequences, respectively.