| Literature DB >> 18988630 |
Tom Whitington1, Andrew C Perkins, Timothy L Bailey.
Abstract
In silico prediction of transcription factor binding sites (TFBSs) is central to the task of gene regulatory network elucidation. Genomic DNA sequence information provides a basis for these predictions, due to the sequence specificity of TF-binding events. However, DNA sequence alone is an impoverished source of information for the task of TFBS prediction in eukaryotes, as additional factors, such as chromatin structure regulate binding events. We show that incorporating high-throughput chromatin modification estimates can greatly improve the accuracy of in silico prediction of in vivo binding for a wide range of TFs in human and mouse. This improvement is superior to the improvement gained by equivalent use of either transcription start site proximity or phylogenetic conservation information. Importantly, predictions made with the use of chromatin structure information are tissue specific. This result supports the biological hypothesis that chromatin modulates TF binding to produce tissue-specific binding profiles in higher eukaryotes, and suggests that the use of chromatin modification information can lead to accurate tissue-specific transcriptional regulatory network elucidation.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18988630 PMCID: PMC2662491 DOI: 10.1093/nar/gkn866
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Filtering thresholds considered
| Filter type | Thresholds considered |
|---|---|
| Distance to nearest mouse KnownGene TSS | ⩽ {10000, 5000, 2000, 1000, 500, 200} base pairs |
| Distance to nearest mouse CAGE TU | ⩽ {10000, 5000, 2000, 1000, 500, 200} base pairs |
| Mouse ES cell H3K4me3 density | ⩾ {1, 2, 4, 8, 16, 32} arbitrary units |
| Mouse MEF cell H3K4me3 density | ⩾ {1, 2, 4, 8, 16, 32} arbitrary units |
| Mouse NP cell H3K4me3 density | ⩾ {1, 2, 4, 8, 16, 32} arbitrary units |
| Mouse masked phastCons score | ⩾ {0.1, 0.2, 0.3, 0.5, 0.7, 1} arbitrary units |
| Distance to nearest human KnownGene TSS | ⩽ {10000, 5000, 2000, 1000, 500, 200} base pairs |
| Human ES cell H3K4me3 density | ⩾ {0, 0.1, 0.2, 0.5, 0.7, 1} arbitrary units |
| Human liver H3K4me3 density | ⩾ {0, 0.1, 0.2, 0.5, 0.7, 1} arbitrary units |
| Human REH cell H3K4me3 density | ⩾ {0, 0.1, 0.2, 0.5, 0.7, 1} arbitrary units |
| Human T-cell H3K4me3 density | ⩾ {1, 2, 4, 8, 16, 32} arbitrary units |
Units for each H3K4me3 density threshold are defined in section ‘Histone modification data’. Units for the masked phastCons score are defined in the ‘Phylogenetic conservation data’ section.
Figure 1.Improvement in E2F1 TFBS prediction by H3K4me3 signal filtering. ROC-like plot shows the TP rate versus the actual number of FPs. Error bars indicate standard error. The TF gold-standard and H3K4me3 data are each derived from mouse ES cells. This figure also serves to illustrate calculation of the ‘best relative FP improvement statistic’, (I), defined in the Methods section.
Quality of gold-standard datasets
| Gold-standard dataset | # ADRs retained | |
|---|---|---|
| Esrrb | 0.02 | 21 647 |
| CTCF | 0.05 | 39 609 |
| Klf2 | 0.14 | 219 |
| Tcfcp2l1 | 0.15 | 24 398 |
| Oct4 [Loh ( | 0.17 | 859 |
| Klf4 [Chen ( | 0.17 | 9912 |
| Smad1 | 0.19 | 770 |
| Sox2 | 0.29 | 2498 |
| Stat3 | 0.32 | 1368 |
| Klf5 | 0.36 | 198 |
| Oct4 [Chen ( | 0.39 | 2073 |
| Zfx | 0.40 | 256 |
| Klf4 [Jiang ( | 0.43 | 225 |
| cMyc | 0.48 | 825 |
| nMyc | 0.60 | 1077 |
| Nanog [Chen ( | 0.69 | 5 |
| Nanog [Loh ( | 0.91 | 3 |
‘q-value’ reported is the q-value of the author-defined TF-binding region (ADR) that had the worst match to the PWM and yet passed our MAST threshold. ‘# ADRs retained’ is the number of author-defined TF-binding regions that would be retained if a q-value threshold of 0.1 were applied.
Figure 2.Comparison of H3K4me3 and TSS proximity filter performance for Klf4 TFBS prediction. ROC-like plot shows the TP rate versus the actual number of FPs. Error bars indicate standard error. The TF gold-standard and H3K4me3 data are each derived from mouse ES cells. A subset of all CAGE thresholds are presented for clarity.
Figure 3.Comparison of H3K4me3 and phastCons filter performance for nMyc TFBS prediction. ROC-like plot shows the TP rate versus the actual number of FPs. Error bars indicate standard error. The TF gold-standard and H3K4me3 datasets are each derived from mouse ES cells. PhastCons filter performance for the other mouse TFs considered is similar to performance shown here for nMyc, as the optimal phastCons filter never outperforms the optimal H3K4me3 filter, for any TF or sensitivity level.
Figure 4.Tissue specificity of cMyc TFBS predictions made with H3K4me3 filter. ROC-like plot shows the TP rate versus the actual number of FPs. Error bars indicate standard error. The TF gold-standard data are each derived from mouse ES cells.
Figure 5.Filter performance in mouse ES cells at sensitivity 20%. The best relative FP rate (as defined in the Methods section) of each filter type has been plotted for the 18 mouse gold-standard TFBS datasets. Multiple gold-standard datasets were available for Klf4, Oct4 and Nanog, and the first author of the corresponding gold-standard dataset has been indicated. PhastCons filtering failed to yield a positive relative FP rate improvement for any of the 18 gold-standard datasets at this sensitivity level, and so has been omitted. Error bars indicate standard error. Barplot mean and standard errors smaller than −1 have been truncated to −1, to allow clearer visualization of relative FP improvement values between 0 and 1.
Figure 6.Filter performance in mouse ES cells at sensitivity 80%. The best relative FP rate (as defined in the Methods section) of each filter type has been plotted for the TFs cMyc, E2F1, nMyc and Zfx. PhastCons filtering failed to yield a positive relative FP rate improvement for any of the four gold-standard datasets at this sensitivity level, and so has been omitted. Error bars indicate standard error. For a given TF and filter, if the filter cannot attain a sensitivity of 80% due to actual positive elimination, then the bar is omitted from the plot.
Figure 7.Tissue specificity of TFBS predictions in three human tissues. The best relative FP rate (as defined in the Methods section) of each H3K4me3 filter is shown for the 10 human gold-standard TFBS datasets. Each arrow indicates the results for the H3K4me3 filter using data estimated from the same tissue as the given TFBS gold-standard data. For example, the distribution of HNF4A TFBSs was estimated in liver, so the arrow points to the liver results for HNF4A. Error bars indicate standard error. Barplot mean and standard errors smaller than −1 have been truncated to −1, to allow clearer visualization of relative FP improvement values between 0 and 1.
Figure 8.Performance of H3K4me3 filtering without optimization of threshold. The relative FP rate has been plotted for a H3K4me3 filter, with a threshold of 1.0 at a sensitivity of 20% (a) and a more stringent threshold of 2.0 at the lower sensitivity of 10% (b). Error bars indicate standard error. Note that the results presented are relative FP improvement of a filter with a single given threshold, rather than best relative FP improvement. That is, we have not optimized the filtering threshold used.
Figure 9.Overlap between H3K4me3 and TF occupancy in ES cells at the Bmp4 (a) and Otx2 (b) gene loci. The track labelled ‘ES_K4 wig’ indicates the distribution of H3K4me3 in mouse ES cells, as published by Mikkelsen et al. (5). Units of H3K4me3 density are described in the Methods section. UCSC KnownGenes and NIA Genes are shown in the lowest two tracks for each displayed region. CAGE TU locations are indicated, as are binding locations for TFs Nanog, Oct4, Klf2, Klf4 and Klf5 estimated by Jiang et al. (23) and Loh et al. (31). Red boxes indicate regions at which the available H3K4me3 information should be of greater benefit to TFBS prediction, compared with the available TSS location information, due to the large distance between the TFBSs and known TSSs.