| Literature DB >> 19865592 |
J Sunil Rao1, Suresh Karanam, Colleen D McCabe, Carlos S Moreno.
Abstract
Background. The computational identification of functional transcription factor binding sites (TFBSs) remains a major challenge of computational biology. Results. We have analyzed the conserved promoter sequences for the complete set of human RefSeq genes using our conserved transcription factor binding site (CONFAC) software. CONFAC identified 16296 human-mouse ortholog gene pairs, and of those pairs, 9107 genes contained conserved TFBS in the 3 kb proximal promoter and first intron. To attempt to predict in vivo occupancy of transcription factor binding sites, we developed a novel marginal effect isolator algorithm that builds upon Bayesian methods for multigroup TFBS filtering and predicted the in vivo occupancy of two transcription factors with an overall accuracy of 84%. Conclusion. Our analyses show that integration of chromatin immunoprecipitation data with conserved TFBS analysis can be used to generate accurate predictions of functional TFBS. They also show that TFBS cooccurrence can be used to predict transcription factor binding to promoters in vivo.Entities:
Year: 2008 PMID: 19865592 PMCID: PMC2768302 DOI: 10.1155/2008/369830
Source DB: PubMed Journal: Adv Bioinformatics ISSN: 1687-8027
Figure 1%GC content of the conserved promoter sequences in each of the seven groups considered. Plotted are the mean and standard deviation of the %GC in each promoter set. Although there is slightly higher GC content in the HNF4-bound groups, no statistically significant GC bias was observed for any of the groups analyzed for patterns of conserved TFBS.
Prediction of in vivo occupancy by HNF1 and HNF4. Data from ChIP-chip studies (3) were integrated with CONFAC TFBS data and genes were separated randomly into training and test sets. The BAM MEI classifier was applied to the independent test set of 1349 genes to predict which class each gene belonged based on the TFBS patterns that were predictive of occupancy.
| Predicted | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| Observed | HNF1 both | HNF1 hep | HNF1 panc | HNF4 both | HNF4 hep | HNF4 panc | Unbound | Total (obs) | Sensitivity |
| HNF1 both | 5 | 0 | 0 | 0 | 0 | 0 | 3 | 8 | 63% |
| HNF1 hep | 0 | 0 | 0 | 0 | 0 | 0 | 23 | 23 | 0% |
| HNF1 panc | 0 | 0 | 3 | 0 | 0 | 0 | 7 | 10 | 30% |
| HNF4 both | 0 | 0 | 0 | 35 | 0 | 0 | 54 | 89 | 39% |
| HNF4 hep | 0 | 0 | 0 | 0 | 14 | 0 | 70 | 84 | 17% |
| HNF4 panc | 0 | 0 | 0 | 0 | 0 | 41 | 45 | 86 | 48% |
| Unbound | 0 | 0 | 0 | 6 | 5 | 2 | 1036 | 1049 | 99% |
| total (pred) | 5 | 0 | 3 | 41 | 19 | 43 | 1238 | 1349 | |
| Specificity | 100% | NA | 100% | 85% | 74% | 95% | 84% | ||
Training set self-consistency performance. Data from ChIP-chip studies (3) were integrated with CONFAC TFBS data and genes were separated randomly into training and test sets. The BAM MEI classifier was trained on the training set of 5399 genes and predictions were made on this same set of genes.
| Predicted | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| Observed | HNF1 both | HNF1 hep | HNF1 panc | HNF4 both | HNF4 hep | HNF4 panc | Unbound | Total (obs) | Sensitivity |
| HNF1 both | 10 | 0 | 0 | 0 | 0 | 0 | 4 | 14 | 71% |
| HNF1 hep | 0 | 4 | 0 | 0 | 0 | 0 | 74 | 78 | 5% |
| HNF1 panc | 0 | 0 | 10 | 0 | 0 | 0 | 14 | 24 | 41% |
| HNF4 both | 0 | 0 | 0 | 130 | 0 | 0 | 142 | 272 | 48% |
| HNF4 hep | 0 | 0 | 0 | 0 | 88 | 0 | 295 | 383 | 23% |
| HNF4 panc | 0 | 0 | 0 | 0 | 0 | 181 | 159 | 340 | 53% |
| Unbound | 0 | 0 | 0 | 6 | 5 | 2 | 4249 | 4262 | 99% |
| total (pred) | 10 | 4 | 10 | 136 | 93 | 183 | 4937 | 5373 | |
| Specificity | 100% | NA | 100% | 89% | 85% | 96% | 86% | ||
Summary of MEI predictions from 25 splits of training and test sets. “NA” means that cell could not be calculated for all splits. Otherwise, the means and sd's were calculated from those splits without NA's.
| Group | Mean sensitivity (sd) | Mean specificity (sd) |
|---|---|---|
| HNF1Both | .252 (.152) | 1 (0) |
| HNF1Hep | 0 (0) | NA |
| HNF1Panc | .352 (.102) | 1 (0) |
| HNF4Both | .388 (.083) | .840 (.023) |
| HNF4Hep | .130 (.041) | .486 (.201) |
| HNF4Panc | .430 (.047) | .940 (.070) |
| Unbound | .982 (.008) | .838 (.004) |
Prediction of in vivo occupancy by HNF1 and HNF4 by BAM MEI analysis using human-mouse genomic alignments instead of local pairwise BLAST alignments restricted to regions of positive regulatory potential and a window size of 25 bp.
| Predicted | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| Observed | HNF1 both | HNF1 hep | HNF1 panc | HNF4 both | HNF4 hep | HNF4 panc | Unbound | Total (obs) | Sensitivity |
| HNF1 both | 3 | 0 | 0 | 0 | 0 | 0 | 1 | 4 | 75% |
| HNF1 hep | 0 | 0 | 0 | 0 | 0 | 0 | 16 | 16 | 0% |
| HNF1 panc | 0 | 0 | 1 | 0 | 0 | 0 | 3 | 4 | 25% |
| HNF4 both | 0 | 0 | 0 | 20 | 0 | 0 | 30 | 50 | 40% |
| HNF4 hep | 0 | 0 | 0 | 0 | 18 | 0 | 71 | 89 | 20% |
| HNF4 panc | 0 | 0 | 0 | 0 | 0 | 8 | 8 | 16 | 50% |
| Unbound | 0 | 0 | 0 | 3 | 3 | 2 | 908 | 916 | 99% |
| Total (pred) | 3 | 0 | 1 | 23 | 21 | 10 | 1037 | 1095 | |
| Specificity | 100% | NA | 100% | 87% | 86% | 80% | 87% | ||
Rules associated with HNF1 and HNF4 binding identified by 10-fold cross-validation of BAMarray analysis.
| TFBS family | Negative association | Positive association |
|---|---|---|
| E2F | HNF1-pancreas | HNF4 binding |
| ETS | HNF1-both | None |
| MAF | HNF1-hepatocytes | None |
| NF | HNF1-any | None |
| Homeobox | HNF4-pancreas | None |
| SOX/TCF | HNF4-both | None |
| Homeobox | HNF4-hepatocytes | None |
| FOX/Homeobox | HNF4-any | None |
Prediction of in vivo occupancy by HNF1 and HNF4 by BAM MEI analysis after removal of the unbound class from the analysis.
| Predicted | ||||||||
|---|---|---|---|---|---|---|---|---|
| observed | HNF1 both | HNF1 hep | HNF1 panc | HNF4 both | HNF4 hep | HNF4 panc | Total (obs) | Sensitivity |
| HNF1 both | 3 | 0 | 0 | 0 | 0 | 0 | 3 | 100% |
| HNF1 hep | 0 | 12 | 0 | 0 | 4 | 2 | 18 | 100% |
| HNF1 panc | 0 | 0 | 6 | 1 | 2 | 0 | 9 | 100% |
| HNF4 both | 0 | 0 | 0 | 66 | 3 | 7 | 76 | 93% |
| HNF4 hep | 0 | 0 | 0 | 1 | 74 | 9 | 84 | 85% |
| HNF4 panc | 0 | 0 | 0 | 4 | 6 | 82 | 92 | 84% |
| Total (pred) | 3 | 12 | 6 | 72 | 89 | 100 | 282 | |
| Specificity | 100% | 98% | 99% | 95% | 95% | 95% | ||
Sites overrepresented by oPOSSUM single site analysis.
| Gene set | Significant sites |
|---|---|
| HNF1-Hepatocytes | HNF4, TCF1 |
| HNF1-Pancreas | None |
| HNF1-Both | None |
| HNF4-Hepatocytes | HNF4 |
| HNF4-Pancreas | Staf, GABPA |
| HNF4-Both | Staf, ELK1, SPIB, Bapx1, ELK4 |
Prediction of in vivo occupancy by HNF1 and HNF4 by BAM MEI analysis using human-mouse genomic alignments instead of local pairwise BLAST alignments and a window size of zero.
| Predicted | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| Observed | HNF1 both | HNF1 hep | HNF1 panc | HNF4 both | HNF4 hep | HNF4 panc | Unbound | Total (obs) | Sensitivity |
| HNF1 both | 3 | 0 | 0 | 0 | 0 | 0 | 1 | 4 | 75% |
| HNF1 hep | 0 | 1 | 0 | 0 | 0 | 0 | 18 | 19 | 5% |
| HNF1 panc | 0 | 0 | 2 | 0 | 0 | 0 | 4 | 6 | 33% |
| HNF4 both | 0 | 0 | 0 | 21 | 0 | 0 | 29 | 50 | 42% |
| HNF4 hep | 0 | 0 | 0 | 0 | 16 | 0 | 82 | 98 | 16% |
| HNF4 panc | 0 | 0 | 0 | 0 | 0 | 34 | 30 | 64 | 53% |
| Unbound | 0 | 0 | 0 | 7 | 7 | 1 | 1259 | 1274 | 99% |
| Total (pred) | 3 | 1 | 2 | 28 | 23 | 35 | 1423 | 1492 | |
| Specificity | 100% | 100% | 100% | 75% | 67% | 95% | 88% | ||