| Literature DB >> 25489339 |
David M Budden1, Daniel G Hurley2, Joseph Cursons2, John F Markham3, Melissa J Davis2, Edmund J Crampin4.
Abstract
BACKGROUND: Transcription factors (TFs) and histone modifications (HMs) play critical roles in gene expression by regulating mRNA transcription. Modelling frameworks have been developed to integrate high-throughput omics data, with the aim of elucidating the regulatory logic that results from the interactions of DNA, TFs and HMs. These models have yielded an unexpected and poorly understood result: that TFs and HMs are statistically redundant in explaining mRNA transcript abundance at a genome-wide level.Entities:
Keywords: Gene expression; Histone modifications; Predictive modelling; Transcription factors; Transcriptional regulation
Year: 2014 PMID: 25489339 PMCID: PMC4258808 DOI: 10.1186/1756-8935-7-36
Source DB: PubMed Journal: Epigenetics Chromatin ISSN: 1756-8935 Impact factor: 4.954
Prediction accuracy of predictive models of mRNA transcript abundance
| TF | HM+DNase | TF+HM+DNase | |
|---|---|---|---|
|
| |||
| Log-linear regression | 0.58 (0.01) | 0.62 (0.01) | 0.68 (0.01) |
| Support vector regression | 0.64 (0.02) | 0.67 (0.01) | 0.70 (0.01) |
|
| |||
| Log-linear regression | 0.33 (0.01) | 0.42 (0.01) | 0.43 (0.01) |
| Support vector regression | 0.39 (0.01) | 0.45 (0.01) | 0.46 (0.01) |
Three sets of ChIP-seq input data were considered: TF binding (TF), HM and DNase-I hypersensitivity (HM+DNase) and the concatenation of both (TF+HM+DNase). Prediction accuracy is based on tenfold cross-validation adjusted R2, reported as the mean and standard deviation of the ten folds.
ChIP, chromatin immunoprecipitation; HM, histone modification; mESC, mouse embryonic stem cell; seq, sequencing; TF, transcription factor.
(embryonic stem cell) data
| Data type | Data source | Notes |
|---|---|---|
| RNA-seq | [ | 18,936 genes RPKM-normalised [ |
| TSS | Ensembl mm8/NCBIM36.46 [ | Consider only most 5 ′-located TSS for each gene |
| TF ChIP-seq | [ | E2f1, Esrrb, Klf4, c-Myc, n-Myc, Nanog, Oct4, Smad1, Sox2, Stat3, Tcfcp2l1 and Zfx |
| HM ChIP-seq | [ | H3K4me1, H3K4me2, H3K4me3, H3K9me3, H4K20me3, H3K27me3 and H3K36me3 |
| DNase-seq | [ | DNase I hypersensitivity |
| Gene ontology | [ | GOC validation date: 15 November 2013Structure from GO.db R package |
| Housekeeping annotations | [ | 3,689 orthologs inferred from the MGI human-mouse homology |
Genes corresponding with haplotype variants, unmapped contig regions and low confidence RNA-seq mappings were removed, resulting in a set of 17,517 genes for analysis. Pre-processed RNA-seq and ChIP-seq data and mapped DNase I hypersensitivity in mESCs are available online [15, 32].
ChIP, chromatin immunoprecipitation; GOC, Gene Ontology Consortium; HM, histone modification; MGI, Mouse Genome Informatics; seq, sequencing; TF, transcription factor; TSS, transcription start site; RPKM, reads per kilobase per million.
Figure 1Statistical redundancy within TFs and HMs in predicting genome-wide mRNA transcript abundance. (a,b) mESCs and (c,d) GM12878 cells. Adjusted R 2 distributions of the log-linear regression models for all combinations of n TFs (a,c) and m HMs and DNase (b,d). The minimum and maximum prediction accuracies for each n and m are connected by the blue and red curves, respectively. Although models constructed from more regulatory elements generally yielded improved prediction accuracy, the rapidly diminishing improvement when adding additional elements to the model suggests significant statistical redundancy within TFs and HMs. It is important to note that statistical redundancy does not necessarily imply functional redundancy. HM, histone modification; TF, transcription factor.
Figure 2Predictive power of TF binding and HM+DNase-based models. These models are of mRNA transcript abundance for 1,880 sets of mESC genes grouped by ontology-classified biological processes. Sets of genes exhibiting significant HM+DNase-to-TF adjusted R 2 ratio (i.e., for which HMs are more predictive of transcript abundance) are indicated in red, with those exhibiting a significant TF-to-HM+DNase adjusted R 2 ratio (i.e., for which TF binding is more predictive) are indicated in blue. The overlap between the significant (Benjamini–Hochberg-corrected P < 0.05 [43]) and non-significant (grey) regions is due to the ratio significance threshold varying with the number of genes belonging to each group. HM, histone modification; TF, transcription factor; TFAS, transcription factor association strength.
Figure 3Proportion of housekeeping genes contributing toward key biological processes. These processes have the top 100 TF-to-HM+DNase (TF) and HM+DNase-to-TF (HM+DNase) adjusted R 2 ratios for (a) mESCs and (b) GM12878 cells. The proportion of housekeeping genes is significantly larger for the TF group in both cases (Welch’s t-test (a) P < 2.2 × 10-16 and (b) P < 2.6 × 10-6). This suggests that TF binding provides more information regarding the transcriptional regulatory state of mammalian biological processes enriched for housekeeping genes and conversely that HMs and DNase provide more information for tissue and context-sensitive processes. HM, histone modification; TF, transcription factor; TFAS, transcription factor association strength.
(GM12878 lymphoblastoid cell line) data
| Data type | Data source | Notes |
|---|---|---|
| RNA-seq | ENCODE [ | 49,488 genesFPKM-normalised [ |
| TSS | Ensembl hg19/GRCh37 [ | Consider only most 5 ′-located TSS for each gene |
| TF ChIP-seq | ENCODE [ | c-Fos, Ctcf, Egr1, Nrf1, Nrsf, Pou2f2, Sp1, Srf, Stat3, Usf1 and Yy1 |
| HM ChIP-seq | ENCODE [ | H3K4me1, H3K4me2, H3K4me3, H4K20me1, H3K27me3 and H3K36me3 |
| DNase-seq | ENCODE [ | DNase I hypersensitivity |
| Gene ontology | [ | GOC validation date: 21 March 2014 Structure from GO.db R package |
| Housekeeping annotations | [ | 3,804 genes Using RNA-seq data GSE30611 |
Genes corresponding with haplotype variants, unmapped contig regions and low confidence RNA-seq mappings were removed, resulting in a set of 38,041 genes for analysis.
ChIP, chromatin immunoprecipitation; GOC, Gene Ontology Consortium; HM, histone modification; seq, sequencing; TF, transcription factor; TSS, transcription start site.
Figure 4Flowchart illustrating the experimental pipeline presented in this study. ChIP/DNase-seq data were used to construct regression models of mRNA transcript abundance for a set of genes. The prediction accuracy of each model was evaluated relative to RNA-sequencing data. By constructing groups of genes categorised by biological process and applying the above methodology, it was possible to identify heterogeneity in the relative predictive power of TFs and HMs. These groups were later analysed for enrichment for housekeeping genes.