| Literature DB >> 16760900 |
Debopriya Das1, Zaher Nahlé, Michael Q Zhang.
Abstract
Although the human genome has been sequenced, progress in understanding gene regulation in humans has been particularly slow. Many computational approaches developed for lower eukaryotes to identify cis-regulatory elements and their associated target genes often do not generalize to mammals, largely due to the degenerate and interactive nature of such elements. Motivated by the switch-like behavior of transcriptional responses, we present a systematic approach that allows adaptive determination of active transcriptional subnetworks (cis-motif combinations, the direct target genes and physiological processes regulated by the corresponding transcription factors) from microarray data in mammals, with accuracy similar to that achieved in lower eukaryotes. Our analysis uncovered several new subnetworks active in human liver and in cell-cycle regulation, with similar functional characteristics as the known ones. We present biochemical evidence for our predictions, and show that the recently discovered G2/M-specific E2F pathway is wider than previously thought; in particular, E2F directly activates certain mitotic genes involved in hepatocellular carcinomas. Additionally, we demonstrate that this method can predict subnetworks in a condition-specific manner, as well as regulatory crosstalk across multiple tissues. Our approach allows systematic understanding of how phenotypic complexity is regulated at the transcription level in mammals and offers marked advantage in systems where little or no prior knowledge of transcriptional regulation is available.Entities:
Mesh:
Substances:
Year: 2006 PMID: 16760900 PMCID: PMC1681499 DOI: 10.1038/msb4100067
Source DB: PubMed Journal: Mol Syst Biol ISSN: 1744-4292 Impact factor: 11.429
Figure 1Modeling mammalian transcription with linear splines. (A) Sigmoidal transcriptional response (Carey, 1998). The response is flat below a binding affinity threshold, namely, the gene activation threshold, and varies exponentially above it. It saturates at high binding energies. (B) Example of a linear spline. A linear spline is a piecewise linear function: it is zero below (above) a threshold, termed knot (ξ), and changes linearly above (below) it. E refers to the observed mRNA level of gene g, whereas E is its mRNA level in the reference sample. Knots are related to gene activation thresholds. All genes with PWM scores S>ξ are predicted targets of the motif contributing to this spline, shaded in blue. (C) A schematic view of the key steps in identification of significant motifs and motif pairs using linear splines.
Correlation between PWM scores and expression
| Organism | Experiment | Biological process/tissue | Motif | Average information content | %RIV | |
|---|---|---|---|---|---|---|
| Yeast | Microarray | Cell cycle (G2) | Mcm1 | 0.59 | 4.9 | 3.4 × 10−9 |
| Human | Microarray | Liver | HNF-1 | 0.62 | 2.2 | 1.6 × 10−6 |
| Human | Microarray | Pancreas | C/EBP beta | 0.76 | 2.5 | 2.5 × 10−7 |
| Human | Microarray | Cell cycle (G1/S) | E2F | 1.74 | 5.9 | 1.7 × 10−12 |
| Mouse | Microarray | Liver | HNF-1 | 0.62 | 2.1 | 2.1 × 10−6 |
| Mouse | Microarray | Liver | HNF-4 | 0.60 | 2.0 | 4.1 × 10−6 |
| Human | ChIP-chip | Liver/HNF-1 | HNF-1 | 0.62 | 14.7 | 8.7 × 10−37 |
Comparison of percent reduction in variances (%RIVs) between yeast and mammals is shown. Single linear splines were used to obtain the %RIV. Yeast cell-cycle data were obtained from Spellman , liver and pancreas microarray data from Su , liver ChIP-chip data for HNF-1 from Odom and human cell-cycle data from Whitfield . Mcm1 matrix was obtained from Bussemaker ; all other matrices were obtained from TRANSFAC and JASPAR databases. P-values were calculated using an F-test (see Materials and methods). Average information content refers to the information content of the position weight matrix, averaged over all the positions (columns). It is 2.0 for non-degenerate words and 0 if all positions were N's (see Materials and methods).
Significant motifs and motif pairs for analysis with known vertebrate motifs as input
| Significant motifs and motif pairs | Fit coefficients | |
|---|---|---|
| HNF-1 | 2.2 × 10–15 | 187.5 |
| HFH-3 | 1.3 × 10–14 | −9.8 |
| ETF | 0.002 | −2.9 |
| OCT-1*HNF-1 | 1.5 × 10–14 | 156.7 |
| OCT-1*PPAR-DR1 | 2.1 × 10–10 | 56.2 |
| HFH-3*HNF-1 | 0.001 | −120.1 |
| HFH-3*MTF-1 | 0.003 | 86.2 |
| OCT-1*HFH-3 | 2.2 × 10–15 | 30.0 |
HNF-1, PPAR-DR1 and MTF are liver-specific motifs, whereas Oct-1 and HFH-3 are ubiquitous motifs. HNF-1 is a well-established liver-specific motif. PPAR-DR1 binds several nuclear factors involved in metabolism, for example, PPAR, RXR, etc. MTF-1 is active in heavy metal metabolism and is essential for liver development. Fit coefficients are coefficients of optimal fit in the final spline model, which is reported in Supplementary note. More details on significant motif combinations are given in Supplementary Table V.
Significant motifs and motif pairs for analysis with known and predicted matrices
| Pair ID | Significant motifs and motif pairs | Fit coefficients | Target gene count | Tissue with maximum expression | Specificity in liver | ||||
|---|---|---|---|---|---|---|---|---|---|
| Model | Genome-wide | ||||||||
| No. of target genes | Percentage count | No. of target genes | Percentage count | ||||||
| — | HNF-1 | 3.0 × 10−15 | 182.8 | 10 | 1.0 | 94 | 0.6 | Liver | 2.4 × 10−15 |
| P1 | TGACCTTTG*HNF-1 | 0.0023 | 521.9 | 10 | 1.0 | 154 | 1.0 | Liver | 3.8 × 10−13 |
| P2 | TGACCTTTG*ACCCTAGACC | 0.0002 | −348.2 | 54 | 5.4 | 812 | 5.3 | Monocytes | — |
| P3 | ARP-1*ATGGAAAGA | 0.0006 | 299.2 | 17 | 1.7 | 154 | 1.0 | Liver | 6.7 × 10−6 |
| P4 | TTAACATGCA(∼FXR IR-1)*ATGGAAAGA | 0.0004 | −70.2 | 222 | 22.2 | 3416 | 22.3 | B cells | — |
| P5 | AATTGAAT(∼LEF-1)*ATGGAAAGA | 3.5 × 10−7 | −66.0 | 331 | 33.1 | 4939 | 32.3 | Monocytes | — |
| P6 | ATGGAAAGA*ACCCTAGACC | 0.0005 | 25.8 | 499 | 49.9 | 7244 | 47.3 | Monocytes | — |
| P7 | TGACCTTTG*ATGCCTGTC(∼USF-2) | 6.1 × 10−10 | 21.0 | 561 | 56.1 | 8481 | 55.4 | Monocytes | — |
| P8 | TGAATGAAT*HNF-1 | 3.0 × 10−15 | −18.6 | 989 | 98.9 | 15 212 | 99.4 | B cells | — |
Predicted matrices are represented by their consensus sequences. Matches to known PWMs are indicated in parentheses. For example, the PWM ATGCCTGTC matches USF-2. Similarity was determined using MatCompare (Schones ). USF-2 is a ubiquitous factor, FXR is a bile acid receptor and LEF-1 is a nuclear mediator in Wnt signaling pathway. ARP-1 is involved in multiple processes including lipid metabolism. Under target gene count, model refers to the genes used to fit the spline model to microarray data (total=1000), whereas genome-wide indicates all the genes on the array whose promoters were available (total=15 309). Final spline model and the significant predicted matrices for this analysis are reported in Supplementary note.
Target genes of HNF-1 in liver
| Gene name | Annotation | PWM score | Expression log ratio | Evidence for direct targets | ||
|---|---|---|---|---|---|---|
| Adult liver | Fetal liver | HNF-1 binding | Experimental evidence (PubMed ID) | |||
| ALB | Albumin | 0.4873 | 5.07 | 4.27 | 2.3 × 10−7 | 2693890 |
| FGB | Fibrinogen, B beta polypeptide | 0.4849 | 4.37 | 4.89 | — | 8218230 |
| AFP | Alpha-fetoprotein | 0.4630 | −1.09 | 5.81 | 0.7 | 2479822 |
| CYP2E1 | Cytochrome | 0.4622 | 5.93 | −1.89 | 8.1 × 10−9 | 7710685 |
| GC | Group-specific component (vitamin D binding protein) | 0.4592 | 4.86 | 4.49 | 0.8 | 9774468 |
| FGA | Fibrinogen, A alpha polypeptide | 0.4582 | 4.25 | 4.32 | — | 7499335 |
| GARS | Glycyl-tRNA synthetase | 0.4569 | −2.16 | −0.95 | 4.1 × 10−5 | — |
| APOH | Apolipoprotein H | 0.4533 | 5.32 | 4.77 | 5.0 × 10−8 | 14984368 |
| TXNIP | Thioredoxin interacting protein | 0.4523 | −0.96 | −2.17 | 0.1 | — |
| RPL34 | Ribosomal protein L34, transcript variant 1 | 0.4521 | −2.70 | −1.33 | 0.7 | — |
Characteristics of the predicted targets of HNF-1 among 1000 modeled genes under the given experimental condition. Expression data are log2(E/E), where E is the observed expression level of gene g and E is that of the average across all the tissues. In all cases, the mRNA levels change by at least two-fold in adult liver, and also in fetal liver, suggesting liver-specific activity. HNF-1 binding P-values were obtained from Odom : P⩽0.001 indicates strong binding. We find evidence for eight of the 10 predicted targets as being direct HNF-1 targets: ALB, FGB, AFP, CYP2E1, GC, FGA, GARS and APOH. The detailed references for reported experimental evidence in the last column are given in Supplementary Table VI.
Enrichment of the biological process categories from Gene Ontology (GO)
| Pair ID | Motif or motif pair | GO biological process | Total no. of genes with the term | Total no. of target genes with the term | Enrichment | |
|---|---|---|---|---|---|---|
| — | HNF-1 | Regulation of blood pressure | 28 | 3 | 15.2 | 0.001 |
| Regulation of body fluids | 79 | 4 | 7.2 | 0.002 | ||
| Response to pathogenic bacteria | 13 | 2 | 21.8 | 0.004 | ||
| Cytolysis | 14 | 2 | 20.2 | 0.004 | ||
| Blood coagulation | 64 | 3 | 6.6 | 0.01 | ||
| Hemostasis | 69 | 3 | 6.2 | 0.01 | ||
| P1 | TGACCTTTG*HNF-1 | Hexose metabolism | 98 | 5 | 4.7 | 0.004 |
| Monosaccharide metabolism | 101 | 5 | 4.6 | 0.005 | ||
| Muscle development | 105 | 5 | 4.4 | 0.006 | ||
| Inflammatory response | 153 | 6 | 3.6 | 0.006 | ||
| Fructose metabolism | 13 | 2 | 14.1 | 0.008 | ||
| Energy derivation by oxidation of organic compounds | 118 | 5 | 3.9 | 0.009 | ||
| Main pathways of carbohydrate metabolism | 76 | 4 | 4.8 | 0.009 | ||
| P3 | ARP-1*ATGGAAAGA | Lipid transport | 57 | 4 | 6.7 | 0.003 |
| Lipoprotein metabolism | 31 | 3 | 9.2 | 0.004 | ||
| Inflammatory response | 153 | 6 | 3.7 | 0.005 | ||
| Glycosphingolipid metabolism | 15 | 2 | 12.6 | 0.01 | ||
| Response to wounding | 235 | 7 | 2.8 | 0.01 | ||
| P2 | TGACCTTTG*ACCCTAGACC | Response to external stimulus | 556 | 48 | 1.7 | 0.0003 |
| Response to pest, pathogen or parasite | 385 | 35 | 1.8 | 0.0008 | ||
| Response to wounding | 235 | 24 | 2.0 | 0.001 |
Significant GO terms with P⩽0.01 (q⩽0.15; Storey and Tibshirani, 2003) are reported here. Enrichment (column 6) quantifies how enriched a given category is with the target genes (Zeeberg ). P-value was calculated using a hypergeometric distribution (see Materials and methods). Parent terms with higher P-values than their child terms are not reported in this table. Terms with exactly one gene in the total and target gene sets are also not reported. Some of the P-values are close to the cutoff possibly because the terms are quite broad.
Figure 2Schematic representation of our analysis. A snapshot of the tissue-specific transcriptional subnetworks discovered from microarray data on adult human liver under a normal condition.
Individual motifs identified as significant at various phases (time points) of cell cycle
| Significant motifs | Time (h) | Phase | Experimental evidence (PubMed ID) | |
|---|---|---|---|---|
| ROR alpha 2 | 4.4 × 10−5 | 22 | G0/G1 | 11241556 |
| ATF6 | 0.004 | 8 | G1/S | 10564271 |
| MEF-2 | <0.005 | 6, 14, 24, 28 | G1/S | 10322110 |
| IRF-2 | <0.01 | 22, 28, 30 | G1/S | 9417064 |
| NF-Y | 6.1 × 10−15 | 8 | G1/S | 9281303 |
| v-Myb | 0.004 | 24 | G1/S | 9111313 |
| AhR:Arnt | <0.0005 | 12, 26 | G1/S, S | 8628281 |
| AP-2 alpha | 4.6 × 10−12 | 28 | S | 9776742 |
| IRF-1 | 0.0004 | 28 | S | 1491701 |
| LEF-1 | 4.2 × 10−8 | 14 | S | 15736165 |
| Oct-1 | 5.4 × 10−16 | 28 | S | 12887926 |
| S8 | 0.009 | 28 | S | — |
| TCF-4 | 1.6 × 10−6 | 14 | S | 12408868 |
| USF2 | 0.0007 | 28 | S | — |
| v-Jun | 0.0007 | 28 | S | 12717415 |
| c-Ets-1 | 1.7 × 10−15 | 32 | G2/M | 8437861 |
| GATA-6 | 1.7 × 10−15 | 32 | G2/M | — |
| GATA-1 | 2.0 × 10−12 | 34 | M | 10216081 |
| MAZ | 3.7 × 10−6 | 2 | M | 11395515 |
| POU3F2 | <0.01 | 10, 32 | — | 14708619 |
| POU6F1 | <0.0001 | 8, 34 | — | 8900043 |
The experimental evidence for activity of a given motif, as reported in the literature, is included in the last column.
Significant motif pairs from different phases (time points) of cell cycle
| Significant pair | Time (h) | Phase | Experimental evidence (PubMed ID) | |
|---|---|---|---|---|
| Oct-1*NF-Y | 8.0 × 10−9 | 6 | G1 | 14586402 |
| MEF-2*E2F-1 | 5.6 × 10−15 | 24 | G1 | |
| TBP*AREB6 | 1.2 × 10−3 | 8 | G1/S | 14761964 |
| E2F*E2F-4/DP-2 | 1.3 × 10−15 | 10 | G1/S | 9372931 |
| E2F(M00050)*E2F(M00516) | 1.3 × 10−15 | 10 | G1/S | 15014447 |
| AML1*E2F-4/DP-2 | 1.0 × 10−10 | 10 | G1/S | — |
| AML1*E2F | 2.6 × 10−5 | 10 | G1/S | |
| AhR*E2F | 7.3 × 10−5 | 0 | G2/M | 10644764 |
| E2F*E2F-1/DP-1 | 9.7 × 10−9 | 18 | G2/M | 7917337 |
| E2F-1/DP-1*Ik-3 | 9.7 × 10−9 | 18 | G2/M | — |
| E2F-1/DP-1*Oct-1 | 2.8 × 10−7 | 18 | G2/M | |
| E2F*Ik-3 | 5.1 × 10−7 | 18 | G2/M | — |
The experimental evidence reported in the literature for activity of a given motif pair during cell cycle is included in the last column. Italicized Pubmed IDs indicate indirect evidence.
Figure 3Biochemical validation of target genes. Results of RT–PCR analysis in NIH3T3 fibroblasts expressing an inducible ERE2F1 or a transactivation-deficient mutant, MERE2F1. At time zero, medium containing 500 nM of 4-hydroxytamoxifen (OHT) was added and transcript levels of (A) DLG7 and (B) CDC16 were determined using gene-specific primers at the indicated times by RT–PCR. The respective transcript levels with 10 μg/ml of cyclohexamide (CTX) added are shown in (C) DLG7 and (D) CDC16. GAPDH was used as a standardization control. Plotted data reflect results after normalization with GAPDH.