| Literature DB >> 16551355 |
Hyunsoo Kim1, William Hu, Yuval Kluger.
Abstract
BACKGROUND: Gene expression and transcription factor (TF) binding data have been used to reveal gene transcriptional regulatory networks. Existing knowledge of gene regulation can be presented using gene connectivity networks. However, these composite connectivity networks do not specify the range of biological conditions of the activity of each link in the network.Entities:
Mesh:
Substances:
Year: 2006 PMID: 16551355 PMCID: PMC1488875 DOI: 10.1186/1471-2105-7-165
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Break-down (decomposition) of the composite regulatory network. The input for our algorithm (upper panel) includes: a) a composite regulatory network published by Milo et al [22] (or a joint network obtained by integrating the literature driven gene regulatory networks compiled by Milo et al. [22] and Herrgard et al [21]). The edges (black links) and nodes of the composite network are illustrated by a graph with 11 genes, where yellow and blue circles represent TFs and non regulating genes respectively (b) microarray gene expression profiles from 387 different experimental conditions involving diploid cells. This is a subset of the experiments stored in a compendium of S. cerevisiae gene expression datasets compiled by Ihmels et al [9]. The expression dataset is illustrated by a miniature matrix consisting of 11 genes and 15 experimental conditions, whose red, blue and yellow entries correspond to up-regulation, down-regulation and intermediate expression levels, respectively. The output of our algorithm allows us to read out condition specific regulatory networks, as illustrated in the lower panel.
Figure 2Local network approach for identifying the experimental conditions for gene regulation by its known direct regulator. As in Fig. 1, the input includes the composite network and a gene expression compendium dataset. For each link in the network, as illustrated for the link TF→Gene1 highlighted in yellow, we identify conditions in which the target gene (Gene1) is directly controlled by its regulator (TF), by extracting two types of condition subsets: a) subset of states in which the expression profile of the target gene (highlighted in pink) is positively or negatively correlated with the expression profile of its regulator, b) subset of states in which the expression profile of the target gene (highlighted in pink) is correlated with the expression profiles of other genes (highlighted in light blue) that are known to be regulated by the same TF. In the illustration shown in the lower panel the conditions in which the TF directly regulates the target gene are indicated below the gene expression patterns across these conditions. These conditions are also displayed in the braces shown next to the sub-networks. To differentiate between conditions in which the link is functional/not-functional due to the activation/deactivation of the TF, we mark these conditions in red and blue respectively.
Figure 3The united signature algorithm (USA). This algorithm is designed to find a subset of conditions in which the input genes are regulating each other or are co-regulated, and to identify additional genes that are potentially co-regulated under the same subset of conditions. The order of the procedures performed in the USA is shown in the following six panels: a) bi-normalization and log transformation of the raw expression data, such that row sums and column sums are equal to zero, b) selection of an input set of gene expression profiles consisting of the target gene and its TF regulator, or the expression profiles of the target gene and all the other known regulated genes controlled by the TF. c) calculation of condition (column) scores by summing (or averaging) the columns of a sub-matrix, whose rows represent the normalized expression profiles of the input genes across all conditions. These rows are first multiplied by +1 for input genes that are stimulated by the TF and by -1 (inversion) for target genes inhibited by the TF. Experimental conditions whose column average Sc across the input genes satisfies |Sc-mean(Sc)| > thresholdcolumn are retained as indicated by black bullets and black experimental IDs below the sub-matrix, d) calculation of gene (row) scores defined as the weighted row average S= Σ(SE)/(#genes) across the selected conditions c e) determination of a sub-matrix of genes and conditions, termed the united transcriptional module (UTM), consisting of gene expression profiles whose weighted row averages satisfy |Sg-mean(Sg)| > thresholdrow across the selected conditions c, f) retaining genes within the UTM whose correlation with the target gene Gene2, TF or the centroid gene (a gene that is correlated with the largest number of genes within the UTM) satisfy |R| > α = thresholdcorrelation where R is the correlation coefficient.
Figure 4A schema of the LINK and STAR models: (A) the LINK model is designed to find a subset of conditions in which the expression profiles of a TF-target gene pair (link highlighted in pink in the upper left panel) are positively or negatively correlated. In addition it finds other known or putative target genes whose expression profiles are correlated with the TF or its target under this subset of conditions. The regulating TF and its target gene (Gene1) are the core set of input genes inserted into the USA. The blue and yellow nodes in the known local network represent the input gene set employed by the USA. This input set is also indicated in a genome-wide vector that has only two nonzero elements representing the TF (+1) and its target (+1 for activation and -1 for suppression). The dataset we insert into the USA is the yeast compendium data described in Fig. 1 and schematized in the miniature matrix consisting of 11 genes and 15 experimental conditions shown between the left and right panels. The red, blue and yellow entries correspond to up-regulation, down-regulation and intermediate expression levels respectively. The algorithm finds the conditions in which TF and Gene1 are correlated. It also finds the additional genes denoted in red (Gene2, Gene4, and Gene5), whose expression profiles across the subset of experimental conditions correlate positively or negatively with the TF or Gene1. Altogether, these genes and conditions constitute the UTM shown in the left middle panel. We predict that links of the known regulatory interactions TF→Gene2 and TF⊣Gene4 are functional under the united transcriptional module (UTM) conditions in which the TF, Gene1 and Gene2 are up regulated (red pixels in the UTM matrix) and Gene4 is down regulated (blue pixels). For the predicted link TF→Gene5, we further compute a MATCH score between the TF position-specific weight matrix (PWM) and the promoter region of Gene5. Links with a score higher than a threshold value of 0.94 (i.e. PWM match) are reported along with the experimental conditions supported by their corresponding UTM as illustrated in the lower left panel. (B) the STAR model enables us to find an alternative subset of experimental conditions in which Gene1 is directly regulated by TF (illustrated by the link highlighted in pink in the right upper panel). It searches for conditions in which the expression profiles of Gene1 is positively or negatively correlated with some of the genes regulated by TF (highlighted in light blue). In addition, it is designed to identify new genes that are positively or negatively correlated with the input core of target genes under the same subset of experimental conditions, and whose promoters contain sequences similar to the TF binding site. In the STAR model, we apply the USA (right center panel) to the core set of input genes consisting of all the TF target genes (Gene1,2,3,4) excluding the TF itself. As in the LINK model, the input set of genes is indicated by the nonzero elements of a genome-wide vector (highlighted in green in the upper right panel), which denotes the regulatory relationship between the TF and its target genes. In addition to links from the original local (STAR) network, two new links, i.e. TF⊣Gene6 and TF→Gene7, are predicted based on their co-expression with the core genes, and the match between their promoter region and the TF PWM. Each target is predicted to be affected by the TF under the experimental conditions of the respective UTM.
Examples of predicted TF-target gene pairs generated by the LINK and STAR models to the S. cerevisiae transcriptional regulatory network published by Milo et al. [22]. The table shows the overlap of these links with: a) high confidence ChIP-on-chip data [6] at P ≤ 0.001 and with sequence conservation across at least 3 yeast species, b) moderate confidence ChIP-on-chip data at P ≤ 0.005 but excluding the high confidence binding events in (a), or c) the literature-driven gene regulatory network constructed by Herrgard et al. [21]
| Links predicted by the LINK model | |
| Overlap with ChIP-on-chip data at | ACE2→CST13, GAT1→DAL2, GCN4→ATR1, GCN4→FOL2, GCN4→IDP1, GCN4→ISU1, GLN3→ARG1, GLN3→UGA3, GLN3→YHR029C, ROX1⊣YLR413W, STE12→FUS2, STE12→GPA1, STE12→INP52, STE12→KAR4, STE12→TEC1, SWI4→OCH1, SWI5→YPL158C, TEC1→GFA1, TEC1→GIC2, TEC1→PCL2, TEC1→STE12, UME6→YOR291W, YAP1→CYT2 |
| Overlap with ChIP-on-chip data at | ACE2→SCW11, ASH1→HSP150, ASH1→PIR1, ASH1→PIR3, DAL80→YLR053C, GAT1→PUT1, GCN4→YMC1, GCN4→ALD5, GCN4→YMC2, GCN4→CAF16, GCN4→BAT1, GCN4→ORT1, STE12→ASG7, STE12→YDR249C, STE12→MFA2, TEC1→PRM1, TEC1→PRM6, TEC1→AGA2, TEC1→KAR5, TEC1→PRP39, YAP1→AAD6, YAP1→GTT2, YAP1→YLR460C |
| Literature supported | BAS1→HIS7, DAL80→GAP1, MSN4→TPS2, STE12→MFA2, STE12→TEC1, TEC1→STE12 |
| Links predicted by the STAR model | |
| Overlap with ChIP-on-chip data at | ACE2⊣BUD9, ADR1→PXA1, BAS1→HIS4, BAS1→SHM2, FKH2→ALK1, FKH2→SWI5, GAT1→DAL2, GCN4→ATR1, GCN4→FOL2, GCN4→IDP1, GCN4→ILV3, GCN4→UGA3, GCR1→CDC19, GLN3→CPS1, HAP4→ATP1, HSF1→CPR6, MCM1→YNL058C, MSN2→TSL1, MSN4→TSL1, PHO4→PHO86, RPN4→PUP2, STE12→FUS2, STE12→GPA1, STE12→INP52, STE12→KAR4, STE12→TEC1, SWI5⊣CYK3 |
| Overlap with ChIP-on-chip data at | ABF2→RPS28A, ACE2→FAA3, ACE2→PRY3, ACE2→SCW11, DAL80→YLR053C, FKH2→ACE2, FKH2→HOF1, FKH2→YLR190W, FKH2→YOR315W, FKH2→YPL141C, GAT1→DAL3, GAT1→DAL5, GAT1→DAL7, GAT1→MEP2, GAT1→PUT1, GCN4→ALD5, GCN4→BAT1, GCN4→CAF16, GCN4→ORT1, GCN4→YMC1, GCN4→YMC2, GCN4→YNL129W, GLN3→MEP2, GLN3→OPT2, GLN3→YGR125W, GLN3→YMR088C, HAP4→NDI1, HAP4→SDH1, HSF1→HSP10, HSF1→HSP60, HSF1→TSL1, LEU3→BAT1, MCM1→BUD4, MCM1→YOR315W, RAP1⊣GPM1, RAP1⊣PGI1, RCS1→ARN1, RCS1→TAF1, RPN4→RPN12, RPN4→RPN6, SKN7⊣DDR48, SKN7⊣GPX2, STE12→ASG7, STE12→MFA2, SWI5⊣FAA3, SWI5⊣PIR1, SWI5⊣PRY3, SWI5⊣TEC1, YAP1→AAD6 |
| Literature supported | BAS1→ADE1, BAS1→ADE13, BAS1→ADE17, BAS1→ADE2, BAS1→ADE5,7, BAS1→HIS1, BAS1→HIS4, BAS1→HIS7, DAL80→GAP1, GCR1→CDC19, HAP4→SDH1, HSF1→SSA4, STE12→MFA2, STE12→TEC1 |
Figure 5Extending the currently known parts of the regulatory network and breaking it down into state dependent networks [see Additional files 2]. Here we display a representative subset of known and predicted links in the yeast regulatory network along with categories of experimental conditions in which the target genes are controlled by their regulators. To simplify the display, we aggregated the experimental conditions into categories such as cell cycle (green links), amino acid starvation (orange), rapamycin treatment (blue), and alpha-factor treatment (purple). TFs are represented by squares and their target genes by circles. Solid and dotted links indicate the known and predicted regulatory links, respectively. The predicted experimental conditions in a UTM corresponding to a regulatory link tend to contain the experimental condition in which a TF binding on the promoter region of a target gene has been experimentally confirmed[6]. For example, the predicted regulatory links STE12→FUS2 and TEC1→GFA1 are supported by ChIP-on-chip location analyses performed with alpha factor pheromone treatment (purple-dotted links STE12→FUS2 and TEC1→GFA1).
Figure 6De-composing the composite network into condition specific regulatory links. To predict the experimental conditions for each TF-target gene link in the regulatory network, we unified the set of experimental conditions generated by the LINK and STAR models. We applied both models to the local networks containing the TF→Gene1 link. (a) A set of experimental conditions (A) in which the TF→Gene1 link is predicted to be active according to the LINK model. Under conditions A, Gene1 and TF are over-expressed. (b) A set of experimental conditions (B) in which the targets are activated by TF according to the STAR model. Under conditions B, Genes1,2,3 are stimulated and Gene4 is suppressed by TF. (c) Finally, we determine the experimental conditions in which the TF→Gene1 link is active by taking the union of A and B. The quality of our network de-composition has been assessed by using the union A ∪ B.
Figure 7Predicting and validating condition specific networks. To find a condition specific network associated with treatment with rapamycin, we first analyzed all the local networks and identified from the respective UTMs all the links associated with this condition [see Additional files 3]. Here we show the sub-network consisting of links in the literature-driven gene regulatory networks [21, 22] and predicted links, which are supported by ChIP-on-chip binding assays (P ≤ 0.001) with rapamycin treatment. Blue links represent pairs of TF-target genes bound to each other and predicted to be active in this condition (46/55). Dotted gray links (9/55) correspond to pairs bound to each other, in which our model failed to predict a condition-specific regulatory relationship.
Overlap between ChIP-on-chip events with P ≤ 0.005 (or P ≤ 0.005 excluding high confidence binding events with P ≤ 0.001 and with sequence conservation across at least 3 yeast species) and predicted links obtained by the LINK model and by other five reference models. The table shows that less than 1% of all possible randomly selected links occur in binding experiments. Moreover, only 3.32% of the TF-gene pairs with overall expression correlation of greater (or smaller) than 0.5 (or -0.5) overlap with binding experimental data. The overlap increases to 13.37% when we consider highly correlated TF-gene pairs under the experimental conditions of the UTMs generated by the LINK model. By filtering the predicted links via the PWM matching, we discriminate between direct and indirect predicted interactions of co-expressed TF-gene pairs. As shown in the table, 40% (14.37%) of the new links predicted by the LINK model overlap with ChIP-on-chip binding all events with P ≤ 0.005 (P ≤ 0.005 excluding high confidence binding events with P ≤ 0.001 and with sequence conservation across at least 3 yeast species). This is a substantially higher rate than the 23.19% (7.25%) rate obtained by using a simpler approach that combines correlation with PWM-promoter matching, but disregards information about other experimentally known links.
| Random links | Correlated links | LINK-UTM links | |
| 0.81% | 3.32% | 13.37% | |
| 6.03% | 23.19% | 40.00% | |
| 0.53% | 1.05% | 8.62% | |
| 3.56% | 7.25% | 14.37% |
Figure 8Validation procedure for the STAR model: In order to demonstrate the feasibility of the STAR model, we designed a recapturing scheme. We removed one link at a time from the network and examined whether this link can be recaptured by application of the STAR model to this reduced network. This figure explains the validation procedure using the MCM1 – CLB2 link as an example. (a) We first applied the USA to the local (STAR-like) network of MCM1 to find experimental conditions in which the core input genes are over-expressed or under-expressed. We then evaluated the correlations between all the targets of MCM1 under these (UTM) conditions, and removed targets whose correlation with other members of the core set are insignificant. We recursively applied the USA in order to remove MCM1 target genes that are weakly correlated with any other of the MCM1 target genes. At each step of the recursive elimination, we identified a centroid gene that has the largest number of highly correlated genes. We eliminated genes that are not highly correlated with the centroid gene. We continued these iterations until all the remaining target genes are highly correlated with each other. CLB2 was identified as the centroid in the last iteration. (b) In the next step we removed the link MCM1→CLB2 associated with the centroid gene, and applied the STAR model described in Figure l(b). As shown, CLB2 reappeared in the UTM generated from the application of the STAR model to a core set of input genes that excludes CLB2. Moreover, the matching score between the PWM of MCM1 and the promoter sequence of CLB2 was high. Overall, we recaptured 36% of the centroid genes by applying this procedure to all the multi-target local networks in the literature-driven gene regulatory network.