| Literature DB >> 22422841 |
Hong Sun1, Tias Guns, Ana Carolina Fierro, Lieven Thorrez, Siegfried Nijssen, Kathleen Marchal.
Abstract
Computationally retrieving biologically relevant cis-regulatory modules (CRMs) is not straightforward. Because of the large number of candidates and the imperfection of the screening methods, many spurious CRMs are detected that are as high scoring as the biologically true ones. Using ChIP-information allows not only to reduce the regions in which the binding sites of the assayed transcription factor (TF) should be located, but also allows restricting the valid CRMs to those that contain the assayed TF (here referred to as applying CRM detection in a query-based mode). In this study, we show that exploiting ChIP-information in a query-based way makes in silico CRM detection a much more feasible endeavor. To be able to handle the large datasets, the query-based setting and other specificities proper to CRM detection on ChIP-Seq based data, we developed a novel powerful CRM detection method 'CPModule'. By applying it on a well-studied ChIP-Seq data set involved in self-renewal of mouse embryonic stem cells, we demonstrate how our tool can recover combinatorial regulation of five known TFs that are key in the self-renewal of mouse embryonic stem cells. Additionally, we make a number of new predictions on combinatorial regulation of these five key TFs with other TFs documented in TRANSFAC.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22422841 PMCID: PMC3384348 DOI: 10.1093/nar/gks237
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.CPModule analysis flow. (A) The input consists of a library of PWMs and a set of sequences. In the first step, prior to the actual CRM detection a screening with public motif databases is performed. Here, we combine standard PWM screening with filtering based on nucleosome occupancy. Motif sites displaying a high nucleosome occupancy are filtered out (indicated as the transparent shapes in A). (B) The second step consists of the actual combinatorial search. Here, we use a constraint programming for itemset mining approach to enumerate all valid motif sets, i.e. combinations of motifs (i) that occur frequently in the input set (frequency constraint). Only valid motif sets will be considered (indicated in regular boxes), while invalid ones will not (indicated in dashed boxes); (ii) of which the motif sites contributing to the motif set occur in each other’s proximity (proximity constraint). Only valid motif sets will be considered (indicated in regular boxes), while invalid ones will not (indicated in dashed boxes); (iii) that are non-redundant (redundancy constraint). The motif sets in the dashed box are redundant with the motif set in the regular box and will not be considered; (iv) that contain a query-motif (query-based constraint), which corresponds in this work to the motif of the ChIP-assayed TF. Valid motif sets are indicated in regular boxes. (C) Valid motif sets or CRMs are finally assigned a P-value that expresses their specificity for the input set.
Comparison of CRM prediction algorithms
| Cister | Cluster-Buster | ModuleSearcher | Compo | CPModule | |
|---|---|---|---|---|---|
| mCC | 0.16 | 0.05 | / | 0.27 | 0.57 |
| nCC | 0.23 | 0.23 | / | 0.68 | 0.55 |
The tools were run on the synthetic data set of Xie et al. (36) using a stringent screening with 516 TRANSFAC PWMs. Slash indicates termination by lack of memory.
Figure 2.Performance comparison of CRM detection tools. All CRM detection tools were run on the synthetic data set of Xie et al. (36). Screening was performed with the PWMs used to generate the synthetic data in combination with an additional set of PWMs sampled from TRANSFAC (the number of PWMs added to the true PWMs is indicated on the x-axis). (A) mCC, (B) nCC.
Figure 3.Known combinatorial regulation of the five assayed TFs. Network representing combinatorial interactions between the five transcription factors (KLF4, SOX2, OCT4, NANOG and STAT3) involved in embryonic stem cell development. Edges indicate that a combinatorial interaction between the indicated TFs exists as reported in literature (with a combinatorial interaction referring to the fact that at least subsets of genes contain binding sites for both TFs in each other’s neighborhood). Dashed lines correspond to the interactions in the benchmark that were missed by CPModule. Solid lines correspond to the interactions in the benchmark that were recovered by CPModule. The thin line indicates that the interaction was detected using CPModule on the ChIP-Seq data set of one TF while the thick line indicates that the interactions was detected by using either ChIP-Seq dataset of the TFs involved in the interaction.
Figure 4.Effect of different screening/filtering combinations on motif prediction results. (A) Effect of using different screening/filtering combinations on the sensitivity of recovering true sites of the assayed TF. Sensitivity is assessed by the percentage of binding peak regions in which a motif site of the assayed TF could be detected. (B) Effect of using different screening /filtering combinations on the average number of remaining motif sites per sequence and per TF for each of the ChIP-assayed data sets. In each panel, we used a stringent screening, a non-stringent screening without filtering and a non-stringent screening with a filtering based on NuOS (different categories are indicated in the order as mentioned above by bars with increasing gray scales), respectively.
Benchmark CRMs obtained with CPModule in combination with filtering (non-stringent screening with filtering for all TFs except the assayed one)
| ChIP-Seq-assayed TF | CRM | Support (%) | Proximity threshold (bp) | Query-based | Non-query-based | Validation (%) |
|---|---|---|---|---|---|---|
| Rank/Total | Rank/Total | |||||
| KLF4 | KLF4, STAT1 | 66 | 150 | 19/23 | 845/849 | 22.73 |
| KLF4, OCT1 | 60 | 200 | 19/46 | 5562/5694 | 33.33 | |
| KLF4, STAT1, [CEBP] | 61 | 200 | 22/46 | 5635/5694 | 23.33 | |
| KLF4, STAT4, [SMAD] | 61 | 250 | 16/98 | 6160/7029 | 22.95 | |
| KLF4, OCT1 | 60 | 250 | 65/98 | 6994/7029 | 33.33 | |
| KLF4, STAT4, [T3R] | 63 | 300 | 23/183 | 6903/7704 | 22.22 | |
| KLF4, OCT1 | 61 | 300 | 131/183 | 7651/7704 | 32.79 | |
| KLF4, STAT4, STAT1, [CDXA, LEF1] | 60 | 350 | 29/284 | 25 056/26 843 | 23.33 | |
| KLF4, OCT1 | 61 | 350 | 212/284 | 26 771/26 843 | 32.79 | |
| KLF4, STAT4, [SMAD, T3R] | 60 | 400 | 5/468 | 24 930/3 1549 | 21.67 | |
| KLF4, OCT1, [CDXA] | 60 | 400 | 207/468 | 31 220/31549 | 33.33 | |
| NANOG | NANOG, STAT5A_03, STAT5A_04 | 62 | 150 | 1/11 | 5930/5941 | 19.35 |
| NANOG, STAT5A_04, [PU1] | 60 | 200 | 1/39 | 40 171/43 093 | 23.33 | |
| NANOG, STAT5A_03, [PU1] | 61 | 250 | 1/71 | 62 475/64 059 | 21.31 | |
| NANOG, OCT1 | 60 | 250 | 21/71 | 64 006/64 059 | 68.33 | |
| NANOG, OCT1, [FAC1] | 60 | 300 | 1/145 | 66 186/80859 | 70.49 | |
| NANOG, STAT3, [FAC1] | 62 | 300 | 3/145 | 77 724/80 859 | 26.67 | |
| NANOG, STAT5A_04, [PU1, FAC1] | 60 | 350 | 1/406 | 159 818/217 328 | 25.00 | |
| NANOG, OCT1, STAT5A_04, [FAC1] | 60 | 350 | 2/406 | 167 806/217 328 | 66.67 (OCT1); | |
| 26.67 (STAT5A_04) | ||||||
| NANOG, STAT5A_04, [PU1, HNF3, AR] | 60 | 400 | 1/883 | 204 024/299 409 | 23.33 | |
| NANOG, OCT1, STAT6, STAT5A_04, [FAC1] | 60 | 400 | 2/883 | 224 495/299 409 | 70.00 (OCT1); | |
| 26.67 (STAT6) | ||||||
| OCT4 | OCT4, STAT6, [XFD2, FOXJ2, FOXP3] | 63 | 150 | 6/1322 | 30/11966 | 14.75 |
| OCT4, SOX2 | 60 | 150 | 1272/1322 | 10 348/11 966 | 78.33 | |
| OCT4, STAT4, STAT6, [PAX2, PAX4, TITF1] | 62 | 200 | 1/13 141 | 6/111 817 | 16.39 | |
| OCT4, SOX2, [PAX2] | 62 | 200 | 11 740/13 141 | 83 797/111 817 | 79.03 | |
| OCT4, STAT4, STAT6, [PAX4, PAX2, ELF1] | 66 | 250 | 1/29 767 | 23/182 697 | 16.67 | |
| OCT4, SOX2, [CDXA] | 60 | 250 | 28 080/29 767 | 1671 42/182 697 | 81.67 | |
| OCT4, STAT3 | 61 | 300 | 1/73 091 | 7/235 252 | 14.75 | |
| OCT4, SOX2, [CDXA, PAX2] | 60 | 300 | 68 944/73 091 | 217 331/235 252 | 75.00 | |
| OCT4, STAT3, [CDXA] | 60 | 350 | 1/290 997 | 1/859 377 | 12.90 | |
| OCT4, SOX2, [PAX2, FOXP3] | 60 | 350 | 106 443/290 997 | 296 722/859 377 | 73.33 | |
| OCT4, STAT3, STAT5A_03 | 60 | 400 | 1/383 001 | 11/108 0139 | 14.75 | |
| OCT4, SOX2, [PAX2, FOXP3] | 60 | 400 | 150 936/383001 | 449 140/1 080 139 | 73.33 | |
| SOX2 | SOX2, OCT4 | 68 | 150 | 1/6318 | 322/46 471 | 88.24 |
| SOX2, STAT5A_04, [NKX62, AR, HELIOSA] | 60 | 150 | 3/6318 | 840/46 471 | 23.33 | |
| SOX2, STAT5A_04, [GEN_INI2_B, FOXJ2, HNF3ALPHA, CEBP, AR] | 60 | 200 | 2/90 416 | 55/512 702 | 25.00 | |
| SOX2, OCT4, [CDXA, TST1] | 61 | 200 | 4/90 416 | 106/512 702 | 27.87 | |
| SOX2, STAT1, STAT5A_04, [SRY, CAP, NFAT, TEF, AR, CDX, HMGIY, BRCA] | 60 | 250 | 1/168 760 | 55/790 791 | 25.00 | |
| SOX2, OCT4, [CDXA, CDX2] | 62 | 250 | 4/168 760 | 183/79 0791 | 87.10 | |
| SOX2, OCT4, [CDXA, CDX, CEBP] | 60 | 300 | 1/303 533 | 94/1 256 190 | 86.89 | |
| SOX2, STAT, STAT5A_03, [CAP, NFAT, GEN_INI2_B, FOXJ2, CEBP] | 60 | 300 | 2/303 533 | 238/1 256 190 | 21.67 | |
| STAT3 | STAT3, OCT4, [CAP] | 61 | 150 | 36/5649 | 43/6426 | 36.07 |
| STAT3, SOX2, [IRF1] | 61 | 150 | 2651/5649 | 3018/6426 | 31.15 | |
| STAT3, OCT4, [CEBP] | 61 | 200 | 301/32 257 | 312/33 640 | 29.51 | |
| STAT3, SOX2, STAT6, [YY1] | 60 | 200 | 4532/32257 | 4675/33640 | 30.00 | |
| STAT3, OCT4, [CAP, TEF1, YY1, PR] | 60 | 250 | 57/54 549 | 61/56 473 | 31.67 | |
| STAT3, SOX2, STAT6, [SRY, IRF1] | 60 | 250 | 6363/54 549 | 6666/56 473 | 28.33 | |
| STAT3, OCT1, [CAP, FOXM1, YY1, PR] | 60 | 300 | 186/73 378 | 188/74 106 | 33.33 | |
| STAT3, SOX2, STAT6, [XPF1] | 60 | 300 | 8442/73 378 | 8517/74 106 | 28.33 | |
| STAT3, OCT, STAT5A_03, STAT6, [HOXA3, AP2REP, PU1] | 60 | 350 | 6/243 758 | 6/243 979 | 35.00 | |
| STAT3, SOX2, STAT1, STAT4, STAT5A_04, STAT6, [HNF3, YY1] | 60 | 350 | 21 046/243 758 | 21 066/243 979 | 26.67 | |
| STAT3, OCT, STAT5A_03, [AP2REP, PU1, XPF1] | 61 | 400 | 12/308 757 | 12/308 993 | 36.07 | |
| STAT3, SOX2, [HOXA3, AP2REP] | 62 | 400 | 21 375/308 757 | 21 385/308 993 | 29.03 |
In this table, only benchmark CRMs recovered by CPModule are displayed, For reasons of conciseness, we only display for each parameter setting the best ranked versions of each of the benchmark CRMs, for instance, whereas Oct4-Sox2 was found to be the best ranked CRM at a proximity threshold of 150, more combinations of Oct4, Sox2 in combination with other TFs were also detected at this setting of the proximity parameter albeit at lower ranks. These alternative versions with lower rank are not displayed in the table.
If PWMs for TFs belonging to the same family are very similar, we also considered those CRMs as true that contained rather than the TF reported in literature another member of the same family (48) (i.e. this was the case for TFs of the STAT and OCT family).
The set of sequences corresponding to the top 100 scoring peak regions of the assayed TF, were screened with a set of 517 TRANSFAC motifs using a non-stringent screening threshold. Filtering was applied on all motif sites except on the ones of the assayed TF. ChIP-Seq-assayed TF: TF from which the top 100 binding peaks were used to perform the analysis. CRM: obtained CRMs that correspond to previously well described CRMs for the assayed TF; [between brackets are indicated other TFs that were predicted to belong to the same CRM, but that have not previously been described to interact with the assayed TF]. Support: the percentage of sequences from the input set in which this CRM occurs (always higher than the frequency threshold). Proximity threshold (bp): the proximity threshold with which the displayed CRM was found. Query-based Rank/Total: the rank this CRM received in the query-based setting/the number of solutions containing the motif for the ChIP-Seq-assayed TF. Non-query-based Rank/Total: the rank this CRM received in all of the solutions/the total number of valid CRMs. Validation: we started from the ChIP-Seq data of one TF and predicted using CRM detection with which other TFs the assayed TF interacts. We verified whether the motif sites contributing to the predicted CRMs fell within the binding peaks of the other ChIP-Seq-assayed TF.
Table 2 reads as follows, for instance, when starting from the ChIP-Seq data of SOX2, we predicted a previously described CRM containing SOX2-OCT4. This retrieved CRM was ranked first amongst the 6318 potential CRMs that contained SOX2 (rank in the query-based mode) and ranked 322 out of the total number of 46 471 possible CRMs in the non-query based mode. SOX2 and OCT4 co-occurred in 68% of the SOX2 ChIP-Seq identified regions (Support) within a distance of 150 bp and the identified sites for OCT4 in the predicted CRM fell within the identified OCT4 ChIP-Seq regions in 88.24% of the cases. As an example of how the same CRM can be detected at different proximity thresholds: the CRM containing KL4-OCT1 was recovered at a proximity constraint of 200, 250, 300 and 350, but with an increasingly lower absolute rank in the non-query-based setting.
With the current screening/filtering all runs could be performed except those for SOX2 with proximity thresholds of 350 bp and 400 bp, respectively. These did not finish after 7days.
Indirect evidences for the suggested CRMs in Supplementary Table S3
| Suggested CRM | Indirect evidence |
|---|---|
| NANOG-FAC1 | The putative transcriptional regulator FAC1 is expressed in embryonic and extraembryonic tissues of the early mouse conceptus. Study ( |
| OCT4-FOXM1 | Foxm1 has been hypothesized to be one of the candidates to help reprogramming somatic cells into iPSCs (Induced pluripotent stem cells) ( |
| SOX2-CDXA | Binding of homeobox domain from CDX1 protein and SOX2 protein was shown to occur in a system of purified components ( |
| SOX2-BRCA | Roles of BRCA in both homologous recombination and non-homologous end joining DNA repair have been shown ( |
| STAT3-HOXA3 | As HOXA3 is involved in wound repair ( |
| STAT3-GATA1 | GATA1 was known to be one of the major transcription factors that stimulated cardiogenesis during development ( |
| STAT1-STAT3-STAT6 | Binding of human STAT3 protein and human STAT6 protein has been shown in a 2-hybrid assay ( |
Indirect evidences derived from literature searches which give indications on possible interactions between the indicated TFs found in the predicted CRMs.