| Literature DB >> 18394174 |
Peter Van Loo1, Stein Aerts, Bernard Thienpont, Bart De Moor, Yves Moreau, Peter Marynen.
Abstract
We present ModuleMiner, a novel algorithm for computationally detecting cis-regulatory modules (CRMs) in a set of co-expressed genes. ModuleMiner outperforms other methods for CRM detection on benchmark data, and successfully detects CRMs in tissue-specific microarray clusters and in embryonic development gene sets. Interestingly, CRM predictions for differentiated tissues exhibit strong enrichment close to the transcription start site, whereas CRM predictions for embryonic development gene sets are depleted in this region.Entities:
Mesh:
Year: 2008 PMID: 18394174 PMCID: PMC2643937 DOI: 10.1186/gb-2008-9-4-r66
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Genome-wide databases of candidate transcription factor binding sites
| Number | Database properties | Number of genes | Number of regions | Number of binding sites |
| 1 | Human-mouse conserved regions, 10 kilobases 5' of TSS | 8,759 | 22,582 | 1,858,800 |
| 2 | (1) + limited to binding sites occurring both in the human and mouse CNS | 8,759 | 22,582 | 878,338 |
| 3 | (2) + correct for possible mouse TSS differences (add 100 kilobases of mouse sequence 5' and 3') | 11,653 | 35,021 | 1,316,927 |
CNS, conserved noncoding sequence; TSS, transcription start site.
Figure 1Performance of ModuleMiner. Illustrated is the performance of ModuleMiner on a set of smooth muscle marker genes, using the three different sets of candidate transcription factor binding sites (TFBSs). Receiver operating characteristic curves are shown, representing results for leave-one-out cross-validations on the set of smooth muscle markers, (a) using singular transcriptional regulatory models and (b) using transcriptional regulatory global models.
Figure 2Sensitivity of ModuleMiner's performance to the quality of the input genes. The ratio of true positive genes (containing the smooth muscle cis-regulatory module [CRM]) to negative genes (approximated by random genes) was varied. Each time, a leave-one-out cross-validation was performed, a receiver operating characteristic (ROC) curve was constructed, and the area under the ROC curve (AUC) was calculated. These AUCs were plotted as a function of the ratio negative genes/positive genes. Because an AUC of 50% signifies random ordering of the left-out genes (and hence indicates that no CRMs can be detected), this value was taken as the origin on the y-axis. Blue: the number of positive genes was kept constant at ten, and the number of negative genes was varied. Red: the total number of genes was kept constant at ten, and the ratio negative genes/positive genes was varied.
Figure 3Comparison with other CRM detection algorithms. (a-e) Receiver operating characteristic (ROC) curves for the leave-one-out cross-validation using ModuleMiner, ModuleSearcher, CisModule, EMCMODULE, Clover, and random transcriptional regulatory models for each of the five benchmark sets: ORegAnno Erythroid (panel a), liver (panel b), muscle (panel c), ORegAnno Stat1 (panel d) and smooth muscle (panel e). (f-i) ROC curves when using transcription factor binding site (TFBS) preservation (TFBS set 2) in the genome ranking step for all algorithms, on the four benchmark sets that performed above random: liver (panel f), muscle (panel g), ORegAnno Stat1 (panel h), and smooth muscle (panel i). (j) ModuleMiner performance for the three TFBS sets on the muscle benchmark data. CRM, cis-regulatory module.
Figure 4Application of ModuleMiner to microarray clusters. (a) The two-step procedure used to detect similar cis-regulatory modules (CRMs) in a subset of genes within a given microarray cluster. In the first step, a fivefold cross-validation is performed, and the number of left-out genes considered as target genes is counted. If this number is significantly more than expected under a random distribution of the ranks, then these genes are transferred to the second step. In this second step, ModuleMiner is used to model the similar CRMs regulating the genes in this focused subcluster. (b) Results of the first step of the procedure in panel (a) for the ten microarray clusters and the three different sets of candidate transcription factor binding sites (TFBSs). Significantly higher numbers of target genes among the left-out genes than randomly expected are depicted by an asterisk. Clusters 7 and 10 only contained sufficient genes (≥ 25) in TFBS set 3 and therefore are omitted for the other two sets. (c) Leave-one-out cross-validation results on the subclusters with a significant enrichment of target genes from panel (b). Each left-out gene was ranked using the transcriptional regulatory global model (TRGM) obtained on the remaining genes. Next, sensitivity/specificity pairs where calculated for different detection thresholds, and these were used to construct receiver operating characteristic (ROC) curves. The areas under these ROC curves (AUCs) were calculated and are depicted here. The colors are as in panel (b). (d) Presented is an example of a set of similar CRMs identified by ModuleMiner. These results were obtained on the cardiac muscle genes by the procedure depicted in panel (a). Each horizontal line represents a human-mouse conserved noncoding sequence (CNS) upstream of a gene within the cluster. The different colored boxes represent binding sites of different transcription factors. Detailed results, including descriptions of the genes shown, and the exact positions of the CNSs are available on our website [26].
Summary of ModuleMiner's results for the ten microarray clusters
| Cluster | Annotation | TFBS set | Number of target genes after cross-validation ( | AUC on target genes | Number of target genes in independent test set ( | Total number of CRMs |
| 1 | Protein synthesis | 1 | 10/50 (0.025) | 0.96 | 14/123 (0.35) | 30 |
| 2 | Oocyte/fertilized egg | 3 | 10/50 (0.025) | 0.98 | 30/164 (8.6 × 10-4) | 43 |
| 3 | Neural tissues | 3 | 10/50 (0.025) | 0.84 | 15/122 (0.24) | 29 |
| 4 | Lymphocytes | 3 | 10/50 (0.025) | 0.87 | 23/85 (7.0 × 10-6) | 36 |
| 5 | Testis/spermatogenesis | - | - | - | - | - |
| 6 | Liver | 3 | 14/50 (2.9 × 10-4) | 0.93 | 7/29 (0.022) | 23 |
| 7 | Mitochondrion | 3 | 9/31 (0.0026) | 0.87 | - | 12 |
| 8 | Extracellular matrix | 2 | 7/32 (0.036) | 0.92 | - | 10 |
| 9 | Cardiac muscle | 3 | 17/32 (6.6 × 10-10) | 0.95 | - | 16 |
| 10 | Energy metabolism | 3 | 7/26 (0.012) | 0.82 | - | 10 |
Transcription factor binding site (TFBS) sets: set 1 includes human-mouse conserved noncoding sequences (CNSs) 10 kilobases 5' of the transcription start site (TSS); set 2 includes set 1 + binding site preservation; and set 3 includes set 2 + correction for TSS differences. For clusters in which multiple TFBS sets resulted in successful cis-regulatory module (CRM) detection, only the result showing the best cross-validation performance is shown. Genes (in the cluster) that by cross-validation were ranked within the top 10% of the genome where considered target genes of the transcriptional regulatory global model (TRGM). The total number of CRMs constitutes all successful CRM predictions near to genes in the cluster. CRM predictions were considered successful if the TRGM score was sufficient to rank the target gene within the top 10% of the genome. In some cases, multiple CRMs are found that control the same target gene.
Summary of ModuleMiner's results for the five embryonic development gene sets
| Embryonic development process | TFBS set | Number of target genes after LOOCV ( | AUC |
| Primary heart field [50] | 1 | 6/7 (6.4 × 10-6) | 0.92 |
| Secondary heart field [50] | 1 | 6/9 (6.4 × 10-5) | 0.79 |
| Neural crest cells [51] | 2 | 6/10 (1.5 × 10-4) | 0.86 |
| Eye development [52] | 1 | 10/15 (1.9 × 10-7) | 0.79 |
| Limb development [53] | 1 | 10/24 (5.2 × 10-5) | 0.77 |
A key review or book used as a basis for construction of the development gene set is given in the first column. The genes in each set as well as the detailed results can be viewed at our website [26]. Transcription factor binding site (TFBS) sets: set 1 includes the human-mouse conserved noncoding sequences (CNSs) 10 kilobases 5' of the transcription start site (TSS); set 2 includes set 1 + binding site preservation; and set 3 includes set 2 + correction for TSS differences. For clusters where multiple TFBS sets resulted in successful cis-regulatory module (CRM) detection, only the result showing the best cross-validation performance is shown. Genes (in the cluster) that by cross-validation where ranked within the top 10% of the genome where considered target genes of the transcriptional regulatory global model. LOOCV, leave-one-out cross-validation.
Transcriptional regulatory global models constructed for the ten microarray clusters
| Cluster | Key transcription factors and binding sites in TRGM (weight) |
| Protein synthesis | NF-Y (1.59), DEC (1.13), HIC1 (1.09), general initiator sequence (0.47), CCAAT box (0.44), TCF-4 (0.32) |
| Oocyte/fertilized egg | T3R (1.00), NF-Y (1.00), ETS/PEA3 (0.99), MAZ (0.92), AP2α (0.78), SP1 (0.30) |
| Neural tissues | UF1-H3β (1.13), CRE-BP/CJUN/ATF-1 (1.00), AP-2 (0.87), ETF (0.55), AP-1/NF-E2 (0.33) |
| Lymphocytes | STAT6 (1.00), PU.1 (0.99), ETS (0.96), STAT5/STAT (0.95), SP1 (0.89) |
| Testis/spermatogenesis | - |
| Liver | TCF1/HNF-1 (1.00), NF-1 (1.00), C/EBP (0.99), HNF-4/COUP (0.99), PPAR/HNF-4/COUP/RAR (0.66), MYC-MAX (0.58), PPAR (0.33) |
| Mitochondrion | c-ETS (1.35), VDR (1.00), GATA-1/GATA-2 (1.00), ZID (0.82), AR (0.43), ROAZ (0.34) |
| Extracellular matrix | AP-1/NF-E2/BACH1 (2.00), FOXD1 (1.00), BLIMP1 (1.00), SRF (0.70), MEF-2/RSRFC4 (0.51), STAT5/STAT6 (0.35) |
| Cardiac muscle | SP-3 (1.00), myogenin (1.00), MEF2A (1.00), SRF (1.00), tyroid hormone receptor/RAR/RXR (0.91), muscle TATA box (0.48) |
| Energy metabolism | CREB/ATF/HLF (1.01), WHN (1.00), SPIB (0.71), PPARγ/RXRα (0.65), general initiator sequence (0.51), RFX (0.31) |
TRGM, transcriptional regulatory global model.
Transcriptional regulatory global models constructed for the five embryonic development sets
| Development process | Key transcription factors and binding sites in TRGM (weight) |
| Primary heart field | D type LTRs (1.12), HAND1/TCF3 (1.01), STAT3 (0.92), STAT5A (0.89), GATA1/GATA2 (0.63), ELK1 (0.32) |
| Secondary heart field | HNF3α (1.56), STAT5A/STAT5B (1.00), GATA2 (0.56), NFAT (0.56), GATA/GATA3 (0.48), WHN (0.35) |
| Neural crest cells | FREAC-7 (1.00), Poly A (1.00), TBX5 (1.00), HSF (0.89), FREAC-2 (0.30) |
| Eye development | RREB1 (1.00), IRF (0.96), POU3F2 (0.92), ZF5 (0.80), GATA/GATA1 (0.46), LMO2 (0.39), NKX6-1 (0.32) |
| Limb development | TEF (1.00), PLZF (1.00), PAX4 (0.96), EGR (0.87), AP-2 (0.65), PBX (0.63), Ikaros 1 (0.37) |
TRGM, transcriptional regulatory global model.
Figure 5Distribution of distance to transcription start site for CNSs and predicted CRMs. (a) All human-mouse conserved noncoding sequences (CNSs) in transcription factor binding site (TFBS) sets 1 and 2 (both are based on the same set of CNSs) and in TFBS set 3. (b) The distribution from panel (a), when divided into six unequal bins. (c) Distribution of all CNSs upstream of genes within the microarray clusters (of genes expressed in different adult tissues) and the embryonic development gene sets, where CRMs could successfully be detected (Tables 2 and 3), divided into the same six bins as under panel (b). (d) Distribution of the distance to transcription start for the CRMs that ModuleMiner identified near to the genes from panel (c). (e) Distribution of distance to transcription start for the CRMs that ModuleMiner identified in a whole genome scan (genes in panel (d) were removed, such that only new target genes where represented here). Note that panels (b) to (e) are drawn to the same scale. (f) Portion of CNSs near to the genes in the different microarray clusters and embryonic development sets that is located within 200 base pairs (bp) of the transcription start site. (g) Portion of predicted CRMs near to the genes in the different microarray clusters and embryonic development sets that is located within 200 bp of the transcription start site. (h) Portion of CRMs predicted in a whole-genome scan for the transcriptional regulatory global model built for the different gene sets that is located within 200 bp of the transcription start site. The blue line in panels (f) to (h) indicates the portion of all CNSs (within 10 kilobases 5' of all human genes) that is less then 200 base pairs of the transcription start site. CI, Confidence Interval.