| Literature DB >> 22313678 |
Hani Z Girgis1, Ivan Ovcharenko.
Abstract
BACKGROUND: Researchers seeking to unlock the genetic basis of human physiology and diseases have been studying gene transcription regulation. The temporal and spatial patterns of gene expression are controlled by mainly non-coding elements known as cis-regulatory modules (CRMs) and epigenetic factors. CRMs modulating related genes share the regulatory signature which consists of transcription factor (TF) binding sites (TFBSs). Identifying such CRMs is a challenging problem due to the prohibitive number of sequence sets that need to be analyzed.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22313678 PMCID: PMC3359238 DOI: 10.1186/1471-2105-13-25
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Method overview. The diagram illustrates the workflow of the system. During training, the system contrasts sequences from the mixed set to control sequences to identify motif pairs that are enriched in the mixed set. The system identifies and scores sequences that include at least one of the enriched pairs. A Bayesian classifier is trained on the scores to distinguish candidate sequences in the mixed set from candidates in the control set. During validation, the list of pairs and the trained classifier are used to classify sequences in the validation set. The training and the validation are repeated to find the parameters that result in the best performance on the validation set. Finally, CrmMiner is tested on sequences in the testing set.
Figure 2Performances of CrmMiner and CisModule on controlled data sets. The performance is measured in terms of the true positive count, the false positive count, and the precision (Equation 9). A set name indicates the number of folds of background sequences mixed with the training and validation hypersensitive sites (HSSs). For instance, the training mixed set of "x10" included 740 HSSs located in the vicinity of 75 genes specific to the human CD4+ T cells. In addition to the 740 HSSs, this training mixed set contained 750 (10 × 75) background sequences. All testing mixed sets consisted of 443 HSSs i.e. they were not mixed with background sequences. All testing control sets were composed of 1125 random genomic sequences. The exact number of CRMs among the HSSs is unknown; however, HSSs are enriched with CRMs.
Mining for heart-specific CRMs - input data
| n | CNEs/Locus | CNEs | Controls |
|---|---|---|---|
| 0 | 9.6 | 1480 | 7400 |
| 10 | 10.8 | 1667 | 8335 |
| 20 | 13.1 | 2019 | 10095 |
| 30 | 15.3 | 2358 | 11790 |
| 40 | 17.5 | 2692 | 13460 |
| 60 | 20.9 | 3219 | 16095 |
| 80 | 23.7 | 3644 | 18220 |
| 100 | 25.7 | 3959 | 19795 |
| All | 31.9 | 4911 | 24555 |
The mixed sets consisted of CNEs found in the vicinity of 154 heart-specific genes. Nine sets were constructed. All mixed sets contained CNEs within 10 kbp upstream and 10 kbp downstream of the transcription start sites (TSSs). We expanded the mixed sets by including n CNEs that are closest to the TSSs, if they were not already included, n = 0, 10, 20, 30, 40, 60, 80, 100, ∞. A control set included five times as many sequences as its mixed set counterpart. We controlled for sequence length while assembling the control sets.
CrmMiner's performance on the 154 heart-specific genes
| n | CNEs | Controls | PCRMs | Controls(+) | E-value | P-value |
|---|---|---|---|---|---|---|
| 0 | 375 | 1875 | 41 | 30 | 6.8 | 1.4 |
| 10 | 492 | 2460 | 46 | 55 | 4.2 | 4.3 |
| 20 | 537 | 2685 | 19 | 8 | 11.2 | 7.3 |
| 30 | 680 | 3400 | 50 | 47 | 5.3 | 1.3 |
| 40 | 602 | 3010 | 41 | 45 | 4.6 | 1.3 |
| 60 | 768 | 3840 | 40 | 58 | 3.5 | 8.4 |
| 80 | 656 | 3280 | 38 | 43 | 4.5 | 1.4 |
| 100 | 1107 | 5535 | 32 | 32 | 5.0 | 6.5 |
| All | 892 | 4460 | 14 | 15 | 4.7 | 7.6 |
CrmMiner was evaluated on nine testing sets that were composed of CNEs found in 39 heart-specific gene loci. PCRMs stand for putative CRMs. Controls(+) stand for CRMs predicted among the control sequences. All mixed sets included CNEs in the region between -10 kbp and +10 kbp. Mixed sets were expanded as before. The number of CNEs in the mixed set did not always increase when the number of the closest CNEs, n, was increased because we randomly partitioned the loci into three sets every time n was increased. The Fisher's exact test was used to calculate the P-values.
Comparison between CrmMiner predictions and putative heart enhancers (PHEs)
| n | CNEs | PCRMs | Expected | Enrichment (fold) | Z-scores |
|---|---|---|---|---|---|
| 0 | 91 | 50 | 13.1 | 3.8 | 11.4 |
| 10 | 96 | 56 | 13.9 | 4.0 | 12.6 |
| 20 | 108 | 58 | 6.4 | 7.5 | 21.7 |
| 30 | 118 | 69 | 9.9 | 7.0 | 20.1 |
| 40 | 129 | 64 | 11.1 | 5.8 | 17.1 |
| 60 | 150 | 63 | 10.4 | 6.1 | 17.3 |
| 80 | 164 | 79 | 10.6 | 7.5 | 22.2 |
| 100 | 177 | 48 | 6.7 | 7.2 | 16.7 |
| All | 205 | 28 | 5.9 | 4.7 | 9.4 |
The PHEs were obtained by a supervised-learning method [9]. Columns CNEs and PCRMs display the number of overlaps between the CNEs and PHEs, and the PCRMs (putative CRMs) and PHEs, respectively. The expected number of overlaps between the CNEs and the PHEs was calculated experimentally in 10000 trials. In each trial, a set was randomly selected from the input CNEs. These CNEs are located in the 154 heart-specific gene loci and comprise the mixed sets that CrmMiner analysed. The number of CNEs in a random set is the same as the number of the PCRMs. Then, the overlaps between a random set and the PHEs were counted. The expected number is the average overlap in the 10000 trials. The Z-scores were based on the distribution of the overlaps in the 10000 trials. The P-values associated with the nine Z-scores are 0.
Comparison between predicted CRMs and experimentally validated heart enhancers
| n | CNEs | PCRMs | Sensitivity | Expected | P-value |
|---|---|---|---|---|---|
| 0 | 12 | 10 | 83% | 1.7 | 1.1 |
| 10 | 13 | 9 | 69% | 1.9 | 8.0 |
| 20 | 15 | 10 | 67% | 0.9 | 0.0 |
| 30 | 15 | 11 | 73% | 1.3 | 0.0 |
| 40 | 15 | 8 | 53% | 1.3 | 1.9 |
| 60 | 15 | 9 | 60% | 1.0 | 4.4 |
| 80 | 15 | 11 | 73% | 1.0 | 0.0 |
| 100 | 16 | 9 | 56% | 0.6 | 0.0 |
| All | 16 | 5 | 31% | 0.5 | 1.0 |
We calculated the overlaps of the CNEs found in the 154 heart-specific loci and the putative CRMs (PCRMs) with 95 experimentally validated heart enhancers [9]. Columns CNEs and PCRMs display the number of overlaps between the CNEs and heart enhancers, and the PCRMs and heart enhancer, respectively. The expected number is the average overlap between a random CNE set and the heart enhancers. We selected 10000 random sets as before. The P-values were based on the Z-scores calculated from the distribution of the overlaps in the 10000 trials.
CrmMiner performance on 57 human tissues and cell types
| Tissue | E-value | P-value | Pairs | The Three Most Enriched Pairs |
|---|---|---|---|---|
| subthalamic nucleus | 6.72 | 2.1 | 78 | E2F1-TFIII E2F1DP2-ZBED6 GAF-NFMUE1 |
| prefrontal cortex | 6.60 | 5.7 | 58 | SP1-PITX3 PAX4-BEN EFC-E2A |
| cingulate cortex | 6.44 | 4.3 | 37 | MAZR-NFAT2 PAX9-IK3 HEB-AHRARNT |
| heart | 5.52 | 4.4 | 89 | PAX9-PXRRXR RSRFC4-E47 CETS1P54-NRSE |
| caudate nucleus | 5.27 | 5.6 | 20 | SP1SP3-SP1 SP1-ETF MZF1-TAXCREB |
| prostate | 5.03 | 4.2 | 79 | E2F-UF1H3BETA NFY-AP2GAMMA MSX3-SP3 |
| amygdala | 4.61 | 7.3 | 34 | ZFP281-RFX AHR-CKROX ETF-SP4 |
| BM-CD71+ early erythroid | 4.56 | 1.5 | 90 | PAX4-NRF1 NFY-KROX MOVOB-NFAT2 |
| adrenal gland | 4.40 | 5.7 | 71 | STAT6-ZF5 SP1-GATA3 NKX11-SP4 |
| fetal brain | 4.26 | 5.2 | 90 | VDR-E2F1 OBOX2-PPARA BEN-ETF |
| bone marrow | 4.16 | 1.8 | 68 | CDXA-SP1 NFKB-E2F1 AR-PR |
| BM CD34+ | 4.15 | 6.0 | 130 | NRSE-E2F1 E2F1-SP1 E2F-SP1 |
| fetal thyroid | 4.13 | 5.3 | 108 | E2F-SP4 NFY-ZF5 STAT3-MEF2 |
| bronchial epithelial cells | 4.09 | 7.4 | 85 | HOXD3-CEBP KLF15-CEBP NKX25-HEN1 |
| occipital lobe | 4.09 | 6.4 | 38 | EGR2-DEAF1 AHR-CACCCBINDINGFACTOR E2F1-SREBP |
| whole brain | 4.05 | 3.5 | 21 | MAZ-ETF SP1SP3-GKLF MZF1-CTCF |
| PB-BDCA4+ dentritic cells | 4.04 | 2.3 | 98 | ATF1-ETF AP1-PEBP AP2-PPARA 1 |
| placenta | 3.82 | 1.6 | 96 | COMP1-GKLF GKLF-NFY CTCF-AP1 |
| hypothalamus | 3.82 | 5.3 | 80 | GADP-ERR1 AP1-FOXO1 OCT-PAX5 |
| liver | 3.73 | 5.7 | 97 | COUP-SZF11 COUPTF-SRF GATA4-GFI1 |
| thyroid | 3.65 | 8.6 | 93 | MTATA-OLF1 MUSCLE-CHX10 LBX2-SP4 |
| thymus | 3.65 | 5.0 | 93 | IK1-KLF15 MINI19-ALPHACP1 IK1-GKLF |
| spinal cord | 3.64 | 1.5 | 68 | IRF-LRF REX1-GLI3 ALPHACP1-PPARA |
| uterus | 3.58 | 2.9 | 85 | ELK1-UF1H3BETA TAL1-E2F1 AP2ALPHA-PITX1 |
| pons | 3.57 | 6.1 | 108 | IPF1-CART1 NMYC-CNOT3 PAX4-MYOGNF1 |
| trachea | 3.48 | 1.1 | 110 | TRF1-TFIII DEC-CP2 COUPTF-STAT4 |
| lung | 3.47 | 3.0 | 36 | AP2-TR4 MAZ-CBF SP1SP3-AP1 |
| tonsil | 3.42 | 0.0019 | 60 | HEN1-NFKAPPAB65 PAX4-SMAD E2F6-NRSF |
| PB-CD19+ Bcells | 3.40 | 9.5 | 61 | ESE1-PAX6 PAX4-TBX5 MZF1-NRF2 |
| 721 B lymphoblasts | 3.35 | 1.8 | 104 | SP1SP3-E2F4DP1 STAT4-NRF2 ZIC1-EGR1 |
| testis | 3.30 | 4.8 | 73 | AP2-ATF T3R-HNF4 MYOD-SRF |
| uterus corpus | 3.27 | 1.3 | 91 | TCF4-POU6F1 CAAT-PITX3 PAX4-XVENT1 |
| tongue | 3.24 | 3.8 | 72 | ZFP281-NFY SP3-PITX3 AP2-IK2 |
| temporal lobe | 3.21 | 3.7 | 95 | TBX5-MTF1 MTATA-HIC1 ERG-ZIC3 |
| PB-CD56+ NKCells | 3.12 | 3.1 | 53 | NFY-SP1 CMYB-STAT6 SRF-SP4 |
| smooth muscle | 3.11 | 2.9 | 87 | KROX-MEF2 SP1-XBP1 HIF1-IK |
| pituitary gland | 3.06 | 2.2 | 96 | MTATA-MAZR ZF5-WT1 AHR-ETF |
| adrenal cortex | 3.04 | 1.5 | 111 | MYCMAX-PAX4 SP2-E2F1 VJUN-ZNF219 |
| BM-CD105+ endothelial | 2.95 | 0.0049 | 79 | E2F-SMAD4 AP4-PAX4 P53-NFY |
| lymph node | 2.94 | 2.6 | 131 | CETS1P54-GABP HIF1-NRSE PEA3-MTF1 |
| adipocyte | 2.91 | 1.3 | 79 | AP2-NFY E2F1-LXR CNOT3-IK2 |
| skeletal muscle | 2.89 | 7.6 | 80 | P53-CACD MEIS2-STAT1 MYOD-XVENT1 |
| thalamus | 2.86 | 0.0065 | 42 | PLAG1-PAX3 AP1-CMAF MZF1-GATA2 |
| pancreatic islets | 2.79 | 6.7 | 97 | CDXA-PROP1 MEIS1-ZBED6 NFKB-E2F |
| skin | 2.79 | 7.8 | 73 | CACBINDINGPROTEIN-XVENT1 NFKB-ZTA ER-SREBP1 |
| medulla oblongata | 2.78 | 7.9 | 103 | ZNF515-NKX25 CETS1P54-NKX26 P53-CMYB |
| whole blood | 2.77 | 5.8 | 53 | PU1-SP2 GFI1-CTCF AP2-GR |
| cardiac myocytes | 2.69 | 9.1 | 120 | EGR-MECP2 SRY-CNOT3 E2F-EGR3 |
| fetal lung | 2.62 | 1.7 | 55 | HIF1-LMAF CTCF-MECP2 AML2-CP2 |
| PB-CD4+ Tcells | 2.58 | 0.0026 | 121 | NGFIC-EGR3 ZFP206-CBF AHRARNT- COUP |
| cerebellum peduncles | 2.52 | 1.2 | 74 | ZNF219-NFY KLF15-OBOX2 NKX11-ZNF219 |
| olfactory bulb | 2.31 | 4.3 | 70 | NFY-SP1 NFKAPPAB-CTCF CP2-GLI |
| cerebellum | 2.30 | 0.0034 | 58 | AP2-LBX2 COUPTF-E2F E2F-SREBP |
| kidney | 2.29 | 4.08 | 87 | GCM-ARP1 E2A-RORBETA NFY-TFIII |
| BM-CD33+ myeloid | 1.80 | 0.0489 | 92 | SRF-HMGIY PR-STAT6 T3R-AP1 |
| parietal lobe | 1.78 | 0.0443 | 82 | AP2ALPHA-E2F1 HIF1-GABP BEN-MYF |
| appendix | 1.75 | 0.0285 | 88 | AP1-LXR PAX2-PAX4 CNOT3-CREB |
We trained, validated and tested CrmMiner on 72 human tissues and cell types [36]. The mixed sets consisted of CNEs in the vicinity of 160 tissue-specific genes. Those CNEs were located within 10 kbp up or downstream of the TSSs, or were among the closest 20 CNEs to the TSSs. Here, we report CrmMiner performance on the testing sets. The number of motif pairs comprising the tissue-specific signatures are listed under the Pairs column. The P-values were obtained by one-tailed Fisher's exact test.
The predicted regulatory signature of the developing human heart
| Pair | E-value | Pair | E-value | Pair | E-value |
|---|---|---|---|---|---|
| ZF5 & SZF11 | 12.5 | E2F & HIC1 | 12.5 | IRF1 & HIC1 | 12.5 |
| SOX2 & ZFX | 11.2 | GATA1 & ZFX | 12.5 | TGIF & COREBINDINGFACTOR | 11.2 |
| AP2 & GATA | 11.2 | LTF & E2F1 | 12.5 | FOXJ2 & CREL | 11.2 |
| MSX3 & CKROX | 11.2 | AIRE & LMX1 | 11.2 | EGR & NFY | 11.2 |
| DAX1 & LXR | 11.2 | E2F & MYOD | 12.5 | MEF2 & AP4 | 11.2 |
| CETS1P54 & DEAF1 | 11.2 | FPM315 & LMO2COM | 12.0 | ZFP206 & LUN1 | 11.2 |
| MAZ & BARX2 | 11.2 | BACH2 & COUP | 11.2 | ZF5 & CAAT | 11.2 |
| ATF1 & ZFX | 11.2 | ALPHACP1 & RNF96 | 11.2 | LHX8 & IK | 11.2 |
| SP1SP3 & RFX1 | 11.0 | SP1 & MSX1 | 11.2 | PEA3 & SMAD4 | 11.0 |
| MOVOB & ELK1 | 11.0 | ZNF219 & GATA2 | 10.8 | ETS & GABP | 10.7 |
| STAT1 & RNF96 | 10.0 | PXRRXR & SP1 | 10.0 | XPF1 & EGR1 | 10.0 |
| POU6F1 & AP2GAMMA | 10.0 | GATA6 & KROX | 10.0 | AP2ALPHA & LMX1 | 10.0 |
| RXRG & RNF96 | 10.0 | GATA1 & AP2 | 10.0 | NFY & CNOT3 | 10.0 |
| ZIC2 & CMYB | 10.0 | HIC1 & FOX | 10.0 | FPM315 & ATF | 10.0 |
| EBF & CNOT3 | 10.0 | LMX1 & NKX12 | 10.0 | LEF1 & WT1 | 10.0 |
| LIM1 & LHX61 | 10.0 | SP4 & GATA | 10.0 | NFKB & PAX4 | 10.0 |
| AR & PAX4 | 10.0 | AP2 & AHRARNT | 10.0 | CNOT3 & LUN1 | 10.0 |
| ZNF515 & BEN | 10.0 | NF1 & CAAT | 10.0 | CNOT3 & ARNT | 10.0 |
| AREB6 & RNF96 | 10.0 | AP2ALPHA & NMYC | 10.0 | NFKB & NGFIC | 10.0 |
| MAZR & LMO2COM | 10.0 | TAL1BETAITF2 & ZIC1 | 10.0 | MYOGNF1 & PAX5 | 10.0 |
| HEB & EN2 | 10.0 | MINI19 & LIM1 | 10.0 | CPHX & VDR | 10.0 |
| SPZ1 & GATA3 | 10.0 | EBF & E2F1 | 10.0 | MEF2 & E2A | 10.0 |
| GLI3 & FOXO4 | 10.0 | NFKB & MEIS1AHOXA9 | 10.0 | MSX2 & UF1H3BETA | 10.0 |
| ETF & TBX22 | 10.0 | ZBED6 & MEF2 | 10.0 | CNOT3 & PXRRXR | 10.0 |
| KROX & NFY | 10.0 | SP2 & GATA2 | 10.0 | MYOD & MEF2 | 10.0 |
| ROAZ & E2F1 | 10.0 | GATA6 & UF1H3BETA | 10.0 | SP1 & GATA | 10.0 |
| AML2 & KLF15 | 10.0 | ZF5 & TTF1 | 10.0 | MZF1 & MIF1 | 10.0 |
| MECP2 & GC | 10.0 | NFY & WT1 | 9.1 | AREB6 & WT1 | 9.3 |
| ZIC1 & E2F1 | 9.3 | HIC1 & FOX | 9.1 | ZNF219 & GTF2IRD1 | 9.1 |
| EGR & GATA2 | 9.1 | CNOT3 & PU1 | 9.2 | MINI19 & AREB6 | 9.1 |
| GATA & SP1 | 9.0 | MUSCLE & CREBATF | 9.0 | STAT3 & E2F1 | 9.0 |
| HIC1 & HFH4 | 9.0 | MUSCLE & MEF2 | 9.0 | TBX15 & SP3 | 9.0 |
| CTCF & CPHX | 9.0 | FOXO3 & PAX5 | 9.0 | PAX9 & PAX4 | 8.7 |
| NFY & SP1 | 8.7 | EGR & GATA2 | 8.6 | SPZ1 & ZF5 | 8.4 |
| LBP1 & POLY | 8.3 | PEA3 & MINI19 | 8.3 | CNOT3 & AHRARNT | 8.3 |
| SP1 & SOX10 | 8.3 | BARHL1 & CKROX | 8.3 | AP2 & ATF1 | 8.3 |
| AP2 & MEF2 | 8.3 | EGR & DEAF1 | 8.2 | COUPTF & ZF5 | 8.2 |
| SP1 & NFY | 8.1 | AMEF2 & TBX15 | 8.0 | TAL1BETAITF2 & SP2 | 8.0 |
| PXRRXR & HIC1 | 8.0 | AP2 & E2F1 | 8.0 | TGIF & MRG2 | 8.0 |
| RXRG & ZBED6 | 8.0 | E2A & SRF | 8.0 | SP3 & ELK1 | 8.0 |
| CEBP & CEBP | 8.0 | TATA & ERR1 | 8.0 | E2F1 & SP1 | 8.0 |
| AML2 & GC | 8.0 | XPF1 & SREBP | 8.0 | EGR1 & SMAD | 8.0 |
| ZIC1 & PPARG | 8.0 | DBP & E47 | 8.0 | KROX & HMEF2 | 8.0 |
| CTCF & GATA3 | 8.0 | EAR2 & ETF | 8.0 | SP1 & PXRRXR | 8.0 |
CrmMiner predicted 132 motif pairs that comprise the regulatory signature of the developing human heart. The score of a sequence is the sum of the enrichment values (E-value, Equation 1) of any of the 132 pairs present in the sequence provided that a pair of motifs meets the distance requirement (≤ 100 bp). A sequence that has a score equal to or above 23.8 is predicted to be a CRM specific to the developing heart. The 23.8 is the threshold that was determined at the training and the parameter-optimization stages. The unit of E-valueis fold.