| Literature DB >> 19583839 |
Amar Drawid1, Nupur Gupta, Vijayalakshmi H Nagaraj, Céline Gélinas, Anirvan M Sengupta.
Abstract
BACKGROUND: DNA sequence binding motifs for several important transcription factors happen to be self-overlapping. Many of the current regulatory site identification methods do not explicitly take into account the overlapping sites. Moreover, most methods use arbitrary thresholds and fail to provide a biophysical interpretation of statistical quantities. In addition, commonly used approaches do not include the location of a site with respect to the transcription start site (TSS) in an integrated probabilistic framework while identifying sites. Ignoring these features can lead to inaccurate predictions as well as incorrect design and interpretation of experimental results.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19583839 PMCID: PMC2718928 DOI: 10.1186/1471-2105-10-208
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Overlapping κB sites. A. Sequence logo of the κB site, based upon the initial motif profile used in HMM training, where the overall height of the nucleotide stack at each position is proportional to the information content at that position and the height of each nucleotide within the stack is proportional to its frequency. B. Four overlapping κB sites are present on the two strands in three adjacent 10-base pair sequence windows.
Figure 2HMM hidden states for κB sites. OHMM consists of 21 states. The background state is colored red and designated by B. Each of the 20 motif states corresponds to each of the ten positions within the κB motif on the two DNA strands. The motif states are colored yellow and designated using M, the position within the motif and the strand. The emission probabilities of the motif states on the two strands are flipped from 5' to 3' so as to represent identical binding irrespective of the motif strand. Non-zero transition probabilities between states are represented by black arrows and their values are shown.
Figure 3Trained HMM Parameters. A. Sequence logo of the motif profile of the HMM trained on 50 bp sequences each consisting of a known κB site and surrounding region (surround-50 HMM) with initial transition probability to the motif (z) equal to 0.02. B. The estimated transition probability to the motif (z) for upstream 800 bp and downstream 100 bp regions with respect to the transcription start site (TSS) as the number of randomly selected training genes increases. The estimated z stabilizes after the addition of a few thousand genes. Each training set contains all sequences with known κB sites in the relevant region (20 and 4 known sites for the upstream 800 bp and downstream 100 bp regions, respectively).
Figure 4Trained . HMMs were trained on TSS-n promoters keeping the initial motif profile fixed. The transition probability to the motif (z) is inversely proportional to the training promoters' length in the range between 500–3000 bp and hence z* is constant around 0.9. This quantity drops slightly between 500 to 200 bp and then substantially after 200 bp due to the lack of κBsites in the shorter training promoters.
Figure 5ROC analysis shows that OHMM performs better than a weight matrix. The performances of the HMM and the weight matrix (WM) are represented by the green and the blue curves, respectively. Whereas the HMM and the WM perform similarly for strong sites, the HMM is more accurate in identifying weak sites. The positive examples consist of the 36 known human κB sites present in upstream 800 bp regions (in their native promoters), and the negative examples consist of all 10-mers in the upstream 800 bp regions in 100 randomly selected human genes as described in the text. Leave-one-out cross-validation was performed. ROC: Receiver Operating Characteristic curve.
Figure 6κB sites with greater HMM occupancy probability are conserved better. Each curve represents the kernel-smoothing density estimate of the evolutionary conservation scores of a set of κB sites. Each set consists of κB sites predicted by OHMM to have occupancy probability above a threshold shown in the legend. The "random" set consists of 1000 10-tuples randomly selected from the human promoters. Conservation scores of κB sites predicted by OHMM are higher than those of the random sequences. Moreover, κB sites with higher HMM occupancy probability have higher conservation scores. Conservation scores and kernel-smoothing density estimates were calculated as described in the Methods section.
Figure 7Biological significance of predicted target gene sets using pathway analysis. Biological significance is shown with the help of the pathways enriched in the κB target gene sets predicted by the HMM at various thresholds. The y-axis shows the sum of the negative logarithm of the p-values of the top 25 enriched pathways. Gene sets predicted by the HMM are biologically significant as compared to randomly selected genes. They show a peak at the threshold occupancy probability of 0.5 (~800 genes). The thresholds used for obtaining the gene sets for the pathway analysis (occupancy probability threshold between 0.05 and 0.7) are indicated. HMM-predicted gene sets and randomly selected gene sets are indicated by blue and green curves, respectively. Only about 50–70% of the genes in each gene set are available for pathway analysis because the rest of the genes are not adequately annotated. The numbers in the figure, however, correspond to the number of genes in the entire gene sets.
Figure 8. Gel shift assays with extracts from 293T cells transiently transfected with either CMV-hRelA (A), CMV-hc-Rel (B) or empty CMV vector as control (vector) and radiolabeled double-stranded oligonucleotide probes containing the predicted NF-κB sites derived from chicken blnk site 1 or site 2, pdcd4, itm2b, pp1e, bcap, igλ, or mip-1β, or a palindromic NF-κB DNA site as control (κB-PD). Reactions containing the κB-PD probe alone, in absence of cell extract, were loaded as control (probe). DNA/protein complexes were resolved from unbound DNA probes in native 5% polyacrylamide gels. (C) shows the sum of Kullback-Leibler (KL) divergences of the HMM-predicted occupancy probabilities of the above sequences (in the gel shift constructs) with their binding affinities in the gel shift experiments, as a function of the transition probability to the motif z. The sum of the KL divergences is minimum at z equal to 0.001 for both NF-κB proteins. (D) shows the correlation between the gel shift binding affinities of the above sequences and their occupancy probabilities predicted by the HMM at z equal to 0.001. The correlation coefficients are 0.91 and 0.92 in case of RelA and c-Rel, respectively. The dashed lines in (D) are linear least square fits.
Figure 9Occupancy probability increases sigmoidally with respect to . Occupancy probability of the bcap and itm2b oligonucleotides used in the gel shift experiment, with either a C or a T at the beginning of the 3' padding sequence, was predicted using an HMM with different z's. The HMM's motif profile was the same in all instances. The predicted occupancy probability rises as a sigmoidal function of z. The occupancy probability of the stronger κB site (itm2b vs. bcap) saturates at lower z, and therefore the occupancy probability of the stronger site is greater at a particular z. Moreover, the occupancy probability of oligonucleotides is greater when the 3' padding sequence begins with a C (resulting in a stronger spurious site) than a T.
Enriched pathways, functions and diseases
| Pathway/Function/Disease | Gene Symbols |
| NF-κB Signaling | NFKB2*, CD40*, IL1F9, IKBKB, RRAS, TNFAIP3*, BCL3*, TLR7, TRAF5, NFKBIB, NFKB1*, LTA*, PIK3C3, NFKBIA*, RELB*, BTRC, PIK3R2, ZAP70, TRAF3, IL1RN*, PLCG2, MAP3K8 |
| Glucocorticoid Receptor Signaling | VCAM1*, ICAM1*, MED1, SMAD3, IKBKB, RRAS, MAPK12, BCL3*, IL13*, CCL5*, NFKBIB, NFKB1*, IL8*, PIK3C3, NFKBIA*, NR3C1*, STAT1, CXCL3*, CREB1, PIK3R2, JAK3, SELE*, IL1RN*, IL6* |
| Antigen Presentation Pathway | B2M*, PSMB9*, HLA-A, CD74, HLA-B*, HLA-DQA1, TAPBP* |
| Acute Phase Response Signaling | SAA1*, IL1F9, RBP1, IKBKB, RRAS, MAPK12, BCL3*, SERPINA3*, NFKBIB, CFB*, NFKBIA*, NR3C1*, PIK3R2, NOLC1, SAA2*, SOCS2, IL1RN*, IL6* |
| B Cell Receptor Signaling | IKBKB, RRAS, MAPK12, BCL3*, NFKBIB, CALML5, NFATC1, PTPN6, NFKBIA*, PIK3C3, CREB1, MAP3K11, PIK3R2, PLCG2, MAP3K8 |
| Death Receptor Signaling | NFKBIA*, BIRC3, DIABLO, IKBKB, BCL3*, TANK, NFKBIB, TNFSF15* |
| Apoptosis Signaling | NFKBIA*, BIRC3, DIABLO, IKBKB, RRAS, BCL3*, MAPK6, TP53*, NFKBIB, RPS6KA1, PLCG2, MAP3K8 |
| Cell Cycle: G1/S Checkpoint Regulation | BTRC, SMAD3, SIN3A, TP53*, HDAC8, E2F6 |
| Chemokine Signaling | CCL4*, RRAS, CCR3, MAPK12, CCL5*, PLCG2, CALML5 |
| T Cell Receptor Signaling | NFATC1, PIK3C3, NFKBIA*, IKBKB, RRAS, PIK3R2, ZAP70, CALML5 |
| Notch Signaling | DLL1, NOTCH2, RBPJ, MAML2 |
| P53 Signaling | BBC3, PIK3C3, SIRT1, PPP1R13B, MED1, PIK3R2, TP53* |
| Xenobiotic Metabolism Signaling | IL4I1, SULT1C2, MED1, RRAS, MAPK12, NFKB1*, NFKB2*, GSTP1*, PIK3C3, PPP2CB, ALDH3B2, EIF2AK3, PIK3R2, NFE2L2, IL6*, IL1RN*, GSTA5 |
| Neurotrophin/TRK Signaling | PIK3C3, CREB1, RRAS, PIK3R2, RPS6KA1 |
| Protein Ubiquitination Pathway | UBE2H, UBE2D3, B2M*, UBE2M*, BIRC3, BTRC, PSMB9*, HLA-A, HLA-B* |
| Skeletal and Muscle Development and Function | CD40*, CSF1*, CXCL11*, DLL1, IKBKB, IL6*, IL13*, IL1RN*, MED1, NFATC1, NFKB1*, NFKB2*, NFKBIA*, RBPJ, SMAD3, STAT1, VCAM1*, WNT10B* |
| Infection of Virus | CCL4*, CCL5*, CLEC4M, DEFA1, ICAM1*, IL13*, IRF8, XPO1 |
| Cancer | ACACA, AIM2, B2M*, BBC3, BCL2L10, BIRC3, BTRC, C6ORF66, CARD8, CD40*, CREB1, CTGF, CYLD, DBC1, DIABLO, DLL1, DPP4, DUT, EGR2, EIF2AK3, GNB1, GNB2L1*, HINT1, HUWE1, IER3*, IFNB1*, IGFBP6, IL6*, IL8*, IL13*, IL1RN*, IRF1*, IRF8, ITGA5, LCN2*, LTA*, LTB*, MAML2, MAP3K11, MAPK12, MEN1, MIA, MSX1, MYB*, NFKB1*, NFKB2*, NFKBIA*, NFKBIZ, NR3C1*, OAS3, PLCG2, PPP1R13B, PPP5C*, PTPN6, RBM17, REL*, RHOC, RPS6KA1, RUNX1T1, SMPD2, STAT1, THOC1, TNFAIP3*, TNFSF13, TP53*, TRAF3, TWIST1* |
| Rheumatoid Arthritis | ACAN, ACTA1, ADAMTS7, B2M*, BLR1*, CARD8, CCL1*, CCL4*, CCL5*, CCL19*, CD40*, CD69*, CD70, CD74, CD83*, CD86*, CD274*, CFB*, CXCL1*, CXCL2*, CXCL3*, CXCL5*, CXCL6*, CXCL10*, DEFA1, DPP4, GP1BA, HLA-A, HLA-DQA1, HPRT1, ICAM1*, IFNB1*, IL6*, IL8*, IL13*, LTA*, LTB*, MAPK12, NFKB1*, NFKBIA*, NR3C1*, PSMB9*, SAA1*, SAA2*, TNFAIP3*, TNFRSF13B, TNFSF15*, TP53*, TPM2, VIM*, WNT10B* |
| Experimental Autoimmune Encephalomyelitis | B2M*, CD40*, CD86*, CXCL10*, DPP4, HLA-DQA1, IFNB1*, IKBKB, IL6*, LTA*, LTB*, NR3C1*, REL*, STAT1 |
Selected cellular pathways, biological functions and diseases in which our predicted NF-κB targets were over-represented are shown. The associated predicted NF-κB targets are represented by official human gene symbols. Genes containing κB sites with predicted occupancy probability greater than 0.5 were used in this analysis. Please see Additional file 5 for the complete list. Genes known in the literature to be regulated by NF-κB (although not necessarily directly) [38] are denoted with *.
Figure 10Gel-shift assays for selected human sites from OHMM predictions. We show the gel-shift results for representative sites out of the OHMM-predicted locations of high NFκB occupancy in human promoters. NFκBIA site is the positive control. Negative control to the bottom right corresponds to the sequence called site 1 in the result section (AACCACAACCTGCAGCTATTA). Note that lane 3 and lane 4, corresponding to gel shifts with extracts from cells over expressing hc-Rel and hRelaA, respectively, strong shift. Control lane 1 with only the probe (no TF) shows no shift. Lane 2, the other control, represents gel shift with extracts from cells with only the vector. These extract may have some indigenous NFκB from the cell, but the results show very weak shifts compared to the results from lane 3 and 4 coming from the over expression of particular NFκB proteins. The negative control shows that these are results of sequence specific binding.