| Literature DB >> 17452354 |
Saran Vardhanabhuti1, Junwen Wang, Sridhar Hannenhalli.
Abstract
Computational discovery of cis-regulatory elements remains challenging. To cope with the high false positives, evolutionary conservation is routinely used. However, conservation is only one of the attributes of cis-regulatory elements and is neither necessary nor sufficient. Here, we assess two additional attributes--positional and inter-motif distance specificity--that are critical for interactions between transcription factors. We first show that for a greater than expected fraction of known motifs, the genes that contain the motifs in their promoters in a position-specific or distance-specific manner are related, both in function and/or in expression pattern. We then use the position and distance specificity to discover novel motifs. Our work highlights the importance of distance and position specificity, in addition to the evolutionary conservation, in discovering cis-regulatory motifs.Entities:
Mesh:
Substances:
Year: 2007 PMID: 17452354 PMCID: PMC1904283 DOI: 10.1093/nar/gkm201
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Conservation Z-scores for 165 TRANSFAC motifs. TRANSFAC motifs are generally more conserved compared to permuted motifs and majority of common core motifs are highly conserved. This two plots are significantly different; the Wilcoxon rank sum test based p-value = 2 × 10−10. We have categorized motifs based on their conservation Z-scores. The high-conservation category (Z > 8) has 27 motifs, the medium conservation category (3 ≤ Z ≤ 8) has 27 motifs and the low conservation category (Z < 3) has 111 motifs. Several core factors are conserved: CAAT box (100.3), Sp1 (90.4), Oct-1 (10.1), TATA (8.4).
Figure 2.(a) Position specificity Z-max distribution for the 165 motifs. Also shown is the Z-max for random sequences with positionally matched base composition used as promoters. We define a set of 39 motifs to be position specific (Z-max ≥ 5) and a set of 38 motifs to be position nonspecific (Z-max ≤ 3). (b) Position specificity Z-score distribution for three core motifs. Transcription start site is at position +1.
Functional relevance of known motifs. Gene targets were obtained for motifs based on three different criteria
| Criterion | Number of Motifs | Number associated with GO process with FDR ≤ 5% | Average number of associated GO processes | Percent of GO-associated motifs that are conserved | Relative enrichment in tissue-associated motifs | Average number of tissues | Percent of tissue-associated motifs that are conserved | ||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Real | Random | Real | Random | Real | Real | Random | Real | Random | Real | ||
| A | 39 | 6 (15%) | 0 (0%) | 3 | 0 | 4/6 (67%) | 1.6 | 0.3 | 31 | 8 | 17/35 (49%) |
| B | 39 | 1 (3%) | 0 (0%) | 1 | 0 | 1/1 (100%) | 0.9 | 0.3 | 22 | 4 | 11/21 (52%) |
| C | 38 | 1 (3%) | 1 (3%) | 1 | 2 | 0 (0%) | 0.6 | 0.1 | 9 | 0 | 1/14 (7%) |
(A) Gene promoters containing position-specific motif in the preferred window, (B) Gene promoters containing position-specific motif at any position, (C) Gene promoters containing position-nonspecific motif at any position. For each ‘Real’ target gene-set a ‘Random’ gene-set of the same size was selected. Table shows for each criterion: col2: number of gene-sets, col3: fraction of gene-sets that associated with a GO process (FDR ≤ 5%), col4: the average number of GO processes, col5: among the motifs that associated with a GO process, the fraction that was conserved, col6: ratio of the ‘actual% of motifs that show differential expression (FDR ≤ 1%)’ to the expected fraction of 55 (see text for how 55 was calculated), col7: the average number of such tissues, col8: among the motifs that associated with tissue, the fraction that was conserved. For each of these columns we show the figures for both the ‘Real’ and the ‘Random’ gene-set.
Figure 3.Distribution of the distance-specificity Z-max distribution for the 21 777 motif-pairs on real promoters are shown in blue. The Z-max distribution for permuted PWM pairs on random promoter sequences is shown in cyan. Distribution for pairs with at least one non-position-specific motif (7804 pairs) is shown in green and pairs with both position-specific motifs comprise (1618 pairs) is shown in red.
Functional relevance of known motif pairs. Gene targets were obtained for motif-pairs based on four different criteria
| Criterion | Number of motifs | Number associated with GO process with FDR ≤ 5% | Average number of associated GO processes | Percent of GO-associated motifs that are conserved | Relative enrichment in tissue-associated motifs | Average number of tissues | Percent of tissue-associated motifs that are conserved | ||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Real | Random | Real | Random | Real | Real | Random | Real | Random | Real | ||
| A | 245 | 61 (25%) | 12 (5%) | 4 | 1 | 37/61 (61%) | 1.6 | 0.6 | 36 | 10 | 78/217 (36%) |
| B | 321 | 35 (11%) | 16 (5%) | 3 | 1 | 4/35 (11%) | 1.5 | 0.4 | 25 | 6 | 8/266 (3%) |
| C | 321 | 31 (10%) | 12 (4%) | 2 | 1 | 2/31 (7%) | 1.1 | 0.4 | 18 | 7 | 16/196 (8%) |
| D | 417 | 24 (6%) | 15 (4%) | 2 | 1 | 1/24 (4%) | 1.1 | 0.3 | 16 | 7 | 3/254 (1%) |
For each criterion, for all the gene-sets with significant GO processes (FDR ≤ 5%) were computed. Four criteria were used to select target gene-sets: (A) genes that contain distance-specific motifs at Z-max distance and both the motifs are position specific, (B) genes that contain distance-specific motifs at the Z-max distance and at least one of the motif is position nonspecific, (C) genes that contain distance-specific motifs at any distance other than the Z-max distance where at least one of the motif is position nonspecific and (D) genes that contain distance-nonspecific motifs at arbitrary distance. For each ‘Real’ target gene-set, a ‘Random’ gene-set of the same size was selected. See Table 1 legend for the description of the columns 2–8.
Figure 4.For the 661 novel motifs, as we increase the threshold for the position specificity, the fraction of qualifying motifs that are conserved increases. At a stringent position-specificity Z-score ≥ 8, 46% of motifs are conserved compared to only 20% among the motifs that have position-specificity Z-score ≥ 4.
Figure 5.For all 79 576 motif-pairs (at least 1 motif is non-position specific), as we increase the threshold for the distance specificity, the fraction of qualifying motifs that are conserved increases. At a stringent distance-specificity Z-score ≥ 9, 19% of motifs are conserved compared to only 11% among the motifs that have distance-specificity Z-score ≥ 4.
Functional assessment of novel motifs. Gene targets were obtained for motifs based on four different criteria
| Criterion | Number of Motifs | Number associated with GO process with FDR ≤ 5% | Average number of associated GO processes | Percent of GO-associated motifs that are conserved | Relative enrichment in tissue-associated motifs | Average number of tissues | Percent of tissue-associated motifs that are conserved | ||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Real | Random | Real | Random | Real | Real | Random | Real | Random | Real | ||
| A | 168 | 2 (1%) | 0 (0%) | 3 | 0 | 1/2 (50%) | 1.6 | 0.3 | 32 | 9 | 48/146 (33%) |
| B | 168 | 0 (0%) | 0 (0%) | 0 | 0 | 0/0 (0%) | 1.1 | 0.3 | 20 | 10 | 35/98 (36%) |
| C | 123 | 0 (0%) | 0 (0%) | 0 | 0 | 0/0 (0%) | 0.5 | 0.1 | 7 | 7 | 2/29 (7%) |
For each criterion, for all the gene-sets with significant GO processes (FDR ≤ 5%) were computed. Three criteria were used to select target gene-sets: (A) gene promoters containing position-specific motif in the preferred window, (B) gene promoters containing position-specific motif at any position and (C) gene promoters containing position-nonspecific motif at any position. For each ‘Real’ target gene-set and ‘Random’ gene-set of the same size was selected. See Table 1 legend for the description of the columns 2–8.
Functional assessment of novel motif-pairs. Gene targets were obtained for motif-pairs based on four different criteria
| Criterion | Number of motifs | Number associated with GO process with FDR ≤ 5% | Average number of associated GO processes | Percent of GO-associated motifs that are conserved | Relative enrichment in tissue-associated motifs | Average number of tissues | Percent of tissue-associated motifs that are conserved | ||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Real | Random | Real | Random | Real | Real | Random | Real | Random | Real | ||
| A | 200 | 23 (12%) | 8 (4%) | 3 | 1 | 6/23 (26%) | 1.7 | 0.9 | 44 | 15 | 44/189 (23%) |
| B | 200 | 17 (9%) | 6 (3%) | 3 | 1 | 4/17 (24%) | 1.6 | 0.6 | 32 | 10 | 14/172 (8%) |
| C | 200 | 16 (8%) | 5 (3%) | 3 | 1 | 3/16 (19%) | 1.3 | 0.6 | 29 | 13 | 15/148 (10%) |
| D | 200 | 3 (2%) | 3 (2%) | 1 | 1 | 2/3 (67%) | 1.0 | 0.4 | 16 | 10 | 0/107 (0%) |
For each criterion, for all the gene-sets with significant GO processes (FDR ≤ 5%) were computed. Four criteria were used to select target gene-sets: (A) gene promoters containing distance-specific motif-pairs that are also position specific at the preferred distance, (B) gene promoters containing distance-specific motif-pairs (at least one motif is non-position specific) at the preferred distance, (C) gene promoters containing distance-specific motif-pairs (at least one motif is non-position specific) at any distance and (D) gene promoters containing distance-nonspecific motif-pairs at any distance. For each ‘Real’ target gene-set, a ‘Random’ gene-set of the same size was selected. See Table 1 legend for the description of the columns 2–8.