| Literature DB >> 17068082 |
Abstract
The reliable recognition of eukaryotic RNA polymerase II core promoters, and the associated transcription start sites (TSSs) of genes, has been an ongoing challenge for computational biology. High throughput experimental methods such as tiling arrays or 5' SAGE/EST sequencing have recently lead to much larger datasets of core promoters, and to the assessment that the well-known core promoter sequence elements such as the TATA box appear to be much less frequent than thought. Here, we address the co-occurrence of several previously identified core promoter sequence motifs in Drosophila melanogaster to determine frequently occurring core promoter modules. We then use this in a new strategy to model core promoters as a set of alternative submodels for different core promoter architectures reflecting these different motif modules. We show that this system improves greatly on computational promoter recognition and leads to highly accurate in silico TSS prediction. Our results indicate that at least for the case of the fruit fly, we are getting closer to an understanding of how the beginning of a gene is defined in a eukaryotic genome.Entities:
Mesh:
Year: 2006 PMID: 17068082 PMCID: PMC1635271 DOI: 10.1093/nar/gkl608
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Frequency of occurrence of individual core promoter motifs and pairs of motifs, modified after (7)
| Motif X | % Seq with Motif X | % Of promoters with Motif X also containing Motif Y below | ||||||
|---|---|---|---|---|---|---|---|---|
| M1 | DRE | TATA | INR | M6 | DPE | MTE | ||
| M1 | 25.1 | 100.0 | 21.3 | 13.1 | 12.7 | 4.9 | 6.1 | |
| DRE | 26.0 | 20.6 | 100.0 | 14.9 | 16.8 | 14.1 | 5.7 | 6.9 |
| TATA | 19.3 | 17.1 | 20.1 | 100.0 | 14.4 | 4.8 | ||
| INR | 26.3 | 12.1 | 16.6 | 100.0 | 12.1 | |||
| M6 | 15.8 | 23.2 | 17.6 | 20.3 | 100.0 | 4.6 | 4.2 | |
| DPE | 7.9 | 15.6 | 18.8 | 11.7 | 9.1 | 100.0 | 8.4 | |
| MTE | 8.5 | 18.2 | 21.2 | 7.9 | 7.9 | 100.0 | ||
The first column gives the motif name. The second column shows the overall fraction of promoters containing a hit to the corresponding weight matrix model. The remaining columns list the frequency with which the motif in each row co-occurs with a particular second motif in the same core promoter. Cells which contain a higher fraction of promoters with a particular second motif than its overall frequency (column 2) are printed in bold and italic.
Figure 1A new Drosophila core promoter module. (A and B) show the location distributions of motifs 6 and 1 relative to the TSS (pos. 0), and (C) shows the distance between motif occurrences in the same promoter.
Figure 2Comparison of motif module frequency in the initial and the final partitioning after semi-supervised clustering. Only initial frequencies are shown for the partition of sequences without a strong motif hit (‘no initial’), i.e. which were initially not assigned to a particular motif class, and for the MTE motif partition, which proved to be not stable and was gradually split up among the other classes. For each of the final partitions, we show the number of promoters with the same motif/module, i.e. which are left from the initial partitions (blue); the number of promoters which were initially assigned to a different partition among the five stable subclasses (red); and the number of promoters from the initial ‘no motif’ and MTE partitions (yellow). Promoters were assigned to several initial partitions in case several motifs/modules had a good hit, and the combined size of the initial partitions thus adds up to more than the total dataset of 1864 promoters.
Figure 3Specific sequence profiles (left) in five different subclasses of core promoters (right). The left shows the average GC trinucleotide content in the region [−250, +50]. The right depicts the different core modules currently modeled in the McPromoter system.
Comparison of McPromoter using one respectively five promoter models, with the most recent Drosophila predictor proposed in Ref. (14)
| McPromoter (one model) | Sharan and Myers | McPromoter (five models) | ||||
|---|---|---|---|---|---|---|
| Sn | Sp | Sn | Sp | Sn | Sp | FP rate |
| 20 | 69 | 20 | 79 | 23 | 91 | 1/426 590 |
| 37 | 51 | 35 | 53 | 36 | 79 | 1/94 797 |
| 52 | 40 | 50 | 33 | 50 | 47 | 1/16 097 |
| 67 | 29 | 65 | 20 | 64 | 36 | 1/8 203 |
To enable a fair comparison, the evaluation is done on the same dataset and annotation as in previous publications. Sn: sensitivity, i.e. fraction of correctly identified TSSs among the set of annotated start sites. A TSS is counted as correct if one or more predictions fall within a window of [−500, +50] of the 5′ end of genes in the set. Sp: specificity, i.e. fraction of correctly identified TSSs among the set of total predictions, where predictions are counted within the regions spanned by the genes. FP rate: false positive rate, i.e. the frequency of additional predictions per nucleotide. Numbers for the one-model McPromoter and Sharan and Myers were adapted from Ref. (14).
Relative frequency of predicted core modules in D.melanogaster and D.pseudoobscura (Figure 2).
| Core module | Frequency mel/pse (%) |
|---|---|
| Inr/DPE | 12/12 |
| Inr only | 6/17 |
| TATA/Inr | 36/37 |
| DRE | 23/15 |
| Motif 1/6 | 23/19 |