| Literature DB >> 15722481 |
I A Shahmuradov1, V V Solovyev, A J Gammerman.
Abstract
Accurate prediction of promoters is fundamental to understanding gene expression patterns, where confidence estimation is one of the main requirements. Using recently developed transductive confidence machine (TCM) techniques, we developed a new program TSSP-TCM for the prediction of plant promoters that also provides confidence of the prediction. The program was trained on 132 and 104 sequences and tested on 40 and 25 sequences (containing TATA and TATA-less promoters, respectively) with known transcription start sites (TSSs). As negative training samples for TCM learning we used coding and intron sequences of plant genes annotated in the GenBank. In the test set of TATA promoters, the program correctly predicted TSS for 35 out of 40 (87.5%) genes with a median deviation of several base pairs from the true site location. For 25 TATA-less promoters, TSSs were predicted for 21 out of 25 (84%) genes, including 14 cases of 5 bp distance between annotated and predicted TSSs. Using TSSP-TCM program we annotated promoters in the whole Arabidopsis genome. The predicted promoters were in good agreement with the start position of known Arabidopsis mRNAs. Thus, TCM technique has produced a plant-oriented promoter prediction tool of high accuracy. TSSP-TCM program and annotated promoters are available at http://mendel.cs.rhul.ac.uk/mendel.php?topic=fgen.Entities:
Mesh:
Year: 2005 PMID: 15722481 PMCID: PMC549412 DOI: 10.1093/nar/gki247
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Characteristics of promoter sequences used for TATA and TATA-less promoter recognition and Mahalonobis distance [D2; (32)] showing power of recognition of each characteristic
| Characteristics | ||
|---|---|---|
| Hexaplets −200 : −45 | 2.6 | 1.4 (−100 : −1) |
| TATA box score | 3.4 | 0.9 |
| Triplets around TSS | 4.1 | 0.7 |
| Hexaplets +1: +40 | 0.9 | |
| Sp1-motif content | 0.9 | |
| TATA fixed location | 0.7 | |
| CpG content | 1.4 | 0.7 |
| Similarity −200 : −100 | 0.3 | 0.7 |
| Motif Density (MD) −200 : +1 | 4.5 | 3.2 |
| Direct/Inverted MD −100 : +1 | 4.0 | 3.3 (−100 : −1) |
MD is motif density, computed on known promoters; functional motifs were taken from Plant REGSITE Database ().
Statistics of testing procedure for 40 TATA and 25 TATA-less promoter sequences of 351 bpa
| Promoter type | Accuracy of discrimination | Negative samples from CDSs (%) | Negative samples from introns (%) |
|---|---|---|---|
| TATA | Mean prediction error for positive samples (%) | 7.4 | 3.5 |
| Mean prediction error for negative samples (%) | 6.0 | 8.7 | |
| TATA-less | Mean prediction error for positive samples (%) | 18.6 | 14.0 |
| Mean prediction error for negative samples (%) | 16.9 | 29.5 |
aA total of 40 various sets of 1000 negative samples of the same length (351 bp), randomly chosen from CDSs (20 sets, totally 20 000 sequences) and introns (20 sets, totally 20 000 samples) of known plant genes. Confidence and credibility levels were ≥0.9 (90%) and ≥0.06 (6%), respectively.
Accuracy of prediction by TSSP-TCM on genomic sequencesa
| Statistic characteristics | For 40 TATA promoters | For 25 TATA-less promoters |
|---|---|---|
| False negatives | 5 | 4 |
| False positives | 14 | 9 |
| False positives' density | 1 per 5375 bp | 1 per ∼4720 bp |
aThe confidence level for the prediction of both promoter classes was 95% or higher. The credibility level was ≥35% for TATA promoters and ≥60% for TATA-less promoters. For every class of promoters only one predicted TSS with the highest credibility level in an interval of 300 bp was taken. TATA and TATA-less promoters predicted were separately estimated by this statistical criterion.
Figure 1Location of the predicted nearest TSS in relationship with the known TSSs for 35 out of 40 genes with annotated TATA promoters.
Figure 2Location of the predicted nearest TSSs in relationship with the known TSS for 21 out of 25 genes with annotated TATA-less promoters.
Summary of promoter prediction for 13 350 mRNA supported genes of A.thaliana
| Analyzed | 13 350 genes |
| At least, 1 promoter found | 9653 (72.3%) genes |
| At least, 1 TATA promoter found | 6141 (46.0%) genes |
| At least, 1 TATA-less promoter found | 6717 (50.3%) genes |
| The predicted TATA promoter is the closest to the annotated mRNA start | 4465 (46.3%) genes |
| The predicted TATA-less promoter is the closest to the annotated mRNA start | 5188 (53.7%) genes |
| Total length of analyzed sequences | 27 709 288 bp |
| Total number promoters predicted | 17 717 |
Figure 3Distribution of distances (D) between the predicted TSS and the annotated start of mRNA (gray, 4465 TATA promoters; black, 5188 TATA-less promoters).
Figure 4Distribution of distances (D) of predicted (6454 TATA and TATA-less promoters) closer than 100 bp to the annotated start of mRNA.