| Literature DB >> 28082394 |
Ilham A Shahmuradov1,2, Ramzan Kh Umarov1, Victor V Solovyev3.
Abstract
Our current knowledge of eukaryotic promoters indicates their complex architecture that is often composed of numerous functional motifs. Most of known promoters include multiple and in some cases mutually exclusive transcription start sites (TSSs). Moreover, TSS selection depends on cell/tissue, development stage and environmental conditions. Such complex promoter structures make their computational identification notoriously difficult. Here, we present TSSPlant, a novel tool that predicts both TATA and TATA-less promoters in sequences of a wide spectrum of plant genomes. The tool was developed by using large promoter collections from ppdb and PlantProm DB. It utilizes eighteen significant compositional and signal features of plant promoter sequences selected in this study, that feed the artificial neural network-based model trained by the backpropagation algorithm. TSSPlant achieves significantly higher accuracy compared to the next best promoter prediction program for both TATA promoters (MCC≃0.84 and F1-score≃0.91 versus MCC≃0.51 and F1-score≃0.71) and TATA-less promoters (MCC≃0.80, F1-score≃0.89 versus MCC≃0.29 and F1-score≃0.50). TSSPlant is available to download as a standalone program at http://www.cbrc.kaust.edu.sa/download/.Entities:
Mesh:
Substances:
Year: 2017 PMID: 28082394 PMCID: PMC5416875 DOI: 10.1093/nar/gkw1353
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Typical structure of TATA (A) and TATA-less (B) promoters. Assumed locations of promoter elements: [−41:−18] for TATA-box, [−13:+13] for INR, [−51:−1] for YP, [+20:+35] for DPE and [−1:−200] for TFBSs.
Characteristics of promoter sequences used for TATA and TATA-less promoter recognition and their Mahalonobis distances
| Features | D2 for all TATA promoters | D2 for all TATA-less promoters | D2 for Arabidopsis TATA Promoters | D2 for rice TATA promoters | D2 for Arabidopsis TATA-less promoters | D2 for rice TATA-less Promoters |
|---|---|---|---|---|---|---|
| TATA | 1.6624 | 1.6601 | 1.5282 | |||
| INR | 0.7034 | 0.4696 | 0.7497 | 0.8507 | 0.4457 | 0.4996 |
| YP | 0.6164 | 0.5439 | 0.4871 | 0.7343 | 0.5307 | 0.5639 |
| DPE | 0.5205 | 0.5026 | 0.5472 | |||
| d(TATA,TSS) | 1.927 | 2.0112 | 1.9948 | |||
| d(INR-TSS) | 0.4176 | 0.1793 | 0.2664 | |||
| 2-mers | 1.0373 | 0.755 | 1.0895 | 1.0634 | 0.7754 | 0.8721 |
| 3-mers | 1.1256 | 0.8601 | 1.2182 | 1.2085 | 0.9007 | 0.8551 |
| 4-mers/1 | 1.5704 | 1.4162 | 1.7217 | 1.3374 | 1.4515 | 1.3953 |
| 4-mers/2 | 1.2662 | 0.9569 | 1.4141 | 1.3758 | 0.9901 | 0.9712 |
| 5-mers/1 | 1.6258 | 1.4658 | 1.7591 | 1.4809 | 1.5083 | 1.4206 |
| 5-mers/2 | 1.3411 | 1.0035 | 1.4746 | 1.4586 | 1.0655 | 1.1006 |
| 6-mers/1 | 1.6218 | 1.4228 | 1.7976 | 1.6006 | 1.4653 | 1.4072 |
| 6-mers/2 | 1.3953 | 1.0603 | 1.5586 | 1.5375 | 1.1642 | 1.2121 |
| TFBS density 1 | 1.6109 | 1.1462 | 1.5653 | 1.3288 | 1.2248 | 0.9516 |
| TFBS density 2 | 0.6487 | 0.7394 | 0.4107 | 0.2506 | 0.6303 | 0.6937 |
| sk(CG) | 1.0399 | 0.7854 | 1.0066 | 1.0056 | 0.7881 | 0.6884 |
| sk(AC) | 0.9233 | 0.5937 | 1.1646 | 1.1962 | 0.5535 | 0.6112 |
|
|
|
|
|
|
|
|
Figure 2.Flow-chart of an algorithm implemented in TSSPlant program. Ttata is a threshold for TATA box located in a region of [−160:−182] of the promoter sequences. Ttotal is a threshold for selecting predicted TSSs.
Testing results for TATA and TATA-less promoters on sequences of 251 bp length
| Promoter class | TP | FN | TN | FP | Sn, % | Sp, % | F1-score, % | MCC |
|---|---|---|---|---|---|---|---|---|
| TATA1 | 276 | 2 | 491 | 9 | 99.3 | 98.2 | 98.8 | 0.97 |
| TATA (At)2 | 174 | 1 | 172 | 3 | 99.4 | 98.3 | 98.9 | 0.97 |
| TATA (Os)3 | 102 | 1 | 101 | 2 | 99.0 | 98.1 | 98.6 | 0.98 |
| TATA-less4 | 594 | 128 | 1457 | 43 | 82.3 | 97.1 | 88.9 | 0.83 |
| TATA-less (At)5 | 272 | 53 | 317 | 8 | 83.7 | 97.5 | 89.9 | 0.78 |
| TATA-less (Os)6 | 322 | 75 | 383 | 14 | 81.1 | 96.5 | 87.9 | 0.84 |
1278 TATA promoters from A. thaliana and O. sativa.
2175 TATA promoters from A. thaliana (At) only.
3103 TATA promoters from O. sativa (Os) only.
4722 TATA-less promoters from A. thaliana and O. sativa.
5325 TATA-less promoters from A. thaliana only.
6397 TATA-less promoters from O. sativa only.
Comparison of accuracies of four promoter prediction programs on TATA promoters, 50 positive and 50 negative sequences, each 251 bp long
| Set | Tool | TP | FN | TN | FP | Sn, % | Sp, % | F1-score, % | MCC |
|---|---|---|---|---|---|---|---|---|---|
| Mixed, | TSSPlant | 48 | 2 | 43 | 7 | 96 | 86 | 91.4 | 0.84 |
| NNPP | 31 | 19 | 43 | 7 | 62 | 86 | 70.5 | 0.51 | |
| TSSP | 28 | 22 | 48 | 2 | 56 | 96 | 70.0 | 0.58 | |
| Proscan | 3 | 47 | 49 | 1 | 6 | 98 | 11.1 | 0.11 | |
|
| TSSPlant | 48 | 2 | 44 | 6 | 96 | 88 | 92.3 | 0.84 |
| NNPP | 32 | 18 | 43 | 7 | 64 | 86 | 71.9 | 0.51 | |
| TSSP | 30 | 20 | 48 | 2 | 60 | 96 | 73.2 | 0.58 | |
| Proscan | 3 | 47 | 49 | 1 | 6 | 98 | 11.1 | 0.11 | |
|
| TSSPlant | 47 | 3 | 42 | 8 | 94 | 84 | 89.5 | 0.78 |
| NNPP | 30 | 20 | 43 | 7 | 60 | 86 | 69.0 | 0.48 | |
| TSSP | 26 | 24 | 48 | 2 | 52 | 96 | 66.7 | 0.54 | |
| Proscan | 3 | 47 | 49 | 1 | 6 | 98 | 11.1 | 0.11 |
Comparison of accuracies of four promoter prediction programs on TATA-less promoters, 50 positive and 50 negative sequences, each 251 bp long.
| Set | Tool | TP | FN | TN | FP | Sn, % | Sp, % | F1-score,% | MCC |
|---|---|---|---|---|---|---|---|---|---|
| Mixed, | TSSPlant | 46 | 4 | 43 | 7 | 92 | 86 | 89.3 | 0.80 |
| NNPP | 19 | 31 | 43 | 7 | 38 | 86 | 50.0 | 0.29 | |
| TSSP | 13 | 37 | 48 | 2 | 26 | 96 | 40.0 | 0.32 | |
| Proscan | 1 | 49 | 49 | 1 | 2 | 98 | 4.0 | 0.02 | |
|
| TSSPlant | 46 | 4 | 44 | 6 | 92 | 88 | 90.2 | 0.80 |
| NNPP | 18 | 32 | 44 | 6 | 36 | 88 | 48.7 | 0.30 | |
| TSSP | 14 | 36 | 48 | 2 | 28 | 96 | 42.4 | 0.33 | |
| Proscan | 1 | 49 | 49 | 1 | 2 | 98 | 4.0 | 0.02 | |
|
| TSSPlant | 46 | 4 | 42 | 8 | 92 | 84 | 90.2 | 0.77 |
| NNPP | 20 | 30 | 42 | 8 | 40 | 84 | 51.3 | 0.28 | |
| TSSP | 12 | 38 | 48 | 2 | 24 | 96 | 37.5 | 0.30 | |
| Proscan | 1 | 49 | 49 | 1 | 2 | 98 | 4.0 | 0.02 |
Comparison of accuracies of five promoter prediction programs on 1100-bp regions of plant protein coding genes with experimentally validated TSS
| Set | Tool | Genes with TSSpr4 | Total number of TSSpr | TP5 | FP | FN | Sn,% | F1-score,% |
|---|---|---|---|---|---|---|---|---|
| Mixed, dicots and monocots1 | TSSPlant | 54 | 115 | 40 | 75 | 15 | 72.7 | 47.1 |
| TSSP | 45 | 105 | 35 | 80 | 20 | 63.6 | 41.2 | |
| NNPP | 47 | 122 | 28 | 97 | 27 | 50.9 | 31.1 | |
| EP32 | 16 | 16 | 11 | 5 | 44 | 20.0 | 31.0 | |
| Proscan2 | 10 | 10 | 5 | 5 | 50 | 9.1 | 15.4 | |
|
| TSSPlant | 44 | 92 | 32 | 60 | 13 | 73.3 | 46.7 |
| TSSP | 38 | 85 | 29 | 56 | 16 | 64.4 | 44.6 | |
| NNPP | 40 | 100 | 23 | 77 | 22 | 51.1 | 31.7 | |
| EP32 | 10 | 10 | 8 | 2 | 37 | 20.0 | 29.1 | |
| Proscan2 | 8 | 8 | 4 | 4 | 41 | 8.9 | 15.1 | |
|
| TSSPlant | 10 | 23 | 8 | 15 | 2 | 70.0 | 48.5 |
| TSSP | 7 | 20 | 6 | 14 | 4 | 60.0 | 40.0 | |
| NNPP | 7 | 22 | 5 | 17 | 5 | 50.0 | 31.3 | |
| EP32 | 6 | 6 | 3 | 3 | 7 | 20.0 | 37.5 | |
| Proscan2 | 2 | 2 | 1 | 1 | 9 | 10.0 | 16.7 |
155 genes with experimentally verified TSS from both dicots and monocots.
245 genes with experimentally verified TSS from dicots only (21 A. thaliana, 1 Phaseolus vulgaris, 9 G. max, 3 Nicotiana tabacum, 3 Nicotiana silvestris, 8 Lycopersicon esculentum, 1 Beta vulgaris, 1 Ricinus communis and 1 Cucumis sativus genes).
310 genes with experimentally verified TSS from monocots only (4 O. sativa, 3 Z. mays, 1 Avena sativa and 2 Hordeum vulgare genes).
4Prediction is considered true if a distance between annotated TSS and predicted TSS (TSSpr) is 50 bp or less.
5EP3 and Proscan programs predict wide transcription start region (250 and 400 bp, respectively).
Figure 3.The scoring landscape of experimentally validated TSSs for TATA promoters. Gray curve: distribution of NN scores that are higher than the prediction threshold, computed for each position of 600-bp sequence around experimentally validated TSSs (300 bp upstream and downstream). Black curve: the number of predicted TSSs in these positions.
Figure 4.The scoring landscape of experimentally validated TSSs for TATA-less promoters. Gray curve: distribution of NN scores that are higher than the prediction threshold, computed for each position of 600-bp sequence around experimentally validated TSSs (300 bp upstream and downstream). Black curve: the number of predicted TSSs in these positions.
Summary of genome-wide search for TSSs in seven plant genomes
| Organism | Genes analyzed | Genes with ≥1 TSSpr | TSSpr, all | Genes with TSSpr at distance ≤ 50 bp from gene start | Genes with TSSpr at distance ≤ 100 bp from gene start | TSSpr density1 |
|---|---|---|---|---|---|---|
|
| 22 333 | 22 258 | 52 924 | 11 827; ≃53% | 14 572; ≃66% | 464 |
|
| 23 467 | 23 330 | 53 265 | 11 993; ≃51% | 14 706; ≃63% | 485 |
|
| 17 901 | 17 896 | 44 108 | 10 355; ≃58% | 12 802; ≃72% | 446 |
|
| 38 718 | 38 702 | 101 550 | 20 202; ≃52% | 25 672; ≃66% | 419 |
|
| 18 227 | 18 226 | 48 014 | 10 477; ≃58% | 12 838; ≃70% | 417 |
|
| 17 650 | 17 645 | 45 517 | 8222; ≃47% | 10 669; ≃61% | 426 |
|
| 11 080 | 11 035 | 27 800 | 4970; ≃45% | 6467; ≃59% | 438 |
1Computed as total number of predicted TSSs (TSSpr) divided by a sum of lengths of gene sequences analyzed in a genome.
Figure 5.Distribution of distances between the closest predicted TSS and gene start (annotated TSS, TSSan) for 23 330 protein-coding genes of Zea mays.
Figure 6.Distribution of distances between the closest predicted TSS and gene start (TSSan) for 18 226 protein-coding genes of Medicago truncatula.