Literature DB >> 28082394

TSSPlant: a new tool for prediction of plant Pol II promoters.

Ilham A Shahmuradov^1,2, Ramzan Kh Umarov¹, Victor V Solovyev³.

Abstract

Our current knowledge of eukaryotic promoters indicates their complex architecture that is often composed of numerous functional motifs. Most of known promoters include multiple and in some cases mutually exclusive transcription start sites (TSSs). Moreover, TSS selection depends on cell/tissue, development stage and environmental conditions. Such complex promoter structures make their computational identification notoriously difficult. Here, we present TSSPlant, a novel tool that predicts both TATA and TATA-less promoters in sequences of a wide spectrum of plant genomes. The tool was developed by using large promoter collections from ppdb and PlantProm DB. It utilizes eighteen significant compositional and signal features of plant promoter sequences selected in this study, that feed the artificial neural network-based model trained by the backpropagation algorithm. TSSPlant achieves significantly higher accuracy compared to the next best promoter prediction program for both TATA promoters (MCC≃0.84 and F1-score≃0.91 versus MCC≃0.51 and F1-score≃0.71) and TATA-less promoters (MCC≃0.80, F1-score≃0.89 versus MCC≃0.29 and F1-score≃0.50). TSSPlant is available to download as a standalone program at http://www.cbrc.kaust.edu.sa/download/.

Entities: Chemical Disease Gene Species

Mesh：

Substances：

Year: 2017 PMID： 28082394 PMCID： PMC5416875 DOI： 10.1093/nar/gkw1353

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

In nuclear genomes of eukaryote organisms, transcription of all protein genes and most non-coding RNA genes, as well as of DNA regions of unknown function, is performed by RNA polymerase II (Pol II). Pol II promoters are key players in regulation of gene expression, and accurate identification of them has both fundamental and practical significance. Currently, there are many publicly available collections of transcripts and promoter sequences with annotated transcription start sites (TSS). However, for entire genomes, these data represent only a small fraction of actual number of promoters, which control gene transcription depending of cell/tissue type, developmental stage and intra/inter-cellular signals (for review see: 1–4). Promoters have complex and gene-specific architecture. Therefore, it is extremely difficult to devise a general strategy for their computational prediction. Promoter sequences contain multiple short DNA motifs of 5–10 bases that serve as binding sites for transcription factors (TFs) involved in specific regulation of transcription, and each promoter has a unique composition of TF binding sites (TFBSs; for review see: 3,5). A core promoter is a minimal promoter region that is capable of initiating basal transcription. It contains a TSS and typically spans from −60 to +40 bp relative to a TSS. A region from 200 to 300 bp immediately upstream of a TSS is commonly referred to as a proximal promoter (see reviews: 3,4). Therefore, ‘promoter prediction’ and ‘TSS prediction’ are used interchangeably. There are numerous computational tools developed in an attempt to predict TSSs or transcription start regions (TSRs). In this work, ‘TSS prediction’ means prediction of proximal promoter region [−200:+51], where reported TSS position corresponds to +1 position in this region. Promoter studies indicate that 30–50% of all known promoters contain a TATA-box located from 40 to 18 bp upstream of the TSS. A TATA-box is apparently the most conserved and one of the key functional signals of eukaryotic promoters. Many highly expressed genes contain a strong TATA-box in their core promoter. However, promoters of many large groups of genes (e.g. housekeeping genes) lack TATA-box; the corresponding promoters are referred to as TATA-less promoters. TSS in such promoters is determined by other core-promoter elements, such as Initiator element (INR), downstream promoter element (DPE), etc. A region 200–300 bp immediately upstream of a core promoter constitutes a proximal promoter and contains multiple REs responsible for specific transcription regulation. Distal part of promoter spans further upstream and is involved mainly in substantially increasing or decreasing activity of transcription machinery via enhancers and silencers, respectively (1–6). Typical architecture of TATA and TATA-less promoters is shown on Figure 1.

Figure 1.

Typical structure of TATA (A) and TATA-less (B) promoters. Assumed locations of promoter elements: [−41:−18] for TATA-box, [−13:+13] for INR, [−51:−1] for YP, [+20:+35] for DPE and [−1:−200] for TFBSs. To study details of plant promoter architecture, a comparative analysis of thirteen promoter elements (TATA-box, INR, YP, CG skew and some TFBSs) with respect to TSS across monocot (Brachypodium distachyon, Oryza sativa ssp. japonica, Sorghum bicolor and Zea mays) and dicot (Arabidopsis thaliana, Populus trichocarpa, Vitis vinifera and Glycine max) plants has been performed (7–9). Distribution of analyzed promoter elements showed positional conservation within monocots and dicots with some differences across species. Thanks to development of advanced experimental techniques, significant progress was made in analysis of gene regulatory sequences (for review see: 10–12). However, a detailed experimental exploration of transcripts is still an expensive and difficult procedure. Therefore, in addition to experimental efforts, accurate computational identification of putative promoter regions, for both individual genes and entire genomes, remains an important task of genomics and post-genomics studies. Over the last decade, relatively more accurate promoter prediction programs were developed. They can be divided into two groups: (i) the programs trying to predict locations of promoter regions upstream of TSS of known genes and (ii) the tools focusing on finding a TSS. Recent studies indicate that there often exists an entire TSR with multiple TSSs that are used at different frequencies, rather than a single TSS (13,14). It seems however that predicting TSRs spanning several hundred nucleotides (from 250 up to 1000) as identifiers of gene start points is not very useful for genome annotation projects. For such tasks, finding the TSSs seems to be more informative. Abeel et al. (15) have developed a promoter prediction tool EP3 taking into account presence or absence of TATA-box, INR and TF IIB recognition element (BRE) as well as CpG islands in a promoter sequence and presence or absence of peaks and clefts in the structural profile of DNA. The program searches for TSRs of 400 bp, and if the predicted 400-bp region includes an annotated TSS, a prediction is evaluated as a true prediction. To define a gold standard for promoter prediction, Abeel et al. (14) compared EP3 and 16 other programs searching for TSRs (ARTS, CpGcluster, CpGProD, DragonGSF, DragonPF, Eponine, FirstEF, Nscan, NNPP2.2, PSPA, ProSOM, McPromoter, Promoter2.0, PromoterExplorer, PromoterScan and Wu-method; for references see: 14). The human genome (release hg18) was analyzed in that study. They used a common evaluation strategy, as an in-depth analysis on predictive performance, promoter class specificity, overlap between predictors and positional bias of predictions. Four tools (ARTS, Eponine, EP3 and ProSOM) had prediction score over 25%, with the best score (34%) for ARTS. Analysis of pairwise prediction overlaps demonstrated, however, that the ProSOM program gave the highest level of true promoter predictions cross-overlapped with results obtained by the other tools (ARTS, Eponine, EP3). Although most known computer tools try to predict promoters in human and animal sequences, there are some computer programs aimed at predicting plant promoters (16–23). In particular, using transductive confidence machine techniques, TSSP-TCM program was developed (18). The program has demonstrated quite high accuracy of TSS prediction in test sequences with experimentally validated TSSs: 87.5 and 84% for TATA and TATA-less promoters, respectively. Moreover, results of genome-wide prediction of promoters of annotated protein-coding genes of A. thaliana were in a good agreement with the start positions of known mRNAs. Another computational plant promoter predictor PromPredict is based on the difference in DNA stability of neighboring upstream and downstream regions relative to the experimentally determined TSSs (20,21). Genome-wide studies in A. thaliana and O. sativa genomes (−2000: +2000 regions around annotated TSSs) resulted in the following predictions: for protein-coding genes, recall and precision were 96 and 42% for Arabidopsis, 97 and 31% for rice, respectively; for non-coding RNA genes, recall and precision were 94 and 75% for Arabidopsis, 95 and 90% for rice, respectively. However, that tool searches for TSR rather than TSS: the authors considered the regions [−500:+100] and [−1000:+1] around TSS (+1) as the true positive (TP) TSR for the protein-coding and ncRNA genes, respectively. Applying support vector machine (SVM) technique, Azad et al. (22) have developed a novel PromoBot tool with 89% sensitivity and 86% specificity. Comparative analysis of PromoBot and other promoter predictors (NNPP 2.2, Promoter 2.0, TSSP-TCM, Promoter Scan 1.7, and PromMachine) on short (251 bp) promoter and non-promoter sequences showed that NNPP 2.2, TSSP-TCM, PromMachine and TSSP (http://www.softberry.com/berry.phtml) had prediction accuracy comparable with PromoBot, whereas Promoter Scan v1.7 and Promoter 2.0 demonstrated very poor performance. At last, by analyzing the GC-compositional bias and specific structural patterns of TATA and TATA-less promoters in PlantProm DB (24), and applying SVM, Zuo and Li (23) have developed a hybrid multi-feature approach for predicting TATA and TATA-less promoters. Compared with the TSSP-TCM program on the same test dataset, authors reported slightly better prediction results for their tool. Generally, accuracy of any predictor depends on (i) the quality of training and testing datasets and (ii) the features used to distinguish promoter and non-promoter sequences. The first collection of plant promoters with experimentally validated TSS was included in the Eukaryotic Promoter Database that mainly consists of animal gene promoters, and its current release has promoters from just two plant organisms (A. thaliana and Z. mays) (25). We have developed PlantProm database (DB), an annotated, non-redundant collection of proximal promoter sequences for RNA polymerase II (Pol II) with experimentally determined TSS(s) comprising 86 plant species (21). The latest version of that DB contains 576 entries including 150, 403 and 23 promoters of monocot, dicot and other plant genes, respectively (26; http://www.softberry.com/plantprom2016/). The database provides promoter DNA sequences and nucleotide composition characteristics upstream and downstream of TSS (−200:+51), taxonomic/promoter type classification of promoters and Nucleotide Frequency Matrices (NFMs) for important promoter elements such as TATA-box, INR-motif, pyrimidine patches (Y Patch; YP) and DPE. To date, the largest collection of experimentally identified plant promoters is ppdb (27,28): its latest version contains promoters for plant genomes of Arabidopsis, rice, Physcomitrella patens and poplar. Moreover, new TSS tag data (34 million) from A. thaliana, determined by high throughput sequencing, has been added to give a ∼200-fold increase in TSS data compared to version 1.0. This resulted in coverage of ∼27000 A. thaliana genes and finer positioning of promoters even for genes with a low expression level. In this paper, we present a new computer tool for predicting plant promoters for RNA polymerase II, TSSPlant, which was developed by using large promoter collections of promoter sequences from ppdb and PlantProm DB. This new program compares favorably with other known and available promoter predictors and was applied for genome-wide analysis of promoter sequences in seven plant genomes.

MATERIALS AND METHODS

In the Plant Promoter Databases (ppdb), version 2.0, 15 655 and 14 214 TSSs for A. thaliana and O. sativa japonica, respectively, are annotated (28, http://ppdb.agr.gifu-u.ac.jp/ppdb/cgi-bin/index.cgi). Of them, 11 966 Arabidopsis and 11 387 rice TSSs are assigned to the annotated protein coding genes. However, some of these TSSs are located very close—at distance of a few nucleotides—to the annotated CDS). In this study, we selected 12 682 (3597 Arabidopsis and 9085 rice) TSSs where every TSS is located at distance at least 300 bp or more from its neighbor TSS. Then, using the genome annotations and sequences of Arabidopsis (TAIR, version 6.0; https://www.arabidopsis.org/) and rice (IRGSP, build 3; http://rapdb.dna.affrc.go.jp/) genomes (the same releases as used in ppdb) we created the promoter set of 251-bp sequences (200 bp upstream and 51 bp downstream of a TSS). To get a non-redundant set of promoters, we performed an intra-species pairwise BLAST (29) comparison of 3597 and 9085 sequences. We found 215 pairs (430 sequences) showing full-length similarity of 80% or higher. After excluding one sequence from every such pair, we finally generated a set of 12 381 promoter sequences from the ppdb. The current release of PlantProm DB (at http://www.softberry.com/plantprom2016/) includes 576 plant promoter sequences of 251 bp length where all TSS have direct experimental validation. The pairwise BLAST comparison of these promoter sequences revealed 9 pairs (18 sequences) showing full-length similarity of 80% or higher level. After excluding one sequence from every such pair, we got 567 promoter sequences from the PlantProm DB, including 50 rice and 106 Arabidopsis promoters. A total of 20 rice and 66 Arabidopsis promoter sequences, for a total of 86, are presented in both ppdb and PlantProm DB sets, and therefore those 86 sequences were excluded from PlantProm DB set. By merging ppdb and PlantProm DB sets, we created a final set of 12 948 plant promoters. Out of them, 11 893 promoters, including 512 sequences from the PlantProm DB, were used as a learning set. For testing, we selected two sets: Set 1 of 1000 sequences from ppdb only, and Set 2 of 55 promoters only from PlantProm DB. Lengths of sequences in the learning set and test Set 1 were 251 bp. Promoter sequences of Set 2 were extended to 1101 bp: 1000 bp upstream and 101 bp downstream of TSS (Supplementary Tables S1 and 2). It should be noted that TSS data from ppdb were obtained from several plant tissues. However, we merged all of them into a single collection, although an existence of some tissue-specific structural differences between promoters utilized in different tissues can't be excluded. Moreover, the majority of collected promoter sequences belong to only two species, monocot rice and dicot Arabidopsis; other monocot and dicot species are represented in these sets by only 327 promoters with experimentally identified exact position of TSS (Supplementary Table S1). We realized that promoters may have some species-specific characteristics. To estimate their influence on accuracy, we performed two kinds of additional tests: (i) to explore differences in the compositional characteristics of Arabidopsis and rice promoters, (ii) to evaluate the prediction accuracy of our new tool in sequences from other monocot and dicot species. To train and test our promoter prediction models, we also generated a negative dataset. From Arabidopsis and rice genome annotations we extracted 18 550 Arabidopsis and 14 46 rice genomic sequences composed of CDSs (starting from second exon) and introns; every sequence had length of 251 bp; the full-length similarity level between any two sequences of the set was less than 80%. In total, we selected 266 503 such sequences (130 431 and 136 072 sequences from Arabidopsis and rice, respectively; Supplementary Table S3). In both learning and testing procedures, the negative sets for TATA and TATA-less promoters were different. For genome-wide search of putative promoters (TSSs) in higher plants, we selected 5΄-region sequences of protein coding genes of seven species from the Ensembl genome annotation system (http://plants.ensembl.org/info/website/ftp/index.html): monocots O. sativa, japonica (35 655 genes; genome assembly IRGSP-1.0) and Z. mays (36 988 genes; genome assembly AGPv3), dicots A. thaliana (27 201 genes; genome assembly TAIR10), Medicago truncatula (47 202 genes; genome assembly MedtrA17_4.0), P. trichocarpa (38 449 genes; genome assembly JGI2.0), V. vinifera(26 118 genes; genome assembly IGGP_12x) and G. max (53 151 genes; genome assembly v1.0). Out of them, only genes with annotated 5΄-UTRs of length of 20 bp or more were taken; if several gene (mRNA) start points for a gene were annotated, we selected only a variant with longest 5΄-UTR. For the promoter search, we extracted [−1000:+101] regions of these genes, where +1 corresponds to a start position of a gene. Finally, we obtained [−1000:+101] regions for 22 333, 23 467, 17 901, 18 227, 17 651, 11 080 and 38 718 genes from O. sativa, Z. mays, A. thaliana, M. truncatula, P. trichocarpa, V. vinifera and G. max, respectively. Data on TFBSs were obtained from Regsite DB (www.softberry.com; Plant division) that contained 1976 TFBSs. We utilized Expectation Maximization (EM) algorithm (30; see Supplementary Data) to compute NFMs for core-promoter elements, such as TATA-box, INR, DPE and YP. The, a TATA-box NFM was applied to classify promoter sequences into TATA and TATA-less promoters. We used previously obtained matrices for INR, YP and DPE elements (23, 26) to compute NFMs for INR and YP, separately for TATA and TATA-less promoters and NFM for DPE for TATA-less promoters. To score an asymmetric nucleotide composition in DNA, such as CG skew, sk(CG) and AC skew, sk(AC), we applied formulas described in (31). Scoring of some oligomer characteristics of promoter sequences, such as dimers (2-mers), triplets (3-mers), tetramers (4-mers), pentamers (5-mers) and hexamers (6-mers) was performed as described by Rani and Bapi (32). Mahalanobis distance (D2) was calculated to estimate significance of each characteristic (33). Rules and formulas for computations of scores (weights) of promoter characteristics are presented in the Supplementary Data. A classification model for the best sorting of promoter from non-promoter sequences was obtained using neural networks with back propagation learning algorithm software module from the interactive data exploring VISAN (Visual Analysis) software package (http://www.softberry.com/berry.phtml?topic=fdp.htm&no_menu=on; for details see Supplementary Data and Supplementary Figure S1). To estimate the performance of promoter predictors, we used the following statistical measures based on a number of TPs, true negatives (TN), false positives (FP) and false negatives (FN): sensitivity (Sn), specificity (Sp), F1 score (a harmonic mean of precision and accuracy) and Matthews correlation coefficient, MCC (34,35). These accuracy measures are briefly described in the Supplementary Data.

RESULTS AND DISCUSSION

Classification of promoters

Using previously published plant TATA-box NFM computed utilizing 345 promoters from PlantProm DB (http://www.softberry.com/data/plantprom/Links/PLPR_TATA.h) and applying an EM procedure, learning and testing sets were divided into two classes: 3395 TATA promoters and 8498 TATA-less promoters in a learning set, 278 TATA promoters and 722 TATA-less promoters in a test Set 1. While calculating TATA NFM, an allowed range of variation of a distance between right boundary of TATA-core (TATAWAWA) box and a TSS was 18–40 bp. As a result of EM procedure for 3395 TATA promoters of a learning set, we computed a new NFM for a TATA-box that is in good agreement with an initial TATA-matrix (for 345 promoters).

Features used to predict promoters

To characterize promoter sequences, we used several significant compositional and signal features of plant promoters found in our previous studies (18,36), combined with features selected in this work. To distinguish promoter and non-promoter sequences, we analyzed about 30 different sequence characteristics computed on the positive and negative learning sets described above. Based on the values of Mahalanobis distances (D2) of individual characteristics reflecting power of each feature to distinguish between promoter and non-promoter sequences, we selected 17 and 15 intrinsic features of TATA and TATA-less promoters, respectively (Table 1). We grouped these selected features into the following categories:

Table 1.

Characteristics of promoter sequences used for TATA and TATA-less promoter recognition and their Mahalonobis distances

Features	D² for all TATA promoters	D² for all TATA-less promoters	D² for Arabidopsis TATA Promoters	D² for rice TATA promoters	D² for Arabidopsis TATA-less promoters	D² for rice TATA-less Promoters
TATA	1.6624		1.6601	1.5282
INR	0.7034	0.4696	0.7497	0.8507	0.4457	0.4996
YP	0.6164	0.5439	0.4871	0.7343	0.5307	0.5639
DPE		0.5205			0.5026	0.5472
d(TATA,TSS)	1.927		2.0112	1.9948
d(INR-TSS)	0.4176		0.1793	0.2664
2-mers	1.0373	0.755	1.0895	1.0634	0.7754	0.8721
3-mers	1.1256	0.8601	1.2182	1.2085	0.9007	0.8551
4-mers/1	1.5704	1.4162	1.7217	1.3374	1.4515	1.3953
4-mers/2	1.2662	0.9569	1.4141	1.3758	0.9901	0.9712
5-mers/1	1.6258	1.4658	1.7591	1.4809	1.5083	1.4206
5-mers/2	1.3411	1.0035	1.4746	1.4586	1.0655	1.1006
6-mers/1	1.6218	1.4228	1.7976	1.6006	1.4653	1.4072
6-mers/2	1.3953	1.0603	1.5586	1.5375	1.1642	1.2121
TFBS density 1	1.6109	1.1462	1.5653	1.3288	1.2248	0.9516
TFBS density 2	0.6487	0.7394	0.4107	0.2506	0.6303	0.6937
sk(CG)	1.0399	0.7854	1.0066	1.0056	0.7881	0.6884
sk(AC)	0.9233	0.5937	1.1646	1.1962	0.5535	0.6112
Total D²	6.3115	2.4047	5.8681	5.9468	2.2094	2.2153

Promoter elements: TATA-box for TATA promoters (search region −42:−19, relative to TSS), INR (−14:+14) for both TATA and TATA-less promoters, DPE for TATA-less promoters (+20:+35), YP (−50:+35 for TATA promoters, −50:+1 for TATA-less promoters). Distances (d) between promoter elements and TSS: d(TATA,TSS) and d(INR,TSS)— for TATA promoters only. Oligomers: eight features were formulated based on calculated scores for 2-mers, 3-mers, 4-mers, 5-mers and 6-mers in different segments of promoter sequences: (i) 2-mers in region [−20:+21], (ii) 3-mers in region [−20:+21], (iii) 4-mers/1 in [−200:−1], (iv) 4-mers/2 in [+1:+51], (v) 5-mers/1 in [−200:-1], (vi) 5-mers/2 in [+1:+51)], (vii) 6-mers/1 in [−200:−1] and (viii) 6-mers/2 in [+1+51]. Density of TFBSs: This category contains two features. TFBS density1, applied to a sense strand only, within the interval [−200:−1]; and TFBS density2, applied within the same interval, but on both strands. Variations in base frequencies along sequence: sk(CG) and sk(AC) in the [−200:+20] region for both TATA and TATA-less promoters. Our analysis demonstrates that, as a rule, features for TATA-less promoters have weaker predictive power than those for TATA promoters. This observation may reflect the fact that TATA-less promoters contain more gene-specific elements, which makes it difficult to develop a general-purpose approach for predicting members of this promoter class. It should be also noted that initially we tried to develop a new promoter prediction tool trained and tested separately for monocot and dicot sequences. For this purpose, we also calculated Mahalanobis distances of individual promoter characteristics separately for rice and Arabidopsis. Our studies, however, haven't revealed any significant differences between monocot and dicot representatives (Table 1, columns 4–6). As a result, we merged both positive and negative sets from monocots (mostly rice) and dicots (mostly Arabidopsis) into a single pair of combined sets. Further tests confirmed that the combination of sets was the right approach.

TSSPlant: the plant promoter prediction tool

Using a combination of 17 and 15 features for TATA and TATA-less promoters, respectively, we applied Neural Network (NN) technique for getting recognition function (classifier) parameters that best distinguish between promoter and non-promoter sequences, separately for two promoter classes by building an NN classifier for each promoter class. Then, we embedded these classifiers into TSSPlant program. The TSSPlant sequence processing workflow is shown on Figure 2. In addition, Supplementary Figure S2 shows a general scheme for computation of scores (weights) of the promoter features used in a NN algorithm that was implemented in TSSPlant program.

Figure 2.

Flow-chart of an algorithm implemented in TSSPlant program. Ttata is a threshold for TATA box located in a region of [−160:−182] of the promoter sequences. Ttotal is a threshold for selecting predicted TSSs. The program analyzes each window of 251 bp over a query sequence, sliding one nucleotide at a time, where position 201 is classified as TSS or non-TSS. The classification is based on a threshold that was computed during training. Predictions with a score higher than a threshold are marked as putative TSSs. The program performs additional filtering by discarding all but the highest-scoring TSS in intervals of a user-specified length (default is 300 bp). Depending on user choice, TSSPlant can output: (i) all predicted TSSs for both TATA and TATA-less promoter classes, (i) only TSS of the TATA promoter class, if TSSs of both TATA and TATA-less classes are found in the 300 bp interval, (iii) or the TSS with highest score (regardless of the promoter class).

Evaluating TSSPlant performance

We evaluated our TSS finder on positive and negative test sets of TATA and TATA-less promoter classes (Table 2). For TATA class, we observed, as expected, very high prediction accuracy, with sensitivity ≃99%, specificity ≃98%, F1 score ≃99% and MCC 0.97. In case of TATA-less promoters, we also achieved relatively high accuracy (sensitivity ≃82%, specificity ≃97%, F1 score ≃89% and MCC 0.83), although lower than for TATA promoters. This difference in performance for two promoter classes may be caused by the following factors: (i) TATA-box alone plays critical role in determining a start point of transcription, and therefore its presence ensures more accurate predictions; and (ii) TATA-less promoters may have significantly more gene-specific architecture that makes it very difficult to devise a classifier that will recognize all their varieties. Note that even if we have a very accurate classifier to recognize promoter regions, identification of exact TSS locations in long genomic sequences is still a difficult task, as we need to select one promoter location among many overlapping sequence segments having almost the same promoter features.

Table 2.

Testing results for TATA and TATA-less promoters on sequences of 251 bp length

Promoter class	TP	FN	TN	FP	Sn, %	Sp, %	F1-score, %	MCC
TATA¹	276	2	491	9	99.3	98.2	98.8	0.97
TATA (At)²	174	1	172	3	99.4	98.3	98.9	0.97
TATA (Os)³	102	1	101	2	99.0	98.1	98.6	0.98
TATA-less⁴	594	128	1457	43	82.3	97.1	88.9	0.83
TATA-less (At)⁵	272	53	317	8	83.7	97.5	89.9	0.78
TATA-less (Os)⁶	322	75	383	14	81.1	96.5	87.9	0.84

1278 TATA promoters from A. thaliana and O. sativa.

2175 TATA promoters from A. thaliana (At) only.

3103 TATA promoters from O. sativa (Os) only.

4722 TATA-less promoters from A. thaliana and O. sativa.

5325 TATA-less promoters from A. thaliana only.

6397 TATA-less promoters from O. sativa only.

1278 TATA promoters from A. thaliana and O. sativa. 2175 TATA promoters from A. thaliana (At) only. 3103 TATA promoters from O. sativa (Os) only. 4722 TATA-less promoters from A. thaliana and O. sativa. 5325 TATA-less promoters from A. thaliana only. 6397 TATA-less promoters from O. sativa only. Moreover, we did not observe significant difference in the prediction accuracy for rice and Arabidopsis sets (see Table 2: columns SN, SP, F1 and MCC for different sets). Figures shown are the means of 10 separate experiments on randomly selected negative sets for TATA and TATA-less promoters.

Comparing TSSPlant to other promoter identification tools

Most of previously published plant promoter prediction programs are not currently obtainable. We compared our TSSPlant predictor with tools available for on-line execution or downloading and local installation: NNPP (37, http://www.fruitfly.org/cgi-bin/seq_tools/promoter.pl/, version 2.2), Proscan (38, http://www-bimas.cit.nih.gov/molbio/proscan/, version 1.7), EP3 (15, http://bioinformatics.psb.ugent.be/webtools/ep3/) and TSSP (http://www.softberry.com/berry.phtml). Input sequences for EP3 must be at least 400 bp long, so we could not run EP3 on short (251 bp) input sequences, while we included that program in our tests on long genomic sequences. First, we compared NNPP, Proscan, TSSP and TSSPlant tools on positive and negative sets of short (251 bp) sequences, randomly chosen from TATA and TATA-less test sets. Results of this analysis are shown in Tables 3 and 4. For both TATA and TATA-less promoters, the results show significantly higher prediction accuracy of TSSPlant. Again, we did not observe meaningful difference in prediction accuracies for rice and Arabidopsis sets (see: columns SN, SP, F1 and MCC for different sets).

Table 3.

Comparison of accuracies of four promoter prediction programs on TATA promoters, 50 positive and 50 negative sequences, each 251 bp long

Set	Tool	TP	FN	TN	FP	Sn, %	Sp, %	F1-score, %	MCC
Mixed, A. thaliana and O. sativa	TSSPlant	48	2	43	7	96	86	91.4	0.84
	NNPP	31	19	43	7	62	86	70.5	0.51
	TSSP	28	22	48	2	56	96	70.0	0.58
	Proscan	3	47	49	1	6	98	11.1	0.11
A. thaliana only	TSSPlant	48	2	44	6	96	88	92.3	0.84
	NNPP	32	18	43	7	64	86	71.9	0.51
	TSSP	30	20	48	2	60	96	73.2	0.58
	Proscan	3	47	49	1	6	98	11.1	0.11
O. sativa only	TSSPlant	47	3	42	8	94	84	89.5	0.78
	NNPP	30	20	43	7	60	86	69.0	0.48
	TSSP	26	24	48	2	52	96	66.7	0.54
	Proscan	3	47	49	1	6	98	11.1	0.11

Table 4.

Comparison of accuracies of four promoter prediction programs on TATA-less promoters, 50 positive and 50 negative sequences, each 251 bp long.

Set	Tool	TP	FN	TN	FP	Sn, %	Sp, %	F1-score,%	MCC
Mixed, A. thaliana and O. sativa	TSSPlant	46	4	43	7	92	86	89.3	0.80
	NNPP	19	31	43	7	38	86	50.0	0.29
	TSSP	13	37	48	2	26	96	40.0	0.32
	Proscan	1	49	49	1	2	98	4.0	0.02
A. thaliana only	TSSPlant	46	4	44	6	92	88	90.2	0.80
	NNPP	18	32	44	6	36	88	48.7	0.30
	TSSP	14	36	48	2	28	96	42.4	0.33
	Proscan	1	49	49	1	2	98	4.0	0.02
O. sativa only	TSSPlant	46	4	42	8	92	84	90.2	0.77
	NNPP	20	30	42	8	40	84	51.3	0.28
	TSSP	12	38	48	2	24	96	37.5	0.30
	Proscan	1	49	49	1	2	98	4.0	0.02

Other tests were performed on long sequences of plant genome intergenic regions. To estimate performance on such sequences, we used NNPP, Proscan, EP3 and TSSPlant tools to search for promoters in 1100-bp regions of 55 plant protein coding genes with experimentally validated TSS (position 1001) collected in PlantProm DB. The results are shown in Table 5. Again, no significant difference in accuracy was on monocot and dicot sets was observed (see: columns SN, SP, F1 and MCC for different sets).

Table 5.

Comparison of accuracies of five promoter prediction programs on 1100-bp regions of plant protein coding genes with experimentally validated TSS

Set	Tool	Genes with TSSpr⁴	Total number of TSSpr	TP⁵	FP	FN	Sn,%	F1-score,%
Mixed, dicots and monocots¹	TSSPlant	54	115	40	75	15	72.7	47.1
	TSSP	45	105	35	80	20	63.6	41.2
	NNPP	47	122	28	97	27	50.9	31.1
	EP3²	16	16	11	5	44	20.0	31.0
	Proscan²	10	10	5	5	50	9.1	15.4
Dicots Only²	TSSPlant	44	92	32	60	13	73.3	46.7
	TSSP	38	85	29	56	16	64.4	44.6
	NNPP	40	100	23	77	22	51.1	31.7
	EP3²	10	10	8	2	37	20.0	29.1
	Proscan²	8	8	4	4	41	8.9	15.1
Monocots Only³	TSSPlant	10	23	8	15	2	70.0	48.5
	TSSP	7	20	6	14	4	60.0	40.0
	NNPP	7	22	5	17	5	50.0	31.3
	EP3²	6	6	3	3	7	20.0	37.5
	Proscan²	2	2	1	1	9	10.0	16.7

155 genes with experimentally verified TSS from both dicots and monocots.

245 genes with experimentally verified TSS from dicots only (21 A. thaliana, 1 Phaseolus vulgaris, 9 G. max, 3 Nicotiana tabacum, 3 Nicotiana silvestris, 8 Lycopersicon esculentum, 1 Beta vulgaris, 1 Ricinus communis and 1 Cucumis sativus genes).

310 genes with experimentally verified TSS from monocots only (4 O. sativa, 3 Z. mays, 1 Avena sativa and 2 Hordeum vulgare genes).

4Prediction is considered true if a distance between annotated TSS and predicted TSS (TSSpr) is 50 bp or less.

5EP3 and Proscan programs predict wide transcription start region (250 and 400 bp, respectively).

155 genes with experimentally verified TSS from both dicots and monocots. 245 genes with experimentally verified TSS from dicots only (21 A. thaliana, 1 Phaseolus vulgaris, 9 G. max, 3 Nicotiana tabacum, 3 Nicotiana silvestris, 8 Lycopersicon esculentum, 1 Beta vulgaris, 1 Ricinus communis and 1 Cucumis sativus genes). 310 genes with experimentally verified TSS from monocots only (4 O. sativa, 3 Z. mays, 1 Avena sativa and 2 Hordeum vulgare genes). 4Prediction is considered true if a distance between annotated TSS and predicted TSS (TSSpr) is 50 bp or less. 5EP3 and Proscan programs predict wide transcription start region (250 and 400 bp, respectively). Results shown in Tables 1–5 indicate that combining monocot and dicot learning and testing sets does not not result in meaningful drop of sensitivity, specificity, F1-score and MCC of TSSPlant. It is worth mentioning a well-known problem of multiple TSSs that researchers encounter in analyzing eukaryotic promoters. Thus, in to our study of 18 935 experimentally mapped TSSs in Arabidopsis genome (ppdb; 25), 2264 TSS pairs were located at a distance 50 bp or less on the same strand, including 1321 cases where the distance was less than 20 bp. In 2498 out of 11 43 protein-coding genes, there is more than one experimentally mapped TSS. In some previously published promoter prediction tools, a predicted TSS (TSSpr) is considered a TP even if it was located at a relatively long distance from the closest experimentally mapped TSS (TSSmap). Specifically, Proscan (38) and EP3 (15) considered a TSS to be a TP if it was within a distance of ±250 and ±400 bp from a TSSmap, respectively. In our analysis, a TP is defined as a TSSpr located within 50-bp distance upstream or downstream of a TSSmap. As shown in Table 5, TSSPlant demonstrated the highest accuracy among tested programs (Sn ≃ 72%, F1 ≃ 47%), followed by TSSP (Sn = 63.6%, F1 ≃ 41.2%), NNPP (Sn = 51%, F1 ≃ 31%), EP3 (Sn ≃ 20%, F1 ≃ 31%) and Proscan (Sn ≃ 9%, F1 ≃ 15%). We observed that some experimentally verified promoters did not pass the prediction thresholds. In an earlier study (39), computational prediction tools available at the time predicted only 20% of known promoters. Later, Huerta and Collado-Vides (40) analyzed 599 promoters from Escherichia coli and found that over 50% of true promoters did not produce high enough scores. Recently, we observed the same effect on both E. coli and cyanobacterial promoters (41). We call this ‘high scoring versus functional promoter’ or simply ‘Scoring’ phenomenon. To study this phenomenon, we analyzed regions ±300 bp of experimental TSSs for 278 TATA and 722 TATA-less promoters. Using TSSPlant, an NN score was calculated at every potential TSS position and TSSs with scores above promoter class-specific threshold were compared to experimentally mapped TSSs. For all TATA promoters and 83% of TATA-less promoters, TSSPlant gave experimentally mapped TSSs scores above threshold, but only 38 and 32% of them (for TATA and TATA-less promoters, respectively) passed the filtering criteria to make it to the final predicted set, because in the remaining cases a neighboring predicted TSS had a higher NN score (Figures 3 and 4). This may suggest that we can currently accurately predict a TSR very close to a true TSS, but not an exact TSS position.

Figure 3.

Figure 4.

The scoring landscape of experimentally validated TSSs for TATA-less promoters. Gray curve: distribution of NN scores that are higher than the prediction threshold, computed for each position of 600-bp sequence around experimentally validated TSSs (300 bp upstream and downstream). Black curve: the number of predicted TSSs in these positions.

The scoring landscape of experimentally validated TSSs for TATA promoters. Gray curve: distribution of NN scores that are higher than the prediction threshold, computed for each position of 600-bp sequence around experimentally validated TSSs (300 bp upstream and downstream). Black curve: the number of predicted TSSs in these positions. The scoring landscape of experimentally validated TSSs for TATA-less promoters. Gray curve: distribution of NN scores that are higher than the prediction threshold, computed for each position of 600-bp sequence around experimentally validated TSSs (300 bp upstream and downstream). Black curve: the number of predicted TSSs in these positions. As many plant genes have multiple TSSs, we built in a possibility to use TSSPlant for studying promoters with multiple TSS. The latest version of the program has an option (-x) that allows user to assign left and right boundaries of a region where a single TSS with the highest score will be selected; Currently, this parameter varies in the range from 20 to 300 bp (the default value is 300 bp).

Genome-wide search for putative promoters in seven plant genomes

We ran TSSPlant on seven genomes of higher plants (O. sativa, Z. mays, A. thaliana, M. truncatula, P. trichocarpa, V. vinifera and G. max) to predict putative TSSs in [−1000:+100] regions of protein coding genes with annotated 5΄-UTR of at least 20 bp (see ‘Materials and Methods’ section). Results are summarized in Table 6 (putative TSS maps of A. thaliana and P. trichocarpa genes in gff and text formats are given in Supplementary Arabidopsis_Populus_predicted_TSS_map.zip; for putative TSS maps of O. sativa, Z. mays, M. truncatula, V. vinifera and G. max visit the PlantProm database [24] at http://www.softberry.com/berry.phtml?topic=plantp_2016.03_p7_11&subgroup=plantprom&group=data&no_menu=on).

Table 6.

Summary of genome-wide search for TSSs in seven plant genomes

Organism	Genes analyzed	Genes with ≥1 TSSpr	TSSpr, all	Genes with TSSpr at distance ≤ 50 bp from gene start	Genes with TSSpr at distance ≤ 100 bp from gene start	TSSpr density¹
O. sativa	22 333	22 258	52 924	11 827; ≃53%	14 572; ≃66%	464
Z. myas	23 467	23 330	53 265	11 993; ≃51%	14 706; ≃63%	485
A. thaliana	17 901	17 896	44 108	10 355; ≃58%	12 802; ≃72%	446
G. max	38 718	38 702	101 550	20 202; ≃52%	25 672; ≃66%	419
M. truncatula	18 227	18 226	48 014	10 477; ≃58%	12 838; ≃70%	417
P. trichocarpa	17 650	17 645	45 517	8222; ≃47%	10 669; ≃61%	426
V. vinifera	11 080	11 035	27 800	4970; ≃45%	6467; ≃59%	438

1Computed as total number of predicted TSSs (TSSpr) divided by a sum of lengths of gene sequences analyzed in a genome.

These results call for the following observations.

For 108 938 out of 109 142 genes from 5 genomes (99.8%), excluding rice and Arabidopsis genomes that were involved in training for TSSPlant program, TSSPlant predicted at least one TSS. Taking only one TSS closest to the annotated start for each gene, we computed a distribution of distances between the TSSpr and gene start (Figures 5 and 6, see also Supplementary Figures S3–5). For 55 864 out of 108 938 genes (51.2%) genes one of predicted TSSs is located within 50 bp of an annotated gene start and 70 352 (≃65%) genes have one of predicted TSSs within 100 bp. For about 35% of genes however, predicted TSSs are farther than 100 bp from an annotated gene start. Part of this discrepancy may be explained by limited prediction power of TSSPlant, which is true for any promoter recognition tool published to date. Besides that explanation, the following factors may be at play. We analyzed protein-coding genes with annotated 5΄-UTRs longer than 20 bp. Among them, for 1218, 1064, 1178, 1711 and 1897 genes from Z. mays, G. max, M. truncatula, P. trichocarpa and V. vinifera genomes, respectively, annotated lengths of the longest 5΄-UTR are less than 40 bp. To date, the minimal length of 5΄-UTR required for proper processing and translation of mRNA is unknown. However, in the same genomes, the longest mRNAs for 8145, 11 333, 4606, 17 149, 5640, 5238 and 2828 genes have 5΄-UTR length of 300 nucleotides or more. This observation may suggest that for a significant portion of analyzed genes, an annotated 5΄-UTRs is truncated and, therefore, a distance between predicted TSS and an actual gene start is shorter than we currently observe. Moreover, the studied promoter set was compiled by merging TSS data obtained for different tissues of Arabidopsis and rice. Although their main promoter elements and DNA physico–chemical characteristics are similar, some limitation in prediction accuracy of the TSSPlant tool may be due to tissue-specific characteristics that are not accounted for in our algorithm.

Figure 5.

Distribution of distances between the closest predicted TSS and gene start (annotated TSS, TSSan) for 23 330 protein-coding genes of Zea mays.

Figure 6.

Distribution of distances between the closest predicted TSS and gene start (TSSan) for 18 226 protein-coding genes of Medicago truncatula.

The total number of predicted TSSs per gene varies between 2 and 3 (see Table 6). This finding agrees with the published data indicating that transcription of a significant fraction of genes in both animal and plant species is initiated from alternative TSS. In particular, recent annotations of human genome suggest that almost half of its protein-coding genes contain alternative promoters, including those in many studied disease-associated genes (42). Many Arabidopsis and rice genes also utilize multiple TSSs (43–45). Results of our analysis of the ppdb content for the Arabidopsis and rice confirm this observation: even if one takes only TSSs at distance 300 bp or more, two TSSs for 905 and 200 genes, respectively, are annotated. It should be noted that we actually estimated FP in relation to the corresponding annotated gene start (Table 6). To study the nature of such ‘false TSSs’, we used TSSPlant to search for TSSs in upstream regions [−1000:−800] of 3352 Arabidopsis genes represented in our learning and testing positive sets. Comparison of predicted TSS with experimentally mapped TSS positions from ppdb resulted in: (i) 337 (∼10%) of predicted TSSs exactly match an experimental TSS or are located within 50 bps of it; (ii) 1403 (∼42%) of predicted TSSs are located within 100 bp of an experimental TSS; (3) 2565 (∼77%) of predicted TSSs are located within 200 bp of an experimental TSS. An example of relative location of experimentally mapped and computationally predicted TSS in upstream region of the Arabidopsis AT2G41190 gene encoding a transmembrane amino acid transporter family protein is shown in Supplementary Figure S6. These results suggest that some ‘FP TSSs’ are indeed functional TSSs, they are just located at a distance from an annotated gene start. To summarize, multiple TSSs seem to be a real phenomenon of plant promoter architecture. Distribution of distances between the closest predicted TSS and gene start (annotated TSS, TSSan) for 23 330 protein-coding genes of Zea mays. Distribution of distances between the closest predicted TSS and gene start (TSSan) for 18 226 protein-coding genes of Medicago truncatula.

CONCLUDING REMARKS

The task of predicting eukaryotic promoters is a notoriously difficult one, due to natural diversity in gene regulation that depends on information encoded predominantly in promoter sequences. This work further advances computational identification that is often used in generating promoter candidates for experimental studies of gene regulation. Thus, according to Google Scholar search results, TSSP promoter prediction program has been used in more than 350 publications on gene regulation studies (https://scholar.google.com; see results of running a search for ‘TSSP AND promoter’). TSSPlant, a novel promoter prediction tool, exhibits significantly improved accuracy of plant Pol II promoter prediction. We still have room for further improving promoter finding tools, which should be based on deeper understanding of transcriptome and its regulation, as well as on new computational approaches able to identify relevant information that has not yet been accounted for in current algorithms. Our studies indicate that combining sequences from different species, in particular, from monocots and dicots, does not seem to decrease prediction accuracy, although applying species-specific and/or intra-species tissue-specific characteristics of promoter regions may allow, in the nearest future, further increase of accuracy, as it was achieved by developing organism-specific parameters in gene prediction tools (46).

AVAILABILITY

TSSPlant is available to download at http://www.cbrc.kaust.edu.sa/download/. Click here for additional data file.

39 in total

Review 1. Orchestrated response: a symphony of transcription factors for gene control.

Authors: B Lemon; R Tjian
Journal: Genes Dev Date: 2000-10-15 Impact factor: 11.361

2. PlantProm: a database of plant promoter sequences.

Authors: Ilham A Shahmuradov; Alex J Gammerman; John M Hancock; Peter M Bramley; Victor V Solovyev
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

3. Computational analysis of plant RNA Pol-II promoters.

Authors: S P Pandey; A Krishnamachari
Journal: Biosystems Date: 2005-10-19 Impact factor: 1.973

4. Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments.

Authors: L R Cardon; G D Stormo
Journal: J Mol Biol Date: 1992-01-05 Impact factor: 5.469

5. Heterogeneity of Arabidopsis core promoters revealed by high-density TSS analysis.

Authors: Yoshiharu Y Yamamoto; Tomoaki Yoshitsugu; Tetsuya Sakurai; Motoaki Seki; Kazuo Shinozaki; Junichi Obokata
Journal: Plant J Date: 2009-06-29 Impact factor: 6.417

Review 6. Escherichia coli promoter sequences: analysis and prediction.

Authors: G Z Hertz; G D Stormo
Journal: Methods Enzymol Date: 1996 Impact factor: 1.600

Review 7. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Authors: S F Altschul; T L Madden; A A Schäffer; J Zhang; Z Zhang; W Miller; D J Lipman
Journal: Nucleic Acids Res Date: 1997-09-01 Impact factor: 16.971

Review 8. Identification and validation of promoters and cis-acting regulatory elements.

Authors: Carlos M Hernandez-Garcia; John J Finer
Journal: Plant Sci Date: 2013-12-14 Impact factor: 4.729

9. The downstream core promoter element, DPE, is conserved from Drosophila to humans and is recognized by TAFII60 of Drosophila.

Authors: T W Burke; J T Kadonaga
Journal: Genes Dev Date: 1997-11-15 Impact factor: 11.361

10. Differentiation of core promoter architecture between plants and mammals revealed by LDSS analysis.

Authors: Yoshiharu Y Yamamoto; Hiroyuki Ichida; Tomoko Abe; Yutaka Suzuki; Sumio Sugano; Junichi Obokata
Journal: Nucleic Acids Res Date: 2007-09-12 Impact factor: 16.971

26 in total

1. 3-Hydroxyisobutyrate Dehydrogenase Is Involved in Both, Valine and Isoleucine Degradation in Arabidopsis thaliana.

Authors: Peter Schertl; Lennart Danne; Hans-Peter Braun
Journal: Plant Physiol Date: 2017-07-13 Impact factor: 8.340

2. In Silico Identification of the Complex Interplay between Regulatory SNPs, Transcription Factors, and Their Related Genes in Brassica napus L. Using Multi-Omics Data.

Authors: Selina Klees; Thomas Martin Lange; Hendrik Bertram; Abirami Rajavel; Johanna-Sophie Schlüter; Kun Lu; Armin Otto Schmitt; Mehmet Gültas
Journal: Int J Mol Sci Date: 2021-01-14 Impact factor: 5.923

3. Prediction of Rice Transcription Start Sites Using TransPrise: A Novel Machine Learning Approach.

Authors: Stepan Pachganov; Khalimat Murtazalieva; Alexei Zarubin; Tatiana Taran; Duane Chartier; Tatiana V Tatarinova
Journal: Methods Mol Biol Date: 2021

Review 4. Machine learning: its challenges and opportunities in plant system biology.

Authors: Mohsen Hesami; Milad Alizadeh; Andrew Maxwell Phineas Jones; Davoud Torkamaneh
Journal: Appl Microbiol Biotechnol Date: 2022-05-16 Impact factor: 4.813

5. Critical assessment of computational tools for prokaryotic and eukaryotic promoter prediction.

Authors: Meng Zhang; Cangzhi Jia; Fuyi Li; Chen Li; Yan Zhu; Tatsuya Akutsu; Geoffrey I Webb; Quan Zou; Lachlan J M Coin; Jiangning Song
Journal: Brief Bioinform Date: 2022-03-10 Impact factor: 11.622

6. Characterization of 25 full-length S-RNase alleles, including flanking regions, from a pool of resequenced apple cultivars.

Authors: Paolo De Franceschi; Luca Bianco; Alessandro Cestaro; Luca Dondini; Riccardo Velasco
Journal: Plant Mol Biol Date: 2018-05-29 Impact factor: 4.076

7. TSSFinder-fast and accurate ab initio prediction of the core promoter in eukaryotic genomes.

Authors: Mauro de Medeiros Oliveira; Igor Bonadio; Alicia Lie de Melo; Glaucia Mendes Souza; Alan Mitchell Durham
Journal: Brief Bioinform Date: 2021-11-05 Impact factor: 11.622

8. Nucleotide patterns aiding in prediction of eukaryotic promoters.

Authors: Martin Triska; Victor Solovyev; Ancha Baranova; Alexander Kel; Tatiana V Tatarinova
Journal: PLoS One Date: 2017-11-15 Impact factor: 3.240

9. Quadruplex DNA in long terminal repeats in maize LTR retrotransposons inhibits the expression of a reporter gene in yeast.

Authors: Viktor Tokan; Janka Puterova; Matej Lexa; Eduard Kejnovsky
Journal: BMC Genomics Date: 2018-03-06 Impact factor: 3.969

10. The unique evolutionary pattern of the Hydroxyproline-rich glycoproteins superfamily in Chinese white pear (Pyrus bretschneideri).

Authors: Huijun Jiao; Xing Liu; Shuguang Sun; Peng Wang; Xin Qiao; Jiaming Li; Chao Tang; Juyou Wu; Shaoling Zhang; Shutian Tao
Journal: BMC Plant Biol Date: 2018-02-17 Impact factor: 4.215