Literature DB >> 29540741

Identification of putative promoters in 48 eukaryotic genomes on the basis of DNA free energy.

Venkata Rajesh Yella^1,2, Aditya Kumar^1,3, Manju Bansal⁴.

Abstract

Transcription is an intricate mechanism and is orchestrated at the promoter region. The cognate motifs in the promoters are observed in only a subset of total genes across different domains of life. Hence, sequence-motif based promoter prediction may not be a holistic approach for whole genomes. Conversely, the DNA structural property, duplex stability is a characteristic of promoters and can be used to delineate them from other genomic sequences. In this study, we have used a DNA duplex stability based algorithm 'PromPredict' for promoter prediction in a broad range of eukaryotes, representing various species of yeast, worm, fly, fish, and mammal. Efficiency of the software has been tested in promoter regions of 48 eukaryotic systems. PromPredict achieves recall values, which range from 68 to 92% in various eukaryotes. PromPredict performs well in mammals, although their core promoter regions are GC rich. 'PromPredict' has also been tested for its ability to predict promoter regions for various transcript classes (coding and non-coding), TATA-containing and TATA-less promoters as well as on promoter sequences belonging to different gene expression variability categories. The results support the idea that differential DNA duplex stability is a potential predictor of promoter regions in various genomes.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Year: 2018 PMID： 29540741 PMCID： PMC5852095 DOI： 10.1038/s41598-018-22129-8

Source DB: PubMed Journal: Sci Rep ISSN： 2045-2322 Impact factor: 4.379

Introduction

Genetic transcription program is initiated in a segment of DNA referred to as ‘promoter’ which serves as a platform for the assembly of pre-initiation complex to specify the transcription start sites (TSSs)[1]. In eukaryotes, promoter architecture is quite complex and diverse. Eukaryotic promoters are broadly classified as core promoters, proximal promoters and distal promoters. Core promoter regions are characterized by the presence of cognate sequence motifs such as Initiator (Inr), TATA-box and downstream promoter element (DPE) positioned at distinct locations relative to TSS, as well as non-canonical elements such as “ATG deserts” and “CpG islands” in mammals[2-4]. Proximal promoter regions are the sequences spanning region 500 base pairs relative to the TSS and contain certain sequence elements, which include the CAAT box, the GC box and cis-regulatory modules[5]. Distal promoter elements encompass enhancers, insulators and silencers. Based on the strategies of transcription initiation, core promoters can be broadly classified as focused core promoters or dispersed core promoters[6,7]. The promoter region for the genes with dispersed transcription initiation can occur several 1000s of nucleotides upstream of the gene body. Further, from the studies on genome-wide nucleosome density maps of different eukaryotes such as Saccharomyces cerevisiae, Schizosaccharomyces pombe, Drosophila melanogaster and humans, it has been revealed that well-positioned nucleosomes usually occupy the sequences downstream of TSS, and the core promoter regions are devoid of the nucleosome[8]. Recent analyses of eukaryotic promoters indicates that though promoters differ in their sequence context (most of them lack consensus motif) and GC composition (lower eukaryotes are AT rich while mammals are GC rich), some properties such as nucleosome-free region and epigenetic features around transcription start sites are quite common[9,10]. The complexity of eukaryotic promoter architecture is further revealed by the discovery of alternative[6,7] and bidirectional promoters[11,12]. In addition, recent experimental studies showed that the transcription landscape in eukaryotes is quite pervasive in nature, with a high proportion of transcripts originating from intergenic regions[12,13]. With this avalanche of genomic sequence data, it is important to characterize and predict promoter sequences in order to fully understand the process of transcription initiation. The delineation of promoters is also essential for complete annotation of genomes and a better understanding of genome’s regulatory networks and their architecture[14]. Experimental methods for locating promoter elements include techniques such as 5′-tag-based methods which characterize promoters and transcription initiation events, small RNA sequencing, as well as methods which capture DNA-bound proteins, including RNAPII, transcription factors and histone modifications[9]. They provide a snapshot of all transcribed regions or DNA-protein interactions in the genome for specific experimental conditions[7,10]. Experimental techniques are costly, labour-intensive and time-consuming. Alternatively, computational methods provide rapid and relatively inexpensive ways for promoter identification. These methods are mostly based on the basic premise that promoter sequences have unique statistical features when compared to other genomic sequences[10]. These approaches use DNA sequence feature information such as biological signals of core promoter elements (Inr, TATA-box, and CpG islands), statistical properties of k-mer composition and DNA secondary structural features. Structural features are found to be more informative and universal as compared to sequence or base-compositional features. DNA structural features such as flexibility/bendability, curvature, base stacking and duplex stability have been applied to characterize promoter regions[15-19]. They are better predictors, as the structural features are comparatively conserved, more informative and widely applicable across genomes. Recent reports suggest that structural properties are linked to variability of gene expression[15,20,21] and help in understanding different promoter classes[22]. Recent findings reveal that, DNA shape features can be useful in understanding, and characterizing transcription factor binding sites, origins of replications and other genomic regions[23-26]. The majority of the structure-based algorithms have been designed with the aim of annotating promoter regions specifically in humans, while a few algorithms such as “PromPredict”[17,27,28] and “EP3”[16] are applicable across a variety of genomes. It has also been reported that among all structural features examined, energy-based features such as base stacking or DNA duplex stability are better predictors[16]. DNA duplex stability, a secondary structural feature can be expressed in terms of short range nearest-neighbour interactions, inter base hydrogen bonds and stacking interactions, which are explicitly dependent on identity and orientation of flanking base pairs[14]. DNA duplex stability computation has been applied in techniques such as PCR, anti-gene targeting, and for understanding replication, repair, and transcription[29,30]. PromPredict algorithm encodes the dinucleotide free energy information obtained from studies of melting temperatures of oligonucleotides to compute the average free energy of a particular sequence, as an indicator of DNA duplex stability. The strategy behind using DNA duplex stability is that promoter regions should be less stable than flanking regions to facilitate melting at the time of transcription initiation. Although PromPredict was developed for bacterial promoter prediction[17,27], it also works very well for plants[28,31], but its efficiency has not been tested on fungal and metazoan promoter sequences. In comparison, EP3, which also uses base-stacking energy, has been applied to annotate promoter regions in several eukaryotes ranging from protists to mammals[16]. In the present study, we report the validation of ‘PromPredict’ in identification of the promoter sequences of 48 eukaryotic systems. The key elements of this in silico study are: Understanding the variation of DNA duplex stability eukaryotes, Evaluating performance of PromPredict across various eukaryotes, Whole genome promoter Prediction for different classes of transcripts in S. cerevisiae. An analysis of predictions for various gene expression variability classes and TATA-containing and TATA-less promoters in S. cerevisiae.

Results and Discussion

The efficiency of in-house software PromPredict, a promoter prediction algorithm that demarcates putative promoter regions based on DNA duplex stability has been studied extensively in archaea, bacteria[17] and plants (Rice and Arabidopsis)[28]. Current study deals with the promoter prediction analysis in the genome sequences of 48 different eukaryotes with translation start site (TLS) data, along with genome sequences of S. cerevisiae, C. elegans, D. melanogaster, zebrafish, mouse and human whose transcription start site (TSS) data is available. The true positive region for predictions for all systems is considered to be −500 to +100 relative to TLS/TSS in all cases except when comparing PromPredict with EP3. Initially DNA duplex profiles in these systems have been compared qualitatively.

Eukaryotic promoters show distinct low stability region

The average DNA duplex stability profiles of promoter regions of six systems with mapped TSS data are presented in Figure 1 (data can be found in additional file). Promoter regions (−500 to +100 relative to TSS) in S. cerevisiae, C. elegans and D. melanogaster show low stability compared to their downstream regions, with narrow less stable regions being observed at −19, −11 and −114 (split peak at −25) for S. cerevisiae, C. elegans, and D. melanogaster respectively. In the case of vertebrates, zebrafish, mouse and human, core promoter regions are GC-rich but have two narrow low free energy peaks in the vicinity of TSS. Two narrow peaks are observed at −27 and +2 for zebrafish, −29 and +6 for mouse and −30 and +1 for human, which may correspond to the locations of TATA-box and INR (initiator) elements respectively (additional file). Similar low stability regions have been observed in different prokaryotic systems and plants[21,28,32]. The DNA duplex stability profiles of yeast and invertebrates are similar to prokaryotic promoters. In the case of a majority of prokaryotes, irrespective of genome GC content, promoter regions are less stable (or AT-rich) compared to downstream region, while in vertebrate promoters they are GC-rich, with several sequence and structure elements such as CpG islands, G-tracts, and G-quadruplex motifs being present in the vicinity of TSS[21]. GC contents of six eukaryotic systems in different regions of promoters, −500 to +500, −500 to −100, −100 to −1, +1 to +100 and +100 to +500 are given in (Supplementary Figure 1). The core promoter regions (−100 to −1) in yeast and invertebrates display less GC content compared to other regions as well as genomic GC while in mammals all regions prefer high GC content compared to genome GC.

Figure 1

DNA duplex stability profiles of S. cerevisiae, C. elegans, D. melanogaster, zebrafish, mouse, and human. The two vertical lines correspond to −150 and −30 position with respect to TSS. Promoter regions in yeast and invertebrates are less stable. Low free energy peaks are observed at −19, −11 and −114 for S. cerevisiae, C. elegans and D. melanogaster respectively. Vertebrate core promoters have two narrow low stability peaks near the vicinity of TSS. Two narrow peak regions are observed at −27, +2 for zebrafish, −29, +6 for mouse and −30 and +1 for human. The stability (average free energy) profiles of genomic promoter sequences are plotted in blue colour while those for shuffled promoter sequences are shown in green. The DNA duplex stability profiles show a similar trend in promoter regions of closely related species (Fig. 2, additional file). To compare this feature in different eukaryotes, the promoter sequences (−500 to +500 relative to TLS) with (40–45) % GC composition are chosen and average free energy is plotted. All fungal species display free energy profiles similar to that of S. cerevisiae in the vicinity of TLS, while fish, sea hare and lancelot show a dip at immediate downstream region of TLS. An extensive study by Kumari S. and Ware D. reported the characteristic signatures of free energy profiles in eight plant species[31]. Further, they showed variations in the free energy profiles of monocot and dicot plants. Our results suggest that unique patterns of DNA duplex free energy or stability are conserved among closely related eukaryotes. It is interesting to see the validation of PromPredict in two different types of promoter classes; AT-rich in yeast and invertebrates and GC-rich in vertebrate promoters.

Figure 2

Average free energy profiles of promoters regions in different eukaryotes belonging to fungi, invertebrates (worm and fly) and vertebrates. The promoter sequences (−500 to +500 relative to translation start site (TLS)) are downloaded from SGD, CGD, and UCSC genome browser. The promoter sequences with GC percentage range 40–45 have been plotted. The red, green, blue, cyan, black and pink color plots represent fungi, worm, fly, marine invertebrate, fish and mammal including bird respectively. The model systems S. cerevisiae, C. elegans, D. melanogaster, zebrafish, mouse, and human are highlighted with gray background. The trends in AFE profiles are unique to closely related classes.

Promoter prediction using PromPredict

Promoter prediction in 48 eukaryotic systems with TLS data (14 species of yeast, five species of worm, 12 species of fly, three marine invertebrates, six fish species, seven mammals and chicken) and six eukaryotic systems, S. cerevisiae, C. elegans and D. melanogaster, zebrafish, mouse and human with mapped TSS data has been carried out using PromPredict. The program discriminates putative promoters from other genomic sequences using relative differences in average free energy and cut-offs based on GC content of the sequences (methods). PromPredict gives predictions in 5′ to 3′ direction and can be applied to the whole genome as well as shorter fragments (not less than 1000 nt). Promoter prediction has been carried out using two strategies; (i) 1001mer or 2001mer sequences and (ii) on whole genome. The analysis of shorter fragments has been performed for 48 systems with TLS data and six systems with TSS data. While the second approach, viz whole genome annotation, is utilized for S. cerevisiae. In all cases, the true positive region is considered to be 600 nt of length, extending from −500 nt upstream to +100 nt downstream, relative to TSS or TLS. Least stable positions (lsp) are considered to evaluate the predictions. Since this kind of analysis gives only true positives, only recall values are derived to assess the performance of PromPredict.

Performance of PromPredict on 48 systems with TLS data

The 48 eukaryotic systems represent species closely related to the six systems with TSS data, along with distantly related systems such as marine invertebrates. They constitute 14 species of yeast, five species of worm, and 12 species of fly, three marine invertebrates, six fish species, seven mammals and chicken (Table 1). The genomic size of these eukaryotes ranges from 10.8 mb (for S. mikatae) to 3197 mb (for elephant) while genomic GC varies from 33% (for C. dubliniensis) to 47% (for lamprey). PromPredict has been tested on all these systems considering −500 to +100 region relative to TLS as true positive (TP) region and the recall values are calculated to evaluate the performance. Figure 3 shows the positional distribution of predictions (lsp) in −500 to +500 regions in different eukaryotic promoters. Since PromPredict gives predictions by considering 250 nucleotide (nt) sliding window (two 100 nt windows separated by 50 nt region) predictions are only obtained up to 750 nt position in a 1000 nt sequence. In all studied systems the frequency of lsp in upstream 50 nt bins is consistently higher as compared to downstream regions. In 14 species of yeast, the lsp signal is observed immediately upstream of TLS (approximately −150 to −50). The predictions in worm and fly species are spread over the upstream region, extending till −500 and possibly beyond. Mammals such as human, mouse, cow and pig show higher predictions in far upstream regions, beyond −250, from the TLS. Overall the trends in the distribution of predictions are similar to that of the related systems with TSS position as reference.

Table 1

Promoter prediction in 48 different eukaryotes.

	Genome size in Mb	Genome GC %	Promoter sequences	Total predictions	TP predictions	TP genes	Recall
Yeast
S. cerevisiae	12.5	38.4	6642	8113	6636	5386	81.1
S. bayanus	11.9	40.3	7216	8161	6415	5346	74.1
S. castellii	11.4	37	4655	5745	4528	3703	79.5
S. kluyveri	11	41.7	2932	2991	2433	2059	70.2
S. kudriazvevii	11.2	39.9	3736	4157	3431	2863	76.6
S. mikatae	10.8	38.2	3064	3773	3054	2465	80.5
S. paradoxus	11.9	38.6	7373	8578	6807	5608	76.1
C. albicans	14.5	33.7	5852	8628	6764	5063	86.5
C. dubliniensis	14.6	33.2	5933	8988	7114	5255	88.6
C. glabrata	12.1	38.5	5149	6913	5913	4575	88.9
C. lusitaniae	12.1	44.5	5797	5621	4867	4051	69.9
C. tropicalis	15.3	33.5	6119	8855	6714	5112	83.5
D. hansenii	11.5	35.4	6102	9100	7459	5592	91.6
L. elongisporus	15.5	37	5657	8337	6933	5076	89.7
Worm
C. elegans	98.3	35.4	32481	49575	34895	26474	81.5
C. brenneri	190.4	39.7	24989	28345	19993	16714	66.9
C. briggsae	108.4	37.7	31342	36290	25460	21187	67.6
C. remanei	145.4	38.5	26174	30728	21571	17874	68.3
C. japonica	166.3	39.9	22173	28033	19710	16061	72.4
Fly
D. melanogaster	143.7	42.1	17283	24152	18720	14536	84.1
D. annanassae	231	42.5	13677	18831	12513	10127	74.0
D. erecta	152.7	42.6	12395	15917	10887	8996	72.6
D. grimshawi	200.5	38.8	7861	10009	6822	5498	69.9
D. mojavensis	193.8	40.2	5329	7172	4982	3885	72.9
D. persimilis	188.4	45.2	7598	10487	7452	5812	76.5
D. pseudoobscura	152.7	45.3	31482	45081	32400	25117	79.8
D. sechellia	166.6	42.5	19059	25110	17308	13974	73.3
D. virilis	206	40.7	7920	10433	7238	5673	71.6
D. yakuba	165.7	42.4	2857	3951	2619	2105	73.7
A. gambiae	265	44.5	13901	17893	13044	10487	75.4
A. mellifera	250.3	34.1	21146	36526	23533	17690	83.7
Marine invertebrates
Sea hare	927.3	42	33340	41949	28465	23737	71.2
Sea squirt	116.7	36.1	729	1073	748	591	81.1
Lancelet	521.9	41.8	30538	41175	27789	23402	76.6
Fishes
Zebrafish	1371.7	36.7	14404	20036	13780	11329	78.7
Fugu	391.5	45.8	18679	25638	17793	14685	78.6
Lamprey	885.5	46.8	8724	11355	8087	6839	78.4
Medaka	869.8	42.3	29255	40229	27999	23106	79.0
Stickleback	446.6	42	38971	54946	37856	30987	79.5
Tetraodon	342.4	46.3	34618	50425	34481	27838	80.4
Mammals and bird
Mouse	2803.6	41.9	23878	34040	23118	18661	78.2
Human	2851.4	40.9	26800	41102	27229	21419	79.9
Chicken	1046.9	41.9	3707	5292	3353	2753	74.3
Cow	2983.3	42.3	7514	11110	7338	5849	77.8
Elephant	3196.7	40.9	16095	22427	15638	12580	78.2
Pig	2808.5	42.5	11953	18064	12403	9736	81.5
Platypus	1995.6	45.7	6166	8336	5969	4735	76.8
Rat	2616.4	42.4	8567	11987	8498	6817	79.6

Promoter prediction in 14 species of yeast, five species of worm, 12 species of fly, three marine invertebrates, six fish species, seven mammals and chicken are carried out using PromPredict algorithm. The promoter regions, −500 to +500 relative to TLS (TLS at 0) are considered for this analysis and are retrieved from SGD, CGD, and UCSC genome browsers. The −500 to +100 relative to TLS is considered as true positive region.

Figure 3

Positional distribution of predictions in promoter regions of different eukaryotes. The plot shows the distance of predictions from TLS in 50 nt bins for 48 different eukaryotes representing 14 species of yeast, five species of worm, 12 species of fly, three marine invertebrates, six fish species, seven mammals, and chicken. Color codes for the background of subplots (white for fungi, yellow for the worm, blue for fly, cyan for marine invertebrates, green for fishes and pink for mammals and bird) have been used to differentiate between different domains of life.

Promoter prediction in 48 different eukaryotes. Promoter prediction in 14 species of yeast, five species of worm, 12 species of fly, three marine invertebrates, six fish species, seven mammals and chicken are carried out using PromPredict algorithm. The promoter regions, −500 to +500 relative to TLS (TLS at 0) are considered for this analysis and are retrieved from SGD, CGD, and UCSC genome browsers. The −500 to +100 relative to TLS is considered as true positive region. Positional distribution of predictions in promoter regions of different eukaryotes. The plot shows the distance of predictions from TLS in 50 nt bins for 48 different eukaryotes representing 14 species of yeast, five species of worm, 12 species of fly, three marine invertebrates, six fish species, seven mammals, and chicken. Color codes for the background of subplots (white for fungi, yellow for the worm, blue for fly, cyan for marine invertebrates, green for fishes and pink for mammals and bird) have been used to differentiate between different domains of life. The percentage recall values for the 48 systems range from 67.6 to 91.6 for C. briggsae (worm) and D. hansenii respectively (Table 1, Fig. 4). In yeast, the recall values span the widest range from 70.4 to 91.6. The 14 yeast species show similar free energy profiles, but they differ considerably in their genomic GC content. The recall values for yeast species with high GC content such as S. bayanus, S. kluyveri and C. lusitaniae have lower recall values whereas genome with less GC content such as C. dubliniensis, C. albicans and D. hansenii show higher recall values. The correlation between genome GC and recall is observed to be −0.79 (p = 0.0007) (Supplementary Figure 2). The reason for this kind of relationship in yeasts may be due to the compactness of the genomes; the GC content of genome is directly reflected in GC content of promoter regions. There is no such relationship observed for genome GC and recall values for invertebrates or vertebrates. However, it should be noted that PromPredict performs better in less GC-rich genomes compared to high GC-rich genomes. The performance of PromPredict is weak in worm species except for C. elegans as they do not have a prominent free energy peak in the vicinity of TLS (Fig. 2). In mammals, PromPredict shows lower prediction accuracy when compared to invertebrates. Invertebrates and mammals have unique structural motifs (A-tracts in invertebrates and G-quadruplexes in mammals). Supplementary Figures 3 and 4 show the distribution of A-tracts and G-quadruplexes in 48 eukaryotes. The PromPredict analysis is tested further in six eukaryotes with experimentally validated TSS data.

Figure 4

Performance of PromPredict in different eukaryotes. The box plot shows the recall in 14 species of yeast, five species of worm, and 12 species of fly, three marine invertebrates, six fish species, seven mammals and chicken. The −500 to +100 region relative to TLS is considered as true positive region in this study.

Performance of PromPredict on six systems with TSS data

The six systems considered are S. cerevisiae, C. elegans, D. melanogaster, zebrafish, mouse, and human, as representatives of different domains of life, ranging from single-celled yeast to complex mammalian systems. The whole genome GC content varies from ~35% (in C. elegans) to ~42% (D. melanogaster and mouse). Prediction calculations are restricted from −500 to +500 regions for S. cerevisiae, C. elegans, and D. melanogaster while from −1000 to +1000 regions are considered in the case of vertebrates. The positional distribution of least stable position in 50 nt binned windows in six systems shows that the location of prediction varies in yeast, invertebrates, and vertebrates (Supplementary Figure 5). In S. cerevisiae, C. elegans and D. melanogaster majority of the predictions are located in the upstream vicinity of TSS (approximately −1 to −150 region) whereas in mammals, maximum number of predictions is observed beyond −300 region relative to TSS. Percentage recall values for the six systems are 85.1, 87.7, 91.9, 80.4, 67.2 and 71.3 for S. cerevisiae, C. elegans, D. melanogaster, zebrafish, mouse, and human respectively (Table 2). Higher recall values are observed for lower eukaryotes as expected, but surprisingly good recall values are also obtained for mouse and human.

Table 2

Promoter predictions in six systems using “PromPredict” algorithm.

	Promoter sequences	Genome GC	Total predictions (−500 to +500 w.r.t TSS at ‘0’)	TP (−500 to 100)	GC% in TP region	TP genes	Recall
S. cerevisiae	4912	38.3	11227	6312	37.6	4182	85.1
C. elegans	18457	35.4	28823	23063	35.1	16178	87.7
D. melanogaster	12898	42.1	18332	16232	39.8	11856	91.9
Zebrafish	5366	36.1	8220	5464	37.6	4313	80.4
Mouse	17451	41.9	22017	14281	54.6	11721	67.2
Human	29456	40.9	43043	27138	53.1	21009	71.3

Predictions in −500 to +500 regions w.r.t TSS of S. cerevisiae, C. elegans and D. Melanogaster, zebrafish, mouse and human have been carried out. PromPredict performs better in the case of yeast and invertebrates as compared to mammals. The −500 to +100 region relative to TSS is considered as true positive region.

Promoter predictions in six systems using “PromPredict” algorithm. Predictions in −500 to +500 regions w.r.t TSS of S. cerevisiae, C. elegans and D. Melanogaster, zebrafish, mouse and human have been carried out. PromPredict performs better in the case of yeast and invertebrates as compared to mammals. The −500 to +100 region relative to TSS is considered as true positive region. To check the distribution of predictions, different TP classes are defined with each class representing the promoter sequences having a prediction within 50 nt bins (shown in Supplementary Figure 5) spanning the regions −1 to −50, −51 to −100, −101 to −150, etc. The average free energy profiles of different TP classes (Supplementary Figure 6) indicate that each class has a broad low stability region corresponding to the bin position, in all systems. These distinct stability profiles in the promoter regions may explain the abundance of cis-regulatory elements in the proximal as well as distal regions of TSS[28]. The results presented here suggested that PromPredict can be potential algorithm for promoter prediction in eukaryotes. Further, the performance of PromPredict has been compared with EP3.

Comparison of PromPredict performance with EP3

The performance of PromPredict has been compared with the other free energy based promoter predictor EP3[16] which has been applied primarily to predict mouse and human promoters (in the −1000 to +1000 nt regions, where TSS is centred at ‘0’ position). Hence the True Positive (TP) region for both predictors is considered as spanning −1000 to +1000 region. The EP3 program uses base-stacking property to distinguish promoter regions from non-promoter sequences. For a given sequence of DNA, it calculates inverted base-stacking energy values over non-overlapping window of 400 base pairs size and calls a region as a promoter when the structural feature value crosses a prescribed threshold score, which is genome specific. So for EP3 every 400 nt window has a prediction, but with true or false prediction value assigned to it, while PromPredict gives only predicted promoter regions in a continuously sliding window. PromPredict provides majority of the predictions in −600 to +200 regions with respect to TSS in both mouse and human, whereas EP3 gives predictions in −200 to +200 regions. The distribution of closest TP predictions shown in Figure 5 suggests that both EP3 and PromPredict are good predictors of promoters. However, as seen from Table 3, the recall values for EP3 are very low as compared to PromPredict for both human and mouse. PromPredict gives recall values of 99% for both mouse and human whereas EP3 gives 46% and 36% recall for mouse and human respectively which are comparable to whole genome prediction values (recall ~41% with CAGE dataset) reported in their study[16]. The results suggest that PromPredict is a better predictor than EP3, even though latter has been trained on the human genome. Since the entire 2001 nt region spanning TSS is considered as ‘TP’ region, no false positives have been identified and hence ‘Precision’ values have not been obtained in this study. PromPredict algorithm was further tested in various classes of transcripts, promoter types and gene expression datasets.

Figure 5

Table 3

Comparison between performance of PromPredict and EP3.

		Promoter sequences	Total predictions in −1000 to +1000nt region w.r.t TSS at ‘0’	TP	TP genes	Recall
PromPredict	Mouse	17451	50834	50834	17301	99.1
PromPredict	Human	29456	90622	90622	29246	99.3
EP3	Mouse	17451	87255	13336	8052	46.1
EP3	Human	29456	147279	18810	10522	35.7

Prediction for −1000 to +1000 region w.r.t TSS for both human and mouse are used for prediction. The whole region −1000 to +1000 region has been used as a truepositiveregion[16] to calculate recall values. PromPredict gives far better recall value when compared to EP3.

Distribution of ‘TP’ promoter predictions by PromPredict and EP3. Predictions in −1000 to +1000 region relative to TSS for both mouse and human have been considered. PromPredict uses a 250 nt sliding window with the midpoint of a 100 nt window being considered as a prediction, if it satisfies GC based cut-offs, while EP3 prediction is given in a 400 nt non-overlapping window and defines the window as a true or false prediction based on genome-specific thresholds. In case more than one ‘TP’ prediction is obtained for a gene, then the position closest to TSS has been considered. Hence, the total number of TPs plotted corresponds to ‘TP genes’. Comparison between performance of PromPredict and EP3. Prediction for −1000 to +1000 region w.r.t TSS for both human and mouse are used for prediction. The whole region −1000 to +1000 region has been used as a truepositiveregion[16] to calculate recall values. PromPredict gives far better recall value when compared to EP3.

Whole genome promoter prediction in S. cerevisiae

S. cerevisiae is an important model organism for eukaryotes, its genome being compact with genes representing ~72% of the entire sequence. Promoter prediction has been carried out for all 16 chromosomes in both forward and reverse strands separately. The predictions are mapped to the −500 to +100 region relative to TSS of each gene. If the least stable position (lsp) of the predicted promoter region lies within the 500 upstream regions or 100 downstream of a known TSS, then it is considered as true positive. On the other hand, if a predicted promoter region occurs within the coding regions of the gene, then it is regarded as a false positive prediction. A true positive region may have more than one prediction, in that case, the nearest prediction (or lsp) to the TSS has been considered. PromPredict achieves ~80% recall and ~37% precision in whole genome prediction (Table 4), which is comparable to its performance in prokaryotes[17]. Furthermore it has been shown in rice and Arabidopsis that a chunk of predictions in intronic regions given by PromPredict are potential cis-regulators[28] but in S. cerevisiae, the intronic regions represent only a total of 3.8% of the ORFs. Predictions of various types of transcripts belonging to protein coding (ORF) as well as non-coding RNAs (CUTs, SUTs and other RNAs such as rRNA, tRNA and SnoRNA) has been considered for checking the performance of the algorithm. Precision, recall and F-score values shown in Table 4 suggest that PromPredict can annotate putative promoters of different classes of transcripts. It should be noted that PromPredict identifies the promoters associated with SUTs and CUTs which are related to pervasive transcription[12].

Table 4

Whole genome promoter prediction in S. cerevisiae using “PromPredict”.

	Number of sequences	Transcript median length	TP	FP	TP promoters	Precision	Recall	F-score
ORF	4912	1548	5317	9072	3934	37.0	80.1	50.6
Cuts	501	428	540	242	404	69.1	80.6	74.4
Suts	729	964	727	704	552	50.8	75.7	60.8
Other	300	1272	296	447	223	39.8	74.3	51.9
All	6442	1436	6880	10465	5113	39.7	79.4	52.9
TATA and TATA-less promoters
TATA	842	1384	978	1298	701	43.0	83.3	56.7
TATA-less	4070	1544	4206	7529	3139	35.8	77.1	48.9

Promoter prediction for 16 chromosomes for both forward and reverse strands has been carried out. −500 to +100 region relative to transcript start was chosen as a true positive region. The performance of PromPredict has been evaluated using the parameters precision, recall, and F-score. Precision is the ratio of number of true positives to the sum of true and false positive predictions, while recall is the ratio of the numbers of promoters with an identified true positive gene to the total number of promoters. TATA-containing and TATA-less gene promoters are defined based on the criterion of presence of TATA-box in −150 to −1 region relative to TSS[22]. Recall values for prediction of different transcripts belonging to ORF, non-protein coding CUTs, SUTs and other RNA classes (tRNA, rRNA, and SnoRNA) suggest that PromPredict is a good predictor for yeast promoter sequences. The algorithm performs better for TATA-containing gene promoters as compared to TATA-less promoters.

Whole genome promoter prediction in S. cerevisiae using “PromPredict”. Promoter prediction for 16 chromosomes for both forward and reverse strands has been carried out. −500 to +100 region relative to transcript start was chosen as a true positive region. The performance of PromPredict has been evaluated using the parameters precision, recall, and F-score. Precision is the ratio of number of true positives to the sum of true and false positive predictions, while recall is the ratio of the numbers of promoters with an identified true positive gene to the total number of promoters. TATA-containing and TATA-less gene promoters are defined based on the criterion of presence of TATA-box in −150 to −1 region relative to TSS[22]. Recall values for prediction of different transcripts belonging to ORF, non-protein coding CUTs, SUTs and other RNA classes (tRNA, rRNA, and SnoRNA) suggest that PromPredict is a good predictor for yeast promoter sequences. The algorithm performs better for TATA-containing gene promoters as compared to TATA-less promoters. In yeasts, gene promoters can be classified into TATA-box containing and TATA- less promoters based on the presence of TATAWAWR consensus motif [33] in the vicinity of TSS. In S. cerevisiae ~17% of the gene promoters belong to TATA- containing class. The two categories of genes vary in their biological functions and gene expression. Further, the TATA-containing promoters are shown to be flexible while TATA-less promoters are rigid in different species of yeast[34]. In S. cerevisiae, TATA-containing promoters are less stable, more bendable and slightly curved compared to TATA-less promoters[22]. PromPredict has been tested on the promoters of two classes of genes. Recall and precision values are ~83% and ~43% for TATA-containing and ~77% and ~36% for TATA-less genes respectively (Table 4). Better performance of PromPredict for TATA-containing gene promoters compared to TATA-less promoters can be attributed to the lower stability of promoters[22] as well as smaller median length of ORFs in the former class[35]. The analysis has also been extended for genes with variable expression. Gene expression variability, which is essential for phenotypic variability, can be measured in the contexts of the environment, evolution, etc. Seven measures of variability, stochastic noise, responsiveness, stress response, trans variability, mutational variance, inter-strain variance and expression divergence have been used in an earlier study[36]. The performance of PromPredict was compared with high and low expression variability classes in all above mentioned measures (Supplementary Table 1). Genes with high or low expression variability are associated with specific biological functions. To define two categories, first, we have sorted the genes, based on their corresponding expression values and created ten bins, each containing 10% of the gene dataset. Top and bottom percentile bins are regarded as low and high expression categories. Precision, recall and F-score values for the seven measures suggest that less and more responsive genes exhibit major differences (Fig. 6c). In our earlier study, we have shown that gene responsiveness is intimately linked to DNA structural properties of promoters, nucleosome occupancy and promoter architecture[20]. Figure 6a shows an example of difference in the association of less responsive and more responsive genes with their biological function. The gene ontology (biological process) analysis shows that high responsive genes are linked to heterocyclic metabolic process, carbohydrate metabolism and response to chemical stimulus while being less responsive class associated with processes such as RNA metabolism, transcription, protein modification and transport etc. Distinct differences are also observed in free energy profiles of less responsive and more responsive genes. Less responsive genes have broader low stability region with two split peaks while more responsive ones have a comparatively narrow peak (Fig. 6b, additional file). More responsive genes (F-score = 57%) are better predicted by PromPredict compared to less responsive genes (F-score = 48%). The whole genome predictions suggest that PromPredict can be applied to yeast promoters and can be helpful to differentiate various classes of promoters.

Figure 6

Performance of PromPredict in less responsive and more responsive genes. Top and bottom ten percentile list of gene (or) promoter sequences considered for this analysis. The dataset contains 459 genes in each category. (a) The frequency of genes in each class of biological process has been plotted to compare the two classes of genes. The less responsive genes are overrepresented in few biological processes compared to more responsive genes. The genes categorized in the gene ontology GO SLIM categories were retrieved from Saccharomyces genome database (http://www.yeastgenome.org/). (b) AFE profiles plotted using 15 nt window shows the differences in the distribution of low stability regions for two classes of gene promoters. (c) Precision and recall values shown as bar plot suggest that PromPredict comparatively a better predictor for more responsive genes. The −500 to +100 region relative to TSS is considered as true positive region.

Conclusions

In silico promoter characterization is a great tool for identification of transcription initiation events, prediction of DNA-Transcription factor binding events and can help experimental molecular biologists to perform low-cost screening. Promoter prediction using in-house software PromPredict in the promoter sequences of six eukaryotic systems S. cerevisiae, C. elegans, D. melanogaster, zebrafish, mouse and human with transcription start site (TSS) data along with 48 eukaryotic systems with translation start site (TLS) data reveals that differential stability is a good criterion for promoter prediction. Whole genome prediction in S. cerevisiae using PromPredict suggests that it can be applied to yeast genomes, and its performance is observed to be different in distinct classes of genes such as TATA-containing and TATA-less, and variably expressed genes. Also, promoter prediction in 1001 nt long sequences flanking the TLS of 48 eukaryotes suggests that PromPredict is a good predictor of promoter regions in eukaryotes. Overall the recall values indicate that PromPredict can be applied for predicting promoter regions in different domains of life, even though it was initially designed for prokaryotes. PromPredict can be implemented without any specified cut-off to any genome, which makes it a very useful tool for global promoter annotation. The software is available as standalone as well as online version (http://nucleix.mbu.iisc.ac.in/prompredict/prompredict.html). Implementation of various sliding window sizes can be done on user’s request.

Materials and Methods

Promoter sequence datasets

Genome sequences of 48 different eukaryotes representing various domains of life are used in current study. Genomic sequences −500 and +500 relative to translation start site (TLS) for different species of invertebrates (Caenorhabditis elegans, C. brenneri, C. briggsae, C. remanei, C. japonica, Drosophila melanogaster, D. annanassae, D. erecta, D. grimshawi, D. mojavensis, D. persimilis, D. pseudoobscura, D. sechellia, D. virilis, D. yakuba, Anopheles gambiae, Apis mellifera, lancelet, sea squirt, sea hare), and vertebrates (zebrafish, fugu, lamprey, medaka, stickleback, tetraodon, mouse, human, chicken, cow, elephant, pig, platypus and rat) have been extracted from UCSC table browser[37]. Datasets for fungi are retrieved from the Saccharomyces Genome Database (SGD) (http://www.yeastgenome.org/) for Saccharomyces species (S. cerevisiae, S. castellii, S. bayanus, S. kluyveri, S. kudriazvevii, S. mikatae and S. paradoxus) and from the Candida Genome Database (CGD) (http://www.candidagenome.org/) for Candida and closely related species (C. albicans, C. dubliniensis, C. glabrata, C. lusitaniae, C. tropicalis, Debaryomyces hansenii and Lodderomyces elongisporus). Total number of sequences and the whole genome GC percentages are listed in Table 1. Transcription start site (TSS) information of three vertebrates, zebrafish, mouse, and human was obtained from Database of Transcription Start Sites (DBTSS), database version 7.0[38]. TSS information of C. elegans and D. melanogaster was retrieved from modENCODE database (http://www.modencode.org/) and Graveley et al.[39] transcriptome profiling studies respectively. In case of S. cerevisiae transcript positions with both ends mapped have been considered[12]. Transcript information of 4912 open reading frames (ORFs), 501 cryptic unstable transcripts (CUTs), 729 stable unannotated transcripts (SUTs) and 300 other transcripts (dubious ORF, tRNA, SnoRNA, rRNA, etc.) is used for whole genome promoter prediction. The TSS positions of each system are mapped to respective genomes and sequences upstream as well as downstream are extracted. Genomic sequence data is downloaded from SGD, Wormbase (https://www.wormbase.org/), Flybase (http://flybase.org/) and UCSC genome browser (https://genome.ucsc.edu/) for S. cerevisiae, C. elegans, D. melanogaster and vertebrates (zebrafish, mouse and human) respectively. Promoter sequences used in this study are 1001 nt long, starting 500 nt upstream, extending up to 500 nt downstream of either TSS or translation start sites (TLS) and the position ‘0’ corresponds to TSS or TLS. 2001 nucleotide long sequences are also considered for mouse and human promoter with TSS data.

TATA-containing and TATA-less promoters and gene expression variability data

Consensus sequence TATA[A/T]A[A/T][A/G][33] has been considered to extract TATA-containing and TATA-less gene promoters in S. cerevisiae. TATA-containing promoters are regarded as those sequences, which contain TATA-box within −150 to −1 region relative to TSS[22]. Gene expression data of S. cerevisiae for seven different gene expression variations namely stochastic noise, responsiveness, stress response, trans variability, mutational variance, inter-strain variance and expression divergence is taken from Choi and Kim 2009 study[36]. Top and bottom ten percentile expression variability values have been used to define genes with lowest and highest expression variability categories[20].

DNA duplex stability or Free energy calculation

DNA duplex stability is referred to the ability of DNA to open up or melt, depends on its hydrogen bonding and base pair stacking. The dinucleotide step energy values corresponding to the 16 dinucleotide steps are taken from melting studies on 108 oligonucleotides[40]. Free energy calculations are computed using 15 nucleotide sliding window[27].

Promoter Prediction

Promoter prediction analysis in 48 different eukaryotic systems is carried out using in-house algorithm “PromPredict”[17,27]. The promoter predictions in mouse and human are also achieved using other structure-based algorithm EP3[16].

PromPredict

PromPredict discriminates between promoter and non-promoter sequences by using the most informative energy related feature, DNA duplex stability. The algorithm considers the Gibbs free energy change (average free energy) of a DNA stretch, which is computed as a sum of the constituent of dinucleotides over two 100 nt (or 50 nt) segments separated by 50 nt window, and the difference of energy values between two segments are evaluated and GC percentage based cut-off values are used to call putative promoter regions[17]. The scoring function D(n) is used to look for differences in free energy of the two neighbouring regions with respect to every nucleotide position ‘n’ (equation 1). The average energy is assigned to the centre position ‘n + 50’ corresponding to each 100 nt window. The functions E1(n + 50) and E2(n + 50) correspond to the mean free energy for 100 nt segments starting from nucleotide position ‘n’ and ‘n + 150’, respectively (equations 2 and 3). D(n) is the difference between E1 and E2. A stretch of DNA sequence is assigned as a promoter only if its average free energy (E1) and the difference in free energy (D(n)) as compared to its neighbouring downstream region are greater than the chosen cut-off values for the respective %GC range, as defined in the TSS- TLS based cut-off values[17]. ‘PromPredict’ outputs predicted promoter region, start and end of the predicted promoter region along with the least stable position (lsp) in the predicted promoter region. In the current study, the ‘lsp’ of a predicted region has been chosen as a measure for defining true and false predictions. The algorithm is scalable and flexible, allowing the setting of sliding window sizes, such as [50 + 25 + 50], [100 + 50 + 100] or [200 + 50 + 200]. The default [100 + 50 + 100] window size restriction has been implemented based on previous analyses[27,41]. However, a statistical analysis on efficiency of “PromPredict” algorithm for various window sizes has been presented as boxplot (Supplementary Figure 7). The recall, precision and F-score were calculated for S. cerevisiae chromosomes (16) for both forward and reverse strands separately. It should be noted that the default window is best suited for both quantitative (recall) and qualitative (precision) analysis as revealed by the F-scores. Notably, with the bigger sliding window (such as 200), the computational time and ambiguity in assigning promoter location increases.

Easy Promoter Prediction Program (EP3)

The EP3 program implements a base-stacking property to demarcate promoter regions from other genomic regions. For a given sequence of DNA, it computes inverted base-stacking energy values over a predefined window size in non-overlapping fashion and calls a DNA fragment as a promoter when the feature value crosses the genome specific threshold score[10,16]. EP3 uses two empirically determined parameters: length of the window and deviation of feature value from genomic average. Window size, 400 has been defined as optimal, and threshold-values of base-stacking are predetermined for genome size. The EP3 program is available at the site: http://bioinformatics.psb.ugent.be/webtools/ep3.

Evaluation of performance

To assess the performance of PromPredict or EP3, evaluating parameters, recall, precision and F-score (harmonic mean of recall and precision) have been used (equations 4–6). These measures for the predictions are calculated using the following formulae: Several true positive regions have been used for eukaryotes for assessing the performance of promoter predictors such as [−1000, +1000] nucleotide[42], [−500, +500] nucleotide and [−500, +100] nucleotide[28]. In this study, the [−500, +100] nucleotide region is considered for evaluating the performance of PromPredict for both TSS (six systems) and TLS (48 systems) datasets. To compare PromPredict with EP3, the true positive (TP) region is chosen as [−1000, +1000] as the latter program gives a prediction in 400 nt non-overlapping windows. False positive (FP) prediction is regarded as the prediction outside the TP region, but for whole genome predictions, only the predictions that lie in the coding regions are chosen, and other predictions are omitted. The closest prediction for a TP gene is considered as a strong signal, which might correspond to the core promoter region[28]. Supplementary information Supplementary dataset

40 in total

1. Generic eukaryotic core promoter prediction using structural features of DNA.

Authors: Thomas Abeel; Yvan Saeys; Eric Bonnet; Pierre Rouzé; Yves Van de Peer
Journal: Genome Res Date: 2007-12-20 Impact factor: 9.043

2. Computing DNA duplex instability profiles efficiently with a two-state model: trends of promoters and binding sites.

Authors: Miriam R Kantorovitz; Zoi Rapti; Vladimir Gelev; Anny Usheva
Journal: BMC Bioinformatics Date: 2010-12-21 Impact factor: 3.169

3. Identification and distinct regulation of yeast TATA box-containing genes.

Authors: Andrew D Basehoar; Sara J Zanton; B Franklin Pugh
Journal: Cell Date: 2004-03-05 Impact factor: 41.582

4. Bidirectional promoters generate pervasive transcription in yeast.

Authors: Zhenyu Xu; Wu Wei; Julien Gagneur; Fabiana Perocchi; Sandra Clauder-Münster; Jurgi Camblong; Elisa Guffanti; Françoise Stutz; Wolfgang Huber; Lars M Steinmetz
Journal: Nature Date: 2009-01-25 Impact factor: 49.962

5. The developmental transcriptome of Drosophila melanogaster.

Authors: Brenton R Graveley; Angela N Brooks; Joseph W Carlson; Michael O Duff; Jane M Landolin; Li Yang; Carlo G Artieri; Marijke J van Baren; Nathan Boley; Benjamin W Booth; James B Brown; Lucy Cherbas; Carrie A Davis; Alex Dobin; Renhua Li; Wei Lin; John H Malone; Nicolas R Mattiuzzo; David Miller; David Sturgill; Brian B Tuch; Chris Zaleski; Dayu Zhang; Marco Blanchette; Sandrine Dudoit; Brian Eads; Richard E Green; Ann Hammonds; Lichun Jiang; Phil Kapranov; Laura Langton; Norbert Perrimon; Jeremy E Sandler; Kenneth H Wan; Aarron Willingham; Yu Zhang; Yi Zou; Justen Andrews; Peter J Bickel; Steven E Brenner; Michael R Brent; Peter Cherbas; Thomas R Gingeras; Roger A Hoskins; Thomas C Kaufman; Brian Oliver; Susan E Celniker
Journal: Nature Date: 2010-12-22 Impact factor: 49.962

6. A novel method for prokaryotic promoter prediction based on DNA stability.

Authors: Aditi Kanhere; Manju Bansal
Journal: BMC Bioinformatics Date: 2005-01-05 Impact factor: 3.169

7. Genome-wide computational prediction and analysis of core promoter elements across plant monocots and dicots.

Authors: Sunita Kumari; Doreen Ware
Journal: PLoS One Date: 2013-10-29 Impact factor: 3.240

8. Links between core promoter and basic gene features influence gene expression.

Authors: Sandra Moshonov; Rofa Elfakess; Michal Golan-Mashiach; Hadar Sinvani; Rivka Dikstein
Journal: BMC Genomics Date: 2008-02-25 Impact factor: 3.969

9. DBTSS: database of transcription start sites, progress report 2008.

Authors: Hiroyuki Wakaguri; Riu Yamashita; Yutaka Suzuki; Sumio Sugano; Kenta Nakai
Journal: Nucleic Acids Res Date: 2007-10-16 Impact factor: 16.971

Review 10. Mammalian RNA polymerase II core promoters: insights from genome-wide studies.

Authors: Albin Sandelin; Piero Carninci; Boris Lenhard; Jasmina Ponjavic; Yoshihide Hayashizaki; David A Hume
Journal: Nat Rev Genet Date: 2007-05-08 Impact factor: 53.242

8 in total

1. Critical assessment of computational tools for prokaryotic and eukaryotic promoter prediction.

Authors: Meng Zhang; Cangzhi Jia; Fuyi Li; Chen Li; Yan Zhu; Tatsuya Akutsu; Geoffrey I Webb; Quan Zou; Lachlan J M Coin; Jiangning Song
Journal: Brief Bioinform Date: 2022-03-10 Impact factor: 11.622

2. Machine learning and statistics shape a novel path in archaeal promoter annotation.

Authors: Gustavo Sganzerla Martinez; Ernesto Pérez-Rueda; Sharmilee Sarkar; Aditya Kumar; Scheila de Ávila E Silva
Journal: BMC Bioinformatics Date: 2022-05-10 Impact factor: 3.307

3. TSSFinder-fast and accurate ab initio prediction of the core promoter in eukaryotic genomes.

Authors: Mauro de Medeiros Oliveira; Igor Bonadio; Alicia Lie de Melo; Glaucia Mendes Souza; Alan Mitchell Durham
Journal: Brief Bioinform Date: 2021-11-05 Impact factor: 11.622

4. Genome-Wide Prediction of Transcription Start Sites in Conifers.

Authors: Eugeniya I Bondar; Maxim E Troukhan; Konstantin V Krutovsky; Tatiana V Tatarinova
Journal: Int J Mol Sci Date: 2022-02-03 Impact factor: 5.923

5. ReFeaFi: Genome-wide prediction of regulatory elements driving transcription initiation.

Authors: Ramzan Umarov; Yu Li; Takahiro Arakawa; Satoshi Takizawa; Xin Gao; Erik Arner
Journal: PLoS Comput Biol Date: 2021-09-07 Impact factor: 4.475

6. Characterization of promoters in archaeal genomes based on DNA structural parameters.

Authors: Gustavo Sganzerla Martinez; Sharmilee Sarkar; Aditya Kumar; Ernesto Pérez-Rueda; Scheila de Avila E Silva
Journal: Microbiologyopen Date: 2021-10 Impact factor: 3.139

7. Delineation of the DNA Structural Features of Eukaryotic Core Promoter Classes.

Authors: Akkinepally Vanaja; Venkata Rajesh Yella
Journal: ACS Omega Date: 2022-02-09

8. IMGT^® Biocuration and Analysis of the Rhesus Monkey IG Loci.

Authors: Viviane Nguefack Ngoune; Morgane Bertignac; Maria Georga; Ariadni Papadaki; Alexandre Albani; Géraldine Folch; Joumana Jabado-Michaloud; Véronique Giudicelli; Patrice Duroux; Marie-Paule Lefranc; Sofia Kossida
Journal: Vaccines (Basel) Date: 2022-03-03

8 in total