Literature DB >> 21954440

Recognition of prokaryotic promoters based on a novel variable-window Z-curve method.

Abstract

Transcription is the first step in gene expression, and it is the step at which most of the regulation of expression occurs. Although sequenced prokaryotic genomes provide a wealth of information, transcriptional regulatory networks are still poorly understood using the available genomic information, largely because accurate prediction of promoters is difficult. To improve promoter recognition performance, a novel variable-window Z-curve method is developed to extract general features of prokaryotic promoters. The features are used for further classification by the partial least squares technique. To verify the prediction performance, the proposed method is applied to predict promoter fragments of two representative prokaryotic model organisms (Escherichia coli and Bacillus subtilis). Depending on the feature extraction and selection power of the proposed method, the promoter prediction accuracies are improved markedly over most existing approaches: for E. coli, the accuracies are 96.05% (σ(70) promoters, coding negative samples), 90.44% (σ(70) promoters, non-coding negative samples), 92.13% (known sigma-factor promoters, coding negative samples), 92.50% (known sigma-factor promoters, non-coding negative samples), respectively; for B. subtilis, the accuracies are 95.83% (known sigma-factor promoters, coding negative samples) and 99.09% (known sigma-factor promoters, non-coding negative samples). Additionally, being a linear technique, the computational simplicity of the proposed method makes it easy to run in a matter of minutes on ordinary personal computers or even laptops. More importantly, there is no need to optimize parameters, so it is very practical for predicting other species promoters without any prior knowledge or prior information of the statistical properties of the samples.

Entities: Chemical Disease Species

Mesh：

Year: 2011 PMID： 21954440 PMCID： PMC3273801 DOI： 10.1093/nar/gkr795

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

In genetics, a promoter is a region of DNA that facilitates the transcription of particular genes. In bacteria, the promoter is recognized by RNA polymerase (RNAP) and associated sigma factors, which may be recruited to the promoter by regulatory proteins binding to specific sites in the region. Thus, control of transcription initiation accounts for much of the overall regulation of gene expression (1). The continued development of large, sophisticated databases and repositories has made vast amounts of biological data accessible to researchers. Additionally, advances in molecular biology and computational techniques are enabling the systematic investigation of the complex molecular processes underlying biological systems. Many algorithms have been developed for the detection of promoters in prokaryotic genomes. For example, Askary et al. and Rangannan and Bansal developed a promoter prediction algorithm based on the difference in stability between neighbouring upstream and downstream regions in the vicinity of experimentally determined transcription start sites (TSSs) (2,3). Rani and Bapi used n-grams (n = 3) as features for a neural network classifier for promoter prediction in Escherichia coli and achieved 67.75% prediction sensitivity and 86.10% specificity (4). Mann et al. used a hybrid technique combining profile hidden Markov models (HMMs) and artificial neural networks (ANNs) methods with Viterbi scoring optimizations (5). Burden et al. and Bland et al. also used ANNs to improve the promoter prediction accuracy (6,7). Lin and Li developed a hybrid approach (called IPMD) combining position correlation score function and increment of diversity with modified Mahalanobis Discriminant to predict eukaryotic and prokaryotic promoters (8). By applying the IPMD to E. coli and Bacillus subtilis promoter sequences, they achieved the sensitivities and specificities of 84.9% and 91.4% for E. coli, as well as 80.4% and 91.3% for B. subtilis. Although these attempts, which employ sophisticated machine-learning methods to identify promoters, offer increased accuracy in certain circumstances, the improvements may not justify the heavy computational requirements they impose for training classifiers. Moreover, the selection and optimization of parameters (such as the type and parameters of kernel functions, number of hidden layer nodes, etc.) need enough prior knowledge of the statistical properties of the samples, which makes it unpractical for the analysis of new genome sequences. The regular Z-curve (or Z-curve) method originally proposed by Zhang is a powerful tool in visualizing and analysing DNA sequences (9,10). It is a 3D curve or point representation for a DNA sequence in the sense that each can be uniquely reconstructed given the other. The resulting curve has a zigzag shape, hence the name Z-curve. The 3D curve or point of a given DNA sequence is calculated from the frequencies of the four bases occurring in it to evaluate the sequence from three main components: distribution of purine/pyrimidine, distribution of amino/keto and distribution of strong H-bonds/weak H-bonds. Z-curve method has been used in many different areas of genome research, such as replication origin identification (11,12), ab initio gene prediction (13), isochore identification (14), genomic island identification (15) and comparative genomics (16). However, the regular Z-curve method could not able to extract the information of w-nucleotides sequence patterns occurring in DNA sequences, the promoter recognition accuracy based on it is far from satisfactory. Hence, a novel variable-window Z-curve (vw Z-curve) method is proposed here as a feature-extraction tool for prokaryotic promoter recognition for the first time. The features extracted by it (with window size w = 1, 2, … , 6) are used as the input variables for further classification by a partial least squares (PLS) classifier. Promoter fragments of two prokaryotic model organisms (E. coli and B. subtilis) are used to verify the prediction performance of the proposed method. The feature extraction power of the vw Z-curve method and the iterative feature selection power of the PLS technique make the prediction performance improved markedly over most existing approaches: for E. coli, the accuracies are 96.05% (σ70 promoters, coding negative samples), 90.44% (σ70 promoters, non-coding negative samples), 92.13% (known sigma-factor promoters, coding negative samples), 92.50% (known sigma-factor promoters, non-coding negative samples), respectively; for B. subtilis, the accuracies are 95.83% (known sigma-factor promoters, coding negative samples) and 99.09% (known sigma-factor promoters, non-coding negative samples). The results are verified relying on a 10-fold cross-validation jackknife test. Moreover, the proposed method is a linear technique, thus its computational simplicity makes it possible to be run on ordinary personal computers or laptops with run times of several minutes. In particular, because there is no need to optimize parameters, this method is very practical for predicting other species promoters without any prior knowledge or prior information of the statistical properties of the samples.

MATERIALS AND METHODS

Databases

The complete genomic sequences of E. coli K-12 and B. subtilis are obtained from NCBI GenBank (17). The positions of experimentally determined TSSs of them are retrieved from RegulonDB version 7.0 (18) and DBTBS (19). Then promoter regions [TSS-60 … TSS+19] (the site of TSS is +1) are taken as the positive examples. The positive sample database of E. coli consists of two kinds of promoter fragments: 576 experimentally confirmed σ70 promoters and 825 experimentally confirmed promoters of several known sigma factors (576 σ70 promoters, 63 σ38 promoters, 40 σ38 and σ70 promoters, 64 σ24 promoters, 4 σ24 and σ70 promoters, 9 σ28 promoters, 44 σ32 promoters, 7 σ32 and σ70 promoters, 18 σ54 promoters). Considering the comparatively small size of experimentally confirmed B. subtilis promoters, all 660 promoters of known sigma factors (e.g. σ43, σ54, σ37 and so on) are used as the positive samples of B. subtilis. As there is no enough experimentally confirmed negative data (i.e. the positions that are confirmed not to be TSS), the risk has to be taken to choose the negative examples randomly from the same chromosome. Approximately, for E. coli K-12, 81% of known TSSs are located in the intergenic non-coding regions and 19% in the coding regions (20). So two kinds of negative examples are prepared: Coding negative examples: fragments extracted from the coding regions (genes). For E. coli, the coding negative sample set contains 836 80-bp fragments extracted from the start of the open reading frames (ORFs) with lengths of 80–380 bp. For B. subtilis, the coding negative sample set contains 665 80-bp fragments extracted from the start of the ORFs with lengths of 80–335 bp. Non-coding negative examples: fragments extracted from the non-coding regions (convergent intergenic spacers). For E. coli, the non-coding negative sample set contains 825 fragments with lengths of 80 bp. For B. subtilis, the non-coding negative sample set contains 331 fragments with lengths of 80 bp. The data sets and the corresponding detailed descriptions are shown in Table 1.

Table 1.

The detailed descriptions of data sets

Data set	Positive samples	Negative samples
Data set-1	576 σ⁷⁰ promoters of E. coli	836 coding fragments of E. coli
Data set-2	576 σ⁷⁰ promoters of E. coli	825 non-coding fragments of E. coli
Data set-3	825 known sigma-factor promoters of E. coli	836 coding fragments of E. coli
Data set-4	825 known sigma-factor promoters of E. coli	825 non-coding fragments of E. coli
Data set-5	660 known sigma-factor promoters of B. subtilis	665 coding fragments of B. subtilis
Data set-6	660 known sigma-factor promoters of B. subtilis	331 non-coding fragments of B. subtilis

The detailed descriptions of data sets

The novel variable-window Z-curve feature extraction method

Being the first transcription step, initiation promoted by interaction of RNAP with gene promoter is a key level of control of gene expression. RNAP holoenzyme is recruited at a given promoter through the recognition of a promoter by transcriptional factors, called ‘sigma (σ) factors’, which are variable subunit of RNAP holoenzyme. Typically, housekeeping σ70 factors of E. coli bind to the −35 and −10 DNA sequence elements in a promoter with the consensus sequences TTGACA at position −35 and TATAAT at position −10, respectively (positions indicate the location of each sequence with respect to the TSS). Two other important sites are the extended −10 element with the consensus sequence ‘TGN’ and the AT-rich UP element (21,22). Alternatively, σ54 factors, which control several ancillary processes including the degradation of xylene and toluene, transport of dicarboxylic acids and so on, bind to ‘GG’ at −24 location and ‘GC’ at −12 location of promoters (23). For B. subtilis, DegU promoter has the ‘GNCATTTA’ consensus DNA-binding sequence (24), σE-independent sigG promoters have ‘TTT’ and ‘AAA’ motifs (25) and so on. It is well known that different sigma factors bind to different motifs of promoters. One genome may encode many different σ-factors. In general, bacterial housekeeping sigma-factors, which regulate genes that are involved in cellular growth, σ-factors are similar to the E. coli σ70 factors (26,27). Several members of the σ70 factor family have been described: E. coli K-12 has six σ70 family σ-factors (28), whereas B. subtilis has 17 known variants of σ70 (19). A specific subfamily of σ-factors that directly incorporates signals from the extracellular environment in regulating transcription (ECF σ-factors) also exists (29). More details about promoter architecture and sigma factors are available in the Supplementary Data. Mismatches between RNAP, σ-factors and the given binding sites can be tolerated and even allow for the modulation of promoter strength at some specific genes. Multiple occurrences of promoters in the same regulatory region of one gene can be found for different regulatory functions (30). Unless mutagenesis is performed, each site has the chance to be the place chosen by the RNAP to bind the DNA. Unlike eukaryotic promoters, tightly packed prokaryotic genes and promoters frequently overlap each other (18) obscuring promoter motifs. Experimental procedures are efficient to identify individual promoters but not conceivable for sets of genes at the whole genome scale. This motivated the search for computational methods based on the knowledge gained about the properties of known promoters or based on an efficient representation of DNA motifs by means of combinatorial or stochastic methods. Unfortunately, the absence of relatively strong sequence patterns identifying true promoters, the diversity of the motifs, the comparatively uncertainty of the locations of the motifs and the incompletely understood mechanisms of the regulation of promoters confound exact predictions of prokaryotic promoters. The aims of this work are not only to predict promoters with very high accuracy, but to predict promoters of different sigma factors that have different recognition motifs in one collective data set (Tables 1 and 3). So it is important to draw out these distinctions with different sigma factors whose motifs usually comprise more than 1 nt. While the regular Z-curve parameters are only derived from the frequencies of mononucleotides occurring in a DNA sequence. Consequently, the features extracted by regular Z-curve method are not enough for promoter recognition problems and the promoter prediction accuracy based on these features is far from satisfactory. Up to now, only Yang et al. (31) used Z-curve method in Human Pol II promoter recognition.

Table 3.

Prediction results of all known sigma-factor promoters of E. coli using different combination of vw Z-curve features**

Number*	Data set-3					Data set-4
	4095	650	350	280	230	4095	1100	610	360
Results (%)
	79.63	87.07	91.59	91.95	92.44	82.56	89.63	91.34	92.20
	75.98	88.17	90.49	91.46	91.83	84.51	90.12	91.10	92.80
	77.80	87.62	91.04	91.71	92.13	83.54	89.88	91.22	92.50

*Number: number of selected vw Z-curve variables

**The average accuracies of the vw Z-curve methods with 230 parameters for Data set-3 and 360 parameters for Dataset-4, which were the best ones among the algorithms evaluated here, were shown in boldface.

According to the key motifs mentioned above, it is reasonable to assume that the parameters derived from the distributions of w-nucleotides patterns (the window size w∈N) are the essential features which could able to distinguish between promoter regions and non-promoter regions successfully. Hence, a novel variable-window Z-curve (vw Z-curve) method which introduces variable window technique into the regular Z-curve method is developed and used in prokaryotic promoter recognition for the first time to improve the prediction accuracy of the issue. The following paragraphs provide a detailed explanation of the methodology of the vw Z-curve method. Let Word is a set consisting the 4 nt A, G, C and T, that is: Word = {A,G,C,T}, (the window size w∈N, i = 1, … ,4) is a string constructed by picking w elements from the set Word with order and repetition. For example: when w = 2, = ‘AA’, = ‘AT’, = ‘AG’, … , = ‘TT’. Let the frequency of sequence pattern ‘’ occurring in an ORF or a fragment of DNA sequence be denoted by , where X = A, C, G and T. Using the Z-curve method of DNA sequences (32,33), the uniform definition of vw Z-curve variables (the window size w∈N) could be deduced as Equation (1) It can be easily seen that the mono-nucleotide, di-nucleotides and tri-nucleotides phase-independent Z-curve parameters illustrated by Gao and Zhang (32) are the special instances of the vw Z-curve method where w = 1, 2, 3. The detailed descriptions of them are shown in Equations (2–4), respectively. where x, y and z are the coordinates of a point in a 3D space. By the same way, the vw Z-curve parameters for frequencies of w-nucleotides could be deduced easily. By a selective combination of n variables or parameters derived from the vw Z-curve method, a DNA sequence can be represented by a point or a vector in an n-dimensional space . (1) The Z curve parameters for frequencies of phase-independent mononucleotides (window size w = 1, variable number n = 3 × 40 = 3): the frequencies of bases A, C, G and T occurring in a DNA sequence are denoted by a, c, g and t, respectively. Based on the Z-curve method, a, c, g and t are mapped onto a point in a 3D space , which are denoted by x, y, z (33). (2) The Z curve parameters for frequencies of phase-independent di-nucleotides (window size w = 2, variable number n = 3 × 41 = 12): let the frequency of di-nucleotides XY be denoted by p(XY), where X, Y = A, C, G and T. Using the Z-curve method of DNA sequences, the following equation could be deduced as: (3) The Z curve parameters for frequencies of phase-independent tri-nucleotides (window size w=3, variable number n = 3 × 42 = 48): using similar notations, it could be deduced as: Unlike the variables extracted by Position Weight Matrix (PWM) based algorithms (30), vw Z-curve parameters are derived from the distributions of w-nucleotides patterns occurring in the same sequence fragment not from their frequencies occurring in different sequence fragments. Thus, the vw Z-curve parameters are not influenced by the uncertainty of motif positions relative to the TSS. Due to the introduce of the window size w, the distributions of w-nucleotides patterns according to different sigma factors could be taken into account synchronously. Consequently, this novel vw Z-curve method is especially suitable for solving motif-finding or pattern recognition (PR) problems of DNA sequence researching. Considering both the length of those widely known motifs and the computational requirement, the window sizes of the proposed vw Z-curve method used for promoter recognition problems are set w = 1, 2, … ,6. The detailed descriptions of them are shown in Supplementary Table S1. For researchers’ convenience, the MATLAB codes of the vw Z-curve method are given in the Supplementary Data.

Partial least squares classifier

Supervised pattern analysis could be taken as the regression problems in which the dependent variables are defined as l∈{−1,+1} in two-class problems or as l∈{1, 2, … , N} in multi-class problems, here N is the number of classes. Hence regression algorithms could be used as classifiers in supervised PR. PLS algorithm is a key technique for modelling linear relationships between a set of output variables (known class-labels) and a set of input variables (predictors). PLS algorithm creates orthogonal latent variables (LVs), which are linear combinations of the original variables. The basic point of the procedure is that the weights used to determine these linear combinations of the original variables are proportional to the maximum covariance among input and output variables (34). Hence, by the projection of the PLS algorithm, the n-dimensional -space is compressed into the v-dimensional LV-space (v<<n in common cases) to remove the noise and the multi-colinearity of the original data. This leads to a biased but lower variance estimate of the regression coefficients compared to the least squares method (34). PLS has been proven to be very useful in situations where the number of observed variables (n) is significantly greater than the number of observations (m) and high multi-colinearity among the variables exists (35,36). This is especially true in the case of the current study. Thus, PLS is expected to be a useful supervised PR method with potential applications in the discovery of key vw Z-curve features. For more detailed mathematical descriptions of the PLS classifier, please refer to the Supplementary Data.

The performance of promoter prediction

To evaluate the performance of promoter prediction, the following measurements are used here. where TP, TN, FP and FN are fractions of positive correct, negative correct, false positive and false negative predictions, respectively. The sensitivity Sn is the proportion of promoter sequences that have been correctly predicted as promoters. The specificity Sp is the proportion of negative samples that have been correctly predicted as negative samples. The accuracy a is defined as the average of Sn and Sp. Thus the goal in this study is to maximize the prediction accuracy a of testing set as well as make good trade-off between Sn and Sp. To overcome the randomicity of samples and to evaluate the prediction performance forcefully, 10-fold leave-one-out method is performed as the cross-validation jack-knife test, in which the data set is divided into 10 parts and tested on the 10 different one-tenths while trained on the remaining nine-tenths.

RESULTS AND DISCUSSION

Optimum combination of the vw Z-curve features for promoter prediction using the PLS based iterative feature selection method

Apart from feature extraction, feature selection (also known as variable selection) is one of the most useful techniques for improving the performance of PR. By removing the most irrelevant and redundant features, feature selection helps to do the following: Alleviate the effect of the curse of dimensionality. Enhance generalizability. Speed up the learning process. Improve model interpretability. To select an optimal set of features, it is necessary to quantitatively evaluate the contribution of each feature of the vw Z-curve method. As mentioned above, two-class supervised pattern analysis can be handled as a univariate regression problem in which the dependent variables are defined as l∈{−1,+1}. For univariate regression problems, the absolute value of the regression coefficient of each variable is a reasonable measurement of its contribution. The total number of all vw Z-curve variables of the issue is (w = 1, … ,6). The absolute values of the coefficients of all 4095 variables in the PLS promoter recognition model of data set-1 and data set-5 are shown in Supplementary Figure S3. The figure shows that only a few variables stand out above the others with high absolute coefficient values. Obviously, important information gets buried in a sea of trivialities, a phenomenon known as ‘information saturation’. Moreover, the methodology of the vw Z-curve method indicates there are strong multi-collinear relationships among all these features. Hence, the feature selection method relying on the PLS algorithm is used to improve the recognition performance. Considering the large number of variables compared to the number of samples, iterative feature selection is used as a way to improve the recognition performance. The detailed procedure is as follows (take the iterative feature selection of data set-1 for an example): Selecting m positive and m negative samples (here, m = 576, the number of positive samples). Using n vw Z-curve variables for training the promoter recognition model (for first feature selection iteration n = 4095). Sorting variables in descending order according to their absolute coefficient values. Selecting the top p variables (e.g. p = 600) with the highest absolute coefficient values and use of a cross-validation procedure to assess the prediction performance of these selected variables. Optimizing p to maximize the prediction average accuracy and to ensure a good trade-off between sensitivity (Sn) and specificity (Sp) of the recognition model. Repeating steps 2–6 (setting n = p) until the prediction average accuracy converges. It is notable that, for different sample sets, the iteration number of the feature selection procedure and the optimal combination of vw Z-curve variables may be different. The prediction results shown in Tables 2–4 demonstrate that the promoter recognition performance strongly lowered by information saturation and multi-collinearity is remarkably improved by the iterative feature selection method. Prediction results of the σ70 promoters of E. coli using different combination of vw Z-curve features** *Number: number of selected vw Z-curve variables. **The average accuracies of the vw Z-curve methods with 330 parameters for Data set-1and 220 parameters for Dataset-2, which were the best ones among the algorithms evaluated here, were shown in boldface. Prediction results of all known sigma-factor promoters of E. coli using different combination of vw Z-curve features** *Number: number of selected vw Z-curve variables **The average accuracies of the vw Z-curve methods with 230 parameters for Data set-3 and 360 parameters for Dataset-4, which were the best ones among the algorithms evaluated here, were shown in boldface. Prediction results of all known sigma-factor promoters of B. subtilis using different combination of vw Z-curve features** *Number: number of selected vw Z-curve variables. **The average accuracies of the vw Z-curve methods with 340 parameters for Data set-5 and 490 parameters for Dataset-6, which were the best ones among the algorithms evaluated here, were shown in boldface. From the results shown in Table 2, it can be seen that after the first iteration of feature selection, the number of useful vw Z-curve variables is markedly reduced from 4095 variables to 600 variables. By eliminating the interference of irrelevant variables, the prediction accuracy of data set-1 is improved by 15.27%, and the accuracy of data set-2 is improved by 8.16%. These 600 variables are used to build recognition models to again re-evaluate their importance clearly and fairly. To further improve the prediction accuracy, features are selected according to their re-evaluated importance. The prediction accuracy is improved until no more useless variables could be eliminated. After three iterations of feature selection, the final prediction accuracy of data set-1 reaches as high as 96.05%, much better than the accuracy obtained with any previously developed method. The final accuracy of data set-2 is only 90.44%, but it is much better than the accuracy achieved by most other methods. Furthermore, the trade-off between the Sn and Sp is also improved by the feature selection procedure: the difference between Sn and Sp for data set-1 is reduced from 4.04% of 4095 variables to 0.53% of 330 variables; for data set-2, it is reduced from 5.96% to 3.34%.

Table 2.

Prediction results of the σ70 promoters of E. coli using different combination of vw Z-curve features**

Number*	Data set-1				Data set-2
	4095	600	350	330	4095	600	500	245	220
Results (%)
	80.00	92.63	95.79	96.32	81.40	88.42	87.02	91.40	92.11
	75.96	93.86	95.61	95.79	75.44	84.74	85.09	86.14	88.77
	77.98	93.25	95.70	96.05	78.42	86.58	86.05	88.77	90.44

*Number: number of selected vw Z-curve variables.

**The average accuracies of the vw Z-curve methods with 330 parameters for Data set-1and 220 parameters for Dataset-2, which were the best ones among the algorithms evaluated here, were shown in boldface.

The prediction results of all experimentally confirmed promoters of known sigma-factors of E. coli are shown in Table 3. It is obvious that the prediction accuracies are both improved markedly using the iterative feature selection method. The highest average accuracies of data set-3 and data set-4 are 92.13% and 92.50%, respectively. To verify the prediction performance of the proposed method further, it is used to predict promoter sequences of B. subtilis, a typical gram-positive model organism. The samples are contained in data set-5 and data set-6 and the details of them are also shown in Table 1. The prediction results of them are shown in Table 4. Surprisingly, the average accuracies of data set-5 and data set-6 are as high as 95.83% and 99.09% respectively, which are much higher than the accuracies obtained by any other existing method.

Table 4.

Prediction results of all known sigma-factor promoters of B. subtilis using different combination of vw Z-curve features**

Number*	Data set-5				Data set-6
	4095	872	405	340	4095	740	490
Results (%)
	80.91	92.73	95.30	95.76	66.97	94.55	98.79
	81.82	91.97	94.24	95.91	73.03	95.76	99.39
	81.36	92.35	94.77	95.83	70.00	95.15	99.09

*Number: number of selected vw Z-curve variables.

**The average accuracies of the vw Z-curve methods with 340 parameters for Data set-5 and 490 parameters for Dataset-6, which were the best ones among the algorithms evaluated here, were shown in boldface.

Comparison with other existing methods

Evaluation of the performance of the proposed method requires comparisons with other available methods. Because different algorithms use different negative sample sets and different fragment sizes for promoter samples, it is only possible to give a rough comparison between the proposed method and other methods.

Comparing the prediction performance of E. coli promoter

Most existing methods tested their prediction performance using σ70 promoter fragments of E. coli K-12 with 80 bp (TSS-60 … TSS+19). They used two kinds of negative samples: coding segments and intergenic segments. The best prediction results of different methods are shown in Table 5 in detail. The most commonly used measurements of these methods are introduced to evaluate the performance of them. For both coding and non-coding negative samples, the performance of the proposed method is much better than that of other methods.

Table 5.

The best prediction results of E. coli promoters obtained by different methods (fragments length is 80 bp)**

Methods	Results (%)
Methods	Sensitivity TP/(TP+FN)	Specificity TN/(TN+FP)	Precision TP/(TP+FP)
Negative samples: Coding segments
IPMD (8)	84.9	91.4	–
Sequence Alignment Kernel+SVM (20)	82	–	84
The proposed method	96.32	95.79	95.81
Negative samples: Intergenic segments
3-gram* (4)	67.75	86.10	–
IPMD (8)	81	92.7	–
Sequence Alignment Kernel+SVM (20)	81	–	81
The proposed method	92.11	88.77	89.13

*The negative sample set contained 709 sequence fragments from the coding region and 709 sequence segments from intergenic portions. Training data set size for E. coli was 1669. The paper did not give more details about the training and testing set.

**The best average accuracies among the algorithms evaluated here were shown in boldface.

The best prediction results of E. coli promoters obtained by different methods (fragments length is 80 bp)** *The negative sample set contained 709 sequence fragments from the coding region and 709 sequence segments from intergenic portions. Training data set size for E. coli was 1669. The paper did not give more details about the training and testing set. **The best average accuracies among the algorithms evaluated here were shown in boldface. When taking intergenic segments as negative samples, the specificity obtained by IPMD is 92.7%, which is higher than the specificity obtained by the proposed method. But the average accuracy (the mean of sensitivity and specificity) obtained by IPMD is 86.85%, while the accuracy obtained by the proposed method is 90.44%. It is obviously, compared with IPMD, the average accuracy is improved by 3.59% by the proposed method. Additionally, the difference between specificity and sensitivity of IPMD and the proposed method is 11.7% and 3.34%, respectively. Consequently, the trade-off between Sn and Sp obtained by the proposed method is much better than that obtained by IPMD. As mentioned above, approximately, for E. coli K-12, 81% of known TSSs are located in the intergenic non-coding regions and 19% in the coding regions (20). Partly due to these facts, the difference of patterns distribution between coding sequences and promoter sequences is much more statistically significant than that between intergenic sequences and promoter sequences. Consequently, from the results shown in Table 5, it could be seen that, for all listed methods, the recognition performance between promoter and coding sequences is better than that of promoter and non-coding sequences. The promoter region is less stable and hence more prone to melting as compared to other genomic regions. Thus, there are some methods based on the differences in the stability of DNA sequences in the promoter and non-promoter region. Askary et al. presented a modified ANN (named N4) fed by nearest neighbours and based on DNA duplex stability (2). The promoter prediction sensitivity [TP/(TP+FN)] and precision [TP/(TP+FP)] of N4 for predicting promoters in E. coli are both 94%. To this author’s knowledge, this represents the best result achieved in the existing literature. Comparisons of the method presented here with that of Askary et al. (2) are made by using the same measurements and similar database construction methods. The positive sample set consists of 576 experimentally confirmed σ70 promoters fragments with 414 bp ([−207 … TSS …], the site of TSS is +1). The negative sample set consists of the first 414-bp fragment of the 530 ORFs with length of 414–585 bp. The best recognition results obtained by these two different methods are shown in Table 6. It is also obvious that the prediction accuracy of the proposed method is much better than the accuracy obtained by N4 method.

Table 6.

The best recognition results of E. coli promoters obtained by different methods (fragments length is 414 bp)*

Methods	Sensitivity (%) TP/(TP+FN)	Precision (%) TP/(TP+FP)
The proposed method	97.10	97.31
N4 Neural Networks (2)	94	94

*The best average accuracies among the algorithms evaluated here were shown in boldface.

The best recognition results of E. coli promoters obtained by different methods (fragments length is 414 bp)* *The best average accuracies among the algorithms evaluated here were shown in boldface.

Comparing the prediction performance of B. subtilis promoter

Bacillus subtilis, a representative Gram positive bacterium, is often used to demonstrate the performance of the prokaryotic promoter prediction methods. Lin and Li applied the IPMD method to predict B. subtilis promoters (8). To this author's knowledge, this represents the best result achieved in the existing literature. Comparisons of the proposed method with that of Lin and Li (8) are made by using the same measurements and similar database construction methods. The best recognition results obtained by these two different methods are shown in Table 7. In the case of coding negative samples, the prediction average accuracy is improved from 85.85% to 95.83% by the proposed method, as well as the difference between Sn and Sp is decreased from 10.9% to 0.15%. In the case of non-coding negative samples, the prediction average accuracy is improved by 15.54% by the proposed method, as well as the difference between Sn and Sp is decreased by 21.3%. The results strongly indicate that both the prediction accuracy and the trade-off between Sn and Sp are improved remarkably.

Table 7.

The best recognition results of B. subtilis promoters obtained by different methods (fragments length is 80 bp)*

Methods	Results (%)
Methods	Sensitivity (Sn) TP/(TP+FN)	Specificity (Sp) TN/(TN+FP)	Average accuracy	Difference between Sn and Sp
Negative samples: coding segments
IPMD (8)	80.4	91.3	85.85	10.9
The proposed method	95.76	95.91	95.83	0.15
Negative samples: intergenic segments
IPMD (8)	72.6	94.5	83.55	21.9
The proposed method	98.79	99.39	99.09	0.6

*The best average accuracies among the algorithms evaluated here were shown in boldface.

The best recognition results of B. subtilis promoters obtained by different methods (fragments length is 80 bp)* *The best average accuracies among the algorithms evaluated here were shown in boldface.

CONCLUSIONS

With the explosive development of the research on synthetic biology and genetic regulatory networks, understanding the gene regulation process has been one of the main challenges for biologists. In this context, important regulatory mechanisms involve the high precise prediction of promoter regions, which promote the initialization of gene expression processes. In this paper, a novel vw Z-curve method is developed as a feature extraction tool for prokaryotic promoter recognition for the first time. The proposed method is used in promoter prediction in E. coli and B. subtilis. Together with the iterative feature selection and classification power of the PLS algorithm, recognition accuracy and the trade-off between sensitivity and specificity are improved markedly. The simplicity of this method allows it to be particularly practical for performing research without any prior knowledge or prior information and to be run on ordinary personal computers or laptops with run times of several minutes. Although this method is developed for prokaryotic promoter recognition, and it has only been tested on samples of E. coli and B. subtilis promoter fragments, it can easily be used for the development of eukaryotic promoter prediction methods or for the development of new motif-finding methods.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR online: Supplementary Tables S1, Supplementary Figures S1–3, Supplementary Methods, Supplementary MATLAB codes and Supplementary References [37-56].

FUNDING

Funding for open access charge: National Natural Science Foundation of China (31000592); Doctoral Fund of Ministry of Education of China (200800561005). Conflict of interest statement. None declared.

36 in total

Review 1. Multiple sigma subunits and the partitioning of bacterial transcription space.

Authors: Tanja M Gruber; Carol A Gross
Journal: Annu Rev Microbiol Date: 2003 Impact factor: 15.500

Review 2. Identification of replication origins in archaeal genomes based on the Z-curve method.

Authors: Ren Zhang; Chun-Ting Zhang
Journal: Archaea Date: 2005-05 Impact factor: 3.273

3. Origins of replication in Cyanothece 51142.

Authors: Feng Gao; Chun-Ting Zhang
Journal: Proc Natl Acad Sci U S A Date: 2008-12-30 Impact factor: 11.205

4. Eukaryotic and prokaryotic promoter prediction using hybrid approach.

Authors: Hao Lin; Qian-Zhong Li
Journal: Theory Biosci Date: 2010-11-03 Impact factor: 1.919

5. Analysis of n-gram based promoter recognition methods and application to whole genome promoter prediction.

Authors: T Sobha Rani; Raju S Bapi
Journal: In Silico Biol Date: 2009

6. Anatomy of Escherichia coli sigma70 promoters.

Authors: Ryan K Shultzaberger; Zehua Chen; Karen A Lewis; Thomas D Schneider
Journal: Nucleic Acids Res Date: 2006-12-22 Impact factor: 16.971

7. GenBank.

Authors: Dennis A Benson; Ilene Karsch-Mizrachi; David J Lipman; James Ostell; Eric W Sayers
Journal: Nucleic Acids Res Date: 2009-11-12 Impact factor: 16.971

8. Transcription initiation by mix and match elements: flexibility for polymerase binding to bacterial promoters.

Authors: India G Hook-Barnard; Deborah M Hinton
Journal: Gene Regul Syst Bio Date: 2007

9. DBTBS: a database of transcriptional regulation in Bacillus subtilis containing upstream intergenic conservation information.

Authors: Nicolas Sierro; Yuko Makita; Michiel de Hoon; Kenta Nakai
Journal: Nucleic Acids Res Date: 2007-10-25 Impact factor: 16.971

10. The regulatory network of E. coli metabolism as a Boolean dynamical system exhibits both homeostasis and flexibility of response.

Authors: Areejit Samal; Sanjay Jain
Journal: BMC Syst Biol Date: 2008-02-29

24 in total

1. iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition.

Authors: Hao Lin; En-Ze Deng; Hui Ding; Wei Chen; Kuo-Chen Chou
Journal: Nucleic Acids Res Date: 2014-10-31 Impact factor: 16.971

2. MULTiPly: a novel multi-layer predictor for discovering general and specific types of promoters.

Authors: Meng Zhang; Fuyi Li; Tatiana T Marquez-Lago; André Leier; Cunshuo Fan; Chee Keong Kwoh; Kuo-Chen Chou; Jiangning Song; Cangzhi Jia
Journal: Bioinformatics Date: 2019-09-01 Impact factor: 6.937

3. Computational prediction and interpretation of both general and specific types of promoters in Escherichia coli by exploiting a stacked ensemble-learning framework.

Authors: Fuyi Li; Jinxiang Chen; Zongyuan Ge; Ya Wen; Yanwei Yue; Morihiro Hayashida; Abdelkader Baggag; Halima Bensmail; Jiangning Song
Journal: Brief Bioinform Date: 2021-03-22 Impact factor: 11.622

4. Characterization of a Minimal Type of Promoter Containing the -10 Element and a Guanine at the -14 or -13 Position in Mycobacteria.

Authors: Yan Zhu; Chunyou Mao; Xingyi Ge; Zhongwei Wang; Pei Lu; Yong Zhang; Shiyun Chen; Yangbo Hu
Journal: J Bacteriol Date: 2017-10-03 Impact factor: 3.490

5. Genome-wide copy number variation pattern analysis and a classification signature for non-small cell lung cancer.

Authors: Zhe-Wei Qiu; Jia-Hao Bi; Adi F Gazdar; Kai Song
Journal: Genes Chromosomes Cancer Date: 2017-05-04 Impact factor: 5.006

6. Critical assessment of computational tools for prokaryotic and eukaryotic promoter prediction.

Authors: Meng Zhang; Cangzhi Jia; Fuyi Li; Chen Li; Yan Zhu; Tatsuya Akutsu; Geoffrey I Webb; Quan Zou; Lachlan J M Coin; Jiangning Song
Journal: Brief Bioinform Date: 2022-03-10 Impact factor: 11.622

7. iPro-WAEL: a comprehensive and robust framework for identifying promoters in multiple species.

Authors: Pengyu Zhang; Hongming Zhang; Hao Wu
Journal: Nucleic Acids Res Date: 2022-10-14 Impact factor: 19.160

8. Prokaryotic and eukaryotic promoters identification based on residual network transfer learning.

Authors: Xiao Liu; Yuqiao Xu; Yachuan Luo; Li Teng
Journal: Bioprocess Biosyst Eng Date: 2022-03-13 Impact factor: 3.210

9. Assessing the effects of data selection and representation on the development of reliable E. coli sigma 70 promoter region predictors.

Authors: Mostafa M Abbas; Mostafa M Mohie-Eldin; Yasser El-Manzalawy
Journal: PLoS One Date: 2015-03-24 Impact factor: 3.240

10. Recognizing short coding sequences of prokaryotic genome using a novel iteratively adaptive sparse partial least squares algorithm.

Authors: Sun Chen; Chun-ying Zhang; Kai Song
Journal: Biol Direct Date: 2013-09-25 Impact factor: 4.540