| Literature DB >> 21954440 |
Kai Song1.
Abstract
Transcription is the first step in gene expression, and it is the step at which most of the regulation of expression occurs. Although sequenced prokaryotic genomes provide a wealth of information, transcriptional regulatory networks are still poorly understood using the available genomic information, largely because accurate prediction of promoters is difficult. To improve promoter recognition performance, a novel variable-window Z-curve method is developed to extract general features of prokaryotic promoters. The features are used for further classification by the partial least squares technique. To verify the prediction performance, the proposed method is applied to predict promoter fragments of two representative prokaryotic model organisms (Escherichia coli and Bacillus subtilis). Depending on the feature extraction and selection power of the proposed method, the promoter prediction accuracies are improved markedly over most existing approaches: for E. coli, the accuracies are 96.05% (σ(70) promoters, coding negative samples), 90.44% (σ(70) promoters, non-coding negative samples), 92.13% (known sigma-factor promoters, coding negative samples), 92.50% (known sigma-factor promoters, non-coding negative samples), respectively; for B. subtilis, the accuracies are 95.83% (known sigma-factor promoters, coding negative samples) and 99.09% (known sigma-factor promoters, non-coding negative samples). Additionally, being a linear technique, the computational simplicity of the proposed method makes it easy to run in a matter of minutes on ordinary personal computers or even laptops. More importantly, there is no need to optimize parameters, so it is very practical for predicting other species promoters without any prior knowledge or prior information of the statistical properties of the samples.Entities:
Mesh:
Year: 2011 PMID: 21954440 PMCID: PMC3273801 DOI: 10.1093/nar/gkr795
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
The detailed descriptions of data sets
| Data set | Positive samples | Negative samples |
|---|---|---|
| Data set-1 | 576 σ70 promoters of | 836 coding fragments of |
| Data set-2 | 576 σ70 promoters of | 825 non-coding fragments of |
| Data set-3 | 825 known sigma-factor promoters of | 836 coding fragments of |
| Data set-4 | 825 known sigma-factor promoters of | 825 non-coding fragments of |
| Data set-5 | 660 known sigma-factor promoters of | 665 coding fragments of |
| Data set-6 | 660 known sigma-factor promoters of | 331 non-coding fragments of |
Prediction results of all known sigma-factor promoters of E. coli using different combination of vw Z-curve features**
| Number* | Data set-3 | Data set-4 | |||||||
|---|---|---|---|---|---|---|---|---|---|
| 4095 | 650 | 350 | 280 | 230 | 4095 | 1100 | 610 | 360 | |
| Results (%) | |||||||||
| | 79.63 | 87.07 | 91.59 | 91.95 | 82.56 | 89.63 | 91.34 | ||
| | 75.98 | 88.17 | 90.49 | 91.46 | 84.51 | 90.12 | 91.10 | ||
| | 77.80 | 87.62 | 91.04 | 91.71 | 83.54 | 89.88 | 91.22 | ||
*Number: number of selected vw Z-curve variables
**The average accuracies of the vw Z-curve methods with 230 parameters for Data set-3 and 360 parameters for Dataset-4, which were the best ones among the algorithms evaluated here, were shown in boldface.
Prediction results of the σ70 promoters of E. coli using different combination of vw Z-curve features**
| Number* | Data set-1 | Data set-2 | |||||||
|---|---|---|---|---|---|---|---|---|---|
| 4095 | 600 | 350 | 330 | 4095 | 600 | 500 | 245 | 220 | |
| Results (%) | |||||||||
| | 80.00 | 92.63 | 95.79 | 81.40 | 88.42 | 87.02 | 91.40 | ||
| | 75.96 | 93.86 | 95.61 | 75.44 | 84.74 | 85.09 | 86.14 | ||
| | 77.98 | 93.25 | 95.70 | 78.42 | 86.58 | 86.05 | 88.77 | ||
*Number: number of selected vw Z-curve variables.
**The average accuracies of the vw Z-curve methods with 330 parameters for Data set-1and 220 parameters for Dataset-2, which were the best ones among the algorithms evaluated here, were shown in boldface.
Prediction results of all known sigma-factor promoters of B. subtilis using different combination of vw Z-curve features**
| Number* | Data set-5 | Data set-6 | |||||
|---|---|---|---|---|---|---|---|
| 4095 | 872 | 405 | 340 | 4095 | 740 | 490 | |
| Results (%) | |||||||
| | 80.91 | 92.73 | 95.30 | 66.97 | 94.55 | ||
| | 81.82 | 91.97 | 94.24 | 73.03 | 95.76 | ||
| | 81.36 | 92.35 | 94.77 | 70.00 | 95.15 | ||
*Number: number of selected vw Z-curve variables.
**The average accuracies of the vw Z-curve methods with 340 parameters for Data set-5 and 490 parameters for Dataset-6, which were the best ones among the algorithms evaluated here, were shown in boldface.
The best prediction results of E. coli promoters obtained by different methods (fragments length is 80 bp)**
| Methods | Results (%) | ||
|---|---|---|---|
| Sensitivity TP/(TP+FN) | Specificity TN/(TN+FP) | Precision TP/(TP+FP) | |
| Negative samples: Coding segments | |||
| IPMD ( | 84.9 | 91.4 | – |
| Sequence Alignment Kernel+SVM ( | 82 | – | 84 |
| The proposed method | |||
| Negative samples: Intergenic segments | |||
| 3-gram* ( | 67.75 | 86.10 | – |
| IPMD ( | 81 | 92.7 | – |
| Sequence Alignment Kernel+SVM ( | 81 | – | 81 |
| The proposed method | |||
*The negative sample set contained 709 sequence fragments from the coding region and 709 sequence segments from intergenic portions. Training data set size for E. coli was 1669. The paper did not give more details about the training and testing set.
**The best average accuracies among the algorithms evaluated here were shown in boldface.
The best recognition results of E. coli promoters obtained by different methods (fragments length is 414 bp)*
| Methods | Sensitivity (%) TP/(TP+FN) | Precision (%) TP/(TP+FP) |
|---|---|---|
| The proposed method | ||
| N4 Neural Networks ( | 94 | 94 |
*The best average accuracies among the algorithms evaluated here were shown in boldface.
The best recognition results of B. subtilis promoters obtained by different methods (fragments length is 80 bp)*
| Methods | Results (%) | |||
|---|---|---|---|---|
| Sensitivity ( | Specificity ( | Average accuracy | Difference between | |
| Negative samples: coding segments | ||||
| IPMD ( | 80.4 | 91.3 | 85.85 | 10.9 |
| The proposed method | ||||
| Negative samples: intergenic segments | ||||
| IPMD ( | 72.6 | 94.5 | 83.55 | 21.9 |
| The proposed method | ||||
*The best average accuracies among the algorithms evaluated here were shown in boldface.