| Literature DB >> 31299595 |
Hong-Yan Lai1, Zhao-Yue Zhang1, Zhen-Dong Su1, Wei Su1, Hui Ding1, Wei Chen2, Hao Lin3.
Abstract
Promoter is a fundamental DNA element located around the transcription start site (TSS) and could regulate gene transcription. Promoter recognition is of great significance in determining transcription units, studying gene structure, analyzing gene regulation mechanisms, and annotating gene functional information. Many models have already been proposed to predict promoters. However, the performances of these methods still need to be improved. In this work, we combined pseudo k-tuple nucleotide composition (PseKNC) with position-correlation scoring function (PCSF) to formulate promoter sequences of Homo sapiens (H. sapiens), Drosophila melanogaster (D. melanogaster), Caenorhabditis elegans (C. elegans), Bacillus subtilis (B. subtilis), and Escherichia coli (E. coli). Minimum Redundancy Maximum Relevance (mRMR) algorithm and increment feature selection strategy were then adopted to find out optimal feature subsets. Support vector machine (SVM) was used to distinguish between promoters and non-promoters. In the 10-fold cross-validation test, accuracies of 93.3%, 93.9%, 95.7%, 95.2%, and 93.1% were obtained for H. sapiens, D. melanogaster, C. elegans, B. subtilis, and E. coli, with the areas under receiver operating curves (AUCs) of 0.974, 0.975, 0.981, 0.988, and 0.976, respectively. Comparative results demonstrated that our method outperforms existing methods for identifying promoters. An online web server was established that can be freely accessed (http://lin-group.cn/server/iProEP/).Entities:
Keywords: feature selection; position-correlation scoring function; promoter; pseudo k-tuple nucleotide composition; web server
Year: 2019 PMID: 31299595 PMCID: PMC6616480 DOI: 10.1016/j.omtn.2019.05.028
Source DB: PubMed Journal: Mol Ther Nucleic Acids
Figure 1A Flowchart to Outline the Promoter Prediction Program Construction
The Optimal Values of Three PseKNC Parameters for Five Species
| Kingdom | Species | ||||
|---|---|---|---|---|---|
| Eukaryotes | 4 | 24 | 0.1 | 90.9 | |
| 5 | 9 | 0.1 | 89.5 | ||
| 4 | 22 | 0.1 | 81.4 | ||
| Prokaryotes | 4 | 12 | 0.2 | 83.8 | |
| 4 | 12 | 0.1 | 80.7 |
The Feature Numbers and Accuracies for Five Species before and after mRMR Feature Selection
| Kingdom | Species | Original Features | Optimal Features | ||
|---|---|---|---|---|---|
| Feature Number | Feature Number | ||||
| Eukaryotes | 423 | 93.4 | 410 | 93.5 | |
| 1,097 | 93.3 | 893 | 93.8 | ||
| 405 | 94.4 | 65 | 95.6 | ||
| Prokaryotes | 345 | 94.0 | 55 | 95.5 | |
| 345 | 92.1 | 44 | 93.2 | ||
The Results for Five Species by Using 10-Fold Cross-Validation
| Kingdom | Species | ||||
|---|---|---|---|---|---|
| Eukaryotes | 93.3 | 92.3 | 92.7 | 0.974 | |
| 93.9 | 92.6 | 92.6 | 0.975 | ||
| 95.7 | 95.0 | 94.4 | 0.981 | ||
| Prokaryotes | 95.2 | 94.8 | 94.3 | 0.988 | |
| 93.1 | 92.2 | 91.2 | 0.976 |
Figure 2Evaluating the iProEP by Using ROC Curve
ROC curves for promoter prediction in (A) H. sapiens, (B) D. melanogaster, (C) C. elegans, (D) B. subtilis, and (E) E. coli.
Figure 3The Comparison between Our Proposed Method with IPMD Classifiers in 10-Fold Cross-Validation
Figure 4The Prediction Results of Four Methods on the Same E. Coli σ70 Promoter Data
The Results for Cross-Species Examination
| Kingdom | Model Training | Model Test | |
|---|---|---|---|
| Eukaryotes | 77.19 | ||
| 66.63 | |||
| 68.41 | |||
| 65.68 | |||
| 66.57 | |||
| 69.58 | |||
| Prokaryotes | 75.95 | ||
| 80.92 |
Figure 5The Homepage of the iProEP Web Server
Available at http://lin-group.cn/server/iProEP/.
The Detail Information of the Training Datasets for Five Species
| Kingdom | Species | Promoter | Non-promoter | Location | |
|---|---|---|---|---|---|
| CDS | Non-CDS | ||||
| Eukaryotes (300 bp) | 1,787 | 1,800 | 1,800 | [−249, +50] | |
| 1,886 | 1,799 | 2,859 | [−249, +50] | ||
| Prokaryotes (81 bp) | 598 | 600 | 600 | [−249, +50] | |
| 270 | 300 | 300 | [−60, +20] | ||
| 741 | 700 | 700 | [−60, +20] | ||
CDS, coding sequences.
Intron for eukaryotes and convergent intergenetic region for prokaryotes.