| Literature DB >> 25803493 |
Mostafa M Abbas1, Mostafa M Mohie-Eldin2, Yasser El-Manzalawy3.
Abstract
As the number of sequenced bacterial genomes increases, the need for rapid and reliable tools for the annotation of functional elements (e.g., transcriptional regulatory elements) becomes more desirable. Promoters are the key regulatory elements, which recruit the transcriptional machinery through binding to a variety of regulatory proteins (known as sigma factors). The identification of the promoter regions is very challenging because these regions do not adhere to specific sequence patterns or motifs and are difficult to determine experimentally. Machine learning represents a promising and cost-effective approach for computational identification of prokaryotic promoter regions. However, the quality of the predictors depends on several factors including: i) training data; ii) data representation; iii) classification algorithms; iv) evaluation procedures. In this work, we create several variants of E. coli promoter data sets and utilize them to experimentally examine the effect of these factors on the predictive performance of E. coli σ70 promoter models. Our results suggest that under some combinations of the first three criteria, a prediction model might perform very well on cross-validation experiments while its performance on independent test data is drastically very poor. This emphasizes the importance of evaluating promoter region predictors using independent test data, which corrects for the over-optimistic performance that might be estimated using the cross-validation procedure. Our analysis of the tested models shows that good prediction models often perform well despite how the non-promoter data was obtained. On the other hand, poor prediction models seems to be more sensitive to the choice of non-promoter sequences. Interestingly, the best performing sequence-based classifiers outperform the best performing structure-based classifiers on both cross-validation and independent test performance evaluation experiments. Finally, we propose a meta-predictor method combining two top performing sequence-based and structure-based classifiers and compare its performance with some of the state-of-the-art E. coli σ70 promoter prediction methods.Entities:
Mesh:
Substances:
Year: 2015 PMID: 25803493 PMCID: PMC4372424 DOI: 10.1371/journal.pone.0119721
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Summary of cross-validation data sets.
|
|
|
|---|---|
| CV_Random | Randomly extracted from a single long sequence that is generated with frequencies 0.28, 0.22, 0.22, and 0.28 for T, G, C and A (respectively), according to Silva et al., [ |
| CV_Coding | Randomly extracted from coding regions extracted form E.coli K-12 genome downloaded from NCBI GenBank [ |
| CV_Convergent | Randomly extracted from convergent intergenic regions downloaded from EcoGene 3.0 database [ |
| CV_Divergent | Randomly extracted from divergent intergenic regions downloaded from EcoGene 3.0 database [ |
| CV_CoPos | Randomly extracted from codirectional positive spacer regions downloaded from EcoGene 3.0 database [ |
| CV_CoNeg | Randomly extracted from codirectional negative spacer regions downloaded from EcoGene 3.0 database [ |
| CV_Mixed | Six equal subsets of negative sequences extracted from negative sequences in CV_Random, CV_Coding, CV_Convergent, CV_Divergent, CV_CoPos, and CV_CoNeg |
Grading scale for classifiers based on their AUC scores.
| AUC score | Grade |
|---|---|
| 0.90–1.00 | Excellent |
| 0.80–0.89 | Good |
| 0.70–0.79 | Fair |
| 0.50–0.69 | Poor |
AUC scores for selected classifiers (trained using CV_Mixed data) and tested on different versions of independent test set (e.g., TS_Random and TS_Coding).
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|
| TS_Random | 0.83(1.5) | 0.77(5.0) | 0.80(3.5) | 0.76(6.0) | 0.80(3.5) | 0.83(1.5) |
| TS_Coding | 0.89(2.5) | 0.87(5.0) | 0.89(2.5) | 0.88(4.0) | 0.86(6.0) | 0.91(1.0) |
| TS_Convergent | 0.80(2.5) | 0.80(2.5) | 0.78(4.0) | 0.64(6.0) | 0.66(5.0) | 0.82(1.0) |
| TS_Divergent | 0.80(2.0) | 0.79(3.0) | 0.78(4.0) | 0.61(6.0) | 0.65(5.0) | 0.82(1.0) |
| TS_CoPos | 0.79(2.0) | 0.78(3.0) | 0.76(4.0) | 0.58(6.0) | 0.66(5.0) | 0.81(1.0) |
| TS_CoNeg | 0.82(2.0) | 0.80(3.5) | 0.80(3.5) | 0.68(5.5) | 0.68(5.5) | 0.84(1.0) |
| TS_Mixed | 0.83(2.0) | 0.82(3.0) | 0.81(4.0) | 0.70(6.0) | 0.71(5.0) | 0.85(1.0) |
| Average | 0.82(2.0) | 0.80(3.4) | 0.80(3.5) | 0.69(5.5) | 0.72(4.9) | 0.84(1.1) |
| STD | 0.03 | 0.03 | 0.04 | 0.10 | 0.08 | 0.03 |
See Methods Section for more information about these test sets. For each data set, the rank of each classifier is shown in parentheses.
Fig 1Pair-wise comparison of selected classifiers with Nemenyi test applied to results on independent test data sets.
Groups of classifiers that are not significantly different (at p-value = 0.05) are connected.
Summary of the performance of NB, RF100, SVMLnr, and SVMBRF classifiers on cross-validation data using twelve different structure-based feature representations.
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|
| M1 | 0.70(9.0) | 0.72(9.0) | 0.67(11.0) | 0.68(10.0) | 0.69(9.8) | 0.02 |
| M2 | 0.65(12.0) | 0.70(10.0) | 0.61(12.0) | 0.61(12.0) | 0.64(11.5) | 0.04 |
| M3 | 0.68(10.5) | 0.68(11.5) | 0.69(9.5) | 0.68(10.0) | 0.68(10.4) | 0.00 |
| M4 | 0.78(1.5) | 0.80(2.5) | 0.78(1.5) | 0.78(2.0) | 0.79(1.9) | 0.01 |
| M5 | 0.76(5.5) | 0.78(7.0) | 0.76(5.0) | 0.76(6.0) | 0.77(5.9) | 0.01 |
| M6 | 0.78(1.5) | 0.79(5.5) | 0.77(3.0) | 0.78(2.0) | 0.78(3.0) | 0.01 |
| M7 | 0.74(7.5) | 0.80(2.5) | 0.76(5.0) | 0.77(4.5) | 0.77(4.9) | 0.03 |
| M8 | 0.77(3.5) | 0.80(2.5) | 0.78(1.5) | 0.78(2.0) | 0.78(2.4) | 0.01 |
| M9 | 0.74(7.5) | 0.79(5.5) | 0.71(7.5) | 0.72(7.5) | 0.74(7.0) | 0.04 |
| M10 | 0.76(5.5) | 0.77(8.0) | 0.71(7.5) | 0.72(7.5) | 0.74(7.1) | 0.03 |
| M11 | 0.68(10.5) | 0.68(11.5) | 0.69(9.5) | 0.68(10.0) | 0.68(10.4) | 0.00 |
| M12 | 0.77(3.5) | 0.80(2.5) | 0.76(5.0) | 0.77(4.5) | 0.78(3.9) | 0.02 |
A-philicity (M1) [41]; Ohler B-DNA twist (M2) [42]; Olson B-DNA twist (M3) [43]; DNA bending stiffness (M4) [44]; DNA denaturation temperature (M5) [45]; Z-DNA free energy (M6) [46]; duplex disruption free energy (M7) [47]; duplex stability free energy (M8) [48]; protein-induced deformability (M9) [43]; propeller twist (M10) [49]; protein-induced DNA twist (M11) [43]; and base stacking energy (M12) [50]. For each data set, the rank of each classifier is shown in parentheses.
Fig 2B-DNA twisting profiles (top) and DNA bending stiffness profiles (bottom) generated from cross-validation data.
AUC scores for Naive Bayes classifier with DNID features (NB_DNID) trained using seven versions of CV data and in each time tested on the seven versions of the independent test data.
| Training data | TS_Random | TS_Coding | TS_Convergent | TS_Divergent | TS_CoPos | TS_CoNeg | TS_Mixed |
|---|---|---|---|---|---|---|---|
| CV_Random | 0.87(2.0) | 0.90(1.0) | 0.80(5.0) | 0.79(6.5) | 0.79(6.5) | 0.83(4.0) | 0.84(3.0) |
| CV_Coding | 0.80(2.5) | 0.93(1.0) | 0.74(6.5) | 0.75(5.0) | 0.74(6.5) | 0.78(4.0) | 0.80(2.5) |
| CV_Convergent | 0.82(4.5) | 0.84(1.0) | 0.82(4.5) | 0.80(6.5) | 0.80(6.5) | 0.83(2.5) | 0.83(2.5) |
| CV_Divergent | 0.80(4.5) | 0.86(1.0) | 0.80(4.5) | 0.79(6.0) | 0.77(7.0) | 0.82(2.0) | 0.81(3.0) |
| CV_CoPos | 0.80(5.0) | 0.84(1.0) | 0.81(2.5) | 0.80(5.0) | 0.79(7.0) | 0.80(5.0) | 0.81(2.5) |
| CV_CoNeg | 0.80(4.5) | 0.86(1.0) | 0.80(4.5) | 0.79(6.0) | 0.76(7.0) | 0.83(2.0) | 0.82(3.0) |
| CV_Mixed | 0.83(2.5) | 0.89(1.0) | 0.80(5.5) | 0.80(5.5) | 0.79(7.0) | 0.82(4.0) | 0.83(2.5) |
| Average | 0.82(3.6) | 0.87(1.0) | 0.80(4.7) | 0.79(5.8) | 0.78(6.8) | 0.82(3.4) | 0.82(2.7) |
| STD | 0.03 | 0.03 | 0.03 | 0.02 | 0.02 | 0.02 | 0.01 |
Each row corresponds to a specified training set while each column corresponds to a specified test set.
AUC scores for Naive Bayes classifier with 4-mer features (NB_4-mer) trained using seven versions of CV data and in each time tested on the seven versions of the independent test data.
| Training data | TS_Random | TS_Coding | TS_Convergent | TS_Divergent | TS_CoPos | TS_CoNeg | TS_Mixed |
|---|---|---|---|---|---|---|---|
| CV_Random | 0.87(1.0) | 0.82(2.0) | 0.63(5.0) | 0.60(6.0) | 0.58(7.0) | 0.65(4.0) | 0.69(3.0) |
| CV_Coding | 0.73(2.0) | 0.91(1.0) | 0.59(6.0) | 0.60(5.0) | 0.56(7.0) | 0.63(4.0) | 0.68(3.0) |
| CV_Convergent | 0.62(3.5) | 0.56(6.5) | 0.74(1.0) | 0.56(6.5) | 0.57(5.0) | 0.64(2.0) | 0.62(3.5) |
| CV_Divergent | 0.64(4.5) | 0.83(1.0) | 0.61(6.0) | 0.64(4.5) | 0.55(7.0) | 0.68(2.0) | 0.66(3.0) |
| CV_CoPos | 0.57(7.0) | 0.65(2.0) | 0.63(3.5) | 0.59(6.0) | 0.66(1.0) | 0.61(5.0) | 0.63(3.5) |
| CV_CoNeg | 0.58(6.0) | 0.74(1.0) | 0.62(4.0) | 0.60(5.0) | 0.55(7.0) | 0.71(2.0) | 0.63(3.0) |
| CV_Mixed | 0.76(2.0) | 0.88(1.0) | 0.64(5.0) | 0.61(6.0) | 0.58(7.0) | 0.68(4.0) | 0.70(3.0) |
| Average | 0.68(3.7) | 0.77(2.1) | 0.64(4.4) | 0.60(5.6) | 0.58(5.9) | 0.66(3.3) | 0.66(3.1) |
| STD | 0.11 | 0.13 | 0.05 | 0.02 | 0.04 | 0.03 | 0.03 |
Each row corresponds to a specified training set while each column corresponds to a specified test set.
Fig 3Performance comparison of BacPP, IPMD, and two variable-window Z-curve models, VWZ1 and VWZ2, trained using Datatset-1 and Datatset-2 (respectively) with four selected classifiers (NB_DNID, RF100_M7, HMM, and meta-predictor) using TS_Mixed independent test set.