| Literature DB >> 15741185 |
Rajeev Gangal1, Pankaj Sharma.
Abstract
Although several in silico promoter prediction methods have been developed to date, they are still limited in predictive performance. The limitations are due to the challenge of selecting appropriate features of promoters that distinguish them from non-promoters and the generalization or predictive ability of the machine-learning algorithms. In this paper we attempt to define a novel approach by using unique descriptors and machine-learning methods for the recognition of eukaryotic polymerase II promoters. In this study, non-linear time series descriptors along with non-linear machine-learning algorithms, such as support vector machine (SVM), are used to discriminate between promoter and non-promoter regions. The basic idea here is to use descriptors that do not depend on the primary DNA sequence and provide a clear distinction between promoter and non-promoter regions. The classification model built on a set of 1000 promoter and 1500 non-promoter sequences, showed a 10-fold cross-validation accuracy of 87% and an independent test set had an accuracy >85% in both promoter and non-promoter identification. This approach correctly identified all 20 experimentally verified promoters of human chromosome 22. The high sensitivity and selectivity indicates that n-mer frequencies along with non-linear time series descriptors, such as Lyapunov component stability and Tsallis entropy, and supervised machine-learning methods, such as SVMs, can be useful in the identification of pol II promoters.Entities:
Mesh:
Substances:
Year: 2005 PMID: 15741185 PMCID: PMC552959 DOI: 10.1093/nar/gki271
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1Principle components analysis (PCA) plot for each promoters and non-promoter. The descriptors used to discriminate between promoters and non-promoters are transformed to three orthogonal axes. A clear separation between promoter and non-promoter sequences is shown in the PCA plot.
Results of models built for promoter prediction
| Input: data promoter and non-promoter sequences | Correctly classified instances on training data (%) | Correctly classified instances on cross-validation data (%) | Correctly classified instances on validation data (%) | Algorithm used | Correlation coefficient | Kappa statistics |
|---|---|---|---|---|---|---|
| Model 1 | 100.00 | 87.5 | 85.8 | SVM | 0.78 | 0.74 |
| Model 2 | 100.00 | 87.25 | 86.6 | SVM | 0.68 | 0.71 |
aTwenty per cent of the training set was split and used for model validation from the training set.
bModel 1 includes calculation of n-mer frequencies, GC% and non-linear time series descriptors.
cModel 2 only includes calculation of n-mer frequencies and GC%.
List of experimentally verified promoters on human chromosome 22
| Accession number | Gene name | Predicted by Prometheus |
|---|---|---|
| L43122 | COMT | + |
| X52828 | BCR | + |
| X84664 | MMP11 | + |
| AJ007494 | GGT1 | + |
| X72990 | EWSR1 | + |
| M63420 | LIF | + |
| AF129855 | OSM | + |
| AF047576 | TCN2 | + |
| AB016655 | LIMK2 | + |
| S79779 | TIMP3 | + |
| S58267 | HMOX1 | + |
| EP11091 | MB | + |
| X63578 | PVALB | + |
| X53093 | IL2RB | + |
| M87841 | H1F0 | + |
| AF115252 | PLA2G6 | + |
| EP11139 | PDGFB | + |
| AF106656 | ADSL | + |
| D86746 | SREBF2 | + |
| M77378 | ACR | + |
| Total 20 genes, correctly predicted instances 20 (100%) | ||
aAll sequences are taken from GenBank/EMBL/EPD. See accession number for details.
bEPD accession number.
Prediction done using the above models
| Predicted sequences | Total no. of sequences | True positive | False positive | False negative | True negative |
|---|---|---|---|---|---|
| Model 1 | |||||
| Promoter | 800 | 707 | Nil | 93 | Nil |
| Intron | 1000 | Nil | 97 | Nil | 903 |
| Human chromosome 22 experimentally verified promoters | 20 | 20 | Nil | Nil | Nil |
| Model 2 | |||||
| Promoter | 800 | 682 | Nil | 118 | Nil |
| Intron | 1000 | Nil | 93 | Nil | 907 |
| Human chromosome 22 experimentally verified promoters | 20 | 9 | Nil | 11 | Nil |
TP, true positives, # {correctly recognized positives}; TN, true negatives, # {correctly recognized negatives}; FN, false negatives, # {positives recognized as negatives}; and FP, false positives, # {negatives recognized as positives}.
Program accuracy
| Program name | NNPP (threshold 0.8) | Soft Berry (TSSW) | Promoter Scan version 1.7 | Dragon Promoter Finder version 1.4 | Promoter 2.0 Prediction Server | Prometheus |
|---|---|---|---|---|---|---|
| Sensitivity (%) | 32 | 60 | 40 | 38 | 50 | 86 |
| Specificity (%) | 34 | 65 | 56 | 64 | 54 | 88 |
| Correlation coefficient | 0.34 | 0.27 | 0.11 | 0.18 | 0.20 | 0.74 |
aSensitivity = 100 × TP/(TP + FN).
bSpecificity = 100 × TN/(TN + FP).
c