| Literature DB >> 24955394 |
Wen-Lin Huang1, Chun-Wei Tung2, Chyn Liaw3, Hui-Ling Huang4, Shinn-Ying Ho4.
Abstract
The rapid and reliable identification of promoter regions is important when the number of genomes to be sequenced is increasing very speedily. Various methods have been developed but few methods investigate the effectiveness of sequence-based features in promoter prediction. This study proposes a knowledge acquisition method (named PromHD) based on if-then rules for promoter prediction in human and Drosophila species. PromHD utilizes an effective feature-mining algorithm and a reference feature set of 167 DNA sequence descriptors (DNASDs), comprising three descriptors of physicochemical properties (absorption maxima, molecular weight, and molar absorption coefficient), 128 top-ranked descriptors of 4-mer motifs, and 36 global sequence descriptors. PromHD identifies two feature subsets with 99 and 74 DNASDs and yields test accuracies of 96.4% and 97.5% in human and Drosophila species, respectively. Based on the 99- and 74-dimensional feature vectors, PromHD generates several if-then rules by using the decision tree mechanism for promoter prediction. The top-ranked informative rules with high certainty grades reveal that the global sequence descriptor, the length of nucleotide A at the first position of the sequence, and two physicochemical properties, absorption maxima and molecular weight, are effective in distinguishing promoters from non-promoters in human and Drosophila species, respectively.Entities:
Mesh:
Year: 2014 PMID: 24955394 PMCID: PMC3927563 DOI: 10.1155/2014/327306
Source DB: PubMed Journal: ScientificWorldJournal ISSN: 1537-744X
Figure 1The promoter of a DNA sequence containing a transcription factor binding site and a TATA box is immediately upstream to a transcription start site.
Conventional features for promoter prediction.
| Features | Label |
|---|---|
| Context features | |
|
| A1 |
| Transition | A2 |
| Distribution | A3 |
| Entropy density profile | A4 |
| Codon-position-independent frequencies of mononucleotides | A5 |
| Digitized DNA sequence | A6 |
| Position-specific information | A7 |
| Relative entropy | A8 |
| Flanking genomic sequence | A9 |
| Signal features | |
| TATA | B1 |
| 5′UTR (untranslated region) | B2 |
| Exons region | B3 |
| Intron region | B4 |
| 3′UTR | B5 |
| Downstream promoter element | B6 |
| TFIIB recognition element | B7 |
| Motif ten element | B8 |
| CCAAT | B9 |
| GC | B10 |
| Transcription factor binding site | C |
| CpG islands | D |
| Structural features | |
| DNA curvature | E1 |
| DNA flexibility | E2 |
| Stabilizing energy of Z-DNA | E3 |
| DNA denaturation values | E4 |
| Base stacking energy | E5 |
| Nucleosome positioning preference | E6 |
| Dinucleotide free energy | E7 |
| Tri-nucleotide CG content | E8 |
| DNA bendability | E9 |
| DNA-bending stiffness | E10 |
| A-philicity | E11 |
| Protein induced deformability | E12 |
| Propeller twist | E13 |
| B-DNA twist | E14 |
| Protein-DNA twist | E15 |
| Duplex stability (disrupt energy) | E16 |
| Duplex stability (free energy) | E17 |
| Radical cleavage intensity | E18 |
| Z-DNA | E19 |
| Epigenetic features | F |
Some representative prediction methods and classifiers with their used features. The informative features are explained in Table 1.
| Methods | Classifier | Features |
|---|---|---|
| ARTS [ | SVM | B2, E5, E13 |
| CorePromoter [ | Stepwise strategy | B1, B6, C |
| CoreBoost [ | LogitBoost algorithm with decision trees | A1, B1, B9, B10, C, D, E2 |
| CoreBoost_HM [ | Hidden Markov model | A1, B1, B9, B10, C, D, E2, F |
| CpGcluster [ | Distance-based algorithm | D |
| CpGProD [ | A generalized linear model | D |
| DragonGSF [ | Artificial neural network | B9 |
| DragonPF [ | Artificial neural network | D |
| EP3 [ | Analysis approach | E3–18 |
| Eponine [ | Relevance vector machine | B1 |
| FSPP [ | SVM | E4–6, E10–17 |
| FirstEF [ | Decision tree | B4, D |
| Fuzzy-AIRS [ | Artificial immune recognition system | A1 |
| GDZE [ | Fisher's linear discriminant algorithm | A1–5, E7 |
| GSD-FLD [ | Fisher's linear discriminant algorithm | A1–4 |
| HMM-SA [ | Hidden Markov model, simulated annealing | F |
| McPromoter [ | Artificial neural network, | E3–6, E8–17 |
| NNPP2.2 [ | Artificial neural network | B1, B4 |
| Nscan [ | Hidden Markov model, | B2–5 |
| Prom-Machine [ | SVM | A1 (128 top-ranked 4-mer motifs) |
| PromPredict [ | A scoring function and threshold values | A10, B12, E1, E7, E9, E17 |
| Promoter 2.0 [ | Neural networks and genetic algorithms | B1, B4, B9, B10 |
| PromoterExplorer [ | AdaBoost algorithm | A1, A6, D |
| PromoterInspector [ | Context analysis approach | A1 |
| PromoterScan [ | Linear discriminant analysis | B1, C |
| ProSOM [ | Artificial neural network | E5, E7 |
| PSPA [ | Probabilistic model | A1, A7 |
| TSSW [ | Linear discriminant function | B1 |
| vw Z-curve [ | Partial least squares | A5 |
| Wu method [ | Linear discriminant analysis | A3–5, A7, A8 |
Figure 2A block diagram of the PromHD method. The block diagram mainly contains the following important parts: (1) datasets, (2) DNA sequence descriptors, (3) DNASDmining algorithm, (4) estimating appearance-frequency ratios, and (5) PromHD prediction system.
Three physicochemical properties of nucleotide.
| DNASD | Description | Nucleotide | Rank by MED | ||||
|---|---|---|---|---|---|---|---|
| A | C | G | T | Human | DPL | ||
|
| Absorption maxima (determined at pH 7.0) | 259 | 271 | 253 | 267 | 2 | 1 |
|
| Molecular weight | 491.2 | 467.2 | 507.2 | 482.2 | 7 | 2 |
|
| Molar absorption coefficient | 15200 | 9300 | 13700 | 9600 | 11 | 3 |
Top 20 descriptors of 4-mer motifs. Top 20 descriptors of the 4-mer motifs are contained in the reference set of 167 DNASDs. The descriptors of the TATA motif are ranked at the 199th and 98th when applied for the HPL and DPL datasets, respectively.
| Rank | HPL dataset | DPL dataset | ||||
|---|---|---|---|---|---|---|
| 4-mer motif | Score | Included ( | 4-mer motif | Score | Included ( | |
| 1 | TGAA | 1000 | + | AAAG | 1000 | + |
| 2 | TGAT | 941 | + | AAGA | 956 | + |
| 3 | CCGG | 878 | − | TTCG | 948 | + |
| 4 | TATG | 843 | + | AGAA | 922 | − |
| 5 | TGGA | 817 | − | GAAA | 866 | − |
| 6 | GATG | 770 | + | AAGG | 791 | + |
| 7 | TCAA | 739 | + | CGCC | 787 | − |
| 8 | TACA | 702 | + | AGAT | 777 | − |
| 9 | AGGC | 697 | − | AATA | 759 | − |
| 10 | ATGA | 694 | + | TCGC | 747 | − |
| 11 | TTGA | 672 | + | TGAT | 744 | + |
| 12 | CGGC | 662 | − | TGAA | 732 | − |
| 13 | CAGG | 651 | − | ATCG | 732 | + |
| 14 | ATGT | 634 | − | TCGA | 724 | + |
| 15 | AGCG | 633 | − | CGGT | 724 | − |
| 16 | CGCG | 629 | − | ATAA | 712 | + |
| 17 | AGCC | 618 | − | CGAT | 710 | − |
| 18 | TCAT | 595 | − | CGCG | 703 | − |
| 19 | GAGC | 592 | + | GAAG | 699 | + |
| 20 | AGGG | 582 | − | ATAG | 697 | − |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 25 AAGT | 642 | |||||
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 199 | TATA | 111 | − | 98 TATA | 365 | − |
+: included in the set of m DNASDs.
−: not included in the set of m DNASDs.
Figure 3Appearance-frequency ratios of R DNASDmining solutions, where k = 1, 2,…, R. The mean frequency ratio is 47.0% for HPL dataset.
Figure 4Training accuracies of the PromHD method and using SVM with a number r of selected informative features for the HPL dataset.
Comparisons of training and test accuracies (ACC, %), sensitivity (SN), specificity (SP), and MCC for the HP dataset.
| Method | No. of used features ( | 10-CV HPL | Independent test HPT | ||||||
|---|---|---|---|---|---|---|---|---|---|
| ACC | SN | SP | MCC | ACC | SN | SP | MCC | ||
| SVM-GSD | 36 (28, 2−2) | 97.4 | 0.972 | 0.976 | 0.949 | 93.6 | 0.930 | 0.941 | 0.872 |
| SVM-4mer | 128 (23, 2−3) | 94.2 | 0.949 | 0.936 | 0.885 | 91.0 | 0.953 | 0.867 | 0.823 |
| SVM-RBS | 82 (27, 2−5) | 96.0 | 0.964 | 0.956 | 0.920 | 91.9 | 0.885 | 0.962 | 0.840 |
| PromHD | 99 (27, 2−5) | 98.9 | 0.979 | 0.979 | 0.979 | 96.4 | 0.967 | 0.960 | 0.927 |
Comparisons of training and test accuracies (ACC, %), sensitivity (SN), specificity (SP), and MCC for the DP dataset.
| Method | No. of used features ( | 10-CV DPL | Independent test DPT | ||||||
|---|---|---|---|---|---|---|---|---|---|
| ACC | SN | SP | MCC | ACC | SN | SP | MCC | ||
| SVM-GSD | 36 (22, 2) | 95.1 | 0.956 | 0.946 | 0.902 | 89.2 | 0.789 | 0.996 | 0.802 |
| SVM-4mer | 128 (23, 2−6) | 96.4 | 0.960 | 0.967 | 0.952 | 94.6 | 0.912 | 0.981 | 0.830 |
| SVM-RBS | 31 (27, 1) | 95.3 | 0.959 | 0.946 | 0.906 | 80.5 | 0.612 | 0.996 | 0.660 |
| PromHD | 74 (24, 1) | 99.3 | 0.996 | 0.990 | 0.986 | 97.5 | 0.961 | 0.988 | 0.949 |
The rule-based knowledge of promoter prediction in human and Drosophila species.
| Species | Rule-based knowledge | CF | Rules | Accuracy | ||
|---|---|---|---|---|---|---|
| (Human) | ||||||
| R1-p: | If | Then | Promoter | 0.928 | 1 | 50.0% |
| R2-n: | If | Then | Non-promoter | 0.999 | 1-2 | 96.2% |
| R3-n: | If | Then | Non-promoter | 0.985 | 1–3 | 99.5% |
| R4-n: | If | Then | Non-promoter | 0.974 | ||
| ( | ||||||
| R1-p: | If | Then | Promoter | 0.997 | 1 | 50.0% |
| R2-n: | If | Then | Non-promoter | 0.999 | 1-2 | 84.7% |
| R3-n: | If | Then | Non-promoter | 0.997 |
Figure 5Top 20 DNASDs, which are ranked by MED values, for human and Drosophila training datasets. The MED values of the first two and four features exceed 30 when performing HPL and DPL datasets, respectively.
Distribution of the extracted DNASDs.
| HPL | DPL | Common | ||||
|---|---|---|---|---|---|---|
| PCP | 3 | 3( | 3 | 3( | 3 | 3( |
| GSDs | 23 | 4( | 22 | 2( | 15 | 1( |
| Frequency descriptors of 4-mer motifs | 73 | 73( | 49 | 49( | 14 | 14( |
|
| ||||||
| Total | 99 | 74 | 32 | |||
The abbreviations D P, D C1, D C4, D E, D , and D represent the descriptors of physicochemical property (PCP) and the global sequence descriptors (GSDs) of 1-mer motif, 4-mer motif, EDP, distribution, and transition, respectively.