| Literature DB >> 17274821 |
Xiaoyue Zhao1, Zhenyu Xuan, Michael Q Zhang.
Abstract
Promoter prediction is a difficult but important problem in gene finding, and it is critical for elucidating the regulation of gene expression. We introduce a new promoter prediction program, CoreBoost, which applies a boosting technique with stumps to select important small-scale as well as large-scale features. CoreBoost improves greatly on locating transcription start sites. We also demonstrate that by further utilizing some tissue-specific information, better accuracy can be achieved.Entities:
Mesh:
Year: 2007 PMID: 17274821 PMCID: PMC1852414 DOI: 10.1186/gb-2007-8-2-r17
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Figure 1The energy profiles of CpG-related promoters and non-CpG-related promoters. The transcription start site is located at position 1000 in the top figures and position 250 in the bottom figures. All of the plots were smoothed with an average window of width 5. TSS, transcription start site.
Input features to train CoreBoost
| Feature list | Details |
| Core promoter elements | Score of each core element |
| TFBSs | Weighted maximal scores for weight matrices from TRANSFAC and density of TFBS |
| Mechanical properties | Weighted energy/flexibility scores around position -25 and +1 |
| Markovian score | Likelihood ratios from homogeneous third order Markov models |
| k-mer frequency | Frequency of 1- or 2-mers related to nucleotide G or C |
TFBS, transcription factor binding site.
Top features in CoreBoost
| Classifier type | Features | |
| CpG | P versus U | Log-likelihood ratios from third order Markov chain, log-likelihood ratios from TSS weight matrix |
| GC-box score, weighted score of transcription factor NFY, weighted energy score at position +1 | ||
| Weighted score of transcription factor YY1, TATA score, weighted score of transcription factor ELK1 | ||
| MTE score, weighted score of transcription factor CREB | ||
| P versus D | Log-likelihood ratios from third order Markov chain, GC-box score | |
| Weighted score of transcription factor NFY | ||
| Log-likelihood ratios from TSS weight matrix | ||
| Difference between the energy score around positions -25 and +1 and the average from surroundings | ||
| Log-likelihood ratios from transcription factor ELK1, frequency of G+C | ||
| Log-likelihood ratios from transcription factor YY1, TATA score, frequency of G | ||
| Non-CpG | P versus U | Correlation between vector of energy scores and empirical average energy profile |
| Log-likelihood ratios from third order Markov chain, TATA score | ||
| Difference between the energy score around positions -25 and +1 and the average from surroundings | ||
| Weighted energy at position +1 | ||
| Proportion of Inr and GC-box pair within 10 bp of observed distance, Inr score. | ||
| P versus D | Correlation between vector of energy scores and empirical average energy profile, TATA score | |
| Log-likelihood ratios from third order Markov chain | ||
| Weighted energy at position +1 | ||
| Correlation between vector of flexibility scores and empirical average flexibility profile, Inr score | ||
| Difference between the flexibility score around position +1 and the average from surroundings, GC-box score | ||
bp, base pairs; D, immediate downstream sequence; P, promoter; TSS, transcription start site; U, immediate upstream sequence.
Figure 2Density plot of the relative distance from the positions with maximal scores to the annotated TSS. The dot curve is based on the prediction from the ChIP-chip experiment. The solid curve is for CoreBoost. The dashed and the dot-dashed curves correspond to McPromoter and Eponine, respectively. The right figure is a zoomed in version of the left one. ChIP, chromatin immunoprecipitation; TSS, transcription start site.
Figure 3Positive predictive value versus sensitivity for CpG-related promoters. The solid and the long dashed curves are for CoreBoost, with the solid one for the cases clustering predictions within 2,000 bp and the long-dashed one within 500 bp. The dot-dashed curve is for McPromoter which clusters predictions within 2,000 bp as default. The dot and the short-dashed curves are for Eponine, with the dot one for the cases clustering predictions within 2,000 bp and the short-dashed one from the default output of Eponine. bp, base pairs.
Figure 4Positive predictive value versus sensitivity for non-CpG-related promoters. The solid and the long-dashed curves are for CoreBoost, with the solid one for the cases clustering predictions within 2,000 bp and the long-dashed one within 500 bp. The dot-dashed curve is for McPromoter, which clusters predictions within 2,000 bp by default. The dot and the short-dashed curves are for Eponine, with the dot one for the cases clustering predictions within 2,000 bp and the short-dashed one from the default output of Eponine. bp, base pairs.
Figure 5LogitBoost algorithm with trees.