| Literature DB >> 22492513 |
Bart Hooghe1, Stefan Broos, Frans van Roy, Pieter De Bleser.
Abstract
Transcription factor binding sites (TFBSs) are DNA sequences of 6-15 base pairs. Interaction of these TFBSs with transcription factors (TFs) is largely responsible for most spatiotemporal gene expression patterns. Here, we evaluate to what extent sequence-based prediction of TFBSs can be improved by taking into account the positional dependencies of nucleotides (NPDs) and the nucleotide sequence-dependent structure of DNA. We make use of the random forest algorithm to flexibly exploit both types of information. Results in this study show that both the structural method and the NPD method can be valuable for the prediction of TFBSs. Moreover, their predictive values seem to be complementary, even to the widely used position weight matrix (PWM) method. This led us to combine all three methods. Results obtained for five eukaryotic TFs with different DNA-binding domains show that our method improves classification accuracy for all five eukaryotic TFs compared with other approaches. Additionally, we contrast the results of seven smaller prokaryotic sets with high-quality data and show that with the use of high-quality data we can significantly improve prediction performance. Models developed in this study can be of great use for gaining insight into the mechanisms of TF binding.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22492513 PMCID: PMC3413102 DOI: 10.1093/nar/gks283
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Overview of our approach: (A) The input from which models are built consists of the two classes of nucleotide sequences that the method should learn to separate. One class contains positive sequences (P, green) known to be bound in vivo; the other contains negative sequences (N, red) highly unlikely to be bound in vivo. (B) Each nucleotide sequence, from either class, is converted into multiple series of values; each series provides values for a specific DNA structural characteristic at all positions of the TFBS and its context (structural model), or simply consists of one base or two base parts of the sequence (NPD). (C) Basic selection of relevant features (i.e. positions) is made by statistical comparison of distributions of values for positive and negative sequences with mild thresholds. (D) Further selection is performed through wrapper-based feature selection, i.e. cross-validation performance evaluation with the RF algorithm. Per characteristic, redundant features are removed by sequential backwards elimination (SBE). Several models with one characteristic might be merged through BIRS. The final NPD model and final structural model can be merged into one integrative model. (E) The resulting model can be used by RF to predict the likelihood that a nucleotide sequence is a TFBS, after converting the sequence into series of the features contained in the model.
Figure 2.Accuracy of classification models in identifying TFBSs, as assessed for five eukaryotic TFs. Details of threshold-averaged ROC curves showing the trade-off between TPR (Y-axis) and FPR (X-axis); Classification models applied: PWM (black), NPD (green), struct (blue), NPD_struct (purple), NPD_struct_PWM (orange), CRoSSeD (brown). (A–E) ROC curves for various transcription factors: (A). HIF1 (B) P53; (C) SP1; (D) STAT1; (E) TBP.
Figure 3.Accuracy of classification models in identifying TFBSs, as assessed for eight prokaryotic TFs. Threshold-averaged ROC curves showing the trade-off between TPR (Y-axis) and FPR (X-axis); Classification models applied: PWM (black), NPD (green), struct (blue), NPD_struct (purple), NPD_struct_PWM (orange), CRoSSeD (brown). (A–H) ROC curves for various transcription factors: (A) AraC; (B) ArcA; (C) Fis; (D) FlhDC; (E) IHF; (F) LexA; (G) PurR; (H) Fis [ChIP-chip set (31)].
Performance of the TBP model on external ChIP-seq TBP data set (Mokry et al.), measured in ROC AUC
| PWM | RF model | CRoSSeD | |
|---|---|---|---|
| ROC AUC | 0.535 | 0.774 | 0.573 |
Figure 4.Visualization of our integrative model for SP1. Top: mononucleotide frequencies with the positions of the NPD model shown as shaded boxes. Bottom, average value of one of the structural characteristics contained in the structural model, namely conformational tendency restB; positions of the structural model are indicated by dotted-line boxes (X-axes indicate position relative to the aligned start of the SP1 binding sites).
Results of the PCA analysis. For each TF model, we selected the five best features according to Weka PCA analysis
| TF model | Feature | TF model | Feature |
|---|---|---|---|
| PWMmatrixscore_general | uniformity_A_fullseqmean | ||
| minor_groove_clash_size_fullseqmean | dint_p5=CG | ||
| minor_groove_clash_size_p18 | PWMmatrixscore_general | ||
| monont_p19=G | dint_p6=GT | ||
| monont_p0=A | dint_p7=TG | ||
| PWMmatrixscore_general | uniformity_A_fullseqmean | ||
| groovewidth_unboundLiu_fullseqmean | homogeneity_BI_fullseqmean | ||
| groovewidth_unboundLiu_p0 | homogeneity_RESTB_fullseqmean | ||
| groovewidth_unboundLiu_p1 | PWMmatrixscore_general | ||
| groovewidth_unboundLiu_p-1 | homogeneity_RESTB_p2 | ||
| PWMmatrixscore_general | homogeneity_RESTB_fullseqmean | ||
| PWMcorescore_general | PWMmatrixscore_general | ||
| uniformity_A_p-2 | uniformity_AB_fullseqmean | ||
| uniformity_A_fullseqmean | dint_p5=CC | ||
| uniformity_A_p-3 | dint_p6=CC | ||
| bend_toward_major_groove_fullseqmean | PWMmatrixscore_general | ||
| bend_toward_minor_groove_fullseqmean | dint_p13=AA | ||
| PWMmatrixscore_general | dint_p5=TT | ||
| bend_toward_major_groove_p-6 | dint_p12=GA | ||
| bend_toward_major_groove_p-7 | dint_p7=TC | ||
| PWMmatrixscore_general | bend_toward_major_groove_fullseqmean | ||
| monont_p-3=C | bend_toward_minor_groove_fullseqmean | ||
| monont_p-20=G | homogeneity_BII_fullseqmean | ||
| monont_p-3=T | bend_toward_minor_groove_p8 | ||
| tors_1_nucleosome_p-7 | bend_toward_major_groove_p8 | ||
| minor_groove_clash_distance_p-8 | |||
| dint_p-8=GC | |||
| PWMmatrixscore_general | |||
| minor_groove_clash_distance_p-7 | |||
| minor_groove_clash_distance_p-9 | |||
| PWMmatrixscore_general | PWMmatrixscore_general | ||
| PWMcorescore_general | monont_p14=C | ||
| monont_p-5=A | bend_towards_minor_groove_p6 | ||
| monont_p-4=A | dint_p9=TT | ||
| monont_p1=T | monont_p0=G |