| Literature DB >> 20806061 |
Jean-François Pessiot1, Hirokazu Chiba, Hiroto Hyakkoku, Takeaki Taniguchi, Wataru Fujibuchi.
Abstract
How to identify true transcription factor binding sites on the basis of sequence motif information (e.g., motif pattern, location, combination, etc.) is an important question in bioinformatics. We present "PeakRegressor," a system that identifies binding motifs by combining DNA-sequence data and ChIP-Seq data. PeakRegressor uses L1-norm log linear regression in order to predict peak values from binding motif candidates. Our approach successfully predicts the peak values of STAT1 and RNA Polymerase II with correlation coefficients as high as 0.65 and 0.66, respectively. Using PeakRegressor, we could identify composite motifs for STAT1, as well as potential regulatory SNPs (rSNPs) involved in the regulation of transcription levels of neighboring genes. In addition, we show that among five regression methods, L1-norm log linear regression achieves the best performance with respect to binding motif identification, biological interpretability and computational efficiency.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20806061 PMCID: PMC2929187 DOI: 10.1371/journal.pone.0011881
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Influence of the peak filtering methods on the correlation coefficients between peak values and their predicted values in the test dataset.
|
|
|
|
|
| None |
| 0.50 | 0.44 |
| Promoter proximity |
| 0.41 | 0.53 |
| Q-value |
|
|
|
The correlation coefficients are averaged in 30-fold cross-validation.
Figure 1STAT1 regression results with two filtering methods: Q-value (right) and promoter proximity (left).
The correlation coefficients on the test data between peak values and their predicted values are 0.65 and 0.41 for Q-value and promoter proximity filterings, respectively.
List of putative STAT1 binding motifs identified by PeakRegressor.
|
|
|
|
| 1. |
|
| 0.96 |
|
| 0.72 |
|
| 0.65 |
|
| −0.57 |
|
| −0.56 |
|
| 0.56 |
|
| 0.55 |
|
| 0.48 |
|
| 0.47 |
The classical GAS motifs are shown in boldface.
List of putative RNA Polymerase II binding motifs identified by PeakRegressor.
|
|
|
|
| 1. |
|
| 0.78 |
|
| 0.76 |
|
| 0.68 |
|
| 0.67 |
|
| 0.63 |
|
| 0.62 |
|
| 0.61 |
|
| 0.54 |
|
| 0.54 |
The known Downstream Promoter Element and Initiator site motifs are shown in boldface.
List of putative STAT1 binding motifs identified by linear least squares regression.
|
|
|
|
| −1.0 |
|
| 0.94 |
|
| 0.34 |
|
| −0.34 |
|
| 0.02 |
|
| 0.02 |
|
| 0.01 |
|
| −0.01 |
|
| 0.01 |
|
| 0.01 |
The classical GAS motifs are shown in boldface.
List of putative RNA Polymerase II binding motifs identified by linear least squares regression.
|
|
|
|
| 1.0 |
|
| 0.86 |
|
| 0.81 |
|
| 0.74 |
|
| 0.74 |
|
| 0.69 |
|
| 0.64 |
|
| 0.62 |
|
| 0.62 |
|
| −0.61 |
The known Downstream Promoter Element and Initiator site motifs are shown in boldface.
List of putative STAT1 binding motifs identified by ridge regression.
|
|
|
|
| 1. |
|
| 0.89 |
|
| −0.69 |
|
| 0.69 |
|
| 0.68 |
|
| 0.65 |
|
| 0.59 |
|
| 0.58 |
|
| −0.57 |
|
| 0.53 |
The classical GAS motifs are shown in boldface.
List of putative RNA Polymerase II binding motifs identified by ridge regression.
|
|
|
|
| 1.0 |
|
| 0.86 |
|
| 0.81 |
|
| 0.75 |
|
| 0.74 |
|
| 0.70 |
|
| 0.65 |
|
| 0.63 |
|
| 0.62 |
|
| 0.61 |
The known Downstream Promoter Element and Initiator site motifs are shown in boldface.
List of putative STAT1 binding motifs identified by partial least squares regression.
|
|
|
|
| 1.0 |
|
| 0.80 |
|
| 0.58 |
|
| 0.56 |
|
| 0.50 |
|
| 0.42 |
|
| −0.41 |
|
| 0.41 |
|
| 0.40 |
|
| 0.39 |
The classical GAS motifs are shown in boldface.
List of putative STAT1 binding motifs identified by principal component regression.
|
|
|
|
| 1.0 |
|
| 0.94 |
|
| 0.87 |
|
| 0.86 |
|
| 0.85 |
|
| 0.82 |
|
| 0.82 |
|
| 0.81 |
|
| 0.80 |
|
| 0.78 |
The classical GAS motifs are shown in boldface.
List of putative RNA Polymerase II binding motifs identified by partial least squares regression.
|
|
|
|
| 1.0 |
|
| 0.99 |
|
| −0.97 |
|
| −0.90 |
|
| −0.89 |
|
| 0.87 |
|
| 0.86 |
|
| 0.86 |
|
| 0.85 |
|
| −0.81 |
The known Downstream Promoter Element and Initiator site motifs are shown in boldface.
List of putative RNA Polymerase II binding motifs identified by principal component regression.
|
|
|
|
| −1.0 |
|
| 0.97 |
|
| −0.96 |
|
| −0.94 |
|
| −0.90 |
|
| −0.89 |
|
| −0.87 |
|
| −0.86 |
|
| −0.84 |
|
| −0.83 |
The known Downstream Promoter Element and Initiator site motifs are shown in boldface.
Different regression methods and their correlation coefficients averaged on the test sets.
|
|
|
|
| L1-norm log linear regression |
|
|
| Linear least squares regression | 0.64 | 0.64 |
| Ridge regression | 0.64 | 0.64 |
| Partial least squares regression | 0.64 | 0.65 |
| Principal component regression | 0.63 | 0.52 |
Figure 2Schematic view of the workflow of PeakRegressor.
PeakRegressor takes ChIP-Seq data as input and outputs a list of TFBM candidates and their weights that give the best regression accuracies.