| Literature DB >> 25605483 |
Xiaolei Wang, Hiroyuki Kuwahara, Xin Gao.
Abstract
BACKGROUND: A quantitative understanding of interactions between transcription factors (TFs) and their DNA binding sites is key to the rational design of gene regulatory networks. Recent advances in high-throughput technologies have enabled high-resolution measurements of protein-DNA binding affinity. Importantly, such experiments revealed the complex nature of TF-DNA interactions, whereby the effects of nucleotide changes on the binding affinity were observed to be context dependent. A systematic method to give high-quality estimates of such complex affinity landscapes is, thus, essential to the control of gene expression and the advance of synthetic biology.Entities:
Mesh:
Substances:
Year: 2014 PMID: 25605483 PMCID: PMC4305984 DOI: 10.1186/1752-0509-8-S5-S5
Source DB: PubMed Journal: BMC Syst Biol ISSN: 1752-0509
Figure 1The workflow of the proposed two-round support vector regression method with weighted degree kernels. (a) The input training DNA binding site sequences with their corresponding Kvalues, demonstrating the general form of the inputs. (b) The weighted degree kernel matrix of the first round, calculated from Eq. 2. Each dimension lists the training binding sequences as shown in (a), and the corresponding entry value represents the similarity between the two sequences by the WD kernel. (c) Based on the kernel matrix in (b), we did the first round of support vector regression to select the top ten k-mers that contribute most to the high binding affinity (in blue) and the ten k-mers that contribute the most to the low binding affinity (in red). The local optimistic parameters were also selected from this step. (d) The regression of Round 2 to predict binding affinities by using the selected k-mers in a new WD kernel.
Figure 2Grid search of parameters on the training data of Rounds 1 and 2 of our method. (a) Grid search of degree and shift for Round 1 in terms of the average Pearson Cor with mismatch 1. The parameter of mismatch can be searched in a similar manner, which is not shown here. (b) Grid search of degree and shift of for Round 2 in terms of Pearson Cor, with mismatch 0.
Average prediction performance of the Rounds 1 and 2 of our method on test sets of the 10-fold CV.
| Test Performance of Round 1: WD with s = 1 & m = 1 | ||||
|---|---|---|---|---|
| d | ||||
| 2 | 572 | 20.06 | 0.74 | 0.46 |
| 3 | 1034 | 19.99 | 0.74 | 0.47 |
| 4 | 1448 | 19.87 | 0.75 | 0.48 |
| 5 | 1834 | 19.79 | 0.75 | 0.49 |
| 6 | 2221 | 19.77 | 0.75 | 0.49 |
| 8 | 2908 | 19.75 | 0.75 | 0.50 |
| 9 | 3193 | 19.74 | 0.75 | 0.50 |
| d | ||||
| 2 | 47 | 18.82 | 0.78 | 0.55 |
| 3 | 90 | 18.09 | 0.80 | 0.59 |
| 4 | 128 | 17.65 | 0.81 | 0.62 |
| 5 | 166 | 17.34 | 0.82 | 0.65 |
| 6 | 200 | 17.09 | 0.83 | 0.66 |
| 8 | 268 | 16.89 | 0.84 | 0.65 |
| 9 | 302 | 16.87 | 0.84 | 0.65 |
Round 1 uses all k-mers up to length d, with shift = 1 and mismatch = 1. Round 2 uses only selected k-mers from Round 1, with shift = 0 and mismatch = 0. 'Runtime' includes both training and testing, in seconds. The values for the parameters selected on training data are in bold.
Figure 3The importance of all the (a) 2-mers and (b) 3-mers at different positions from Round 1. The x-axis lays out all the 2-mers and 3-mers, respectively. The y-axis shows the positions within the 12-mer DNA binding sequence. The baseline color is yellow. Red color denotes the effect of leading to large Kvalues, whereas blue color denotes the effect of leading to small Kvalues.
Comparison with state-of-the-art methods.
| PWM | HK | SVR w. WD | Our Method | |
|---|---|---|---|---|
| RMSE | 20.2 | 25.4 | 22.5 | |
| RMSRE | 46% | 51% | 58% | |
| Pearson Cor | 0.56 | 0.49 | 0.70 | |
| Spearman Cor | 0.50 | 0.45 | 0.50 |
"PWM" represents the position weight matrix model. "HK→ME" represents the linear model in [21]. "SVR w. WD" represents SVR with WD kernel without mismatch or shift. All values are the averages over the same 10-fold CV. The best values in each row are in bold.
Statistics of the ten 7-mers that were identified to be important for high-affinity 12-mers through Round 1.
| Rank | 7-mer | Freq. | MIN | MAX | Average | Standard |
|---|---|---|---|---|---|---|
| 1 | 419 | 8.49 | 409.08 | 39.31 | 43.04 | |
| 2 | 990 | 8.49 | 567.81 | 56.66 | 54.61 | |
| 3 | 446 | 9.83 | 648.79 | 74.46 | 96.84 | |
| 4 | 453 | 14.52 | 303.87 | 63.66 | 54.64 | |
| 5 | 224 | 8.74 | 896.78 | 112.54 | 190.25 | |
| 6 | 392 | 8.49 | 963.28 | 167.26 | 254.46 | |
| 7 | ATGAGTC | 504 | 15.60 | 975.18 | 276.01 | 292.93 |
| 8 | TGACTAA | 327 | 14.67 | 821.67 | 192.02 | 199.69 |
| 9 | TACTCAC | 847 | 9.65 | 975.05 | 437.92 | 336.43 |
| 10 | GACTAAT | 808 | 14.67 | 984.67 | 528.74 | 300.75 |
The seven columns list the rank of importance, nucleotide sequence, number of 12-mer sequences that contain this 7-mer, the minimum Kfor all such 12-mers, the maximum Kfor all such 12-mers, the mean Kvalue for these 12-mers, and the standard deviation of these 12-mers, respectively. The six 7-mers in bold are the ones with lower dispersions of Kvalues than the remainders.
Figure 4Box plots of the 7-mers identified to be important for high-affinity 12-mers. (a) The distribution of Kof the important 7-mers in the same order as in Table 3. (b) The distribution of Kof the three predicted 10-mers, including the stable 10-mers TATGACTCAT and TGTGACTCAT (the left two), and the sensitive 10-mer CATGACTAAT (the right one).
Comparison with state-of-the-art methods on four other TFs in S. cerevisiae
| PWM | HK | SVR | Our | |
|---|---|---|---|---|
| Cbf1p | 0.25 | 0.30 | 0.36 | |
| Cin5p | 0.21 | 0.26 | 0.47 | |
| Pho4p | 0.19 | 0.24 | 0.41 | |
| Yap1p | 0.22 | 0.24 | 0.40 |
All values are Pearson Cor and averaged over the same 5-fold CV. The best values in each row are in bold.