| Literature DB >> 18533028 |
Nak-Kyeong Kim1, Kannan Tharakaraman, Leonardo Mariño-Ramírez, John L Spouge.
Abstract
BACKGROUND: Biologically active sequence motifs often have positional preferences with respect to a genomic landmark. For example, many known transcription factor binding sites (TFBSs) occur within an interval [-300, 0] bases upstream of a transcription start site (TSS). Although some programs for identifying sequence motifs exploit positional information, most of them model it only implicitly and with ad hoc methods, making them unsuitable for general motif searches.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18533028 PMCID: PMC2432075 DOI: 10.1186/1471-2105-9-262
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Positions of hypothetical TFBSs (gray boxes) with respect to the corresponding TSS.
The A-GLAM output with positional information for 'hm08r'.
| Name | Start | Alignment | End | Score | E-value |
| seq_0 | -66 | -59 | 11.0093 | 6.65E-06 | |
| seq_2 | -65 | -58 | 10.3315 | 2.30E-05 | |
| seq_3 | -58 | -51 | 11.2688 | 2.94E-06 | |
| seq_5 | -188 | -181 | 11.4594 | 1.28E-06 | |
| seq_7 | -184 | C | -177 | 9.86871 | 4.64E-05 |
| seq_9 | -101 | -94 | 10.9283 | 8.09E-06 | |
| seq_10 | -220 | -213 | 7.58906 | 3.78E-04 | |
| seq_11 | -80 | -73 | 11.1306 | 4.75E-06 | |
| seq_12 | -52 | CTGACGGC | -45 | 10.0764 | 3.50E-05 |
| seq_14 | -8 | CTGATGTC | -1 | 7.60515 | 3.69E-04 |
A-GLAM predicted TFBSs in 10 data subsets in the TSS Tompa data subset hm08r'. The column "Name" shows each data subset; the column "Alignment", the corresponding predicted TFBS. The start and end positions with respect to the corresponding TSS are shown in the columns "Start" and "End". The columns "Score" and "E-value" show bit scores and E-values that A-GLAM assigned to predicted TFBSs. The known binding sites in the alignment are underlined.
The correlation coefficients for the TSS Tompa data subsets
| Data Subset | Without positional information | With positional information | Improvement |
| hm01r | -0.012 | -0.007 | 0.005 |
| hm02r | -0.009 | -0.007 | 0.002 |
| hm03r | -0.037 | 0.386 | 0.423 |
| hm04r | -0.008 | -0.005 | 0.003 |
| hm05r | -0.031 | -0.019 | 0.012 |
| hm06r | -0.014 | 0.156 | 0.170 |
| hm07r | -0.015 | -0.015 | -0.001 |
| hm08r | -0.012 | 0.574 | 0.586 |
| hm09r | -0.011 | 0.358 | 0.369 |
| hm10r | -0.019 | 0.083 | 0.102 |
| hm11r | -0.028 | -0.012 | 0.016 |
| hm13r | -0.015 | -0.016 | -0.001 |
| hm14r | 0.204 | -0.018 | -0.222 |
| hm15r | -0.011 | -0.012 | -0.002 |
| hm16r | -0.011 | -0.006 | 0.005 |
| hm17r | -0.015 | -0.012 | 0.004 |
| hm18r | -0.018 | 0.094 | 0.112 |
| hm19r | -0.010 | -0.007 | 0.003 |
| hm20r | -0.026 | 0.046 | 0.073 |
| hm21r | 0.401 | 0.384 | -0.016 |
| hm22r | -0.020 | -0.020 | 0.000 |
| hm24r | -0.016 | -0.010 | 0.006 |
| hm26r | -0.016 | 0.099 | 0.115 |
| Combined CC | -0.008 | 0.101 | 0.109 |
Table 2 shows the correlation coefficients for A-GLAM's predictions on the 23 subsets of the TSS Tompa dataset. The column, "Improvement", quantifies the effect of positional information on predictions, by showing the difference between the correlation coefficients in the second and third columns, "Without Positional Information" and "With Positional Information".
Figure 2Distribution of known locations of binding site in TSS Tompa dataset. The x-axis is anchored on the TSS, denoted as location 0. All sequences in each test subset are collapsed into a single line; hence the 23 data subsets are shown as 23 different horizontal lines. Each data subset contains TFBSs corresponding to a single specific transcription factor.
Figure 3Distribution of known locations of binding site in TRANSFAC dataset. The x-axis is anchored on the TSS, denoted as location 0. All sequences in each test subset are collapsed into a single line; hence the 82 data subsets are shown as 82 different horizontal lines. Each data subset contains TFBSs corresponding to a single specific transcription factor.
The effect of truncating the sequence upstream of the TSS
| Sequence range | TSS Tompa Dataset | TRANSFAC Dataset | ||||
| Without positional info | With positional info | p-value | Without positional info | With positional info | p-value | |
| [-2000, 0] | -0.008 | 0.101 | 0.002 | -0.009 | 0.027 | 10-8 |
| [-1000, 0] | 0.086 | 0.098 | 0.583 | 0.050 | 0.066 | 0.112 |
| [-500, 0] | 0.125 | 0.133 | 0.338 | 0.077 | 0.078 | 0.070 |
| [-250, 0] | 0.139 | 0.139 | 0.054 | 0.094 | 0.076 | 0.603 |
The first column shows the sequence range upstream of the TSS given as input to A-GLAM. The change of CCC from modes with and without positional information for the TSS Tompa and TRANSFAC datasets is displayed in the corresponding groups of three columns. The third column of each group shows a Wilcoxon p-value, which evaluates the difference between the CCCs in the previous two columns. Because not all TFBSs in our datasets are known, small improvements in the CCC correspond to true improvements of unknown magnitude. In particular, e.g., in the Table, two CCC values rounded to 0.139 have unseen decimals different enough to have a p-value of 0.054. To view results for individual sites in the Tompa dataset, see Supplementary Table 7 [see Additional file 1].