| Literature DB >> 23522376 |
Abstract
BACKGROUND: Scientists routinely scan DNA sequences for transcription factor (TF) binding sites (TFBSs). Most of the available tools rely on position-specific scoring matrices (PSSMs) constructed from aligned binding sites. Because of the resolutions of assays used to obtain TFBSs, databases such as TRANSFAC, ORegAnno and PAZAR store unaligned variable-length DNA segments containing binding sites of a TF. These DNA segments need to be aligned to build a PSSM. While the TRANSFAC database provides scoring matrices for TFs, nearly 78% of the TFs in the public release do not have matrices available. As work on TFBS alignment algorithms has been limited, it is highly desirable to have an alignment algorithm tailored to TFBSs.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23522376 PMCID: PMC3747862 DOI: 10.1186/1471-2105-14-108
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1An illustration of LASAGNA with . (a) The aligned binding sites in A and the unaligned ones in U. The shortest binding site is in bold. (b) The sequence logo [48] of the PSSM built from A aligns with the augmented sequence - - - - - - - - - TTTCCCGCCAA - - - - - - - - -, where the matched portion is in bold. (c) The updated A and U, where the newly added binding site is in bold.
TFBSs in TRANSFAC public database by species
| Homo sapiens | 68 | 1984 |
| Mus musculus | 53 | 966 |
| Rattus norvegicus | 26 | 633 |
| Drosophila melanogaster | 29 | 935 |
| Saccharomyces cerevisiae | 13 | 253 |
| Overall | 189 | 4771 |
1. The total number of TFs.
2. The total number of TFBSs.
Figure 2Overall ROC curves for the three alignment algorithms. The left panel shows the curves at low false positive rates, from 0 to 0.02. The right panel presents the curves at false positive rates from 0.02 to 0.6. The three methods are indistinguishable when the false positive rate is greater than 0.6 and hence the region is not shown. We note that the vertical axes of the two panels are on different scales.
Species-wise and overall comparisons between LASAGNA and ClustalW2
| H. sapiens | 54 (79.4%) | 0 | 68 | 4.42×10−7 |
| M. musculus | 42 (79.2%) | 0 | 53 | 1.41×10−5 |
| D. melanogaster | 22 (75.9%) | 0 | 29 | 9.89×10−4 |
| S. cerevisiae | 9 (69.2%) | 1 | 13 | 3.88×10−2 |
| R. norvegicus | 20 (76.9%) | 1 | 26 | 1.54×10−3 |
| Overall | 147 (77.8%) | 2 | 189 | 1.22×10−15 |
1. Number of TFs on which LASAGNA performs better than ClustalW2.
2. Number of TFs on which LASAGNA and ClustalW2 have the same performance.
3. Total number of TFs for a species.
4. Wilcoxon signed-rank test p-value.
Comparison of two groups of TFs divided according to results on LASAGNA and ClustalW2
| # TFBSs 4 | 25.07483 | 25.83333 | 0.1409 |
| Mean of TFBS length | 18.78626 | 17.56167 | 0.08451 |
| SD of TFBS length 5 | 8.180204 | 6.921905 | 0.06295 |
1. LASAGNA performed better than ClustalW2 on TFs in this group.
2. ClustalW2 performed better than or equal to LASAGNA on TFs in this group.
3. Wilcoxon signed-rank test p-value.
4. Number of binding sites for each TF.
5. Standard deviation of binding site length for each TF.
Species-wise and overall comparisons between LASAGNA and MEME
| H. sapiens | 41 (60.3%) | 0 | 68 | 7.83×10−3 |
| M. musculus | 41 (77.4%) | 0 | 53 | 8.79×10−6 |
| D. melanogaster | 26 (89.7%) | 0 | 29 | 1.02×10−7 |
| S. cerevisiae | 10 (76.9%) | 3 | 13 | 2.96×10−3 |
| R. norvegicus | 23 (88.5%) | 1 | 26 | 1.73×10−4 |
| Overall | 141 (74.6%) | 4 | 189 | 3.55×10−15 |
1. Number of TFs on which LASAGNA performs better than MEME.
2. Number of TFs on which LASAGNA and MEME have the same performance.
3. Total number of TFs for a species.
4. Wilcoxon signed-rank test p-value.
Comparison of two groups of TFs divided according to results on LASAGNA and MEME
| # TFBSs4 | 23.33333 | 30.85417 | 0.03196 |
| Mean of TFBS length | 18.33468 | 19.04125 | 0.3007 |
| SD of TFBS length 5 | 7.95844 | 7.730625 | 0.1846 |
1. LASAGNA performed better than MEME on TFs in this group.
2. MEME performed better than or equal to LASAGNA on TFs in this group.
3. Wilcoxon signed-rank test p-value.
4. Number of binding sites for each TF.
5. Standard deviation of binding site length for each TF.
Distribution of the 1751 binding sites of 90 TFs in TRANSFAC public database
| Homo sapiens | 735 |
| Mus musculus | 346 |
| Rattus norvegicus | 278 |
| Saccharomyces cerevisiae | 158 |
| Drosophila melanogaster | 155 |
| Gallus gallus | 73 |
| Bos taurus | 5 |
| Sus scrofa | 1 |
1. Total number of TFBSs.
Figure 3Comparison of the PSSM method dependent on LASAGNA to SiTaR. (a) Scatter plot of precision by LASAGNA-PSSM against precision by SiTaR at the same recall rate for each TF. Each point corresponds to a TF. Seventy-three percent (65 out of 89) of the TFs are above the reference line, indicating that LASAGNA-PSSM is more precise for the 65 TFs. (b) Plots of precision against recall for LASAGNA-PSSM and SiTaR based on all the 90 TFs.
Figure 4Partial results of scanning the promoter of human gene CCL2. The list of predicted binding sites are sorted by p-value in ascending order while only the top-4 hits are shown. The best hit is visualized in the context of other binding sites over a stretch of the promoter, where the height of a box is − log10p-value. CCL2 is known to be a target gene of AP-1, Sp1 and p50 [28]. These 3 binding sites are not in the TRANSFAC public database and were not used to build the PSSMs.