| Literature DB >> 18305824 |
Ziliang Qian1, Lingyi Lu, Liu Qi, Yixue Li.
Abstract
Various statistical models have been developed to describe the DNA binding preference of transcription factors, by which putative transcription factor binding sites (TFBS) can be identified according to scores assigned. Statistical significance of these scores, usually known as the p-value, play a critical role in identification. We developed an efficient algorithm to provide precise calculation of the statistical significance, remarkably enhancing the calculation efficiency by reducing the time complexity from an exponent scale to a linear scale, and successfully extended the application of this algorithm to a wide range of models, from the commonly used position weight matrix models to the complicated Bayesian Network models. Further, we calculated p-values of all transcription factor DNA binding sites recorded in the database, JASPAR, and based on these, we investigated some unseen properties of p-values as a whole, such as the p-value distribution of different models and the p-value variance according to changed scoring schemes. We hope that our algorithm and the result of computational experiments would offer an improved solution to the statistical significance of transcription factor binding sites. The software to implement our method can be downloaded from http://pcal.biosino.org/pCal.html.Entities:
Keywords: Bayesian network; DNA; binding sites; transcription factor
Year: 2007 PMID: 18305824 PMCID: PMC2241927 DOI: 10.6026/97320630002169
Source DB: PubMed Journal: Bioinformation ISSN: 0973-2063
Figure 1The pseudo code of our method
Figure 2Comparison of p-values obtained based on different scoring schemes. The vertical axis in this figure depicts the −log(p-value) and the horizontal axis of left two sub graphs depicts length of TFBS, while the horizontal axis of two sub-graphs (right) depicts the frequency of different p-values. Each point in the two left sub-graphs represents the log of p-value corresponding to a certain TFBS. The two right sub-graphs are the distribution of p-values by accumulating probability and MATCH score scheme. Although different scoring schemes were used, p-value distributions not very different.
Figure 3p-value is better than the raw score for identifying TFBS. The vertical axis of the two upper sub-graphs depicts the −log(p-value), while the vertical axis of the two lower sub-graphs depicts the raw score directly obtained from scoring schemes. The horizontal axis in all sub-graphs depicts the various lengths of TFBS. The blue dots represent p-values of true TFBSs provided by JASPAR whereas red dots represent p-values of DNA sequence fragments of various lengths from genome background. According to the two upper sub-graphs, there is a sharp distinction between true TFBSs and genomic background for both the cumulative probability based scoring scheme and the MATCH scoring scheme. However, in the two lower sub-graphs, blue dots and red dots appear fused together, indicating that raw scores are not appropriately used as the criterion to identify TFBS