| Literature DB >> 24074225 |
Ilya E Vorontsov1,2, Ivan V Kulakovskiy3,1, Vsevolod J Makeev3,1,4.
Abstract
BACKGROUND: Positional weight matrix (PWM) remains the most popular for quantification of transcription factor (TF) binding. PWM supplied with a score threshold defines a set of putative transcription factor binding sites (TFBS), thus providing a TFBS model.TF binding DNA fragments obtained by different experimental methods usually give similar but not identical PWMs. This is also common for different TFs from the same structural family. Thus it is often necessary to measure the similarity between PWMs. The popular tools compare PWMs directly using matrix elements. Yet, for log-odds PWMs, negative elements do not contribute to the scores of highly scoring TFBS and thus may be different without affecting the sets of the best recognized binding sites. Moreover, the two TFBS sets recognized by a given pair of PWMs can be more or less different depending on the score thresholds.Entities:
Year: 2013 PMID: 24074225 PMCID: PMC3851813 DOI: 10.1186/1748-7188-8-23
Source DB: PubMed Journal: Algorithms Mol Biol ISSN: 1748-7188 Impact factor: 1.405
Figure 1The cumulative distributions () and probability density () of similarities for pairs of TFBS models. The similarities for pairs of models for the same TF are shown by solid lines (data for 85 TFs with the models available in both HOCOMOCO [20] and JASPAR [19] databases). The similarities for all possible pairs for 170 assessed models are shown by dashed lines. Different colors correspond to different P-value levels. It is notable that the paired models for the same TF are really closer as compared with the whole set of possible pairs.
Figure 2The mean and the standard deviation of similarities between TFBS models for the same TF. Similarities are computed for HOCOMOCO and JASPAR TFBS models for 85 TFs.
Figure 3The similarities (depending on P-value) and LOGO representations for pairs of TFBS models (HOCOMOCO and JASPAR) for selected TFs. It is notable that even for extremely similar LOGOs, like those of CTCF, the Jaccard similarity reaches only 0.6, indicating that the models define the sets of binding sites overlapping only for 60%. The similarity remains comparatively low even at high P-values (e.g. 0.01 where each 100th word of the dictionary is recognized as the binding site). The same effect is shown for KLF4 (with the exception of similarity 1.0 for the lowest P-value, where both models recognize only identical consensus sequences). SPI1 models differing in length show very weak similarities. HIF1A models are surprisingly dissimilar at low P-values (possibly due to shorter model lengths).
Figure 4The circular tree illustrating the hierarchy of high quality models from HOCOMOCO collection. Clusters are shown by alternating colors. The examples of clustered TFBS models are shown with respective LOGO representations. The tree is drawn using jsPhyloSVG [22].