| Literature DB >> 24564945 |
Shaoqiang Zhang, Xiguo Zhou, Chuanbin Du, Zhengchang Su.
Abstract
BACKGROUND: Discovering transcription factor binding sites (TFBS) is one of primary challenges to decipher complex gene regulatory networks encrypted in a genome. A set of short DNA sequences identified by a transcription factor (TF) is known as a motif, which can be expressed accurately in matrix form such as a position-specific scoring matrix (PSSM) and a position frequency matrix. Very frequently, we need to query a motif in a database of motifs by seeking its similar motifs, merge similar TFBS motifs possibly identified by the same TF, separate irrelevant motifs, or filter out spurious motifs. Therefore, a novel metric is required to seize slight differences between irrelevant motifs and highlight the similarity between motifs of the same group in all these applications. While there are already several metrics for motif similarity proposed before, their performance is still far from satisfactory for these applications.Entities:
Mesh:
Substances:
Year: 2013 PMID: 24564945 PMCID: PMC3866262 DOI: 10.1186/1752-0509-7-S2-S14
Source DB: PubMed Journal: BMC Syst Biol ISSN: 1752-0509
The definitions of six metrics used for motif comparison.
| Similarity metric | Formula | References |
|---|---|---|
| Average log-likelihood ratio (ALLR) | Wang and Stormo [ | |
| Average Kullback-Leibler | Kullback and Leibler [ | |
| Sum of squared distances (SSD) | Schones | |
| 1-p-value of Chi-square (pCS) | Schones | |
| Pearson correlation coefficient | Pietrokovski [ | |
| Asymptotic Covariance (AC) | Pape | |
Summary of the three datasets used for the evaluation in this study.
| Number of true motifs | Number of putative motifs | Number of classes | Average length | True motifs source | Data source | |
|---|---|---|---|---|---|---|
| Dataset-1 | 96 | 0 | 13 | 10.39 | JASPAR | Mahony, |
| Dataset-2 | 124 | 0 | Unknown | 10.6 | JASPAR | Xu and Su, 2010 [ |
| Dataset-3 | 122 | Unknown | 16 | RegulonDB | Zhang, | |
Comparison of top 7 performing alignment strategies of SPIC with the best strategies of existing methods for motif retrieval on Dataset-1.
| Accuracy | |||
|---|---|---|---|
| Strategy | ZF PFMs(25) | Non-ZF PFMs(71) | Total(96) |
| SPIC/SW(gap open = 1.00) | |||
| SPIC/SW(gap open = 0.75) | 0.613 | 0.918 | 0.837 |
| SPIC/SW(gap open = 0.50) | 0.614 | 0.916 | 0.837 |
| SPIC/SW(gap open = 1.50) | 0.605 | 0.916 | 0.835 |
| SPIC/SW(gap open = 0.25) | 0.606 | 0.915 | 0.835 |
| SPIC/SW(ungapped) | 0.610 | 0.916 | 0.836 |
| SPIC/NW(gap open = 1.0) | 0.585 | 0.793 | 0.731 |
| KFV(4-mer, cosine angle) | 0.600 | 0.915 | 0.833 |
| PCC/SWU | 0.600 | 0.887 | 0.813 |
| SSD/SW | 0.560 | 0.859 | 0.781 |
The last three columns are the results for the zinc-finger (ZF), non-ZF, and total families, respectively. The performances of SSD/SW and PCC/SWU are quoted from the STAMP [11]. The data of KFV are quoted from [25]. Gap extension is equal to half the gap open.
Figure 1Evaluation of the three motif similarity metrics using ROC analysis on Dataset-1 and Dataset-2.
Figure 2Comparison of SPIC with the seven existing methods for separation of relevant motifs from irrelevant ones on Dataset-3.