| Literature DB >> 20098703 |
Abstract
BACKGROUND: Transcription factor binding site (TFBS) motifs can be accurately represented by position frequency matrices (PFM) or other equivalent forms. We often need to compare TFBS motifs using their PFMs in order to search for similar motifs in a motif database, or cluster motifs according to their binding preference. The majority of current methods for motif comparison involve a similarity metric for column-to-column comparison and a method to find the optimal position alignment between the two compared motifs. In some applications, alignment-free methods might be preferred; however, few such methods with high accuracy have been described. METHODOLOGY/PRINCIPALEntities:
Mesh:
Substances:
Year: 2010 PMID: 20098703 PMCID: PMC2808352 DOI: 10.1371/journal.pone.0008797
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Three datasets used in this study for testing and evaluation.
| Dataset | Dataset-1 | Dataset 2 | Dataset-3 |
| Number of PFMs | 96 | 355 | 496 |
| Average length | 10.39 | 12.14 | 10.6 |
| Min length | 4 | 4 | 4 |
| Max length | 30 | 29 | 22 |
| Number of Classes | 13 | 6 | - |
| PFM sources | JASPAR | TRANSFAC | JASPAR |
| Dataset source | Mahony, et al., 2007 | Narlikar and Hartemink, 2006 | This study |
Performance of the KFV algorithm on the three datasets measured as the accuracy from the “best hit” test.
| Dataset-1 | k = 1 | k = 2 | k = 3 | k = 4 | k = 5 |
| - Euclidean | 0.385 | 0.708 | 0.760 | 0.792 | 0.674 |
| - PCC | 0.250 | 0.677 | 0.802 | 0.823 | 0.768 |
| - cosine angle | 0.375 | 0.719 | 0.823 |
| 0.768 |
| - KL-based | 0.427 | 0.646 | 0.729 | 0.667 | 0.568 |
| Dataset-2 | k = 1 | k = 2 | k = 3 | k = 4 | k = 5 |
| - Euclidean | 0.531 | 0.789 | 0.854 | 0.868 | 0.859 |
| - PCC | 0.251 | 0.803 | 0.882 | 0.899 | 0.898 |
| - cosine angle | 0.475 | 0.800 | 0.873 |
| 0.893 |
| - KL-based | 0.562 | 0.777 | 0.811 | 0.823 | 0.805 |
| Dataset-3 | k = 1 | k = 2 | k = 3 | k = 4 | k = 5 |
| - Euclidean | 0.946 |
| 0.984 | 0.986 | 0.986 |
| - PCC | 0.323 | 0.978 | 0.980 | 0.984 | 0.984 |
| - cosine angle | 0.760 | 0.982 | 0.982 | 0.984 | 0.986 |
| - KL-based | 0.921 | 0.974 | 0.972 | 0.964 | 0.953 |
Comparison of the KFV algorithm with other methods for motif retrieval using Dataset-1.
| Accuracy | |||
| Method | Non-ZF PFMs(71) | ZF PFMs (25) | Total (96) |
| KFV (k = 4, cosine) |
|
|
|
| STAMP (PCC) | 0.887 |
| 0.813 |
| STAMP (SSD) | 0.859 | 0.560 | 0.781 |
| STAMP (AKL) | 0.831 | 0.520 | 0.750 |
| STAMP (ALLR-LL) | 0.859 | 0.400 | 0.740 |
| STAMP (pCS) | 0.761 | 0.560 | 0.708 |
| STAMP (ALLR) | 0.775 | 0.400 | 0.677 |
| MOSTA (Smax) |
| 0.440 | 0.792 |
| MOSTA (Ssum) | 0.817 | 0.560 | 0.750 |
The results are shown separately for the zinc-finger and non zinc-finger families. The values in bold indicate the highest accuracy achieved for each category. In parentheses beside each method are the primary parameter settings (column comparison metric for STAMP or similarity measure score for MoSta). The accuracy for STAMP using different column comparison metrics were taken from [8], in which the evaluation was performed using the optimal alignment strategies and gap scores on the same dataset. For MoSta, a GC content of 0.5 and the balanced threshold were used.
Comparison of the KFV algorithm with other methods for motif retrieval using Dataset-2.
| Structural Class | Accuracy | |||
| KFV | STAMP | MoSta (Smax/Ssum) | Bayesian Learning | |
| bZIP (93) | 0.92 |
| 0.90/ | 0.92 |
| C2H2 (74) |
| 0.76 | 0.76/0.72 | 0.77 |
| C4 (52) |
|
|
| 0.91 |
| Homeo (50) | 0.88 | 0.82 | 0.82/ | 0.85 |
| Forkhead (49) |
| 0.9 |
| 0.83 |
| bHLH (37) | 0.89 | 0.81 |
| 0.88 |
|
|
|
|
|
|
The number in the parentheses is the number of PFMs within that TF structural class. The accuracies for STAMP and Bayesian Learning were taken from Mahony et al. [8]. The accuracy for STAMP was evaluated using ungapped Smith-Waterman alignment and PCC metric for column comparison. The accuracy for KFV was evaluated with k = 4 and cosine angle distance. The values in bold font indicate the highest accuracy achieved for each structural class.
Figure 1Evaluation of three motif comparison algorithms using ROC curves.
The ROC curves were plotted based on three datasets in Table 1.
Figure 2The motif tree of the 71 non-ZF PFMs in Dataset-1.
The tree was constructed using the UPGMA algorithm based on the pairwise distances calculated by our KFV method with k = 4 and cosine angle metric. The vertical dashed blue line represents the level at which the CH metric estimates the optimal number of clusters. The 71 PFMs were grouped into 16 groups as indicated by the dashed line.
Figure 3The CH plot.
The log modified Calinski and Harabasz metric (CH) at different hierarchical levels (from 2 to 70 clusters) for the motif tree in Figure 2. The maximal CH value was reached when the number of clusters (c) was 16, suggesting an optimal number of 16 groups of PFMs for the 71 non-ZF PFMs in Dataset-1.