| Literature DB >> 27882329 |
Yipu Zhang1, Ping Wang1, Maode Yan1.
Abstract
Motif discovery problem is crucial for understanding the structure and function of gene expression. Over the past decades, many attempts using consensus and probability training model for motif finding are successful. However, the most existing motif discovery algorithms are still time-consuming or easily trapped in a local optimum. To overcome these shortcomings, in this paper, we propose an entropy-based position projection algorithm, called EPP, which designs a projection process to divide the dataset and explores the best local optimal solution. The experimental results on real DNA sequences, Tompa data, and ChIP-seq data show that EPP is advantageous in dealing with the motif discovery problem and outperforms current widely used algorithms.Entities:
Mesh:
Year: 2016 PMID: 27882329 PMCID: PMC5110948 DOI: 10.1155/2016/9127474
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Figure 1The process of calculating the PFM. (a) The input sequences. (b) The aligned substrings. (c) The count matrix. (d) The position frequency matrix.
Figure 2The process of cluster projection.
Algorithm 1The information of six DNA datasets.
| Datasets |
|
|
|
|
|
|---|---|---|---|---|---|
| CREB | 17 | 200 | 8 | 19 | 1.12 |
| CRP | 18 | 105 | 18 | 23 | 1.28 |
| MEF2 | 17 | 200 | 10 | 17 | 1 |
| MYOD | 17 | 200 | 6 | 21 | 1.23 |
| SRF | 20 | 200 | 10 | 36 | 1.8 |
| TBP | 95 | 200 | 7 | 95 | 1 |
The comparison of MEME, GAME, VINE, APMotif, and EPP on six DNA datasets.
| Datasets | MEME | GAME | VINE | APMotif | EPP | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
| CREB |
| 0.68 | 0.78 | 0.68 | 0.79 | 0.73 | 0.72 | 0.80 | 0.76 | 0.70 | 0.84 | 0.76 | 0.74 |
|
|
| CRP | 0.89 | 0.67 | 0.76 | 0.79 | 0.78 | 0.78 |
| 0.70 | 0.80 | 0.86 | 0.72 | 0.78 | 0.83 |
|
|
| MEF2 |
| 0.82 | 0.88 | 0.82 | 0.80 | 0.81 | 0.88 | 0.88 | 0.88 | 0.84 |
|
| 0.84 |
|
|
| MYOD | 0.60 | 0.28 | 0.38 | 0.48 | 0.48 | 0.48 | 0.47 |
| 0.61 | 0.60 | 0.52 | 0.56 |
| 0.68 |
|
| SRF | 0.74 | 0.89 | 0.81 | 0.70 | 0.92 | 0.80 | 0.92 | 0.94 | 0.93 | 0.88 | 0.90 | 0.89 |
|
|
|
| TBP |
| 0.69 | 0.76 | 0.78 | 0.77 | 0.77 | 0.74 |
| 0.80 | 0.72 | 0.80 | 0.76 | 0.82 | 0.81 |
|
|
| |||||||||||||||
| Average | 0.82 | 0.56 | 0.73 | 0.72 | 0.77 | 0.74 | 0.78 | 0.84 | 0.81 | 0.77 | 0.79 | 0.78 |
|
|
|
Figure 3The accuracy comparison of MEME, GAME, VINE, APMotif, and EPP. (a) Precision comparison. (b) Recall comparison. (c) F score comparison.
The subsets and l-mers amount of EPP.
| Datasets | Total | [min_size, max_size] | The number of | The number of | The | Reducing amount |
|---|---|---|---|---|---|---|
| CREB | 3294 | [15,19] | 66 | 4 | 65 | 98% |
| CRP | 1584 | [16,24] | 31 | 5 | 104 | 98% |
| MEF2 | 3247 | [9,17] | 176 | 33 | 335 | 90% |
| MYOD | 3315 | [17,23] | 55 | 6 | 111 | 97% |
| SRF | 3820 | [20,30] | 73 | 13 | 310 | 92% |
| TBP | 18430 | [80,95] | 32 | 2 | 175 | 99% |
The computational time comparison.
| Datasets | MEME | GAME | VINE | APMotif | EPP |
|---|---|---|---|---|---|
| CREB | 1.52 | 134.00 | 4.82 | 71.23 | 17.52 |
| CRP | 0.60 | 391.04 | 2.61 | 97.04 | 8.91 |
| MEF2 | 2.01 | 113.25 | 7.37 | 135.83 | 21.91 |
| MYOD | 2.25 | 96.08 | 8.25 | 68.36 | 30.27 |
| SRF | 2.12 | 223.56 | 10.11 | 147.29 | 28.28 |
| TBP | 39.05 | 786.32 | 55.53 | 280.43 | 10.83 |
Figure 4Results of EPP and MEME on Tompa datasets.
The performance coefficient of MEME, VINE, and EPP on the synthetic datasets.
| Datasets | Algorithm | |||
|---|---|---|---|---|
| Width | Con | MEME | VINE | EPP |
| Short | Low | 0.32 | 0.24 | 0.32 |
| Middle | Low | 0.88 | 0.72 | 0.90 |
| Long | Low | 0.98 | 0.88 | 0.98 |
| Short | High | 0.91 | 0.96 | 0.98 |
| Middle | High | 0.98 | 0.99 | 0.99 |
| Long | High | 1 | 1 | 1 |
Results of the mouse embryonic stem cell data.
| Datasets | Length | Seq. # | EPP | Weeder |
|---|---|---|---|---|
|
| 11 | 39601 |
| TNG |
|
| 9 | 3422 |
| C |
|
| 11 | 21644 |
|
|
|
| 10 | 10872 |
|
|
|
| 7 | 10342 |
|
|
|
| 10 | 7181 |
|
|
|
| 16 | 1126 |
|
|
|
| 15 | 3775 |
|
|
|
| 9 | 2546 |
| T |
|
| 10 | 4525 |
| CA |
|
| 11 | 26907 |
|
|
|
| 10 | 10336 |
| CG |