| Literature DB >> 26347887 |
Chunxiao Sun1, Hongwei Huo1, Qiang Yu1, Haitao Guo1, Zhigang Sun1.
Abstract
The planted (l, d) motif search (PMS) is one of the fundamental problems in bioinformatics, which plays an important role in locating transcription factor binding sites (TFBSs) in DNA sequences. Nowadays, identifying weak motifs and reducing the effect of local optimum are still important but challenging tasks for motif discovery. To solve the tasks, we propose a new algorithm, APMotif, which first applies the Affinity Propagation (AP) clustering in DNA sequences to produce informative and good candidate motifs and then employs Expectation Maximization (EM) refinement to obtain the optimal motifs from the candidate motifs. Experimental results both on simulated data sets and real biological data sets show that APMotif usually outperforms four other widely used algorithms in terms of high prediction accuracy.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26347887 PMCID: PMC4547008 DOI: 10.1155/2015/853461
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Figure 1AP clustering results.
Algorithm 1APMotif.
Number of −∞ related to different sequence length n on (15, 4) instance.
|
| Data sizea | Numbers of − | Percentage |
|---|---|---|---|
| 100 | 93 | 6.81 | 78.75% |
| 300 | 308 | 8.01 | 84.43% |
| 500 | 523 | 2.29 | 83.89% |
| 600 | 630 | 3.41 | 85.82% |
| 800 | 846 | 5.84 | 81.55% |
| 1000 | 1061 | 9.53 | 84.67% |
aData size: the number of all l-mers in one cluster.
Prediction accuracy on different (l, d) instances.
| ( | nPC | ||||
|---|---|---|---|---|---|
| Projection | MEME | VINE | Gibbs sampling | APMotif | |
| (11, 3) | 92% | 65% | 95% | 56% | 96% |
| (12, 3) | 77% | 84% | 92% | 3% | 93% |
| (15, 4) | 93% | 86% | 98% | 19% | 96% |
| (16, 5) | 64% | 71% | 95% | 2% | 94% |
| (18, 6) | 75% | 79% | 93% | 3% | 98% |
| (19, 7) | 84% | 77% | 92% | 4% | 97% |
Prediction accuracy of different sequence length n on (15, 4) instance.
|
| nPC | ||||
|---|---|---|---|---|---|
| Projection | MEME | VINE | Gibbs sampling | APMotif | |
| 100 | 96% | 99% | 99% | 92% | 100% |
| 300 | 94% | 98% | 99% | 58% | 99% |
| 600 | 89% | 91% | 97% | 19% | 98% |
| 800 | 87% | 90% | 98% | 14% | 97% |
| 1000 | 88% | 76% | 91% | 8% | 95% |
Results of APMotif on real biological data.
| Data set | Predicted motif | Reference motif | ( |
|---|---|---|---|
| c-fos | CCATATTAG | CCANATTNG | (9, 2) |
| Preproinsulin | TGCAGCCTCAGCCCC | CAGCCTCAGCCCCAT | (15, 2) |
| Yeast ECB | TTACCCNNTTAGGAAA | TTTCCCNNTNAGGAAA | (16, 3) |
| DHFR | ATTTCGCGCCA | ATTTCGCGCCA | (11, 2) |
| Metallothionein | TCTGCACCCGGCCCG | CTCTGACNCCGCCC | (15, 2) |
Figure 2Sequence logos of the predicted motifs.
Figure 3Prediction accuracy on real biological data.
Figure 4Prediction accuracy on Tompa data.