| Literature DB >> 22848493 |
Hieu Dinh1, Sanguthevar Rajasekaran, Jaime Davila.
Abstract
Detection of rare events happening in a set of DNA/protein sequences could lead to new biological discoveries. One kind of such rare events is the presence of patterns called motifs in DNA/protein sequences. Finding motifs is a challenging problem since the general version of motif search has been proven to be intractable. Motifs discovery is an important problem in biology. For example, it is useful in the detection of transcription factor binding sites and transcriptional regulatory elements that are very crucial in understanding gene function, human disease, drug design, etc. Many versions of the motif search problem have been proposed in the literature. One such is the (ℓ, d)-motif search (or Planted Motif Search (PMS)). A generalized version of the PMS problem, namely, Quorum Planted Motif Search (qPMS), is shown to accurately model motifs in real data. However, solving the qPMS problem is an extremely difficult task because a special case of it, the PMS Problem, is already NP-hard, which means that any algorithm solving it can be expected to take exponential time in the worse case scenario. In this paper, we propose a novel algorithm named qPMS7 that tackles the qPMS problem on real data as well as challenging instances. Experimental results show that our Algorithm qPMS7 is on an average 5 times faster than the state-of-art algorithm. The executable program of Algorithm qPMS7 is freely available on the web at http://pms.engr.uconn.edu/downloads/qPMS7.zip. Our online motif discovery tools that use Algorithm qPMS7 are freely available at http://pms.engr.uconn.edu or http://motifsearch.com.Entities:
Mesh:
Year: 2012 PMID: 22848493 PMCID: PMC3404135 DOI: 10.1371/journal.pone.0041425
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Traverse the tree in qPMSPrune.
with alphabet . The value of at each node is the location of its shaded letter. For example, at node , at node .
Time comparison of different algorithms on the challenging instances of DNA sequences for the special case - PMS Problem.
| Algorithm | (13,4) | (15,5) | (17,6) | (19,7) | (21,8) | (23,9) |
| qPMS7 | 47 s | 2.6 m | 11 m | 0.9 h | 4.3 h | 24 h |
| PMS6 | 67 s | 3.2 m | 14 m | 1.16 h | 5.8 h | – |
| PMS5 | 117 s | 4.8 m | 21.7 m | 1.7 h | 9.7 h | 54 h |
| qPMSPruneI | 17 s | 2.6 m | 22.6 m | 3.4 h | 29 h | – |
| Pampa | 35 s | 6 m | 40 m | 4.8 h | – | – |
| qPMSPrune | 45 s | 10.2 m | 78.7 m | 15.2 h | – | – |
| Voting | 104 s | 21.6 m | – | – | – | – |
| RISOTTO | 772 s | 106 m | – | – | – | – |
The alphabet , , , and .
Time comparison of different algorithms on the challenging instances of DNA sequences for the generalized case - qPMS Problem.
| Algorithm | (13,3) | (15,4) | (17,5) | (19,6) | (21,7) |
| qPMS7 | 34 s | 2.4 m | 16 m | 1.8 h | 11.6 h |
| qPMSPruneI | 14 s | 2 m | 21 m | 3.9 h | – |
| qPMSPrune | 32 s | 9 m | 2.6 h | – | – |
The alphabet , , , and .
Figure 2Traverse the graph in qPMS7.
Visited nodes in in a depth-first manner when the starting node is . In this example, . The value of at each node is the location of its shaded letter. For example, at node .
Time comparison of different algorithms on the challenging instances of protein sequences for the special case - PMS Problem.
| Algorithm | (11,5) | (13,6) | (15,7) | (17,8) | (19,9) |
| qPMS7 | 1 m | 1.4 m | 1.9 m | 6.8 m | 7.5 m |
| qPMSPruneI | 4.5 m | 21 m | 2.4 m | 17 h | – |
| qPMSPrune | 12 m | 104 m | 16 h | – | – |
The alphabet size , , , and .
Time comparison of different algorithms on the challenging instances of protein sequences for the generalized case - qPMS Problem.
| Algorithm | (11,4) | (13,5) | (15,6) | (17,7) | (19,8) |
| qPMS7 | 27 s | 3 m | 18 m | 3.8 h | 11 h |
| qPMSPruneI | 62 s | 16 m | 3.7 h | – | – |
| qPMSPrune | 181 s | 113 m | 29 h | – | – |
The alphabet size , , , and .
Results on real datasets.
| Data | Predicted Motifs | Known Motifs |
|
| 1 |
|
| (10,2) |
| 2 |
| ATTTCnnGCCA | (13,2) |
| 3 |
|
| (16,3) |
| 4 |
| TTTCCCnnTnAGGAAA | (16,3) |
Data 1: Preproinsulin; Data 2: DHFR; Data 3: c-fos; Data 4: Yeast ECB. Parameter is set to .
Results on real datasets for transcription factor-binding sites discovery.
| Data | Predicted Motifs | Matched Binding Sites |
| mus05r |
|
|
| mus07r |
|
|
| mus11r |
|
|
|
| ||
|
| ||
|
| ||
| hm03r |
|
|
|
| ||
|
| ||
| hm08r |
|
|
| hm19r |
|
|
| hm26r |
|
|
|
| ||
|
|
The datasets are from mouse (resp. human) if their names start with “mus” (resp. “hm”).