| Literature DB >> 23679045 |
Tian Mi1, Sanguthevar Rajasekaran.
Abstract
BACKGROUND: Motifs are significant patterns in DNA, RNA, and protein sequences, which play an important role in biological processes and functions, like identification of open reading frames, RNA transcription, protein binding, etc. Several versions of the motif search problem have been studied in the literature. One such version is called the Planted Motif Search (PMS)or (l, d)-motif Search. PMS is known to be NP complete. The time complexities of most of the planted motif search algorithms depend exponentially on the alphabet size. Recently a new version of the motif search problem has been introduced by Kuksa and Pavlovic. We call this version as the Motif Stems Search (MSS) problem. A motif stem is an l-mer (for some relevant value of l)with some wildcard characters and hence corresponds to a set of l-mers (without wildcards), some of which are (l, d)-motifs. Kuksa and Pavlovic have presented an efficient algorithm to find motif stems for inputs from large alphabets. Ideally, the number of stems output should be as small as possible since the stems form a superset of the motifs.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23679045 PMCID: PMC3679804 DOI: 10.1186/1471-2105-14-161
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Numbers of wildcards
| 0≤ | ||
Figure 1Illustration of speeding up the 2-neighbors computation. A: l-mer alignments. B: computation order in the matrix.
Example values of given | |=4 and | |=20
| (7,1) | 1.29∗10−2 | 6.03∗10−6 |
| (9,2) | 4.89∗10−2 | 3.32∗10−5 |
| (11,3) | 1.15∗10−1 | 1.11∗10−4 |
| (13,4) | 6.38∗10−2 | 8.88∗10−5 |
| (15,5) | 4.27∗10−4 | 8.22∗10−7 |
| (17,6) | 5.82∗10−10 | 2.76∗10−20 |
| (19,7) | 3.64∗10−12 | 1.91∗10−25 |
| (21,8) | 1.43∗10−3 | 1.21∗10−5 |
Time comparison of MSS, RISOTTO, and stemming algorithms
| (7,1) | 23.2 | 3.7 | 0.03 | 4.7 | 48.0 |
| (9,2) | 24.64 | 3.7 | 0.3 | 242.3 | 359.9 |
| (11,3) | 27.5 | 3.7 | 2.0 | 7166.1 | 4674.2 |
| (13,4) | 34.5 | 3.9 | 50.4 | - | - |
| (15,5) | 38.8 | 4.7 | 74.5 | - | - |
| (17,6) | 60.2 | 15.3 | 1459.0 | - | - |
| (19,7) | 200.8 | 117.3 | 18364.1 | - | - |
| (21,8) | 719.6 | 563.2 | 69340.8 | - | - |
Number of stems generated by MSS and stemming algorithms
| (7,1) | 2 | 1 | 928 |
| (9,2) | 22 | 2 | 17894 |
| (11,3) | 130 | 44 | 265587 |
| (13,4) | 2250 | 1452 | - |
| (15,5) | 5222 | 1032 | - |
| (17,6) | 60168 | 23829 | - |
| (19,7) | 521658 | 422019 | - |
| (21,8) | 2255690 | 1297576 | - |
Comparison of MSS, ROSOTTO, and stemming algorithms on challenging instances
| (7,3) | 225.9 | 615.7 | 7068.6 | >4hours |
| (9,4) | 1051.0 | 1477.4 | >4hours | >4hours |
| (11,5) | 5129.4 | 5503.0 | >4hours | >4hours |
Statistics on different alphabet sizes
| 40 | 25.1/190 | 3.6/190 | 2.4/45 | 2125.5/16669665 |
| 60 | 26.2/400 | 3.6/400 | 6.9/169 | 3023.4/18465345 |
| 80 | 23.6/50 | 3.6/50 | 0.4/4 | 3493.0/11380993 |
| 100 | 27.1/260 | 3.6/260 | 5.6/216 | 4464.9/17733385 |
Motif search on protein data
| CPTINEPCC | 7 | (9,2) | 2.0 | 100.0 | 244.0 |
| CRFYNCHHLHEPGC | 10 | (14,4) | 22.2 | >4hours | >4hours |
| HTHPTQTAFLSSVD | 8 | (14,4) | 10.3 | >4hours | >4hours |
| ILPPVPVPK | 14 | (9,2) | 3.8 | 105.8 | 582.0 |
| PEPNGYLHIGH | 134 | (11,3) | 51.1 | 12827.0 | >4hours |
| PSPTGFIHLGN | 36 | (11,3) | 6.5 | 4336.6 | 4561.0 |
| PTVYNYAHIGN | 19 | (11,3) | 3.6 | 3358.9 | 4917.0 |
| PYANGSIHLGH | 110 | (11,3) | 52.1 | 11363.2 | >4hours |
| PYPSGQGLHVGH | 18 | (12,3) | 10.4 | >4hours | >4hours |
| QELFKRISEQFTAMF | 9 | (15,4) | 47.6 | >4hours | >4hours |
| QIKTLNNKFASFIDK | 9 | (15,4) | 20.3 | >4hours | >4hours |
| SGYSSPGSPGTPGSR | 9 | (15,4) | 32.6 | >4hours | >4hours |
| SSSSLEKSYELPDGQ | 10 | (15,4) | 41.3 | >4hours | >4hours |
| VTVYDYCHLGH | 8 | (11,3) | 2.9 | 3145.8 | 2235.0 |
MSS2 vs. PMSPrune on DNA data
| (7,1) | 4.1 | 3.3 |
| (9,2) | 10.7 | 3.4 |
| (11,3) | 87.2 | 8.1 |