| Literature DB >> 23281969 |
Mostafa M Abbas1, Mohamed Abouelhoda, Hazem M Bahig.
Abstract
BACKGROUND: Given a set of DNA sequences s1, ..., st, the (l, d) motif problem is to find an l-length motif sequence M , not necessary existing in any of the input sequences, such that for each sequence si, 1 ≤ i ≤ t, there is at least one subsequence differing with at most d mismatches from M. Many exact algorithms have been developed to solve the motif finding problem in the last three decades. However, the problem is still challenging and its solution is limited to small values of l and d.Entities:
Mesh:
Year: 2012 PMID: 23281969 PMCID: PMC3521218 DOI: 10.1186/1471-2105-13-S17-S10
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Time Comparison of PMSPrune and HEP_PMSprune(mns) with the Challenging Instances
| 11 | 3 | 1.92 s | 9 | 1.4 s | 27.1 % |
| 13 | 4 | 33.95 s | 7 | 26.05 s | 23.27 % |
| 15 | 5 | 7.7 m | 6 | 6.4 m | 16.8 % |
| 17 | 6 | 1.55 h | 7 | 1.26 h | 18.5 % |
| 19 | 7 | 18.62 h | 6 | 14.93 h | 19.8 % |
| 21 | 8 | 8.59 dy | 6 | 6.68 dy | 22.23 % |
Time Comparison of PMSPrune and HEP_PMSprune(ons) with the Challenging Instances
| 11 | 3 | 1.92 s | 10 | 1.34 s | 30 % |
| 13 | 4 | 33.95 s | 9 | 24.55 s | 27.69 % |
| 15 | 5 | 7.7 m | 7 | 6.02 m | 21.8 % |
| 17 | 6 | 1.55 h | 8 | 1.26 h | 18.65 % |
| 19 | 7 | 18.62 h | 7 | 14.39 h | 22.74 % |
| 21 | 8 | 8.59 dy | 6 | 6.68 dy | 22.23 % |
Figure 1Performance of our method for different challenging instances. Behavior of HEP_PMSprune(q) for different (l, d) instances such that q ∈{mns,..., t}. (a): (11, 3), (b): (13, 4), (c): (15, 5), (d): (17, 6), (e): (19, 7). We used the following remarks in the figures: 1) Black-triangle-down to indicate the runing time of HEP_PMSprune(mns). 2) Black-star to indicate the running time of PMSprune or HEP_PMSprune(t). 3) White-box to indicate the running time of HEP_PMSprune(ons); i.e., using theoretically estimated q.
The performance of the HEP_PMSprune(ons) for different values of n and l
| 300 | 3 | 11 | 9 | 0.0001 | 3-20 | 0.0001 | 0.0001 |
| 600 | 3 | 11 | 10 | 1.34 | 10 | 1.34 | 1.92 |
| 900 | 3 | 11 | 14 | 4 | 11-16 | 4 | 5 |
| 1200 | 3 | 11 | 17 | 7 | 17 | 7 | 8 |
| 1500 | 3 | 11 | 20 | 16 | 20 | 16 | 16 |
| 300 | 3 | 12 | 6 | 0.05 | 4-20 | 0.05 | 0.05 |
| 600 | 3 | 12 | 8 | 0.83 | 4-20 | 0.83 | 0.83 |
| 900 | 3 | 12 | 8 | 1.5 | 6-20 | 1.5 | 1.5 |
| 1200 | 3 | 12 | 9 | 3 | 6-15 | 3 | 4 |
| 1500 | 3 | 12 | 10 | 5 | 8-12 | 5 | 7 |
| 300 | 4 | 13 | 7 | 3 | 5-20 | 3 | 3 |
| 600 | 4 | 13 | 9 | 24.55 | 9 | 24.55 | 33.95 |
| 900 | 4 | 13 | 11 | 81 | 11 | 81 | 109 |
| 1200 | 4 | 13 | 14 | 190 | 14 | 190 | 217 |
| 1500 | 4 | 13 | 17 | 353 | 17-19 | 356 | 360 |
| 300 | 4 | 14 | 6 | 1 | 4-20 | 1 | 1 |
| 600 | 4 | 14 | 7 | 6.5 | 7-18 | 6.5 | 7 |
| 900 | 4 | 14 | 8 | 21.5 | 8-9 | 21.5 | 24 |
| 1200 | 4 | 14 | 8 | 54 | 8 | 54 | 67 |
| 1500 | 4 | 14 | 9 | 107 | 9 | 107 | 146 |
| 300 | 4 | 15 | 5 | 0.25 | 4--20 | 0.25 | 0.25 |
| 600 | 4 | 15 | 5 | 1.25 | 4-20 | 1.25 | 1.25 |
| 900 | 4 | 15 | 6 | 5 | 5-20 | 5 | 5 |
| 1200 | 4 | 15 | 6 | 12 | 8 | 10 | 13 |
| 1500 | 4 | 15 | 7 | 16.5 | 7-13 | 16.5 | 20 |
| 300 | 4 | 16+ | 5 | 0.002 | 3-20 | 0.002 | 0.002 |
| 600 | 4 | 16+ | 5 | 0.25 | 4-20 | 0.25 | 0.25 |
| 900 | 4 | 16+ | 5 | 1 | 4-20 | 1 | 1 |
| 1200 | 4 | 16+ | 6 | 2.34 | 5-20 | 2.34 | 2.34 |
| 1500 | 4 | 16+ | 6-8 | 4.89 | 5-20 | 4.89 | 4.89 |
| 300 | 5 | 15 | 7 | 38 | 6-10 | 38 | 46 |
| 600 | 5 | 15 | 8 | 361.2 | 8 | 360 | 462 |
| 900 | 5 | 15 | 9 | 1250 | 9 | 1250 | 1847 |
| 1200 | 5 | 15 | 11 | 2976 | 11 | 2976 | 4060 |
| 1500 | 5 | 15 | 13 | 5829 | 13 | 5829 | 6969 |
| 300 | 5 | 17 | 5 | 2 | 5-20 | 2 | 2 |
| 600 | 5 | 17 | 6 | 27 | 13-20 | 19 | 19 |
| 900 | 5 | 17 | 5 | 103 | 7-20 | 92 | 92 |
| 1200 | 5 | 17 | 6 | 231 | 6-8 | 224 | 264 |
| 1500 | 5 | 17 | 6 | 439 | 6-8 | 439 | 552 |
| 300 | 5 | 18+ | 5 | 1 | 5-20 | 1 | 1 |
| 600 | 5 | 18+ | 6 | 5 | 6-20 | 4 | 4 |
| 900 | 5 | 18+ | 6-7 | 14 | 6-20 | 14 | 14 |
| 1200 | 5 | 18+ | 6-7 | 33 | 6-20 | 33 | 33 |
| 1500 | 5 | 18+ | 6-8 | 74 | 6-20 | 74 | 74 |
The first column includes the sequence length n, the second includes the hamming distance d, and the third includes the motif length l. The entries l+, means greater than l leads to no improvement. 'ons' stands for the theoretically computed q, while "ons_exp" stands for the experimentally found one. We report range of ons_exp that yielded best time. There also range of ons for l+. "T_ons" and "T_onsstand for the times (in seconds) with ons and ons_exp, respectively. "T_pms" stands for the time with the original PMSprune algorithm only.
Running time of PHEP_PMSprune(ons) using different number of processors p for some challenging instances
| 13 | 4 | 24.86 s | 12.4 s | 8.35 s | 6.1 s | 4.95 s | 4.35 s | 3.6 s | 3.2 s |
| 15 | 5 | 6.34 m | 3.19 m | 2.13 m | 1.61 m | 1.28 m | 1.07 m | 55.2 s | 48.5 s |
| 17 | 6 | 1.28 h | 38.28 m | 25.58 m | 19.16 m | 15.34 m | 12.81 m | 10.98 m | 9.61 m |
| 19 | 7 | 14.56 h | 7.24 h | 4.81 h | 3.61 h | 2.98 h | 2.42 h | 2.07 h | 1.82 h |
| 21 | 8 | 6.68 dy | 3.33 dy | 2.23 dy | 1.67 dy | 1.34 dy | 1.12 dy | 23.18 h | 20.42 h |
Figure 2Scalability plot of the parallel version. The plots show speed-up for different number of processors and problem instances.
Application of the PHEP_PMSprune(ons) on the real yeast dataset
| Transcription Factor | Genes | Detected motif (s) & parameters | Published Motif (s) & reference(s) | Time |
|---|---|---|---|---|
| PHO4 (600 bp) | PHO5, PHO8, PHO81, PHO84, | CACGTG (6,0) | CACGT[G|T] [ | 38 (5%) |
| HSE_HSTF | SSA1, HSP26, SSA4, HSC82, SIS1, CUP1-1 | TTCAGTGAA | TTCNNGAA [ | 37 (35%) |
| PDR | PDR3, SNQ2, | TCCGTGGA | TCCG[C|T]GGA [ | 27(13%) |
| MCB | CDC2, CDC9, | ACGCGT | [A|T]CGCG[A|T] [ | 31(20%) |
| ECB | SWI4, MCM5 | TTTCCCATTAAGGAAA (16,3) | TTtCCcnntnaGGAAA [ | 41(49%) |
The first column includes the transcriptional factors (regulatory elements) and the length of upstream sequences. The second column includes the regulated genes. The first three factors and their related genes are available at the SCPD [38]. The ECB is the early-cell-cycle-box promoter region described in [39] and we extracted its related genes from the Yeast Genome Database [37]. The third column includes the motif detected by our tool and the respective parameters (l, d). The fourth column includes the published motifs and their references. The final column includes the running time in seconds needed to run our program in the parameter range from (6, 0) until (21, 3), i.e., there are 64 invocation of our program. The percentages in brackets refer to percentage improvements in rum time compared to PMSprune method.
Application of the PHEP_PMSprune(ons) on the Blanchette real dataset
| DNA region | Detected motif | Published Motif | Time | |
|---|---|---|---|---|
| Insulin family | 8 | CCTCAGCCCC (10, 1) | CCTCAGCCCC [ | 87(10%) |
| AAGACTCTAA (10,2) | AAGACTCTAA [ | |||
| GCCATCTGCC (10,1) | GCCATCTGCC [ | |||
| CTATAAAG (8,0) | CTATAAAG [36, GB] | |||
| GGGAAATG (8,1) | GGGAAATG [ | |||
| Metallothionein | 26 | TTTGCACACGC (11,3) | TTTGCACACG [ | 7.87(1%) |
| TGCACAC (7,1) | TGCACACGG [ | |||
| Interleukin-3 5'UTR+Promoter | 6 | TTGAGTACT (9,2) | TTGAGTACT [ | |
| GATGAATAAT (10,1) | GATGAATAAT [ | |||
| TCTTCAGAG, (9,2) | TCTTCAGAG [ | |||
| AGGACCAG, (8,1) | AGGACCAG [ | 466(10%) | ||
| AGGTTCCATGTCAGATAAAG, | Novel | |||
| Growth-hormone | 16 | AACTTATCCAT (11,3) | ATTATCCAT [ | 3.43(0%) |
| ATAAATGTAAA (11,3) | ATAAATGTA [ | |||
| TATAAAAAG (9,2) | TATAAAAAG [ | |||
| c-fos | 6 | CCATATTAGGAC (12,3) | CCATATTAGGACATCT [ | 350(15%) |
| GAGTTGGCTGC (11,3) | GAGTTGGCTG [ | |||
| CACAGGATGT (10,2) | CACAGGATGT [ | |||
| AGGACATCTGCT (12,3) | AGGACATCTG [ | |||
| c-myc | 7 | GTTTATTC (8,1) | GTTTATTC [ | 83.5(42%) |
| CTTGCTGGG (9,2) | TTGCTGGG [ | |||
| TGTTTACATC (10,2) | TGTTTACATC [ | |||
| CCCTCCCC (8,1) | CCCTCCCC [ | |||
| Histone H1 | CAATCACCAC, (10,2) | CAATCACCAC, [36, GB] | 47.6(9%) | |
| AAACAAAAGT (10,1) | AAACAAAAGT, [36, GB] | |||
The first column includes the gene family and the length of upstream sequences. The second column includes the number of sequences. The third column includes the motif detected by our tool and the respective parameters (l, d). The fourth column includes the published motifs and their references; "GB" stands for Genebank annotation. The final column includes the running time in seconds needed to run our program in the parameter range from (6,0) until (21,3), i.e., there are 64 invocation of our program. The percentages in brackets refer percentage improvements in rum time compared to PMSprune method.