| Literature DB >> 23119020 |
Qiang Yu1, Hongwei Huo, Yipu Zhang, Hongzhi Guo.
Abstract
Motif search is a fundamental problem in bioinformatics with an important application in locating transcription factor binding sites (TFBSs) in DNA sequences. The exact algorithms can report all (l, d) motifs and find the best one under a specific objective function. However, it is still a challenging task to identify weak motifs, since either a large amount of memory or execution time is required by current exact algorithms. A new exact algorithm, PairMotif, is proposed for planted (l, d) motif search (PMS) in this paper. To effectively reduce both candidate motifs and scanned l-mers, multiple pairs of l-mers with relatively large distances are selected from input sequences to restrict the search space. Comparisons with several recently proposed algorithms show that PairMotif requires less storage space and runs faster on most PMS instances. Particularly, among the algorithms compared, only PairMotif can solve the weak instance (27, 9) within 10 hours. Moreover, the performance of PairMotif is stable over the sequence length, which allows it to identify motifs in longer sequences. For the real biological data, experimental results demonstrate the validity of the proposed algorithm.Entities:
Mesh:
Substances:
Year: 2012 PMID: 23119020 PMCID: PMC3485246 DOI: 10.1371/journal.pone.0048442
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1An example for partitioning positions in the alignment of two/three l-mers.
This figure shows an example for partitioning positions in the alignment of two/three 15-mers.
Figure 2Illustration of the PairMotif algorithm.
This figure takes the instance (15, 4) as an example to explain the process of PairMotif, which consists of three stages: selecting pairs, filtering l-mers and verifying candidate motifs.
R(x, x′) and |M(x, x′)| for the instance (15, 4).
| dH(x, x′) | R(x, x′) | |Md(x, x′)| |
| 9–15 | Ф | 0 |
| 8 | {<0,0>} | 70 |
| 7 | {<0,0>, <0,1>} | 350 |
| 6 | {<0,0>, <0,1>, <0,2>, <1,0>} | 1190 |
| 5 | {<0,0>, <0,1>, <0,2>, <0,3>, <1,0>, <1,1>} | 2970 |
| 4 | {<0,0>, <0,1>, <0,2>, <0,3>, <0,4>, <1,0>, <1,1>, <1,2>, <2,0>} | 6600 |
| 3 | {<0,0>, <0,1>, <0,2>, <0,3>, <1,0>, <1,1>, <1,2>, <1,3>, <2,0>, <2,1>} | 13504 |
| 2 | {<0,0>, <0,1>, <0,2>, <1,0>, <1,1>, <1,2>, <2,0>, <2,1>, <2,2>, <3,0>} | 27316 |
| 1 | {<0,0>, <0,1>, <1,0>, <1,1>, <2,0>, <2,1>, <3,0>, <3,1>} | 42760 |
| 0 | {<0,0>, <1,0>, <2,0>, <3,0>, <4,0>} | 100636 |
This table shows the values of R(x, x′) and |M(x, x′)| for the instance (15, 4) under different Hamming distances.
Figure 3An example for traversing candidate motifs in M(x, x′).
This figure shows an example for traversing candidate motifs shared by two l-mers x and x′. After calculating R(x, x′), for each <α, β> in R(x, x′), let y = x′, and the process of traversing is implemented by changing y with three steps. First, select α positions from the positions where x[i] = x′ [i], and for each i of these α positions, change y[i] to one of the three characters different from x[i]. Second, select β positions from the positions where x[i] ≠x′ [i], and for each i of these β positions, change y[i] to one of the two characters different from x[i] and x′ [i]. Third, select a part of positions from the positions where x[i] ≠x′ [i] except for those selected in step 2, and change y[i] to x[i] for each i of these positions. The bold italic characters denote the changed positions in y.
Algorithm complexities.
| Algorithm | Time complexity | Space complexity |
| PMSprune |
|
|
| iTriplet |
|
|
| PMS5 |
|
|
| PairMotif |
|
|
This table shows the time and space complexities of PairMotif and that of other famous exact algorithms. Note that, t is the number of sequences; n is the sequence length; p is the probability that the Hamming distance between two random l-mers is not more than k; L represents the time to load the ILP table of PMS5, which is about 50 seconds [22].
Time comparison on fixed 2d neighborhood probability.
| Algorithm | (12, 3) | (15, 4) | (18, 5) | (21, 6) | (24, 7) | (27, 8) | (30, 9) |
| RISOTTO | 25 s | 3.8 m | 30.3 m | 4.1 h | -o | -o | -o |
| PMS5 | 17 s | 28 s | 2.4 m | 2.5 m | 2.4 m | -e | -e |
| iTriplet | 2.9 m | 3.1 m | 3.8 m | 4.2 m | 4.9 m | 5.9 m | 7.4 m |
| PMSprune | 1 s | 2 s | 6 s | 11 s | 19 s | 35 s | 50 s |
| PairMotif | 2 s | 2 s | 3 s | 5 s | 11 s | 24 s | 47 s |
Time units, s: seconds; m: minutes; h: hours. Note, -o: over 10 hours; -e: memory error.
Time comparison on different 2d neighborhood probability.
| ( | Neighborhood Probability | RISOTTO | PMS5 | iTriplet | PMSprune | PairMotif |
| (29, 8) | 0.016 | -o | -e | 21 s | 1 s | 1 s |
| (9, 2) | 0.049 | 3 s | 14 s | 2.6 m | 1 s | 1 s |
| (23, 7) | 0.078 | -o | 2.6 m | 19.3 m | 2.3 m | 2.2 m |
| (28, 9) | 0.138 | -o | -e | 3.6 h | 1.1 h | 2.9 h |
| (19, 6) | 0.175 | 7.5 h | 3.0 m | 1.9 h | 5.9 m | 4.0 m |
| (27, 9) | 0.213 | -o | -e | -o | -o | 7.9 h |
| (18, 6) | 0.283 | -o | 7.1 m | -o | 29.6 m | 12.1 m |
| (15, 5) | 0.319 | 1.3 h | 4.1 m | -o | 8.7 m | 4.7 m |
| (17, 6) | 0.426 | -o | 31.3 m | -o | 1.8 h | 53.3 m |
| (19, 7) | 0.534 | -o | 1.4 h | -o | -o | 8.6 h |
Time units, s: seconds; m: minutes; h: hours. Note, -o: over 10 hours; -e: memory error.
Figure 4Time comparison on different sequence lengths.
This figure compares PairMotif with two famous algorithms PMS5 [22] and PMSprune [20] on different sequence lengths on the instance (18, 6). The x-axis shows the sequence lengths. The y-axis shows the running times.
Experimental results on real biological data.
| Data set | ( | Amount of ( | Motif |
| preproinsulin | (15, 2) | 102 | CAGCCTCAGCCCCCC |
| TG | |||
| GAAATTG | |||
| TG | |||
| DHFR | (11, 2) | 103 | ATTTCGCGCCA |
| CATCGTCGCCG | |||
|
| |||
|
| |||
| c-fos | (9, 2) | 104 | CCANATTNG |
| GCCTCCCCC | |||
| C | |||
| GTTGGCTGC | |||
| metallothionein | (15, 2) | 101 | CTCTGCACRCCGCCC |
|
| |||
|
| |||
|
| |||
| Yeast ECB | (16, 3) | 101 | TTTCCCNNTNAGGAAA |
|
| |||
|
| |||
|
|
The published motif.
The predicted motif obtained by using consensus score.
The predicted motif obtained by using relative entropy.
The predicted motif obtained by using sequence specificity.
The corresponding (l, d) used is (10, 2).
Figure 5Comparison of predicted motifs under different objective functions.
The x-axis shows the data sets used in our experiments. For each data set, we obtain three predicted motifs in terms of three objective functions. The y-axis shows the value of nucleotide-level correlation coefficient for each predicted motif.