| Literature DB >> 20122182 |
Christina Boucher1, James King.
Abstract
BACKGROUND: Improving the accuracy and efficiency of motif recognition is an important computational challenge that has application to detecting transcription factor binding sites in genomic data. Closely related to motif recognition is the CONSENSUS STRING decision problem that asks, given a parameter d and a set of l-length strings S = {s1, ..., sn}, whether there exists a consensus string that has Hamming distance at most d from any string in S. A set of strings S is pairwise bounded if the Hamming distance between any pair of strings in S is at most 2d. It is trivial to determine whether a set is pairwise bounded, and a set cannot have a consensus string unless it is pairwise bounded. We use CONSENSUS STRING to determine whether or not a pairwise bounded set has a consensus. Unfortunately, CONSENSUS STRING is NP-complete. The lack of an efficient method to solve the CONSENSUS STRING problem has caused it to become a computational bottleneck in MCL-WMR, a motif recognition program capable of solving difficult motif recognition problem instances.Entities:
Mesh:
Year: 2010 PMID: 20122182 PMCID: PMC3009483 DOI: 10.1186/1471-2105-11-S1-S11
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Efficiency of rejection sampling. Average number of rejections when generating a pairwise bounded set with our rejection sampling heuristic. Each plot shows the effect of varying one of the three parameters (n, ℓ, d). Data points are connected with cubic splines. Note the logarithmic scale used in the right plot.
Figure 2Weight distribution histograms. Histograms showing weight distributions for motif sets and decoy sets. Normal distributions fitted to the data are shown to indicate that the weight distributions are approximately normal.
Weight distribution properties.
| (ℓ, | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| (15, 4) | 794 | 1439 | 84 | 84 | 989 | 1243 | 15 | 432 | 980 | 52 | 60 | 552 | 840 |
| (16, 5) | 850 | 1651 | 86 | 102 | 1050 | 1413 | 20 | 794 | 1439 | 84 | 84 | 989 | 1243 |
| (18, 6) | 899 | 2204 | 89 | 140 | 1106 | 1878 | 25 | 1529 | 2250 | 129 | 110 | 1829 | 1994 |
| (25, 8) | 954 | 2670 | 111 | 175 | 1212 | 2262 | 30 | 1845 | 3263 | 196 | 169 | 2300 | 2869 |
| (28, 9) | 1024 | 3230 | 152 | 199 | 1378 | 2767 | 35 | 2240 | 4523 | 246 | 213 | 2812 | 4027 |
| (30, 11) | 1069 | 3882 | 169 | 245 | 1462 | 3312 | 40 | 3709 | 6110 | 389 | 275 | 4613 | 5460 |
Data illustrating the change to the mean and standard deviation of the weight of a random motif set and the weight of a random decoy set as the values of ℓ, d, and n increase. On the left the number of strings is fixed at 20 and on the right the values (ℓ, d) are fixed at (15, 4).
Performance on synthetic data with varying (ℓ, d).
| (ℓ, | sMCL-WMR | MCL-WMR | PROJECTION | Voting | PMSprune |
|---|---|---|---|---|---|
| (10, 2) | 15 | 1020 | 56 (98%) | < 1 | 12 |
| (12, 3) | 24 | 2780 | 321 (85%) | 28.4 | 23 |
| (14, 4) | 98 | 3120 | 658 (75%) | 412 | 102 |
| (16, 5) | 253 | 4101 | 1312 (80%) | 1620 | 520 |
| (18, 6) | 632 | 10202 | 2200 (85%) | 4210 | 33560 |
| (20, 7) | 1203 | - | 2700 (75%) | 20021 | - |
| (25, 9) | 1502 | - | - | - | - |
| (28, 12) | 1691 | - | - | - | - |
| (30, 14) | 2002 | - | - | - | - |
Comparison of the performance of sMCL-WMR and other motif recognition programs on synthetic data; other programs tested include PROJECTION [4], MCL-WMR [10], Voting [9], and PMSprune [8]. All programs except PROJECTION had a success rate of 100% and for this reason, the success rate was for PROJECTION is included in brackets in the table. The time is given in CPU seconds. In all experiments, n = 600, m = 20, and ℓ and d are varied. "-" denotes that the program was not capable of solving the specific problem.
Performance on synthetic data with varying n.
|
| sMCL-WMR | MCL-WMR | PROJECTION | Voting | PMSprune |
|---|---|---|---|---|---|
| 18 | 223 | 5320 | 698 (85%) | 3930 | 37020 |
| 20 | 243 | 12032 | 729 (77%) | 5201 | 45030 |
| 24 | 1354 | 36112 | 874 (75%) | 10211 | - |
| 28 | 1960 | - | - | - | - |
| 30 | 2504 | - | - | - | - |
| 40 | 3203 | - | - | - | - |
The performance of sMCL-WMR as the number of strings increases in comparison to other motif recognition programs. Other programs tested include MCL-WMR [10], PROJECTION [4], Voting [9], and PMSprune [8]. The time is given in CPU seconds. In all experiments, ℓ = 18, d = 6, m = 600 and n ranges from 18 to 40.
Motif recognition on biological data.
| Data set | Published motif | Motif pattern discovered | Motif recognition program | ℓ |
| Time (CPU sec.) |
|---|---|---|---|---|---|---|
| hm01 | gggaggctgaggcatgag | cggaggcctaagcctcag | GLAM [ | 18 | 8 | 42.1 |
| hm03 | Cagccaggctgcagtgctg | Catccatacagaa | GLAM [ | 13 | 6 | 12.3 |
| hm04 | gcgatgtgtaatagtcgc | gacatgtgtaaaaga | MEME [ | 15 | 9 | 54.2 |
| hm08 | ggagaaattctaa | aTGACgTC | Weeder [ | 13 | 6 | 4.34 |
| hm20 | ctgTAatc | gagTAaac | MITRA [ | 8 | 3 | 6.7 |
| hm26 | GCCGGC | GCCGGC | MITRA [ | 6 | 0 | 3.44 |
Data collected from TRANSFAC and the published motifs are found by the assessment of Tompa et. al. [2] of differing motif tools. For each set of data, we determined motifs of the same length using sMCL-WMR. The number of strings ranges from 8 to 34.