| Literature DB >> 18226245 |
Andra Ivan1, Marc S Halfon, Saurabh Sinha.
Abstract
We consider the problem of predicting cis-regulatory modules without knowledge of motifs. We formulate this problem in a pragmatic setting, and create over 30 new data sets, using Drosophila modules, to use as a 'benchmark'. We propose two new methods for the problem, and evaluate these, as well as two existing methods, on our benchmark. We find that the challenge of predicting cis-regulatory modules ab initio, without any input of relevant motifs, is a realizable goal.Entities:
Mesh:
Year: 2008 PMID: 18226245 PMCID: PMC2395258 DOI: 10.1186/gb-2008-9-1-r22
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Statistics for the data sets in our benchmark
| Name | Number of CRMs | Minimum CRM length | Maximum CRM length | Average CRM length | Total CRM length (Kbp) |
| mapping3.adult | 34 | 83 | 2,013 | 748 | 25 |
| mapping1.adult mesoderm | 5 | 126 | 927 | 561 | 2 |
| mapping1.amnioserosa | 7 | 469 | 1,500 | 708 | 4 |
| mapping1.blastoderm | 77 | 126 | 1,833 | 906 | 69 |
| mapping1.cardiac mesoderm | 8 | 237 | 1,513 | 536 | 4 |
| mapping1.cns | 34 | 304 | 1,986 | 1,034 | 35 |
| mapping1.dorsal ectoderm | 8 | 267 | 1,657 | 842 | 6 |
| mapping1.ectoderm | 37 | 105 | 2,015 | 839 | 31 |
| mapping2.ectoderm | 51 | 105 | 2,015 | 815 | 41 |
| mapping1.endoderm | 16 | 220 | 1,373 | 579 | 9 |
| mapping1.eye | 6 | 187 | 1,930 | 824 | 4 |
| mapping2.eye | 18 | 187 | 2,015 | 868 | 15 |
| mapping1.fat body | 5 | 375 | 529 | 456 | 2 |
| mapping1.female gonad | 10 | 83 | 1,657 | 442 | 4 |
| mapping1.glia | 7 | 515 | 1,890 | 899 | 6 |
| mapping1.imaginal disc | 47 | 177 | 2,015 | 938 | 44 |
| mapping2.imaginal disc | 12 | 490 | 2,015 | 1,248 | 14 |
| mapping3.larva | 69 | 176 | 2,015 | 892 | 61 |
| mapping1.male gonad | 8 | 200 | 1,319 | 862 | 6 |
| mapping1.malpighian tubules | 4 | 540 | 1,373 | 782 | 3 |
| mapping1.mesectoderm | 5 | 601 | 1,415 | 913 | 4 |
| mapping1.mesoderm | 16 | 105 | 1,415 | 544 | 8 |
| mapping2.mesoderm | 45 | 105 | 1,513 | 518 | 23 |
| mapping1.neuroectoderm | 7 | 343 | 1,360 | 575 | 4 |
| mapping2.neuronal | 54 | 177 | 2,013 | 988 | 53 |
| mapping1.pns | 24 | 177 | 2,013 | 976 | 23 |
| mapping2.reproductive system | 21 | 83 | 1,801 | 734 | 15 |
| mapping1.salivary gland | 6 | 295 | 1,890 | 786 | 4 |
| mapping1.somatic muscle | 12 | 312 | 1,513 | 718 | 8 |
| mapping1.tracheal system | 9 | 515 | 2,015 | 1,236 | 11 |
| mapping1.ventral ectoderm | 12 | 343 | 1,657 | 700 | 8 |
| mapping1.visceral mesoderm | 12 | 183 | 1,104 | 451 | 5 |
| mapping2.wing | 33 | 177 | 2,015 | 1,029 | 33 |
Each control region is ten times the CRM length.
Performance of Stubb, D2Z-set, and CSam on 33 data sets in our benchmark
| Stubb§ | D2Z-set§ | CSam§ | |||||||
| Data set | Sequence number* | Length† | Maximum sensitivity‡ | Sensitivity | Sensitivity | Sensitivity | |||
| MAPPING3.ADULT | 34 | 254,800 | 0.71 | 0.20 | 0.72 | 0.07 | 0.15 | 0.13 | |
| mapping1.adult mesoderm | 5 | 28,085 | 0.76 | 0.51 | 0.05 | 0.11 | 0.22 | 0.51 | 0.05 |
| mapping1.amnioserosa | 7 | 49,635 | 0.84 | 0.25 | 0.15 | 0.34 | 0.12 | 0.09 | 0.23 |
| MAPPING1.BLASTODERM | 77 | 698,840 | 0.77 | 0.36 | 0.10 | 0.13 | 0.26 | ||
| MAPPING1.CARDIAC MESODERM | 8 | 42,979 | 0.76 | 0.08 | 0.22 | 0.28 | 0.12 | 0.19 | |
| MAPPING1.CNS | 34 | 352,108 | 0.80 | 0.48 | 0.10 | 0.20 | 0.18 | ||
| mapping1.dorsal ectoderm | 8 | 67,490 | 0.77 | 0.08 | 0.22 | 0.88 | 0.00 | 0.08 | 0.22 |
| MAPPING1.ECTODERM | 37 | 311,000 | 0.72 | 0.20 | 0.20 | 0.21 | |||
| MAPPING2.ECTODERM | 51 | 416,473 | 0.74 | 0.18 | 0.15 | 0.23 | |||
| MAPPING1.ENDODERM | 16 | 92,723 | 0.82 | 0.24 | 0.31 | 0.12 | 0.26 | ||
| MAPPING1.EYE | 6 | 49,494 | 0.70 | 1.00 | 0.00 | 0.48 | 0.08 | 0.32 | |
| mapping2.eye | 18 | 156,531 | 0.69 | 0.19 | 0.14 | 0.68 | 0.07 | 0.88 | 0.04 |
| mapping1.fat body | 5 | 22,831 | 0.93 | 0.14 | 0.20 | 1.00 | 0.00 | 0.45 | 0.09 |
| MAPPING1.FEMALE GONAD | 10 | 44,269 | 0.62 | 0.24 | 0.97 | 0.00 | 0.86 | 0.02 | |
| mapping1.glia | 7 | 63,008 | 0.82 | 0.49 | 0.09 | 0.16 | 0.19 | 0.21 | 0.17 |
| MAPPING1.IMAGINAL DISC | 47 | 441,597 | 0.77 | 0.55 | 0.09 | 0.20 | 0.24 | 0.12 | |
| mapping2.imaginal disc | 12 | 149,915 | 0.80 | 0.57 | 0.08 | 0.12 | 0.18 | 0.33 | 0.12 |
| MAPPING3.LARVA | 69 | 616,635 | 0.76 | 0.14 | 0.15 | 0.18 | |||
| mapping1.male gonad | 8 | 69,044 | 0.85 | 0.22 | 0.15 | 0.46 | 0.10 | 0.15 | 0.18 |
| mapping1.malpighian tubules | 4 | 31,338 | 0.81 | 0.10 | 0.25 | 1.00 | 0.00 | 0.30 | 0.16 |
| MAPPING1.MESECTODERM | 5 | 45,712 | 0.83 | 0.18 | 0.20 | 0.43 | 0.10 | 0.46 | |
| MAPPING1.MESODERM | 16 | 87,140 | 0.72 | 0.21 | 0.09 | 0.17 | 0.22 | 0.13 | |
| MAPPING2.MESODERM | 45 | 233,441 | 0.75 | 0.22 | 0.20 | 0.16 | |||
| MAPPING1.NEUROECTODERM | 7 | 40,315 | 0.80 | 0.34 | 1.00 | 0.00 | 0.51 | ||
| MAPPING2.NEURONAL | 54 | 534,081 | 0.78 | 0.24 | 0.12 | 0.19 | 0.26 | ||
| MAPPING1.PNS | 24 | 234,532 | 0.78 | 0.19 | 0.07 | 0.17 | 0.21 | ||
| mapping2.reproductive system | 21 | 154,400 | 0.69 | 0.16 | 0.14 | 0.34 | 0.10 | 0.24 | 0.12 |
| mapping1.salivary gland | 6 | 47,232 | 0.74 | 0.55 | 0.06 | 1.00 | 0.00 | 0.36 | 0.11 |
| MAPPING1.SOMATIC MUSCLE | 12 | 86,317 | 0.79 | 0.29 | 0.12 | 0.05 | 0.21 | 0.28 | |
| mapping1.tracheal system | 9 | 111,351 | 0.85 | 0.55 | 0.08 | 0.21 | 0.16 | 0.18 | 0.17 |
| MAPPING1.VENTRAL ECTODERM | 12 | 84,154 | 0.77 | 0.38 | 0.32 | 0.12 | 0.27 | ||
| MAPPING1.VISCERAL MESODERM | 12 | 54,278 | 0.77 | 0.46 | 0.10 | 0.32 | 0.12 | 0.28 | |
| MAPPING2.WING | 33 | 340,094 | 0.78 | 0.14 | 0.13 | 0.23 | 0.22 | ||
*The number of sequences in a data set; †the total sequence length; ‡the maximum sensitivity possible. §The sensitivity and its empirical p-value are given for each method tested. Data set names are capitalized if at least one of the three methods performs significantly (p-value ≤0.05; shown in bold) on it.
Entry for any pair of methods is the number of data sets on which both methods performed significantly well (sensitivity p-value <0.05)
| Stubb | CSam | D2Z-set | |
| Stubb | 12 | 9 | 4 |
| CSam | - | 16 | 8 |
| D2Z | - | - | 9 |
Diagonals indicate the number of data sets on which the corresponding method performed well.
Figure 1Performance of CSam on five data sets where its sensitivity p-value was below 0.05. The data sets are (a) mapping1.neuroectoderm, (b) mapping1.mesectoderm, (c) mapping1.ventral ectoderm, (d) mapping1.eye and (e) mapping1.ectoderm. In each panel, every sequence is shown as a blue line, the location of a known module is shown as a red rectangle below the line and the location of a predicted module is shown as a green rectangle above the line. The displays of different panels are to different scales.
CRM-level sensitivity of data sets
| Set name | CRMs* | Stubb† | CSam† | D2Z-set† | CisModule† | MCD† |
| mapping3.adult | 34 | 0.24 (8) | 0.21 (7) | 0.24 (8) | 0.26 (9) | |
| mapping1.adult mesoderm | 5 | 0.20 (1) | 0.20 (1) | 0.00 (0) | 0.20 (1) | |
| mapping1.amnioserosa | 7 | 0.14 (1) | 0.14 (1) | 0.00 (0) | 0.29 (2) | |
| mapping1.blastoderm | 77 | 0.42 (32) | 0.21 (16) | 0.14 (11) | 0.12 (9) | |
| mapping1.cardiac mesoderm | 8 | 0.38 (3) | 0.25 (2) | 0.12 (1) | ||
| mapping1.cns | 34 | 0.12 (4) | 0.24 (8) | 0.15 (5) | 0.15 (5) | |
| mapping1.dorsal ectoderm | 8 | 0.25 (2) | 0.00 (0) | 0.12 (1) | 0.25 (2) | |
| mapping1.ectoderm | 37 | 0.30 (11) | 0.35 (13) | 0.22 (8) | 0.19 (7) | |
| mapping2.ectoderm | 51 | 0.24 (12) | 0.25 (13) | 0.14 (7) | 0.18 (9) | |
| mapping1.endoderm | 16 | 0.25 (4) | 0.19 (3) | 0.25 (4) | 0.12 (2) | |
| mapping1.eye | 6 | 0.00 (0) | 0.17 (1) | 0.33 (2) | 0.17 (1) | |
| mapping2.eye | 18 | 0.11 (2) | 0.11 (2) | 0.22 (4) | 0.11 (2) | |
| mapping1.fat body | 5 | 0.00 (0) | 0.00 (0) | 0.00 (0) | 0.00 (0) | |
| mapping1.female gonad | 10 | 0.10 (1) | 0.00 (0) | 0.20 (2) | 0.00 (0) | |
| mapping1.glia | 7 | 0.14 (1) | 0.29 (2) | 0.29 (2) | 0.14 (1) | |
| mapping1.imaginal disc | 47 | 0.11 (5) | 0.21 (10) | 0.19 (9) | 0.15 (7) | |
| mapping2.imaginal disc | 12 | 0.08 (1) | 0.17 (2) | 0.17 (2) | 0.08 (1) | |
| mapping3.larva | 69 | 0.16 (11) | 0.25 (17) | 0.19 (13) | 0.13 (9) | |
| mapping1.male gonad | 8 | 0.12 (1) | 0.12 (1) | 0.12 (1) | ||
| mapping1.malpighian tubules | 4 | 0.00 (0) | 0.00 (0) | |||
| mapping1.mesectoderm | 5 | 0.20 (1) | 0.20 (1) | 0.00 (0) | 0.20 (1) | |
| mapping1.mesoderm | 16 | 0.31 (5) | 0.31 (5) | 0.31 (5) | 0.31 (5) | |
| mapping2.mesoderm | 45 | 0.24 (11) | 0.31 (14) | 0.07 (3) | 0.22 (10) | |
| mapping1.neuroectoderm | 7 | 0.43 (3) | 0.00 (0) | 0.29 (2) | 0.14 (1) | |
| mapping2.neuronal | 54 | 0.15 (8) | 0.24 (13) | 0.09 (5) | 0.09 (5) | |
| mapping1.pns | 24 | 0.25 (6) | 0.17 (4) | |||
| mapping2.reproductive system | 21 | 0.19 (4) | 0.14 (3) | 0.19 (4) | 0.24 (5) | |
| mapping1.salivary gland | 6 | 0.00 (0) | 0.00 (0) | |||
| mapping1.somatic muscle | 12 | 0.25 (3) | 0.17 (2) | 0.08 (1) | ||
| mapping1.tracheal system | 9 | 0.11 (1) | 0.00 (0) | 0.00 (0) | ||
| mapping1.ventral ectoderm | 12 | 0.25 (3) | 0.08 (1) | 0.17 (2) | ||
| mapping1.visceral mesoderm | 12 | 0.17 (2) | 0.42 (5) | 0.25 (3) | 0.17 (2) | |
| mapping2.wing | 33 | 0.24 (8) | 0.09 (3) | 0.18 (6) |
*The number of CRMs in a data set. †The fraction (and number in parentheses) of CRMs in a data set that were 'hits' (that is, overlap greater than half the length of the shorter window). The best CRM-level sensitivity for each data set is in bold.
Figure 2Logos of the known and predicted motifs in a data set. (a,b) The known Dorsal motif (a) and the motif discovered using YMF on ab initio predicted CRMs in the mapping1.neuroectoderm data set (b). The same motif finding program when run on the entire data set mapping1.neuroectoderm (which include the CRMs) did not find this or any other known motif from the FlyREG database.
Figure 3Pseudo-code for the CSam algorithm.