| Literature DB >> 14629780 |
Eldon Emberly1, Nikolaus Rajewsky, Eric D Siggia.
Abstract
BACKGROUND: One of the important goals in the post-genomic era is to determine the regulatory elements within the non-coding DNA of a given organism's genome. The identification of functional cis-regulatory modules has proven difficult since the component factor binding sites are small and the rules governing their arrangement are poorly understood. However, the genomes of suitably diverged species help to predict regulatory elements based on the generally accepted assumption that conserved blocks of genomic sequence are likely to be functional. To judge the efficacy of strategies that prefilter by sequence conservation it is important to know to what extent the converse assumption holds, namely that functional elements common to both species will fall within these conserved blocks. The recently completed sequence of a second Drosophila species provides an opportunity to test this assumption for one of the experimentally best studied regulatory networks in multicellular organisms, the body patterning of the fly embryo.Entities:
Mesh:
Substances:
Year: 2003 PMID: 14629780 PMCID: PMC302112 DOI: 10.1186/1471-2105-4-57
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
The statistics of conserved sequence between pairs of species for a range of parameters that maximized the correlation between the experimental binding sites and the conserved sequence. Two alignment programs were used along with two measures of intersection: (a) site counted only if entirely within a conserved block, (b) number of conserved bases belonging to site. Columns indicate the fraction of the total sequence aligned in conserved blocks (the percent identity of the conserved blocks), the fraction of sites in blocks, the fraction when sites are placed randomly. (see Methods), and the statistical significance. The percentage intervals in the three columns correspond.
| method | conserved seq. (%) | sites in conserved seq. (%) | random sites in conserved seq. (%) | Z score |
| SMASH (a) | 41–51 (82–92%) | 51–71 | 37–54 | 5.0 |
| SMASH (b) | 33–51 (82–92%) | 48–74 | 42–57 | 5.0 |
| LAGAN (a) | 33–71 (82–92%) | 36–53 | 25–39 | 5.0 |
| LAGAN (b) | 31–54 (82–92%) | 50–80 | 41–58 | 6.0–8.0 |
| method | conserved seq. (%) (PID) | sites in conserved seq. (%) | random sites in conserved seq. (%) | Z score |
| SMASH (a) | 25–31 (83–92%) | 18–32 | 13–25 | 2.0 |
| SMASH (b) | 25–26 (83–92%) | 36–41 | 27 | 4.5 |
| LAGAN (a) | 29–56 (81–92%) | 23–36 | 18–32 | 2.25 |
| LAGAN (b) | 30–48 (81–92%) | 45–61 | 30–50 | 5.0 |
Figure 3Histograms of the amount of conserved sequence in blocks of various sizes. The data is from the comparison with D. pseudoobscura and the parameters are match 1, mismatch -1, gap start -6, gap continue 0 for LAGAN and (1, -2, -2, -1), (1, -2, -4, -2) for the two SMASH runs respectively.
Figure 1The fraction of sequence conserved within the modules of our test set vs random. The distribution in conserved sequence for the 30 test modules compared with two randomized sets. The red histogram distributes the modules randomly within the 200 kb of sequence we aligned around the regulated genes, while the green data is sampled from all noncoding sequence. The conserved blocks were computed from the SMASH alignments with parameters corresponding method (a) in Table 1. There are 10 equally spaced bins for each histogram.