| Literature DB >> 21821659 |
Majid Kazemian1, Qiyun Zhu, Marc S Halfon, Saurabh Sinha.
Abstract
Despite recent advances in experimental approaches for identifying transcriptional cis-regulatory modules (CRMs, 'enhancers'), direct empirical discovery of CRMs for all genes in all cell types and environmental conditions is likely to remain an elusive goal. Effective methods for computational CRM discovery are thus a critically needed complement to empirical approaches. However, existing computational methods that search for clusters of putative binding sites are ineffective if the relevant TFs and/or their binding specificities are unknown. Here, we provide a significantly improved method for 'motif-blind' CRM discovery that does not depend on knowledge or accurate prediction of TF-binding motifs and is effective when limited knowledge of functional CRMs is available to 'supervise' the search. We propose a new statistical method, based on 'Interpolated Markov Models', for motif-blind, genome-wide CRM discovery. It captures the statistical profile of variable length words in known CRMs of a regulatory network and finds candidate CRMs that match this profile. The method also uses orthologs of the known CRMs from closely related genomes. We perform in silico evaluation of predicted CRMs by assessing whether their neighboring genes are enriched for the expected expression patterns. This assessment uses a novel statistical test that extends the widely used Hypergeometric test of gene set enrichment to account for variability in intergenic lengths. We find that the new CRM prediction method is superior to existing methods. Finally, we experimentally validate 12 new CRM predictions by examining their regulatory activity in vivo in Drosophila; 10 of the tested CRMs were found to be functional, while 6 of the top 7 predictions showed the expected activity patterns. We make our program available as downloadable source code, and as a plugin for a genome browser installed on our servers.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21821659 PMCID: PMC3239187 DOI: 10.1093/nar/gkr621
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.(A) General scheme for evaluation of a CRM discovery method. We first select a set of genes from BDGP (or FlyBase) with expression patterns commensurate with those of the CRMs in a data set. We next take the modules predicted for that data set and extract their nearest neighboring genes (‘predicted gene set’). (In this step, we ignore predicted modules that overlap with any training CRMs.) Finally, we perform a Hypergeometric test of enrichment between the expression gene set and the predicted gene set. (B) IMM of order n is a mixture of Markov models up to order n. ‘0th MM’, ‘1th MM’,…, ‘nth MM’ denote Markov models of order 0, 1, … , n, respectively. are the mixture weights. (C) Scorer scores every windows of length 500 bp with 50 bp shift across the entire genome. In this toy example, the score of each sliding window (orange lines) is shown as blue bar.
Figure 2.Evaluation of methods. For each method, shown is (A) the number of data sets for which the evaluation P-value is significant at different LLHT P-value thresholds. For clarity, all single species methods are formatted as dashed lines and the two multi-species methods are shown as solid line. (B) Y-axis shows the number of overlaps between the expression gene set and the top k predicted genes for ‘imaginal_disc.2’ data set. (See Supplementary Figure S2A and S2B for all other data sets.).
Comparison between evaluation P-values of msIMM and the best method from (20), for the 15 data sets that were reported on by (20)
The second and third columns show the total number of genes in the expression data source and the size of expression gene set for each data set, respectively (the size of the predicted gene set is 200 for all data sets). The lowest (best) P-value for each pair is shaded. Note that the P-values reported here are LLHT P-values, which are different from the standard Hypergeometric P-values shown in Table 2 of (20).
Discrimination between control regions of an expression gene set and random sequences of matching lengths
The second and third columns show the number of training CRMs and the size of expression gene set, respectively, for each data set. Scores of genes in an expression gene set were compared to scores of a collection of randomly chosen genomic regions. The score of a sequence is the maximum score in that region, under a CRM prediction scheme. For each gene in the expression gene set, 50 random genomic segments of length equal to the gene's territory length were included in the random collection. The last seven columns show the P-values (Wilcoxon rank sum test) of such a comparison for each data set and for each method. The best P-value for each data set is shaded. The last row indicates the number of times that a method is superior (smallest P-value).
Figure 3.In vivo validation of 12 Drosophila CRM predictions. Transgenic embryos were stained with antibodies to GFP to detect reporter gene expression. Embryos are oriented anterior to the left. Panels A, F, G, K, M, O and S are dorsal views; H and N are ventral views; the remainder are lateral views with dorsal to the top. ‘Pattern-specific’ CRMs are shown in panels A–L and non-pattern specific CRMs in panels M–T. (A) The Antp_161 CRM drives expression in several tissues, including the visceral mesoderm (arrowhead) and the non-mesodermal midgut (asterisk). The endogenous Antp gene is expressed in a much more restricted manner, suggesting that the reporter gene expression is either ectopic or that the CRM is associated with a different gene of undetermined identity. (B) Co-labeling for Tropomyosin (magenta) shows GFP expression (green) in somatic muscles, in particular the lateral transverse muscle fibers (arrowheads). (C and D) The how_171 CRM drives expression in the mesoderm in both mid-stage (C) and late-stage (D) embryos, consistent with how gene expression. (E–H) The noc_532 CRM is active in many noc-positive tissues throughout embryogenesis. Pictured is metameric expression in both ectoderm and mesoderm at stage 9 (E), in the visceral mesoderm (F, arrowhead), in the mesodermally-derived lymph glands (G, arrowhead), the ventral nerve cord (H, arrow) and the hindgut (H, arrowhead). (I) Mhc_537 regulates reporter gene expression (green) in a subset of Mhc-positive mesodermal cells including longitudinal visceral muscles (arrow) and several ventral oblique somatic muscle fibers (arrowheads). (J) Longitudinal visceral muscle precursors express GFP under the control of the slou_847 CRM (arrowheads), cells not positive for endogenous slou expression. Expression is also observed in the supraesophogeal ganglion (K) and ventral nerve cord (data not shown), consistent with known slou expression patterns. Inset shows a more dorsal view of the anterior portion of the embryo in the main panel. (L) mbc_64 drives expression in the mbc-positive mesoderm, pictured here at stage 16. (M and N). Expression driven by the lola_648 CRM is confined to the central nervous system in both mid-stage (M) and late-stage (N) embryos, consistent with lola expression. lola_648 overlaps the independently discovered CRM40 of (40). (O) The slou-828 CRM regulates reporter gene expression in tissues that are not slou-positive including cells in the antenno-maxillary complex (black arrow), the posterior spiracles (arrowhead) and cells in the anterior and posterior portions of the foregut (white arrow and data not shown). (P) slou_828-controlled expression is also observed in the midgut, consistent with normal slou expression. (Q and R) The rib_120 CRM drives expression throughout hindgut development, part of the normal rib expression pattern. (S and T) Reporter gene expression regulated by the ama_299 CRM is restricted to the central nervous system. Although ama is not expressed in the CNS of late-stage embryos (41), earlier expression in the ventral midline beginning at stage 8 (data not shown) is consistent with ama expression at that stage. (U) Summary of results and msIMM scores for each predicted CRM. Names are based on the closest gene to each predicted module and do not necessarily reflect the actual gene regulated by the CRM. ‘Data set’ indicates the data set from which the CRM was predicted (‘S’ and ‘M’ for somatic muscle and mesoderm respectively). ‘Pattern?’ indicates whether the tested sequence drives a spatial and/or temporal expression pattern. ‘Matches training set?’ and ‘Matches gene?’ indicate whether the expression pattern agrees with that of the training set or the nearest gene, respectively. Check marks/blue coloring denote a positive result, crosses and yellow coloring a negative result. Mixed blue and yellow coloring is used for cases where both endogenous and ectopic gene expression patterns are observed. ‘Score’ shows msIMM scores for each tested CRM. The gray arrow points to the best decision stump on msIMM scores in terms of predicting pattern-specific CRMs.