Literature DB >> 17224917

Tissue-specific regulatory elements in mammalian promoters.

Andrew D Smith¹, Pavel Sumazin, Michael Q Zhang.

Abstract

Transcription factor-binding sites and the cis-regulatory modules they compose are central determinants of gene expression. We previously showed that binding site motifs and modules in proximal promoters can be used to predict a significant portion of mammalian tissue-specific transcription. Here, we report on a systematic analysis of promoters controlling tissue-specific expression in heart, kidney, liver, pancreas, skeletal muscle, testis and CD4 T cells, for both human and mouse. We integrated multiple sources of expression data to compile sets of transcripts with strong evidence for tissue-specific regulation. The analysis of the promoters corresponding to these sets produced a catalog of predicted tissue-specific motifs and modules, and cis-regulatory elements. Predicted regulatory interactions are supported by statistical evidence, and provide a foundation for targeted experiments that will improve our understanding of tissue-specific regulatory networks. In a broader context, methods used to construct the catalog provide a model for the analysis of genomic regions that regulate differentially expressed genes.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Substances：
RNA, Messenger

Year: 2007 PMID： 17224917 PMCID： PMC1800356 DOI： 10.1038/msb4100114

Source DB: PubMed Journal: Mol Syst Biol ISSN： 1744-4292 Impact factor: 11.429

Introduction

Reverse engineering mammalian transcriptional regulatory circuits can be achieved using systematic methodology that includes both computational and experimental techniques, working in tandem to generate, refine and verify hypotheses. Understanding tissue-specific transcription is a necessary step for extending regulatory circuit reverse-engineering efforts from single-cell eukaryotes to metazoans. We recently demonstrated that the information in proximal promoters can predict a significant portion of tissue-specific elevated or inhibited expression (Smith ). Here, focusing on tissue-specific regulatory pattern identification and prediction accuracy instead of proof of existence, we use refined analysis and data curation methods to discover and catalog high-confidence regulatory interactions and sites. This catalog will assist experimental efforts to reverse engineer tissue-specific transcriptional regulatory networks from the bottom up. Numerous techniques for analysis of regulatory sequences have been proposed, and the problem of module identification is now receiving due attention (Zhou and Wong, 2004; Gupta and Liu, 2005; Zhu ). Previously characterized binding site motifs have been used to infer transcription factor function in certain tissues (Nelander ). Xie identified conserved motifs across ortholog promoters of four mammalian genomes. Robertson describe cisRed, a database that integrates genome annotation data, homology data and genome alignments to identify motifs with conserved sites across mammals. We analyzed proximal promoters with evidence for tissue-specific regulation in order to identify tissue-specific motifs, modules and their sites in proximal promoters. We developed a new technique for characterizing tissue-specific modules that ensures that each module component significantly improves tissue-specific module enrichment. We integrated multiple sources of expression data to identify reliable sets of transcripts that are under tissue-specific regulation in human and mouse. Using transcription start site (TSS) annotation in Cold Spring Harbor Mammalian Promoter Database (CSHLmpd) (Xuan ), we compiled sets of proximal promoters corresponding to transcripts with evidence for specific regulation in the selected tissues. Our analysis was based on motifs discovered de novo (called novel motifs) using DME (Smith ) and DME-B (Smith ), as well as previously characterized vertebrate binding-site motifs (called known motifs) from TRANSFAC (Matys ) and JASPAR (Sandelin ). We evaluated motifs according to enrichment in tissue-specific promoters relative to other promoters from the same species. We showed that motifs associated with factors with known tissue-specific roles rank high for enrichment, that motif ranks are significantly correlated between human and mouse and that this same set of motifs and their corresponding cis-elements are unlikely to be identified using traditional, order-preserving alignments of ortholog promoter sequences. We constructed modules of interacting motifs (both novel and known motifs), ensuring that each component contributed significantly to the enrichment of the whole module. We annotated tissue-specific promoters with predicted tissue-specific regulatory elements and demonstrated that these sites are in excellent agreement with experimentally annotated liver-specific sites in the human albumin promoter and skeletal-muscle-specific sites in the human α-actin promoter. Both promoters are particularly well annotated with experimentally verified tissue-specific regulatory elements and permit an informative comparison. In other tissues, we gave predicted sites for tissue-specific motifs in representative promoters. The complete data and analysis are available in TCat: The Catalog of Tissue-Specific Regulatory Motifs (http://rulai.cshl.edu/tcat).

Results

We describe first steps toward cataloging high-confidence tissue-specific motifs, modules and their sites. We first collected and integrated expression and function data from various sources, and identified transcripts that are likely to be under tissue specific regulation. We demonstrated that transcripts with evidence for tissue-specific regulation from multiple expression sources in one species (human or mouse) are significantly more likely to have evidence for tissue-specificity in the other species. We analyzed and annotated proximal-promoter sets in seven representative tissues from both human and mouse, demonstrating that motifs and predicted binding sites are in agreement with experimentally verified data and that analyses in human and mouse are significantly correlated. We also showed that the top-scoring sites in orthologous tissue-specific promoters from human and mouse rarely have significant conservation of site order, suggesting that comparative genomics alone may not be sufficient to decode the regulatory signals in these proximal promoters.

Transcripts under tissue-specific regulation

Few transcripts have expression restricted to a single tissue, but many transcripts appear to be regulated in a tissue-specific manner (Su ), and the corresponding promoters are likely to contain tissue-specific regulatory elements. To circumvent problems associated with individual sources of information, we used a voting system that combined information about expression, function and tissue specificity from different sources. Table I gives the number of transcripts with single and multiple sources of evidence (votes) for tissue-specific regulation in each tissue. Orthologs of transcripts with multiple votes for tissue-specific regulation were more likely to have evidence for specific regulation in that tissue, suggesting that the false-positive rate for calling a transcript tissue specific is lower when based on multiple votes. The number of ortholog transcript pairs with multiple votes for tissue-specific regulation in both species ranged from 1 in CD4 T cells to 69 in liver. Table II lists genes and orthologous transcripts with votes for skeletal-muscle-specific regulation in both human and mouse. Gene and transcript lists for other tissues are given in Supplementary Section 1.5.

Table 1

Ability of single versus multiple votes to predict tissue-specificity of a transcript's ortholog

	Human		Mouse		Common evidence
Tissue	Multiple	Single	Multiple	Single	Multiple		Single		P-value
CD4 T-cells	2	247	6	212	3/7	42.9%	36/435	8.3%	1.79E−02
Heart	28	260	105	560	35/122	28.7%	102/766	13.3%	3.78E−05
Kidney	43	188	172	540	42/200	21.0%	66/706	9.3%	1.74E−05
Liver	152	411	271	651	148/354	41.8%	184/982	18.7%	6.46E−17
Pancreas	31	186	47	313	26/61	42.6%	75/450	16.7%	9.93E−06
Skeletal muscle	49	394	141	681	52/174	29.9%	137/1000	13.7%	4.47E−07
Testis	38	287	446	668	67/471	14.2%	86/923	9.3%	4.09E−03
Columns labeled ‘Multiple' and ‘Single' give the number of transcripts with multiple and single votes for specificity, respectively. Columns under ‘Common evidence' show the proportion of transcripts with single and multiple votes for tissue-specificity that have an ortholog with at least one vote for specificity in the same tissue. Excluding CD4 T-cells (which represents a small sample), human and mouse transcripts with multiple votes for tissue-specificity are significantly more likely to have an ortholog with at least one vote for specificity in the same tissue (P<0.01; Fisher's exact test).

Table 2

Transcripts with multiple votes for tissue-specificity in both human and mouse skeletal muscle

Symbol	Name	Human RefSeq	Mouse RefSeq	Votes
MYH2	Myosin, heavy polypeptide 2	NM_017534	NM_144961	7
TTID	Myotilin	NM_006790	NM_021484	6
TNNT3	Troponin T type 3	NM_006757	NM_011620	6
TNNC2	Troponin C type 2	NM_003279	NM_009394	6
MYBPC2	Myosin binding protein C	NM_004533	NM_146189	6
HUMMLC2B	Fast skeletal myosin light chain 2	NM_013292	NM_016754	6
ACTN2	Actinin α2	NM_001103	NM_033268	5
VAMP5	Vesicle-associated membrane protein 5	NM_006634	NM_016872	4
TRIP10	Thyroid hormone receptor interactor 10	NM_004240	NM_134125	4
TPM3	Tropomyosin 3	NM_153649	NM_022314	4
SGCG	Sarcoglycan γ	NM_000231	NM_011892	4
MYOD1	Myogenic differentiation 1	NM_002478	NM_010866	4
MYF6	Myogenic factor 6 (herculin)	NM_002469	NM_008657	4
CKM	Creatine kinase, muscle	NM_001824	NM_007710	4
CACNG1	Calcium channel, voltage-dependent γ1	NM_000727	NM_007582	4
The ‘Votes' column gives the total number of votes for skeletal muscle-specificity in both human and mouse. Our analysis used promoter sets of size 100 for both tissues, including promoters that correspond to transcripts with a single vote for tissue-specific regulation. Tables for the remaining 6 tissues are given in supplementary material.

Enrichment of known tissue-specific motifs

Knowledge of factors and corresponding binding sites that regulate tissue-specific transcription can be used to evaluate motif ranking. We measured motif enrichment in tissue-specific promoter sets using balanced error rates, evaluating motifs for their ability to distinguish tissue-specific sets (foreground sets) from background sets that are composed of non-tissue-specific promoter samples from CSHLmpd. Balanced error rates measure proportions of misclassified promoters after normalization of foreground and background sizes. We ranked motifs according to enrichment and determined whether the ranks assigned to binding-site motifs for factors with known tissue-specific roles are significantly elevated in the corresponding tissues. The results presented in Table III demonstrate that binding-site motifs for these factors ranked significantly high (P<0.01) according to a Wilcoxon signed-ranks test in almost all tissues tested. Excluding DBP in human liver and HNF-3 in mouse liver, these factors had evidence for expression in their respective tissues, and their binding-site motifs were highly enriched in our foreground sets. Results are summarized in Table IV, and the motifs with greatest enrichment in each tissue are given in TCat. Table IV also includes information for HNF-6 in liver. HNF-6 is a known liver regulator, but there is no evidence for its expression in liver based on our data, and its binding-site motif was not enriched in our liver foreground sets. In addition, C/G- or A/T-rich motifs are likely to be enriched in foreground sets that are C/G or A/T rich relative to promoter base composition. To eliminate this potential bias, we adjusted the GC content in background sets to match foreground sets. Some known tissue-specific motifs were identified as enriched only after GC content correction.

Table 3

Significance of elevated ranks for motifs associated with important factors in liver, skeletal muscle and testis

Tissue	Factors	Motifs	Human P-value	Mouse P-value
Liver	HNF-1, HNF-3, HNF-4, C/EBP, DBP	68	2.72E−18	4.19E−12
Skeletal muscle	MEF-2, SRF, Myogenin, Sp1	45	1.33E−14	2.29E−5
Testis	SRY, CREM, RFX	30	0.087	1.89E−4
Motifs give the total number of motifs associated with the listed factors.

Table 4

Evidence for expression and classification quality of binding-site motifs for factors with known tissue-specific regulatory roles

				Expressed		Classifies
Tissue	Factor		Similarmotifs	Hs	Mm	Hs	Mm	Comment
Liver	HNF-1	Hepatocyte nuclear factor 1; Member of the homeodomain factor family.		Yes	Yes	Yes	Yes	Ranks in the top 3 in both species.

	HNF-3	Hepatocyte nuclear factor 3; Member of the fork-head factor family.	HFH, XFD and certain other fork-head family members.	Yes	No	Yes	Yes	Unranked in mouse or mouse orthologs of human liver-specific promoters.

	HNF-4	Hepatocyte nuclear factor 4; Member of thyroid hormone receptor-like family.	PPAR, COUP; Steroid/thyroid hormone receptor-like super-family.	Yes	Yes	Yes	Yes	Ranks 1st in both species.

	HNF-6	Hepatocyte nuclear factor 6; A homeodomain factor from the CUT subfamily.		No	No	No	No	The known motif in TRANSFAC (v9.4) may be poorly characterized.

	C/EBP	CCAAT/enhancer binding protein; Variants form a subfamily of basic region leucine zipper family.		Yes	Yes	Yes	Yes	Ranks in the top 10 in both species and is known to interact with other high-ranking motifs.

	DBP	Albumin D-site binding protein; A member of PAR family of b-ZIP factors.	PAR family: HLF, TEF and VBP.	No	Yes	Yes	Yes	All three sources call DBP present in mouse, none of the three call DBP present in human.

Skeletal Muscle	MEF-2	Myocyte-specific enhancer factor 2; A member of the MADS domain family.	Splice variant RSRFC4.	Yes	Yes	Yes	Yes	Mouse skeletal muscle promoters are enriched with C/G-rich motifs. The A/T-rich MEF-2 ranks 2nd after GC-content correction.

	SRF	Serum response factor; a member of the MADS domain family.		Yes	Yes	Yes	No	The A/T-rich SRF is not identified in mouse even after GC-content correction.

	MyoD	Myogenic factor 3; Member of the myogenin family.	Myogenin family, E-box binding factors.	Yes	Yes	Yes	Yes	Ranks in the top 5 in both species and predicted to interact with SRF and MEF-2.

	Sp1	Stimulating protein 1; A ubiquitous factor with a Cys2His2 zinc finger domain.		Yes	Yes	Yes	Yes	The C/G-rich motif is higly ranked with and without GC-content correction.

Testis	SRY	Sex-determining region on Y chromosome; Member of the high-mobility group (HMG) class of factors.	SOX factors.	Yes	Yes	Yes	No	The A/T-rich motif is ranked 2nd in human after GC-content correction.

	CREM	Cyclic AMP-responsive element modulator; Member of the CREB/ATF subfamily of the bZIP factors.	CREB/ATF family.	Yes	Yes	Yes	Yes	C/G-balanced and synergistic, ranks 1st in mouse with and without GC-content correction.

	RFX	Regulatory factor X; subfamily of the forkhead factors with winged-helix binding domains.		Yes	Yes	No	Yes	The core of the RFX motif (GTTGCCA) is highly similar to the reverse of the core MYB motif (CCGTTG), ranking top in human.
A motif classifies foreground from background if it is ranked in the top 20 distinct motif classifiers.

Nuclear receptor binding-site motifs and E-box motifs are among the top enriched motifs in 11 and 10 of the 14 human and mouse tissues, respectively. A Wilcoxon signed-ranks test showed that nuclear receptor and E-box motifs (represented by 54 and 39 TRANSFAC motifs, respectively) are enriched in the union of our foreground sets with P-values below 6.07E−14 and 2.02E−13 (nuclear-receptor motifs) and 2.22E−10 and 4.57E−03 (E-box) in human and mouse, respectively. These results suggest a diversity of tissue-specific roles for nuclear receptors and E-box binders, likely mediated by tissue-specific cofactors.

Tissue-specific cis-regulatory elements

Highly enriched motifs, and the associated score thresholds identified by our methods, provide a starting point for targeted experimental annotation of tissue-specific promoters. Figure 1 shows known and predicted sites mapped on the −500 to +100 region of the human albumin promoter and the −250 to +50 region of the human skeletal muscle α-actin promoter. Human albumin has known functional binding sites for HNF-1, C/EBP, AFP and NF-Y (Paonessa ; Sawadaishi ; Frain ; Li ), all of which were identified among the top predicted motifs or included in top modules for liver. Locations of known sites for C/EBP (at −437) and NF-Y (at −125) do not perfectly align with the corresponding predicted sites (at −462 and −143, respectively), but current knowledge about binding sites for those factors raises the possibility that the predicted locations are more accurate. The only predicted binding site for these factors in the human albumin promoter that is not depicted in the figure is a C/EBP site at −956. Human skeletal muscle α-actin has known sites for SRF, TEF and a known TATA box (Boxer ; MacLellan ), all among top predicted motifs for human skeletal muscle (TATA box motifs have high similarity with MEF-2 motifs). Figure 2 gives predictions in a representative human promoter in each of the remaining tissues.

Figure 1

Verified and predicted binding sites in human albumin and skeletal muscle α-actin promoters. Predicted sites are represented by horizontal bars and verified sites by vertical bars. Verified sites for albumin (Paonessa ; Sawadaishi ; Frain ; Li ) and for α-actin (Boxer ; MacLellan ) were mapped to the promoter from CSHLmpd to obtain their correct locations relative to the TSS.

Figure 2

Predicted binding sites for selected factors in promoters from the human tissue-specific sets. The selected factors are among the top ranked in the corresponding tissues.

Comparison to previous results

Previous analysis of tissue-specific patterns in regulatory regions includes analysis based on cross-species conservation (Xie ) and coexpression (Smith ). Xie identified conserved elements in orthologous promoters of four mammalian genomes. They found 59 experimentally validated motifs that are significantly conserved and enriched in at least one human tissue. These include E-box, ETS, MEF2, MEIS1 and NF-1 in skeletal muscle; Chx10 in kidney; NRF-1, ELK-1, GABP and E12 in CD4 T cells; AP-4 and MEF-2 in heart; and NRF-1 in testis. Our results agree with Xie on the enrichment of E-Box and MEF-2 in skeletal muscle, ETS in CD4 T-cells and E-box in pancreas. In previous work (Smith ), we tested the hypothesis that proximal promoters contain information that can be used to predict tissue-specific expression. We were not concerned with identifying the most significant tissue-specific motifs, modules and sites. Considering the difference between the goals of the two projects, it is not surprising that the predictive models described by Smith have little similarity to our top motifs and modules. The most significant similarities between our top tissue-specific patterns and the predictive models of Smith include the enrichment of ETS in CD4 T-cell-specific promoters and the enrichment of Smith motifs Novel3 and Novel6 in mouse testis and Novel1 in human testis. The three novel testis motifs are very similar to motifs that rank in the top 100 in our analysis, but the enrichment of these motifs was not sufficiently high for inclusion in TCat.

Correlation between human and mouse regulatory regions

We compared motif enrichment ranks in each human foreground set to ranks in the corresponding mouse foreground set using Spearman's rank correlation test, and found that enrichment ranks across species are highly correlated (P<0.001) for all but CD4 T cells (Supplementary Table 15). In CD4 T cells, motif enrichment ranks are similar only for few highly enriched motifs. Despite the motif-enrichment ranks correlation, the order of the top predicted binding sites is not usually conserved between orthologous promoters. Fewer than 10% of orthologous pairs showed significant (P<0.01) conservation of site order. Weak site-order conservation suggests that the top tissue-specific sites would be difficult to identify using traditional cross-species alignment alone, and methods that rely on co-linear promoter alignment may have high false-negative detection rates. This evidence is in agreement with Frith , who found that homologous transcription start sites can be separated by more than 100 nucleotides. A list of the nine genes (out of 102 candidates) with significant conservation of site order is given in Supplementary Section 2.3.

Materials and methods

The steps used in creating the catalog include (1) identifying tissue-specific transcripts, (2) identifying factors that are expressed in each tissue, (3) obtaining promoter sequences for tissue-specific transcript, and (4) identifying individual motifs and modules (i.e. sets of interacting motifs) that characterize tissue-specific promoter sets.

Identifying tissue-specific transcripts

To identify motifs and modules that regulate tissue-specific transcription, we analyzed promoters of transcripts that appear to be regulated in a tissue-specific manner. If an information source indicated that a transcript has restricted expression, unusually high expression, or a specific function in the tissue, that source voted for tissue specificity of the transcript. For each tissue, we sorted the transcripts according to the number of votes received, retaining the top 100 with distinct TSS as tissue specific. Ties in the ranking were broken according to intensity values from the GNF SymAtlas expression data (discussed below), which we have found to be the most complete and the most reliable source of tissue-based expression information. We used the same number of transcripts for each tissue to facilitate comparison across tissues, and 100 sequences provided sufficient information for our analysis while allowing identification of well-known tissue-specific motifs.

Microarray data

The GNF SymAtlas microarray data were generated using Affymetrix HG-U133A array and the custom GNF1H and GNF1M Affymetrix arrays, and include expression profiles for 79 human and 61 mouse tissues (Su ). Among these are the seven tissues we selected to include in the catalog. Tissues were selected with consideration to data availability in GNF and other sources and interest from Zhang lab members and collaborators. A transcript received a vote for tissue specificity from this information source if it was called present and its intensity exceeded its mean across all tissues by 3 standard deviations. The Hughes Toronto microarray data (Zhang ), which was generated using custom-built oligonucleotide arrays, provide mouse expression profiles for 55 tissues, including all of our tissues but CD4 T cells. A transcript received a vote for tissue specificity if it was called present in the tissue, and had intensity at least 10 standard deviations above its mean across all 55 tissues. This large number of standard deviations was required to limit the number of transcripts receiving positive votes. The GeneNote expression profiles (Shmueli ), which were generated using the Affymetrix GeneChip HG-U95, provide human expression data for 12 tissues, including all of our tissues but CD4 T cells and testis. The GeneNote data were used in the same way as the GNF SymAtlas data, with a transcript being called tissue specific if it is present in that tissue and has intensity at least 3 standard deviations above its mean across all 12 tissues.

EST data

dbEST is a database of expressed sequence tags (Boguski ), and contains source information, such as the tissue of origin, for each EST. This information is used to annotate UniGene clusters with the source data, and a UniGene is said to have restricted expression in a tissue if more than half of the ESTs contributing to that UniGene have the same source tissue. A transcript received one vote for specificity in a particular tissue if the corresponding UniGene cluster is annotated as having expression restricted to that tissue.

GO terms

We associated a set of GO Terms with each tissue. This was performed by compiling a set of keywords for each tissue (e.g. ‘renal' was associated with kidney; ‘sperm' was associated with testis), and searching GO Term names and definitions for those keywords. This produced, for each tissue, a set of GO Terms that were subsequently reviewed to ensure that the context of the keywords was appropriate. A transcript of a gene annotated with a GO Term that is associated with a tissue received a vote for specificity in that tissue.

Selecting promoter sequences

Although regulatory elements can exist almost anywhere in the genome, they are concentrated near the TSS (Cooper ). We used the CSHLmpd to map transcripts to promoters, using experimentally confirmed promoters from EPD (Perier ), DBTSS (Suzuki ) and GenBank, as well as computationally predicted promoters. For each promoter, we used the proximal sequence region of −1000 to +100 relative to the TSS. Each part of our analysis is based on comparing the tissue-specific promoter sets to a background of random promoters from the same species. For each tissue, a background set was constructed by selecting 1000 transcripts uniformly at random from the set of RefSeqs for the corresponding species with TSS annotation in CSHLmpd. For each tissue, transcripts with at least one vote for specificity in that tissue were removed from consideration before selecting the background. Because our analysis focused on proximal promoters, −1000 to +100 relative to the TSS, if the TSS annotation is off by several hundred base pairs, important promoter regions might be excluded. AKR1D1, identified as liver specific by our voting system, has two known TSSs within 500 bp of the first exon for the corresponding RefSeq (NM_005989) (Charbonneau and Luu-The, 1999). We used the TSS located upstream of the first exon, but could have chosen to use the other promoter, which was annotated by a generally more reliable source (DBTSS versus GenBank). Currently, in such situations, there is little information that identifies the promoter responsible for observed tissue-specific regulation, but comparative genomics and rapidly improving arrays promise better 5′ end identification, thus improving proximal promoter annotation and association between transcripts and promoters (Kim ; Carninci ). Negative promoter sets can be used to cancel out patterns that are not related to tissue-specific transcription regulation. We use random negative promoter sets with and without GC-content correction; this correction cancels the inuence of genomic GC-content isochore variability.

Identifying and evaluating motifs

Given a motif M (represented as a position–frequency matrix) and a sequence S, the max-score of M in S, max-score (M, S) is the score of the top scoring subsequence of S when aligned against the scoring matrix for M. Details on constructing and using scoring matrices can be found in Stormo (2000). For a fixed threshold λ, the max-score classification method classifies S as belonging to the foreground if max-score (M,S)⩾λ, and the background otherwise. Given a set of foreground sequences FG (i.e. the tissue-specific promoters) and a set of background sequences BG, the sensitivity of M and λ under max-score classification is and the specificity is The balanced error rate for M and λ under max-score classification is then The quantity of interest in our analysis corresponds to the optimal value of λ for M in distinguishing FG from BG: Many known motifs are similar to each other, usually owing to similar binding specificities for distinct factors or distinct origins for mofits associated with a single factor. We used MATCOMPARE (Schones ) to eliminate redundancies in the sets of known and novel motifs.

Identifying and evaluating modules

Modules are used to classify sequences based on the max-score values of the motifs they contain. Let ℳ={M, …, M} be a module and Λ={λ1, …, λ} be an associated set of thresholds. The max-score classification for modules assigns sequence S to the foreground if and only if max-score (M, S)⩾λ holds for all i (1⩽i⩽k). Given sets of sequences FG and BG, the sensitivity of ℳ and Λ under max-score classification is and the specificity is The balanced-error rate for ℳ and Λ under the max-score classification is As with mofits, we are interested in the optimal value of Λ and define Because modules are intended to describe synergistic function of a set of motifs, we are interested in modules whose performance is better than expected given the performance of the individual motifs composing the module. For a module ℳ composed of k motifs, let ℳ′⊂ℳ minimize B(ℳ′, FG, BG) over all size k−1 modules built from motifs in ℳ, and let ℳ′=ℳ\ℳ′. To assess whether ℳ, with balanced-error rate u, significantly improves over ℳ′ and ℳ′, we use the probability We estimated this probability empirically, by sampling from the distribution of balanced-error rates resulting from intersections of sets with balanced-error rates B(ℳ′, FG, BG) and B(ℳ′, FG, BG). We used MODULATOR, which is available in CREAD (Smith ), to construct modules. Given a set of motifs, a set of foreground sequences and a set of background sequences, MODULATOR identifies those modules composed of the given motifs that have the best balanced-error rates. A branch-and-bound algorithm is used to simultaneously optimize the score thresholds for the motifs in a module. Modules are constructed by adding motifs to existing modules until a user-specified module size is reached or until motif addition does not significantly improve enrichment. Each time a motif is added to a module, the resulting larger module is retained only if the balanced-error rate of the larger module is improved significantly above expectation. The initial modules of size two are obtained by combining pairs of motifs. For modules that are entirely composed of known motifs, the top 100 motifs (before eliminating redundancies) were used. Modules were allowed to contain up to four motifs for reasons of computational feasibility, but many top modules are smaller. Novel modules, which must contain at least one novel motif, were constructed using the top 100 novel motifs and the top 100 known motifs. Redundancies were removed from the lists of top modules using a procedure described in Supplementary Section 2.

Measuring the significance of motifs and modules

To measure significance of enrichment for top known motifs we used known motifs to classify randomly assembled promoter sets. We constructed 1000 foreground/background pairs for each species by selecting 100 sequences for each foreground and 1000 for each background uniformly at random from CSHLmpd. For each foreground/background pair we calculated the balanced-error rate of each known motif. The best balanced-error rates overall obtained on random samples for human and mouse were 0.364 and 0.368, respectively. We used the distribution of these error rates to identify the q-value (Storey and Tibshirani, 2003) significance of the error rate of each motif. Tissues whose highest ranking motifs fail the q<0.05 test include CD4 T cells and heart in human, and CD4 T cells in mouse. TCat includes q-value annotation for each ranked known motif. The full set of motifs for which q<0.05 is estimated to include five false leads per 100 predictions. We did not obtain statistical significance measures for novel motifs, because this will require running DME and DME-B more times than is computationally feasible. Modules were identified by combining motifs whose cooccurrence was enriched in the foreground sets. We measured enrichment of modules using the balanced-error rate (analogous to that of motifs), and we required that each motif in a module contributes significantly to the enrichment of the module as a whole. To test significance, we randomly selected 100 of the 1000 foreground/background pairs used to evaluate individual known motifs, and performed the module identification procedure on each of the 100 selected pairs. The best balanced-error rates for modules that are entirely composed of known motifs (called known modules) in random human and mouse sets were 0.3145 and 0.304, respectively. We used these balanced-error rates as an estimate of the critical value for P<0.01. We opted for using P-value cutoffs instead of computing q-values because accurate q-value estimation for modules is computationally prohibitive. Top known modules in human kidney (0.2955), liver (0.3105), pancreas (0.3125) and testis (0.3) scored better than the cutoff, as did top known modules in mouse kidney (0.3005), liver (0.3025) and testis (0.2945). Supplementary information

31 in total

Review 1. DNA binding sites: representation and discovery.

Authors: G D Stormo
Journal: Bioinformatics Date: 2000-01 Impact factor: 6.937

2. GeneNote: whole genome expression profiles in normal human tissues.

Authors: Orit Shmueli; Shirley Horn-Saban; Vered Chalifa-Caspi; Michael Shmoish; Ron Ophir; Hila Benjamin-Rodrig; Marilyn Safran; Eytan Domany; Doron Lancet
Journal: C R Biol Date: 2003 Oct-Nov Impact factor: 1.583

3. JASPAR: an open-access database for eukaryotic transcription factor binding profiles.

Authors: Albin Sandelin; Wynand Alkema; Pär Engström; Wyeth W Wasserman; Boris Lenhard
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

4. Statistical significance for genomewide studies.

Authors: John D Storey; Robert Tibshirani
Journal: Proc Natl Acad Sci U S A Date: 2003-07-25 Impact factor: 11.205

5. CisModule: de novo discovery of cis-regulatory modules by hierarchical mixture modeling.

Authors: Qing Zhou; Wing H Wong
Journal: Proc Natl Acad Sci U S A Date: 2004-08-05 Impact factor: 11.205

6. A high-resolution map of active promoters in the human genome.

Authors: Tae Hoon Kim; Leah O Barrera; Ming Zheng; Chunxu Qu; Michael A Singer; Todd A Richmond; Yingnian Wu; Roland D Green; Bing Ren
Journal: Nature Date: 2005-06-29 Impact factor: 49.962

7. Identification and characterization of a factor that binds to two human sarcomeric actin promoters.

Authors: L M Boxer; T Miwa; T A Gustafson; L Kedes
Journal: J Biol Chem Date: 1989-01-15 Impact factor: 5.157

8. A gene atlas of the mouse and human protein-encoding transcriptomes.

Authors: Andrew I Su; Tim Wiltshire; Serge Batalov; Hilmar Lapp; Keith A Ching; David Block; Jie Zhang; Richard Soden; Mimi Hayakawa; Gabriel Kreiman; Michael P Cooke; John R Walker; John B Hogenesch
Journal: Proc Natl Acad Sci U S A Date: 2004-04-09 Impact factor: 11.205

9. TRANSFAC: transcriptional regulation, from patterns to profiles.

Authors: V Matys; E Fricke; R Geffers; E Gössling; M Haubrock; R Hehl; K Hornischer; D Karas; A E Kel; O V Kel-Margoulis; D-U Kloos; S Land; B Lewicki-Potapov; H Michael; R Münch; I Reuter; S Rotert; H Saxel; M Scheer; S Thiele; E Wingender
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

10. Purification of a NF1-like DNA-binding protein from rat liver and cloning of the corresponding cDNA.

Authors: G Paonessa; F Gounari; R Frank; R Cortese
Journal: EMBO J Date: 1988-10 Impact factor: 11.598

30 in total

1. Sequence features that drive human promoter function and tissue specificity.

Authors: Jane M Landolin; David S Johnson; Nathan D Trinklein; Shelly F Aldred; Catherine Medina; Hennady Shulha; Zhiping Weng; Richard M Myers
Journal: Genome Res Date: 2010-05-25 Impact factor: 9.043

2. Genome-wide mapping and analysis of active promoters in mouse embryonic stem cells and adult organs.

Authors: Leah O Barrera; Zirong Li; Andrew D Smith; Karen C Arden; Webster K Cavenee; Michael Q Zhang; Roland D Green; Bing Ren
Journal: Genome Res Date: 2007-11-27 Impact factor: 9.043

3. Integrated biochemical and computational approach identifies BCL6 direct target genes controlling multiple pathways in normal germinal center B cells.

Authors: Katia Basso; Masumichi Saito; Pavel Sumazin; Adam A Margolin; Kai Wang; Wei-Keat Lim; Yukiko Kitagawa; Christof Schneider; Mariano J Alvarez; Andrea Califano; Riccardo Dalla-Favera
Journal: Blood Date: 2009-12-03 Impact factor: 22.113

4. Identification of interacting transcription factors regulating tissue gene expression in human.

Authors: Zihua Hu; Steven M Gallo
Journal: BMC Genomics Date: 2010-01-19 Impact factor: 3.969

5. A systems biology approach to transcription factor binding site prediction.

Authors: Xiang Zhou; Pavel Sumazin; Presha Rajbhandari; Andrea Califano
Journal: PLoS One Date: 2010-03-26 Impact factor: 3.240

6. Genetic variation in the proximal promoter of ABC and SLC superfamilies: liver and kidney specific expression and promoter activity predict variation.

Authors: Stephanie E Hesselson; Pär Matsson; James E Shima; Hisayo Fukushima; Sook Wah Yee; Yuya Kobayashi; Jason M Gow; Connie Ha; Benjamin Ma; Annie Poon; Susan J Johns; Doug Stryke; Richard A Castro; Harunobu Tahara; Ji Ha Choi; Ligong Chen; Nicolas Picard; Elin Sjödin; Maarke J E Roelofs; Thomas E Ferrin; Richard Myers; Deanna L Kroetz; Pui-Yan Kwok; Kathleen M Giacomini
Journal: PLoS One Date: 2009-09-09 Impact factor: 3.240

7. CpG-depleted promoters harbor tissue-specific transcription factor binding signals--implications for motif overrepresentation analyses.

Authors: Helge G Roider; Boris Lenhard; Aditi Kanhere; Stefan A Haas; Martin Vingron
Journal: Nucleic Acids Res Date: 2009-09-06 Impact factor: 16.971