Literature DB >> 26390058

Early enhancer establishment and regulatory locus complexity shape transcriptional programs in hematopoietic differentiation.

Alvaro J González¹, Manu Setty¹, Christina S Leslie¹.

Abstract

We carried out an integrative analysis of enhancer landscape and gene expression dynamics during hematopoietic differentiation using DNase-seq, histone mark ChIP-seq and RNA sequencing to model how the early establishment of enhancers and regulatory locus complexity govern gene expression changes at cell state transitions. We found that high-complexity genes-those with a large total number of DNase-mapped enhancers across the lineage-differ architecturally and functionally from low-complexity genes, achieve larger expression changes and are enriched for both cell type-specific and transition enhancers, which are established in hematopoietic stem and progenitor cells and maintained in one differentiated cell fate but lost in others. We then developed a quantitative model to accurately predict gene expression changes from the DNA sequence content and lineage history of active enhancers. Our method suggests a new mechanistic role for PU.1 at transition peaks during B cell specification and can be used to correct assignments of enhancers to genes.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Histones
Lysine

Year: 2015 PMID： 26390058 PMCID： PMC4626279 DOI： 10.1038/ng.3402

Source DB: PubMed Journal: Nat Genet ISSN： 1061-4036 Impact factor: 38.330

Introduction

Genome-scale studies of cellular differentiation have observed that many enhancers involved in cell-type specific programs are already established in precursor cells. For example, we recently found that most enhancers involved in the regulatory T (Treg) cell transcriptional program – based on their occupancy by the Treg cell master regulator Foxp3 – were DNase accessible in CD4+ precursor cells, occupied by other factors that “place-hold” to maintain the potential for Treg cell differentiation[1]. Evidence in support of early enhancer establishment or chromatin poising has also been documented in B cell and macrophage specification[2,3], T cell development[4], early hematopoiesis[5], and multipotent endoderm cells at enhancers associated with liver and pancreas cell fates[6]. Earlier concepts of poising include bivalent domains in embryonic stem cells (ESCs), where the active mark H3K4me3 and repressive mark H3K27me3 coincide[7]; other poised ESC elements marked by H3K4me1 and H3K27me3[8]; and poised/inactive enhancers marked with H3K4me1 but not H3K27ac[9]. Meanwhile, recent studies have defined the notion of cell-type specific “super-enhancers” – spatially clustered enhancers, occupied by master regulator transcription factors (TFs) for the cell type – that regulate developmentally important genes[10,11]. Others have used segmentation of histone mark data to identify long (>3kbp) “stretch enhancers”[12], associated wide domains of the active mark H3K27ac with high regulatory potential[13], or characterized broad domains of H3K4me3 as “buffer domains” for important cell-type specific genes[14]. Here we introduce a new definition of regulatory locus complexity based on the multiplicity of DNaseI hypersensitive sites (DHSs) regulating a gene across a lineage. We investigate how locus complexity and early enhancer establishment in hematopoietic differentiation work together to shape transcriptional programs and quantitatively determine gene expression changes in cell state transitions. Through an integrative DHS-centric analysis of chromatin state and gene expression across ESCs and five primary hematopoietic cell types and predictive modeling of gene expression changes in cell fate specification, we propose that both regulatory complexity and early enhancer establishment contribute to achieving large expression changes during differentiation and strong cell-type specific expression patterns for important cell identity genes.

Results

A lineage DHS atlas defines gene regulatory complexity

We carried out an integrative analysis of DNase-seq, histone modification ChIP-seq for multiple marks (H3K27ac, H3K27me3, H3K4me1, H3K4me3), and RNA-seq data in order to link enhancer dynamics and spatial organization to gene expression changes in hematopoietic differentiation. We focused on six cell types characterized by the Roadmap Epigenomics project[15-17] (Supplementary Table 1): human embryonic stem cells (hESC), hematopoietic stem and progenitor cells (CD34+ HSPC), one myeloid cell type (CD14+ monocytes), and three lymphoid lineages (CD19+ B cells, CD3+ T cells, CD56+ NK cells). We first performed peak calling on DNase-seq profiles, using three biological replicates per cell type to control for irreproducible discovery rate (IDR)[18], and assembled an atlas of over 120K reproducible DNase hypersensitive sites (DHSs, median width = 456bp; Supplementary Fig. 1, Online Methods). We initially assigned each DHS in the atlas to the nearest gene, and we defined the regulatory locus complexity of a gene as the total number of atlas DHSs, over all cell types, assigned to it. Nearest-gene enhancer assignment can incur errors, especially in gene-dense regions or conversely for distal intergenic enhancers. However, 58% of DHSs in the atlas reside within the transcription unit of their assigned target gene (from 2Kbp upstream of the TSS to 2Kbp downstream of the last annotated 3′ end) and an additional 10% lie within 10Kbp of the target gene. Therefore, in a majority of cases, we assign a DHS to the encompassing or local gene. Indeed, roughly half of the non-promoter DHSs in the atlas, as well as in individual cell types, are intronic (Supplementary Table 2). Several studies have characterized conserved intronic enhancers that reside in developmentally important lymphoid lineage genes, including Foxp3 and Ikzf1[19,20], and demonstrated that they indeed regulate these genes. Furthermore, recent high-resolution capture Hi-C analysis found that transcriptionally active promoters asymmetrically interact with the gene body more than with the local upstream sequence, suggesting an activating role for intronic enhancers[21]. However, intronic enhancers sometimes regulate a nearby gene rather than the encompassing gene[22-24], and therefore our assignment of intronic DHSs is imperfect. We will return to the problem of improving enhancer-gene assignments later. We next grouped genes into three equally sized classes (tertiles) based on their regulatory complexity: low complexity genes, with a total assignment of 0 to 2 atlas DHSs; medium complexity genes, with 3 to 7 DHSs; and high complexity genes, with 8 or more DHSs. Fig. 1a shows an example of a low complexity gene, NUP54, a housekeeping gene that encodes a component of the nuclear pore complex; and a high complexity gene, RUNX1, a developmentally important hematopoietic transcription factor. NUP54 has a single DHS assigned to it, a promoter peak that is accessible and has active histone marks (H3K4me3 and H3K27ac) across all cells; contains no additional DHSs in its short transcription unit; and displays a high constitutive level of gene expression with little variation across cell types. By contrast, RUNX1 is assigned 43 DHSs, including several promoter peaks and a large number of intronic enhancers across its very long transcription unit (only the first four introns are shown); its enhancers display dynamic patterns of accessibility and chromatin marks across cell types; and it achieves dramatic expression changes in specific cell state transitions.

Figure 1

DHS atlas defines high and low complexity genes in hematopoietic differentiation

(a) NUP54 (left) encodes a component of the nuclear pore complex, and RUNX1 (right) encodes a hematopoietic master regulator (first four introns shown). Colored signals correspond to histone modifications and the grey signal to DNaseI hypersensitivity. Atlas peaks (DHSs) at these two loci are represented in the top track as grey boxes. NUP54 has only one promoter peak while RUNX1 was assigned 43 peaks. Transcript expression is quantified from RNA-seq for each gene across cell types.

(b–c) Complexity classes were defined by taking the 33 and 66 percentiles of the complexity distribution to produce groups with similar number of genes: low complexity (0 to 2 peaks), medium (3 to 7), and high (8 or more). High complexity genes have (b) longer transcription units with (c) higher fractions of intronic sequence. P-values are computed using one-sided Kolmogorov-Smirnov tests. IQR: interquartile range.

(d–e) Gene expression variation correlates with regulatory complexity. (d) Distribution of expression dynamic range for the three complexity groups. (e) Expression log fold changes (logFC) in the transition from CD34+ HSPCs to CD14+ monocytes for the complexity groups, based on the top 10% most highly expressed genes in each cell type.

(f–g) Gene density analysis for genes in complexity classes, based on gene counts in 1MB genomic bins. (f) Low complexity genes are enriched in gene dense regions, whereas high complexity genes are found in gene poor regions. (g) Regions of high gene density contain predominantly promoter and ubiquitous DNase peaks (green dots).

Complexity tied to gene architecture, function, expression

In fact, the differences in gene architecture, function, and expression dynamics between NUP54 and RUNX1 are characteristic of the differences between low and high complexity genes. As one might expect, the log length distribution of the transcription unit for high complexity genes is significantly greater than that of medium complexity genes, which are longer than low complexity genes (Fig. 1b; P < 1×10−16, one-sided Kolmogorov-Smirnov test for both comparisons). Furthermore, the fraction of the transcription unit consisting of intronic sequence is significantly higher in high versus medium complexity genes and in medium versus low complexity genes (Fig. 1c; P < 1×10−16, one-sided KS tests) since higher complexity genes have longer introns (Supplementary Fig. 2). High complexity genes that are highly expressed (top 10% of expression distribution) in a cell type are strongly enriched for cell type specific functions such as hemopoiesis and leukocyte activation (HSPCs), lymphocyte activation (B cells), and immune response (monocytes), while highly expressed low complexity genes are enriched for similar housekeeping functions, such as translation and ubiquitination, in all cell types (Supplementary Table 3; Online Methods). The distribution of genes over complexity classes and properties of these classes also hold for the top 10% expressed genes in each cell type (Supplementary Fig. 3). Interestingly, translation elongation, RNA splicing, and mRNA processing terms were consistently enriched in the medium complexity genes. Importantly, genes in low, medium, and high complexity classes show significantly increasing dynamic range of gene expression across cell types (Fig. 1d; P < 1×10−4, one-sided KS tests for low vs. medium and medium vs. high). Moreover, when examining highly expressed genes in cell state transitions such as the specification of HSPCs to monocytes, high complexity genes achieve the largest increases in expression, with medium and low complexity genes showing progressively less upregulation (Fig. 1e; P < 2×10−7 for high vs. medium, P < 4×10−6 for medium vs. low, one-sided KS tests; see Supplementary Fig. 4 for other cell state transitions). These results suggest that high regulatory complexity may help to achieve large expression changes in differentiation. Given our nearest-gene DHS assignments, we also investigated how gene density interacts with our complexity definitions. For each gene, we computed the number of genes on either strand overlapping a 1M window centered at the gene (Fig. 1f) or in two 500KB windows flanking each gene (Supplementary Fig. 5a). Both analyses showed that high complexity genes occur in low density regions, while low complexity genes occur in gene dense regions. We considered if tightly packed “low complexity” genes could share their DHSs with each other in a dense network of chromatin loops. However, we found that a large fraction (61%) of DHSs at low complexity genes and present in at least one differentiated cell type are promoter peaks or ubiquitous peaks, and that the percentage of promoter/ubiquitous peaks increases with gene density (Fig. 1g). If there does exist a network of interacting DHSs in our “low complexity” class, it would appear to be constitutive rather than dynamic, consistent with the housekeeping functions and limited expression dynamics of low complexity genes.

High complexity genes are enriched for “transition” DHSs

We next examined the patterns of DHS accessibility across cell types and their association with gene complexity classes. The heatmap in Fig. 2a shows the most frequent patterns of accessibility found among ~48K DHSs present in monocytes (minor patterns omitted). Major accessibility patterns include promoter DHSs that are constitutively open across all cell types (11.1% of monocyte DHSs); ubiquitous intergenic and intronic DHSs (19.2%), non-promoter peaks open in all cell types; hematopoietic DHSs (4.4%), absent in hESCs but present in all hematopoietic cell types; and cell-type specific DHSs (35.6%), enhancers that are present only in monocytes. Finally, another major accessibility pattern we identified consisted of “transition peaks” (12.6%), enhancers that are established in HSPCs and maintained in monocytes but lost in other fates. Similar early enhancer establishment was suggested through histone mark analysis in early hematopoiesis[5]. In contrast to promoter and ubiquitous peaks, transition and cell-type specific peaks are preferentially found in regions with low gene density (Supplementary Fig. 5b, red dots). Examination of lymphoid cell types yielded similar major patterns, with the addition of a lymphoid-restricted accessibility pattern (Supplementary Fig. 6–8). In particular, each differentiated cell type had a distinguished set of transition enhancers, established in HSPCs and maintained upon specification to that cell fate but lost in others. Since our analysis identified a substantial number of DHSs shared between B cells and monocytes, and between T and NK cells, we allowed sharing of peaks between these two pairs of cell types when defining “cell-type specific” and “transition” peaks, provided they were not found in other differentiated cell types (Online Methods, Supplementary Table 4).

Figure 2

High complexity genes contain enhancers with distinct dynamics

(a) Heatmap showing DNase accessibility at peaks for hESC cells and 5 primary hematopoietic cell types. Each row represents one of the 48,303 DNase peaks present in CD14+ monocytes. DNase read counts are displayed in a 1,000 base window, binned at 10 bases, and normalized (TPM) in each cell type. 83% of CD14+ peaks can be grouped into one of the following accessibility patterns: promoter, ubiquitous (non-promoter), hematopoietic, transition, and cell-type specific peaks.

(b–f) Each DNase accessibility pattern displays characteristic chromatin signals. DNase and histone modification signals are shown as metapeaks, plotting the average of each signal across all peaks in the category. For DNase, the 10 and 90 TPM percentiles at each position are represented as a grey shadow around the mean signal.

(g–h) Genes in different complexity classes show distinct enrichments for accessibility patterns at DHSs. The majority of DHSs associated with low complexity genes are promoter or ubiquitous peaks, while DHSs assigned to high complexity genes are strongly enriched for cell-type specific and transition peaks. The lymphoid pattern is defined as DHSs accessible in B, T and NK cells. “Cell-type” is short for cell-type specific.

Meta-peak visualizations of the major DNase accessibility patterns for monocyte peaks, together with average chromatin signals across cell types, are shown in Fig. 2b–f. DNase and histone mark distributions within each pattern (Supplementary Fig. 9–15) show that accessibility is highest in promoter peaks and lowest in cell-type specific peaks, and that H3K4me1 is highest in transition peaks, followed by both cell-type specific and hematopoietic peaks. Ubiquitous peaks display a bimodal distribution of the active marks H3K4me3 and H3K27ac (Supplementary Fig. 10–11, 13–14), suggesting that a subpopulation of these DHSs are not active transcriptional enhancers. When we overlaid CTCF ChIP-seq from the ENCODE lymphoblastoid cell line GM12878 with ubiquitous DHSs, we saw dramatic enrichment of CTCF signal at low H3K4me3 ubiquitous peaks (Supplementary Fig. 16, Online Methods), suggesting their a role in 3D chromatin structure. An early ENCODE study of 1% of the human genome across 6 unrelated cell types found that almost all ubiquitous DNase peaks were promoter peaks (86%), and most of the remainder were CTCF bound (10%)[25]; our lineage-based analysis reveals a larger class of ubiquitous non-promoter DHSs with active histone marks and without CTCF occupancy. Similar observations can be made for analogous enhancer types in other differentiated cells (Supplementary Fig. 9–15). Strikingly, DHSs assigned to genes in different complexity classes are enriched for distinct lineage-wide DNase accessibility patterns. DHSs of low complexity genes are dominated by promoter and ubiquitous peaks in monocytes (Fig. 2g) and B cells (Fig. 2h) as well as other cell fates (Supplementary Fig. 17). Meanwhile, DHSs of high complexity genes are strongly enriched for both cell-type specific and transition peaks (Fig. 2g, < 2×10−107 and P ~ 0 for transition and monocyte specific peaks, Fisher’s test; Fig. 2h, P < 5×10−85 and P < 2×10−153 for transition and B cell specific peaks). Therefore, consistent with their greater dynamic range of expression, high complexity genes are also enriched for enhancers with dynamic accessibility, including both transition and cell-type specific enhancers.

Upregulated high complexity genes gain active enhancers

We next examined the gain and loss of active enhancers in cell state transitions from HSPCs to differentiated cell types. While cell-type specific enhancers change their accessibility, transition peaks display a change in chromatin marks. Fig. 3a, b show 2D density plots of the active mark H3K27ac[9] vs. the mark H3K4me1 for all DNase peaks in HSPCs and monocytes (see Supplementary Fig. 18 for other cell types). Gaussian mixture modeling defines three overlapping subpopulations (Online Methods): peaks with low activity/moderate H3K4me1, moderate activity/high H3K4me1, and high activity/moderate H3K4me1. Transition peaks for monocyte specification are indicated in blue. In HSPCs, monocyte transition peaks are strongly enriched in both the low activity/moderate H3K4me1 and moderate activity/high H3K4me1 subpopulations. Upon specification to monocytes, the transition peaks largely move to the moderate activity/high H3K4me1 region of the monocyte landscape (Fig. 3b) and display a strong increase in H3K27ac (Supplementary Fig. 19); transition peaks for other cell fates show the same patterns (Supplementary Fig. 20–22). A partial analogy can be made to bivalent domains in hESC cells[7], some of which coincide with DNase peaks and form a subpopulation of DHSs with moderate active promoter mark H3K4me3/high repressive mark H3K27me3 in the corresponding 2D hESC landscape (Supplementary Fig. 23) but resolve to active or inactive DHSs upon specification to HSPCs (Supplementary Fig. 24–26).

Figure 3

Gain of active enhancers in cell state transitions correlates with increase in expression

(a–b) The 2D topographical plots show the density of all accessible DNase peaks in (a) CD34+ HSPCs and (b) CD14+ monocytes in the landscape defined by the active mark H3K27ac (x-axis) and the mark H3K4me1 (y-axis). Gaussian mixture modeling identifies three subpopulations in each plot, shown by the mean of each mixture component and the ellipse corresponding to the 90% probability level curve. Enrichments of accessibility patterns in each subpopulation are indicated in figure; all enrichment P-values by Fisher’s test. Percentages indicate proportion of all the monocyte transition peaks that lie in each region of the ellipses.

(c–d) Panels show gene expression changes and enhancer changes in (c) low complexity and (d) high complexity genes for the transition from HSPCs to monocytes for the 100 genes with largest absolute logFC in each class. Each gene is illustrated by a stack of diamonds, where a diamond represents a DHS associated to the gene. The bottom of the stack corresponds to the logFC of the gene (y-axis). Red/blue diamonds are active peaks gained/lost in the transition. Peaks maintained in the transition and that have gains/losses in enhancer marks (H3K4me1 or H3K27ac) are represented by yellow/green diamonds (Online Methods). Grey dots represent peaks present in both cell types without changes in histone marks. Box-and-whisker plots show the logFC distributions for all the genes. Gene names annotated with [t]/[s] have GO annotations for transcriptional/signaling processes.

We then examined the 10% most highly expressed genes in HSPCs and monocytes and identified the low and complexity genes with largest expression changes in HSPC to monocyte specification. While a handful of low complexity genes appeared to achieve large expression changes without any gain/loss of DHSs (Fig. 3c and Supplementary Fig. 27–29), large increases/decreases in expression at high complexity genes were consistently associated with gains/losses in active cell-type specific enhancers, almost always accompanied by gain/loss of H3K4me1/H3K27ac at one or more transition enhancers (Fig. 3d). Indeed, highly expressed, high complexity genes with at least one transition peak experienced greater chromatin remodeling than those without a transition peak, gaining significantly more cell-type specific peaks while losing more HSPC peaks, and achieved higher log fold changes in monocyte specification (Supplementary Fig. 30a, b, c). These results suggest that regulatory complexity is important for achieving cell-type specific gene expression; however, these complex locus control regions are not typically established in a single cell state transition but rather are primed through early enhancer establishment.

SeqGL predicts TF binding signals from cell type DHSs

We next asked whether we could predict expression changes of genes in cell state transitions from the DNA sequence signals and lineage history of their active enhancers. To dissect sequence motifs in enhancers, we applied SeqGL, a group lasso algorithm we developed to identify multiple DNA signals de novo from DNase data using k-mer patterns[26]. SeqGL first computes a count matrix of occurrences of k-mers in DNase peak sequences and flanks and clusters k-mers by their count vectors to identify k-mer groups. Fig. 4a shows part of the clustered count matrix obtained by applying SeqGL to B cell DNase-seq data. Each group tends to contain sequence-similar k-mers that may represent variants of the DNA recognition signal of a single TF, found in a subset of training examples. To obtain a scoring function for each k-mer group, SeqGL learns weights over k-mers by training a binary logistic regression model to discriminate between peaks and flanks[26] (Online Methods). Groups with primarily positive weights represent predicted TF binding signals found in peaks (Fig. 4a), and for each of these groups, we can identify highly scored peak regions, generate a binding motif, and compare to existing databases to identify the corresponding TF, if found (Fig. 4a, Online Methods).

Figure 4

SeqGL identifies multiple TF sequence signals in B cell DNase peaks

(a) SeqGL identifies multiple DNA sequence signals de novo from DNase-seq data using k-mer patterns that can discriminate between peaks and flanks. SeqGL first computes a k-mer count matrix from the set of input sequences and clusters co-occurring k-mers into groups. A subset of this matrix for B cells is shown in the top panel. SeqGL then learns a weighting over k-mers in each group that discriminates peaks from flanks. The bottom panel shows the inferred weights and top ranking k-mers of each group. Groups with positive weights represent predicted TF binding signals.

(b) The PAX5 locus is shown highlighting the top TFs identified by training SeqGL on B cell DNase peaks. The first row shows the DNase read density and the remaining rows show SeqGL predicted TF binding signals.

(c) The active B cell peaks were divided into four categories: (1) promoter: peaks defined in the promoter; (2) ubiquitous: peaks with similar accessibility across cell types; (3) transition: peaks retained specifically in the transition from HSPC to B cells; (4) cell-type specific: peaks that are significantly higher in B cells.

(d–g) Barplots showing binding signal enrichment in different peak categories. NFY is strongly enriched in promoters, and CTCF in both ubiquitous and promoter peaks. The B cell factor IRF is significantly higher in cell-type specific peaks, while PU.1 is enriched in both cell-type specific and transition peaks. “*” indicates P < 0.01 by Wilcoxon rank sum test. Error bars represent standard deviation.

In a previous analysis of ENCODE DNase-seq data, SeqGL was more sensitive than traditional motif discovery methods at identifying TF binding signals supported by ChIP-seq data in the same cell type[26]. Running SeqGL on DNase-seq data for our 5 hematopoietic cell types produced ~50 predicted TF binding signals mapping to 25–34 distinct known TF motifs per cell type (Supplementary Table 5). While we do not have parallel TF ChIP-seq data for these cell types, SeqGL retrieved many TF binding signals that are supported by the literature and previous ChIP-seq studies, including PU.1, RUNX and AP-1 in HSPCs as well as differentiated myeloid and lymphoid cells[27]; T-BOX/TBET and NF-κB signals in NK, T cells and monocytes[28]; myeloid zinc factor MZF in HSPCs and monocytes[29]; CEBP in monocytes[30]; and STAT exclusively in T cells[31]. However, since SeqGL extracts DNA sequence signals from DNase-seq and relies on motif databases to associate signals to TFs, it cannot distinguish between TFs with nearly identical binding motifs. Fig. 4b gives an example of SeqGL TF binding predictions for DHSs in the PAX5 locus after training the model on DNase peaks in CD19+ B cells. Each track below the DNase signal track shows the SeqGL group k-mer score for the indicated TFs; for RUNX, k-mer patterns contributing to the scoring model are shown. We analyzed SeqGL’s TF score distributions across different categories of DHSs with active histone marks (H3K4me1 or H3K27ac). We refined the previous DNase accessibility categories by performing pairwise differential read count analysis on DHSs to identify cell-type specific or shared events[32] (Online Methods), obtaining four categories present in B cells: promoter, ubiquitous (non-promoter), transition, and cell-type specific (Fig. 4c). As expected, the NFY sequence signal is enriched in B cell promoter peaks (Fig. 4d), CTCF in ubiquitous as well as promoter peaks (Fig. 4e), and IRF in B cell specific peaks (Fig. 4f). Interestingly, PU.1 is enriched not only in cell-type specific peaks but also in transition peaks (Fig. 4g). While PU.1 is known to play a key role in both HSPC and B cell function[2,33], the binding of PU.1 in transition peaks and its impact on gene regulation have not been characterized. We therefore developed a regression framework to model the influence of PU.1 and other factors on gene expression changes in cell fate specification.

Regression model accurately predicts expression changes

We learned global regression models to predict expression changes from DNA signals in active enhancers by training on high and low complexity genes in differentiation from HSPCs to distinct cell fates. We trained a separate regression model for each cell fate and used SeqGL to derive TF sequence signals from DNase-seq data in HSPCs and the differentiated cell type. In the model for B cell specification, each gene is represented by a set of SeqGL-predicted TF binding scores (features) computed from its active (H3K4me1 or H3K27ac marked) DHSs in HSPCs and B cells (Fig. 5a, Online Methods). The model learns a regression coefficient for each TF signal or pair of signals in each category of DHS (promoter, ubiquitous, HSPC specific, B cell specific, transition) to allow these signals to play distinct roles in predicting expression changes. For example, TF signals in HSPC and B-cell specific peaks represent regulatory inputs that are lost or gained, respectively, in the cell state transition; transition peaks may be bound by TFs in HSPCs that act as “place-holders” for later binding by other TFs in B cells or may be continuously occupied by the same TFs in both cell types. The contribution of TFs in these different roles to orchestrating gene expression changes is captured by the sign and magnitude of the learned regression weights (Fig. 5a).

Figure 5

Regression model suggests a role for PU.1 in early establishment of B cell enhancers

(a) Regression framework to model the effect of TF signals on gene expression changes in the transition from HSPCs to B cells. TF binding scores in both HSPC and B cell peaks, categorized as in Fig. 4c, form the features for each gene. The weights of different TFs across peak categories, representing their contribution to expression changes, are inferred using ridge regression.

(b) Heatmaps showing the predicted contributions of each TF across peak categories to gene expression changes in cell state transitions. PU.1 signals from multiple peak classes occur in all models. High expression in HSPC, monocytes, B, T and NK cells is strongly predicted by corresponding cell-type specific factors. Changes in many B cell genes are predicted not only by signals in B cell peaks but also by transition peak signals, primarily PU.1.

(c) B cells genes in Fig. 5b were classified into three categories: genes with predictive signals from (i) transition peaks alone; (ii) B cell specific peaks alone; and (iii) both B cell specific and transition peaks, termed “poised” genes. “Poised” genes are significantly upregulated compared to other categories (P < 0.02, category (i) and P < 1 × 10−4, category (ii), one-sided KS test).

(d) Functional enrichment through gene ontology analysis shows that “poised” genes are significantly more enriched ontologies related to B cell functions compared to other genes.

(e) HSPC specific signals like GATA are significantly higher in active HSPC peaks of HSPC genes compared to those of poised B cell genes (P < 0.03, Wilcoxon rank sum test) and are low in B cell peaks. Similarly, signals for IRF, a B cell factor, are significantly higher in active peaks of poised B cell genes compared to HSPC genes (P < 3 × 10−4) and are low in HSPC peaks. Error bars represent standard deviation.

Restricting to the top 10% most highly expressed genes in HSPCs and the differentiated cell type, we found that the predicted expression changes for held-out genes in the B cell transition were remarkably accurate (Spearman correlation ρ = 0.666), especially for the high complexity genes (ρ = 0.742) which have the largest expression changes (Supplementary Fig. 31). Low complexity genes have more modest expression changes, and the model behaved appropriately for these genes by predicting smaller log fold changes (Supplementary Fig. 31). Similarly strong prediction performance was obtained for transitions to other cell fates (Supplementary Table 6). Analysis of the regression models reveals the specialized roles of different kinds of enhancers (Supplementary Fig. 32). The first heatmap in Fig. 5b shows the contribution of TFs residing in different DHS categories for predicting gene expression changes of high complexity genes in B cell specification, and the remaining heatmaps show other cell state transitions. In these heatmaps, each row represents a high complexity gene and every column represents a TF signal or pair of TF signals for a particular enhancer category; the value in each cell is the SeqGL-derived binding score weighted by the regression coefficient, so that we can visualize if it contributes positively (orange) or negatively (blue) to the overall predicted log expression change. In some cases, several k-mer groups are associated with the same TF by SeqGL and can lead to multiple columns labeled with the same TF, sometimes within the same DHS category. Some of these multiple k-mer groups for a TF may represent subtle differences in the underlying motifs and perhaps merit experimental examination, while others are likely a result of over-clustering (Supplementary Fig. 33), which nonetheless does not harm the predictive performance of the regression model. Using second order features significantly improves performance over first order features alone (P < 5×10−7, Wilcoxon signed rank test) but only fine-tunes the predictions. The model identifies known HSPC factors like GATA and RUNX[34] in HSPC specific peaks as predictors of negative log fold change for genes specifically expressed in HSPCs; similarly, known B cell factors like IRF[34] in B cell specific peaks predict positive log fold changes in B cell genes. Interestingly, binding signals for PU.1 in HSPC and B cell peaks are associated with downregulation of HSPC genes and upregulation of B cell genes, respectively; moreover, a subset of B cell genes harbor PU.1 signals in transition peaks that strongly predict their expression changes. This finding suggests a novel mechanistic role for binding of PU.1 in HSPCs at transition enhancers to poise genes for B cell specification. We therefore defined three sets of high complexity/highly expressed genes in B cells: (i) genes with predictive signals only in transition peaks; (ii) genes with predictive signals only in B cell specific peaks; and (iii) “poised genes” with predictive signals in both B cell specific peaks and transition peaks. Strikingly, poised genes displayed the strongest upregulation upon differentiation to B cells (Fig. 5c) as well as the strongest enrichment of B cell differentiation and activation ontology terms (Fig. 5d), suggesting that key B cell identity genes are regulated by TF binding to both B cell specific and transition enhancers. Among the poised genes were a number with important B cell functions, such as the anti-apoptotic and regulatory genes BCL2, IKZF1, KLF6, TCF3 and IRF8, as well as the immune signaling genes PLCG2, SYK, IL4R, INPP5D, IFNGR1, and IL16. We confirmed that the transition peaks of poised genes have significantly higher DNase accessibility and lower H3K27me3 repressive marks compared to their B cell specific peaks in HSPCs (P < 3×10−93, Wilcoxon rank sum test) and are close in chromatin state to the HSPC specific peaks of HSPC genes (Supplementary Fig. 34; P < 4×10−3, Wilcoxon rank sum test). This suggests that PU.1 binding at transition enhancers – the main predictive transition peak signal – may help maintain the chromatin in an open state. However, poised B cell genes have much lower expression in HSPCs than HSPC genes (Supplementary Fig. 34; P < 3×10−93, Wilcoxon rank sum test). To explain this, we confirmed that poised genes have reduced signals for key HSPC-specific factors like GATA in their HSPC and B cell active enhancers while harboring strong B cell specific TF signals like IRF in their B cell active enhancers (Fig. 5e, Online Methods). Therefore, while poised genes lack HSPC sequence signals to achieve high expression in precursor cells, chromatin poising by PU.1 may help promote their upregulation once B cell factors are expressed. Interestingly, our modeling of monocyte specification also reveals a role for PU.1 poising, but here the PU.1 signal arises in promoter peaks and may maintain open chromatin for later binding by the promoter-associated factor CEBP, a key regulator of monocyte/macrophage fate (Supplementary Fig. 35–36).

Regression model nominates gene-enhancer reassignments

While overall we observed high rank correlation between measured and predicted expression changes for highly expressed genes in cross-validation (Fig. 6a), predictions for a small number of genes had high error (Supplementary Fig. 37). We hypothesized that these errors may arise from misassignment of enhancers to nearby genes. Therefore, we implemented an iterative reassignment procedure, where we trained regression models using cross-validation, flagged held-out genes with large prediction errors, scanned for marginally expressed genes neighboring the flagged genes, and reassigned the active enhancers of the silent genes to the poorly predicted genes (Online Methods). In almost all cases, the prediction errors for expression changes of flagged genes decreased (Supplementary Fig. 38), and overall rank-correlation in cross-validation also improved (ρ = 0.78, Supplementary Fig. 37). For an independent validation of the enhancer reassignment method, we further found that B cell enhancer reassignments significantly reduced prediction error for the affected genes in the monocyte, NK cell, and T cell regression models (P < 2×10−3, 1×10−9, 3×10−7, respectively, Wilcoxon signed rank test; Fig. 6b, Supplementary Fig. 39). Similarly, using enhancer-gene reassignments from other regression models led to significantly improved regression performance in independent cell types in all but one case (Supplementary Fig. 39; Supplementary Table 7). As an example, the B cell marker gene CD20 resides in a gene dense region where no nearby genes are expressed in B cells. Originally assigned 2 active DNase peaks, it acquires 13 actively marked peaks through enhancer reassignment to achieve a notable improvement in regression prediction (Supplementary Fig. 40).

Figure 6

Regression model proposes reassignment of enhancers to genes

(a) The regression performs very well on high complexity genes (Spearman rank correlation ρ=0.74, 10-fold cross-validation.). A number of genes that are highly expressed in either B cells or HSPC had high regression errors (Supplementary Fig. 37), often because their large expression change was not associated with corresponding high complexity in the enhancer-gene assignment. To correct the potential misassignment of enhancers to genes, an iterative reassignment scheme was implemented: for each gene with high regression error and high expression in either B cells or HSPC, the nearest marginally expressed gene (expression <75th percentile) neighboring the current regulatory locus was masked, and the peaks associated with the marginally expressed gene were reassigned to the gene with regression error; the regression model was then retrained. This procedure was iterated until convergence of regression coefficients. After iterative enhancer reassignment, the regression model shows improved performance with significant reduction of prediction errors for affected genes in both cell types (Supplementary Fig. 38).

(b) For an independent evaluation of the enhancer reassignment procedure, the enhancer reassignments identified in B cells were used to predict gene expression changes in the T cell transition. Squared errors using these reassignments were significantly reduced compared to errors without any reassignments (P < 3 × 10−7, Wilcoxon signed rank test). Similarly, using enhancer-gene reassignments from any of the other regression models led to improved regression performance in independent cell types (Supplementary Fig. 39).

Discussion

Recent studies have proposed various epigenetic signals to define highly cell-type specific “super-enhancers” that mark cell identity genes[10-12,14]. We found that the highly expressed high complexity genes in our cell types are indeed enriched for genes proximal to super-enhancers defined by broad domains of Mediator ChIP-seq (for hESCs) or H3K27ac (for hematopoietic cell types[10]), though a majority of previously defined super-enhancer genes do not overlap with these genes (Supplementary Fig. 41). Moreover, about half of previously defined hESC super-enhancers contain 0 or 1 DHSs, while ~41% of HSPC super-enhancers and >30% of lymphoid cell super-enhancers contain 2 or fewer DHSs (Supplementary Fig. 42). Therefore, current definitions of super-enhancers do not always identify complex locus control regions with many individual enhancers, as others have also observed[35]. Rather than defining a separate repertoire of super-enhancers for each cell type, we define regulatory complexity as a property of individual genes based on their DHS multiplicity across a lineage. This definition may be more natural from a mechanistic and evolutionary standpoint by characterizing a gene’s regulatory potential and the cell-type dependent constraints on that potential. While we divide genes into low, medium, and high complexity classes to simplify our analysis, there may be no intrinsic threshold dividing super-from non-super-enhancers. Our analysis also demonstrates the value of examining the lineage dynamics and DNA sequence content of enhancers in addition to their spatial clustering, proposing a distinguished role of PU.1 transition enhancers in B cell specification. We found that presence of a transition peak in high complexity genes is associated with higher gain of cell-type specific DHSs and greater enhancer remodeling in the cell state transition. A potential model is that the transition enhancer nucleates the establishment of enhancer-promoter DNA looping in the new cell state. New genome editing tools should allow this model to be tested experimentally. We lacked the data to examine the sequential establishment of complex locus control regions over multiple steps in differentiation. However, we did find a sizeable set of non-promoter, non-ubiquitous DHSs that are established as accessible in hESCs and maintained until a differentiated cell fate; like transition peaks, these DHSs were enriched in high complexity genes (Supplementary Fig. 43). While limited by available data, we have shown proof-of-principle results that complex locus control and early establishment of enhancers play important roles in transcriptional programs in differentiation. Importantly, our study is not limited to a traditional examination of stable end states in development and the repertoire of active regulatory elements in these end states. Rather, our analysis considers the asynchronous establishment and activity of enhancers that collaborate to mediate large developmental shifts in expression. Our regulatory model starts by simply assigning DHSs to the nearest gene. There is disagreement in the literature about how often this rule is incorrect. An early ENCODE 5C analysis on 1% of the human genome concluded that only ~7% of looping interactions are with the nearest gene[36]. By contrast, recent Cohesin ChIA-PET interaction data in mouse ES cells supported the assignment of enhancers to the proximal active gene in a large majority of cases (83% of super-enhancers, 87% of typical enhancers) and also largely supported assignment of enhancers to single rather than multiple genes[37], and promoter capture Hi-C analysis of an ENCODE lymphoblastoid cell line found that interactions with promoters are directed at the nearest promoter in almost two thirds of cases, while the remaining interactions pass over intervening active or inactive promoters[21]. These studies are largely consistent with our initial assumptions, although definitive interaction data is not yet available and the statistical analysis of 3C-sequencing data is challenging. Computational methods for predicting enhancer-gene associations – often based on correlating the accessibility or active marks of DNase-mapped enhancers and promoters across many cell types[38-40] – have been effectively deployed in large-scale analyses. However, our results suggest that such strategies may not work in all cases: high complexity genes can have constitutively accessible and active promoters across a lineage but cell-type specific enhancers. Other approaches have correlated activity of enhancers with target gene expression across unrelated cell types without using a gene regulation model[38,41,42]. Despite nearest-gene enhancer assignment, our regression model predicts gene expression changes with high accuracy, and larger prediction errors for a minority of genes suggested enhancer reassignments that reduced the loss in independent cell types. As new technologies enable profiling of fine-grained cell populations using small cell numbers[5,43], our approach may point the way forward to improved enhancer annotation.

Online Methods

Data and preprocessing

Aligned DNase-seq bam files were downloaded from the Roadmap Epigenomics data portal (http://www.ncbi.nlm.nih.gov/geo/roadmap/epigenomics/). RNA-seq data was downloaded as fastq files and aligned to hg19 using STAR[44]. See Supplementary Table 1 for accession numbers and links to each library. Processed data files have also been made available at the supplementary website for the paper (see URLs). The Bioconductor package GenomicFeatures[45] was used to determine the number of reads mapping to each gene. Reads per kilobase (RPK) was calculated for each gene, and the RPK values were quantile normalized across all the cell types to obtain gene expression levels.

DNase peak calling

The peak calling pipeline is outlined in Supplementary Fig. 1. Peak calling was performed on each cell type individually: first the reads from different replicates were pooled, and then the MACS peak caller[46] was used to identify peaks with a permissive threshold (P < 2×10−3). MACS identified broad regions of DNase accessibility. Therefore the PeakSplitter tool[47] was applied to identify constituent narrow peaks, each potentially representing a specific binding event. Finally, IDR[18] was used to identify reproducible peaks for each cell type (IDR < 1e-2). After identification of reproducible peaks, a simple heuristic was used to combine peaks from multiple cell types to build a common atlas comprising peaks from all cell types. For simplicity, first suppose that there are only two cell types. In this case, the set of non-overlapping peaks from the two cell types are first added to the atlas. Then the following heuristic is used to combine overlapping peaks: if the overlap between the peaks is > 75%, the non-overlapping portions of the peaks are removed to create a single unified peak which is added to the atlas; if the overlap is < 75%, the overlapping portions are removed to create two separate peaks which are both added to the atlas. This procedure can be extended to multiple cell types by first building the set of peaks for any two cell types, and then combining this atlas with the third cell type, and so on. This procedure was extended to all the cell types to create the atlas of DNase peaks for subsequent analysis.

Assignment of DNase peaks to genes

The June 2013 RefSeq transcript annotation of the hg19 version of the human genome was used for genomic location of transcription units. For genes with multiple gene models, the longest transcription unit was used for the gene locus definition. DNase peaks located in the body of the transcription unit, together with the 2kb regions upstream of the TSS and downstream of the 3′ end, were assigned to the gene. If a peak is found in the overlap of the transcription units of two genes, one of the genes is chosen arbitrarily. Intergenic peaks were assigned to the gene whose TSS or 3′ end is closest to the peak. In this way, each peak was unambiguously assigned to one gene. This approach to associating enhancers to target genes has been used previously[48].

DNase peak atlas and enhancer categories

We found a total of 120,583 DNase peaks (reproducible across replicates) of which 51.4% were present in hESC cells, 45.4% in HSPCs, 40.1% in monocytes, 30.9% in B cells, 30% in NK cells and 29.2% in T cells. Examining genomic locations, 44.6% of the peaks were found in introns, 41.9% in intergenic regions, 8.7% in promoters (5′UTR and 2kb upstream of TSS), 2.9% in 3′UTRs and 1.9% in coding sequences. The peaks were assigned to enhancer classes (DNase accessibility patterns) according to the cell types where they appeared accessible as well as their genomic locations (promoter and non-promoter). In this way, we found the largest classes to be hESC-specific (31,119 peaks), monocyte-specific (16,327 peaks) and ubiquitous (14,683 peaks). We looked at the distribution of DNase accessibility patterns at different types of genomic loci and noticed that of the 10,520 peaks that are located in the promoters of genes, 58% were ubiquitous (present in all cell types and in all cell types but monocytes – see note below) and 18% were accessible only in hESC cells or HSPCs. Therefore, the overwhelming majority of promoter peaks that appear in a differentiated cell type (the focus of this study) are ubiquitous, and for this reason we included the ubiquitous promoter peaks as a DNase accessibility pattern. Non-promoter peaks had a much wider range of accessibility patterns, of which the most abundant ones constituted the major classes that were defined in this study: cell-type specific peaks (present in a differentiated cell type, 26,933), transition peaks (present in HSPCs and a differentiated cell type, 9,506), ubiquitous peaks (present in all cell types and in all cell types but monocytes, 11,997 – see note below), hematopoietic peaks (present in all cell types but hESC, 2,139) and lymphoid peaks (present in T, B and NK cells, 1,917). In the transition and cell-type specific classes, we included peaks that were present in both monocytes and B cells, and also peaks present in both T and NK cells, since these categories have a considerable number of peaks. See Supplementary Table 4 for numbers on sharing of peaks between cell types. Note: In the ubiquitous classes (both promoter and non-promoter) we included peaks present everywhere but in monocytes because upon reexamination of DNase-seq data, they exhibited non-negligible levels of DNase accessibility in monocytes (despite the lack of called reproducible peaks) and their histone modifications signals were highly similar to the signals in ubiquitous peaks.

Histone modification ChIP-seq processing

Aligned histone modification ChIP-seq bam files were downloaded from the Roadmap Epigenomics data portal. See Supplementary Table 1 for accession numbers and links to each library. Each data set was subjected to cross-correlation analysis for quality control, as described previously[49]. Briefly, it is assumed that a high-quality ChIP-seq experiment produces significant clustering of enriched DNA sequence tags at locations bound by the protein of interest, and that the sequence tag density accumulates to the left of the bound protein on forward strand, and to the right on the reverse strand, making the two strands appear shifted around the binding event. Therefore one can compute the Pearson linear correlation between the plus and minus strands, after shifting the plus strand by k base pairs, and expect to observe two peaks when cross-correlation is plotted against the shift value: a peak of enrichment corresponding to the predominant fragment length and a peak corresponding to the read length (“phantom” peak). Metrics obtained from this plot, and the presence/absence of these two peaks, were used to flag and discard some data sets and also to experimentally estimate the fragment length in each library. In the end, we were able to collect data sets with high signal to noise ratio for the histone modifications H3K4me1/me3 and H3K27ac/me3 in the six cell types of interest, although only in a few cases were we able to keep more than one replicate. When replicates remained available, they were pooled into one data set. These signals, after having been shifted by half the estimated fragment length, were used for counting enrichment of histone modifications at DNase peaks (400 bp in the flanking regions of the peak were included for counts) and for generating signal tracks (Fig. 1 and Fig. 2). Counts at DNase peaks were normalized to reads per kilobase (to normalize for peak width) and then quantile normalized across different cell types. For signal tracks, counts of reads in bins of length 200bp were used in Fig. 1 and normalized to tags per million; and bins of length 10bp were used for the meta-peak signals in Fig. 2 and also normalized to tags per million.

DNase and histone modification heat maps

The DNase heatmaps in Fig. 2 and Supplementary Fig. 6–8 were created by pooling replicates of DNase data sets and binning the 1kbp region around the peak summit in bins of length 10bp. Read counts at bins were normalized to tags per million, and these vectors were hierarchically clustered (for each enhancer class independently) using the “hclust” function of R with Euclidean distance. Heatmaps were drawn using the function “aheatmap” from the package “NMF”[50]. For histone modification heatmaps (Supplementary Fig. 25–26), a similar procedure was followed, but for improved visualization the dynamic range of the histone modification counts was linearly transformed to have the same dynamic range of DNase counts.

Gaussian mixture models

Two dimensional Gaussian mixture models were used to identify subpopulations of DNase peaks according to their histone modification counts in each cell type. One analysis focused on H3K27ac vs. H3K4me1 to study the poising of enhancers (Fig. 3a, b and Supplementary Fig. 19), and a second analysis focused on H3K4me3 vs. H3K27me3 to study bivalent peaks and their fates in differentiation (Supplementary Fig. 23–26). In each case, log transformed counts were used to fit a Gaussian mixture model with 3 components using the R package “mixtools”[51]. The function “mvnormalmixEM”, which runs the EM algorithm for fitting, was used with random initialization. In the case of H3K4me3 vs. H3K27me3 in hESCs (Supplementary Fig. 23), we noticed that the three Gaussians capture two subpopulations of interest (one with both signals low, and one with high H3K4me3 and low H3K27me3) and a third subpopulation that encompasses the background. Therefore, to capture bivalent peaks, we ran the EM fit again with four components, initializing it with the 3 components learned in the first pass and a fourth centered around bivalent peaks (by visual inspection). Once the bivalent component was captured, bivalent peaks (blue dots in Supplementary Fig. 23) were defined as those having posterior probability greater than 0.50 in this component. In all the fits, probability level curves with p = 0.90, were used for visualizing the different subpopulations. These curves enclose all the points x⃗ that satisfy: where μ⃗ is the mean of the subpopulation, Σ is the covariance matrix and is the quantile function for probability p (in this case 0.90) of the chi-squared distribution with k degrees of freedom, where k is the number of dimensions. Since here k is 2, the chi-squared distribution simplifies to an exponential distribution with mean 2.

Detailed description of diamond plots

In Fig. 3c, d, we illustrate how up/down expression changes in cell state transitions are accompanied by gains/losses of active DHSs in the regulatory loci of genes. Both panels show gene expression changes and enhancer changes in (c) low complexity genes and (d) high complexity genes for the transition from HSPCs to monocytes. Each gene is illustrated by a stack of diamonds, where a diamond represents a DHS associated to the gene. The bottom of the stack corresponds to the logFC of the gene (y-axis). Red diamonds are active peaks gained in the transition, with a significant change in DNase accessibility (FDR < 1%, 0 < logFC) and logFC of H3K4me1 or H3K27ac greater than 1. Blue diamonds are peaks lost in the transition, similarly defined as having significant change in DNase accessibility (FDR < 1%, logFC < 0) and logFC of H3K4me1 or H3K27ac smaller than −1. Peaks that are maintained in the transition are represented by yellow and green diamonds, defined as DNase peaks present in both cell types (no significant change at FDR of 2%) with a logFC in either H3K4me1 or H3K27ac greater than 1 (yellow diamonds) or smaller than −1 (green diamonds). Grey dots represent peaks present in both cell types without changes in histone marks. In solid colors are peaks that satisfy the previous conditions but also are assigned to different monocyte enhancer types: cell type specific (red), HSPC specific (blue) or transition (yellow and green). Restricting to the top 10% most expressed genes in both cell types, panels (c) and (d) show for each complexity group only the 100 genes with largest absolute logFC. The box-and-whisker plots on the right show the logFC distributions for all the genes; the box represents the interquartile range (IQR), and the whiskers are placed at distance 1.5*IQR from the corresponding quartile. Gene names have been annotated with [t] and [s] whenever the gene has annotation terms associated with transcription or signaling, respectively. In the high complexity group, greater changes in gene expression are associated with large numbers of gained/lost peaks, typically in the presence of transition peaks.

CTCF ChIP-seq processing

We observed that non-promoter ubiquitous peaks display a markedly bimodal distribution of H3K4me3 in all the differentiated cell types (Supplementary Fig. 13–14); H3K27ac has a similar behavior but to a smaller extent (Supplementary Fig. 10–11). We hypothesized that the subpopulation with low active signals should be enriched for structural DHSs, and to test this possibility, we measured CTCF signals at these loci. By visual inspection of the monocyte H3K4me3 density plot at ubiquitous peaks, we defined the “structural” group as the peaks with log2 H3K4me3 signal between 3.8 and 4.2 (873 peaks) and the “active” group as the peaks with signal between 8.8 and 9.2 (721 peaks). We downloaded aligned bam files for two biological replicates of CTCF ChIP-seq in the lymphoblastoid cell line GM12878 from the ENCODE web portal (accession ENCSR000DRZ), counted reads at peaks and transformed to reads per kilobase to normalize for peak width. The signals for the two groups of peaks were compared with empirical cumulative distribution functions (Supplementary Fig. 16).

SeqGL overview

SeqGL is a novel group-lasso based de novo motif discovery algorithm to extract multiple TF binding signals from ChIP-, DNase- and ATAC-seq profiles[26]. SeqGL identifies these signals by training a discriminative model to differentiate between sequences in peaks vs. flanking sequences using a k-mer feature representation. The clustering of these k-mer features across training examples revealed a block structure, and this block structure is encoded as a group lasso constraint in the binary classification problem. Formally this can be written as: where x represents the k-mer count vector for example i and yi are the labels: +1 for peaks and −1 for flanks. The first summation is over all the examples (peaks and flanks) and defines the logistic loss function. The second summation encodes the group lasso constraints across all k-mer weights w for all groups g. Here w is a vector of k-mer weights that belong to group g. The third constraint encodes the sparsity constraints over all k-mers and sets the weight of non-informative k-mers to zero. The two regularization parameters λ and λ control sparsity at the group level and k-mer level respectively. The prediction scores of flanks is used as an empirical null distribution to identify the subset of peaks best predicted by a group that discriminates peaks from flanks. HOMER is then run on this subset of peaks to associate a TF with each group. Benchmarked on over 100 ChIP-seq experiments, SeqGL outperformed traditional motif discovery tools in discriminative accuracy. SeqGL also successfully scaled to DNase- or ATAC-seq maps, identifying numerous TF signals confirmed by ChIP, including those missed by conventional motif finders[26]. SeqGL was run by sampling 40,000 subpeaks of the cell type identified using PeakSplitter. 200 groups were used for the analysis. Group scores and motifs were identified using the scores of the flanks as empirical null distributions.

Peak category definitions for regression analysis

To refine peak categories for regression analysis, edgeR[32] was performed on all pairs of cell types using only peaks defined in either of the two cell types under consideration. A peak was identified as significantly differential if it satisfied an FDR-corrected P < 1×10−2 and absolute DNase fold change > 1. A number of DNase peaks were observed to be shared between (i) B cells and monocytes and (ii) T cells and NK cells (see Supplementary Table 4). Hence, this information was used in defining the peak categories (Fig 4b), as described below. The B cell peak categories were defined as follows: (1) promoter: peaks in the promoter regions genes, defined as 2kb window up- and downstream of the TSS; (2) ubiquitous: peaks with no significant change in all comparisons of B cell vs. HSPCs, T cells and NK cells; (3) transition: peaks with significant change in all of B cell vs. T cells and NK cells; and (4) cell-type specific: peaks with significant change in all of B cell vs. HSPCs, T cells and NK cells. A similar logic is used to categorize peaks in the other cell types.

Predictive model of gene expression changes

SeqGL scores were computed for all the subpeaks. Broad peak scores were computed by summing the constituent subpeak scores. The number of non-zero k-mer groups identified in each cell type is an outcome of the SeqGL optimization algorithm. In particular, HSPCs SeqGL scores for 53 non-zero k-mer groups were used for HSPC-specific peaks and transition peaks, and B cell SeqGL scores for 48 groups were used for cell-type specific, transition, ubiquitous, and promoter peaks. Similarly, ~50 groups were identified by SeqGL in each other cell type. These scores are used to learn the weights of TF signals occurring in different peak categories using a regression model. Each category of peaks (promoter, ubiquitous, cell-type specific, HSPC-specific, and transition) is represented by all SeqGL TF features from the relevant cell types (i.e. for B cell specification, B cell features are used for all peak categories except HSPC-specific, and HSPC features are used for all peak categories except B cell specific). For each TF signal and peak category, all the SeqGL scores for this signal across all the “active” peaks belonging to the category and assigned to a gene are summed up, and this sum is used as the corresponding feature value for the gene; “active” peaks are those marked with either H3K4me1 or H3K27ac. Therefore, the extent of DNase accessibility is not considered in the model; all reproducible DNase peaks with active histone marks in a particular category are treated equally. The multiplicity of peaks containing a TF signal does figure into the model, e.g. if a gene gains many B cell specific peaks all containing an IRF signal, than the corresponding feature value (IRF binding signal in B cell specific peaks) will be high because it will be the sum of multiple IRF SeqGL scores. The TF binding site identification is used to turn each gene’s set of assigned (active) DNase peaks into a feature vector of binding signals. As described above (and as we illustrate in Fig. 5a), binding scores for each TF are summed across each category of peaks. Therefore, e.g., the information in the set of ubiquitous peaks assigned to a gene is summarized as a feature vector of TF scores, one for each SeqGL TF signal. If there are multiple peaks in a category, information about which specific peak the signal came from is not retained. Similarly, while we use second order features in the model – e.g. “PU.1 signals and GATA signals in HSPC-specific peaks” – this is the product of the summary features for PU.1 and for GATA in HSPC-specific peaks rather than co-occurrences of these signals in individual peaks. Retaining more information about the sequence signals in individual peaks and interactions between peaks could be a direction for future work, perhaps by integrating high-resolution Hi-C data. For now, the model is sufficient to learn the major TFs that explain gene expression changes and the categories of peaks through which they act. Ridge regression models were trained using the SPAMS package[52], and a regression model was trained separately for the transition from HSPCs to each differentiated cell type.

DNA signals in poised B cells and HSPC genes

Active HSPC and B cell peaks (including cell-type specific, promoter, transition, and ubiquitous) peaks were first identified in both poised B cell and HSPC genes. The B cell and HSPC SeqGL models were used to predict sequence signals in these peaks.

Gene ontology analysis

Gene ontology analysis was performed using the DAVID tool[53]. All human genes were used as background, and analysis was performed using the “Biological Process” ontology terms.

Iterative reassignment of enhancers

Genes with high cross-validation errors were first identified (error > 85th percentile). For each high-error gene, the nearest marginally expressed gene (expression < 75% percentile, normalized RPM threshold: 9.5) neighboring the high-error gene’s currently defined regulatory locus was identified, and the enhancers of the non-expressed gene were assigned the expressed gene. For high-error genes with positive fold change, differentiated cell expression was used to identify non-expressed neighbors; HSPC expression was used to identify non-expressed genes for high-error genes with negative fold change. 10-fold cross-validation was then performed with the new enhancer assignments. This process was repeated until the improvement in cross validation was less than 1e-2.

48 in total

1. Super-enhancers in the control of cell identity and disease.

Authors: Denes Hnisz; Brian J Abraham; Tong Ihn Lee; Ashley Lau; Violaine Saint-André; Alla A Sigova; Heather A Hoke; Richard A Young
Journal: Cell Date: 2013-10-10 Impact factor: 41.582

2. A map of the cis-regulatory sequences in the mouse genome.

Authors: Yin Shen; Feng Yue; David F McCleary; Zhen Ye; Lee Edsall; Samantha Kuan; Ulrich Wagner; Jesse Dixon; Leonard Lee; Victor V Lobanenkov; Bing Ren
Journal: Nature Date: 2012-08-02 Impact factor: 49.962

3. Genome-wide chromatin state transitions associated with developmental and environmental cues.

Authors: Jiang Zhu; Mazhar Adli; James Y Zou; Griet Verstappen; Michael Coyne; Xiaolan Zhang; Timothy Durham; Mohammad Miri; Vikram Deshpande; Philip L De Jager; David A Bennett; Joseph A Houmard; Deborah M Muoio; Tamer T Onder; Ray Camahort; Chad A Cowan; Alexander Meissner; Charles B Epstein; Noam Shoresh; Bradley E Bernstein
Journal: Cell Date: 2013-01-17 Impact factor: 41.582

4. Overexpression of the zinc finger protein MZF1 inhibits hematopoietic development from embryonic stem cells: correlation with negative regulation of CD34 and c-myb promoter activity.

Authors: D Perrotti; P Melotti; T Skorski; I Casella; C Peschle; B Calabretta
Journal: Mol Cell Biol Date: 1995-11 Impact factor: 4.272

Review 5. Long range regulation of the sonic hedgehog gene.

Authors: Eve Anderson; Robert E Hill
Journal: Curr Opin Genet Dev Date: 2014-05-23 Impact factor: 5.578

6. The pluripotent regulatory circuitry connecting promoters to their long-range interacting elements.

Authors: Stefan Schoenfelder; Mayra Furlan-Magaril; Borbala Mifsud; Filipe Tavares-Cadete; Robert Sugar; Biola-Maria Javierre; Takashi Nagano; Yulia Katsman; Moorthy Sakthidevi; Steven W Wingett; Emilia Dimitrova; Andrew Dimond; Lucas B Edelman; Sarah Elderkin; Kristina Tabbada; Elodie Darbo; Simon Andrews; Bram Herman; Andy Higgs; Emily LeProust; Cameron S Osborne; Jennifer A Mitchell; Nicholas M Luscombe; Peter Fraser
Journal: Genome Res Date: 2015-03-09 Impact factor: 9.043

7. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.

Authors: Mark D Robinson; Davis J McCarthy; Gordon K Smyth
Journal: Bioinformatics Date: 2009-11-11 Impact factor: 6.937

8. Role of conserved non-coding DNA elements in the Foxp3 gene in regulatory T-cell fate.

Authors: Ye Zheng; Steven Josefowicz; Ashutosh Chaudhry; Xiao P Peng; Katherine Forbush; Alexander Y Rudensky
Journal: Nature Date: 2010-01-13 Impact factor: 49.962

9. The long-range interaction landscape of gene promoters.

Authors: Amartya Sanyal; Bryan R Lajoie; Gaurav Jain; Job Dekker
Journal: Nature Date: 2012-09-06 Impact factor: 49.962

10. Integrative analysis of 111 reference human epigenomes.

Authors: Anshul Kundaje; Wouter Meuleman; Jason Ernst; Misha Bilenky; Angela Yen; Alireza Heravi-Moussavi; Pouya Kheradpour; Zhizhuo Zhang; Jianrong Wang; Michael J Ziller; Viren Amin; John W Whitaker; Matthew D Schultz; Lucas D Ward; Abhishek Sarkar; Gerald Quon; Richard S Sandstrom; Matthew L Eaton; Yi-Chieh Wu; Andreas R Pfenning; Xinchen Wang; Melina Claussnitzer; Yaping Liu; Cristian Coarfa; R Alan Harris; Noam Shoresh; Charles B Epstein; Elizabeta Gjoneska; Danny Leung; Wei Xie; R David Hawkins; Ryan Lister; Chibo Hong; Philippe Gascard; Andrew J Mungall; Richard Moore; Eric Chuah; Angela Tam; Theresa K Canfield; R Scott Hansen; Rajinder Kaul; Peter J Sabo; Mukul S Bansal; Annaick Carles; Jesse R Dixon; Kai-How Farh; Soheil Feizi; Rosa Karlic; Ah-Ram Kim; Ashwinikumar Kulkarni; Daofeng Li; Rebecca Lowdon; GiNell Elliott; Tim R Mercer; Shane J Neph; Vitor Onuchic; Paz Polak; Nisha Rajagopal; Pradipta Ray; Richard C Sallari; Kyle T Siebenthall; Nicholas A Sinnott-Armstrong; Michael Stevens; Robert E Thurman; Jie Wu; Bo Zhang; Xin Zhou; Arthur E Beaudet; Laurie A Boyer; Philip L De Jager; Peggy J Farnham; Susan J Fisher; David Haussler; Steven J M Jones; Wei Li; Marco A Marra; Michael T McManus; Shamil Sunyaev; James A Thomson; Thea D Tlsty; Li-Huei Tsai; Wei Wang; Robert A Waterland; Michael Q Zhang; Lisa H Chadwick; Bradley E Bernstein; Joseph F Costello; Joseph R Ecker; Martin Hirst; Alexander Meissner; Aleksandar Milosavljevic; Bing Ren; John A Stamatoyannopoulos; Ting Wang; Manolis Kellis
Journal: Nature Date: 2015-02-19 Impact factor: 69.504

46 in total

1. Inference of cell type specific regulatory networks on mammalian lineages.

Authors: Deborah Chasman; Sushmita Roy
Journal: Curr Opin Syst Biol Date: 2017-04-17

2. Genome-Wide Characterization of DNase I-Hypersensitive Sites and Cold Response Regulatory Landscapes in Grasses.

Authors: Jinlei Han; Pengxi Wang; Qiongli Wang; Qingfang Lin; Zhiyong Chen; Guangrun Yu; Chenyong Miao; Yihang Dao; Ruoxi Wu; James C Schnable; Haibao Tang; Kai Wang
Journal: Plant Cell Date: 2020-05-29 Impact factor: 11.277

3. Causal Gene Regulatory Network Modeling and Genomics: Second-Generation Challenges.

Authors: Ellen V Rothenberg
Journal: J Comput Biol Date: 2019-05-07 Impact factor: 1.479

4. Chromatin Potential Identified by Shared Single-Cell Profiling of RNA and Chromatin.

Authors: Sai Ma; Bing Zhang; Lindsay M LaFave; Andrew S Earl; Zachary Chiang; Yan Hu; Jiarui Ding; Alison Brack; Vinay K Kartha; Tristan Tay; Travis Law; Caleb Lareau; Ya-Chieh Hsu; Aviv Regev; Jason D Buenrostro
Journal: Cell Date: 2020-10-23 Impact factor: 41.582

5. Deep learning of immune cell differentiation.

Authors: Alexandra Maslova; Ricardo N Ramirez; Ke Ma; Hugo Schmutz; Chendi Wang; Curtis Fox; Bernard Ng; Christophe Benoist; Sara Mostafavi
Journal: Proc Natl Acad Sci U S A Date: 2020-09-25 Impact factor: 11.205

6. High-Resolution Mapping of Multiway Enhancer-Promoter Interactions Regulating Pathogen Detection.

Authors: Pranitha Vangala; Rachel Murphy; Sofia A Quinodoz; Kyle Gellatly; Patrick McDonel; Mitchell Guttman; Manuel Garber
Journal: Mol Cell Date: 2020-09-28 Impact factor: 17.970

7. Epigenomic State Transitions Characterize Tumor Progression in Mouse Lung Adenocarcinoma.

Authors: Lindsay M LaFave; Vinay K Kartha; Sai Ma; Kevin Meli; Isabella Del Priore; Caleb Lareau; Santiago Naranjo; Peter M K Westcott; Fabiana M Duarte; Venkat Sankar; Zachary Chiang; Alison Brack; Travis Law; Haley Hauck; Annalisa Okimoto; Aviv Regev; Jason D Buenrostro; Tyler Jacks
Journal: Cancer Cell Date: 2020-07-23 Impact factor: 31.743

8. IKZF2 Drives Leukemia Stem Cell Self-Renewal and Inhibits Myeloid Differentiation.

Authors: Sun-Mi Park; Hyunwoo Cho; Angela M Thornton; Trevor S Barlowe; Timothy Chou; Sagar Chhangawala; Lauren Fairchild; James Taggart; Arthur Chow; Alexandria Schurer; Antoine Gruet; Matthew D Witkin; Jun Hyun Kim; Ethan M Shevach; Andrei Krivtsov; Scott A Armstrong; Christina Leslie; Michael G Kharas
Journal: Cell Stem Cell Date: 2018-11-21 Impact factor: 24.633

9. Developmental Control of NRAMP1 (SLC11A1) Expression in Professional Phagocytes.

Authors: Mathieu F M Cellier
Journal: Biology (Basel) Date: 2017-05-03

10. A RUNX-CBFβ-driven enhancer directs the Irf8 dose-dependent lineage choice between DCs and monocytes.

Authors: Koichi Murakami; Haruka Sasaki; Akira Nishiyama; Daisuke Kurotaki; Wataru Kawase; Tatsuma Ban; Jun Nakabayashi; Satoko Kanzaki; Yoichi Sekita; Hideaki Nakajima; Keiko Ozato; Tohru Kimura; Tomohiko Tamura
Journal: Nat Immunol Date: 2021-02-18 Impact factor: 25.606