Transcription is regulated by transcription factor (TF) binding at promoters and distal regulatory elements and histone modifications that control the accessibility of these elements. Chromatin immunoprecipitation followed by sequencing (ChIP-seq) has become the standard assay for identifying genome-wide protein-DNA interactions in vitro and in vivo. As large-scale ChIP-seq data sets have been collected for different TFs and histone modifications, their potential to predict gene expression can be used to test hypotheses about the mechanisms of gene regulation. In addition, complementary functional genomics assays provide a global view of chromatin accessibility and long-range cis-regulatory interactions that are being combined with TF binding and histone remodeling to study the regulation of gene expression. Thus, ChIP-seq analysis is now widely integrated with other functional genomics assays to better understand gene regulatory mechanisms. In this review, we discuss advances and challenges in integrating ChIP-seq data to identify context-specific chromatin states associated with gene activity. We describe the overall computational design of integrating ChIP-seq data with other functional genomics assays. We also discuss the challenges of extending these methods to low-input ChIP-seq assays and related single-cell assays.
Transcription is regulated by transcription factor (TF) binding at promoters and distal regulatory elements and histone modifications that control the accessibility of these elements. Chromatin immunoprecipitation followed by sequencing (ChIP-seq) has become the standard assay for identifying genome-wide protein-DNA interactions in vitro and in vivo. As large-scale ChIP-seq data sets have been collected for different TFs and histone modifications, their potential to predict gene expression can be used to test hypotheses about the mechanisms of gene regulation. In addition, complementary functional genomics assays provide a global view of chromatin accessibility and long-range cis-regulatory interactions that are being combined with TF binding and histone remodeling to study the regulation of gene expression. Thus, ChIP-seq analysis is now widely integrated with other functional genomics assays to better understand gene regulatory mechanisms. In this review, we discuss advances and challenges in integrating ChIP-seq data to identify context-specific chromatin states associated with gene activity. We describe the overall computational design of integrating ChIP-seq data with other functional genomics assays. We also discuss the challenges of extending these methods to low-input ChIP-seq assays and related single-cell assays.
DNA–protein interactions and epigenetic modifications are crucial for transcriptional regulation. Genome-wide profiling of transcription factor (TF)-binding sites, regions with covalently modified histones and other DNA-binding proteins reveal cell- or tissue-, species- and disease-specific cis-regulatory repertoires, which are vital for understanding gene regulation. Chromatin immunoprecipitation (ChIP) methodologies [1-3] use an antibody that recognizes a TF or histone modification to pull down attached DNA for identifying binding locations. With the rapid development of sequencing technology, chromatin immunoprecipitation followed by sequencing (ChIP-seq) [2-5] has become the most common and effective assay to identify bound loci genome-wide in vitro and in vivo. The basic computational pipeline and software for analyzing ChIP-seq data have been established and optimized alongside advances in sequencing library preparation and ChIP-seq techniques [6-8], including read quality control, alignment, peak calling and evaluation of reproducibility. ChIP peaks can be visualized using genome browsers as a simple quality check of signal over known true positives. Confirmed peaks can be further analyzed with differential density analysis for different treatments, gene-associated annotation, motif discovery and other downstream analyses. Limitations and advances in these steps are reviewed in detail elsewhere [9].However, the binding of one TF alone is rarely enough to directly infer functional effects on the gene expression levels of neighboring genes, which are typically under the combinatorial control of multiple TFs. Therefore, ChIP-seq data are often actively integrated with other functional genomic techniques to decipher the basic regulatory control of gene expression by incorporating open chromatin regions, long-range chromatin interactions and SNP (single-nucleotide polymorphism) variants. With the increasing availability of multiple ChIP-seq data sets [10, 11], as well as data sets from other genome-wide assays, the power of integrative computational analysis is of ever-increasing interest. In this review, we discuss the application of probabilistic models and machine learning methods to the analysis of TF and histone modification ChIP data simultaneously to identify chromatin patterns across multiple genomes and cell types. We also focus on the computational integration of ChIP-seq with other functional genomic assays such as RNA sequencing (RNA-seq) for gene expression levels, ATAC-/DNase-seq/FAIRE-seq for chromatin accessibility, and chromatin interaction analysis by paired-end tag sequencing (ChIA-PET)/Hi-C for chromatin interactions that affect regulation of gene expression. Finally, we discuss the development of ChIP-seq assays that use low amounts of input materials, and their further application in the emerging field of integrative analysis of single-cell sequencing functional genomics data.
Identifying distinct chromatin states using histone modifications and TF occupancy
Histone modifications are often found in recurring combinations at promoters, enhancers and repressed regions. These combinations are referred to as ‘chromatin states’ and can be used to annotate regulatory regions in genomes [12, 13]. For example, H3K4me1 alone marks primed enhancers, while H3K4me1 combined with H3K27ac mark active enhancers. Promoters are characterized by a detectable level of H3K4me3 coupled with a high ratio of H3K4me3 to H3K4me1. Furthermore, H3K36me3 histone modifications and RNA polymerase (Pol) II ChIP signal are associated with transcribed regions, while the presence of H3K27me3 or H3K9me3 is associated with repressive chromatin states (Figure 1) [14, 15]. The goal of software packages analyzing chromatin states is to first discover these relationships in the data, and to then check for changes in states assigned to a particular region in different cell types. Large-scale data sets produced by ENCODE [10] and Roadmap Epigenomics [11] have been used to train and to test with statistical or machine learning methods that assign chromatin states to genomic segments (typically 100 bp or longer). These state assignments can then be interpreted through comparisons with known annotations and gene expression.
Figure 1.
Chromatin states are defined by different combinations of histone modifications, TFs and RNA Pol II binding. In this example, a typical repressive state (gray) is characterized by high H3K27me3 signal or H3K9me3 signal, an enhancer state (yellow) would show a high occupancy ratio of H3K4me1 to H3K4me3 as well as high H3K27ac and the promoter state (red) would show a high occupancy ratio of H3K4me3 to H3K4me1 as well as RNA Pol II binding at the promoter, whereas poised promoter state (magenta) would show the occupancy of H3K4me3 and H3K27me3 bivalent modifications. Actively transcribed region (green) is characterized by a high occupancy of H3K36me3 with some RNA Pol II binding along the gene body.
Chromatin states are defined by different combinations of histone modifications, TFs and RNA Pol II binding. In this example, a typical repressive state (gray) is characterized by high H3K27me3 signal or H3K9me3 signal, an enhancer state (yellow) would show a high occupancy ratio of H3K4me1 to H3K4me3 as well as high H3K27ac and the promoter state (red) would show a high occupancy ratio of H3K4me3 to H3K4me1 as well as RNA Pol II binding at the promoter, whereas poised promoter state (magenta) would show the occupancy of H3K4me3 and H3K27me3 bivalent modifications. Actively transcribed region (green) is characterized by a high occupancy of H3K36me3 with some RNA Pol II binding along the gene body.Hidden Markov models (HMMs) were originally developed for speech recognition, but have since been used extensively in other fields to identify hidden states from observed signal data [16]. In genomics studies, it has been successfully applied to gene annotation [17] and protein domain characterization [18]. HMMseg [19] was the earliest software package to partition and annotate a genome by training HMMs on functional genomics data. However, this tool can only identify two states (‘active’ or ‘inactive’), which limits its application in annotating chromatin states in greater detail, e.g. active/poised promoters and enhancers. ChromHMM [13] and Segway [20] were developed with the goal of capturing more comprehensive combinatorial patterns of multiple histone modifications, RNA Pol II binding and insulator CTCF binding genome-wide (Figure 2). ChromHMM segments the genome into minimum 200 bp intervals (default) and converts raw read counts into binary code using a product of independent Bernoulli random variables for each interval, which are then used to train a HMM. Similarly, Segway was developed based on dynamic Bayesian networks. It transforms raw read counts to coverage signal and can segment the genome down to 1 bp resolution, although 100 bp segments are more practical. Additional tools have been developed to extend and speed up the identification of chromatin states. For example, TreeHMM [21] also uses binary vectors but is position-dependent when inferring chromatin patterns during cell differentiation and across different cell types. hiHMM [22] uses a hierarchically linked infinite HMM model to not only identify chromatin states across multiple ChIP-seq data sets but also address species variance for cross-species inference. diHMM [23] inherits from ChromHMM but uses a hierarchical HMM to identify combinatorial patterns at variable length scale that range from nucleosome-level to higher-order domain-level states. Another joint analysis platform, IDEAS [24, 25], can infer chromatin states using both position-dependency and cell-type-specific cases at multiple range scales, and can run faster than both ChromHMM and Segway using single core mode. Additional tools have been developed for comparing chromatin patterns between different experimental treatments [26] and expanding the comprehensiveness of epigenomic maps [27]. The combinatorial patterns generated by these methods have been correlated with gene expression profiles to find context-specific signatures across cell types using linear regression model [24, 25]. However, the difficulty of interpreting large numbers of states has led to a practical preference for models with lower numbers of states. Typically, the focus is on the discovered states rather than their transition probabilities, unlike more traditional applications of HMM to gene annotation. The assumption is that a limited number of chromatin states and a small number of histone markers combinations covering significant fractions of the genome will capture most of the biologically relevant features.
Figure 2.
Graphical structure of annotating chromatin states using a HMM method such as ChromHMM. The genome is split into nonoverlapping segments, and ChIP-seq signal for histone modifications is binarized (0 or 1) and collected for each segment, which are further built into input matrix for HMM training. The hidden state of the current segment is dependent on the state of the previous one, and the transition probabilities (in red) of changing from one state to another are learnt from training on the input matrix. ChromHMM outputs trained hidden states for each segmentation, which are then interpreted as chromatin states based on the chromatin profile and gene annotations, such as active promoter/enhancer, transcriptional elongation or repressive states.
Graphical structure of annotating chromatin states using a HMM method such as ChromHMM. The genome is split into nonoverlapping segments, and ChIP-seq signal for histone modifications is binarized (0 or 1) and collected for each segment, which are further built into input matrix for HMM training. The hidden state of the current segment is dependent on the state of the previous one, and the transition probabilities (in red) of changing from one state to another are learnt from training on the input matrix. ChromHMM outputs trained hidden states for each segmentation, which are then interpreted as chromatin states based on the chromatin profile and gene annotations, such as active promoter/enhancer, transcriptional elongation or repressive states.While useful for predicting chromatin states, HMM-based methods have been relatively less successful when applied to a large number of TFs with restricted, presumably combinatorial binding patterns, which cover small fractions of the genome. Self-organizing maps (SOMs) are an alternative, unsupervised machine learning method for integratively analyzing such high-dimensional, comparatively sparse data. SOMs consist of individual units (which can be thought of as either neurons or mini-clusters) arranged on a scaffold that is trained with data to capture the high-density parts of high-dimensional data sets while preserving similarity relationships, i.e. data that are close in the input will also be close on the SOM. Chromatin SOMs identify TF-TF localization and co-binding pairs of TFs across cell types and tissues [28]. SOMs have been trained on the same data as chromHMM and Segway in ENCODE, namely, histone modification markers, RNA Pol II and CTCF. These are then overlaid post-training with additional data such as EP300 ChIP-seq signals to confirm cell-type-specific and commonly shared enhancer activity of groups of DNA segments [29]. For example, a trained SOM would distinguish open chromatin regions from promoters and enhancers based on their difference in H3K4me3 and H3K4me1 signal density (Figure 3). The individual units in SOM maps can be grouped into map regions called metaclusters [29, 30], which can then be analyzed for their ChIP-seq signal enrichments and used to automatically identify sets of potentially co-regulated regions [29]. Once a unit or metacluster of interest has been identified, proximal genes can be associated with bound DNA elements by using tools like GREAT [31] and Homer [32], and their gene expression profiles can be correlated [24] and visualized together with DNA element activity. Co-associated genes can then be analyzed for gene ontology enrichment using GREAT and Homer, but other tools such as DAVID [33] and Metascape [34] can also be applied to identify potential functional enrichments. While SOM does not impose a state transition model like HMMs, it recovers similar high-level states at the level of metaclusters but allows for further granular mining of ‘microstates’ corresponding to specific chromatin profiles in individual such as distinct combination of TFs that are present in small sections of the genome [29]. SOM can therefore be used to deeply data-mine for complicated relationships in highly dimensional ChIP-seq data sets.
Figure 3.
Graphical structure of annotating chromatin states using SOMs. (A) The genome is split into nonoverlapping segments, and ChIP-seq signal for histone modifications is collected for each segment to build a signal matrix for SOM training, where each segment represents a vector of signal. (B) At the beginning of training, the map consists of a grid of regularly spaced or randomly initialized units (green dots) that we wish to fit to the data, which are signal vectors (black plus signs) spread in high-dimensional space. (C) For each training step, a signal vector is selected and the closest unit is found. The best matching unit is pulled as well as other surrounding units toward to the selected signal vector, which causes the map to adapt itself to match the data distribution in the space. (D) The trained SOM map is divided into metacluster regions (metaclusters 1–8) that represent combinations of signal enrichments. (E) Metaclusters are then assigned chromatin state labels by inspection based on annotations and the combinations of signal enrichments as in the HMM case.
Graphical structure of annotating chromatin states using SOMs. (A) The genome is split into nonoverlapping segments, and ChIP-seq signal for histone modifications is collected for each segment to build a signal matrix for SOM training, where each segment represents a vector of signal. (B) At the beginning of training, the map consists of a grid of regularly spaced or randomly initialized units (green dots) that we wish to fit to the data, which are signal vectors (black plus signs) spread in high-dimensional space. (C) For each training step, a signal vector is selected and the closest unit is found. The best matching unit is pulled as well as other surrounding units toward to the selected signal vector, which causes the map to adapt itself to match the data distribution in the space. (D) The trained SOM map is divided into metacluster regions (metaclusters 1–8) that represent combinations of signal enrichments. (E) Metaclusters are then assigned chromatin state labels by inspection based on annotations and the combinations of signal enrichments as in the HMM case.
Incorporating chromatin accessibility with ChIP-seq
Eukaryotic chromatin is tightly packaged into nucleosomes, and the positioning of nucleosomes regulated by TFs and histone modifications shows dynamic patterns during cell differentiation and development [35]. Specific proteins, often called pioneer factors, can control nucleosome repositioning via recruitment of chromatin remodelers, thus exposing cis-regulatory elements to lineage- or cell-type-specific TFs that activate or repress gene expression [15, 36]. Additionally, nucleosomes with H3.3/H2A.Z histone variants show hypermobility, which make them less stable and the DNA more easily accessible for TFs binding [37, 38]. Histone-depleted regions are referred to as open chromatin (Figure 1), and several sequencing assays have been developed to capture chromatin accessibility directly at high resolution such as DNase-seq [39-41], FAIRE-seq [42, 43] and ATAC-seq [44]. MNase-seq [35, 45, 46] is a related assay for identifying DNA regions occupied by nucleosomes instead of detecting open chromatin regions directly. DNase- and ATAC-seq depend on enzymatic digestion and Tn5 transposase insertion, respectively, to detect open chromatin regions in vivo. Both of them have a higher signal-to-noise ratio than the other methods, and ATAC-seq has become increasingly popular because of its ease of use. All of these methods need deep sequencing (about 50–100 million reads per sample) to get accurate, high-resolution profiles. The basic computational pipeline for open chromatin assays includes reads alignment, visualization for QC, peak calling and footprint analysis for DNase- and ATAC-seq or nucleosome profiling for MNase- and ATAC-seq (each step has been reviewed in detail elsewhere) [35, 47]. Specific software packages have been developed to detect signal-enriched regions for each assay. For example, Hotspot [48] detects DNase I hypersensitive regions for DNase-seq; GeneTrack [49] and DANPOS [50] do nucleosome calling for MNase-seq; NucleoATAC [51] calls nucleosome positions and occupancy for ATAC-seq. In addition, tools developed for ChIP-seq and DNase-seq peak calling also work effectively for ATAC-seq, such as MACS [52], Hotspot [48] and Homer [32]. DNAse-seq open chromatin data have been used alongside histone modification ChIP-seq data to define chromatin states using HMMs and SOMs in the ENCODE project [10, 29].Deeper sequencing of open chromatin data to 200–500 million reads per sample can also be used to detect TF-binding occupancy ‘footprints’ at nucleotide resolution [35]. The ability of DNase- and ATAC-seq to perform footprint calling is the consequence of TF occupancy protecting DNA from nuclease cleavage and Tn5 transposition, which results in small stretches of fewer cuts within otherwise open regions. The sequences within these footprints can be compared with known motifs for identification [53-55]. The power of footprinting is that a single experiment can identify the binding sites for hundreds of TFs, a task that would be still gargantuan with hundreds of TF-specific ChIP-seq experiments. However, many TF motifs are similar to each other and can be difficult to distinguish based on sequence alone. For these cases, ChIP-seq of selected TFs can be used to validate the footprints when they are critical to the inferred gene regulatory networks [56]. Additionally, histone modification ChIP-seq data can be mapped to open chromatin peaks to confirm the chromatin state of regulatory elements [44, 57–59]. The profiling of chromatin accessibility and TFs/histone occupancy has revealed that cis-regulatory elements show both transitory and stable activity during development and differentiation process for different lineages [60, 61]. Integrative analysis of chromatin accessibility and TFs occupancy from ChIP-seq has revealed that the two processes are not necessarily synchronous. Some TFs commonly referred to as pioneer factors can induce and remodel chromatin accessibility [62-65]. On the other hand, chromatin can be opened and activated before TF binding [48] or closed well after the TF has ceased to be bound. As open chromatin assays such as ATAC-seq are relatively easier to do and require less starting material than ChIP-seq, we expect that an increasing number of studies will start with open chromatin data followed with selected ChIP-seq for TFs and/or histone modifications. These data will be analyzed integratively with additional packages developed to facilitate their joint analysis.
Integrative analysis of gene expression with ChIP-seq
Most users of ChIP-seq data are interested in understanding the impact of TF binding or histone modifications on the expression of nearby genes, and therefore, ChIP-seq and RNA-seq are analyzed jointly to estimate this effect [6, 7, 14, 66, 67]. In the ideal case, a high ChIP-seq signal of a transcriptional activator would be found near highly expressed genes, while a high ChIP-seq signal of a repressor would be found near silenced genes. In another case, differentially expressed genes are first identified and classified into upregulated or downregulated genes between different experimental treatments. Then, differential TF and epigenetic occupancy are correlated with differential gene expression levels. TF-binding peaks and histone modification-enriched regions are associated with genes based on which gene is nearest, or using a particular distance radius. However, TF and epigenetic occupancy alone are seldom effective in predicting nearby target gene expression level accurately because (a) they cannot account for posttranscriptional turnover of the transcript, (b) it is difficult to accurately associate ChIP-seq peaks with their target genes and (c) we may not have the ChIP-seq data for all of the TFs controlling the expression of the target genes. One study has reported that the binding signal of 12 embryonic stem cell (ESC) TFs can explain 65% of the variance in mES gene expression, and the correlation coefficient between predicted and observed gene expression is 0.8 [68]. However, the predictive power of the same set of TFs in differentiated mES decreased dramatically (r = 0.2) [68], and they can only explain 30% of gene expression variance in GM12878 [69]. In addition, while histone modifications alone can explain high gene expression variance in human CD4 T+ cells (r = 0.7), combinatorial histone modification combinations show different predictive power [70]. Inferring the effect of TFs on expression is complicated by the fact that TFs may activate a subset of target genes but repress others. Furthermore, TFs and histone marks have different power in predicting gene expression levels [71-73]. Thus, this approach is only practical for predicting gene expression in well-studied systems, where there are plenty of TFs and histone modifications data sets available that can be selected based on biological significance.Efforts have been made to integrate chromatin accessibility data and ChIP-seq together to predict gene expression, and this combination is more accurate than using ChIP-seq alone [69]. However, the asynchrony between binding and chromatin accessibility also accounts for the less than perfect correlation between changes in these metrics and changes in gene expression. This is because transcription is the sum total of the multitude of effects of chromatin remodelers, TFs co-occupancy, different combination of histone marks and even DNA methylation, which are laborious to capture and profile simultaneously. Using regression models of RNA-seq, ChIP-seq and chromatin accessibility data, gene expression can be predicted from TFs/histone binding [69] and ChIP-seq-identified TF-binding motifs in open chromatin regions [74]. Mixed linear models of gene expression correlated with chromatin accessibility corrected with ChIP-seq TF binding can predict TF triggering or binding before chromatin remodeling [75]. Furthermore, TF-TFs co-occupancy can be predicted using support vector machines (SVMs) trained on open chromatin, histone markers and TFs ChIP-seq data [76]. The predictive power of integrated chromatin feature data can also be extended to the inference of gene regulatory networks. In one recent study [77], chromatin feature data were not only used to predict gene expression but also to predict the activation status of regulatory elements and further infer a context-specific gene regulatory network. The expression of TFs, target genes and chromatin remodelers as well as the accessibility of cis-regulatory elements and TF motifs in regulatory elements are integrated together and fed into a statistical Paired Expression and Chromatin Accessibility (PECA) model. This model predicts active cis-regulatory elements, TF expression and expression of related target genes within the same context-specific gene regulatory network, which are confirmed by knocking down key TFs in the network [78]. Although combining TF/histone modification ChIP-seq and chromatin accessibility data is an effective strategy for predicting gene expression and inferring gene regulatory networks, more software packages and platforms are still needed to be developed for integrating data from different functional assays. We expect that the next generation of packages will improve the predictive power of ChIP-seq for gene expression prediction using ever-more sophisticated and robust statistical methods.
Incorporating long-range chromatin interactions with ChIP-seq
Most gene regulatory analyses only consider the effects of histone modifications and TFs on the nearest gene, thus not taking into account long-range interactions of cis-regulatory elements with more distal genes. Promoters and enhancers are physically coupled with target genes by chromatin loops mediated by TFs, cohesin, mediator and some noncoding RNAs to control gene expression [79-83]. A single promoter or enhancer can interact with multiple enhancers or promoters within the same chromatin loops [10, 84]. Recruitment of cofactors such as EP300 by TFs ultimately mediates these complex promoter–enhancer interactions. Chromosome conformation capture (3C)-based sequencing assays such as Hi-C [85, 86] and ChIA-PET [87] can be used to detect these long-range interactions. In particular, ChIA-PET combines ChIP and 3C-based methods to detect chromatin interactions between sites bound by specific proteins such as RNA Pol II or CTCF on a genome-wide scale [79, 88], but requires hundreds of millions of cells as starting materials. Compared with ChIA-PET, Hi-C can capture all sites interactions in the genome but at the expense of deep sequencing, as it needs at least a billion reads to achieve 1 kb resolution in mammalian genomes [85, 86, 89]. ChIA-PET can capture promoter–enhancer, promoter–promoter and enhancer–enhancer interactions that involve RNA Pol II directly, while Hi-C identifies TADs (topologically associated domains) in chromatin structure. Newer methods such as HiChIP [90] and PLAC-seq [91] combine the advantages of ChIA-PET and Hi-C to capture long-range interactions more efficiently and accurately. 3C-based methods and the basic computational analysis pipelines for each of the techniques have been reviewed previously [92, 93].Although the mechanisms of long-range interactions are not completely understood, it is known that TFs and histone modifications are actively involved in the interactions and may help alter the chromatin structures [94]. By coupling ChIP-seq with long-range interaction data, studies find that TFs such as CTCF and YY1 are highly enriched in interacting loci or the boundaries of TADs in long-range interactions [86, 88, 89, 95–101]. Multiple studies have reported that CTCF can also co-bind with other TFs to form lineage—or cell-type—specific long-range interactions and activate context-specific gene expression [101-104]. It has also been shown that disruptions to TF binding at TADs boundaries or cis-regulatory elements, whether caused by mutations, methylation of TF-binding sites or deletion of a TF, can cause remodeling of chromatin interactions and abnormal expression of target genes, which may lead to disease [105, 106]. To integrate ChIP-seq data with ChIP-based long-range interaction data (i.e. ChIA-PET), peak callers are used to find TF co-binding and histone modifications in anchor sites of PETs [79, 107, 108]. For example, RNA Pol II ChIA-PET detects promoter–promoter and enhancer–promoter interactions directly. Enhancers or promoters can be further confirmed by comparing ChIP signal between H3K4me3 and H3K4me1 modifications [79]. In addition, distal enhancers have been thought to interact with promoters via cohesin-associated CTCF-CTCF loops that also insulate enhancers from genes that they are not supposed to target. The insulators are identified by overlapping anchor sites of cohesin ChIA-PET with cohesin and CTCF ChIP signal, while active enhancers are marked with H3K27ac ChIP signal [108]. Specific TFs co-binding patterns involved in the cis-interactions can be detected with ChIP-seq peak calling in the anchor regions [107]. Furthermore, differential promoter–promoter, enhancer–promoter and enhancer–enhancer interactions can be identified using ChIP-seq of histone modifications and comparing ChIP signal between conditions [79, 108]. For example, CTCF ChIP-seq signal at the anchor sites of cohesin PETs was used to confirm CTCF-CTCF loops in hESC. Although CTCF-CTCF loops are highly conserved between naïve and primed ES cells, the loop structures are different in terms of enhancer–promoter and enhancer–enhancer interactions, as can be seen by comparing H3K27ac ChIP-seq signal between the two states [108]. A popular strategy is to segment the genome into TADs using HiC when available, or predicting TADs using CTCF and/or cohesin component ChIP-seq to constrain interactions between TFs and cis-regulatory elements within these ∼100–1000 kb regions [109]. By matching ChIP-seq peaks of CTCF and cohesin complex proteins to non-ChIP-based long-range interaction data, like Hi-C, TAD boundaries can be defined and TADs can be segmented into sub-transcription units more accurately [108]. Although TADs have relatively conserved segmentation structure during cell development and differentiation [105, 110], the intra-TAD interactions and epigenetic states of TADs are less stable in terms of outside stimulus and differentiation conditions [110, 111]. By comparing normalized ChIP signal of histone modifications within TADs before and after treatment, it is possible to define activated or repressed TAD states that are then correlated with differentially expressed genes within the same TADs. As ChIP-seq has been performed routinely in many laboratories and large consortiums such as the ENCODE [10] and modENCODE [112] projects, many ChIP-seq data sets are available for public use. Frequent chromatin interaction loci (‘hubs’) and TAD boundaries can be predicted accurately from published histone ChIP-seq data integrated with customized Hi-C [113]. Interestingly, a recent study shows that cohesin loss causes loop domains to disappear based on Hi-C data, but CTCF and histone modification ChIP-seq data show that their patterns are unaffected. The disappearance of loop domains only affects the expression levels of a small percentage of genes, which suggests that cohesin-mediated loops only have modest effects on transcription for most genes and that super-enhancers of genes seem to keep their activity intact without cohesin looping [114]. Thus, given the complex relationship between long-range interactions and gene expression, more studies applying Hi-C/ChIA-PET coupled with ChIP-seq are needed to understand the exact role of chromatin loops in gene expression and to further categorize genes based on their response to the disruption of loop formation.
Predicting regulatory sequence variants by integrative analysis with ChIP-seq
Sequence variants or SNPs are known to be associated with genetic traits and diseases [115, 116]. Most SNPs identified by genome-wide association studies as associated with traits or diseases are found outside of protein-coding regions, with the majority of these noncoding SNPs located in open chromatin regions [117, 118]. As open chromatin regions map to enhancers and promoters, noncoding SNPs in the accessible regions may interrupt or strengthen protein–DNA interactions by introducing sequence variants into binding motifs, and thus causing gene expression and traits to vary between individuals. Indeed, multiple studies have reported that many disease-causing nucleotide changes are in TF-binding sites and affect TF-DNA-binding events [10, 119–131]. The interruption in TF binding can not only influence proximal gene expression but also that of distal genes [122, 125, 129, 132]. However, only a minority of differential TF-DNA-binding causes can be explained by sequence variation in binding motifs [133]. Besides, allelic occupancy profiling of >20 TFs using ChIP-seq data revealed that only a small proportion of these events have sequencing variants in binding motifs for specific TFs [134]. Although local variants in motifs are not necessarily affecting specific TF binding, sequence context is still an important source of differential TF-DNA binding. For example, proximal sequence changes may influence cooperative TF-TF binding [133, 135–138], and distal variants can affect TF-DNA and TF-TF interactions by changing chromatin state and conformation [133, 139–141].Many efforts have been made to integrate ChIP-seq and other experimental data to predict regulatory sequence variants. One of the most straightforward methods is to match SNPs to known TF-binding motifs from database such as JASPAR [142] and TRANSFAC [143], or to look for putative TF-bindings sites using HMMs. The binding affinity score can be calculated based on a position weight matrix representation of the motif. When comparing the motif affinity score between two alleles, a greater motif score difference indicates that the variant is more likely to be regulatory [144-146]. However, these methods rely on known TF-binding sites and do not leverage the predictive power of chromatin signatures to filter out a large set of false-positive predictions. Recent studies have successfully integrated ChIP-seq and DNase-seq data into predictive analyses without relying on TF-binding motifs databases [147, 148]. In these studies, peak calls from ChIP-seq and DNase-seq are scanned for k-mers of a given length, and the putative regulatory sequences are used to train a SVM to predict the regulatory power of any k-mer sequence. The weighted sequences can then be used to predict the impacts of single-nucleotide changes on regulatory activity in the variant sequences [147]. Another version of this method is to weigh the predictive power of k-mer sequences and compute DNase-seq covariates from ChIP-seq data using regression methods. The trained k-mers and DNase-seq signals are then used to predict ChIP-seq binding signals at two alleles. By comparing the predicted ChIP-seq signal between the reference and variant alleles, the variant can be predicted to be regulatory or not [148]. Other studies have applied deep learning methods such as convolutional neural nets to more comprehensively integrate sequence variants, chromatin states, chromatin accessibility and even RNA-binding protein data to predict which regulatory variants will be functional [149, 150]. Some regulatory variants are disease-associated, and we can predict the effect of those variants on the binding affinity of TFs by evaluating the change in score for the motif [149]. We expect additional work on the development algorithms that can predict potential causal disease variants from the integration of functional genomics data, which will require experimental validation. The validation data in turn will be of great value for training the next set of methods to analyze variants from ChIP-seq data.
ChIP-seq integrative analysis in the era of low cell count and single-cell genomics
ChIP-seq has been the standard method for identifying genome-wide protein–DNA interactions when a specific antibody is available [151]. However, the traditional ChIP-seq technique requires a large amount of starting material (preferably >10 million of cells) to get high-resolution profiles, which limits its applicability for small organisms, rare cell types and single cells. Efforts have been made to optimize the ChIP-seq protocol for a low amount of starting materials, which successfully detect TF-binding signals with as few as 5000 cells [152] and H3K4me3-binding signals with only 500 cells [153]. Although these methods generate binding profiles at a good resolution with a small number of cells, the experimental procedures are still time-consuming and costly. Owing to the need for high polymerase chain reaction amplification in the low-input ChIP protocols, the number of identical aligned reads needs to be carefully corrected for during data analysis. The low-input ChIP-seq peaks can also be compared with open chromatin regions from ATAC-seq to show high correlation between enhancer histone modifications and open chromatin regions. By doing motif discovery analysis, people also identify lineage-specific TF binding to lineage-specific open chromatin regions. TF expression levels have been observed to correlate with differential open chromatin regions accessibility across cell types [153]. Another advancement in low-input ChIP-seq technique is to couple ChIP and Tn5 transposase tagmentation to add sequencing adapters to the bead-bound chromatin in a single step [154]. This protocol is both fast as well as cost-effective, and it successfully identifies TF binding with 100 000 cells and histone markers with 10 000 cells. The ChIP signal needs to be normalized to genomic tagmentated DNA to remove tagmentation bias. However, the protocol also benefits from Tn5 insertions in open chromatin regions to detect TF footprints and nucleosome positioning [154].Single-cell epigenetics is a rapidly emerging area because of the development of new techniques [155, 156]. While we know that TF binding, histone modifications, chromatin accessibility, DNA methylation and long-range interactions work together to generate context-specific patterns, these results are primarily based on experiments with bulk samples. Individual cells may have different epigenetic patterns that influence their random behaviors [157]. Therefore, many single-cell epigenetics assays [156] have been developed to study this, including scATAC-seq [158, 159], scHi-C [160] and scBS-seq [161-163]. In addition, several techniques have been developed to couple multiple functional assays together to get transcriptomic and epigenetic data from the same cell simultaneously [164-166]. Compared with these methods, single-cell ChIP-seq seems more limited because of the technical difficulties of working from so little material. Only one protocol has successfully performed ChIP-seq at single-cell level [167], identifying hundreds of histone modification peaks per cell. The authors successfully distinguished three cell types by doing unsupervised hierarchical clustering and identifying subpopulations with different chromatin signatures. However, the low input and antibody sensitivity cause single-cell ChIP-seq to suffer from high technical variance and low sensitivity across individual cells. Similarly, recent advances in single-cell ATAC-seq [158, 159] successfully identified individual open chromatin regions in single cells, with the downside of low signal-to-noise compared with bulk ATAC-seq. However, scACTAC-seq reads are aggregated to be validated when comparing with bulk ATAC-seq data, which shows less technical variance and higher sensitivity compared with single-cell ChIP-seq. The high background IP noise probably limits scChIP-seq to histone modifications, and extensive computational analysis needs to be carried out to remove the noise in peak calling. The strategy used for now is to segment ChIP-ed DNA for peak calling for individual cells and then cluster cells based on fractions of reads in known ChIP peaks from bulk samples. Thus, the analysis is still performed at low-input level rather than the true single-cell level [9]. Future studies need to develop methods to remove IP noise and improve solid peak calling in individual cells, as bulk analysis methods cannot be applied directly in single-cell assays. We can further expect that methods will appear combining single-cell ChIP-seq and single-cell RNA-seq from the same cells, which will open up new possibilities when working from mixed cell types and difficult-to-obtain samples.
Future direction and conclusion
ChIP-seq has become the standard method for profiling protein–DNA binding over the past decade, and it has been actively integrated with newer functional genomics assays such as RNA-seq, DNase/ATAC-seq and Hi-C/ChIA-PET to generate models of gene regulation. In the best studies, the integrative analysis is validated with a series of validation experiments to show that the binding of particular TFs is critical for target genes expression. As ATAC-seq and RNA-seq protocols continue to become easier, we expect that ChIP-seq will be routinely integrated with these functional genomics assays. While most current studies compare different ‘static’ cell types, transcription changes temporally in response to stimuli that involve changes in TF binding, and will become more often the subject of study using ChIP-seq during development and/or stimulation. ChIP-seq following perturbations will also become more routine, and will need to be integrated when building predictive models to identify potentially active cis-regulatory elements and key TFs, which would guide experimental validation and will feed back into further model building.Another challenge in ChIP-seq integrative analysis will be how to incorporate long-range interaction and gene expression data into the chromatin state analyses that are being done with HMMs and SOMs. Currently, all of these analyses include multiple ChIP-seq data sets and can incorporate chromatin accessibility but are not designed to incorporate connectivity between distant regions or gene expression data as part of their training as opposed to post-training analysis and annotation. A challenge is that while at least ChIA-PET and HiC are working in a similar ‘feature space’ of chromatin as ChIP-seq, regular RNA-seq is measuring the steady state of transcripts, which is affected by several posttranscriptional processes such as mRNA turnover mediated by microRNAs. As chromatin will always be more predictive of transcriptional initiation, it may be more fruitful to compare the predicted models of expression to GRO-seq and other measurements of transcriptional activity than regular RNA-seq.In recent years, ChIP-seq techniques for low-input materials have been developed to expand its applications to rare tissues or cell types, and even single-cell studies. Other functional genomics assays have also been developed at single-cell level to answer new biological questions. However, the integrative analysis of single-cell ChIP-seq with these functional genomics assays in single cells is a difficult challenge. One reason is that the experimental protocols to capture protein binding, transcriptomes and DNA methylation data from the same cell are still not available. However, it may still be worthwhile to integrate data from scChIP-seq and other functional genomics assays in different individual cells from the same pool based on the assumption that protein-binding profiles would match to the gene expression profiles from the assay because these cells are from the same pool. Once protocols are available to do scRNA-seq and scChIP-seq from the same single cell, algorithms will need to be developed to integrate these single-cell data types together to understand the connection between binding and gene expression heterogeneity in subsets of a cell population. As single-cell data are sparser than bulk data, new statistical methods and tools are required for integration. In a hopefully not-so-distant future where robust single-cell ChIP-seq and RNA-seq are practical, they could become the method of choice for studying samples where the amount of material or the heterogeneity of the population makes the bulk version of these experiments less attractive.Key PointsTF and histone modification ChIP-seq data can be
used to define chromatin states for annotating regulatory regions in the genome.ChIP-seq data can be integrated with chromatin
accessibility and long-range interaction data to further decipher mechanisms of gene regulation.ChIP-seq data can be integrated with other functional
genomics data to predict noncoding regulatory sequence variants.Single-cell ChIP-seq promises to reveal the cell-to-cell
variability of TF and histone occupancy, but the experimental and computational methods still need to be improved to capture meaningful ChIP signal.
Authors: B Ren; F Robert; J J Wyrick; O Aparicio; E G Jennings; I Simon; J Zeitlinger; J Schreiber; N Hannett; E Kanin; T L Volkert; C J Wilson; S P Bell; R A Young Journal: Science Date: 2000-12-22 Impact factor: 47.728
Authors: Rosa Karlić; Ho-Ryun Chung; Julia Lasserre; Kristian Vlahovicek; Martin Vingron Journal: Proc Natl Acad Sci U S A Date: 2010-02-01 Impact factor: 11.205
Authors: Menno P Creyghton; Styliani Markoulaki; Stuart S Levine; Jacob Hanna; Michael A Lodato; Ky Sha; Richard A Young; Rudolf Jaenisch; Laurie A Boyer Journal: Cell Date: 2008-11-06 Impact factor: 41.582
Authors: Anshul Kundaje; Wouter Meuleman; Jason Ernst; Misha Bilenky; Angela Yen; Alireza Heravi-Moussavi; Pouya Kheradpour; Zhizhuo Zhang; Jianrong Wang; Michael J Ziller; Viren Amin; John W Whitaker; Matthew D Schultz; Lucas D Ward; Abhishek Sarkar; Gerald Quon; Richard S Sandstrom; Matthew L Eaton; Yi-Chieh Wu; Andreas R Pfenning; Xinchen Wang; Melina Claussnitzer; Yaping Liu; Cristian Coarfa; R Alan Harris; Noam Shoresh; Charles B Epstein; Elizabeta Gjoneska; Danny Leung; Wei Xie; R David Hawkins; Ryan Lister; Chibo Hong; Philippe Gascard; Andrew J Mungall; Richard Moore; Eric Chuah; Angela Tam; Theresa K Canfield; R Scott Hansen; Rajinder Kaul; Peter J Sabo; Mukul S Bansal; Annaick Carles; Jesse R Dixon; Kai-How Farh; Soheil Feizi; Rosa Karlic; Ah-Ram Kim; Ashwinikumar Kulkarni; Daofeng Li; Rebecca Lowdon; GiNell Elliott; Tim R Mercer; Shane J Neph; Vitor Onuchic; Paz Polak; Nisha Rajagopal; Pradipta Ray; Richard C Sallari; Kyle T Siebenthall; Nicholas A Sinnott-Armstrong; Michael Stevens; Robert E Thurman; Jie Wu; Bo Zhang; Xin Zhou; Arthur E Beaudet; Laurie A Boyer; Philip L De Jager; Peggy J Farnham; Susan J Fisher; David Haussler; Steven J M Jones; Wei Li; Marco A Marra; Michael T McManus; Shamil Sunyaev; James A Thomson; Thea D Tlsty; Li-Huei Tsai; Wei Wang; Robert A Waterland; Michael Q Zhang; Lisa H Chadwick; Bradley E Bernstein; Joseph F Costello; Joseph R Ecker; Martin Hirst; Alexander Meissner; Aleksandar Milosavljevic; Bing Ren; John A Stamatoyannopoulos; Ting Wang; Manolis Kellis Journal: Nature Date: 2015-02-19 Impact factor: 69.504
Authors: Anthony Mathelier; Oriol Fornes; David J Arenillas; Chih-Yu Chen; Grégoire Denay; Jessica Lee; Wenqiang Shi; Casper Shyr; Ge Tan; Rebecca Worsley-Hunt; Allen W Zhang; François Parcy; Boris Lenhard; Albin Sandelin; Wyeth W Wasserman Journal: Nucleic Acids Res Date: 2015-11-03 Impact factor: 16.971
Authors: Jonathan D Rubin; Jacob T Stanley; Rutendo F Sigauke; Cecilia B Levandowski; Zachary L Maas; Jessica Westfall; Dylan J Taatjes; Robin D Dowell Journal: Commun Biol Date: 2021-06-02