Literature DB >> 32612756

Seq-ing answers: Current data integration approaches to uncover mechanisms of transcriptional regulation.

Barbara Höllbacher^1,2,3, Kinga Balázs¹, Matthias Heinig^2,3, N Henriette Uhlenhaut^1,4.

Abstract

Advancements in the field of next generation sequencing lead to the generation of ever-more data, with the challenge often being how to combine and reconcile results from different OMICs studies such as genome, epigenome and transcriptome. Here we provide an overview of the standard processing pipelines for ChIP-seq and RNA-seq as well as common downstream analyses. We describe popular multi-omics data integration approaches used to identify target genes and co-factors, and we discuss how machine learning techniques may predict transcriptional regulators and gene expression.

Entities: Chemical Disease Gene Species

Keywords: ChIP-seq; Data integration; Multi-omics; NGS; RNA-seq; Transcriptional regulation

Year: 2020 PMID： 32612756 PMCID： PMC7306512 DOI： 10.1016/j.csbj.2020.05.018

Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN： 2001-0370 Impact factor: 7.271

Introduction

Within an organism, all cells contain the same genome, but have vastly different roles. These tissue and cell type specific functions are largely conferred by transcriptional regulators that control gene expression and thereby define cell identity. Transcriptional regulators include trans-acting factors, such as transcription factors (TFs), cis-regulatory elements (promoters and enhancers), as well as the chromatin structure (DNA-accessibility, nucleosome structures and chromatin looping) and epigenetic marks (histone modifications and DNA methylation). Specific elements in this gene regulatory machinery can be studied by different genome-wide analyses (Fig. 1). Chromatin immunoprecipitation followed by sequencing (ChIP-seq) [1] has become the method of choice to explore protein-DNA interactions such as TF binding and histone modifications (HMs). Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq) [2] measures genome-wide chromatin accessibility, and RNA-sequencing (RNA-seq) [3], [4] identifies the transcriptome. Additionally, Hi-C [5], Capture-C [6] and other methods analyze the 3-dimensional chromosome structure by capturing chromatin interactions.

Fig. 1

Schematic representation of the transcriptional machinery. Cis-regulatory elements (enhancers or promoters), trans-regulatory elements (transcription factors) as well as epigenetic modifications and 3D chromatin structure are known to influence gene expression. TAD: Topologically associated domain. Epigenetic modifications [7], [8] are a fundamental network controlling transcriptional outcomes. Since 2003, the Encyclopedia of DNA Elements (ENCODE) consortium [9] has systematically built a compendium of functional elements in the human genome. ENCODE also performs data curation and offers standardized processing pipelines [1] for various assay types online (https://www.encodeproject.org/), with regular updates. ENCODE includes thousands of datasets on gene expression (RNA-seq, Cap Analysis of Gene Expression (CAGE) and RNA-pet), ChIP-seq (TF binding, HMs) and chromatin accessibility (ATAC-seq, DNase-seq) from several cell types [10]. Within the cancer research field, the Cancer Genome Atlas [11] offers a vast collection of genetic, epigenetic, transcriptional and proteomics data on 33 different cancer types, which can be accessed through the Genomic Data Commons Data Portal [12]. The Roadmap Epigenomics project [13] and the BLUEPRINT project [14] are further large-scale undertakings that systematically collect data to characterize the human epigenome. Their datasets can be accessed through the IHEC [15] data portal. With ever more data being generated (current high-throughput systems can sequence up to 6000 gigabases per run), the bottleneck has shifted from data generation towards their analysis, posing new challenges for bioinformaticians. In this review, we provide an overview of the standard processing pipelines for ChIP-seq and RNA-seq as well as common downstream analyses. Furthermore, we discuss popular approaches for data integration and point out shortcomings along the way. Specifically, we show how ChIP-seq and RNA-seq data can be used to identify the target genes of a TF as well as coregulators for transcription, and we review methods that leverage chromatin assays to predict gene expression. Finally, we discuss how new developments in the field of machine learning contribute to the understanding of gene regulation.

Experimental design

General experimental considerations

ChIP-seq experiments assess the interactions of a protein of interest (such as TFs or modified histones) with DNA on a genome-wide level [1]. Depending on the samples submitted for sequencing, this can answer different questions. The classic experiment is to determine the interactions within a certain cell-type at steady state. More often however, it is of interest how these interactions change in response to a perturbation. Changing the expression level of a gene through overexpression, knock-down or knock-out experiments, can lead to changes in DNA binding of molecularly connected factors. Comparing sequencing results of these samples with baseline data can reveal new insights on the relationship between these components. Similarly, introducing a treatment condition that changes the levels of the TF itself is used to determine the target genes by comparing DNA binding in the treatment condition with the steady state. Furthermore, mutating either the gene for the TF itself, or the DNA sequence it binds to, can validate putative targets with additional wet lab experiments. The same steady state and/or perturbed samples assayed in ChIP-seq, can also be submitted for RNA-seq, with the readout being the effect on gene expression. The advantage of combining RNA-seq and ChIP-seq in the same experiment is to link a change in occupancy with a change in transcription, which allows inference of which peaks are functional binding sites. In this review, we will discuss a number of methods that combine multiple ChIP-seq datasets and/or RNA-seq data to answer this and additional questions.

ChIP-seq specific considerations

The quality of ChIP-seq results is dependent on the specificity and the sensitivity of the chosen antibody. These factors should be taken into consideration when comparing data generated with different antibodies or the same antibody in different samples [1]. “Hyper-chippable” regions [16], GC rich regions [17] and non-random fragmentation [18] can introduce various biases or background. Therefore, “input controls” or “IgG controls” are crucial to accurately identify ‘real’ peak signals. To ensure reproducibility of the results, it is recommended to submit biological replicates of the samples for sequencing. In most cases, two replicates can be sufficient and little information is gained by further increasing the number of replicates [19]. Ranking peaks and comparing them between replicates can then be used to assess the agreement of the results and to determine their irreproducible discovery rate (IDR). Analogous to the concept of FDR, setting the IDR to be no bigger than a predefined significance level ɑ, can control for the rate of irreproducible peaks [20].

RNA-seq specific considerations

Compared to ChIP-seq, the number of replicates is very important for the detection of differentially expressed genes. As resources are limited, a thorough experimental design also includes decisions on sample sizes and on technical parameters, such as read depth [21], [22], [23]. Power analysis can be used to decide the study's optimal sample size and its impact, for the test to be performed. In the case of RNA-seq studies, given the common statistical assumption of the most reliable differential expression methods DESeq2 and edgeR [24], power analysis is based on the theory of negative binomial count regression [25], [26]. Deciding on sample size is also influenced by biological heterogeneity, and significantly, the required minimum fold change to be detectable between the conditions at the given significance level. Various approaches, including simulation based models [24], [27] are compared and benchmarked in [28].

Data processing

Systematic literature search

We investigated how most research groups approach data integration and whether there was a specific tool or strategy taking hold in the scientific community, by performing a systematic literature search. Gene Expression Omnibus (GEO) is an online database hosted by the National Center for Biotechnology Information (NCBI), archiving microarray and next generation sequencing (NGS) genomics data. We used the package GEOmetadb (1.44.0) within R version 3.5.2 to query all submitted entries matching the Dataset types “Expression profiling by high throughput sequencing” and “Genome binding/occupancy profiling by high throughput sequencing” performed in humans or mice. After filtering for those entries linked to Pubmed IDs, we checked what publications submitted both expression and genome binding data. Out of 4377 Pubmed IDs, 346 included datasets of both assay types (Fig. 2A). Quantifying what references those 346 studies shared revealed a number of frequently cited peak calling algorithms, read alignment tools and gene set enrichment approaches, to which we will refer in the corresponding section. Importantly, no tool designed to integrate RNA-seq and ChIP-seq data came up in our search. Hence, despite genome occupancy profiling and gene expression frequently being employed in the same project, no specialized tools for integrating their results have established themselves.

Fig. 2

Systematic literature search on publications combining gene expression and DNA binding data. (A) Numbers of Pubmed IDs associated with RNA-seq and ChIP-Seq data submissions (retrieved on 01/22/2020). (B) Top 20 most commonly referenced citations from publications in the intersection of the Venn diagram shown in A. PMID: Pubmed ID.

ChIP-seq

Covalent modifications of histone tails are essential determinants of nucleosome positioning and gene regulation [29], [30]. Different types of HMs such as acetylation, phosphorylation, methylation or ubiquitination, can change the interaction strength of DNA with histones, which in turn influences transcription. Specific epigenetic marks are associated with gene activation, others with repression [31]. ChIP-seq offers a way to investigate HMs as well as interactions of TFs with their DNA binding sites. To identify the genomic sequences a transcription factor is binding to, crosslinked chromatin is fragmented, and an antibody specific to the target protein is used to purify the DNA-protein complex by immunoprecipitation. After de-crosslinking, the DNA is purified and prepared for NGS. Similarly, antibodies directed against various histone residues can be employed. Recent advances have further improved the technique by significantly increasing resolution and reducing background noise, as in ChIPexo [32] or CUT&Tag [33], for example.

Preprocessing & read alignment

Upon successful completion of such an NGS experiment, the read quality needs to be assessed by checking the quality of the base-calls, the duplication rate, GC content and adapter content. The tool FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) evaluates these and additional criteria, and returns an overview of the sample metrics. Depending on the results, removal of adapter sequences [34] and removal of low quality bases by read trimming might be desirable before mapping them to the reference genome. Contrary to RNA-seq, aligners for ChIP-seq reads do not need to be splice-aware, since they do not contain exon boundaries. Commonly used tools include bwa [35], bowtie and its successor Bowtie2 [36] (Fig. 2B).

Peak calling

Most peak-calling algorithms have been developed for TF binding data and consequently were optimized for narrow peaks. Few HMs (such as H3K4me3) also fall into this category. The most commonly used peak caller (Fig. 2B) is the second version of Model-based Analysis of ChIP-seq data (MACS) [37]. MACS2 considers local biases by using a dynamic Poisson distribution when determining the fold enrichment during peak calling. On the other hand, most histone modifications or DNA methylation patterns show broad enrichments without clear peaks, so-called domains. Methods such as histoneHMM [38] specifically identify enriched domains, and some tools that were developed for narrow peaks offer parameter adjustments to accommodate domain calling (i.e. MACS2). MACS2 is a reliable choice for TF binding data [39], but Bayesian Change-Point (BCP) [40] and MUltiScale enrichment Calling for ChIP-seq (MUSIC) [41] slightly outperform it when calling broad peaks. For methods with higher signal-to-noise ratios such as CUT&RUN or CUT&Tag, standard peak callers may generate high false-positive rates, making specialized tools like SEACR [42] more appropriate. For an overview of statistical methods and their underlying models, see Supplementary Table 1. Quality metrics after mapping and peak calling include the percentage of mappable reads, the library complexity, percentage of reads in peaks and strand cross-correlation ([1], [43]). As discussed in section 2.2, the robustness of the results can further be assessed using IDR [20].

Differential binding analysis

Experimental questions answered by ChIP-seq may include qualitative or quantitative comparisons of multiple samples, i.e. whether the same peaks are present in different conditions or whether the strength of the peak signal differs. Immunoprecipitation efficiencies can vary between samples, potentially influencing the fraction of reads in peaks. Together with the signal to noise ratio, these factors may affect differential binding analyses [44]. Tools to determine differential binding use alternative approaches to model the data, each with their own strengths and weaknesses [45]. Some Bioconductor/R packages such as edgeR [46] and DESeq2 [47] are routinely used in RNA-seq analysis pipelines (see section 2.3.4). Others, such as csaw [48] or DiffBind (https://bioconductor.org/packages/release/bioc/html/DiffBind.html), make use of those packages in workflows specifically developed for ChIP-seq. Another popular tool called MAnorm [49] is based on the assumption that the peaks shared between samples do not differ globally, and uses them as a basis to fit a robust regression, extrapolation to all peaks and normalization. In histoneHMM [38], differential binding is formulated as an unsupervised classification problem and analyzed using a bivariate Hidden Markov Model (HMM). In case the binding landscape changes profoundly, those assumptions do not hold true. Alternative approaches use experimental spike-ins, i.e. chromatin from a different organism, during the ChIP. The reads derived from this reference can then be used for normalization. In CUT&Tag, the small amounts of E. coli DNA remaining after transposase production, suffice as spike-in substitutes.

Peak annotation

For the scientist interpreting ChIP-seq results within their biological context, the positional information of putative cis-regulatory regions needs to be linked to genetic functions. An intuitive approach is to visually inspect the processed ChIP-seq data on a genome browser, such as Integrative Genomics Viewer [50] or the University of California, Santa Cruz Genome Browser [51]. The data can then be parsed in conjunction with publicly available datasets such as DNase hypersensitivity, HMs, single nucleotide polymorphisms, tissue specific gene expression etc. However, this strategy does not benefit from the myriad of tools designed to identify global patterns. While identifying the target genes of a TF is one prime objective of ChIP-seq experiments, the fact that most peaks are not promoter proximal impedes this task. Linear proximity to the closest transcription start site is often used to identify putative target genes for a given TF peak. For example, GREAT [52] allows the user to pick from a number of association rules that assign genomic regions to their target genes. Bioconductor/R packages such as ChIPpeakAnno [53] and ChIPseeker [54] annotate large quantities of peaks simultaneously and visualize the peak distribution within certain genomic features. One obvious shortcoming of this approach is that the three dimensional character of chromatin is discounted. For instance, distal cis-regulatory elements can physically interact with promoter regions by DNA loop formation, bringing distant regions into close spatial contacts [55]. Recent studies on the principles of phase separation have revealed a surprising complexity of 3D chromatin dynamics, which are currently challenging to study [56]. New NGS methods such as Hi-C [5] assess genome-wide chromatin interactions and should be considered when assigning peaks to their potential targets. However, Hi-C currently lacks the resolution to go beyond topology associating domains. Promoter-capture Hi-C [57] overcomes this shortcoming, but it only detects the proximity of genomic regions, which may not reflect functional interactions, as is the technical limitation of all ligation-based assays.

RNA-seq

With the advent of RNA-seq, or whole transcriptome shotgun sequencing, it became possible to screen the entire transcriptome of any organism or even single cells by NGS. Transcriptome analysis consists of the quantification of all kinds of transcripts (mRNA, microRNA, noncoding RNAs etc.), differential expression analysis, de novo transcript assembly as well as determining the transcriptional structures of genes [58], [59]. RNA-seq identifies and quantifies RNA species at a given time point (as RNA abundance is not stable over time) in biological samples. Experimentally, the RNA is extracted, randomly fragmented and reverse transcribed into cDNA with adaptors attached to one or both ends. After PCR amplification and sequencing, the raw data consists of a list of reads with associated quality scores for each sample, which are then subjected to RNA-seq data analysis. Here we focus only on the application of RNA-seq for differential gene expression analysis and we briefly summarize the most common necessary steps (Fig. 3).

Fig. 3

Standard processing workflow of ChIP-seq and RNA-seq. In both cases, the quality of the sequenced reads is checked before performing the alignment. The ChIP-seq data analysis continues with peak calling, followed by differential binding analysis. Searching for motifs in the peak regions and peak annotation are crucial steps. For RNA-seq, the aligned reads are quantified at gene level, the raw counts are then filtered and normalized to enable further comparisons. The differential expression analysis provides a list of significant genes, from which biological meaning may be retrieved. QC: Quality control, DE: differential expression.

Preprocessing

The steps for preprocessing raw data are comparable to those of ChIP-seq experiments (see section 3.2.1). The downstream analysis essentially consists of mapping, quantification, filtering and normalization, detection of differentially expressed genes and finally the biological interpretation of the results.

Read mapping

The process of assigning reads to their best matching location in the reference is referred to as mapping. Fragments can either be mapped to a reference transcriptome or genome. In the former case, all isoforms of a gene are considered separately, whereas in the latter, reads are aligned to the underlying genes, regardless of what isoform the read stems from [60]. The most popular, splice-aware alignment tools, which rely on a reference genome are STAR [61], TopHat [62], TopHat2 [63], and Bowtie2 [36] (Fig. 2B). In the case of mapping to a transcriptome, popular efficient alignment-free tools quantify the transcripts directly, for example Kallisto [64] and Salmon [65]. Their quantification is based on k-mers, i.e. they fragment the reads into all possible k-mers and then map only the unique ones to the pre-indexed transcriptome. Multi-mappers (i.e. reads mapping to multiple locations), represent a significant fraction of mapped reads and are bioinformatically challenging. The simplest approach is to discard ambiguously mapped reads and keep only uniquely mapped ones. Another modality is to keep all matches, which leads to an amount of mapped reads beyond the number of raw reads. It is also possible to use a scoring function to find the best possible alignment, and in case of equal scores distribute the reads randomly between loci. There is also the option to allocate ambiguous reads in relative proportion according to probabilistic inference, for example in RSEM [66] and TopHat [62]. The latter strategy might be the most applicable, as it appears to produce the least bias in inferring differential gene expression [67].

Gene or transcript level quantification

The counting and clustering of reads can be performed over different genomic features, such as transcripts or genes. The most common is to estimate the gene level abundances, by counting the number of reads/fragments overlapping the exons of the gene. However, even for the best annotated human or mouse data, a significant amount of the reads will map outside annotated exons [68]. Widely used quantification tools are CuffLinks [69], featureCounts [70], kallisto [64] and Salmon [65]. While featureCounts is an exon-based approach, kallisto and Salmon are transcript based approaches, which rely on an Expectation Maximization for estimating transcript abundances. In either case, the final output is a matrix of read/fragment counts, where each row corresponds to a feature of interest, while the columns represent the different samples.

Filtering and normalization

Importantly, the choice of normalization method has a bigger impact on the results than the mapping method or the test statistics used for finding differentially expressed genes [71], [72]. There are two types of normalization to account for biological or technical bias: within and between sample normalization. In the first case, comparisons between the features of a single sample are enabled by correcting for gene length and sequence composition, for example GC-content [73]. In the second case, for across sample feature comparisons, normalization is performed to adjust for the library size [74], [75]. To set a cutoff, zero or low count genes are omitted from the count table. Of note, when correcting for sequencing depth, the assumption is that the total expression is similar under different conditions, so each condition is assumed to have the same amount of mRNA per cell [76]. In this case using the total count normalization, each read count will be divided by the sum of the reads of the sample [77]. The RPKM method (reads per kilobase per million mapped reads) is based on total count normalization, but accounts also for the length of the gene [78]. Other very popular methods rely on capturing information from non-changing genes. For example, the Trimmed Mean of the M-values approach implemented in the edgeR package assumes that the majority of genes are not differentially expressed and excludes those that are differential from the normalization factor [79]. It selects a reference sample for computing logarithm count ratios after trimming differential genes, and uses their mean for normalizing read counts. The DESeq normalization [47] is similar, but it computes the count ratio of a reference sample relative to the geometric mean of all other samples for each gene, then uses the median of these for scaling the reference counts.

Differential gene expression

After normalization, Principal Component Analysis can be used for visual data inspection to detect and remove outlier samples [80], which would distort downstream analyses. Another way to visualize the results of the read normalization and check for outliers is by heatmaps. The R package ComplexHeatmaps [81] offers highly customizable row and column annotations such as dendrograms, based on different distance functions. This way of including unsupervised clustering offers an intuitive way to interpret the overall similarity in expression across samples and genes. Initially, RNA-seq count data was approximated with the Poisson distribution, under the assumption that reads follow a random sampling process [82], [83]. However, since the variance and mean of RNA-seq counts are not equal, the negative binomial distribution was found to be more adequate [84], [85]. The most popular approaches that were developed consequently include DESeq2 [86] and edgeR [46]. As mentioned in the previous subsection, DESeq and DESeq2 both assume a negative binomial distribution of the counts and have two parameters, the dispersion and the mean. The dispersion describes how much the variance (i.e. within-group variability) deviates from the mean (Var Kij = µij + αi µ2ij, where αi is the dispersion parameter) and it is estimated in three steps. First, with maximum likelihood, a dispersion value is estimated for every gene, then a curve model, as a function of the mean expression level, is fitted to these values. Finally, a dispersion value is assigned to every gene. In DESeq, this is computed as a function of the mean by fitting a smoothed curve to the observed values. In DESeq2, the dispersion value is assigned by using an empirical Bayes method to shrink the gene-wise dispersion estimates close to the fitted values. When comparing the distribution of counts between different groups, DESeq2 fits a generalized linear model (GLM) for each gene, as defined by the design matrix. The coefficients represent a log2 fold change in simple case-control experiments, but more complex relations can also be modeled. After the fit, a hypothesis test for differential expression is applied on the coefficient of interest, i.e. whether they are different from 0 (the no effect case). DESeq2 offers the use of the likelihood ratio test or the Wald test, which can test individual coefficients, as well as contrasting them. The different edgeR variants are also assuming a negative binomial distribution. In edgeR classic, the quantile-adjusted conditional maximum likelihood is used to estimate the dispersions, conditioning on the total count of the particular gene [87]. Since edgeR classic can only be used for designs with a single factor, an exact test similar to Fisher's exact test can be constructed to test for differential expression [46]. The more advanced edgeR glm [88] and edgeR robust [89] use the Cox-Reid profile-adjusted likelihood to estimate the dispersions, and fit a GLM as in DESeq, followed by a likelihood ratio test for differential expression. To reduce the influence of outliers, edgeR robust assigns weights to observations based on their Pearson residual in the GLM fit. To identify genes that change significantly in abundance across different samples and conditions, testing methods focus on evaluating the null hypothesis that there is no difference between conditions, i.e. the log fold-changes between cases and controls are exactly zero. A threshold of 5% on these p-values would limit the number of false positives in a single test, but one still needs to account for the large numbers of tests that are typically performed in parallel. Under the assumption that the null-hypothesis is true, when performing 20.000 tests, this would lead to 1.000 false positives. To control for type I errors (i.e. incorrectly rejected null hypotheses), several methods controlling the family-wise error rate (i.e. the probability of making at least one type I error) exist. One of these, the Bonferroni correction [90], adjusts the significance threshold by dividing the significance level ɑ by the number of performed tests. In practice, this correction is too conservative, and instead of controlling the family-wise error rate, the rate of type I errors can be limited by false discovery rate (FDR) controlling procedures. The FDR is the fraction of false positives (falsely rejected null hypotheses) among all results that were declared significant (all rejected null hypotheses). Most commonly, the Benjamini–Hochberg procedure (BH step-up procedure) is implemented, which controls the FDR at a predefined level [91]. While adjusted p-values (i.e. q-values) are computed for each test, the interpretations of p-values and q-values are quite different. For p-values, a cutoff of 5% means that 5% of all tests will result in false positives, assuming that there are no differentially expressed genes (the null hypothesis is true). However, the same cutoff for q-values means that 5% of the significant tests are false positives (i.e. the rate of false discoveries is 5%). Both in edgeR and DESeq2, the p-values for each gene are adjusted for multiple testing, controlling for the false discovery rate according to the Benjamini-Hochberg procedure. Calling a gene as being differentially expressed based on an FDR cutoff alone has the disadvantage of including results whose effect size, while being statistically significant due to the consistency of the result, is biologically insignificant. Hence an additional filter may be applied on the log2 fold change, at the risk of distorting the FDR statistics in the selected subset. Accordingly, the SEQC consortium [92] found that pipeline-dependent filters for p-value, fold-change and expression-level are necessary to reproduce results.

Biological interpretation of the results

Once a gene set of interest has been defined, enrichment analyses can ascribe biological meaning. Gene Set Enrichment Analysis [93] is the most widely used tool (Fig. 2B) and checks for significant over- or underrepresentation of annotated gene sets, such as Gene Ontology terms [94], within provided lists. Also, DAVID [95], [96] is an online platform which functionally annotates and classifies genes. Other approaches determine overrepresentation of selected genes in metabolic pathways or map them to putative protein interaction networks. These analyses obviously depend on prior knowledge about those biological pathways. Gene lists can be mapped onto specific pathways diagrams, and statistically significant associations can be retrieved and visualized, for example using the Kyoto Encyclopedia of Genes and Genomes [97], Reactome [98] and WikiPathways [99]. Protein-protein interaction networks contribute to the system-level data interpretation. Known cellular interaction networks represent another source of information, since proteins that participate in the same biological process may be more likely to interact. Therefore, integrative interactomics aim to provide a similar view as pathway analyses, by exploiting large interactomes identified in model organisms [100], [101]. For example, differentially expressed genes can be mapped to protein–protein interaction data, and then the functional clusters in the networks could be determined. Important protein network databases include IntAct [102], STRING [103] and BioGRID [104].

Data integration

Jointly characterizing multiple omics might enable an in-depth understanding of the interplay between various cogs of the transcriptional machinery. Depending on the specific question, various flavors of data integration could be applied (Fig. 4).

Fig. 4

Data integration approaches. (A) ChIP-seq and RNA-seq data can be integrated in a discretized fashion by determining the overlap of significantly affected genes in the 2 assays. (B) Newer approaches combine ChIP-seq data from multiple TFs and HMs together with expression data and accessibility data such as DNase-seq and ATAC-seq. They achieve data integration through various different mathematical concepts such GLMs, HMMs and deep neural networks to identify co-regulators, predict gene expression or model TF binding. DE: differential expression, TF: transcription factor. This figure was created with BioRender (biorender.com).

Identifying coregulators

TF ChIP-seq can serve to identify co-factors through motif analysis, which takes a number of sequences as input and finds motifs (usually 8–16 bp in length) that are present more frequently than would be expected [105]. In addition to the consensus motif expected for the TF targeted by the specific antibody, other binding sites for co-factors cross-talking with the protein of interest may be enriched. Furthermore, motif analyses can pinpoint the exact site within the ChIP peak that is occupied by the TF. Also, ChIP peak lists can first be narrowed down by integrating expression data before searching for distinct motifs associated with a defined transcriptional outcome. Exploring all possible solutions to find the highest ranking motifs is still challenging. The most commonly used tool HOMER [106] determines enrichment using cumulative hypergeometric distributions. MEME-ChIP [107] applies expectation maximization and Discrover [108] uses discriminative learning based on Hidden Markov Models. Most tools perform de novo motif discovery as well as testing for the enrichment of known-motifs, which are represented as position weight matrices (PWMs). Motif databases like JASPAR [109], Cis-BP [110] and HOCOMOCO [111] store PWMs and can be used by motif analysis tools to link the discovered sequences to known consensus motifs.

Identifying epigenetic cofactors

In addition to the profiling and functional characterization of individual histone marks, comprehensive models aim to combine several dozens of epigenetic HMs [112]. For example, a multivariate HMM on the combinatorial patterns of 38 different modifications, RNA polymerase II, H2A.Z and CTCF ChIP-seq data, was used to define “chromatin states” and to systematically annotate the genome at 200 bp resolution [113]. This approach of chromatin segmentation has since been implemented and expanded by the NIH Roadmap Epigenomics Consortium [13]. The Roadmap project integrated chromatin states with DNA methylation, DNA accessibility and RNA expression to create reference epigenomes for over 100 human cell types and tissues.

Identifying target genes

Classical strategies to investigate the direct and indirect targets of a TF, are gain and loss of function experiments or specific treatments in conjunction with controls. ChIP-seq and RNA-seq data of matched samples may first be processed separately according to their respective analysis standards, and then be combined in a discretized fashion. In order to obtain comparable results, ChIP-seq peaks are usually assigned to nearest genes (see section 3.2.4). Then, one can determine whether the genes that are differentially expressed show concordant patterns of differential TF binding or epigenetic modifications. A prevalent approach to assess the similarity in changes across assays is to arrange those genes showing differential ChIP signals, and those being differentially expressed, as contingency tables and to test for overrepresentation with Fisher’s exact test. A common way to depict these numbers in publications is as a Venn diagram. The intersection, which represents genes that have differential ChIP signals and expression changes, can then further be displayed in a heatmap to visually inspect their expression pattern (Fig. 4A). The biggest shortcoming of this approach is that the results for both assays need to be binarized by setting an arbitrary threshold to split the data into significant and non-significant results. A possible approach to avoid arbitrary cutoffs when integrating the results of different experiments was proposed by Roider and colleagues [114]. It was originally developed for a combination of ChIP-chip and affinity data, but could be applied to combine p-values of ChIP-seq and expression data as well. This method transforms the results into ranked lists and systematically adjusts the threshold to find the optimal cutoff, yielding the most significant enrichment as measured by hypergeometric testing. A way to avoid setting a hard p-value threshold on one of the datasets is by performing gene set testing. The results of one platform are hereby ranked according to a test-statistic of choice, and the positions of the elements in a gene set on that ranked list, such as the significant hits of another platform, are determined [115]. E.g. RNA-seq results can be ranked based on a test statistic representing the degree of differential expression between two samples (such as the t-value), and the genes with significant ChIP-seq peaks can be indexed on the ranked list. This can then be used to test whether the genes with ChIP-seq peaks tend to be more differentially expressed than genes without ChIP-seq peaks. In order to prevent setting thresholds altogether, the log2 fold changes of expression and peak intensities can be tested for correlation. Those genes that show alterations in ChIP-seq and RNA-seq are likely direct or indirect targets of the TF. More formally, BETA [116] assigns a regulatory potential to each gene based on the number and proximity of TF binding sites to its transcription start site, and, determines if the TF is mainly an activator or a repressor in conjunction with the gene expression data. Direct targets are selected using a rank product between the RNA and ChIP data. Interestingly, BETA and Discrover [108] also return differential motifs, which again might identify co-regulators.

Predicting gene expression

It is still an open question whether TF binding strength (ChIP-Seq) can be used to predict gene expression levels (RNA-Seq). A study using the quantitative ChIP-seq signal of TFs around the transcription start site could explain 67% of the variation in CAGE data, i.e. nascent transcription, but performed poorly for total RNA [117]. Conversely, chromatin marks and histone modifications (for example H3K27ac) are more established predictors, with a small number of HMs at promoter regions being sufficient to correlate well with gene expression (Fig. 4B). It appears that the relationship between the chromatin landscape and RNA expression can be generalized across different cell types [118]. Finally, IMAGE pinpoints transcriptional regulators by utilizing PWMs to model the activity of a certain motif. This information is then used to infer causality by modelling the contribution of the motif to expression levels [119].

Predicting TF binding

Modelling cell-type specific gene expression on TF binding data remains difficult, as the available ChIP-seq datasets for any given cell type are still limited. In an attempt to predict gene expression with fewer assays, tools hinging on chromatin accessibility in combination with PWMs were developed.

Classical approaches

DNase-seq and ATAC-seq find cell type specific open regulatory regions in the genome, which are prone to DNase I and Tn5 activity, respectively [120], [121]. TF occupancy protects short sequences from these cleavage enzymes, causing dips in the accessibility signal. Matching the protected sequence of these footprints with known PWMs can identify the bound TF [122]. The presence of TF motifs within proximal and distal DNase I hypersensitive sites can be quantified and used to generate scores or footprints for regression models classifying tissue specific expression patterns [123], [124]. The CENTIPEDE [125] algorithm uses DNase-seq and HM data as prior information to predict TF binding with hierarchical mixture models. HINT [126] also uses accessibility data and HMs to calculate active TF binding sites based on HMMs. This algorithm was later extended by HINT-ATAC [127] to identify footprints in ATAC-seq data, while correcting for transposase specific artifacts. Continued interest in predicting in vivo TF binding for various tissue types sparked the ENCODE-DREAM challenge which now serves as a benchmarking study (https://www.synapse.org/#!Synapse:syn6131484/wiki/402031).

Deep learning approaches

The availability of large amounts of training data and breakthroughs in high-performance computing such as the use of graphical processing units (GPUs) have triggered a comeback of neural networks in the analysis of genomic data [128]. Two methods based on convolutional neural networks (CNNs), are DeepSEA [129] and DeepBind [130]. DeepSEA predicts chromatin features such as HMs, DNase I hypersensitive sites and TF binding sites, and calculates how sequence alterations can affect chromatin. DeepBind applies deep CNNs to predict both binding affinity from sequence and the binding in vivo (as measured by ChIP-seq). The performance of deep learning is evidenced by FactorNet placing in the top 3 of the DREAM challenge, predicting TF binding from DNA sequences. Their convolutional-recurrent neural networks can predict cell type specific TF binding [126], leveraging binding in a reference cell type and chromatin accessibility from the cell type of interest. Moreover, ExPecto [131] uses a deep CNN to predict HMs, TF binding and other transcriptional regulators from DNA sequence alone by training on ENCODE and Roadmap Epigenomics data. These features are then transformed and fed into a cell type-specific linear model to predict gene expression (whereas DeepSEA only predicted the effect of non-coding variants on chromatin). CNNs are also used in BPNet [132], which takes a DNA sequence as input and directly predicts ChIPexo signals at single base resolution to elucidate how TF binding is influenced by the motif syntax. This way, no information is lost in the intermediary peak calling process which usually precedes motif discovery in a standard analysis pipeline, and the regulatory elements of multiple TFs can be assessed simultaneously.

Conclusions & outlook

Taken together, the integration of multi-omics data can contribute to decrypting transcriptional regulatory codes. With new techniques and data forms constantly emerging, novel data integration methods are evolving. Besides data analysis tools, databases provide meaningful biological interpretation. Overall, the exact molecular mechanisms of TF binding, histone modifications and transcriptional regulation are far from understood. The field has moved from individual genes and factors towards a higher dimensional view, integrating epigenetic marks, distal regulatory elements and the 3D structure. Furthermore, live cell imaging coupled with single-cell RNA-seq is on the rise. Advancements in the development of experimental methods in combination with novel analysis tools hold great potential. As such, pooled CRISPR screening combined with single-cell RNA-seq is a powerful method to investigate distinct perturbations in thousands of individual cells. For example, Perturb-seq [133] couples gene inactivation using CRISPR with single-cell RNA-seq to study phenotypic alterations in parallel in many cells. scMAGeCK [134] is now able to find genes and enhancers which play a role in cell proliferation, simply by associating common proliferation markers. Despite the progress in bioinformatics, the identification of functional enhancer-promoter interactions remains challenging, and in silico predictions still require high-throughput experimental validation. While STARR–seq [135] (self-transcribing-active-regulatory- region-sequencing) creates genome-wide quantitative enhancer activity maps, the recently published enCRISPRa and enCRISPRi [136] epigenetic editing systems allow for functional interrogation of enhancers in situ and in vivo. To understand complex biological systems, specialized tools merging different omics data sets like genomics, transcriptomics, metabolomics, proteomics etc. and ultimately integrating not only transcriptional data, will yield unprecedented insights into the feature space of systems biology.

CRediT authorship contribution statement

Barbara Höllbacher: Conceptualization, Writing - original draft, Visualization, Formal analysis. Kinga Balázs: Conceptualization, Writing - original draft, Visualization. Matthias Heinig: Conceptualization, Writing - review & editing, Funding acquisition, Supervision. N. Henriette Uhlenhaut: Conceptualization, Writing - review & editing, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

130 in total

1. Differential expression in RNA-seq: a matter of depth.

Authors: Sonia Tarazona; Fernando García-Alcalde; Joaquín Dopazo; Alberto Ferrer; Ana Conesa
Journal: Genome Res Date: 2011-09-08 Impact factor: 9.043

2. Comprehensive genome-wide protein-DNA interactions detected at single-nucleotide resolution.

Authors: Ho Sung Rhee; B Franklin Pugh
Journal: Cell Date: 2011-12-09 Impact factor: 41.582

3. Comprehensive mapping of long-range interactions reveals folding principles of the human genome.

Authors: Erez Lieberman-Aiden; Nynke L van Berkum; Louise Williams; Maxim Imakaev; Tobias Ragoczy; Agnes Telling; Ido Amit; Bryan R Lajoie; Peter J Sabo; Michael O Dorschner; Richard Sandstrom; Bradley Bernstein; M A Bender; Mark Groudine; Andreas Gnirke; John Stamatoyannopoulos; Leonid A Mirny; Eric S Lander; Job Dekker
Journal: Science Date: 2009-10-09 Impact factor: 47.728

4. BLUEPRINT to decode the epigenetic signature written in blood.

Authors: David Adams; Lucia Altucci; Stylianos E Antonarakis; Juan Ballesteros; Stephan Beck; Adrian Bird; Christoph Bock; Bernhard Boehm; Elias Campo; Andrea Caricasole; Fredrik Dahl; Emmanouil T Dermitzakis; Tariq Enver; Manel Esteller; Xavier Estivill; Anne Ferguson-Smith; Jude Fitzgibbon; Paul Flicek; Claudia Giehl; Thomas Graf; Frank Grosveld; Roderic Guigo; Ivo Gut; Kristian Helin; Jonas Jarvius; Ralf Küppers; Hans Lehrach; Thomas Lengauer; Åke Lernmark; David Leslie; Markus Loeffler; Elizabeth Macintyre; Antonello Mai; Joost H A Martens; Saverio Minucci; Willem H Ouwehand; Pier Giuseppe Pelicci; Hèléne Pendeville; Bo Porse; Vardhman Rakyan; Wolf Reik; Martin Schrappe; Dirk Schübeler; Martin Seifert; Reiner Siebert; David Simmons; Nicole Soranzo; Salvatore Spicuglia; Michael Stratton; Hendrik G Stunnenberg; Amos Tanay; David Torrents; Alfonso Valencia; Edo Vellenga; Martin Vingron; Jörn Walter; Spike Willcocks
Journal: Nat Biotechnol Date: 2012-03-07 Impact factor: 54.908

5. Understanding mechanisms underlying human gene expression variation with RNA sequencing.

Authors: Joseph K Pickrell; John C Marioni; Athma A Pai; Jacob F Degner; Barbara E Engelhardt; Everlyne Nkadori; Jean-Baptiste Veyrieras; Matthew Stephens; Yoav Gilad; Jonathan K Pritchard
Journal: Nature Date: 2010-03-10 Impact factor: 49.962

6. Power analysis and sample size estimation for RNA-Seq differential expression.

Authors: Travers Ching; Sijia Huang; Lana X Garmire
Journal: RNA Date: 2014-09-22 Impact factor: 4.942

7. Integrated analysis of motif activity and gene expression changes of transcription factors.

Authors: Jesper Grud Skat Madsen; Alexander Rauch; Elvira Laila Van Hauwaert; Søren Fisker Schmidt; Marc Winnefeld; Susanne Mandrup
Journal: Genome Res Date: 2017-12-12 Impact factor: 9.043

8. Mining biological pathways using WikiPathways web services.

Authors: Thomas Kelder; Alexander R Pico; Kristina Hanspers; Martijn P van Iersel; Chris Evelo; Bruce R Conklin
Journal: PLoS One Date: 2009-07-30 Impact factor: 3.240

9. Fast and accurate short read alignment with Burrows-Wheeler transform.

Authors: Heng Li; Richard Durbin
Journal: Bioinformatics Date: 2009-05-18 Impact factor: 6.937

10. Accounting for immunoprecipitation efficiencies in the statistical analysis of ChIP-seq data.

Authors: Yanchun Bao; Veronica Vinciotti; Ernst Wit; Peter A C 't Hoen
Journal: BMC Bioinformatics Date: 2013-05-30 Impact factor: 3.169

7 in total

1. Transcriptional regulation by a RecQ helicase.

Authors: Subrata Debnath; Xing Lu; Sudha Sharma
Journal: Methods Enzymol Date: 2022-04-18 Impact factor: 1.682

2. Decoding mechanism of action and sensitivity to drug candidates from integrated transcriptome and chromatin state.

Authors: Caterina Carraro; Lorenzo Bonaguro; Jonas Schulte-Schrepping; Arik Horne; Marie Oestreich; Stefanie Warnat-Herresthal; Tim Helbing; Michele De Franco; Kristian Haendler; Sach Mukherjee; Thomas Ulas; Valentina Gandin; Richard Goettlich; Anna C Aschenbrenner; Joachim L Schultze; Barbara Gatto
Journal: Elife Date: 2022-08-31 Impact factor: 8.713

Review 3. Genome-in-a-Box: Building a Chromosome from the Bottom Up.

Authors: Anthony Birnie; Cees Dekker
Journal: ACS Nano Date: 2020-12-21 Impact factor: 15.881

4. Differential chromatin binding of the lung lineage transcription factor NKX2-1 resolves opposing murine alveolar cell fates in vivo.

Authors: Danielle R Little; Anne M Lynch; Yun Yan; Haruhiko Akiyama; Shioko Kimura; Jichao Chen
Journal: Nat Commun Date: 2021-05-04 Impact factor: 14.919