Literature DB >> 32226593

Exploring 3D chromatin contacts in gene regulation: The evolution of approaches for the identification of functional enhancer-promoter interaction.

Hang Xu¹, Shijie Zhang^1,2, Xianfu Yi³, Dariusz Plewczynski^4,5, Mulin Jun Li^1,2.

Abstract

Mechanisms underlying gene regulation are key to understand how multicellular organisms with various cell types develop from the same genetic blueprint. Dynamic interactions between enhancers and genes are revealed to play central roles in controlling gene transcription, but the determinants to link functional enhancer-promoter pairs remain elusive. A major challenge is the lack of reliable approach to detect and verify functional enhancer-promoter interactions (EPIs). In this review, we summarized the current methods for detecting EPIs and described how developing techniques facilitate the identification of EPI through assessing the merits and drawbacks of these methods. We also reviewed recent state-of-art EPI prediction methods in terms of their rationale, data usage and characterization. Furthermore, we briefly discussed the evolved strategies for validating functional EPIs.

Entities: CellLine Chemical Disease Gene Mutation Species

Keywords: Chromatin Conformation Capture; Chromatin loop; Computational method; Enhancer-promoter interaction; Machine learning; cis-Regulatory element

Year: 2020 PMID： 32226593 PMCID： PMC7090358 DOI： 10.1016/j.csbj.2020.02.013

Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN： 2001-0370 Impact factor: 7.271

Seeking functional EPIs

In the last several decades, researchers have identified several types of functional DNA elements that can regulate tissue/cell type-specific gene expression in cis [1]. These cis-regulatory elements (CREs) are often located in the non-coding genomic regions, which make up of over 98% of the human genome [2]. Promoters and enhancers are two major CREs that control context-dependent gene transcription, with which promoters drive gene expression adjacent to transcription start sites (TSS) and enhancers faithfully orchestrate transcription from distal position regardless of orientation [3], [4]. However, when and how these elements precisely regulate target gene expression are largely unknown. Eukaryotic enhancers are bound by various transcription factors (TFs) when activated and upregulate target gene expression by forming chromatin loops with promoters [4], [5]. The chromatin loops are mostly driven by persistent or transient cohesin extrusion and other unknown mechanisms [6], [7]. The complexity of EPI formation and its functional implementation limits accurate detection of functional EPIs in high throughput manner. First, it is estimated that over 1,000,000 enhancers in the human genome, whereas the number of promoters, even if the alternative promoters are considered, is in the same order of magnitude as the number of gene transcripts [4], [8], [9], [10]. Such great redundancy enables target genes to be regulated by different enhancers and ensures robust gene control at different conditions, but in turn complicates the detection of tissue/cell type-specific EPIs [11], [12]. Second, enhancer is suggested to regulate gene expression by forming loops with promoters of target genes [13]. Though the distance among most EPIs is less than 200 kb, in extreme cases, enhancer can locate over 1 Mb away from its regulatory targets [14]. It is estimated that only 40% enhancers can regulate their nearest genes, and others cannot be detected through the nearest-gene rule [11]. Third, interaction between enhancer and promoter does not necessarily mean functional causation. EPIs detected by close proximity usually represent more likely association than causation because of contradictory evidence showing the influence of EPIs on gene expression [15], [16], [17]. Fourth, the dynamics of EPI and corresponding maintenance of gene expressions could be partially explained by loop extrusion model [18], [19], and how the functional EPIs are established and maintained remain to be fully addressed. Currently, techniques based on chromosome conformation capture (3C-based techniques) are commonly used to identify EPIs in different throughputs [20]. However, the identification of true functional EPIs during development and homeostasis usually requires extra efforts, such as profiling chromatin states, tracking TF binding and quantifying eventual gene expression [21], [22]. Simplistically, evidence from three aspects are suggested to define a functional EPI in common practice: 1) active or meaningful chromatin states; 2) close spatial proximity (although several lines of evidence indicate enhancers could control target gene expression independent of EPI [17], [23]); and 3) positive transcription outcome (Fig. 1). By focusing on assessing the drawbacks and merits, in this review we summarize and categorize the technologies for identifying active CREs, detecting chromatin proximity and validating potential EPIs. We briefly introduce the roles of the evolved 3D genomic profiling assays for characterizing functional EPIs. Furthermore, we illustrate how the state-of-the-art computational methods were developed based on these functional genomics data.

Fig. 1

Definition of functional EPIs. Functional EPIs required evidence from three aspects: (A) Active status of enhancers and promoters. (B) Spatial proximity between enhancer and promoter (though some studies revealed exceptional cases). (C) Context-dependent gene expression alteration.

Identifying active CREs

A key component of functional EPI detection is to investigate whether the associated CREs are activated in particular conditions. Comparative genome analysis has been widely applied in identifying functional DNA elements through searching the cross-species conserved regions [24], [25]. But its application on enhancer detection is challenged since the findings that enhancers evolve rapidly and most of them are species-specific [12]. The lack of ability in measuring activity state of CREs is another limitation of comparative genome analysis, which promotes the development of new strategies to track active CREs beyond DNA sequence alone. Advances in biotechnologies and high-throughput sequencing greatly facilitate the identification of tissue/cell type-specific enhancers and promoters. For examples, the binding of certain TFs, co-factors and histone modifiers (such as EP300, CDK7, BRD4, and Mediator) usually indicate active CREs. The histone modifications can mark the activity states of CREs (such as H3K27ac for active enhancer, H3K4me1 for primed enhancer, both H3K4me1 and H3K27me3 for poised enhancer, and H3K4me3 for active promoter [26], [27], [28]). These biological processes could be measured by ChIP-ChIP, ChIP-seq, as well as recent Cut&Run and Cut&Tag [29], [30], [31], [32]. Active enhancers and promoters are also closely related to chromatin states (such as open chromatin and nucleosome occupancy), which is highly relevant to the binding of various TFs. Therefore, genomic assays measuring chromatin states, including DNase-seq, ATAC-seq and MNase-seq, are all informative for detecting active CREs [33], [34], [35]. But there is no one-to-one correspondence between epigenomic feature and certain CRE. As a part of ENCODE project, ChromHMM and Segway are used to integrate multiple histone modifications and chromatin states across large numbers of tissues/cell types to generate comprehensive predictions for different CREs using DNA segmentation algorithms [36], [37], [38]. Detecting nascent RNAs is another feasible approach for the identification of active CREs. It is enlightened by the similar properties between enhancers and promoters. For instances, enhancers can be transcribed into non-coding enhancer RNAs (eRNA) when activated [39], [40]. And promoters of some genes were found to have enhancer’s ability and distally regulate the transcription of other genes [34], [41], [42]. Cap analysis of gene expression (CAGE) and similar techniques (like GRO-seq and PRO-seq) are very suitable for detecting transcripts within cell nucleus, thus are widely applied in detecting eRNA and nascent mRNA [43], [44], [45], [46], [47]. Studies have shown that poised enhancers can physically contact their target genes by polycomb dependent manner in certain cellular and genomic contexts, but the permissive EPIs will be silent until receiving active nuclear signatures [48], [49]. Therefore, detecting active enhancers and promoters could be the prerequisite to define functional EPIs (Fig. 2A). Nevertheless, current understanding of CRE activation by specific chromatin marks or nascent transcripts do not necessarily imply that the enhancers or promoters are truly functional. For example, it was revealed that only 26% enhancers predicted in ENCODE project have regulatory activity [50]. Besides, recent CRISPR screening had uncovered that some regulatory regions with unmarked regulatory signatures are functional [51], which highlights the importance to exploit the novel chromatin features for active CRE definition. Taken together, these techniques do not provide direct evidence for the linkage between enhancers and promoters. Some patterns of the data, such as co-activation between enhancer and promoter across cells, are informative for functional EPI detection, which will be discussed further in the part of computational methods.

Fig. 2

Conventional workflow for detecting, predicting and validating functional EPIs. (A) Epigenomic features and nascent transcripts are the major characteristics of active CREs. (B) Functional EPIs require enhancer and promoter to be spatially adjacent. (C) Candidate EPIs are routinely derived from the combination of active CREs and chromatin loops. (D) Computational methods are developed on candidate EPIs using either supervised or unsupervised algorithms. (E) Disrupting CREs and testing the transcriptional effects on gene transcription are the main approaches to validate candidate EPIs.

Tracking spatial proximity

Reduced spatial distance by chromatin loop formation is another critical property of functional EPI. 3C-based techniques lead to a revolution for identifying DNA interactions [52] (Fig. 2B). Through introducing proximity ligation to next generation sequencing (NGS), 3C and its derivatives are able to capture the three-dimensional interactions among chromatin. Especially, high-throughput 3C-derived techniques, including high-depth Hi-C, HiChIP and ChIA-PET using a large number of cells, provide efficient avenue to identify genome-widely potential EPIs to date [53], [54], [55]. Relying on modeling contact frequencies and assuming interaction background generated by random collisions across the chromatin polymer, significant interactions can be called through various computational methods [56]. However, conventional 3C-based techniques still have limitations in terms of resolution, sensitivity and expenditure. In fact, unless sustaining enough library complexity and investing ultra-high sequencing depth, the resolution of Hi-C was not precise enough to distinguish chromatin loops. A fundamental difficulty of 3C-based techniques was their dependence on proximity ligation. It was suggested that proximity ligation fails to capture many known structures and introduces high background noise [57], which significantly affects the quality of identified loops for follow-up EPI modeling. Besides, restriction enzyme dependent techniques could not distinguish genomic regions smaller than a theoretical limit determined by enzyme. For example, the theoretical maximum resolution of 6 bp restriction enzymes was 4 kb. ChIA-PET was suggested to have higher resolution with the same sequencing depth because it focused on regions marked by specific factors. But it was criticized to have low sensitivity which leads to high false negative rate in detecting chromatin loops [55]. Additionally, it usually required a mass of cells to achieve high resolution for loop calling, which not only made it unaffordable for many studies, but also made it unable to detect chromatin interactions at single cell level. The development of single-cell or single-molecule 3C-techniques provide insights into single-cell loop calling, but most of them are immature and need to be validated and improved with more efforts [57]. Therefore, genome-wide identification of unbiased loops (including EPIs) is impossible by conventional 3C-based techniques. Some considerations and evolved strategies are briefly summarized here from different angles with previous reviews [58], [59], [60], [61].

Crosslinking and proximity ligation introduce noises and artificial interactions

As two necessary steps in 3C library construction, crosslinking could capture unwanted contacts that are not mediated by direct chromatin interactions and introduce artifacts through intervening molecules or organelles, while ligation heavily relies on the specificity of crosslinked sequences and physicochemical state of chromatin in nucleus or solution. Although several modified techniques, such as DLO Hi-C [62] and BL-Hi-C [63], had been developed to optimize the efficiency and specificity during the crosslinking and ligation, the original defects still exist. Recently, several new methods complementary to previous 3C-based techniques have been invented to provide relatively unbiased investigation of chromosome interactions. Briefly, native 3C-based assay (i3C/iHi-C) captures spatial interactions without crosslinking [64], [65]; genome architecture mapping (GAM) leverages ultrathin nuclear cryosectioning followed by sequencing to detect long distance and three-way contacts [66]; split-pool recognition of interactions by tag extension (SPRITE) discriminates different types of contacts by split-pool barcoding of DNA molecules within the same crosslinking complex [57]; transposase-mediated analysis of chromatin looping (Trac-looping) simultaneously detects chromatin accessibility and multiscale chromatin interactions without prior chromatin fragmentation and proximity-based ligation [67]; DNA adenine methyltransferase identification (DamID) named DamC provides the first crosslinking- and ligation-free demonstration of chromosome structure by DNA methylation-based detection [68]; multiplex chromatin-interaction analysis via droplet-based and barcode-linked sequencing (ChIA-Drop) captures complex chromatin interactions with single-molecule precision [60], [69]. These novel methods that overcome some restrictions of conventional 3C-based library construction can achieve more reliable and powerful chromatin conformation measurement.

Restriction enzyme-based fragmentation limits the resolution of chromosome folding mapping

Theoretically, HaeIII and MboI can cut the human genome every 342 bp and 401 bp on average respectively [63]. Although increasing sequencing read coverage has been widely applied in Hi-C experiments, reliance on restriction enzyme limits the resolution of data despite read coverage. Factor-specific 3C-based techniques, like ChIA-PET, HiChIP and PLAC-seq on architecture proteins (such as CTCF and YY1), RNA Polymerases as well as histone modifications (such as H3K27ac and H3K27me3), improve the yield of conformation-informative reads and increase resolution by chromatin immunoprecipitation and peak calling [55], [70], [71], [72], [73], [74]. In addition, Capture-C, Tiled-C, T2C, HiCap and other types of Capture Hi-C (CHi-C) enrich selected regions of interest by pre-designed capture oligonucleotides, which significantly increase power of interaction detection at high resolution [75], [76], [77], [78], [79], [80]. Syn-HiC redesigns chromosome with regularly spaced restriction sites thus enables unbiased distribution of contact frequencies and robust definition of Hi-C resolution [81]. Avoiding to use restriction enzyme, Micro-C and DNase Hi-C utilize micrococcal nuclease and DNase I respectively to achieve mononucleosome resolution [82], [83].

Loop calling methods are heterogeneous for significant interactions identification

Many computational tools have been developed to call loops from genome-wide 3C-based data (notably Hi-C), and the performance of several representative methods was comprehensively benchmarked before [56], [84]. Extreme difference in the number of identified interactions, varied mean distance between the interacting points, low reproducibility among replicates as well as high false discovery rate upon simulated data were observed for existing loop calling methods. Such defects could be attributed to several intrinsic issues of data. First, there is no internal normalization criteria or reference of ground truth to convert relative contact probabilities into absolutely comparable values, which complicates the assumption of background model for contact frequency and hampers the definition of significant threshold referring to true EPIs. Multiplexed single molecule FISH combined with recent optical reconstruction of chromatin architecture (ORCA) that identify kilobase-level EPIs could provide a potentially complementary reference dataset [85]. Second, 3C-based library construction causes insufficient power of loop calling methods to detect long range cis (>1 Mb on the same chromosome) and trans interactions [86]. Some targeted enrichment approaches, such as CHi-C, HiChIP and ChIA-PET, could preferentially capture long range interactions. Specifically, ligation-free strategies, like GAM, SPRITE, Trac-looping and DamC provide more adequate power to identify long range cis and trans loops [60].

Conventional 3C-based processing lacks ability to capture simultaneous or cooperative interactions

The redundancy of enhancers compared to promoters implies the relationship between enhancer and promoter cannot be 1-to-1, and the percentage of enhancers interacted with multiple promoters were estimated from 9% to 50% [14], [87]. It was speculated that simultaneous interactions between enhancers and promoters are important to ensure stable gene expression in transcription factories [14], [88]. Although conventional 3C-based techniques can be used to detect multiple-contacts, it significantly lacks throughput and resolution to locate CREs genome-widely [78], [89]. To capture simultaneous interactions at scale, several multi-way chromatin contacts identification methods have been developed. For examples, tethered multiple 3C (TM3C), that maps genome-wide simultaneous chromatin contacts via ligation of fragments upon agarose gel beads followed by paired-end sequencing [90]. Chromosomal walks (C-walks) investigates higher-order organization by linking multiple genomic loci together into proximity chains [91]. GAM enables the detection of three-way chromatin contacts but the resolution (>100 kb) is not enough to capture the interactions between multiple enhancers and promoters [66]. Multi-contact 4C (MC-4C) applies nanopore sequencing to measure multi-way DNA conformations in individual alleles using modified 4C-seq method [92], [93]. Similar to MC-4C, Tri-C efficiently detects multiple ligation junctions within single sonicated 3C fragments by oligonucleotide capture [87]. SPRITE can identify multiple loci that simultaneously interact in a single cluster and long-distance [57]. Likewise, Trac-looping provides unbiased detection of multiple-way chromatin interactions and captures chromatin interactions across extremely large distances [67]. Taken together, the development of multi-way methods with diverse strategies have made it possible to detect multi-contacts among enhancers and promoters.

Current techniques face difficulties to exploit fine-scale interactions at single cells

Given the rapid evolution of 3C-based techniques at multiple contacts level, most of them generally capture snapshots of 3D genome for the whole cell populations at specific time point. To describe the highly dynamic 3D chromatin at single cells, several groups have optimized the 3C-based techniques and applied them to track chromatin conformation on single cells across different development stages and conditions [94], [95], [96], [97], [98]. However, the genomic resolution of current single cell chromatin conformation methods is limited due to either the highly variable chromatin structure among cell populations or the technical issue of subsampling, which significantly prevent the identification of stable EPIs at single cell level. Despite all these, Dip-C improves the detection power of chromatin contacts by combining a transposon-based whole-genome amplification [99], while ChIA-Drop uses droplet-based chromatin interaction analysis and reveals many promoter-centered multivalent interactions at high resolution [69]. Such technical progresses will initiate the in-depth exploitation of fine-scale EPIs at single cell level.

Predicting unseen EPIs

The continuous evolution of proximity-based 3D genome techniques made them possible to detect EPIs in genome wide. Meanwhile, accumulated genomic, epigenomic, and transcriptomic profiling data provided abundant resources for the identification of active CREs. To detect meaningful EPIs in particular conditions, the common practice would be to overlap high resolution chromatin interactions with tissue/cell type-specific active CREs (Fig. 2C) [20]. Although the discovery of EPIs has been fueled by integrating tissue/cell type-specific epigenomic profiles and 3D genomic data, high resolution genome-wide loop data are only available for limited human tissues/cell types thus far [100], [101]. Unless the limitations such as low resolution, low sensitivity and high cost were properly addressed, 3C-based techniques could not be widely applied in most studies for chromosome loop identification. Besides, EPI was not the exclusive case in terms of spatial neighborhood, thus the chromatin loops identified by 3C data could be interpreted as either functional loops or other interactions by chance. To predict the unrecognized EPIs at different contexts, many computational methods have been developed by learning or modeling existing 3D genomic data and other molecular phenotype profiles, such as open chromatin, transcript expression, histone modification and TF binding [102], [103] (Fig. 2D). Pioneered by epigenetic mark-promoter linkage studies [28], [104], over 30 in silico methods currently have been proposed to predict EPIs in human using diverse omics datasets and statistical models. Generally, existing computational EPI prediction methods could be divided into two major categories including unsupervised and supervised learning (Table 1).

Table 1

Computational methods for EPI prediction.

Tool	Year	Method category	Features	Algorithm	Links
Ernst et al. [28]	2011	Correlation-based	Histone marks, TF binding	Pearson’s Correlation	http://compbio.mit.edu/ENCODE_chromatin_states/
Thurman et al. [104]	2012	Correlation-based	DHS	Pearson’s Correlation	https://genome.ucsc.edu/ENCODE/downloads.html
DRE-target [115]	2013	Correlation-based	DHS, Sequence homology	Pearson’s Correlation	ftp://public:public@202.120.224.143/NAR2013.tar.gz
Andersson et al. [11]	2014	Correlation-based	CAGE	Pearson’s Correlation	http://fantom.gsc.riken.jp/5/
PreSTIGE [106]	2014	Distance-based	Distance, Insulator	Linear Domain Models	http://mendel.gene.cwru.edu:8080/
gkm-SVM [139]	2014	Train Classifier	DNA	Support Vector Machine	http://www.beerlab.org/gkmsvm/
IM-PET [123]	2014	Train Classifier	Histone marks, TF binding, DNA, RNA-seq	Random Forest	www.healthcare.uiowa.edu/labs/tan/IM-PET_Package.tgz
ELMER [114]	2015	Correlation-based	DNA methylation, RNA-seq	Pearson’s Correlation	https://static-content.springer.com/esm/art%3A10.1186%2Fs13059-015–0668-3/MediaObjects/13059_2015_668_MOESM4_ESM.xlsx
RIPPLE [125]	2015	Train Classifier	Histone marks, TF binding, DHS, DNA-seq	Random Forest	http://pages.discovery.wisc.edu/~sroy/ripple/index.html
PEGASUS [116]	2015	Correlation-based	Conservation	Linkage Scoring	ftp://ftp.biologie.ens.fr/pub/dyogen/PEGASUS/
Basset [140]	2016	Train Classifier	DNA	CNN	https://github.com/davek44/Basset
TargetFinder [126]	2016	Train Classifier	Histone marks, TF binding, DHS, CAGE	Gradient Tree Boosting	https://github.com/shwhalen/targetfinder
PETModule [124]	2016	Train Classifier	Histone marks, Conservation, Motif	Random Forest	http://hulab.ucf.edu/research/projects/PETModule/
EpiTensor [119]	2016	Decomposition-based	Histone marks	Tensor Decomposition	http://wanglab.ucsd.edu/star/EpiTensor/
JEME [141]	2017	Regression-based	Histone marks, DHS, DNA methylation, eRNA	Linear Regression	https://github.com/yiplabcuhk/JEME
McEnhancer [135]	2017	Train Classifier	DHS	Markov Chain Model	https://ohlerlab.mdc-berlin.de/software/McEnhancer_134/
SWIPE-NMF [120]	2017	Decomposition-based	eQTL, DHS	Matrix Factorization	https://github.com/kaiyuanmifen/SWIPE-NMF
EPIANN [130]	2017	Train Classifier	DNA	CNN + Attention Model	https://github.com/wgmao/EPIANN
PEP [131]	2017	Train Classifier	DNA	Gradient Tree Boosting	https://github.com/ma-compbio/PEP
CISD [138]	2017	Train Classifier	MNase-seq	Logistic Regression	https://github.com/huizhangucas/CISD
FOCS [142]	2018	Regression-based	DHS, CAGE, GRO-seq	Linear Regression	https://github.com/Shamir-Lab/FOCS
Cicero [111]	2018	Correlation-based	scATAC-seq	Graphical Lasso	https://github.com/cole-trapnell-lab/cicero-release
TransDecomp [121]	2018	Decomposition-based	CAGE	Decomposition	https://github.com/anderssonlab/transcriptional_decomposition
Rambutan [136]	2018	Train Classifier	DNA, DHS	CNN	https://github.com/jmschrei/rambutan
SPEID [129]	2018	Train Classifier	DNA	CNN	https://github.com/ma-compbio/SPEID
3DEpiLoop [127]	2018	Train Classifier	Histone marks, TF binding	Random Forest	https://bitbucket.org/4dnucleome/3depiloop
EP2vec [132]	2018	Train Classifier	DNA	Word2vec + Gradient Boosted Regression Trees	https://github.com/wanwenzeng/ep2vec
DeepTACT [137]	2019	Train Classifier	DHS, DNA	CNN + Attention Model	https://github.com/liwenran/DeepTACT
C3D [112]	2019	Correlation-based	DHS	Pearson’s Correlation	https://github.com/LupienLab/C3D
EPIP [128]	2019	Train Classifier	DHS, Histone marks	Adaboost	http://www.cs.ucf.edu/~xiaoman/EPIP/
DRAGON [152]	2019	Polymer Simulation	Histone marks, TF binding	Maximum Entropy	https://github.com/ZhangGroup-MITChemistry/DRAGON
CHINN [133]	2019	Train Classifier	DNA	CNN	https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE135052
CT-FOCS [143]	2019	Regression-based	DHS	Linear Mixed Effect Models	http://acgt.cs.tau.ac.il/ct-focs
HiC-Reg [150]	2019	Regression-based	DHS, Histone marks, TF binding	Random Forests Regression	https://github.com/Roy-lab/HiC-Reg
ABC [110]	2019	Distance-based	Distance, DHS, Histone marks	Activity-by-contact Model	https://github.com/broadinstitute/ABC-Enhancer-Gene-Prediction
3DPredictor [146]	2020	Train Classifier	CAGE, CTCF	Gradient Boosting	https://github.com/labdevgen/3Dpredictor

Computational methods for EPI prediction.

Unsupervised learning methods

Attributing enhancers to their nearest genes was commonly used approach to identify EPIs. Besides, increasing genomic/epigenomic features provided another approach to detect EPIs since active enhancers and promoters that have distinct patterns compared to inactive ones. These characteristics further enable the correlation of epigenomic signals between enhancer and promoter to be useful criterion for identifying potential interactions [28]. Similarly, the relation between gene expression and EPIs was also a feasible strategy to detect active EPIs, which makes use of linear/non-linear regression to quantitatively estimate the regulatory potential of enhancers towards given genes. In general, these unsupervised methods could be classified into three categories: (1) distance-based methods; (2) correlation-based methods and (3) decomposition-based methods (Fig. 3A–C).

Fig. 3

Overview of computational methods for EPI prediction. Strategies to predict EPIs can be divided into two major categories, unsupervised learning and supervised learning. Unsupervised learning algorithms include (A) Distance-based methods assign enhancers to the nearest genes, and the regulatory scope is restricted in some methods. (B) Correlation-based methods detect EPI according to high correlation of chromatin features between enhancer and promoter from a panel of samples. (C) Decomposition-based methods decompose feature matrix/tensor into subspaces, which capture the spatial features of genome thus could be used to detect EPI. Supervised learning algorithms include (D) Training Classifier methods measure the relationship between gene activity and enhancers by estimating the regulatory potential of enhancers for specific gene. (E) Regression-based methods build different machine learning classifier to distinguish positive EPIs from randomly selected negative set. ML: machine learning, DL: deep learning.

Distance-based methods

Linking enhancers to the nearest promoter had been widely used in many studies. Despite the simple rationale, methods applying this strategy was proved to be considerably effective. It was estimated that in 40% cases enhancers regulate the nearest genes [11], [105]. Setting additional criteria can increase the efficiency of distance-based methods. For example, considering the cell type-specific activated enhancer and promoters, PreSTIGE captured EPIs that only exist in specific cell types [106]. In addition to distance, false positive rate caused by distance-based methods were greatly reduced by constraining the range of EPIs within topologically associating domains (TADs) [107] or insulated neighborhoods (INs) [108] which limit the regulatory potential of enhancers within specific 3D genome architecture [109]. Recent activity-by-contact (ABC) model scored the potential of EPI by combining distance effect and enhancer activity, demonstrating a superior performance than existing approaches [110]. Nevertheless, as a naïve strategy that barely consider long distance interaction, distance-based methods were commonly used as baseline approach to predict EPIs in many studies and the performance of distance-based methods was usually not as good as subsequent methods.

Correlation-based methods

Many EPI prediction methods, including Ernst et al. [28], Thurman et al. [104], Cicero [111] and C3D [112] were developed based on the correlation of epigenomic marks between enhancers and promoters. To be specific, Ernst et al. calculated Pearson’s correlation of histone modifications and TF binding between enhancers and approximate promoters [28], while, Thurman et al. leveraged DNase I hypersensitive sites (DHSs) across many cell types to compute correlation [104]. Similarly, C3D was able to detect correlated CREs with more DHS data [112]. Besides, with single cell ATAC-seq [35] profiles, Cicero was designed to detect correlation of open chromatin at single cell level [111]. Instead, Andersson et al. [11] used cap analysis gene expression (CAGE) [113] profiles to inspect whether consistent transcriptional events could be observed between enhancer locus (such as enhancer RNA (eRNA)) and target genes across different tissues/cell types. ELMER uses inverse correlation between DNA methylation and expression of nearby genes to predict transcriptional targets [114]. DRE-target [115] identified target genes of distal regulatory elements (DREs) as those obtaining high phylogenetic correlation with DREs. Despite the improved performance, DRE-target depended on Hi-C data which were not widely available. Likewise, PEGASUS relied on evolutionary conservation of synteny to estimate enhancer-gene associations [116], [117]. The performance of correlation-based methods was not only influenced by the choice of features, it was also suggested to be affected by the algorithm to calculate correlation [111]. Pearson’s correlation was very sensitive to outliers, usually generating large numbers of false positive predictions. Among the correlation-based EPI methods, only Cicero addressed this problem by using graphical Lasso [111] which calculated regularized correlation matrices, thus was more robust to outliers [118].

Decomposition-based methods

With accumulating genomic/epigenomic/transcriptomic data from diverse tissues/cell types, matrix decomposition became a feasible approach to detect EPIs by extracting meaningful co-variation patterns from high-dimensional signals. Hitherto there were three EPI methods based on matrix decomposition, including EpiTensor [119], SWIPE-NMF [120] and TransDecomp [121]. EpiTensor collected various assays from many cell types and combined all the data with a three-order tensor in which the dimensions represent genomic loci, assay type and cell type respectively. Tensor decomposition was then applied to resolve the combined tensor into three subspaces: cell subspace, assay subspace and locus subspace. By analyzing the eigenvectors of locus subspace, genomic interactions were captured by linking the peaks of eigenvectors following distance-based approaches [119]. SWIPE-NMF firstly included six types of genomic segments and then established association matrices for every pairs of segments. Extended version of three-factor penalized matrix factorization (PMF) was then used to factorize the association matrices. Enhancer-promoter interactions were then characterized from the results of PMF [120]. TransDecomp implemented very different strategies compared with EpiTensor and SWIPE-NMF. It collected CAGE signals and decomposed the data to get two principle components called positional independent (PI) and positional dependent (PD). Then TransDecomp set 25 features related to the components to train a random forest classifier to identify EPIs derived from promoter-capture Hi-C [122]. The results showed that features derived from PI and PD components were unique in distinguishing active EPIs [121]. Different with correlation-based methods that only rely on limited data type across large numbers of samples, the decomposition-based methods leveraged multiscale information from omics data to learn unique patterns for putative EPIs.

Supervised learning methods

With the development of 3C-based techniques, especially the high throughput methods, such as high-depth Hi-C, HiChIP and ChIA-PET in large numbers of cells, EPIs could be effectively described across entire genome, which made it possible to use supervised methods to identify potential EPIs (Fig. 3D and E).

Training classifier with machine learning

There were many efforts to detect EPIs by training classifiers. By leveraging various 1D genomic/epigenomic features, classifiers could be established to distinguish Hi-C/ChIA-PET supported EPIs from random selected negatives. IM-PET was the first method following this strategy [123]. It implemented four features, including enhancer and promoter activity profile correlation, transcription factor and target promoter correlation, coevolution of enhancer and target promoter and distance constraint between enhancer and target promoter [123]. Therefore, IM-PET was a combination of several strategies mentioned above, but this made it not user-friendly since the way of calculating those integrated features was not easy to fix, which raises additional challenges. PETModule implemented a similar feature sets with IM-PET but showed higher performance [124]. Instead, in subsequent studies, such as RIPPLE [125], TargetFinder [126], 3DEpiLoop [127] and EPIP [128] directly used epigenomic/transcriptomic profiles, including TF binding, histone marks, DHSs and expression data to train comprehensive classifier. Both RIPPLE and TargetFinder carried out careful feature evaluation and identified expression level as the most distinct feature, and DHS, CTCF binding were also informative in trained models. Besides, TargetFinder included ChIP-seq data for over 100 TFs and the feature importance of those data was top-ranked. Both RIPPLE and TargeFinder achieved high performance in identifying EPIs within specific cell types. However, the requirement of plenty of features made it impractical to apply RIPPLE and TargetFinder to train specific classifiers for more tissues/cell types. Efforts had also been put to explore the possibility of EPIs detection with sequence features only. There were several methods developed based on this approach, including SPEID [129], EPIANN [130], PEP [131], EP2vec [132] and CHINN [133]. DNA sequences were usually represented with one-hot encoding, but with the development of deep learning methods, especially the great success of word2vec in nature language processing, word embedding, like dna2vec, had been considered in motif feature analysis [129], [130], [131], [132]. Among those methods, SPEID and EPIANN used one-hot encoding while PEP and EP2vec tried dna2vec approach to represent DNA sequence. All sequence-based methods focused on efficiently extracting the information from DNA sequences. To pinpoint it, SPEID implemented a bidirectional long short-term memory (BiLSTM) [134] module before training classifier, while EPIANN applied attention mechanism to directly locate the functional DNA elements. PEP and EP2vec trained a word embedding model first, and then trained classifiers with embedded DNA sequences. In addition, to identify cell type-specific EPIs, sequence-based methods required active enhancers and promoters to be well defined in given condition, such as CHINN used DNA sequences of the interacting open chromatin regions [133]. To overcome the limitation of sequence-based methods, some algorithms used widely available open chromatin profiles to supply cell type-specific signatures. For example, McEnhancer [135] learned related DHS-gene pairs from small number of known pairs and used a semi-supervised strategy to predict unlabeled ones. Besides, Rambutan [136] used convolutional neural network (CNN) to extract features from both DNA and DNase-seq data, and the summarized features together with the distance between enhancer and promoter were finally used to make predictions. Similarly, DeepTACT [137] applied CNN layers to extract features from raw data, but the merged features were further processed with BiLSTM and attention layers to make better integration. Other type of chromatin accessibility data, such as MNase-seq could also be used in EPI prediction [34]. CISD implemented MNase-seq to train classifier to distinguish EPIs derived from ChIA-PET loops against randomly selected ones [138]. According to different algorithms aforementioned, statistical learning methods could be divided into two categories: typical machine learning and deep learning. Machine learning methods used random forest, logistic regression, and support vector machine (SVM) to establish classifiers directly with Hi-C/ChIA-PET loops and various feature sets. According to RIPPLE, ensemble learning methods, such as random forest, usually performed better than other classification methods [125]. However, feature selection could not be performed by random forest. To solve this problem, RIPPLE combined random forest with Lasso for feature selection. On the other hand, deep learning methods used CNN, recurrent neural network (RNN) and attention model to extract informative features from raw input and the then implemented simple machine learning to train classifiers. Deep learning methods usually performed better when sample size is large enough. Specifically, when implementing attention layers, learning kernel could locate the relatively precise position of enhancer elements [23], [130]. Interestingly, some machine learning methods which were not originally designed for EPI prediction could also be applied to this topic. For example, gkm-SVM [139] and Basset [140] were developed to identify regulatory elements, but when the input was set as the EPI samples and concatenated features of enhancers and promoters, both gkm-SVM and Basset were easily transformed into EPI prediction methods.

Regression-based methods

Although training classifiers to distinguish active EPIs from others was a promising strategy, the regulatory potential of specific enhancers to their target genes could not be properly quantified. This limitation could be addressed by regression-based methods. When considering gene expression level as the effect indicator of neighboring CREs, the regulatory ability of separate enhancers could be inferred from training regression model among a panel of expression data together with epigenomic profiles in many cell types. In other words, the parameters learned from regression model may represent the degree of enhancers influencing target genes. Based on this rationale, JEME [141] quantitatively estimated the universal regulatory potential of enhancers while cell type-specific EPIs were then identified by training a random forest classifiers. Similar to JEME, FOCS [142] implemented a leave-n-out algorithm to obtain robust regression model linking nascent RNA transcription (measured by GRO-seq [47]) with DHSs data for correlated enhancer and promoter activity across many samples. CT-FOCS extended the FOCS model to use multiple replicates per cell type to infer cell type-specific EPIs [143]. Therefore, compared with the classifier learning, the regression-based methods can properly evaluate regulatory potential by directly associating CREs with gene transcription.

Existing issues and challenges for EPI prediction

Despite the great achievements, there are limitations among strategies applied in the current state-of-art computation methods. Distance-based methods highly depend on accurate identification of cell type-specific regulatory elements, which usually requires multiple genomic and epigenomic features. Distance-based methods are also hampered by the fact that many enhancers regulate distal promoters. Correlation-based methods and regression-based methods have higher performance, but they require large panel of samples, thus is usually not generalizable when unseen cell type is given. In contrast, supervised learning methods which train classifiers are applicable in predicting EPI across cell types. The merits and disadvantages of each EPI prediction strategy have been discussed before [102], [103]. Here, we supplemented some critical problems that significantly affect the performance of current methods and briefly summarized related challenges of future prediction task. First, there is no widely accepted ground truth to systematically evaluate the existing methods, probably due to the low power and high false positive rate of current technologies in capturing chromatin interactions. A recent benchmark study generated a set of candidate enhancer-gene interactions (BENGI) by integrating the candidate active CREs with experimentally validated genomic interactions on specific tissues/cell types [144]. They found that, overall, the correlation-based methods did not outperform the distance-based methods, and the supervised methods, like TargetFinder [126], were only modestly better than distance-based methods for most benchmark datasets when trained and tested with the same cell type but underperformed when applied across cell types. Second, the inflated performance of supervised learning methods was frequently observed [145]. The key problem could be attributed to the improper generation of training dataset and biased sampling procedure, in which the positive samples shared same features but negative samples obtained varied feature distribution. Random-split strategy in preparation of training dataset seems to overcome such problems [145], [146]. Third, methods using deep learning have problems in doing feature selection. To address this problem, SPEID evaluated feature importance by measuring the decreasing of method performance when replacing certain feature value with random noise [129]. To evaluate feature importance in deep leaning-based methods, one can use the feature selection strategies including SHAP [147], DeepLIFT [148], and Deep Feature Selection [149]. Fourth, most of existing computational methods modeled the chromatin interaction by either spatial proximity or transcriptional outcome, which makes them face difficulty in verifying causal relationship instead of showing only correlation. Although methods, such as HiC-Reg [150], 3Dpredictor [146], MEGABASE [151], DRAGON [152] and ABC [110], can quantitatively measure the interacting probability or intensity, how such prediction values are proportional to functional readout is largely unknown. Recent biochemical experiments showed that the increased EPI could lead to decreased gene activation [17], [19], [23], which further complicates the establishment of causal link between spatial proximity and transcriptional outcome. Lastly, computational EPI prediction methods are facing challenges and new opportunity brought by advanced techniques. For example, some 3D genomic techniques, such as C-walks [91], GAM [66], MC-4C [92], SPRITE [57] and ChIA-drop [69], have been developed to detect multiplex chromatin interactions in single allele. In these data, multiple cis-regulatory elements can interact with same target gene simultaneously, which is difficult scenario for current EPI methods, especially for methods implemented training classifiers. Methods based on regression could detect multiple interactions theoretically [153], [154], but whether the interactions are active simultaneously or just gloss caused by the average of bulk cells remain elusive.

Validating functional EPIs and future direction

Even if the advance of biochemical and computational methods has deepened our understanding of precise transcriptional control in the 3D genome, the biological function of CREs as well as the links between enhancer and their regulated targets identified by current techniques remain to be validated (Fig. 2E). Using transgenic reporter assays and massively parallel reporter assays, candidate enhancers could be dissected regardless of their native genomic context and endogenous target genes [5], [155], [156]. In vivo enhancer manipulation can be performed by Clustered Regulatory Interspaced Short Palindromic Repeat (CRISPR) associated Cas (CRISPR/Cas) system currently. Several lines of CRISPR/Cas strategies, including nuclease-active genome-editing screens and nuclease-inactive epigenome-editing screens, have been successfully applied to characterize large numbers of enhancers in their native genomic context [5]. On the other hand, super-resolution DNA FISH and microscopy complement with proximity-based techniques provide an unprecedented view of chromatin interactions at kilobase-scale resolution [157]. However, accurate identification of active CREs and bona fide EPIs in particular condition are usually two uncoupled experiments, which cannot answer which EPIs are true functional and eventually modulate transcriptional event of target genes or other molecular phenotypes. The objective of validating true functional EPIs have sparked enormous interest in designing novel coherent experiment. For examples, by introducing chromatin loops at desired genomic loci, chromatin loop reorganization using CRISPR-dCas9 (CLOuD9) can selectively inspect gene expression at targeted loci [158]; CRISPR affinity purification in situ of regulatory elements (CAPTURE) simultaneously identifies locus-specific transcriptional regulator complexes, chromatin-associated RNA and DNA interactions [159]; CRISPR-genome organization (CRISPR-GO) system can efficiently control the spatial positioning of genomic loci relative to specific nuclear compartments, enabling interrogation of chromatin interaction dynamics and associated molecular events [160]; light-activated-dynamic-looping (LADL) system allows light-inducible loop formation followed by single-molecule RNA-FISH for nascent expression quantification [161]. These new technologies greatly facilitate the one-stop evaluation of true biological functions for individual EPIs. The evolution of advanced biotechnologies and accumulated functional genomics data are constantly revolutionizing the genome-wide identification of functional EPIs. By integrating multilayer tissue/cell type-specific evidence from uncoupled assays on genomic, epigenomic and transcriptomic profiling, the false positive rate of functional EPIs discovery could be reduced [162]. For example, tissue/cell type-specific quantitative trait locus mapping on gene expression (eQTL), chromatin accessibility (caQTL) and promoter interaction (pieQTL) have been used to refine or ascertain true functional EPIs together with active CREs profiling and 3C-based techniques [77], [163], [164], [165], [166], which will ultimately facilitate the interpretation of non-coding regulatory variant effect on 3D genome and complex disease [167]. In addition, the high throughput CRISPR/Cas-based perturbation screenings, such as Mosaic-seq [168], crisprQTL mapping [169] and CRISPRi-FlowFISH [110], on multiple target genes have offered promising strategies to simultaneously validate the endogenous effect of CREs with their putative target genes. Although these CRISPR/Cas-based systems are still in its infancy which only identified several hundreds of high-confidence EPIs, we believe they will initialize fundamentally novel computational methods by combining advanced 3D genomic data in the near future.

Credit authorship contribution statement

Hang Xu: Investigation, Writing - original draft. Shijie Zhang: Writing - review & editing. Xianfu Yi: Writing - review & editing. Dariusz Plewczynski: Writing - review & editing. Mulin Jun Li: Conceptualization, Investigation, Supervision, Writing - original draft.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

155 in total

1. C3D: a tool to predict 3D genomic interactions between cis-regulatory elements.

Authors: Tahmid Mehdi; Swneke D Bailey; Paul Guilhamon; Mathieu Lupien
Journal: Bioinformatics Date: 2019-03-01 Impact factor: 6.937

Review 2. Enhancer function: new insights into the regulation of tissue-specific gene expression.

Authors: Chin-Tong Ong; Victor G Corces
Journal: Nat Rev Genet Date: 2011-03-01 Impact factor: 53.242

3. Cicero Predicts cis-Regulatory DNA Interactions from Single-Cell Chromatin Accessibility Data.

Authors: Hannah A Pliner; Jonathan S Packer; José L McFaline-Figueroa; Darren A Cusanovich; Riza M Daza; Delasa Aghamirzaie; Sanjay Srivatsan; Xiaojie Qiu; Dana Jackson; Anna Minkina; Andrew C Adey; Frank J Steemers; Jay Shendure; Cole Trapnell
Journal: Mol Cell Date: 2018-08-02 Impact factor: 17.970

4. Comprehensive mapping of long-range interactions reveals folding principles of the human genome.

Authors: Erez Lieberman-Aiden; Nynke L van Berkum; Louise Williams; Maxim Imakaev; Tobias Ragoczy; Agnes Telling; Ido Amit; Bryan R Lajoie; Peter J Sabo; Michael O Dorschner; Richard Sandstrom; Bradley Bernstein; M A Bender; Mark Groudine; Andreas Gnirke; John Stamatoyannopoulos; Leonid A Mirny; Eric S Lander; Job Dekker
Journal: Science Date: 2009-10-09 Impact factor: 47.728

5. Live-cell imaging reveals enhancer-dependent Sox2 transcription in the absence of enhancer proximity.

Authors: Jeffrey M Alexander; Juan Guan; Bingkun Li; Lenka Maliskova; Michael Song; Yin Shen; Bo Huang; Stavros Lomvardas; Orion D Weiner
Journal: Elife Date: 2019-05-24 Impact factor: 8.140

6. NET-CAGE characterizes the dynamics and topology of human transcribed cis-regulatory elements.

Authors: Shigeki Hirabayashi; Shruti Bhagat; Yu Matsuki; Yujiro Takegami; Takuya Uehata; Ai Kanemaru; Masayoshi Itoh; Kotaro Shirakawa; Akifumi Takaori-Kondo; Osamu Takeuchi; Piero Carninci; Shintaro Katayama; Yoshihide Hayashizaki; Juha Kere; Hideya Kawaji; Yasuhiro Murakawa
Journal: Nat Genet Date: 2019-09-02 Impact factor: 38.330

7. Constructing 3D interaction maps from 1D epigenomes.

Authors: Yun Zhu; Zhao Chen; Kai Zhang; Mengchi Wang; David Medovoy; John W Whitaker; Bo Ding; Nan Li; Lina Zheng; Wei Wang
Journal: Nat Commun Date: 2016-03-10 Impact factor: 14.919

8. GeneHancer: genome-wide integration of enhancers and target genes in GeneCards.

Authors: Simon Fishilevich; Ron Nudel; Noa Rappaport; Rotem Hadar; Inbar Plaschkes; Tsippi Iny Stein; Naomi Rosen; Asher Kohn; Michal Twik; Marilyn Safran; Doron Lancet; Dana Cohen
Journal: Database (Oxford) Date: 2017-01-01 Impact factor: 3.451

9. Manipulation of nuclear architecture through CRISPR-mediated chromosomal looping.

Authors: Stefanie L Morgan; Natasha C Mariano; Abel Bermudez; Nicole L Arruda; Fangting Wu; Yunhai Luo; Gautam Shankar; Lin Jia; Huiling Chen; Ji-Fan Hu; Andrew R Hoffman; Chiao-Chain Huang; Sharon J Pitteri; Kevin C Wang
Journal: Nat Commun Date: 2017-07-13 Impact factor: 14.919

10. A curated benchmark of enhancer-gene interactions for evaluating enhancer-target gene prediction methods.

Authors: Jill E Moore; Henry E Pratt; Michael J Purcaro; Zhiping Weng
Journal: Genome Biol Date: 2020-01-22 Impact factor: 13.583

14 in total

Review 1. The epigenetic basis of cellular heterogeneity.

Authors: Benjamin Carter; Keji Zhao
Journal: Nat Rev Genet Date: 2020-11-26 Impact factor: 53.242

2. EPIXplorer: A web server for prediction, analysis and visualization of enhancer-promoter interactions.

Authors: Li Tang; Zhizhou Zhong; Yisheng Lin; Yifei Yang; Jun Wang; James F Martin; Min Li
Journal: Nucleic Acids Res Date: 2022-05-25 Impact factor: 19.160

3. Analysis of the landscape of human enhancer sequences in biological databases.

Authors: Juan Mulero Hernández; Jesualdo Tomás Fernández-Breis
Journal: Comput Struct Biotechnol J Date: 2022-05-30 Impact factor: 6.155

4. EPIsHilbert: Prediction of Enhancer-Promoter Interactions via Hilbert Curve Encoding and Transfer Learning.

Authors: Mingyang Zhang; Yujia Hu; Min Zhu
Journal: Genes (Basel) Date: 2021-09-06 Impact factor: 4.096

Review 5. Predicting Genome Architecture: Challenges and Solutions.

Authors: Polina Belokopytova; Veniamin Fishman
Journal: Front Genet Date: 2021-01-22 Impact factor: 4.599

6. Deep-Learning-Based Cancer Profiles Classification Using Gene Expression Data Profile.

Authors: Hatim Z Almarzouki
Journal: J Healthc Eng Date: 2022-01-07 Impact factor: 2.682

Review 7. 3D chromatin architecture and transcription regulation in cancer.

Authors: Siwei Deng; Yuliang Feng; Siim Pauklin
Journal: J Hematol Oncol Date: 2022-05-04 Impact factor: 23.168

8. PredTAD: A machine learning framework that models 3D chromatin organization alterations leading to oncogene dysregulation in breast cancer cell lines.

Authors: Jacqueline Chyr; Zhigang Zhang; Xi Chen; Xiaobo Zhou
Journal: Comput Struct Biotechnol J Date: 2021-05-07 Impact factor: 7.271

9. Leveraging three-dimensional chromatin architecture for effective reconstruction of enhancer-target gene regulatory interactions.

Authors: Elisa Salviato; Vera Djordjilović; Judith Mary Hariprakash; Ilario Tagliaferri; Koustav Pal; Francesco Ferrari
Journal: Nucleic Acids Res Date: 2021-09-27 Impact factor: 16.971

Review 10. Understanding 3D Genome Organization and Its Effect on Transcriptional Gene Regulation Under Environmental Stress in Plant: A Chromatin Perspective.

Authors: Suresh Kumar; Simardeep Kaur; Karishma Seem; Santosh Kumar; Trilochan Mohapatra
Journal: Front Cell Dev Biol Date: 2021-12-08