Literature DB >> 35832620

Predicting 3D chromatin interactions from DNA sequence using Deep Learning.

Robert S Piecyk¹, Luca Schlegel¹, Frank Johannes^1,2.

Abstract

Gene regulation in eukaryotes is profoundly shaped by the 3D organization of chromatin within the cell nucleus. Distal regulatory interactions between enhancers and their target genes are widespread and many causal loci underlying heritable agricultural or clinical traits have been mapped to distal cis-regulatory elements. Dissecting the sequence features that mediate such distal interactions is key to understanding their underlying biology. Deep Learning (DL) models coupled with genome-wide 3C-based sequencing data have emerged as powerful tools to infer the DNA sequence grammar underlying such distal interactions. In this review we show that most DL models have remarkably high prediction accuracy, which indicates that DNA sequence features are important determinants of chromatin looping. However, DL model training has so far been limited to a small set of human cell lines, raising questions about the generalization of these predictions to other tissue-types and species. Furthermore, we find that the model architecture seems less relevant for model performance than the training strategy and the data preparation step. Transfer learning, coupled with functionally curated interactions, appear to be the most promising approach to learn cell-type specific and possibly species- specific sequence features in future applications.

Entities: Chemical

Keywords: 3D Chromatin Interaction; Chromosome conformation capture (3C); Deep Learning; Epigenetics; Genome folding

Year: 2022 PMID： 35832620 PMCID： PMC9271978 DOI： 10.1016/j.csbj.2022.06.047

Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN： 2001-0370 Impact factor: 6.155

Introduction

The spatial organization of chromatin in the nucleus of animal and plant cells plays important roles in genome regulation. It has central functions in processes such as DNA replication and repair, the spatial- and temporal patterning of gene expression, and in the silencing of transposable element (TE). In the past two decades, chromosome conformation capture (3C) coupled with next-generation sequencing has emerged as a powerful method to interrogate 3D chromatin interactions in a high-throughput fashion [1]. Among these, Hi-C was the first developed method [2]. It is designed to capture all chromatin interactions at the genome-wide scale, albeit at low resolution (see Fig. 1, Fig. 2, [2]). By contrast, more recent methods, such as chromatin interaction analysis by paired-end tag sequencing (ChIA- PET) (see Fig. 2, [3]) or Hi-C coupled to ChIP-seq (HiChIP or PLAC-Seq) [4], [5], generate high-resolution interaction maps, but are restricted to specific loci occupied by proteins that can be pulled down by ChIP (e.g. modified histones, transcription factors, and RNA polymerase II). Together, these techniques have generated unprecedented insights into the function of 3D chromatin organization of mammalian and more recently also in plant genomes. They have led to the systematic identification of Enhancer-Promoter Interactions (EPIs), Insulator Loops (e.g. CTCF-cohesin loop in human cells) and interactions mediated by specific transcription factors [6].

Fig. 1

Fig. 2

Training procedure of a typical CNN + LSTM model. (A) During input data preprocessing, chromatin sequences are translated using a one-hot encoding technique. (B) Schematic representation of a commonly used architecture in chromatin interaction detection. Two input anchors are managed separately by CNN blocks, which are defined by the number of convolutional and pooling layers and several hyperparameters. Once the architecture, the arrangement of the layers and hyperparameters are selected, we start with an initial unbiased parameter distribution . The concept of ’training’ refers to minimizing a certain loss function L with respect to lambda, which contains the input values x, the parameter set and all non-linear functions . This representation is drastically reduced and on a highly abstract level. Many additional decisions are necessary to define a full CNN model with LSTM units. (C) After training, we end up with a set of optimal parameters . This set of trained parameters in combination with the model architecture must be applied to a final test data for validation, before applying to completely new data.

(A), (B1) – (B5) Hi-C sequencing. A restriction enzyme buffer in combination with SDS solubilization enables the access to open cross-linked chromatin and removes any other substances. Next, a type II restriction endonuclease digests the accessible chromatin. HindIII enzyme detects and cleaves of all 5’-AAGCTT-3’sequences, which are filled with biotin-14-dCTP. Special dilute conditions favor proximity ligation and can be identified by its unique 5’-GCTAGC-3’ Nhel site. These chromatin ligation products are degraded by Proteinase K followed by several DNA purification steps [44]. (B6), (C6) DNA samples are mapped to a given DNA library with the correct reference genome including many quality steps. (A), (C1) - (C5) ChIA-PET sequencing. Formaldehyde stabilizes cross-linked DNA–protein complexes before sonication is used for digestion. ChIP is applied, while using the corresponding antibody for the protein of interest. The precipitate, which is enriched with the digested chromatin complexes, is divided into two separate locations with two different half-linker oligonucleotides. Both samples are mixed, which activates proximal half-linker and self-ligation. The restriction enzyme Mmel is added for digestion and DNA fragments from paired-end tags (PETs) are isolated. The final DNA sequences can be assigned uniquely by their ligation type. Self-ligated sequences are considered as chromatin looping with small distance, while mixed linkers referring to long distance base pairs, eventually on different chromosomes [49]. Numerous machine learning (ML) approaches have emerged in parallel to these technological developments [7]. Their general aim is to use 3C data as input to train sequence-based predictors of chromatin looping and to identify specific sequence features that may facilitate physical contacts between distal genomic regions. The most promising of such ML approaches are supervised Deep Learning (DL) methods. As in other areas of genome biology, DL methods provide the most accurate predictions, can handle large and complex amount of genomic data and automatically detect patterns or unanticipated genomic relationships [8]. DL methods thus provide a powerful framework for dissecting the causal determinants underlying 3D chromatin interactions and for providing testable hypotheses for experimental follow up. Here we review the state-of-the-art in the use of DL methods to predict 3D chromatin interactions from DNA sequence (Table 1). We evaluate these models in terms of their objective, data preprocessing, architecture, training procedure and finally by their prediction performance. We find that data selection and preprocessing in combination with transfer learning appear to be more important for model performance than the choice of model architecture. Moreover, even though all these models perform very well on their test data, they have so far been mainly restricted to the analysis of specific human cell lines. Similar model-based approaches are thus urgently needed for a wider range of cell lines, tissues or species. This information could facilitate deeper insights into the evolutionary and developmental factors that impact chromatin looping biology. Moreover, experimental validation of model-based predictions is needed to assess the biological value of these approaches and for fine-tuning model architecture.

Table 1

Deep Learning algorithms for 3D chromatin interactions, sorted by architecture. All models are based on Convolution Neural Networks or Recurrent Neural Networks. [18], [41], [43], [9], [19], [27], [28], [13], [22], [29], [33], [38], [40], [42], [31], [30][11], [12], [14], [15], [16], [17], [20], [21], [23], [25], [26], [32], [34], [35], [36], [37].

Sequencing methods

Chromosome conformation capture (3C) followed by high-throughput sequencing has emerged as a powerful experimental approach to probe chromatin interactions at the genome-wide scale. Hi-C and ChIA-PET are two variants of these measurement approaches, which have served as the main data input for most of the DL methods reviewed in Table 1. The general experimental workflows for Hi-C and ChIA-PET involve cross-linking and fragmentation of chromatin, the addition of biomarkers, ligation, purification and finally sequencing [44]. We detail these experimental steps in Fig. 2. Despite the popularity of Hi-C, there are a number of well-known limitations with this method. First, its resolution is limited by the choice of the digestion enzyme [45], so that spatially close chromatin interactions may be missed in genomic regions where the distribution of enzymatic cut sites is sparse. Second, Hi-C often does not capture long range interactions and omits many simultaneous promoter-enhancer interactions [46]. Third, false positive interactions are often detected because of spurious cross-linking and ligation. The first two limitations could be viewed as potential opportunities for DL methods, because computationally predicted loops could, in principle, be generated in genomic regions where the measurement technology has failed. However, the third limitation is disadvantageous for model training, where clean true positive (and true negative) loop sets are necessary. Newer experimental approaches, including ChIA-PET, Micro-C or Promoter Capture Hi-C (PCHi-C), try bypass many of these limitations. Micro-C uses micrococcal nuclease (MNase) as replacement for restriction enzymes to archive a higher resolution for short-distance interactions [47]. PCHi-C introduces an additional individual biotinylated RNA purification step for promoter rich fragments, to reduce the amount of ligation products before PCR sequencing is applied [48]. Unlike Hi-C, ChIA-PET combines 3C with chromatin immunoprecipitation (ChIP) followed by sequencing [3]. It performs chromatin fragmentation by sonication and uses a smarter biolinker concept [49]. Additionally, the use of ChIP enriches for chromatin contacts that harbor specific transcription factors, or other binding proteins, such as CTCF, and thus ensures a higher rate of true functional chromatin interactions [49]. However, as a trade-off ChIA-PET has relatively low sensitivity and requires a large amount of material [5], which could increase measurement variation due to cellular heterogeneity. Still, ChIA-PET is frequently applied and has thus been employed in the training of several DL algorithms (see Table 1).

Computational methods

Preprocessing

Preparation of training data

Measured (i.e. observed) chromatin interactions from Hi-C or ChIP-PET are the starting point for all DL methods reviewed here. However, DL methods differ in the way they use this information at the input stage. We can broadly distinguish between classification- and regression-based methods. Classification-based methods (EPIANN, TransEPI, EPIsCNN, Rambutan, EPIsHilbert, EPIHC, ChINN, DeepTact, DeepMILO, SEPT, SPEID, EPIVAN, EPI-DLMH) typically take a list of discrete, interacting regions, called “anchors”, as input, which are treated as true positives. The goal is to learn specific sequencing features within the anchors that may facilitate looping. The sizes of the input anchors can range from 2 kb to 3 kb, depending on the anchor type (see Table 1). In addition to these true positives, it is also necessary to supply true negative anchor pairs. Most methods do this by “rewiring” the positive set of anchors in new ways; that is, they form in silicoloops between anchors that have not been observed to interact in the original Hi-C or ChIP-PET data. These generated negatives are often matched for distances similar to those in the positive set, in order to avoid biases in model training (DeepMILO, DeepTACT, SEPT, SPEID). Alternatively, any distance information can be added as the auxiliary input in the training procedure itself (ChINN, Rambutan, TransEPI). Because of the substantial imbalance between positive and negative interactions, a data augmentation procedure is sometimes employed that adds arbitrary sub-sequences flanking the original anchors from the positive set (EPIANN, EPIsCNN, EPIsHilbert, SEPT, SPEID, EPIVAN, EPI-DLMH, EPIHC). In contrast to classification-based approaches, regression-based methods (Akita, DeepC, Orca) take contact frequency maps as input, which are constructed at megabase (Mb) resolution. To this end, the genome is partitioned into non-overlapping virtual contigs, and interaction counts within 1 Mb sub-sequences are taken to generate a two-dimensional frequency matrix. DNA sequences and corresponding frequency matrix are fed into the DL algorithm to predict factors affecting local interaction typologies (Akita, DeepC). Orca introduces, in addition to the sequence encoder, a multilevel cascading decoder to provide genome wide interaction on window sizes from 1 Mb to 265 Mb with resolution between 4 kb and 1024 kb.

Auxiliary data

Many classification-based methods further employ auxiliary data to filter anchor pairs prior to training. For example, methods such as EPIANN, TransEPI or SPEID (see Table 1), select anchors that map to annotated enhancers and promoters. Such filtering strategies reduce the genome-wide interaction landscape to a subset of functional regions, and (probably) reduce biases or measurement errors arising from the 3C assay itself. Clearly, such a priori filtering strategies are only sensible in applications involving high-quality genomes where sufficient epigenomic information is available and all regulatory elements are well annotated. This is certainly true for all DL methods to date, as they have been developed specifically for human genomic applications, and therefore benefit from the extensive ENCODE data resources [24]. Common auxiliary data modalities include RNA-seq (gene expression), DNase-seq (accessible regions), as well as ChIP-seq for various histone modifications, RNA polymerase, transcription factors or CTCF binding proteins. Using such functional data, anchors overlapping low-expressed transcripts from RNA-seq can be removed from the enhancer-promoter data, since they are less likely to be actively regulated by enhancers [13], [40], [50]. Similarly, anchors can be screened for chromatin accessible regions related to CTCF or RNA Pol II binding sites [31] to be enriched for active loops. Furthermore, the incorporation of functional data makes it possible to identify cell-type-specific chromatin interactions. This creates avenues for predicting and understanding looping biology underlying cell lineage determination. As an alternative to a priori filtering, some methods integrate such auxiliary data directly into the model training, either in a pre-training step (DeepC, Orca, ChiNN) or in the full training procedure (TransEPI, EPIHC). Either way, it has been shown that the inclusion of auxiliary data into DL methods can improve algorithm performance (TransEPI, DeepC, Orca, ChiNN, EPIHC), and thus appears to be an important aspect of data preparation.

Sequence embedding

The nucleotide sequences used in DL model training need to first be converted into a machine readable format. One-hot encoding is a commonly used approach to represent sequences as binary vectors. The four nucleotides A, T, C and G are saved in four separate channels. A is represented as , T as , C as and G as . The unknown nucleotide category N can be either removed (SPEID, EPIANN, SIMCNN), represented as (ChINN), or saved as the fifth dimension in matrix representation (DeepMILO). Moreover, one-hot encoding can be easily extended to capture the spatial relationship between anchors by converting a two-dimensional one-hot matrix into a three-dimensional matrix–vector representation by applying a space-filling Hilbert curve [51], [52] (see EPIsHilbert). Another group of methods uses instead a distributed representation of small DNA sequences, called k-mers. Dna2vec embedding [53] is based on a popular word2vec [54] natural language processing model. The algorithm takes distributed samples of variable-length k-mers to train a shallow two-layer neural network model. This aggregated model enables the user to perform a decomposition by k-mer length, followed by the selection of the best low-dimensional and high-quality vector representation used for sequence embedding. This increases computational efficiency as well as specificity in DL approaches (EPIVAN), and preserves the hidden information from the correlation between single small DNA sequences. It can be either trained on the whole genome or on a set of anchors.

Deep Learning approaches

We examined the model architectures of the DL methods in Table 1 in detail. The most frequent architecture uses a Convolutional neural network (CNN) in combination with a long short-term memory (LSTM) (see Fig. 2). CNNs, or simply convolutional networks (CNs), are a specific type of DL, which are based on artificial neural networks (ANNs) [55]. Models with a long short-term memory (LSTM) unit belong to the family of recurrent neural networks (RNNs) and are also a class of ANNs [56]. The family of LSTM units also comprises Gated Recurrent Units (GRU). It is a slightly simpler version with less parameters [57]. Both models often occur with additional connections in opposite directions within one layer [58]. In that case, they are called Bidirectional LSTM (Bi-LSTM) and Bidirectional-Gru (Bi-Gru). Due to the continuous refinement of 3C-based methods, especially Hi-C, ChIA-PET or other throughput techniques, large amounts of high-quality chromatin data is available. This data motivates the use of supervised versions of the introduced ANN models. If we consider Table 1, we notice that all models are using CNNs as core models and six of 16 additionally add an extra (Bi)-LSTM or Bi-Gru unit (DeepTact, DeepMilo, SEPT, SPEID, EPIVAN and EPI-DLMH). This observation could be explained by the historical applications of CNNs and LSTMs. As its name indicates, CNNs use a convolution operator, also called kernel function, to search for specific patterns in discrete grid-based topologies or sequential data [59]. The concept was originally developed for image or sound classification and is related to the neuronal connections of the human brain [60]. Section 2.1 explained the conversion of genomic sequences into a matrix or vector-based representation. This indicates the connection between image recognition and the computation of chromatin interaction since the input data of both problems can be represented by a matrix or linear vector combination that contain unknown patterns and possibly long-term dependencies. LSTM units are designed to capture long-term dependencies, due to shared parameters that function as a memory unit. One could say, if CNNs catch mostly local patterns, LSTMs act as a global observer. Therefore, LSTMs have been used mainly in time dependent problems, as they store crucial information over a long period of time. A typical application is speech recognition, where it can take a long time until a specific word or phrase is repeated [56]. A similar event occurs in promoter-enhancer interactions, which often span distances of up to several Mbs [61]. A common DL configuration as well as typical training data is illustrated in Fig. 2. Even though the layer-based representation shown in that figure is helpful for visualizing the overall model structure, it is important to keep the actual neural-based architecture in mind. All layers contain nodes that are connected through edges. On edges, linear transformations are applied, which provide all the weights or parameters. On nodes, preselected non-linear transformations are necessary. The process of training is equivalent to minimizing a loss function L with respect to the corresponding parameters , which connect the layers. The output of the training process is this specific set of parameters in combination with the layer structure and arrangement. During the training process, the model is validated with a subset of the training set (usually around 10%). Another 10% of the training data, the testing set, is saved and used for the final test run on unknown data. In the supervised setup, this test run provides all statistical measurements and quality values like accuracy, sensitivity, specificity and other conditions on the sample set. Training procedure of a typical CNN + LSTM model. (A) During input data preprocessing, chromatin sequences are translated using a one-hot encoding technique. (B) Schematic representation of a commonly used architecture in chromatin interaction detection. Two input anchors are managed separately by CNN blocks, which are defined by the number of convolutional and pooling layers and several hyperparameters. Once the architecture, the arrangement of the layers and hyperparameters are selected, we start with an initial unbiased parameter distribution . The concept of ’training’ refers to minimizing a certain loss function L with respect to lambda, which contains the input values x, the parameter set and all non-linear functions . This representation is drastically reduced and on a highly abstract level. Many additional decisions are necessary to define a full CNN model with LSTM units. (C) After training, we end up with a set of optimal parameters . This set of trained parameters in combination with the model architecture must be applied to a final test data for validation, before applying to completely new data.

Performance and training strategies

We sought to rank the prediction performance of the models reviewed in Table 1. The majority of the original studies provide values for the area under precision recall curve (AUPR) as statistical measurement of prediction accuracy. A higher AUPR value is indicative of better model performance. Since Akita, Orca and DeepC are regression based models, they did not published those values and are not included in the performance analysis. Nevertheless, since Akita and Orca are trained on the same Micro-C data set, it is possible to compare their Pearson correlation coefficient. On average, in Orca this correlation coefficient is higher for the H1 embryonic stem cells (H1-ESC) and higher for the human foreskin fibroblasts (HFF). Beside these, we exclude all algorithms that are not trained and tested on Hi-C [10] data, because mixed data modalities would render comparisons difficult. The remaining models (EPIsHilbert, EPIVAN, EPIsCNN, EPI-DLMH, EPIsCNN, EPIVAN, EPIHC and EPIANN), which provide AUPR values, have been trained on a combination of six cell lines: K562 (mesoderm lineage cells from a leukemia patient), GM12878 (lymphoblastoid cells), HeLa-S3 (ectoderm lineage cells from a cervical cancer patient), HUVEC (umbilical vein endothelial cells), IMR90 (fetal lung fibroblasts) and NHEK (epidermal keratinocytes) [10], and employ a rich set of auxiliary functional data [62]. We observe four essential training strategies: Training on a specific cell line Training on a combination of all cell lines Model-based learning Cellular transfer learning (data based) Model-based training often introduces an attention layer, to merge the acquired feature knowledge of cell-line specific sub-models. Cellular transfer learning suggests two or more stages. First, all cell line data is used for training, and second, a specific cell-line is selected for another training procedure. The second cell-line specific training consumes little computational effort [28], since the model architecture remains the same and the primary trained parameters are used as initial weights. If cellular transfer learning has been applied, the process is repeated for all six cell lines separately to obtain a total of six AUPR values. Training and test-sets originate from the same cell line. The mean value of these six AUPR values is plotted in Fig. 3 with the range of all values indicated by the yellow interval. We observe that models using transfer learning tend to perform comparatively well. This could be an indicator that cell-specific features are important predictors [63]. Interestingly, we also see that the more sophisticated CNN + LSTM architectures are not among the top performers.

Fig. 3

Performance bar plot for models, which provided AUPR value for cell type specific training and testing. Light blue bars indicate a pure CNN model, dark blue refers to a CNN + RNN model. Transfer learning is represented by dotted bars. The yellow interval is defined by the minimum and maximum AUPR value. There are a number of caveats with this simple side-by-side performance evaluation. First, AUPR values are just one of several statistical measurement tools. The imbalanced training data set contains approximate 20 times more true interactions. Since the AUPR value mainly composes true interactions [64], it is the most frequently used performance tool. Second, we also do not consider the computational effort and costs of the models, which is an essential factor for the efficiency.

Biological insights and applications

Genomic determinants of looping

DL models are undoubtedly powerful tools for predicting 3D chromatin looping based on DNA sequence. However, they also provide an intriguing framework for dissecting the underlying looping biology. A common approach in classification-based algorithms is to use in silico mutagenesis, whereby mutations are artificially introduced into specific anchor pairs, and then re-supplied to the trained DL model. The concept of this approach is presented in the Fig. 4. Mutations that lead to significant drops in the predicted interaction probability would suggest an important role in facilitating chromatin contacts, perhaps because they are located in crucial protein binding motifs. Algorithms such as DeepC, Akita, Orca and DeepMILO have extensively used this approach to assess the effect of specific deletions, single nucleotide polymorphisms (SNP) or structural variants. Akita, for instance, mutagenized a set of random regions within and near CTCF motifs. The results of this study showed the significant impact of SNPs on CTCF binding, either directly or by flanking cofactors. Additionally, a mouse-trained model from Akita was used to predict the effect of a 622 kb inversion at the enhancer locus Eph4A on 3D folding. The experimental studies observed that the inversion effected CTCF binding [65]. Using its predicted contact maps, Akita confirmed that an in silico inversion of the Eph4A locus lead to a loss of CTCF mediated insulator looping. However, in silico mutagenesis is a brute force approach. It is not optimal for probing the complete combinatorial mutation space within a given anchor sequence. This limitation may be crucial in situations where looping is facilitated by combinations of specific (and possible complex) motif sequences. As an alternative, DeepC employs a metric called saliency score [66], which quantifies the importance of every base pair and motif to the predicted interaction. This metric can be calculated as the scalar product of the model output gradient, with respect to the one-hot-encoded input sequence. Akita used this metric as well. They found saliency peaks at the CTCF motifs and active promoters, and hypothesized that mutations of these regions would affect chromatin architecture, and thus, gene expression, which can be investigated by expression quantitative trait locus (eQTL) studies. To test this, they used the set of cell-type specific eQTLs located in open chromatin or CTCF sites. They compared the saliency scores of these eQTLs and mutagenized random regions. The significantly higher saliency score for eQTLs revealed that this metric can be used for eQTL mapping when the expression changes are caused by chromatin architecture perturbation. Orca successfully predicted the influence of several structural variants in six studies while comparing the model output with experimental chromatin capture data.

Fig. 4

Biological insights by Deep Learning. Deep Learning models can be used to predict chromatin interaction in combination with several nucleotide variants through in silico mutagenesis. A change in the loop probabilities and between sequence variants indicates the importance of the specific single nucleotide polymorphisms in chromatin looping. Transfer learning can be used to extend previously trained knowledge to different species or cell lines. Class Activation Maps (CAM) [67] is another approach to quantify the influence of sequence features on enhancer-promoter interaction. It visualizes three-dimensional vectors with a heatmap matrix, which represents the interaction occurrence and their spatial relationship. These association maps may be used to highlight sequence patterns leading to the chromatin interaction (EPIsHilbert). Another proposed method to detect motifs responsible for chromatin looping is based on the output of the first convolutional layer. This procedure can extract the best matching subsequences for each kernel, with respect to the model architecture. SEPT used this strategy to compute a position frequency matrix (PFM) and then, compare the PFM-related features with known TF motifs from HOCOMOCO database [68]. They found a set of potentially important regulatory elements, which are involved in transcription and cell-cycle regulation that may determine their role in chromatin looping. These results revealed that SEPT has the ability to learn cell-type specific patterns crucial in genome folding, which explains its relevance in transfer learning approaches. Attention layers [69] can be used not only to merge cell-line specific submodels in transfer learning, but also to evaluate the impact of those features on prediction. EPIANN labeled each base in observed enhancer and promoter sequences with the corresponding marginal attention to highlight regulatory elements overlapping patterns crucial in predicting EPIs. It showed that the attention regions are usually highly correlated with other genomic annotations. This information can be further investigated to interpret the key patterns in chromatin interactions.

Screening of disrupting variants

Building on the above-describe approach, several studies have tried to test or identify specific loop-disrupting genetic variants and their impact on cancerogenesis [39]. Such genetic variants could result in the suppression, or in some cases even the activation, of chromatin loops. In the latter case, enhancer elements could be erroneously brought in physical contacts with proto-oncogenes and thus promote cancer progression [70], [71]. The DL prediction framework provides a means to systematically screen through available GWAS or re-sequencing databases to identify putative causal variants underlying differential looping. Using such an approach, DeepMILO was tested on two known deletions in T-Cell Acute Lymphoblastic Leukemia (T-ALL) patients at anchors containing oncogenes TAL1 and LMO2. The algorithm correctly predicted that these mutations should lead to insulator loop disruption. In addition, DeepMILO employed in silico mutagenesis and further predicted that a number of smaller mutations, just outside the CTCF binding motif, may also affect chromatin interactions. These latter predictions provide concrete hypotheses for experimental follow-up. Conceptually similar approaches were taken by TransEPI and ChINN to identify putative loop-disrupting mutations in the context of neuronal disorder and chronic lymphocytic leukemia (CLL), respectively. Using chronic lymphocytic leukemia (CLL) patient data, ChINN was able to construct a patient-specific chromatin interaction profile, suggesting that such predictions could serve as a per-clinical tool to predict CLL disease risk. Orca predicted high structural impact regions (10 bp) consistently with Chip-seq data and confirmed reduced genome interactions through disrupting motifs like POU5F1::SOX2 or AP-1. Overall, these results suggest local and global effects of disrupting variants. Many eQTLs identified in human population data have been shown to act in trans, that is, the QTL affects the expression state at distal genes, rather than locally. One possible mechanism for these distal interactions is chromatin looping [72]. Indeed, application of Akita on a set of eQTLs from GTEx (Genotype-Tissue Expression, [73]) whole blood samples revealed significantly higher disruption for SNP with greater causal probability, within and outside of CTCF motifs. This indicates the impact of CTCF and non-CTCF variants on genome folding. Hence, eQTL data sets can be used in biological validation of predicted interactions to reduce false positives from 3C-based data.

Cell-type and species specific chromatin interactions

Cross-cell line prediction poses a big challenge due to the presence of cell-type specificity [63], [10]. Our review revealed that general DL models trained on all cell lines together, tend to perform poorly in capturing cell-specific events due to the relative disproportion between shared and private chromatin interactions. Conversely, most DL methods that are trained on each cell line separately, failed to identify general and cell-line specific interactions simultaneously. Approaches based on transfer learning perform much better in this setting. EPIsHilbert hypothesized that the effectiveness of these approaches is determined by the numerous common sequence patterns among all cell lines. To evidence this, they calculated the overlapping ratio of chromatin interactions between different tissues. It indicated that there are more common than private interactions. This knowledge can be used to create a model that has the ability to predict chromatin interactions on novel cell lines. SEPT used a feature extractor and domain discriminator to learn EPIs-related features and recognize cell-line specific patterns at the same time. This provides an opportunity to create an universal model, which is able to predict chromatin interactions not only on previously trained cell lines, but also on novel cell lines using general EPI features. Similarities between mammalian genome folding may allow us to predict species-specific differences in genome folding. Recent studies showed that ChAHP complexes lead to the disruption of insulator loops within mouse-specific B2 SINE elements [74], [75]. To test this, Akita trained models on human and murine embryonic stem cell (ESC) Hi-C to show the impact of in silico mutation in these elements on CTCF binding. Comparison of these results confirmed that both models correctly predicted the disturbance of genome folding before and after mutagenesis of B2 SINE elements. This highlights the opportunity to use DL and transfer learning approaches in studies investigating species-specific regulatory strategies (see Fig. 4).

Discussion

Here, we have reviewed current supervised DL models for predicting 3D chromatin interactions from DNA sequence. We find that these methods have remarkably high prediction accuracy, which indicates that DNA sequence features are important determinants of chromatin looping. By examining the learned sequence features it is possible to uncover complex, combinatorial, sequence motifs that would otherwise be difficult to discover, even with elaborate experimental assays. Thus, DL models have the potential to provide novel insights into chromatin biology. Similarly, trained DL models can be used as a tool to identify loop disrupting genetic variants from population-level sequencing data. This type of information is highly relevant for understanding the genetic basis of regulatory variation underlying complex traits, both in a clinical as well as in an agricultural setting. Indeed, numerous genome-wide association studies have identified causal loci in non-coding regions of genomes, many of which appear to act as eQTL for distal target genes [76], [77]. DL models could be used to assess if these non-coding loci and their targets are likely to interact physically, and whether the type of genetic variants seen at the eQTL locus is expected to cause differential looping. Understanding the mode of action of disease-associated non-coding variants can facilitate insights into disease etiology and potentially lead to novel treatment targets in biomedical applications. Despite such exciting prospects, our review also revealed a number of key limitations with current DL methods. All models to date have been trained and tested on human data of six cell lines from the ENCODE project [24], or a subset of those cell lines. While the use of a single, or a few, reference data set(s) enables comparisons of the algorithms among themselves, it reduces the generalizability of the trained models. This potential limitation is already apparent in DL models that trained on multiple cell lines and showed how sensitive the trained model can be to cell-type specific features (EPIsCNN, EPIsHilbert, EPIVAN, SEPT, TransEPI). It is therefore highly unlikely that current models readily extent to other in vivo tissues in humans, and much less across different species. Hence, a much broader range of training data sets is urgently needed. From our perspective, a cross-species DL model would be highly interesting. For instance, it is well known that the biology underlying chromatin looping in mammals and plants differs fundamentally. Plants lack CTCF proteins and display several other key topological differences in the 3D organization of their genomes. In contrast to mammalian TADs structure, plant compartment domains tend to interact with each other at intra and inter-chromosome level [78], [79]. Moreover, although mammalian loop domains are conserved through different species due to conservation of CTCF binding sites, plant compartments might differ for several plant species [80]. The molecular components underlying chromatin looping in plants is not fully resolved. Cross-species DL models would not only be able to identify conserved and divergence sequence determinants, but also reveal how the molecular mechanisms underlying chromatin looping have evolved. However, training the DL models reviewed here on other species may not be trivial. Many of the models rely heavily on auxiliary functional data (DNase-seq, RNA-seq, ChIP-seq, CTCF binding proteins, transcription factors, RNA polymerase) and high-quality genome annotations in the training procedure itself (see Table 1). While this type of data is useful for boosting model performance, it is not readily available in most non–human species. Thus, such auxiliary data would have to be generated first, or alternative training strategy would have been employed that rely less on such data sources. From a bioinformatics point of view, it may be tempting to focus on the development of the ANN structure. However, if we consider the performance metrics in Fig. 3, there is no clear indication that ANN structure has a significant impact, if it consists of some minimal criteria like two well performing CNN blocks. Zhuang et al. [28], for instance, shows that a simple CNN model has similar performance on the same training data as a CNN models coupled with a RNN. This even raises questions about the implementation of an LSTM unit altogether, even though there are decent biological interpretations of their function. Since the computational effort of LSTM or related units is very expensive compared to CNN blocks, they might fail in the long run due to lack of efficiency. On the other hand, the training process itself seems to have a very deep impact on prediction results. If we consider the AUPR values in Fig. 3, we observe that the most promising results are derived by training processes that use transfer learning strategies. This improvement in statistical measurements, can be explained biologically by cell line specific and general features along the chromatin. In mammalian cells, for example, it is known that CTCF bindings are conserved through all cell lines, but other interactions are cell line specific [81]. It should also be clear that the use of DL models cannot fully replace experimental data in revealing chromatin interactions. Experimental validation of predicted interactions should go hand-in–hand with model building. To date, there has been relatively little effort to perform such validations.

Author contributions

All authors contributed to the conceptualisation and writing of the paper.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

64 in total

1. Enhancer-promoter specificity mediated by DPE or TATA core promoter motifs.

Authors: J E Butler; J T Kadonaga
Journal: Genes Dev Date: 2001-10-01 Impact factor: 11.361

2. Serial genomic inversions induce tissue-specific architectural stripes, gene misexpression and congenital malformations.

Authors: Katerina Kraft; Andreas Magg; Verena Heinrich; Christina Riemenschneider; Robert Schöpflin; Julia Markowski; Daniel M Ibrahim; Rocío Acuna-Hidalgo; Alexandra Despang; Guillaume Andrey; Lars Wittler; Bernd Timmermann; Martin Vingron; Stefan Mundlos
Journal: Nat Cell Biol Date: 2019-02-11 Impact factor: 28.824

3. Predicting enhancer-promoter interactions by deep learning and matching heuristic.

Authors: Xiaoping Min; Congmin Ye; Xiangrong Liu; Xiangxiang Zeng
Journal: Brief Bioinform Date: 2021-07-20 Impact factor: 11.622

4. 3D Chromatin Architecture of Large Plant Genomes Determined by Local A/B Compartments.

Authors: Pengfei Dong; Xiaoyu Tu; Po-Yu Chu; Peitao Lü; Ning Zhu; Donald Grierson; Baijuan Du; Pinghua Li; Silin Zhong
Journal: Mol Plant Date: 2017-11-22 Impact factor: 13.164

5. Genetic Control of Chromatin States in Humans Involves Local and Distal Chromosomal Interactions.

Authors: Fabian Grubert; Judith B Zaugg; Maya Kasowski; Oana Ursu; Damek V Spacek; Alicia R Martin; Peyton Greenside; Rohith Srivas; Doug H Phanstiel; Aleksandra Pekowska; Nastaran Heidari; Ghia Euskirchen; Wolfgang Huber; Jonathan K Pritchard; Carlos D Bustamante; Lars M Steinmetz; Anshul Kundaje; Michael Snyder
Journal: Cell Date: 2015-08-20 Impact factor: 41.582

6. Comprehensive mapping of long-range interactions reveals folding principles of the human genome.

Authors: Erez Lieberman-Aiden; Nynke L van Berkum; Louise Williams; Maxim Imakaev; Tobias Ragoczy; Agnes Telling; Ido Amit; Bryan R Lajoie; Peter J Sabo; Michael O Dorschner; Richard Sandstrom; Bradley Bernstein; M A Bender; Mark Groudine; Andreas Gnirke; John Stamatoyannopoulos; Leonid A Mirny; Eric S Lander; Job Dekker
Journal: Science Date: 2009-10-09 Impact factor: 47.728

Review 7. ChIP-based methods for the identification of long-range chromatin interactions.

Authors: Melissa J Fullwood; Yijun Ruan
Journal: J Cell Biochem Date: 2009-05-01 Impact factor: 4.429

8. Novel distal eQTL analysis demonstrates effect of population genetic architecture on detecting and interpreting associations.

Authors: Matthew Weiser; Sayan Mukherjee; Terrence S Furey
Journal: Genetics Date: 2014-09-16 Impact factor: 4.562

9. Promoter Capture Hi-C: High-resolution, Genome-wide Profiling of Promoter Interactions.

Authors: Stefan Schoenfelder; Biola-Maria Javierre; Mayra Furlan-Magaril; Steven W Wingett; Peter Fraser
Journal: J Vis Exp Date: 2018-06-28 Impact factor: 1.355

10. A curated benchmark of enhancer-gene interactions for evaluating enhancer-target gene prediction methods.

Authors: Jill E Moore; Henry E Pratt; Michael J Purcaro; Zhiping Weng
Journal: Genome Biol Date: 2020-01-22 Impact factor: 13.583