Literature DB >> 35685360

Analysis of the landscape of human enhancer sequences in biological databases.

Juan Mulero Hernández¹, Jesualdo Tomás Fernández-Breis¹.

Abstract

The process of gene regulation extends as a network in which both genetic sequences and proteins are involved. The levels of regulation and the mechanisms involved are multiple. Transcription is the main control mechanism for most genes, being the downstream steps responsible for refining the transcription patterns. In turn, gene transcription is mainly controlled by regulatory events that occur at promoters and enhancers. Several studies are focused on analyzing the contribution of enhancers in the development of diseases and their possible use as therapeutic targets. The study of regulatory elements has advanced rapidly in recent years with the development and use of next generation sequencing techniques. All this information has generated a large volume of information that has been transferred to a growing number of public repositories that store this information. In this article, we analyze the content of those public repositories that contain information about human enhancers with the aim of detecting whether the knowledge generated by scientific research is contained in those databases in a way that could be computationally exploited. The analysis will be based on three main aspects identified in the literature: types of enhancers, type of evidence about the enhancers, and methods for detecting enhancer-promoter interactions. Our results show that no single database facilitates the optimal exploitation of enhancer data, most types of enhancers are not represented in the databases and there is need for a standardized model for enhancers. We have identified major gaps and challenges for the computational exploitation of enhancer data.

Entities: Chemical

Keywords: Bioinformatics; Biological databases; Enhancers; Gene regulation; Human

Year: 2022 PMID： 35685360 PMCID： PMC9168495 DOI： 10.1016/j.csbj.2022.05.045

Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN： 2001-0370 Impact factor: 6.155

Introduction

Enhancers are distal cis-regulatory sequences capable of increasing the transcription of genes that they regulate independently of their orientation and distance to the transcription start site (TSS) [1], [2], [3]. Moreover, they have been shown to be fundamental sequences in the regulation of genes and processes of relevance such as cell identity and disease development. In fact, the term enhancerophaties has been used to refer to diseases associated with these sequences, and they have been studied as possible therapeutic targets [4], [5], [6], [7], [8]. Literature estimates that the human genome contains more enhancers than protein-coding genes, because one gene can be regulated by multiple enhancers, although one enhancer can also control multiple genes [3], [9], [10] (Fig. 1). Some studies have found that each enhancer interacts with approximately two promoters and each promoter interacts with 4–5 enhancer elements [9]. Furthermore, the specificity of species and tissue implies that different combinations of enhancers can be used to control the expression of a gene and that they can be different depending on the cellular state and environmental factors [3], [9], [11], [12]. Even at the enhancer level, the existence of multiple transcription factor binding sequences (TFBS) means that combinations of transcription factors (TF) and mechanisms of action can be even more varied and context dependent [13].

Fig. 1

Traditional chromatin loop model (A). The enhancer physically interacts with the promoter through chromatin flexibility and mechanisms like the loop extrusion model. In this way, the enhancer can provide molecular elements that increase transcription of the target gene. Alternatively, in the hub model, the spatial proximity of multiple regulatory sequences allows the recruitment of high concentrations of molecular elements that can generate a network and a microenvironment, even phase-separated, that increases the transcription of target genes. This model can explain phenomena like the regulation of multiple genes by the same sequence (B) and the regulation of the same gene by multiple sequences (C). Modified from www.addgene.org and [42]. Initially, enhancer sequences have followed a functional definition. They have traditionally been characterised as nucleosome-free regions (NFR or DHS) enriched with TFBS that allow the recruitment of molecular elements. In turn, they collaborate in the gene transcription process when these molecular elements can interact through chromatin loops that approximate the sequences or through the formation of hubs [14], [15], [16] (Fig. 1). Multiple mechanisms have been observed and suggested to explain how enhancers are able to increase the expression of target genes [17]. We find modes of action based on the nature of the sequence that performs the enhancer action (enhancer action based on the DNA sequence or transcribed eRNA), the environment of action (cis/trans action), the mode of linking between regulatory sequences (chromatin loop, transcription factories or hubs, facilitate tracking, linking/chaining), the mechanism that initiates the enhancer action (recruitment of transcription factors, cofactors, chromatin modifiers, RNAPII transfer, liberation of the transcriptional pause and eRNA action) or the consequence of the enhancer activity (increase the initiation or elongation of the transcription or reduce the pause of the transcription). It has also been proposed that they can increase splicing, polyadenylation, transcription termination rate and RNAPII recycling, but the implication of enhancers in post-transcriptional regulatory steps requires more study [11], [17], [18], [19], [20], [21], [22], [23]. In short, enhancers have the capacity to regulate by increasing the transcription of a gene, increasing the activity of a promoter or providing essential information that the promoter does not provide. With the development of genomics and next generation sequencing (NGS), studies have tried to identify genomic features in these sequences to study their function, their mechanisms of action and to perform a massive screening of these sequences in the genome using these correlated physical properties [24], [25]. In this task, high-throughput genomics methods have allowed us to study in depth the chromatin properties of enhancers [26] and the transcripts that they can generate (eRNA) [27], [28], [29], [30], while chromatin conformation capture and high-resolution microscopy techniques have allowed the determination of distal chromosomal contacts between promoter and enhancer sequences (EPI), as well as the mechanisms of action [31], [32], [33], [34]. However, deeper exploration of these properties has also shown that there is no homogeneous profile of features in the enhancers [26], [35] and, therefore, the different methodologies provide a partial view of the regulatory landscape [36]. The same situation applies to the identification of EPI [37]. In addition, the different characteristics identified in the enhancers have also generated different classifications and terminologies that have extended in the literature, whose validity of use generates controversy due to the lack of consensus in the definition of the enhancers [12] and the absence of a controlled vocabulary by models of knowledge representation. The study of regulatory elements has advanced quickly in recent years with the development and use of new techniques, and this progress has been paralleled with the clinical purpose of detecting and studying diseases [25], [38], [39], [40], [41]. As a result, a large volume of data and knowledge about human enhancer sequences has been generated and has been stored in a growing number of biological databases. Given the increasing bioinformatics processing of these data, reviewing the content of the databases will show how well the content in databases covers the knowledge generated by the scientific community and which are the main limitations for the computational exploitation of enhancer data. Hence, we will first describe a model for representing human enhancer sequences derived from the analysis of the literature. That model will drive the analysis of the content of 25 biological databases, which will focus on three main aspects: types of enhancers, type of evidence about the enhancers, and methods for detecting enhancer-promoter interactions. Finally, we identify the main challenges in this area. This work aims at contributing to describe the current landscape of human enhancers data and to the development of the Gene Regulation Knowledge Commons targeted by the GREEKC COST Action (https://greekc.org/). The GREEKC focus is on the generation, curation and analysis of data and knowledge about gene regulation processes. The current article is focused on the data resources, but discussion on the need for enhancer data interoperability is also provided in this article.

Materials and methods

A model for representing enhancers

After a review of the literature, we propose a model to represent the fundamental information about enhancers (Fig. 2). In the following subsections we describe the main elements which should be covered and which will drive our analysis of bioinformatics data resources: types of enhancer, methodologies to generate evidence, enhancer-promoter interactions (EPI) and other annotations of interest. Furthermore, any enhancer must have, at least, the coordinates that allow mapping this sequence in the genome and the biosample. As we will see below, the type of enhancer may vary according to the biological sample, because enhancers are regulatory sequences specific to cell type, but also to environmental stimuli [3], [13].

Fig. 2

Proposed model for the representation of enhancers. Each enhancer is located in a region of the genome and belongs to one or more classifications of enhancers, which may differ according to the biological sample in question, because enhancers are sequence specific. The identification of the enhancer has evidence derived from the methodology used and must have a bibliographic reference that allows to verify the information. As regulatory sequences, enhancers regulate genes and this regulation is also supported by evidence. In addition, the enhancers can be enriched with information of interest such as their link to diseases or the TFBS that compose the sequence.

Types of enhancers

Enhancers do not have homogeneous characteristics, but have a variable profile [26]. Enhancers are dynamic and specific of species and tissue, even TFBS that make up the enhancers can be specific (pleiotropic sequences) [13]. Therefore, annotating a sequence as enhancer is not sufficient to represent the description of its biology and multiple typologies have emerged in the scientific literature to classify different patterns of features. Fig. 3 includes the types of enhancers which have been described in the literature and shows different ways of classifying and subtyping enhancers, so one enhancer can belong to different types at the same time. We describe next those types:

Fig. 3

Classification of the main types of enhancers found in the literature. The characteristics of enhancers do not have a homogeneous profile. For this reason, we find different classifications in the literature, which have been compiled in this figure. Each classification is based on different properties, so an enhancer can belong to several types at the same time. Enhancers by distance: This classification is based on the distance of the enhancer to promoters or target genes. Primary and shadow enhancers are likely to be one of the most frequently mentioned types of enhancer in the related literature [11], [6]. Initially, enhancers that were close to promoters were considered as primary or principal enhancers in the regulation of gene expression, while more remote enhancers that exhibited the same regulatory activity were considered as secondary or redundant [43]. However, studies have shown that the genomic characteristics of primary and shadow enhancers do not differ significantly [11]. The selective restriction of shadow enhancers can be as great or even greater than that of other enhancers [44] and they can act simultaneously with other enhancers and over large distances [45]. Other studies have also shown that they can contribute to the accuracy and robustness of gene transcription against environmental and genetic variability, as well as stochastic perturbations, by reducing the level of transcription noise [22], [46], [44], [47]. Therefore, their contribution may be relevant, meaning that this classification based on the position of the sequence with respect to the target gene is not the most useful. In addition, because one enhancer can regulate more than one gene, an enhancer could be labeled as both a primary and a shadow enhancer depending on the gene. Proximal and distal enhancers are part of this hierarchy, and they differ on the distance to the target gene. If the distance is larger than a certain value, then the enhancer is classified as distal, otherwise it is proximal. There is no consensus on the threshold distance. For example, the threshold in SCREEN ENCODE is 2 kb. Enhancers by location: This classification is based on the location of the enhancers in the genome. Since enhancers are regulatory sequences, they have been typically searched in intergenic regions (intergenic enhancers), which represent approximately 98–99 of the genome [48]. However, studies have shown that there are also intragenic enhancers, which can be located in intronic (intronic enhancers) and exonic sequences (exonic enhnacers or eExons) and modulate gene expression, e.g. by acting as alternative promoters [49], [50]. Furthermore, there are intragenic enhancers which are capable of recruiting TF that increase the recruitment of RNAPII and other GTF to the promoter of the gene itself [51] or to other genes [52]. On the other hand, other studies showed that the presence of enhancers in intragenic sequences can also attenuate the expression in the gene itself, possibly by an interference between the elongation of the gene transcript and the transcription of the enhancer [52]. Thus, the transcriptionally active intragenic enhancers could have disparate functions and improve transcription of one or more distal genes while limiting transcription of the gene itself (double function, as enhancer and silencer). Enhancers by clustering: Enhancers can be found proximate and linked to the same regulatory process. For this reason, the idea was to cluster and merge these sequences to consider them as a single regulatory elements. The term distributed enhancers has been coined to represent that multiple enhancers can regulate the same gene and to eliminate the controversy about the classification of enhancers as primary or secondary based on the distance to the target gene [11]. However, currently we know that multiple enhancers can act on the same gene and that one enhancer can act on more than one gene [9], [53], [54], so a deeper investigation of the regulatory profile may result in a generalization of this distributed or collaborative property that eliminates this concept as a subtype of enhancer. In general, it is common that when one enhancer explains a large part of an expression pattern is found, the search for more sequences is not carried out, so the number of genes with distributed enhancers could be high. Superenhancers (SE) are the type of cluster of merged enhancers located proximally in the genome, typically within 12.5 kb, and have unusually high levels of master TF, RNAPII, cofactors like the mediator and integrator complex and other enhancer-associated features, such as H3K27ac and H3K4me1. They are also associated with high transcription values of eRNA and associated target genes, but also with a higher frequency of chromatin interactions with respect to individual enhancers [55], [56], [57], [58], [59], [60], [61]. In addition, the enhancers that compose the SE (constituent enhancers) are usually linked to the same regulatory process. In the identification of SE, first the enhancers located within a distance of 12.5 kb are joined and then, those sequences that exceed the inflection point in the signal level plot [55] are selected. After this process, enhancers not classified as SE are called typical enhancers. SE are the most studied type of enhancer due to their relationship with important regulatory processes such as cell development and cell identity. This relevance derives from their ability to bind master transcription factors and their association with the expression of pluripotent genes and tissue-specific genes [55], [56], [57]. These characteristics and relevance also result in the association of these sequences with the development of diseases like cancer, and their consideration as therapeutic targets [62], [63], [64], [7], [65]. Other studies also suggest other aspects: individual enhancers may have a role equivalent to SE [66], a cooperative profile in SE sequences, not all elements have to act at the same time and under the same conditions, and a hierarchical structure within SE [66], [13], [67], [42]. Some studies have tried to dissect this hierarchy in SE to determine which sequences are essential and which are more dispensable [61]. In turn, this organization has also included new terms or subtypes: hierarchical and non-hierarchical SE. Non-hierarchical SE have enhancers with similar contact frequencies and, therefore, they are more homogeneous. The hierarchical SE are more heterogeneous and have some enhancers with a higher frequency of interactions (hub enhancers) than the rest of the constituent enhancers (non-hub enhancers). Hub enhancers share similar histone marks with the non-hub enhancers, but have more CTCF and cohesin binding sites. Therefore, it has been suggested that hub enhancers act as organizational centers within the SE, that coordinate contacts with the rest of the non-hub enhancers and with other distal regulatory elements [61]. In addition, hub enhancers are more associated with SE function and disease-related variants, so their manipulation or deletion has demonstrated deep effects on gene activation and local chromatin state [61]. Enhancers by sequence length: This classification is based on the length of the sequence of the enhancer. 800 bp is the average length identified, stretch enhancers were defined as those longer than 3 kb [68], otherwise they are standard enhancers. These sequences were also associated with genes with higher expression levels and with cell type-specific genes [69]. Enhancers by sequence overlap: This classification is based on the existence of overlap in the position and sequence of the enhancers. The classifications by sequence length and by clustering have identified SE and stretch enhancers. Around 85 of the SE overlap with stretch enhancers, SE being the set of stretch enhancers with higher activity and with higher enrichment values in analyzed tags. The overlapping sequences of SE and stretch enhancers were designated as super-stretch enhancers [68]. Other enhancers are overlapped with known promoters and exhibit enhancer activity because they can interact with other promoters [70], [71] and have bidirectional transcription [72]. With the improvement in gene editing systems and high-throughput reporter assays, distal enhancer activity in promoter sequences was verified [73], [74], [75], [76], [77]. Epromoters is the term used for these sequences and some works consider that 2–3 of promoters are of this type [76]. We also find overlapping enhancers with Locus Control Regions (LCR), because they are structures composed of different regulatory modules that can include enhancers [78], [79]. However, they have also mapped onto sequences annotated as silencers, which have the opposite biological function to enhancers. This is because regulatory sequences sometimes have a bifunctional function depending on cellular context [80], [81]. However, a special nomenclature is not used for these sequences. We represent these cases in the Fig. 3 as LCR enhancer and silencer/enhancer. Enhancers by chromatin profile: Different terminologies and classifications have been established according to the chromatin profile [26], [82], [83], [84], [85], [86]. This classification is one of the most important because the activity of enhancers is not usually measured directly, but their activity is inferred from the chromatin profile of the sequences. As the activity of enhancers varies depending on the biological context, this classification is directly linked to the biological sample used. Previously, the community thought that when a cell had completed its terminal differentiation, the regulatory repertoire was established and maintained by lineage-specific TF. Thus, internal or external stimuli could not change the regulatory pool, but acted within it through the cooperation with the master TF that were already bound to the sequences. However, cellular plasticity has shown that stimuli can create new functional properties through the activation of regulatory sequences that are not pre-established by the cell lineage, even to the point of defining a new cellular subtype [85]. In response to stimuli, some enhancers without histone masks characteristic and without bound TF (inactive enhancers) can recruit master TF that provide chromatin accessibility and allow the acquisition of a chromatin profile associated with enhancer activity, such as H3K4me1 and H3K27ac. These inactive enhancers without marks that can get activated after a stimulus are called latent enhancers [85]. After the loss of the stimulus, some chromatin marks may be lost, like acetylation and TF binding, as well as regulatory activity, but the sequences can retain marks like H3K4me1. Subsequently, when cells receive a stimulus again, the sequences can be re-stimulated with a faster and stronger response [85]. Sequences with H3K4me1 were initially considered as enhancers with little or no activity, but predisposed to acquire acetylation and transition to a more active state. These sequences were named as primed/poised enhancers [85], [26]. However, studies have shown that the activity of sequences labeled with H3K4me1 is not lower because of lacking H3K27ac, but they could contribute to expression in a similar or even superior way to the enhancers labeled with acetylation and called as active enhancers [87]. On the other hand, it was also observed that these predisposed enhancers can also present marks associated with the silencing, like H3K27me3, H3K9me3 and PRC2 binding, but also P300 (associated with activity) [88], [89], [83], [90], [91]. These bivalent sequences have been found close to inactive genes that are important during development and can be activated during differentiation by deletion of H3K27me3 and gain of H3K27ac. In addition, chromosome conformation capture assays have shown that these sequences can be detected interacting physically with their target genes through the PRC2 complex even before activation [90], [20], [92], [93]. The observed variability has led to different subdivisions of this set, but also to different nomenclatures due to the lack of consensus and controlled vocabulary. Some studies classify and name the enhancers without tags as inactive enhancers, the enhancers H3K4me1+ as primed enhancers or intermediate enhancers, the enhancers with marks like H3K27me3 and PRC2 a subset of this primed enhancers are called poised enhancers or bivalent enhancers, and the enhancers H3K4me1+ and H3K27ac+ as active enhancers [26], [83], [94], [95]. Other classifications and nomenclatures are common in the chromatin state annotation of different studies and projects, like Roadmap epigenomics, which have generated datasets widely used in other research and public sources [96], [82], [86], [97]. In these annotations we find categories such as:It is important to note that, although some labels are closer to a classification by activity, these labels correspond to a classification by chromatin profile, because they are assigned on the basis of chromatin properties. These properties are correlated with activity, but activity must be confirmed experimentally and it has also been shown that these tags are not strictly necessary to identify activity in enhancer sequences [35], [99], [87]. Strong and weak enhancers [82], similar to a classification reduced to active and primed enhancers. Genetic enhancers, enhancers and bivalent enhancers [96], equivalent to active, primed and poised enhancers. Enhancer, permissive regulatory region and bivalent enhancer [95], also equivalent to active, primed and poised enhancers. Active, poised, repressed and inactive enhancers, similar to active, primed, poised and inactive enhancers, respectively [98]. Enhancers by transcription: Transcription has been observed in some enhancers, mainly those considered to be active [29], [53], [100], [101]. Therefore, enhancers can also be classified into transcribed enhancers (T-Enh) and non-transcribed enhancers (NT-Enh). This transcription is initiated in the NFR of the enhancers and is mainly bidirectional, although eRNAs with structural heterogeneity and different combinations of properties have been observed: unidirectional, bidirectional, polyadenylated and non-polyadenyadenylated [27], [29], [102], [103]. Within this heterogeneity, the most usual subdivision is between enhancers with unidirectional (1D-Enh) and bidirectional transcription (2D-Enh), although a recent single-cell CAGE sequencing suggests that the directionality of transcription could be more complex than unidirectional or bidirectional in absolute terms [29], [102]. Different functions have been proposed for the transcripts [17], [29], [103], [104], [105], such as formation and stabilisation of the chromatin loop [106], liberation of the transcriptional pause [107] or increasing the occupancy of TF and coactivators in enhancer sequences [108]. Even eRNA function can be different depending on the transcribed strand [109].

Methodologies to generate evidence

The variety of enhancer features also results in a variety of different experimental approaches for their detection, which provide different levels of evidence. Comparative studies of enhancers obtained by different methodologies have shown that, currently, there is no preferential method for the detection of enhancers [36]. Each methodology provides a set of candidate sequences because they have genomic characteristics that make them potential enhancers [24]. Nevertheless, each of them only provide a partial view of the profile of enhancers due to technical limitations, and because most techniques identify sequences indirectly, through characteristics correlated with enhancer identity and activity. For this reason, the inclusion of supporting methodologies is essential in an enhancer representation model, because they provide the level of evidence to candidate sequences that need to be verified. Fig. 4 shows the two groups of evidence generation methodologies identified in the literature according to the identification strategy used, namely, based on chromatin characteristics and based on reporter assays. Next, we describe both types.

Fig. 4

Enhancers can be identified through different methodologies, which follow a certain strategy or approach that provides the level of evidence for the sequence. We can distinguish two main types of evidence. Based on chromatin features: they appeal to sequence features, thus correlated properties that are not a direct measure of enhancer activity. Reporter-based: these measure enhancer activity directly, but the interpretation of the results can be complex. Evidence based on chromatin characteristics: The objective is to identify sequences according to chromatin properties correlated to enhancers. The development of high-throughput genomic methods has been fundamental to capture these properties, while bioinformatics tools have allowed us to analyze, to search patterns in the data and to generate models to classify sequences according to chromatin properties [110], [111], [112], [86], [96]. Therefore, these properties are the evidence supporting the enhancers and we distinguish different approaches or levels of evidence which we cover below. Detection of sequence conservation involves finding conserved sequences between species and over time. It was one of the first approaches used for the identification of enhancers and followed in the early repositories. The detection of conserved sequences has been successful in the discovery of enhancers involved in biological processes of high importance in most organisms, like sequences active during early development, where enhancers have represented almost 50 of the highly conserved sequences analyzed [113]. However, this approach presents problems in the detection of sequences with species specificity, where evolutionary conservation is lower [3]. In addition, studies show that enhancers have different levels of evolutionary conservation when they are obtained by different identification methods. All of them show a higher overlap with conserved elements than randomly, but the enhancers detected by eRNA transcription were the most conserved, while the enhancers obtained by PTM marks were the less conserved [36]. The identification of TFBS is an approach based on the fact that the activity of many enhancers depend on their ability to bind TF and coactivators that, by different possible mechanisms, increasing the transcription of target genes [17], [114]. For this purpose, ChIP-seq has been used to capture and sequence those DNA fragments bound to a specific TF, p300 and the mediator complex being the most commonly used [115], [58]. However, this method may have low specificity because, in addition to non-specific binding, can capture those sequences that have complementary sequence and chromatin accessibility. This is also an antibody-dependent method with its associated problems [23]. Therefore, it is not a method typically used isolated for the identification of enhancers, rather it is used to generate chromatin profiles that integrate different chromatin properties in order to classify and predict sequences. The identification of accessible chromatin (DHS) is another approach used in the identification of enhancers. If many enhancers base their activity on their ability to bind TF and on the mechanisms of action of the eRNAs [17], these sequences must be accessible, at least during their period of activity. This approach has low specificity, because multiple non-enhancer sequences share this property [116], [117]. Therefore, as for TFBS, the identification of DHS is generally used for chromatin profiling. The detection of PTM in histones, such as methylations and acetylations, has been widely used both as individual approach and for building chromatin profiles [26]. For this, chromatin immunoprecipitation followed by microarrays or sequencing is the technique most popular [118]. However, there is not strictly necessary and reliable methylation or acetylation mark to identify unambiguously enhancer sequences [35], [99], [87]. In the beginning, H3K4me3 was associated with promoter sequences and H3K4me1 with enhancer sequences, while dimethylations were observed in both types of sequences without a clear distinction [119]. H3K4me1 and H3K4me3 are not mutually exclusive marks in a genomic region, so the H3K4me1/H3K4me3 signal ratio has also been used [120], [121]. However, this criterion was affected by the detection of enhancers with high H3K4me3 values [120], [122] as well as enhancers without H3K4me1 [123], [124], [125]. On the other hand, H3K4me1 is not a mark capable of discerning enhancer activity. Acetylation analysis associated the H3K27ac mark with the activity of the enhancers, so this mark has been widely used in the identification of active enhancers [88], [89]. However, other studies have also shown that H3K27ac is also not a strictly necessary mark for enhancer activity [126], [87]. For these reasons, histone PTM are also used with DHS and TF binding data for the development of computational models that annotate chromatin following chromatin profiles. Detection of eRNA. Since transcription has been correlated with sequence activity, eRNA detection has been used to identify active enhancers [53], [127], [128], [129]. The methods employed for the sequencing of these RNAs are varied. There are techniques that allow the detection of RNA already produced, either the full length of the sequence (e.g., flcDNA-seq) or the first nucleotides (e.g., CAGE and TSS-seq). Other techniques use the transcription rate (e.g., GRO-seq, PRO-seq or Start-seq). These nascent RNA sequencing techniques allow to measure transcript levels, so they have the advantage of quantifying RNA sequences which are not very stable and are rapidly degraded. This is the case of eRNAs and PROMPT sequences of promoters [99]. Computational genome annotation. The development of algorithms able to work efficiently with large volumes of data has also made it possible to work with multiple experimental evidence rather than individual chromatin properties. These models or algorithms have presented the problem of enhancer identification from a computational point of view and their goal is to determine if a sequence can function as an enhancer or not according to a set of multiple types of data that provide a description of the sequence [130], [131]. Therefore, the first step in these algorithms is the integration of different types of data that provide information about the sequences. Subsequently, these data are preprocessed (normalisation and scaling) to generate a feature vector that serves as input for an enhancer identification and analysis system, which is responsible to annotate the DNA regions based on these feature vectors. The computational models used have been developed following different computational strategies and we find supervised and unsupervised methods. Some tools developed include clustering algorithms, like K-means or bi-clustering; others use regression models, like least absolute shrinkage and selection operator (LASSO); probabilistic graphical models (PGMs), like Dynamic Bayesian Networks (DBNs) and Hidden Markov Models (HMM); or classification systems like artificial neural networks (ANNs), support vector machines (SVMs), random forests (RFs) and decision trees (DTs) [130]. Evidence based on reporter essays: The objective is to identify whether a sequence can increase the expression of a reporter under the control of a minimal promoter. With the development of high-throughput methods like MPRA and STARR-seq, this approach can now be used for massive sequence identification [132], [133], [134], [24], [135]. According to the reporter method used, the enhancer can drive the expression of: a given sequence, such as a barcode used as a reference; its own expression, through eRNA measurement; or the expression of an alternative reporter gene, like a fluorescent reporter. The vector used also varies the methodology. Assays based on plasmids. This is a simple approach with higher throughput, but is unable to replicate the complexity of gene regulation in chromosomes. Examples are episomal reporter assays, STARR-seq and MPRA. Assays based on integration. The integration into the genome is a complicated process and can be done randomly or in a guided manner. If random integration is chosen, the genomic context can be lost, while in guided integration the context can be maintained, but at the cost of low efficiency due to the limitation of the number of sequences that we can analyse in parallel. On the other hand, in vivo systems are more reliable than in vitro, although we should not ignore the technical limitations and problems that may arise from the genomic context, cell and organism specificity when we use model organisms [132], [133]. Gene editing with CRISPR-Cas9 technology and the use of guide RNAs (sgRNA) have also facilitated the endogenous manipulation of enhancers [136], [66] and can help to screen sequences even at the TFBS level. The main advantage of this method is the possibility to work in vivo at a high scale, in a targeted manner and to maintain the local chromatin context by using targeted sgRNA libraries [137], [138]. It also allows other molecules to be incorporated into the Cas9 nuclease, so many technical variants have been developed to activate and silence sequences [139], [74], [140]. However, the main problem is to be able to evaluate the causality and impact of the alterations of the enhancers on expression, as well as the scaling of the technique.

Enhancer-promoter interactions

The public sources often include information that enriches the knowledge about enhancers. Within this information, enhancer-promoter relationships are one of the most important because they inform about the potential regulatory role of the enhancer, either by regulating the transcription of protein-coding sequences or non-coding sequences, such as lncRNA or miRNA [141], [142], [143], [144]. The distance between enhancers and promoters that interact can be very large, with a distance of one megabase or more, but they are usually closer [10], [71]. The majority of enhancer-promoter interactions (EPI) are less than 200 kb and the most numerous are usually around 20–50 kb [71], [20]. Chromatin conformation capture methods (3C and derived methodologies) and high-throughput microscopy techniques, such as fluorescence in situ hybridization (FISH) experiments, are experimental tools used to study the three-dimensional structure of chromatin and to determine the contacts between sequences [145], [31], [146], [147], [32], [33], [34]. Computational methodologies are also used to predict EPI [141], [145]. Therefore, similar to enhancer identification, the strategies used in the different sources for EPI identification are varied (Fig. 5) and, despite advances in new technologies, comparison of different methods against a curated reference set has found that they still need to be improved [37]. The association methods can be classified in two groups: unsupervised machine learning and supervised machine learning [145].

Fig. 5

Similar to the identification of enhancers, the determination of EPI can follow different strategies, which provide the level of evidence for the regulatory relationship. Two main groups are also distinguished. Experimental methods determine the relationship directly. Computational methods make predictions and can follow two main approaches. Supervised methods generate a model from a training set, while unsupervised methods lack this set. Unsupervised methods: In unsupervised models, machine learning is based on a model that is adjusted according to the observations, so there is no a priori knowledge. Within this type we highlight the methods based on distance, correlations and scores. Distance-based methods link enhancers to target according to one or more distance functions. Linkage to the closest TSS has been the most widely used strategy and estimates have determined that approximately 40 of EPIs are established with the gene linked to this gene [53]. The association with the closest gene could happen for different reasons, such as a low specificity of the enhancer for its target sequence [20]. Other widely used distance-based strategies include: overlapping genes with the enhancer, proximal genes or genes within a distance window, flanking genes or closest genes on both sides of the sequence. Distance-based models can be useful for generating an initial list of possible target genes, but it is a method that can be inefficient because it does not consider other aspects like interactions at large distances, cooperatively between sequences or the specificity [13], [45], [9], [148]. Correlation-based methods evaluate the correlation of properties between pairs of sequences, such as correlation between eRNA transcription and gene transcription [53], or correlation between eQTL values and active chromatin marks [82]. A significant advantage of these methods is that they can identify multiple targets for one enhancer and measure quantitatively the power of the association [141]. In contrast, a major problem is the need for a large number of samples with sufficient quality for comparisons, because correlation methods assume that enhancer activity changes between conditions and between cells [141]. They are also very sensitive to outliers and can therefore generate a large number of false positive predictions, but some algorithms have been developed to deal with this outlier problem [149]. Score-based methods integrate data of different types and each feature is associated with a quantitative value that is used to generate a total quantitative score to establish an association ranking for enhancer-gene interactions [141]. These methods have also been defined as Decomposition-based methods in other sources [145]. An advantage of these methods is that all possible interactions between sequences can be quantified, you can adjust the level of significance and allow different priorities for each of the features. However, the need to adjust weight values to the features is also one of the main problems, because this can be arbitrary. Supervised methods: In supervised learning models, we start with a set of data that are true EPI, i.e. true positives and negatives, and we use them as a training set to find patterns and create models that can be a classifier or a regression model. Classifiers use the patterns found in genomic features to create a model that generates labels that are then applied to new datasets. As with correlation and scoring methods, the number and type of features used for the predictor can be highly variable, and the potential of these models will depend on the data used for training and the quality of the variables [145], [141]. The main advantage of these systems is that they can find hidden patterns in the data that are difficult to see when we have a large amount of data, while the main problem is that they are very dependent on the dataset used for training. Regression-based methods differ from classifiers primarily in their ability to quantify potential. These methods systematically evaluate the quantitative contributions of enhancers that can regulate a gene within a genomic window by exploiting a large number of genomic features that are assigned contribution weights. These methods work with the logic that multiple enhancers can regulate one gene. Therefore, a combinatorial approach is applied for sequence pairing [141]. This has the advantage of the cooperative effect of the enhancers under consideration. However, they have the disadvantage typical of supervised machine learning methods in terms of training data. Besides, it also requires the definition of a genomic window, which is set by a distance criterion, and a maximum number of enhancers to be considered around a TSS.

Other annotations of interest

In the annotation of enhancers, besides the sequence coordinates, we have emphasized the importance of the different types, which in the case of a chromatin profile classification depends on the biological sample. That is because the activity of the enhancer sequences can be cell-specific, but also stimulus-dependent [13], [9], [3], [150]. Therefore, the chromatin profile of the sequences also varies between biological samples and, consequently, the type. All cells in an organism have the same genetic information and therefore all cells have the same enhancers. Thus, simply identifying enhancers in the genome tells us nothing about possible enhancer activity and cell-to-cell variability. In this way, the annotation of the type of enhancer and the biological source is essential for a model of enhancer representation that aims to study gene regulation. We have also pointed out the importance of the methods which support the sequences as well as possible target genes, because they are the target of regulation, and their corresponding EPI predictor methods, because they also support the prediction. However, the representation model can be enriched with other annotations of interest that increase its value. The activity of enhancers derives mainly from their ability to bind TFs and to generate functional eRNAs [17]. Alteration of enhancer activity can also contribute to the disruption of regulatory networks and the development of diseases [4]. Therefore, including information about TFBS, eRNA transcription, mutations and linking enhancers to biological networks and diseases enriches the value of the model, also the annotation of chromatin profiles, mechanisms of action and other useful information.

Databases

Databases store the information generated by the scientific community about enhancers. The volume of sequences obtained by the different research efforts varies according to the identification method used, because most methods use different correlated features, or combinations of correlated features, to identify enhancer sequences in the genome. The consequence is that the results can differ in several orders of magnitude [36] and this variability is also reflected in the repositories, because each source identifies and stores sequences according to different criteria and data inputs. Thus, while the RefSeq reference genome GRCh38.p13 (release 109.20211119) contains around 5,000 enhancers, FANTOM5 contains around 50,000 sequences [53] and SCREEN ENCODE around 1 million [151]. For our study, we have selected 25 publicly available and accessible repositories specialized in identifying and annotating human enhancer sequences and which annotate, at least, the coordinates of the sequences (see Table 1). The data was collected in February 2022.

Table 1

Brief description of the databases included in this study.

Repository	Focus	Short Description
CancerEnD	Diseases	Set of enhancers for TCGA cancer types
dbInDel	Mutations	Enhancer-associated insertion and deletion variants
dbSUPER	SE general annotation	Super-enhancers archive
ENdb	Diseases	A manually curated database of experimentally supported enhancers for human and mouse
EnDisease 2.0	Diseases	A manually curated database for enhancer-disease associations
EnhancerAtlas 2.0	General annotation	General annotation of enhancers in different human biosamples and other species
EnhancerDB	General annotation	General annotation of enhancers in different human biosamples
EnhFFL	Feed-forward loops (FFL) with enhancers	A database of enhancer mediated feed-forward loops for human and mouse
Ensembl Regulatory Build v105	General annotation	Set of regions of the genome that probably are involved in gene regulation
ETph	Pig-human homology	General enhancers and their targets in pig and human
FANTOM5	Transcribed enhancers	Transcription-capable enhancers
FOCS	EPI	Method for inferring an extended enhancer-promoter and predicted set
GeneHancer 4.8 (UCSC)	General annotation	Integration of enhancer sequences to generate a consensus set
HACER	Transcribed enhancers	Transcription-capable enhancers
HEDD	Diseases	Human enhancers with a focus on their links to diseases
HeRA	Transcribed enhancers	Transcription-capable enhancers
RAEdb	Enhancers identified by reporter assays	Enhancers identified by high-throughput reporter assays
SCREEN V3	General annotation	Set of regions of the genome that probably are involved in gene regulation
Roadmap epigenomics	General annotation	Genome annotation in states
SEA 3.0	SE general annotation	Super-enhancers archive
SEanalysis	Biological networks with SE	Super-enhancers associated with regulatory networks
SEdb 1.03	SE general annotation	Super-enhancers archive
RefSeq GRCh38.p13	General annotation	Annotation of functional elements in the reference genome
TiED	General annotation	Identification and annotation of active and transcribed enhancers in 10 tissues
VISTA Enhancer	General annotation	Validated enhancers with transgenic mice

Brief description of the databases included in this study.

Results

In this section we describe how the selected databases cover the information about enhancers included in our model.

Types of enhancers

Table 2 shows that the majority of the databases do not cover the types of enhancers, but annotate the sequences as general enhancers (see bar plot in Fig. 6A). Therefore, both the type of enhancer and its possible activity profile have to be inferred mainly from the methodology used for sequence identification, through an analysis or by other means. Due to their relevance, the most covered enhancer type in the repositories are SE and transcribed enhancers, although the constituent enhancers are not always included. dbSUPER [152], ENdb [153], SEA [154], SEdb [155], SEanalysis [156] and EnhFFL [157] are repositories which contain SE. RAEdb [158] is the only source that covers epromoters, while SCREEN [151] distinguishes between proximal and distal enhancers according to their distance to the nearest TSS (2 kb limit). On the other hand, according to chromatin profile we find mainly two sources: Ensembl [98] and Roadmap [96]. The first distinguishes between Active, Poised, Repressed, Inactive and NA. The second does it between Genetic enhancers, Enhancers and Bivalent enhancers.

Table 2

Type of enhancers hosted by each database. The types of enhancers not included in this table are not covered by any database included in this study.

Enhancer types according to model	Repositories
Enhancers (without classification)	CancerEnD, dbInDel, ENdb, EnDisease 2.0, EnhancerAtlas 2.0, EnhancerDB, Etph, FOCS, GeneHancer 4.8, HEDD, RAEdb, RefSeq GRCh38.p13, VISTA Enhancer
Super-enhancers	dbSUPER, ENdb, EnhFFL, SEA 3.0, SEanalysis, SEdb
Typical enhancers	EnhFFL, SEA 3.0, SEanalysis, SEdb
Constituent enhancers	dbSUPER, SEanalysis, SEdb
Epromoters	RAEdb
Proximal enhancers	SCREEN V3
Distal enhancers	SCREEN V3
Active enhancers	Ensembl Regulatory Build v105, Roadmap, TiED
Primed enhancers	Ensembl Regulatory Build v105, Roadmap
Poised enhancers	Ensembl Regulatory Build v105, Roadmap
Inactive enhancers	Ensembl Regulatory Build v105
Transcribed enhancers	FANTOM5, HACER, HeRA, TiED

Fig. 6

Coverage of the different items in the 25 biological databases analyzed with information about human enhancers.

Type of enhancers hosted by each database. The types of enhancers not included in this table are not covered by any database included in this study. Coverage of the different items in the 25 biological databases analyzed with information about human enhancers. Furthermore, while Ensembl first annotates the consensus enhancer sequence in the genome and then profiles the type of enhancer according to the biological sample, the other repositories usually annotate the sequences by biological sample, without finding a reference sequence. Therefore, the amount of enhancers in databases is usually very high, because each biological sample annotates enhancers that may coincide with those of another biological sample or overlap and differ in sequence boundaries. Therefore, this amount is reduced when we obtain sequences with unique coordinates, and could be further reduced if we search for consensus sequences from overlapping sequences that differ at the boundaries.

Methodologies to generate evidence

The first aspect studied was the origin of the data in the resources (see supplementary material) and whether the resources that included data from different databases perform an integration of the data or preserve original sequences. First, we found repositories that store data from their own study, such as FANTOM5, and repositories that integrate data from different sources, such as ENdb. In turn, these integrative repositories can enrich the information with their own contributions or include new sequences. Regarding the sequences, we can find examples of both situations. On the one hand, we find repositories that compile enhancers from different sources to generate a new set, such as Genehancer [159], EnhancerAtlas [160] or HEDD [161]. On the other hand, there are repositories that preserve the original sequences, such as dbSUPER [152], EnDisease [162] and ENdb [153]. In addition, in both cases, the sequences do not usually provide information or cross-references to the original source records, which makes it difficult to contrast and integrate the information and to follow a historical record of the evolution of the data. Table 3 summarizes the results obtained for each identification method and that are described next. In quantitative terms, repositories that integrate data and generate new sequences are more frequent, while the most used evidence use strategies based on chromatin properties. Fig. 6B shows that 14 of the 25 databases analyzed have enhancers with eRNA transcription evidence. This is because FANTOM5 is a repository widely used by other repositories as a source (13 repositories use the FANTOM5 sequences, see supplementary). On the other hand, by volume of data, strategies based on PTM and computational annotation provide the highest number of sequences (see supplementary material).

Table 3

Databases classified by the experimental evidence supporting the sequences that they contain.

Repository	Seq conservation	TFBS	NFR/DHS	PTM	eRNA	Computational annot.	Reporter essays
CancerEnD					X
dbInDel				X
dbSUPER		X		X
ENdb	X	X	X	X	X		X
EnDisease 2.0
EnhancerAtlas 2.0		X	X	X	X		X
EnhancerDB	X		X	X	X		X
EnhFFL		X	X	X
Ensembl Regulatory Build v105	X				X	X	X
ETph					X	X
FANTOM5					X
FOCS			X	X
GeneHancer 4.8 (UCSC)	X				X	X	X
HACER					X
HEDD					X	X
HeRA					X	X
RAEdb							X
Roadmap epigenomics						X
SCREEN V3						X
SEA 3.0		X		X
SEanalysis				X	X	X
SEdb 1.05				X	X	X
RefSeq GRCh38.p13	X						X
TiED			X	X	X
VISTA Enhancer	X						X

Databases classified by the experimental evidence supporting the sequences that they contain.

Evidence based on chromatin characteristics

Detection of sequence conservation: This is the case of VISTA Enhancer, which selects candidate sequences by sequence conservation and subsequently validates them by reporter gene assays in mouse embryos [163]. It is also used by GeneHancer for enhancer confidence scoring [159]. Identification of sequences that bind TF: This is the case of the methodology used by sources like dbSUPER [152], which contains SE identified by Med1 and BRD4, and sequences included in EnhancerAtlas [160]. It is also a methodology used to generate genome annotations in repositories such as SCREEN ENCODE [151], Ensembl [98] or RoadMap [96] and, therefore, by the sources that use these sequences (see supplementary material). In addition, many repositories enrich their enhancers with information about TFBS obtained from ChIP-seq experiments or by computational prediction, because this is relevant information in the study of enhancers and regulatory networks. However, this enrichment information is not always available for download (see supplementary material). Identification of DHS: The identification of DHS is used more for chromatin profiling to identify enhancers. It is also a useful type of information to enrich sequence information, like is done in EnDisease source [162]. Detection of PTM in histones: The dbSUPER is an example of source that used PTM for the identification of SE, specifically the H3K27ac signal [152]. This mark is also used in other sources such as SEdb [155], SEanalysis [156], SEA [154], dbInDel [164] and EnhFFL [157]. On the other hand, EnhancerDB identified enhancers using high levels of H3K27ac and H3K4me1 and low levels of H3K4me3 [165]. Furthermore, histone PTM are also used with DHS and TF binding data for the development of computational models that annotate chromatin following chromatin profiles. Detection of eRNA: The CAGE technique was the methodology used, for example, by the FANTOM5 consortium [53], so it is the technique that provides the level of evidence for these sequences. In addition, the dataset obtained by FANTOM5 has been widely used by other sources, both for sequence integration purposes and to add transcript enrichment. This is the case of repositories like CancerEnD [166], HeRA [167], FOCS [168], Ensembl Regulatory Build [98], EnhancerDB [165], HACER [169], SEdb [155], SEanalysis [156], EnhancerAtlas [160], GeneHancer [159] and TiED [101]. On the other hand, GRO-seq and PRO-seq technologies were used for the identification of enhancers in the HACER source [169]. Computational genome annotation: ENCODE, Roadmap and Ensembl are examples of repositories that follow this approach of computational genome annotation through integration of different experimental evidence, mainly PTM, DHS and TF binding [151], [98], [96].

Reporter-based methods

Reporter-based assays are not widely represented in the enhancer databases. We summarize next the results for this type. Assays based on plasmids: ENdb source collects enhancers from the literature, some of which have reporter gene assays of this type as evidence [153]. Assays based on integration: After the identification of candidate enhancers by evolutionary conservation, the VISTA Enhancer repository validated the sequences using transgenic mice by reporter gene assays. STARR-seq and MPRA high-throughput methodologies: These were used in the identification of RAEdb sequences [158], so EnhancerAtlas also contains information of this type because it integrates enhancers from this database [160]. The ENdb source collects enhancers from the literature, some of which have reporter gene assays as evidence [153].

Enhancer-promoter interactions

We have distinguished between experimental methods and computational predictions. Similar to enhancer identification, the strategies used in the different databases for EPI identification are varied (Table 4). It is important to note that repositories usually include both coding and non-coding sequences such as miRNA or lncRNA as target genes. HACER [169], GeneHancer [159] and ENdb [153] are sources of enhancers that incorporate data from 3C experiments and derivatives to annotate and predict potential target genes. Only ENdb contains EPI with evidences based on reporter assays and gene editing, which are the most commonly used experimental techniques to verify enhancer-gene regulation [24], [141], [145].

Table 4

Experimental approach used in the identification of EPI, which constitute the evidence of the regulatory relationship between sequences.

Repository	Experimental evidence	Distance	Correlation	Score	Supervised method
CancerEnD		X
dbInDel		X
dbSUPER		X
ENdb	X
EnhancerAtlas 2.0					X
EnhancerDB		X
EnhFFL		X
ETph		X
FANTOM 5			X
FOCS					X
GeneHancer 4.8 (UCSC)				X
HACER	X	X	X
HEDD			X
HeRA			X
SEA 3.0		X
SEanalysis		X	X
SEdb 1.03		X	X
TiED		X
VistaEnhancer		X

Experimental approach used in the identification of EPI, which constitute the evidence of the regulatory relationship between sequences. Regarding computational methods, distance-based methods are still the most used by biological databases (12/25), followed by correlation-based methods (6/25) (see Fig. 6C and Table 4). We can also note that not all databases include EPI (see supplementary material).

Unsupervised methods

Distance-based methods: Some sources annotate the closest genes, like SEA [170], VISTA Enhancer [163] or dbInDel [164]. Others use a window that varies in size according to the source. In EnhancerDB this window is 100 kb [165], while in dbSUPER it is 50 kb [152]. There are also sources that use a combination of distances. SEdb [155] and SEanalysis [156] include the strategies of nearest active gene, genes overlapping with the enhancer, proximal genes and results obtained by the Lasso [171] and PreSTIGE [172] algorithms. In the case of HACER [169], the closest gene and genes within a distance of 50 kb are included. Correlation-based methods: FANTOM5 used correlation between eRNA and gene transcription to link enhancer and genes within a 500 kb window [53]. This set of associations established by FANTOM5 has been used to enrich other repositories, such as HEDD [161] and HACER [169], but also as part of other models. This is the case of the scoring system used by GeneHancer [159]. On the other hand, Roadmap [96] used a method based on correlation between eQTL values and active chromatin marks. Score-based methods: The GeneHancer repository is one example of a source that uses this system (based on eQTLs, CHi-C, eRNA co-expression, TF co-expression and distance) [159].

Supervised methods

Not many databases register EPI detected by applying supervised methods. The EAGLE algorithm, a classifier, was used by the EnhancerAtlas database [160], while FOCS is an example of a regression-based method that uses ordinary least squares to predict promoter activity as a function of k nearby enhancers within a window of 500 kb [168].

Other annotations of interest

Our supplementary data file includes the main annotations included in each of the public sources analyzed with information about human enhancer sequences. However, most of them are available in the web version of the databases, but not for downloading. Fig. 6D shows that, statistically, the different annotations are covered by less than half of the databases analyzed.

Discussion

Enhancer are distal regulatory sequences that have been shown to be able to modulate gene expression, even over large distances, and to be fundamental in important regulatory processes such as development, cell identity, but also in pathologies that have been termed enhancerophaties [4]. Enhancers do not have a homogeneous profile, but there is a great diversity even between different tissues due to cell specificity [13]. For this reason, there are also different methodologies for the identification of enhancers that provide a partial view of the regulatory landscape. Identifying the relationship between genes and enhancers is not a simple task and different approaches have been used. All this variability of information has been transferred to the different databases. This study has included 25 publicly available databases. There are more resources about enhancers in the literature, but they were excluded because of unavailability (e.g., DENdb, DiseaseEnhancer and SELER), not containing human data (e.g. Animal-eRNAdb and Zenbase) or containing non-specific sequences (e.g., PReMod and UCNE). The current situation is that there is no central repository, that each database has a different model and there is no cross-referencing between these databases. This makes the collection of information about enhancer sequences difficult and justifies the need for the analysis carried out in this work, which has been done based on a model for enhancer sequences extracted from literature. Next, we describe the major findings, gaps and challenges that can be drawn from our work.

Findings

These are the major findings drawn from our research. None of the existing databases can be considered the main entry point when searching for information about enhancers, since no database includes every type of information. The choice of database(s) will depend on the requirements and goals of our study. Given that the databases do not share a unified model, they are not interoperable, which makes it difficult to combine the information from the different resources. The classification of enhancers is poorly covered in the databases. The classification into SE and typical enhancers is the most popular in databases, but the majority of repositories annotate the sequences in a general way by biosamples. SCREEN, SEdb, HACER and EnhancerAtlas exhibit the largest diversity in biological sources. We highlight the Ensembl annotation, because annotates the enhancers in the genome and subsequently classifies them according to their chromatin profile, a classification that allows us to estimate the activity of the enhancers in each biosample. However, this type of annotation could be expanded to cover the different types of enhancers. Regarding the identification of enhancers, each database includes enhancers identified by different methodologies, providing a partial view of the regulatory landscape. For this, the strategy followed by EnhancerAtlas is of great interest, because it integrates enhancers obtained by different approaches, which can provide a broader view of the current knowledge. However, in the generation of the consensus set, the database does not provide the original sequences that produce the new enhancers and their methodologies, so we cannot keep track of the historical record and support for the sequence prediction. The majority of databases use basic distance strategies for enhancer-promoter interactions. Since there is no preferred prediction method according to the literature, it is positive to have predictions developed by different strategies. Therefore, in a similar way to enhancer identification, we should highlight the annotation of repositories such as HACER, because includes EPI identified by different strategies. With respect to the other annotations of interest, only the SEanalysis and EnhFFL databases includes biological networks. The number of repositories about enhancers-diseases relationship is also small, as well as the volume of information they contain (see Fig. 6D and the supplementary material). Therefore, the study of the influence of enhancers on diseases is limited with these specialized databases. However, other databases without a focus on diseases contain biological biosamples associated with pathologies. Thus, against this situation, the comparative study of the enhancer profile between pathological and healthy biosamples may be an alternative. Enrichments related to other data of interest such as TFBS or mutations also vary between repositories. Moreover, many of these enrichments are only available in the web version of the databases and are not available for download. This complicates the use of this information, because the repositories also do not usually have APIs to program queries.

Challenges and future directions

Next, we describe the main challenges in the field that we have identified due to our study. We also propose research directions of interest in this area.

Identification of enhancers

The study of enhancers also confronts other challenges associated with the identification of sequences, their target genes and the validation of candidates. In addition to the limitations associated with the experimental and computational techniques used, these challenges derive fundamentally from the identification and association of genes by indirect methods, because most of the methods use correlated properties that are not a direct measure and that only offer a partial view of the enhancer profile, as well as false predictions. Therefore, the validation of results also becomes a fundamental pillar that is also limited by the interpretation of the results, because the specificity of the regulatory sequences and their genomic context-dependent activity make this task difficult. In this context, progress in high-throughput reporter assays such as MPRA, STARR-seq and gene editing with CRISPR-Cas9 would be a potential tool for a massive screening of candidate enhancers to validate their role as regulatory elements. In the meantime, the enhancer sequences hosted in the different resources should be considered candidate sequences, whose experimental validation is pending. The contribution of the scientific community is essential to submit scientific results to the databases to update existing knowledge, as well as keeping the databases updated and working, avoiding the obsolescence of the content and/or the shutdown of the databases. On the other hand, novel software to identify enhancer sequences is being developed [173], [174]. Comparative studies of algorithms and revisions about these tools have been previously elaborated in other works [130], [175], [176], although a more recent in-depth review regarding this issue would be of interest. In the supplementary material we have included the main algorithms that have been used to identify the enhancers provided in each repository. It is remarkable that the majority of these software are not specific tools for enhancer detection, but are common tools for peak identification, alignment and sequence processing due to the approach/strategy used for the detection of these sequences (Fig. 4 and Table 3). Moreover, the databases do not annotate this information, but users must check the original papers for more information on this issue. In addition, some papers report the identification process in the methodology but do not go into the software used, or use their own code, so the inclusion of this data is difficult and may not be complete. Therefore, the annotation of these tools that provide evidence is also an aspect that databases should improve and include in future repositories.

Underrepresented concepts in biological databases

With the exception of SE, the types of enhancers are underrepresented. In this case, the annotation of SE also needs to be improved, because the constitutive enhancers that compose the sequence are often not included. The classification by chromatin profile is particularly interesting, because chromatin marks are correlated with enhancer activity. Since the majority of repositories do not label the type of enhancer, the type has to be inferred based on the methodology used in the identification of the sequences and, therefore, the activity of the enhancers has also to be inferred. However, this annotation is not usually included in the databases either, but must be obtained from reading the corresponding article, a situation which becomes more complicated when the repository uses different experimental approaches, because the experimental evidence of the sequences may be lost. Therefore, an interesting area of further research is to explore the diversity of those sequences and their different profiles, which would increase the knowledge about the different typologies of enhancers. Also, the annotation of the type of evidence that supports the sequences is usually missing in the databases. That type of evidence is needed to properly report the validity of the data. The use of resources such as the Evidence Ontology should be considered [177]. A similar situation is found in the integration of sequences. Many repositories integrate information from different sources (see supplementary material), either to generate a new dataset, to increase the volume of data in the repository, or to use these sequences to add new useful information. However, databases lack cross-references between sources and do not keep the identifiers used in the reference sources. This representation complicates the identification of sequences and the monitoring of the evolution of information. This is an important aspect because each database has a different approach and does not provide all the annotations that may be of interest. It is therefore necessary to consult sequences in different repositories and the lack of linking between sources and the use of common identifiers makes this task difficult. Therefore, future work on the definition of community standards identifying, for instance, the minimal amount of information [178] that should be reported in the databases and how to represent that minimal information and those cross-references would help to have more homogeneous datasets and to facilitate link discovery. In this context, our model is offered as a tool for the representation and structuring of knowledge, as well as the use of identifiers instead of variable string variables between sources. Biological networks are the least covered aspect of enhancer annotations. The association between enhancers and diseases is also an under-covered aspect. However, repositories often contain both healthy and pathological biosamples, so the information present can be explored for comparative studies. It is important to note that, although the annotation of biological samples is often carried out, the system of representation is suboptimal, strings are annotated instead of instances corresponding to a knowledge model. Therefore, choosing biosamples of interest between hundreds of possibilities in the repositories can be a complex task, because it is an unstructured annotation. In this task, the annotation of ontology instances can help, as it would allow to obtain the samples belonging to a level of granularity level of interest to the user.

Formal knowledge model for enhancers

The variety of characteristics detected in enhancers has led to a lack of consensus on the definition of these sequences [12] and the proliferation of different subtypes of enhancers described in the literature [26], some of which overlap between them and make the understanding of the regulatory landscape complex. The representation of enhancers by biological sample in the repositories also contributes to this problem, because millions of overlapping sequences have been generated that vary in their boundaries. In this article, we have provided a model that captures relevant information about enhancer sequences. However, that kind of model should evolve towards a knowledge model. Formalizing enhancer related knowledge in the form of an ontology would contribute to eliminate controversy, duplicity and to have a consensus. Ontologies are a useful tool both for structuring information and for its representation and are used in Life Sciences, the Gene Ontology being the most successful example of biological ontology [179]. That enhancer-related ontology would be the knowledge reference that would facilitate the comprehension and appropriate transmission of scientific knowledge. Currently, the Sequence Ontology (SO) [180] is the most relevant ontology about features and attributes of biological sequences. SO includes the enhancer class (SO:0000165), but does not contain the most common subtypes of the literature. The SO have recently been extended with new terms related to gene regulation as part of the collaborative research carried out in the GREEKC consortium [181]. More concretely, the terminology related gene expression has been updated in the cis-regulatory module (CRM) [182]. A similar effort should be pursued in order to incorporate in the Sequence Ontology the terminology related to enhancers included in our model and extracted from the literature.

Integrated data exploitation

In a search for information about enhancer sequences located in a region or that control the regulation of a certain gene, the current database landscape requires the user to query a wide variety of biological databases that are not interoperable with each other, which means that they cannot easily exchange information and that their information cannot be easily combined. The search tools provided by web portals are often simple, so performing multiple queries requires the download of the full dataset. This is also due to the fact that the majority of databases do not have APIs that allow programming queries. These general downloads are not always available or only offer a partial dataset. In addition to this, many annotations are made using free text strings, which makes integration and contrasting of information difficult. In this context, the availability of the aforementioned ontology would provide the terms for describing the data of the different resources which would facilitate data exchange and interoperability. The interoperability of the datasets would generate a virtual global repository which would enable a powerful exploitation of the large volume of isolated, existing data about enhancers. Such data interoperability should also be rooted on the FAIR principles (Findable, Accessible, Interoperable and Reusable) for data management [183]. Methodological aspects discussed and proposed by the GREEKC consortium [181] for the development of the Gene Regulation Knowledge Commons would be applicable here. This would also contribute to facilitate to keep track of the evolution of the information about enhancers.

Conclusions

There is an increasing interest in the exploitation of information about enhancers for generating new knowledge about regulatory processes due to their potential relation with disorders. We have analyzed the landscape of databases that contain information about enhancers. Our study shows that the resources are highly heterogeneous in the types of information about enhancers, which makes the integrated exploitation of the resources very difficult. The annotation of the data should also be improved to reflect the content of the literature. The development of knowledge models about enhancers and their integration in existing ontologies should contribute to the interoperability of the databases and to improve the usability and the landscape of biological databases with information about enhancer sequences.

Ethical approval

Ethics approval was not required for this study.

CRediT authorship contribution statement

Juan Mulero Hernández: Conceptualization, Methodology, Investigation, Writing - original draft. Jesualdo Tomás Fernández-Breis: Conceptualization, Methodology, Writing - review & editing, Supervision, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

181 in total

1. Chromatin stretch enhancer states drive cell-specific gene regulation and harbor human disease risk variants.

Authors: Stephen C J Parker; Michael L Stitzel; D Leland Taylor; Jose Miguel Orozco; Michael R Erdos; Jennifer A Akiyama; Kelly Lammerts van Bueren; Peter S Chines; Narisu Narisu; Brian L Black; Axel Visel; Len A Pennacchio; Francis S Collins
Journal: Proc Natl Acad Sci U S A Date: 2013-10-14 Impact factor: 11.205

Review 2. Identification and function of enhancers in the human genome.

Authors: Candice J Coppola; Ryne C Ramaker; Eric M Mendenhall
Journal: Hum Mol Genet Date: 2016-07-08 Impact factor: 6.150

3. Super-Enhancer-Mediated RNA Processing Revealed by Integrative MicroRNA Network Analysis.

Authors: Hiroshi I Suzuki; Richard A Young; Phillip A Sharp
Journal: Cell Date: 2017-03-09 Impact factor: 41.582

4. Cicero Predicts cis-Regulatory DNA Interactions from Single-Cell Chromatin Accessibility Data.

Authors: Hannah A Pliner; Jonathan S Packer; José L McFaline-Figueroa; Darren A Cusanovich; Riza M Daza; Delasa Aghamirzaie; Sanjay Srivatsan; Xiaojie Qiu; Dana Jackson; Anna Minkina; Andrew C Adey; Frank J Steemers; Jay Shendure; Cole Trapnell
Journal: Mol Cell Date: 2018-08-02 Impact factor: 17.970

5. SEA: a super-enhancer archive.

Authors: Yanjun Wei; Shumei Zhang; Shipeng Shang; Bin Zhang; Song Li; Xinyu Wang; Fang Wang; Jianzhong Su; Qiong Wu; Hongbo Liu; Yan Zhang
Journal: Nucleic Acids Res Date: 2015-11-17 Impact factor: 16.971

6. Mll3 and Mll4 Facilitate Enhancer RNA Synthesis and Transcription from Promoters Independently of H3K4 Monomethylation.

Authors: Kristel M Dorighi; Tomek Swigut; Telmo Henriques; Natarajan V Bhanu; Benjamin S Scruggs; Nataliya Nady; Christopher D Still; Benjamin A Garcia; Karen Adelman; Joanna Wysocka
Journal: Mol Cell Date: 2017-05-05 Impact factor: 17.970

Analysis of the landscape of human enhancer sequences in biological databases.

Introduction

Materials and methods

A model for representing enhancers

Types of enhancers

Methodologies to generate evidence

Enhancer-promoter interactions

Other annotations of interest

Databases

Results

Types of enhancers

Methodologies to generate evidence

Evidence based on chromatin characteristics

Reporter-based methods

Enhancer-promoter interactions

Unsupervised methods

Supervised methods

Other annotations of interest

Discussion

Findings

Challenges and future directions

Identification of enhancers

Underrepresented concepts in biological databases

Formal knowledge model for enhancers

Integrated data exploitation

Conclusions

Ethical approval

CRediT authorship contribution statement

Declaration of Competing Interest

1. Chromatin stretch enhancer states drive cell-specific gene regulation and harbor human disease risk variants.

Review 2. Identification and function of enhancers in the human genome.

3. Super-Enhancer-Mediated RNA Processing Revealed by Integrative MicroRNA Network Analysis.

4. Cicero Predicts cis-Regulatory DNA Interactions from Single-Cell Chromatin Accessibility Data.

5. SEA: a super-enhancer archive.

6. Mll3 and Mll4 Facilitate Enhancer RNA Synthesis and Transcription from Promoters Independently of H3K4 Monomethylation.

7. GeneHancer: genome-wide integration of enhancers and target genes in GeneCards.

Review 8. Role of Enhancers in Development and Diseases.

Review 9. Super Enhancers in Cancers, Complex Disease, and Developmental Disorders.

10. A curated benchmark of enhancer-gene interactions for evaluating enhancer-target gene prediction methods.