| Literature DB >> 26634919 |
Dimitrios Kleftogiannis, Panos Kalnis, Vladimir B Bajic.
Abstract
Enhancers are cis-acting DNA elements that play critical roles in distal regulation of gene expression. Identifying enhancers is an important step for understanding distinct gene expression programs that may reflect normal and pathogenic cellular conditions. Experimental identification of enhancers is constrained by the set of conditions used in the experiment. This requires multiple experiments to identify enhancers, as they can be active under specific cellular conditions but not in different cell types/tissues or cellular states. This has opened prospects for computational prediction methods that can be used for high-throughput identification of putative enhancers to complement experimental approaches. Potential functions and properties of predicted enhancers have been catalogued and summarized in several enhancer-oriented databases. Because the current methods for the computational prediction of enhancers produce significantly different enhancer predictions, it will be beneficial for the research community to have an overview of the strategies and solutions developed in this field. In this review, we focus on the identification and analysis of enhancers by bioinformatics approaches. First, we describe a general framework for computational identification of enhancers, present relevant data types and discuss possible computational solutions. Next, we cover over 30 existing computational enhancer identification methods that were developed since 2000. Our review highlights advantages, limitations and potentials, while suggesting pragmatic guidelines for development of more efficient computational enhancer prediction methods. Finally, we discuss challenges and open problems of this topic, which require further consideration.Entities:
Keywords: bioinformatics; chromatin signatures; computer science; enhancers; gene regulation; genome annotation; histone modification marks; machine learning
Mesh:
Substances:
Year: 2015 PMID: 26634919 PMCID: PMC5142011 DOI: 10.1093/bib/bbv101
Source DB: PubMed Journal: Brief Bioinform ISSN: 1467-5463 Impact factor: 11.622
Comparison analysis of enhancer predictions obtained by different methods across six ENCODE cell lines
| Method 1 versus Method 2 | Gm12878 | H1hesc | K562 | HeLa | HepG2 | HUVEC | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10.7 versus 42.69 | 5.3 | 11.0 | 19.1 versus 56.5 | 2.2 | 3.0 | 34.5 versus 28.1 | 8.0 | 14.8 | 26.6. versus 42.9 | 5.3 | 8.3 | 40.8 versus 24 | 6.4 | 11.0 | 49.4 versus 47.1 | 21.3 | 28.3 | |
| 344.1 versus 42.69 | 31.6 | 8.9 | 124.7 versus 56.5 | 29.5 | 19.4 | 130.6 versus 28.1 | 18.5 | 13.1 | 87.4 versus 42.9 | 26.4 | 25.4 | 253.1 versus 24 | 12 | 4.5 | 191.2 versus 47.1 | 33.6 | 16.4 | |
| 82.7 versus 42.69 | 37.7 | 43.0 | 80.5 versus 56.5 | 36.9 | 111.4 versus. 28.1 | 24.8 | 21.6 | 70.9 versus 42.9 | 36.0 | 72.88 versus 24 | 10.8 | 12.6 | 107.2 versus 47.1 | 40.0 | 35.0 | |||
| 119.5 versus 42.69 | 39.1 | 31.7 | 404.9 versus 56.5 | 20.3 | 4.6 | 282.5 versus 28.1 | 27.6 | 9.7 | 124.8 versus 42.9 | 41.1 | 32.7 | 230.3 versus 24 | 11.6 | 4.5 | 189.6 s. 47.1 | 39.1 | 19.7 | |
| CSI-ANN versusRFECS | 10.7 versus 344.1 | 6.1 | 1.7 | 19.1 versus 124.7 | 2.2 | 1.5 | 34.5 versus 130.6 | 11.4 | 7.4 | 26.6 versus 87.4 | 5.71 | 5.2 | 40.8 versus 253.1 | 12.6 | 4.4 | 49.4 versus 191.2 | 21.9 | 10.0 |
| RFECS versus ChromHMM | 344.1 versus 82.7 | 55.9 | 15.0 | 124.7 versus 80.5 | 40.3 | 24.4 | 34.5 versus 111.4 | 51.1 | 26.7 | 87.4 versus 70.9 | 42.1 | 36.1 | 253.1 versus 72.88 | 45.2 | 16.2 | 191.2 versus 107.2 | 73.7 | 32.7 |
| RFECS versus Segway | 344.1 versus 119.5 | 71.6 | 18.2 | 124.7 versus 404.9 | 52.3 | 10.9 | 130.6 versus. 282.5 | 80.5 | 24.2 | 87.4 versus 124.8 | 51.1 | 31.7 | 253.1 versus 230.3 | 77.8 | 19.1 | 191.2 versus 189.6 | 92.4 | 32.0 |
| CSI-ANN versus ChromHMM | 10.7 versus 82.7 | 6.0 | 6.8 | 19.1 versus 80.5 | 1.6 | 1.6 | 34.5 versus 111.4 | 12.0 | 8.9 | 26.6 versus 70.9 | 5.8 | 6.3 | 40.8 versus 72.88 | 10.9 | 10.6 | 49.4 versus 107.2 | 25.8 | 19.7 |
| ChromHMM versus Segway | 82.7 versus 119.5 | 63.8 | 80.5 versus 404.9 | 52.5 | 12.1 | 111.4 versus 282.5 | 100.6 | 70.9 versus 124.8 | 61.4 | 45.7 | 72.88 versus 230.3 | 56.3 | 107.2 versus 189.6 | 99.0 | ||||
| CSI-ANN versus Segway | 10.7 versus 119.5 | 8.0 | 6.5 | 19.1 versus 404.9 | 1.3 | 0.3 | 34.5 versus 282.5 | 19.74 | 6.6 | 26.6. versus 124.8 | 10.0 | 7.1 | 40.8 versus 230.3 | 18.1 | 7.1 | 49.4 versus 189.6 | 30.4 | 14.5 |
We report the total number of bases in millions predicted as belonging to enhancers. Coverage 1 corresponds to enhancers predicted by Method 1, while Coverage 2 corresponds to enhancers predicted by Method 2. The overlap column corresponds to the same enhancer predictions in million bases as obtained by Method 1 and Method 2. In the third column, we report similarity of predictions of Method 1 and Method 2 based on the Jaccard similarity index (as percentage)
Figure 1This figure shows basic components of a general enhancer identification system. The first block on the left (lille colour) handles integration and preprocessing of different data types. These data types (summarized in Table 2) can be combined in different ways to generate feature vectors that describe DNA regions. The feature values can be normalized or rescaled (second block-red colour). Then, FS techniques can be applied to reduce the number of features and select smaller sets of features with higher discriminative capabilities. The feature vectors feed computational models that make decisions using unsupervised and/or supervised algorithms (third block-green colour). Outcome is a list of identified enhancer regions (fourth block-orange colour), which can be analysed further using computational techniques.
Overview of data and features used for enhancer identification
| Data sources | Feature example | Advantage | Disadvantage | Representative methods |
|---|---|---|---|---|
| Evolutionary conservation | Conserved motifs across species | Easy to compute | Insufficient information for predicting enhancer's tissue-specific activity | [ |
| Histone marks | ChIP-seq from H3K4me1 | Provides cell-line-/tissue-specific information that characterize enhancers and also different categories of enhancers (e.g. poised versus active) | Different cell lines/tissues are associated with different combination of histone marks | [ |
| TFBSs | ChIP-seq from P300 | Provides cell-line-/tissue-specific information that characterize enhancers. High-resolution data for testing activity of enhancer-related TFs | Not available for many cell lines/tissues | [ |
| Open chromatin | DHS | High discriminative capacity when combined with other data types, e.g. P300-binding sites | Regions with enriched DHS activity do not necessarily correspond to enhancers | [ |
| Sequence characteristics | Kmers of size 5 | Easy to compute | Insufficient information for predicting enhancers’ activity across different tissues | [ |
| eRNA expression | CAGE data | High accuracy | eRNA regulation mechanisms are unknown, and not all of the enhancers are known to produce eRNAs | [ |
| Enhancer-screening data | STARR-seq | High accuracy for testing enhancer activity | Not useful for | [ |
Figure 2The figure presents the roadmap of existing approaches for enhancer identification. We have categorized the methods into three basic streams, which we partitioned further into subcategories based on the underlying computational solutions and the combination of relevant enhancer data.
Summary of the most popular bioinformatics approaches for enhancer identification
| Name | Computational method | Highlight | Link | Reference |
|---|---|---|---|---|
| Heintzman | Clustering and correlation of histone marks profiles | High-recognition performance in HeLa | – | [ |
| ChromaSig (*) | Identification of specific histone mark motifs and clustering | The method is sensitive enough to capture patterns characterizing different classes of enhancers. | [ | |
| Rye | Clustering of profiles | The results indicate that selection of relevant TFs may be sufficient to identify regulatory elements | – | [ |
| Won | HMMs | State-of-the-art method suggesting that HMMs are capable of integrating information from multiple histone marks for predicting regulatory elements | [ | |
| Boyle | Combination of DHS with TFBSs | Active enhancers usually overlap with open chromatin regions, but not all of the DNA accessible regions correspond to enhancers | – | [ |
| ChromHMM (*) | HMMs | State-of-the-art genome annotation method by ENCODE | [ | |
| Segway (*) | DBNs | State-of-the-art genome annotation method by ENCODE | [ | |
| ChroModule | HMMs | Annotated human genome for eight cell lines and improved the AUC compared with state-of-the-art HMM based methods | – | [ |
| CSI-ANN (*) | ANNs | Effective combination of ANNs with FDA for FS | [ | |
| ChromaGenSVM (*) | SVMs | Effective combination of SVMs with GA for optimization and FS | [ | |
| EnhancerFinder | MKL | Functional genomics combined with sequence motifs can accurately identify developmental enhancers | – | [ |
| RFECS (*) | RFs | Method less prone to overfitting, which introduces additional novelties on the way enhancer predictions are validated | [ | |
| DEEP (*) | SVMs and ANNs | Novel ensemble-learning-based algorithm with good generalization capabilities in unknown cell lines.. | [ | |
| kmer-SVM (*) | SVMs | Study extensively the enhancer sequence context | [ | |
| dREG (*) | SVR | Usage of GRO-seq data combined with regression analysis | [ | |
| DELTA (*) | AdaBoost | Introduces the concept of shape features from ChiP-seq data | [ | |
| Andersson | eRNA expression analysis | Introduces one of the most accurate features for enhancer identification | [ | |
| CoSBI (*) | Bi-clustering | Reports combination of histone marks with high discriminative power for the category of enhancers | [ |
Note. With (*), are marked the methods that provide source codes or executable files.