| Literature DB >> 31316726 |
Judith Mary Hariprakash1, Francesco Ferrari1,2.
Abstract
Enhancers are non-coding regulatory elements that are distant from their target gene. Their characterization still remains elusive especially due to challenges in achieving a comprehensive pairing of enhancers and target genes. A number of computational biology solutions have been proposed to address this problem leveraging the increasing availability of functional genomics data and the improved mechanistic understanding of enhancer action. In this review we focus on computational methods for genome-wide definition of enhancer-target gene pairs. We outline the different classes of methods, as well as their main advantages and limitations. The types of information integrated by each method, along with details on their applicability are presented and discussed. We especially highlight the technical challenges that are still unresolved and hamper the effective achievement of a satisfactory and comprehensive solution. We expect this field will keep evolving in the coming years due to the ever-growing availability of data and increasing insights into enhancers crucial role in regulating genome functionality.Entities:
Year: 2019 PMID: 31316726 PMCID: PMC6611831 DOI: 10.1016/j.csbj.2019.06.012
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 7.271
Fig. 1Timeline of the enhancer-target gene pairing algorithms. The main methods described in the review (tool name in bold, if defined) are listed to highlight the timeline of their publication over the years (horizontal axis).
Enhancer - Target Gene pairing methods. The table enlists the various ETG algorithms. Their grouping into four main classes is specified: correlation-based (C), supervised learning-based (SL), regression-based (R), score-based (S). Methods with mixed features are specified (e.g. SL + R or C + R). C* is for a method conceptually related to correlation-based solutions. Details on each method and features adopted for ETG pairing are also listed.
| Name | Class | Method details | Features |
|---|---|---|---|
| Thurman et al. | C | Pearson correlation | DNase-seq |
| Shen et al. | C | Spearman correlation | ChIP-seq for Pol2 and H3K4me1 |
| PreSTIGE | C* | Shannon entropy to select cell type-specific patterns | RNA-seq, ChIP-seq for H3K4me1 |
| ELMER | C | Inverse correlation | RNA-seq, DNA methylation |
| Rodelsperger et al. | SL | Random forest | Distance, conserved synteny, gene ontology, protein-protein interactions |
| Ernst et al. | SL | Logistic regression | Gene expression (microarrays), ChIP-seq for 3 histone marks |
| IM-PET | SL | Random forest | Distance, conserved synteny, correlation between enhancer (CSI-ANN score on 3 histone marks) and target promoter (RNA-seq) activity, TFs binding (sequence motifs) and target promoter correlation |
| PETModule | SL | Random forest | Distance, conserved synteny, DNase-seq |
| TargetFinder | SL | Ensemble of boosted decision trees | DNase-seq, FAIRE-seq, DNA methylation, RNA-seq, ChIP-seq for 32 histone marks, in addition to TFs and architectural proteins |
| McEnhancer | SL | Third-order interpolated Markov chain model in a semi-supervised learning setup | Sequence motifs |
| Andersson et al. | C + R | Pearson correlation, then linear models and lasso shrinkage | DNase-seq |
| RIPPLE | SL + R | Random forest and group lasso | DNase-seq, RNA-seq, ChIP-seq for 8 histone marks and 15 TFs. |
| JEME | R + SL | Multiple linear regression and lasso shrinkage | DNase-seq, RNA-seq, ChIP-seq for 3 histone marks |
| FOCS | R | Ordinary least squares regression | DNase-seq, CAGE-seq |
| EpiTensor | S | Higher-order tensors decomposition | DNase-seq, RNA-seq, ChIP-seq for 16 histone marks |
| GeneHancer | S | Additive score with custom weights and data transformations for each quantitative | Distance, TFs co-expression, eRNAs, eQTLs, capture Hi-C |
| PEGASUS | S | Score reflecting the evolutionary sequence and synteny conservation | Conserved synteny and sequence conservation |
Fig. 2Features used in ETG pairing tools. The figure summarizes the main types of features used to define ETG pairs by the tools discussed in this review. For each feature, its respective frequency (y-axis, number of methods) and first adoption by the tools discussed in this review (x-axis, year) is reported. The size of each dot is also proportional to the frequency (number of methods). The colors represent the category of the data: genomic annotations independent to cell type (dark green); epigenomics data (orange); transcriptomic data (mauve).
Fig. 3Main classes of ETG pairing methods. The cartoon highlights the main principles underlying the four main classes of ETG pairing methods as discussed in this review. (a) Correlation-based methods are centered on assessing the correlation between activity of individual enhancer-promoter pairs across multiple cell types. Their activity is measured by one or more types of functional epigenomics or transcriptomics data. (b) Supervised learning-based methods instead build a predictor based on a known set of true positive and negative ETG pairs. For each of these, several features (e.g. functional genomics data) are considered to describe enhancers and promoters activity across multiple cell types. These can also be enriched with other features directly associated to the ETG pair, such as their genomic distance or synteny conservation. (c) Regression-based methods are simultaneously assessing the quantitative contribution to a promoter activity by multiple enhancers within the considered genomic window. These methods leverage a large number of genomic features and functional data. Regression methods can provide a weight for the contribution of individual enhancers (represented by ETG pairing lines of different thickness in the cartoon). (d) Score-based methods integrate into a single quantitative score information from a large set of genomic features and functional data. The score is quantifying the strength of individual ETG pairs. In the cartoons for all methods enhancers and promoters are represented as boxes labelled as “E” or “P”, respectively. TSS is marked with an arrow. Colored (purple or green) curves are used to represent quantitative functional genomics data used to infer the activity level of enhancers or promoters, respectively. They are meant to hint the peaks of various intensity that would be associated to such features in genomics data such as ChIP-seq.
Details on methods applicability. The table lists details concerning each method usability and applicability. Namely, the organisms, cell and tissue types used to develop the tools are specified, as well as details on the code availability.
| Name | Organism | Samples | Code availability |
|---|---|---|---|
| Rodelsperger et al. | Mouse | Embryonic mouse forebrain and limb | – |
| Ernst et al. | Human | 9 cell lines | – |
| Thurman et al. | Human | 125 cell lines | – |
| Shen et al. | Mouse | 19 cell and tissue types | – |
| PreSTIGE | Human | 12 cell lines | Galaxy module |
| IM-PET | Drosophila and human | Drosophila and 12 human cell lines | Galaxy module and archive with a collection of scripts for PERL, Python, Java, R and others tools |
| Andersson et al. | Human | 432 primary cell, 135 tissue and 241 cell lines | – |
| RIPPLE | Human | 4 cell lines | Bitbucket repository with a collection of C++ programs and MATLAB scripts (current version: 1.0) |
| ELMER | Human | 2841 TCGA samples | Bioconductor version: Release (3.9) R package |
| PETModule | Mouse and human | 2 mouse cell lines and 8 human cell lines | Archive with a collection of scripts for Python, Java and other tools |
| TargetFinder | Human | 4 cell lines | GitHub repository with a collection of Python scripts |
| EpiTensor | Human | 5 cell lines | Archive with a collection of scripts for Bash, R, MATLAB and other tools (current version: v0.9) |
| JEME | Human | 935 human primary cell and tissue types | GitHub repository with a collection of scripts for Bash, R and other tools |
| GeneHancer | Human | Cell lines from multiple compendiums | GeneCards web portal |
| McEnhancer | Drosophila | Drosophila embryo development stages | GitHub repository with a collection of scripts for Bash, PERL, R, Python, Java and others tools |
| PEGASUS | Zebrafish and human | Human embryonic stem cells and zebrafish developmental stages | – |
| FOCS | Human | 2630 samples | GitHub repository with a collection of scripts for R |
Pros and cons of ETG pairing approaches. Table summarizes the advantages and limitations of the methodology.
| Method | Correlation | Machine learning | Regression | Score based |
|---|---|---|---|---|
| Pros | can identify multiple targets of an enhancer can directly derive a quantitative measure of the strength of association correlation can be measured also between regulatory elements and genes within a short distance | once the classifier is trained, in principle it could predict ETG pairs in other cell types | multiple enhancers that are candidate regulators of a given gene can be ranked to select the most informative ones | flexible prioritization of ETG pairs by adjusting a single threshold on the score all possible ETG pairs can be scored |
| Cons | they need genomics data over a large panel of cells, with consistent quality and resolution may overlook the cell type and time specificity of interactions, thus missing relevant connections when extending to a new cell type or condition does not directly consider multiple enhancers acting cooperatively on a gene | the training of the classifier requires a set of known positive as well as negative interactions hampered by the lack of comprehensive genome-wide definition of known true positive and negative ETG pairs | arbitrary chosen parameters such as the window size or maximum number of enhancers around each TSS. they generally need a large compendium of cell types with functional data used to build the models | they rely on a number of assumptions and arbitrarily defined parameters or weights to be able to combine a heterogenous set of information into a single quantitative value |