| Literature DB >> 31221079 |
Mary Lauren Benton1, Sai Charan Talipineni2, Dennis Kostka3,4, John A Capra5,6.
Abstract
BACKGROUND: Non-coding gene regulatory enhancers are essential to transcription in mammalian cells. As a result, a large variety of experimental and computational strategies have been developed to identify cis-regulatory enhancer sequences. Given the differences in the biological signals assayed, some variation in the enhancers identified by different methods is expected; however, the concordance of enhancers identified by different methods has not been comprehensively evaluated. This is critically needed, since in practice, most studies consider enhancers identified by only a single method. Here, we compare enhancer sets from eleven representative strategies in four biological contexts.Entities:
Keywords: Cis-regulatory elements; Enhancer identification; Gene regulation
Mesh:
Year: 2019 PMID: 31221079 PMCID: PMC6585034 DOI: 10.1186/s12864-019-5779-x
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1Eleven diverse enhancer identification strategies were evaluated across four cellular contexts. Each row summarizes the data sources, analytical approaches, and contexts for the eleven enhancer identification strategies we considered. The leftmost columns of the schematic represent the experimental assays and sources of the data used by each identification strategy. The middle columns describe the computational processing (if any) performed on the raw data (ML: machine learning). The rightmost columns give the contexts in which the sets were available. Table 1 gives the number, length, and genomic coverage of each enhancer set
Summary of all enhancer sets analyzed in this study
| Context | Enhancer Set | Number of Base Pairs (kb) | Number of Enhancers | Median Length | Genome Coverage |
|---|---|---|---|---|---|
| K562 | H3K27acPlusH3K4me1 | 22,113 | 6642 | 1903 | 0.0078 |
| H3K27acMinusH3K4me3 | 34,072 | 19,698 | 525 | 0.0120 | |
| DNasePlusHistone | 6620 | 13,402 | 431 | 0.0023 | |
| ChromHMM | 96,545 | 100,837 | 600 | 0.0339 | |
| EncodeEnhancerlike | 39,961 | 36,008 | 878 | 0.0140 | |
| Ho14 | 29,027 | 35,769 | 556 | 0.0102 | |
| Yip12 | 5389 | 13,303 | 342 | 0.0019 | |
| p300 | 7939 | 26,463 | 316 | 0.0028 | |
| GRO-cap | 3905 | 23,825 | 160 | 0.0014 | |
| FANTOM | 390 | 1084 | 344 | 0.0001 | |
| Gm12878 | H3K27acPlusH3K4me1 | 28,355 | 8019 | 2749 | 0.0099 |
| H3K27acMinusH3K4me3 | 20,868 | 11,238 | 701 | 0.0073 | |
| DNasePlusHistone | 9286 | 19,815 | 386 | 0.0033 | |
| ChromHMM | 73,929 | 69,314 | 800 | 0.0259 | |
| EncodeEnhancerlike | 50,224 | 38,872 | 1018 | 0.0176 | |
| Ho14 | 41,543 | 39,550 | 674 | 0.0146 | |
| Yip12 | 5389 | 13,303 | 342 | 0.0019 | |
| p300 | 6480 | 17,532 | 360 | 0.0023 | |
| GRO-cap | 3646 | 21,308 | 160 | 0.0013 | |
| FANTOM | 1025 | 2826 | 343 | 0.0004 | |
| Liver | H3K27acPlusH3K4me1 | 87,576 | 37,644 | 1831 | 0.0307 |
| H3K27acMinusH3K4me3 | 137,874 | 77,014 | 1096 | 0.0484 | |
| DNasePlusHistone | 51,292 | 170,212 | 152 | 0.0180 | |
| ChromHMM | 108,375 | 101,260 | 800 | 0.0380 | |
| EncodeEnhancerlike | 89,129 | 37,426 | 1849 | 0.0313 | |
| FANTOM | 326 | 869 | 347 | 0.0001 | |
| Villar15 | 86,139 | 27,725 | 2545 | 0.0302 | |
| Heart | H3K27acPlusH3K4me1 | 59,892 | 42,910 | 1102 | 0.0210 |
| H3K27acMinusH3K4me3 | 157,468 | 141,162 | 684 | 0.0553 | |
| DNasePlusHistone | 33,224 | 103,898 | 168 | 0.0117 | |
| ChromHMM | 93,067 | 113,092 | 600 | 0.0327 | |
| EncodeEnhancerlike | 186,866 | 47,235 | 2872 | 0.0656 | |
| FANTOM | 611 | 1720 | 335 | 0.0002 |
Fig. 2Enhancer identification methods vary in the number and length of predicted enhancers. (a) The number of K562 and liver enhancers identified by each method varies over two orders of magnitude. There is considerable variation even among methods defined based on similar input data, e.g., histone modifications. (b) The length of K562 and liver enhancers identified by different methods shows similar variation. Enhancer lengths are plotted on a log10 scale on the y-axis. Data for other contexts are available in Table 1 and Additional file 1: Figure S1
Fig. 3Enhancer sets have low genomic overlap. (a) Pairwise bp enrichment values (log2 fold change) for overlap between each K562 (upper triangle) or liver (lower triangle) enhancer set, compared to the expected overlap between randomly distributed, length-matched regions. (b) The log2 enrichment for bp overlap compared to a random genomic distribution for each pair of enhancer sets within each context. Only contexts with annotations across all biological contexts are included. The fold changes across annotations for the primary tissues—liver and heart—are significantly lower than the cell lines—K562 and Gm12878 (p = 6.88E-11 Kruskal-Wallis test, followed by Dunn’s test with Bonferroni correction for pairwise comparisons between contexts). The patterns are similar for element-wise comparisons (Figs. S3). (c, d) The percent base pair (bp) overlap between all pairs of (c) K562 enhancer sets and (d) liver enhancer sets. Percent overlap for each pair was calculated by dividing the number of shared bp between the two sets by the total number of base pairs of the set on the y-axis. The highest overlap is observed for pairs based on similar input, e.g., machine learning models trained on the same functional genomics data, or comparisons with large sets, e.g. ChromHMM. Comparisons between biological replicates average 76% overlap. (e, f) The Jaccard similarity between all pairs of (E) K562 or (f) liver enhancer sets. The upper triangle gives the Jaccard similarity, and the lower triangle gives the relative Jaccard similarity in which the observed similarity is divided by the maximum possible similarity for the pair of sets
Fig. 4Enhancers have different levels of enrichment with functional attributes. (a) Enhancer sets vary in their degree of evolutionary conservation. Each point represents the enrichment (fold change compared to randomly shuffled regions) for overlap between a conserved element (combined primate and vertebrate PhastCons) and each enhancer set. Methods based on transcriptional assays and TF binding profiles (GRO-cap, FANTOM, p300, and Yip12) are the most enriched for conserved elements, while sets based on histone modification data alone are among the least enriched. (b) GWAS SNP enrichment among all enhancer sets for each biological context. All sets are significantly enriched, except FANTOM in K562 and liver contexts due to small sample size. (c) GTEx eQTL enrichment among all enhancer sets for each biological context. Transparent points indicate nonsignificant enrichment (p > 0.05)
Fig. 5Enhancers identified by different methods differ in functional attributes. The 9 kb region on human chromosome 1 containing genetic variants associated with LDL cholesterol levels and MI in GWAS and the causal SNP (rs12740374). Here, the region containing the casual SNP is predicted to be an enhancer by four of the seven methods. GWAS tag SNPs are colored in red and LD blocks are shown with a horizontal line. (b) The 60 kb region of human chromosome 9 containing loci associated with coronary artery disease (CAD) in GWAS. Two of the associated variants (rs10811656 and rs4977757) have been shown to contribute to CAD risk. However, the enhancer annotations in this region are generally non-overlapping and do not highlight either functional variant. (c) Few GWAS SNPs overlap an enhancer; the colored bars represent the number of methods that identified the region as an enhancer. The majority of these variants are not predicted as enhancers, and very few GWAS variants are overlap enhancers from multiple methods. The conclusions are similar when considering variants in high LD (r2 > 0.9) with the GWAS tag SNPs in liver (Liver LD; Additional file 1: Figure S8). The pattern is also similar when limiting to SNPs associated with liver or heart related phenotypes (Liver Specific, Heart Specific). When considering the SNP in each LD block with the maximum number of enhancer overlaps there is still a large percentage of SNPs supported by none or only one method (Liver Max). This demonstrates that the situation illustrated in panel B is very common. (d) Among all eQTL that overlap at least one enhancer, the majority is supported by only a single method. This holds for LD- expanded and context-specific sets (Liver LD, Liver Specific, Heart Specific; Additional file 1: Figure S8). Many variants remain unique to a single method, even when limiting to the variant in each LD block overlapping the maximum of enhancer sets (Liver Max). These trends are similar to what is seen for GWAS SNPs in (c). (e) Enhancer sets from the same biological context have different functional associations. We identified Gene Ontology (GO) functional annotations enriched among genes likely to be regulated by each enhancer set using GREAT. The upper triangle represents the pairwise semantic similarity for significant molecular function (MF) GO terms associated with predicted liver enhancers. The lower triangle shows the number of shared MF GO terms in the top 30 significant hits for liver enhancer sets. Results were similar when using enhancer-gene target predictions from JEME (Additional file 1: Figure S9–10)
Top 5 Gene Ontology (Molecular Function) terms for liver enhancer sets from GREAT and JEME target-mapped WebGestalt enrichments
| Enhancer Set | GO MF Terms (GREAT) | GO MF Terms (JEME+WebGestalt) |
|---|---|---|
| H3K27acPlusH3K4me1 | cytoskeletal adaptor activity | small molecule binding |
| 14–3-3 protein binding | anion binding | |
| leukotriene-C4 synthase activity | nucleoside phosphate binding | |
| nucleobase-containing compound transmembrane transporter activity | nucleotide binding | |
| FAD binding | transferase activity | |
| H3K27acMinusH3K4me3 | 14–3-3 protein binding | oxidoreductase activity |
| cytoskeletal adaptor activity | anion binding | |
| thyroid hormone receptor binding | small molecule binding | |
| ARF guanyl-nucleotide exchange factor activity | nucleoside phosphate binding | |
| high-density lipoprotein particle binding | nucleotide binding | |
| DNasePlusHistone | cytoskeletal adaptor activity | small molecule binding |
| glucocorticoid receptor binding | anion binding | |
| nucleobase-containing compound transmembrane transporter activity | transferase activity | |
| high-density lipoprotein particle binding | nucleotide binding | |
| 14–3-3- protein binding | nucleoside phosphate binding | |
| ChromHMM | high-density lipoprotein particle binding | nucleotide binding |
| nucleobase-containing compound transmembrane transporter activity | nucleoside binding | |
| cytoskeletal adaptor activity | purine nucleoside binding | |
| 14–3-3 protein binding | DNA binding | |
| retinoid X receptor binding | RNA binding | |
| EncodeEnhancerlike | cytoskeletal adaptor activity | nucleotide binding |
| 14–3-3 protein binding | transferase activity | |
| nucleobase-containing compound transmembrane transporter activity | small molecule binding | |
| apolipoprotein A-I binding | anion binding | |
| high-density lipoprotein particle binding | carbohydrate derivative binding | |
| FANTOM | glucocorticoid receptor binding | structural constituent of ribosome |
| protein kinase binding | receptor binding | |
| kinase binding | cell adhesion molecule binding | |
| methylglutaconyl-CoA hydratase activity | molecular function regulator | |
| vitamin D response element binding | transcription regulatory region DNA binding | |
| Villar15 | protease binding | anion binding |
| phosphatidylinositol 3-kinase binding | small molecule binding | |
| 14–3-3 protein binding | oxidoreductase activity | |
| cytoskeletal adaptor activity | cofactor binding | |
| glucocorticoid receptor binding | oxidoreductase activity, acting on CH-OH group of donors |
Fig. 6The genomic and functional similarities between enhancer sets are not consistent. (a) Multidimensional scaling (MDS) plot of liver enhancer sets based on the Jaccard similarity of the genomic distributions (Fig. 3b). (b) MDS plot for liver enhancers based on distances calculated from molecular function (MF) Gene Ontology (GO) term semantic similarity values with GREAT (Fig. 5e). (c, d) Ranked hierarchical clustering based on the Jaccard similarities of the genomic distributions (c) of all liver enhancer sets compared to clustering based on GO semantic similarity (d). FANTOM enhancers are the most distant from all other enhancer sets in both genomic and functional similarity, but the relationships between other sets are not conserved. Red branches denote identical subtrees within the hierarchy. (e) Hierarchical clustering based on genomic Jaccard distances for all contexts and methods with annotations in each context. (f) Hierarchical clustering of all available enhancer sets based on GO term distances. Terminal branches are colored by biological context. With the exception of FANTOM enhancers, the enhancer sets’ genomic distributions are more similar within than between biological contexts. Functional similarity does not always correlate with genomic similarity, and the clustering by biological context is weaker in functional space
Fig. 7Enhancers identified by multiple methods have little additional evidence of function. (a) Enrichment for overlap between conserved elements (n = 3,930,677) and liver enhancers stratified by the number of identification methods that predicted each enhancer. (b) Enrichment for overlap between GWAS SNPs (n = 20,458) and liver enhancers stratified by the number of identification methods that predicted each enhancer. (c) Enrichment for overlap between GTEx eQTL (n = 429,964) and liver enhancers stratified by the number of identification methods that predicted each enhancer. In (a-c), the average enrichment compared to 1000 random sets is plotted as a circle; error bars represent 95% confidence intervals; and n gives the number of enhancers in each bin. The only significant differences are found in the enrichment for evolutionary conservation (a), but the difference is modest in magnitude (1.36x for 1 vs. 1.62x for 6+). (d) Boxplots showing the distribution of confidence score ranks for FANTOM enhancers in liver partitioned into bins based on the number of other methods that also identify the region as an enhancer. Lower rank indicates higher confidence; note that the y-axis is flipped so the high confidence (low rank) regions are at the top. The lack of increase in enhancer score with the number of methods supporting it held across all methods tested (Additional file 1: Figure S14–17)