| Literature DB >> 32029002 |
Florian Schmidt1,2,3,4, Fabian Kern1,3,5, Marcel H Schulz6,7,8,9,10.
Abstract
BACKGROUND: Enhancers play a fundamental role in orchestrating cell state and development. Although several methods have been developed to identify enhancers, linking them to their target genes is still an open problem. Several theories have been proposed on the functional mechanisms of enhancers, which triggered the development of various methods to infer promoter-enhancer interactions (PEIs). The advancement of high-throughput techniques describing the three-dimensional organization of the chromatin, paved the way to pinpoint long-range PEIs. Here we investigated whether including PEIs in computational models for the prediction of gene expression improves performance and interpretability.Entities:
Keywords: Chromatin accessibility; Chromatin conformation; DNase1-seq; Gene expression prediction; Gene regulation; HiC; HiChIP; Machine learning
Mesh:
Substances:
Year: 2020 PMID: 32029002 PMCID: PMC7003490 DOI: 10.1186/s13072-020-0327-0
Source DB: PubMed Journal: Epigenetics Chromatin ISSN: 1756-8935 Impact factor: 4.954
Fig. 1Assignment of DNase1-seq peaks to genes. The different setups are illustrated for two genes g1 and g2. The colour code of peaks and the border colour of segments indicate to which gene a peak is assigned. Peaks with a striped filling are not assigned to any gene. a In a window-based annotation, peaks are linked to a gene if they are located within a window w centred at the 5′ transcription start site (TSS) of a gene of interest. denotes the set of all DHSs overlapping window w1 centred around the promoter of gene g1. b Peaks are linked to the nearest gene, defining nearest as the gene with the closest TSS in linear genomic distance. Here, refers to the set of all DHSs linked to gene g1 following the nearest gene approach. c Using HiC or HiChIP, secondary windows covering the distal regions linked to the TSS are considered in addition to the TSS window. For gene g1, two additional windows, v1 and v2, are considered, yielding the additional peak sets and
Different combinations of features evaluated in this study
| Name | Considered peak features | Considered TF features | Annotation |
|---|---|---|---|
| Promoter: peaks | Peak length | Window | |
| Peak count | Nearest gene | ||
| Peak signal | ChromHMM | ||
| Promoter + HiC: peaks | Peak length | Window + HiC | |
| Promoter + HiChIP: peaks | Peak count | Window + HiChIP | |
| Peak signal | ChromHMM | ||
| Peak length | |||
| Peak count | |||
| Peak signal | |||
| Promoter + HiC: C peaks | Peak length | Window + HiC | |
| Promoter + HiChIP: C peaks | Peak count | Window + HiChIP | |
| Peak signal | ChromHMM | ||
| Promoter: peaks + TFs | Peak length | Affinities in promoter DHS | Window |
| Peak count | ChromHMM | ||
| Peak signal | |||
| Promoter + HiChIP: peaks + TFs (EF) | Peak length | Affinities in promoter DHS | Window + HiChIP |
| Peak count | Affinities in loop DHS | ChromHMM | |
| Peak signal | |||
| Peak length | |||
| Peak count | |||
| Peak signal |
Fig. 6Protein–protein association network obtained from the database illustrating interactions among YY1 and TFs selected as a predictor in gene expression models for K562, Jurkat, and HCT116 in promoter and distal loop sites
Fig. 2The performance of gene expression prediction models measured in terms of Spearman correlation on hold-out test data is shown for various models using peak length, peak count, and peak signal within the gene promoter regions. Two different window sizes (, ) and the nearest gene approach are compared. We observe that the models outperform the models. Considering the models, there is a slight advantage of the window-based models over the nearest gene-based annotation
Fig. 3a The observed over expected ratio for the overlap between DHSs with HiC regions is shown for different cell lines and different HiC resolutions. b Percentage of HiChIP contacts overlapping with a DHS (yellow) compared to those not overlapping a DHS (blue). All HiChIP samples have a resolution of 5 kb. c The bar plot shows the number of prot. coding genes that overlap a HiC loop using a LW of 25 kb and a HiC resolution of 10 kb. d Analogously to c but with HiChIP data and a promoter search window of 5 kb. For c and d overlaps with DHSs are not considered
Fig. 4Model performance measured in Spearman correlation on hold-out test data considering chromatin contacts derived from a HiC data using two different search windows (25 kb and 50 kb) as well as b using HiChIP data. We considered two different promoter windows, a 3 kb and a 50 kb window. While for HiC, the 50 kb promoter windows led to better models than the 3 kb models, it is the other way around considering HiChIP data. Significance is assessed using a Wilcoxon test using the promoter model as the reference group (, **, *, ns: )
Fig. 5a Model performance assessed in terms of Spearman correlation on hold-out test data for models including TF-gene scores computed in the promoter and in the distal enhancers. Generally, including TF predictions improves model performance compared to considering only peak features. Significance is assessed using a Wilcoxon test using the promoter model as the reference group (****, **, *, ns : ). b UpSet plot showing the relationship between TFs with a non-zero regression coefficients inferred by the extended feature space models. c Ranking of the top 20 TFs by their absolute regression coefficients for each cell line. The colour code indicates the mean scaled regression-coefficient of the TFs computed in a ten-fold outer cross-validation