Literature DB >> 35749574

MaxHiC: A robust background correction model to identify biologically relevant chromatin interactions in Hi-C and capture Hi-C experiments.

Hamid Alinejad-Rokny^1,2,3, Rassa Ghavami Modegh⁴, Hamid R Rabiee⁴, Ehsan Ramezani Sarbandi⁴, Narges Rezaie⁵, Kin Tung Tam¹, Alistair R R Forrest¹.

Abstract

Hi-C is a genome-wide chromosome conformation capture technology that detects interactions between pairs of genomic regions and exploits higher order chromatin structures. Conceptually Hi-C data counts interaction frequencies between every position in the genome and every other position. Biologically functional interactions are expected to occur more frequently than transient background and artefactual interactions. To identify biologically relevant interactions, several background models that take biases such as distance, GC content and mappability into account have been proposed. Here we introduce MaxHiC, a background correction tool that deals with these complex biases and robustly identifies statistically significant interactions in both Hi-C and capture Hi-C experiments. MaxHiC uses a negative binomial distribution model and a maximum likelihood technique to correct biases in both Hi-C and capture Hi-C libraries. We systematically benchmark MaxHiC against major Hi-C background correction tools including Hi-C significant interaction callers (SIC) and Hi-C loop callers using published Hi-C, capture Hi-C, and Micro-C datasets. Our results demonstrate that 1) Interacting regions identified by MaxHiC have significantly greater levels of overlap with known regulatory features (e.g. active chromatin histone marks, CTCF binding sites, DNase sensitivity) and also disease-associated genome-wide association SNPs than those identified by currently existing models, 2) the pairs of interacting regions are more likely to be linked by eQTL pairs and 3) more likely to link known regulatory features including known functional enhancer-promoter pairs validated by CRISPRi than any of the existing methods. We also demonstrate that interactions between different genomic region types have distinct distance distributions only revealed by MaxHiC. MaxHiC is publicly available as a python package for the analysis of Hi-C, capture Hi-C and Micro-C data.

Entities: Chemical

Mesh：

Substances：
Chromatin

Year: 2022 PMID： 35749574 PMCID： PMC9262194 DOI： 10.1371/journal.pcbi.1010241

Source DB: PubMed Journal: PLoS Comput Biol ISSN： 1553-734X Impact factor: 4.779

This is a PLOS Computational Biology Methods paper.

Background

Hi-C is a genome-wide chromosome conformation capture technology used to identify interactions between pairs of genomic regions. Hi-C libraries consist of a pair of reads which can be mapped to the genome for identifying pairs of interacting regions. The data is typically summarized as a contact matrix, in which each row and column correspond to a (fixed-bin) fragment of the genome and each cell shows the number of reads supporting an interaction between the two bins. Hi-C data has multiple biases due both to the multiple steps used in the protocol and the proclivity of closely located genomic regions that are more likely to interact. Not all interactions are biologically relevant and thus the important challenge in the analysis of Hi-C data is to distinguish between background/artefactual interactions and those interactions that are more likely to be functional/regulatory. There are two main approaches used to identify significant interactions in Hi-C data. The first, as used in Hi-C-DC [1], attempts to quantify known sources of biases (fragment length, GC content and mappability), and use them as variables in a background expectation function. The second approach, as used in Fit-Hi-C [2], CHiCAGO [3] and GOTHiC [4], assumes that all Hi-C biases are reflected in the read count data of interactions. Therefore, the observed or normalised read counts are used to assign a P-value to interactions. In order to account for unknown sources of bias a separate bias parameter is assumed for each bin. The tools using the first approach are limited to analysis of Hi-C libraries generated with restriction enzymes, whereas those using the second approach can also work with DNase Hi-C experiments in which the fragment sizes are unknown. Rao et al. [5] also proposed HiCCUPS, which is not a background correction model, but can be used to identify chromatin loops. The term chromatin loop refers to the interaction between promoters and enhancers, genes or architectural loops, or polycomb-mediated regions [6]. Loop anchors usually happen at domain boundaries and bind CCCTC-binding factor (CTCF) sites. Chromatin loops can be detected as enriched Hi-C interacting regions compared to their neighbourhood regions. HiCCUPS identifies clusters of Hi-C contact matrixes and reports the centroid of the cluster, in which the frequency of the contact matrix is significantly higher than the local background. Using a local background and reporting only the centroid of a cluster results in identification of a small number of significant Hi-C interactions for a small range of distances ~50kb to ~5000kb. Peakachu [7], and Mustache [8] two other loop callers use machine learning techniques to identify chromatin loops. Peakachu builds loop-classifying models by using Random Forest classification techniques to predict chromatin loops from Hi-C data. Mustache uses a new computer vision technique, scale-space theory, to predict blob-shaped objects in Hi-C and Micro-C contact data. We introduce MaxHiC (Maximum Likelihood estimation for Hi-C), a negative binomial model that uses a maximum likelihood technique to correct the complex combination of known and unknown biases in both Hi-C and capture Hi-C libraries. Benchmarking against current leading background correction models (GOTHiC, CHiCAGO, Fit-Hi-C, Fit-Hi-C2 [9], HiC-DC+ and loop callers (HiCCUPS, Peakachu [7], and Mustache [8].) shows that MaxHiC identifies interactions that contain substantially more evidence of biological relevance. Interacting regions identified by MaxHiC are significantly (P-value < 0.05) enriched for H3K27ac, H3K4me1, CTCF binding sites, DNase hypersensitive sites, and GWAS SNPs. They are also more likely to be annotated as promoters, enhancers, and insulator elements. More importantly the interacting pairs are substantially enriched for interactions between regulatory regions and non-coding polymorphisms linked by eQTL to target genes. MaxHiC is an open-source python package and is available here https://github.com/bcb-sut/MaxHiC.

Results

A negative binomial regression background model for Hi-C data

Here we present MaxHiC, a novel background correction model for identifying statistically significant interacting regions in both general and capture Hi-C libraries. In MaxHiC, the genome is divided into fixed size non-overlapping bins, and the read counts supporting interactions between bins are tested for significance. The read-count of interactions are assumed to follow a negative binomial distribution which is widely used to model over dispersed count data [10, 11]. The mean parameter of the negative binomial distribution for each interaction is calculated as a function of the bias factors of its two ends and their genomic distance. It is known that the contact frequency between pairs of regions linked by Hi-C reads decays as a function of distance along the chromosome [12]. A large fraction of Hi-C reads represents background ligation products of nearby genomic regions. In MaxHiC, distance is modelled by a function that decreases at increasing genomic distances to reach a small but constant non-zero value to account for background ligations. For trans-interactions, we use the same constant value as observed for distant cis interactions. Bias factors of bins are calculated as a function of their total read-count and the mappability of their surrounding bins. All of the parameters of the model are learned by maximizing the logarithm of likelihood of the observed interactions using the ADAM algorithm [13]. The model is trained in multiple iterations; in each iteration, putative interactions with the best P-values are identified and set aside. These significant interactions are ignored in the next phase of training to prevent the trained background model from becoming biased [2]. The same approach is also used in MaxHiC to analyze Capture-Hi-C data. The only difference is that separate sets of parameters are learned for modelling target-target, target-non-target and non-target-non-target interactions to account for the fact that interactions involving one or two of the targeted regions will be enriched to different levels compared to non-targeted regions. A schematic of the MaxHiC model for analysis of general Hi-C is shown in S1 Fig and a detailed explanation of both models is provided in the methods section.

MaxHiC vs other leading Hi-C tools

In order to evaluate the performance of MaxHiC, we performed a comprehensive benchmarking between MaxHiC and Hi-C significant interactions callers (SIC) including GOTHiC, Fit-Hi-C, Fit-Hi-C2, HiCDC+ [14], and CHiCAGO (for capture Hi-C only). We also compared the performance of MaxHiC with Hi-C loop callers including HiCCUPS [5], Peakachu [7], and Mustache [8]. We used several published Hi-C and capture Hi-C datasets to perform the analyses (see more details about the samples in ). We first used MaxHiC to identify significant interactions (at 1kb, 5kb and 10kb bin sizes) in two publicly available Hi-C data from Rao et al. [5] (on the GM12878 EBV transformed B-lymphoblastoid cell line and human mammary epithelial cells (HMEC)) and one set of capture Hi-C data from Mifsud et al. [15] (also on GM12878). For each library we also used the above-mentioned background correction models and loop callers with their default parameters to identify significantly interacting regions. Tables 1 and S2 summarise the numbers of significantly interacting pairs found using each method. Notably the number of interactions called as significant by each method varied greatly (from 10,487 using HiCCUPS, to 8,136,100 using GOTHiC on GM12878).

Table 1

Statistical summary of significant interactions identified by MaxHiC, GOTHiC, Fit-Hi-C, Fit-HiC2, HiCDC+, HiCCUPS, Peakachu, and Mustache in the GM12878 Hi-C library from Rao et al. [5]. at bin-sizes 5kb and 10kb.

Statistical summaries for HMEC library are provided in , respectively.

1kb	MaxHiC	GOTHiC	Fit-Hi-C	Fit-Hi-C2	HiCDC+	HiCCUPS	Peakachu	Mustache	All
percentage of significant interactions	0.03	9.4	16.6	17.1	0.05	-	-	-	100
average read-count of interactions	18.1	2.9	1.1	1.5	11.8	-	-	-	1.2
min read-count of cis interactions	4	2	1	1	4	-	-	-	1
median distance of cis interactions (kb)	3	14	30832	29426	5	-	-	-	544
5kb
percentage of significant interactions	0.17	9.25	0.8	0.9	0.07	0.0095	0.0172	0.0091	100
average read-count of interactions	21.3	7.85	5.6	5.9	16.7	32.39	28.49	36.16	1.77
min read-count of cis interactions	3	2	1	1	3	11	10	14	1
median distance of cis interactions (kb)	130	85	5235	5014	245	110	145	165	15173
10kb
percentage of significant interactions	0.32	7.32	1.06	1.1	0.11	0.0094	0.0163	0.0066	100
average read-count of interactions	29.4	15.17	12.3	13.5	26.6	58.62	46.37	62.74	2.38
min read-count of interactions	3	2	2	2	5	11	9	16	1
median distance of cis interactions (kb)	270	170	7590	7426	480	150	230	330	6990

* HiCCUPS, Peakachu, and Mustache calls interactions with FDR < 0.05 as significant. For all other methods we used a threshold of P-value < 0.001 to identify significant interactions.

Statistical summary of significant interactions identified by MaxHiC, GOTHiC, Fit-Hi-C, Fit-HiC2, HiCDC+, HiCCUPS, Peakachu, and Mustache in the GM12878 Hi-C library from Rao et al. [5]. at bin-sizes 5kb and 10kb.

Statistical summaries for HMEC library are provided in , respectively. * HiCCUPS, Peakachu, and Mustache calls interactions with FDR < 0.05 as significant. For all other methods we used a threshold of P-value < 0.001 to identify significant interactions. As the p-values from all models may not be comparable, we ranked interactions identified by each tool, based on their reported p-value and then took the top 100K and 20K interactions reported by each tool. S2 and S3 Figs show Venn diagrams of the top 100K and 20K interacting pairs identified by each method at 10kb bin sizes. The sets of interactions called as significant or insignificant by each method ( for sample GM12878) had distinct distance distributions. At 10kb bin size, GOTHiC and HiCCUPS had the smallest median distances between interacting regions of 170kb, while that for MaxHiC, Fit-Hi-C, Fit-Hi-C2 and HiCDC+ were 270kb, 7,590kb, 7,426kb and 480kb respectively. The low median length observed for GOTHiC is reflected in the observation that it calls almost all putative interactions below 100kb as significant.

Background modelling as a function of distance in MaxHiC and other HiC tools.

a) Average read-count of significant (blue) and insignificant interactions (red) at different genomic distances identified by different Hi-C interaction callers at 10kb bin size on the Rao et al. [5] GM12878 data. b) Number of significant interactions identified by the five Hi-C interaction callers in different genomic distances on Rao et al. GM12878 sample at fragment size 10kb. Comparing the interactions called as significant by each method revealed interactions identified by MaxHiC had substantially more read support than those identified by all SIC models. Importantly, the number of significant interactions identified by MaxHiC is almost two times more than HiCDC+, however, the average read count of significant interactions in MaxHiC is much higher than HiCDC+ (Tables 1 and S2). Both the average number of reads and minimum number of reads supporting a significant interaction were higher for those called by MaxHiC than other models (Tables 1 and S2). We also compared the average read count of MaxHiC’s interactions with the loop caller methods. As Tables 1 and S2 show the average read count of MaxHiC’s interactions is smaller than loop callers which was expected as loop callers only report the average signals from selected regions of interest (the centroid of the clusters, CTCF loops, H3K27ac loops, etc), in which the reported frequency for the selected regions is either the total number of interactions in the local cluster or the maximum frequency of all interaction in the local cluster. We then compared ranked interactions identified by MaxHiC based on their reported P-value and compared the average read count of top 20k MaxHiC’s interaction with loop callers at 10kb bin size. Interestingly the average read count of significant interactions for MaxHiC (68.6) was higher than the loop callers (HiCCUPS: 58.6, Peakachu: 46.4, Mustache: 62.7). By examining the average read count of significant and insignificant interactions called by each SIC method ( for GM12878), it is revealed that at all distances, significant interactions called by MaxHiC and HiCDC+ had substantially more reads than those called as insignificant. GOTHiC only demonstrated this trend at distances above 100kb. At all distances, paired regions called as significantly interacting by MaxHiC had higher average read count support than the other five methods. This was also reflected in the minimum number and average number of reads required to call an interaction as significant (S4 Fig). We observed a similar pattern when considering interactions calculated using different fragment sizes (S5 Fig shows the same plots using 1kb, 5kb, 10kb, and 50kb bin sizes).

Evidence of regulatory potential for interacting regions identified by MaxHiC

We next investigated whether interacting regions identified by each method as significant were enriched for genomic and epigenetic annotations indicative of regulatory regions. Comparison of the interacting regions with histone mark profiles generated for GM12878 and HMEC by the ENCODE consortium [16-18] revealed the regions identified by MaxHiC were significantly (P-value < 0.05) enriched for H3K9ac, H3K27ac, H3K79me2, H3K4me3, H3K4me1, and H4K20me1 marks (). They were also enriched for DNAse hypersensitive sites and CTCF binding but not for the repressive mark H3K27me3. Importantly there was very little if any enrichment of these features in regions called as significant by GOTHiC, Fit-Hi-C, and Fit-Hi-C2. There were also enrichments for HiCDC+’s regions, however, these enrichments were less substantial than MaxHiC’s regions. Interestingly MaxHiC’s regions had higher enrichments than regions identified by the loop callers Peakachu and Mustache but had a similar level of enrichment with HiCCUPS. Notably, MaxHiC identified 34-fold more interactions.

Enrichment of regulatory potential in regions identified by MaxHiC and other tools.

a) Enrichment of GM12878 specific histone marks, putative regulatory regions and GWAS signal in regions identified by MaxHiC in the Rao et al. [5] GM12878 dataset. b) As in (a), but enrichment of signals in the HMEC dataset are shown. ENCODE chromatin states are from GM12878 and HMEC data respectively. For FANTOM5, enrichment is shown for all FANTOM5 promoters and enhancers and for those only active in GM12878 and HMEC. GWAS SNPs for autoimmune disease, breast and kidney/liver/lung disorders () were downloaded from Maurano et al. [29]. Note: HiCCUPS calls interactions with FDR < 0.05 as significant. For all other methods we used a threshold of P-value < 0.001 to identify significant interactions. We also used MaxHiC to investigate the enrichment of other important histone markers, such as H3K4me1, H3K4me2, H3K4me3, H3K27me3, H3K9me3 in the significant interactions. As shows all histone marks used in the analysis, were significantly enriched (P-value < 0.05) for the significant interactions identified in GM12878, except H3K27me3, H3K9me3, and H4K20me1. To ensure that MaxHiC’s performance is not affected by the number of statistically significant chromatin interactions used in the analyses, we ranked interactions identified by each tool, based on the reported p-value, and then took the top 10K, 20K, 50K, 100K interactions reported by each tool. We then repeated the enrichment analysis. As S7 Fig shows MaxHiC’s interactions have the highest enrichment for regulatory features compared to all other tools. For the GM12878 analysis we also observed significant enrichment (P-value < 0.05) of transcription factor binding sites from ENCODE ChIP-seq data (S8 Fig and S3 Table). Again, MaxHiC had much higher enrichment than SIC models and the loop callers Peakachu, and Mustache while it had a similar level of enrichment with HiCCUPS for GM12878 specific transcription factor binding sites. Together this suggests that both MaxHiC and HiCCUPS identify interactions between regions with regulatory potential however MaxHiC identifies substantially more (34-fold). Next, the interacting regions identified by MaxHiC were significantly enriched (P-value < 0.05) for ENCODE annotated enhancers, promoters, insulators and transcription states [19, 20] (Figs , and for top 10k, 20k, 50k, 100kinteractions analyses). They were also significantly depleted (P-value < 0.05) of regions annotated as heterochromatin. Again, HiCCUPS showed similar levels of enrichment while Fit-Hi-C and Fit-Hi-C2, HiCDC+ showed some lower level of enrichment for some of the chromatin states. Besides the regulatory regions predicted by ENCODE, significant interactions identified by MaxHiC also contained a much higher fraction of promoters and enhancers identified by FANTOM5 as active in GM12878 and HMEC (in comparison to all FANTOM5 promoters and enhancers) () indicating that MaxHiC not only enriches for promoters and enhancers but those active in the cell type from which the Hi-C data was generated. Importantly, interacting regions identified by MaxHiC and HiCCUPS on the GM12878 dataset were significantly (P-value < 0.05) more likely to overlap autoimmune SNPs identified in GWAS while those identified in the HMEC were significantly (P-value < 0.05) more likely to overlap breast trait associated SNPs ( GWAS traits considered are provided in S4 Table). No such enrichment was observed for SNPs associated with unrelated traits such as kidney, lung, and liver traits. Lastly, we compared interactions found by MaxHiC to those from the next top SIC method Fit-HiC2 (based on top 10k, 20k, 50k, and 100k interactions analyses) and the top loop caller method, HiCCUPS. In the comparison to Fit-Hi-C2, the interactions found by MaxHiC alone were significantly enriched (P-value < 0.05) for interactions between regulatory regions whereas those found by Fit-Hi-C2 alone showed no enrichment and were near the background expectation of 1 (S9A Fig). In contrast the MaxHiC and HiCCUPS -specific interactions were enriched to similar levels (S9B Fig). S10 Fig shows an example of a Hi-C interaction identified by MaxHiC as significant, which was not identified as a significant interaction by other tools. Both sides of the interaction have strong signals of regulatory features. (S10 Fig).

MaxHiC enriches for interactions between regulatory regions

We next examined the fractions of interacting pairs identified by each method involving regulatory elements at both ends. We used GM12878 related ChromHMM predicted chromatin segments from ENCODE. We specifically focused on promoter-enhancer, promoter-promoter, enhancer-enhancer and insulator-insulator pairs. In the GM12878 dataset approximately 12% of the raw unfiltered pairs fell into at least one of the above categories (). Strikingly, the significant interactions identified by MaxHiC were almost 4 times (47%) more likely to fall into one of these four categories (). Notably ~17% of the pairs corresponded to promoter-enhancer interactions, ~15% enhancer-enhancer, ~8% promoter-promoter and ~7% to insulator-insulator interactions. Similar levels of enrichment were observed for HiCCUPS and Peakachu but as noted above, HiCCUPS identified almost 34-fold fewer significant loops. HiCDC+’s interactions were also significantly enriched for promoter-enhancer, promoter-promoter, enhancer-enhancer and insulator-insulator pairs. In contrast, the other methods had distributions similar to the unfiltered pairs with Fit-Hi-C, Fit-Hi-C2 and GOTHiC showing fractions slightly above that of the unfiltered pairs. Extending this to any interaction involving an enhancer, promoter or insulator increased the fraction of annotated pairs to 70%. Similar fractions were observed with the HMEC data.

MaxHiC enriches for interactions between regulatory regions, eQTL-target pairs and known enhancer-promoter pairs.

a) Fraction of regulatory elements in the significant interactions identified by the four models. b) Enrichment of cell line specific chromatin pairs, chromatin state pairs, and eQTL pairs that overlapped with both ends of the significant interactions identified by the six models at fragment size 10kb. The enrichment is calculated by dividing the fraction of features overlapping both ends of the significant interactions identified by each model by the fraction of features overlapping both ends of all interactions (see ). We also repeated this analysis for the top 100K and 20K interactions identified by each model. Interestingly, for the top 100K interactions, MaxHiC had higher enrichment than all SIC models and loop callers in both GM1278 and HMEC cell lines. This enrichment was much higher in the top 20K interactions analysis (S11 Fig). In the top 20K interactions on HMEC cell line, 99%) of MaxHiC’s interactions were annotated as enhancers, promoters or insulators. shows the relative enrichment for interactions involving pairs of chromatin features and predicted chromatin states. Similar to we observed enrichment of interactions involving active chromatin states and depletion of heterochromatin-heterochromatin pairs using the interactions called as significant by MaxHiC and HiCCUPS and to a much lesser extent with Fit-Hi-C and Fit-Hi-C2. HiCDC+ also had a significant level of enrichment for interactions involving pairs of chromatin features, but less compared to MaxHiC. We next used expression quantitative trait loci (eQTL) pairs, identified in EBV-transformed lymphocytes as part of the Genotype-Tissue Expression (GTEx) Project [21] to examine enrichment of eQTL SNPs in regions that interact with promoters. Approximately 2.1% of the interactions identified by MaxHiC in the GM12878 cell line overlapped at least one eQTL pair. This corresponded to a 15.7-fold enrichment (). In contrast GOTHiC, Fit-Hi-C’, Fit-Hi-C2, and HiCDC+’s interactions were lower (8.3, 7.9, 9, and 9.6-fold enriched respectively). Interestingly, the enrichment of eQTL pairs in MaxHiC’s interaction was slightly higher than HiCCUPS and much higher than Peakachu and Mustache. Matching analyses on HMEC confirmed similar enrichments for regulatory features and eQTLs from mammary tissue ().

Genomic spacing of interacting regulatory elements

Using the putative regulatory region annotations from the analysis above, we next examined the spacing preferences for interactions between different regulatory features. We first examined preferred spacing for H3K27ac-H3K27ac and CTCF-CTCF structural loops. For MaxHiC, Fit-Hi-C, and Fit-Hi-C2 we observed the spacing between CTCF pairs to be larger (130kb) than H3K27ac-H3K27ac pairs (80kb) (). In both cases there were long tails reaching beyond 1Mb. Performing the same analysis on annotated regulatory regions revealed promoter-promoter, enhancer-promoter, and enhancer-enhancer pairs had almost identical spacing preferences (with median spacings of 70kb, 70kb and 90kb respectively), while insulator-insulator pairs were considerably further apart with a median spacing of 170kb (). We did not see this trend for HiCDC+ and loop callers, however we note HiCCUPS did not identify significant interactions below 50kb and above 5Mb.

Distance distribution for interactions between different classes of regulatory elements.

a) Comparison of spacing observed for H3k27ac-H3k27ac and CTCF-CTCF structural loops in the interactions identified by MaxHiC in the Rao et al. GM12878 dataset. b) Comparison of spacing observed for promoter-promoter, enhancer-promoter, enhancer-enhancer and insulator-insulator pairs in the interactions identified by MaxHiC in the Rao et al. [5] GM12878 dataset. Plots use fragment bin-size 5kb and window size 5kb.

MaxHiC contact maps and aggregate peak analyses

Comparing contact maps of significant interactions found for MaxHiC and HiCCUPS models (S12 Fig) shows at first glance they are very similar, however in a zoomed in region we can see additional interactions called by MaxHiC that are missed by HiCCUPS. Examining this in more detail revealed as approximately 8-fold more bases are identified as significantly interacting by MaxHiC than HiCCUPS (S13 Fig). However, to further show that these interactions are not random we provide an aggregate peak analysis (APA) comparing signals surrounding significant peaks called by MaxHiC and HiCCUPS. shows the aggregate background interaction frequency of 5 bins upstream and downstream of the significantly interacting bins. APA plots are shown for the top 5k, and 10k. As the figure shows, MaxHiC and HiCCUPS have almost the same APA patterns and confirmed that the significant interactions identified by MaxHiC are not random and that even when all 352k significant interactions identified by MaxHiC are considered there is strong focal enrichment in the APA. To demonstrate application of MaxHiC to a very large dataset we also analysed a Micro-C dataset [22] consisting of ~480M raw valid interactions. The contact map for this dataset is shown in . Reassuringly matching aggregate peak analysis plots confirm there is strong focal enrichment for these significant interactions .

Aggregate peak analysis (APA) of significant interactions called by MaxHiC and HiCCUPS.

Interactions were called at 10kb resolution using the GM12878 Hi-C dataset from Rao et al. The analysis shows the aggregate background interaction frequency of 5 bins upstream and downstream of the significantly interacting bins. For HiCCUPS the bins at the centroid of the loop regions are used. For HiCCUPS the APA for all 5,348 interactions is shown. For MaxHiC, APA plots are shown for the top 5k, 10k, 50k and all 243k significant interactions.

Application of MaxHiC to identify differentially interacting regions

To further demonstrate the use of MaxHiC in comparative Hi-C experiments we applied it to Hi-C datasets from cells depleted of the cohesion release factor WAPL (Haarhuis et al. [23]) and the polycomb protein RING1B cells (Boyle et al. [24]); and their matched wildtype controls. In the case of WAPL-depleted HAP1 cells MaxHiC identified almost 3 times more significant interactions than in matched wildtype cells (). Furthermore the distance distribution of interacting genomic regions found in the ΔWAPL cells (median = 6.4Mb) was significantly longer in comparison to wildtype (median = 3.1Mb) (); this reiterates the finding of Haarhuis et al. [23].

WAPL restricts chromatin loop extension.

HiC Contact map comparing significant interactions identified in WT (a) and ΔWAPL HAP1 (b) cells at 10kb fragment size. Black pixels are significant interactions identified by MaxHiC at P-value <0.001. (chr7: 25M-31M). c) Distance distribution of significant interactions identified by MaxHiC (P-value <0.001, 10kb bins) in WT and ΔWAPL HAP1 cells. The number of significant interactions identified in HAP1 WT is 283,028 and the number of significant interactions identified in HAP1 WAPL is 761,820. In the case of the RING1B depleted cells, MaxHiC was able to re-identify differentially interacting regions identified in the original paper ( replicate Fig 5A from Boyle et al. [24]). We also observed that wildtype cells have a slightly larger median distance between interactions (325kb for WT) than the knockout cells (258kb for RING1B-/-) (). Lastly we examined enrichment for repressive H3K27me3 and RING1B sites (using ChIP-seq data from wildtype cells [25]) in interacting regions found in wild type cells and RING1B-/- cells. In all cases the enrichment was close to the genome wide background expectation of 1, however the interactions found in the knockout were slightly more enriched than the wildtype, which suggests the sites are more open for interactions after removal of RING1B ().

Fig 5

Aggregate peak analysis (APA) of significant interactions called by MaxHiC and HiCCUPS.

MaxHiC identifies more validated pairs of element-gene pairs tested by CRISPRi

Lastly, as significant interactions identified by MaxHiC were enriched for enhancer-promoter interactions we sought to test whether these interactions are enriched for functional regulatory potential. To do this we asked whether interactions identified by MaxHiC and alternative tools run on a K562 Hi-C dataset from Rao et al. [5] could connect a set of 134 functional regulatory element-target gene pairs (involving 103 CRISPRi targeted regions) identified in K562 cells [26]. 102 of the 103 CRISPRi targeted regions and 57 of the 134 element-target gene pairs are present in the raw Hi-C data (182M total interactions). MaxHiC identified approximately 275K significant interactions, amongst which 16% (21 of 134) of the CRISPRi element-target gene pairs were captured (). The next best performing tool Fit-HiC2 only identified 10 of 134 pairs, HiCCUPS identified 1 and Mustache, HiCDC+ and Peakachu failed to identify any of the pairs. From this we conclude that MaxHiC identifies functionally relevant regulatory interactions.

Discussion

Enhancers have important roles in controlling gene expression. Linking distal regulatory regions such as enhancers to the genes that they regulate is critical to the interpretation of regulatory variants identified in whole genome data. Importantly, genomic variants in enhancers and regulatory elements have been linked to human complex diseases [27, 28]. Therefore, pairing enhancer regions to their regulatory targets is crucial for diagnosis of genetic diseases. Compared to simplistic assignment of enhancers to their nearest promoters, computational methods have improved the prediction of enhancer-promoter pairs [29-32]. However, physical interaction data is still far more convincing in validating the pairs. Hi-C data offers one approach to identify these links, however Hi-C sequencing libraries are extremely noisy because of self-ligations, background ligation artifacts and other complex sources of biases such as GC content and mappability of the sequences. Given this, accurate identification of significantly interacting regions is dramatically affected by the tool used. Here, we have presented MaxHiC, a new tool for identifying significantly interacting regions from Hi-C data (and capture Hi-C data see ). Through a dramatically improved discrimination of biologically relevant interactions from background, we showed that MaxHiC enriches for interacting regions with substantially more evidence of regulatory potential (Figs ) than those identified by existing Hi-C analysis tools [2–4, 9]. This is a critical advance as enhancer-promoter interaction maps are needed to interpret the impact of non-coding variants identified in intergenic regions. Importantly CRISPRi data were able to confirm that MaxHiC enriches for functional enhancer-promoter relationships. From our analyses, only MaxHiC and HiCDC+ (and to a much lesser extent Fit-Hi-C2) enrich for interactions between regulatory regions, but MaxHiC regions, had much higher enrichment than HiCDC+. Specifically, interactions identified by MaxHiC were enriched for regulatory features, disease-associated genome-wide association SNPs, eQTL pairs and experimentally validated enhancer-promoter interactions. Although interacting regions identified by loop callers had similar levels of enrichment MaxHiC found 34-fold more significantly interacting regions. Our top‘k’ (k:10k, 20k, 50k, and 100k) interactions analyses showed that MaxHiC significantly outperformed not only SIC models, but also loop callers. Additionally, loop callers such as HiCCUPS appears to have limitations on the distances at which interactions can be found and the bin sizes possible. They also need very deep Hi-C resolution and substantial computational resources. From this observation we conclude that MaxHiC is a substantially better tool than existing Hi-C analysis packages. We also implemented a capture version of MaxHiC to analyse capture Hi-C experiments; in the capture Hi-C technology, target regions (also called baits) are sequenced with higher depth than the others, resulting in three different types of interactions target-target, target-other and other-other with different properties. Capture Hi-C baits, but not other ends, have a further source of bias associated with uneven capture efficiency, thus for capture Hi-C experiments we model each class separately. As show the capture version of MaxHiC also identifies target-interacting regions that are enriched for regulatory features and disease-associated GWAS SNPs. The levels of enrichment are much higher than the regions identified by GOTHiC, Fit-Hi-C, and Fit-Hi-C2, and HiCDC+. The average read count of significant interacting regions identified by MaxHiC were much higher than other models. Importantly, MaxHiC’s interacting regions had higher level of enrichment than the interacting regions identified by CHiCAGO, a specific model designed for the analysis of capture Hi-C data. In conclusion, MaxHiC is a new open source tool for identifying significant interacting regions from Hi-C and capture Hi-C data. We have demonstrated that it significantly outperforms existing tools in terms of enrichment of interactions between known regulatory regions. We will continue to develop MaxHiC related tools for identifying topologically associating domains (TADs) and for identifying interactions that are significantly different between conditions. As single-cell Hi-C data becomes more broadly available we will also adapt MaxHiC to work with these datasets. Lastly, we believe that with minor modifications the same principles used here to analyse Hi-C data may be more broadly extended to identify significant interactions from RNA-chromatin interaction datasets [33, 34].

Online methods

Tool availability

The MaxHiC source code and python package, a sample dataset, and instructions on how to run MaxHiC are provided at https://github.com/bcb-sut/MaxHiC. For more details about each parameter, please visit the GitHub page. A tutorial on using the MaxHiC package is also provided in (instructions for MaxHiC). MaxHiC is memory and speed optimized and will work with both single core and multicore machines. It can work with very large Hi-C data sets (as bam file or contact maps generated by other tools) to identify significant interactions at a broad range of resolutions (1kb-500kb).

Publicly available Hi-C data and data access

Samples GM12878 B-lymphoblastoid cell line, Human Mammary Epithelial cell line, and k562 cell line Hi-C data from Rao et al. [5], capture Hi-C sample GM12878 B-Lymphoblastoid from Mifsud et al. [15] and mouse cortex Hi-C sample from Dixon et al. [35] were used to benchmark MaxHiC against other models. All samples are publically available. More details about the samples are provided in

Mapping, filtering and Interaction calling

MaxHiC takes as input either bam files or contact map files generated by packages such as HiCUP [36] and HiC-Pro [37]. For the analyses presented here, FASTQ paired-end reads were aligned to the human hg19 genome and filtered through the HiC-Pro pipeline (statistics are presented in ). To remove duplicate reads, filter for valid interactions, and generate Hi-C interaction matrices (5kb and 10kb bin-sizes), we set the following parameters: MIN_FRAG_SIZE = 230; MAX_FRAG_SIZE = 1100; MIN_INSERT_SIZE = 120; and MAX_INSERT_SIZE = 990. All experimental artifacts, such as circularized reads and re-ligations, singletons have been filtered out using HiC-Pro and MHiC [38]. We also removed all self-interactions from the downstream analyses. In-house scripts were then used for statistical evaluation of Hi-C libraries (see ).

The MaxHiC algorithm

In the following, we explain MaxHiC algorithm and mathematics behind the model for the analysis of both Hi-C and capture Hi-C experiments. In MaxHiC, a negative binomial distribution has been used to model the read count of interactions based on the bias factors of the two ends and their genomic distance. The negative binomial distribution is the analytic form of a Poisson distribution with its mean parameter following a gamma distribution to account for over-dispersion in data. As shown in Eq 1, the observed read count of interactions, x, is modelled using their expected read count, μ, and a shared dispersion factor, r, between all interactions. MaxHiC considers ‘neighborhooding’ (similar to local background) when identifying significant interactions. The parameters of the model are learned using a maximum likelihood approach. Gradient descent has been used to find the values of the parameters maximizing the logarithm of the likelihood function. In order to have higher accuracy, by having a higher difference between the read count of meaningful and meaningless interactions, and also to have higher speed, the proposed model uses the interactions between DNA bins, not DNA fragments. DNA is divided into pieces and all reads recorded for each fragment are assigned to the bin containing their middle points. All of the recorded interactions between pairs of bins with at least one read are the input samples used for training the model. After the model has been trained, the P-value of each interaction can be calculated based on the value of reliability, 1—CDF, for its observed read count.

Calculating the expected value for the read count

The expected read count for the interaction between bins i and j is calculated according to Eq 2. f(d) indicates the expected value of read count to be recorded in the experiment for any two loci with distance d, v and v are bias factors of the two interacting bins. As mentioned above, bins have different characteristics and different probabilities of showing up in the experiment that affects the expected value of their interactions. It’s assumed that the effects of two bins are independent from each other, so the total bias factor has been calculated as a product of the bias factors of the two ends.

Calculating bias factors for bins

In order to be applicable to all types of Hi-C protocols, accounting for the unknown sources of bias and also the relation between the known sources, the second approach, mentioned in the introduction, has been adopted in this study. A uniform bias factor has been assumed for each bin, which becomes more valid as the resolution of the analysis increases and the length of bins become shorter. We expect the bins included in more interactions with higher reads to have higher tendency to show up in the experiment. In addition, the total number of interactions that a bin has been included in, would have no effect on its bias. To deal with these problems, the bias factor for bins have been calculated using Eq 3 in which v, b and b are the parameters of the function and s is the total number of reads observed for bin i. Here it is assumed that the total number of reads observed for each bin is a measure of its ability to show up in the final results of the test, and bias factor is assumed to be a linear function of s in log-log space. v is also considered as the base possible value for the bias factor. This prevents the bins with very few recorded reads to affect the parameters in the training process and it also prevents their related interactions with very few reads from showing up as significant because of their abnormally low bias. Many tools do this by putting a threshold and filtering bins with less bias, but in MaxHiC it is done automatically without any need for adjusting the threshold. As illustrated in Eq 3, v is proportional to s raised to a power b. The power can be learned and if it becomes less than one it can express a saturation phenomena. As duplicated read pairs are discarded in post-processing, having more read count for two interacting bins as the read count of their interaction increases, becomes lower than their initial potential.

Calculating expected read count as a function of distance

The expected read count as a function of distance for cis interactions is illustrated in Eq 4. The function is the maximum of two values. The first one, f(d), is a degree 3 polynomial function of genomic distance in log-log space which specifies the expected read count of interactions based on the linear genomic distance between the two interacting loci. The reason for using an exponential function is the fact that Brownian motion and diffusion can be modelled using an exponential function. This assumption is also consistent with the power law function observed for the read count as a function of distance which has been introduced by Lieberman et al. for the first time [39]. The degree of 3 has been used so the function can have varying curvature in different distances. In this function, the logarithm of distance has been used instead of the distance itself, as DNA is not a straight string and has multiple folding with different scales that causes proximity to have a closer relation with the logarithm of the length than the length itself. A similar concept has been used in Fit-Hi-C [2], CHiCAGO [3] and HiC-DC [1] methods. The biggest difference is the fact that all of these methods consider a strictly decreasing function for the expected read count which approaches zero as the genomic distance increases and causes all interactions between far genomic loci to be considered significant while they may have only one read. This cannot be completely logical as farther than some specific distance, random background collisions occur due to three dimensional closeness not the genomic closeness. This three dimensional closeness can be due to closeness to the fixed three dimensional structures such as loops or random dynamic crossing of DNA string while two different parts of it are having contact. The expected read count for these kinds of random interactions has been modeled using a constant parameter, f. Therefore, in close genomic distances is dominating while the value of becomes very small, the value of f dominates the expected read count. For trans-interactions, similar to the cis interactions for far loci, a constant parameter has been used to denote the expected value of read count for any random background trans-interactions. Maximum function is not a smooth function to use in optimization. Therefore, instead of maximum, the soft-maximum function has been used. This soft-maximum resembles the maximum function. When an argument, for example α, is larger than the other, it’s exponential becomes far more larger and so it’s related factor, , approaches 1 while the other factor, , approaches 0. Therefore, the final value will be approximately equal to α itself.

Training the model in multiple rounds

As mentioned above, the method models read count of interactions occurring based on Brownian motion, not the meaningful chromatin contacts, which are called real interactions from now on. But real interactions are also among the data and can affect the parameters of the model and pull them up. For example, in distances where there are more read interactions, the expectation may become higher than other distances. To deal with this bias, the model is trained in multiple rounds, which can be specified by user (the default value for iterations is 4). In each round, first the parameters of the model are learned using a stochastic gradient descent approach. The training is done in iterations until all the parameters of the model have changed by less than 0.0001 compared to the previous iteration. Training also will stop when reaching 1000 iterations to handle oscillating situations around the optimum values. This number is high enough to give the model enough time to be trained well and has been chosen by the developers based on running the model on multiple Hi-C datasets (25 samples with 2–4 different resolutions). Next, a P-value is calculated for each interaction using the trained model and the real interactions are identified based on a fixed user-defined P-value as significance limit. The read count of these real interactions is substituted with their expected read count for calculating the total number of reads for each bin, as if they were not read what would their read count be, to remove the bias of having real interactions from bias factors of the bins. They are also completely ignored in the training of the model’s parameters. This is another main difference between MaxHiC and other models.

An automatic way of removing the effect of noisy data

Typically, a high percentage of interacting fragments in Hi-C have only 1 read between; for example, the typical percentage can be above 85% for a resolution of 5Kb. End bins of many of these interactions have very low coverage which results in a very small expected value for the interaction. This can pull the value of parameters down for the benefit of themselves in the model. In order to cope with such noisy data, the expectation for read count is calculated based on Eq 6 instead of Eq 2. In this way, the maximum of expected value and a base expected read count, base, is used as the final expected value. This will prevent the mentioned noisy interactions to affect greatly the parameters of the model.

A negative binomial regression background model for capture Hi-C data

Capture-Hi-C is a version of Hi-C in which some regions in DNA, which are called Baits or Target Regions, are sequenced with higher depth than the others. This will result in three different types of interactions in Capture-Hi-C with different properties, Target-Target, Target-Other and Other-Other. The background model developed for Capture-Hi-C follows the same statistical model as the one developed for General-Hi-C, but in order to account for the differences between these types of interactions, three different sets of parameters are used, each for one interaction type. Each set of parameters is learned separately from the others based on its own interactions, in the process of learning the parameters of the model. In calculating the bias factors for bins, the sum of read counts cannot be used directly as different interactions correspond to different types. To remove this barrier, all the read counts of interactions are converted to their equivalent read-counts in Target-Target interaction type: In Eq 7, t1 and t2 stand for the original interaction type and the one we want to convert the read-count to, which is assumed to be Target-Target, respectively. The expected value for each interaction type is easily calculated by using the set of parameters related to that interaction type and the calculated properties of the interaction, sum of reads for bins and their genomic distance, in the expected read-count formula regardless of the original type of the interaction.

Justification for the estimation of expected value of read count for trans chromatin interactions

In MaxHiC, distance is modelled by a function that decreases at increasing genomic distances to reach a small but constant non-zero value to account for random ligations. For trans-interactions, we use the same constant value as observed for distant cis interactions. Here two things are considered in the function modelling distance, more exactly the function modelling expected random ligations based on distance. It is a strictly decreasing function and it converges to a constant nonzero value when distance gets high. Which means it converges to a constant value when distance gets larger than a limit. In trans interactions, distance is not close so the same constant is considered in calculating the expected value. To show that the long distance cis have similar read counts as trans, we counted average read count of long distance (≥ 100Mb) significant cis interactions (P-value < 0.001) against average read count of all significant trans interactions (P-value < 0.001). As a results, long distance significant cis interactions had a very similar average read count (21.7) as significant trans interactions (22.1). Note: the average read count of all significant cis interactions was 29.4. We also plotted distribution of number of significant long distance cis interactions against significant trans interactions. As the plot shows both groups have almost the distribution .

Assessment of feature enrichment (ENCODE regulatory features, FANTOM5 promoters/enhancers)

Annotation datasets

Tissue specificity is an important aspect of many genetic diseases [40]; We therefore downloaded tissue-specific ChIP-Seq data in the form of processed peak calls for histone modifications (e.g. H3K27ac, H3K4me1, etc), CTCF binding sites, DNase hypersensitive sites, and ChromHMM predicted promoters and enhancers from the ENCODE project [19]. We used FANTOM5 active promoters and transcribed enhancers from the FANTOM5 consortium [41, 42]. FANTOM5 tissue-specific enhancers were obtained from [42].

Region-based analysis

To compute the enrichment of interactions overlapping an epigenetic feature, we first calculated the fraction of significant interactions with at least one end overlapping a feature. We then calculated fraction of all interactions overlapping with the feature as the genome-wide expectation. The enrichment value was then calculated by dividing the fraction of significant interactions overlapping a feature by the genome-wide expectation. To calculate the enrichment P-value, we used Fisher’s Exact test in the following manner: (i) number of significant interactions overlap with the feature; (ii) number of significant interactions that do not overlap with the feature; (iii) number of all interactions overlap with the feature; (iv) number of all interactions that do not overlap with the feature. The same approach is used to calculate the P-values for other enrichment analyses in this study.

Pair-based analysis

To compute the enrichment of interactions that both sides of an interaction overlap with an epigenetic feature, we repeated the above analyses in which both sides of interaction overlap with the feature.

Capture Hi-C analysis

In the capture Hi-C analysis, to ensure our results were not affected by the strong peak signal over target regions, only target-interacting regions (“bait-to-any interactions”) are considered and signals that mapped as bait-bait interactions and those signals that mapped outside of the target-interacting regions are excluded from the analysis. To calculate the fraction of significant target-interacting regions overlapping an epigenetic feature in the capture Hi-C analysis, we first calculated the fraction of significantly target-interacting regions overlapping a feature. We then randomly selected fragments that had no interaction with the target regions (we selected exactly the same number of fragments as the number of significant target-interacting fragments). The enrichment value then calculated by dividing the fraction of significant interacting regions overlapping a feature on the average fraction of 1000 permutations. We performed the above analysis for each Hi-C background model, separately.

GWAS data analysis

GWAS SNPs for autoimmune disease, neurological/behavioural traits and kidney/liver/lung disorders were downloaded from Maurano et al. [29]. To compute the enrichment of interactions overlapping GWAS SNPs, we first calculated the fraction of significant interactions with at least one end containing GWAS SNPs. We then calculated fraction of all interactions overlapping GWAS SNPs as the genome-wide expectation. The enrichment value was then calculated by dividing the fraction of significant interactions overlapping GWAS SNPs by the genome-wide expectation. In the analysis of capture Hi-C library, we only considered overlapping of GWAS SNPs with target-interacting regions in both the significant target-interacting regions and the permutated target-interacting regions.

Overlapping Hi-C interactions with eQTL pairs

We used the set of GTEx v7 eQTLs identified as significant in EBV-transformed lymphocytes from the Genotype-Tissue Expression (GTEx) Project [21]. To calculate the fraction of eQTL pairs overlapping with significant interactions, we considered those Hi-C interactions where one side of the interaction overlapped an eQTL SNP and the other side overlapped the promoter of the eQTL target gene identified by GTEx. The promoter regions are defined as 5kb± of starting point of a gene (TSS). To calculate enrichment, we divided these by the genome-wide expectation.

CRISPRi validation

To validate P-E pairs identified in Hi-C data, an orthogonal dataset of validated enhancer-promoter pairs was taken from Nasser et al. publication [26]. CRISPR perturbations in K562 cells identified 103 non-promoter elements involved in 134 functional element-gene relationships defined as causing significant changes in gene expression in K562 cells. We then counted a) number of non-promoter elements that overlapped with significant fragments identified by each model; b) number of element-gene pairs that overlapped with significant pairs identified by each model.

Parameters used for comparative Hi-C tools

All tools were run with their default parameters. For SIC models, we used P-value 0.001 to identify significant interactions. For loop callers, we used FDR 0.05 to identify significant loops.

In this file, we provided an instruction to run MaxHiC.

(PDF) Click here for additional data file.

Intuitive diagram showing the procedures and logic of MaxHiC.

(TIFF) Click here for additional data file.

Venn diagram of the top 100K significant interactions identified by each model in the Rao et al. [5] GM12878 and HMEC datasets.

a) Venn diagram of top 100K significant interactions identified by each model in the GM12878 data at fragment size 10kb. Loop callers are highlighted by red colour. The number of identified loops for each method are 10,487 for HiCCUPS, 18,148 for Peakachu, and 7,346 for Mustache. b) Venn diagram of significant interactions identified by each model in the HMEC data at fragment size 10kb. We did not include Fit-Hi-C2 in this figure as the results are almost the same as for Fit-Hi-C. Loop callers are indicated in red font. The number of identified loops are 12,053 for HiCCUPS, 18,385 for Peakachu, and 9,469 for Mustache. (TIFF) Click here for additional data file.

Venn diagram of the top 20K significant interactions identified by each model in the Rao et al. GM12878 and HMEC datasets.

a) Venn diagram of top 20K significant interactions identified by each model in the GM12878 data at fragment size 10kb. Loop callers are indicated in red font. The number of identified loops for each method are 10,487 for HiCCUPS, 18,148 for Peakachu, and 7,346 for Mustache. b) Venn diagram of significant interactions identified by each model in the HMEC data at fragment size 10kb. We did not include Fit-Hi-C2 in this figure as the results are almost the same as for Fit-Hi-C. Loop callers are indicated in red font. The number of identified loops are 12,053 for HiCCUPS, 18,385 for Peakachu, and 9,469 for Mustache. (TIFF) Click here for additional data file.

Distance distribution of read-count support of significant interactions identified by the SIC models in GM12878 and HMEC.

a) Minimum read-count of significant interactions in the Rao et al. [5] GM12878 data, b) as in a but for HMEC, c) Average read-count of significant interactions in the Rao et al. GM12878 data and d) as in c but for the HMEC data. Fragment size 10kb. (TIFF) Click here for additional data file.

Comparisons of the number of significant interactions and average read-counts of significant interactions identified by SIC models at fragment sizes 1kb, 5kb, 10kb, 50kb.

Plots as in . (TIFF) Click here for additional data file.

Enrichment of Histone marks in the significant interactions identified by MaxHiC.

The histone marks were downloaded in the peak calling format from UCSC genome browser. (TIFF) Click here for additional data file.

Enrichment of GM12878 specific regulatory potential in top 100K, 50K, 20K, 10K regions identified by MaxHiC and other tools.

a) Top 100K interactions identified by each model. b) Top 50K interactions identified by each model. c) Top 20K interactions identified by each model. d) Top 10K interactions identified by each model. (TIFF) Click here for additional data file.

Enrichment of transcription factor binding sites from ENCODE ChIP-seq data on Rao et al. [5] GM12878 data.

As the plot shows the enrichments in MaxHiC’s interactions are similar to the enrichments in the chromatin loops identified by HiCCUPS (which identified 34-fold fewer interactions) and much better than SIC models as well as two other loop callers. (TIFF) Click here for additional data file.

Enrichment of regulatory features in model specific interacting regions on Rao et al. GM12878 Hi-C data.

a) Interacting regions identified by MaxHiC only versus interacting regions identified by Fit-Hi-C2 only. The number of MaxHiC-specific interacting regions is 39,629; and the number of Fit-Hi-C2-specific interacting regions is 516,719. b) Interacting regions identified by MaxHiC only versus interacting regions identified by HiCCUPS. The number of MaxHiC-specific interacting regions is 219,397; and the number of HiCCUPS-specific interacting regions is 61. P-value < 0.001 and fragment size = 10kb. (TIFF) Click here for additional data file.

An example of Hi-C interaction identified by MaxHiC as significant, which was not identified as a significant interaction by other tools.

Both sides of the interaction have strong signals of regulatory features. (TIFF) Click here for additional data file.

Fraction of regulatory pairs in the top 100K and 20K significant interactions identified by each model in GM12878 and HMEC cell lines.

a) GM12878 – 100K. b) HGM12878 – 20K. c) HMEC– 100K. d) HMEC– 20K. (TIFF) Click here for additional data file.

Hi-C interaction map on GM12878 HiC data.

Heatmap showing the raw read count of interactions at 5k bins of chr21 (36000kb to 39500kb) (same region as S10 Fig in the cLoops paper [43] on the GM12878 Hi-C dataset in Rao et al. [5]. The heatmap is colored based on the log of read count. Significant interactions identified by a) MaxHiC and b) HiCCUPS, are shown in black. A zoomed view of one TAD is shown in c) and d). (TIFF) Click here for additional data file. Venn diagrams showing a) Overlap between the sets of interactions identified by MaxHiC and HiCCUPS on the GM12878 dataset. b) Overlap of 10kb genomic windows involved in significantly interacting regions identified (Note: for HiCCUPS only the 10kb centroids are counted). c) Similar to b, but showing overlap when the entire anchor sequence is considered for HiCCUPS. (TIFF) Click here for additional data file.

Micro-C data analysis.

a) Contact map on Micro-C data from Kreitensten et al. [22] (chr21:36000kb-39500kb). b) Aggregate peak analysis (APA) of significant interactions called by MaxHiC on Micro-C data. Interactions were called at 5kb resolution using the Micro-C Hi-C dataset from Kreitensten et al. [22]. The analysis shows the aggregate background interaction frequency of 5 bins upstream and downstream of the significantly interacting bins. APA plots are shown for the top 5k, and 10k. (TIFF) Click here for additional data file.

RING1B-depleted Hi-C analysis.

a) Hi-C Contact map comparing significant interactions identified in WT and RING1B-/- cells at 10kb fragment size. Black pixels are significant interactions identified by MaxHiC at P-value <0.001. chr5:28.3-30Mb). b) Distance distribution of significant interactions identified by MaxHiC (Pvalue <0.001, 10kb bins) in RING1B-/-, and RING1B WT cells. c) Enrichment of Ring1b (left) and H3K27me3 (right) binding sites in interacting regions found i) only in the wild type cells, ii) only observed in the RING1B knockout cells and iii) those stably observed in both. Data is from [24]. (TIFF) Click here for additional data file.

Number of CRISPRi interactions captured by each model in k562 Hi-C data.

Y-axis shows number of CRISPRi interactions captured by each model and X-axis shows number of significant interactions/loops identified by each model. a) The total number of CRISPRi elements was 103. b) The total number of CRISPRi element-gene pairs was 134. CRISPRi data is from [26]. (TIFF) Click here for additional data file.

Comparison of the four models in the analysis of capture Hi-C experiment in capture Hi-C sample GM12878.

We also developed a separate version of MaxHiC to handle capture Hi-C libraries. In the capture Hi-C library, we are particularly interested to identify significant interactions that at least one end of interaction is baited (i.e., a capture region). Enrichment of GM12878 specific histone marks, putative regulatory regions and GWAS signal in regions identified by CHiCAGO, GOTHiC, Fit-Hi-C, Fit-Hi-C2, HiCDC+ and capture version of MaxHiC, on the capture Hi-C library GM12878 from Mifsud et al. [15]. This showed that significant target-interacting pairs identified by MaxHiC have much higher enrichment for regulatory pairs and autoimmune related GWAS SNPs. (TIFF) Click here for additional data file.

Comparisons of the number of significant interactions and average read-counts of significant interactions identified by SIC models at fragment sizes 1kb, 5kb, 10kb, 50kb capture Hi-C sample GM12878.

Plots as in Fig 1. Capture Hi-C data was obtained from Mifsud et al. [15].

Fig 1

Background modelling as a function of distance in MaxHiC and other HiC tools.

(TIFF) Click here for additional data file.

Distribution of number of significant long distance cis interactions (>100Mb) against significant trans interactions in Rao et al. [5], GM12878 Hi-C sample.

All significant cis interactions identified by MaxHiC at P-value < 0.001 that had distance more than 100Mb were categorized as long-distance cis interactions. All trans interactions with P-value < 0.001 were identified as significant trans interactions. Long distance significant cis interactions had a very similar average read count (21.7) as significant trans interactions (22.1). (TIFF) Click here for additional data file.

Details of publicly available Hi-C samples used in this study.

(XLSX) Click here for additional data file.

The statistical summaries of significant interactions identified by MaxHiC, GOTHiC, CHiCAGO, Fit-Hi-C, Fit-Hi-C2, HiCDC+, HiCCUPS, Peakachu, and Mustache on the Rao HMEC Hi-C library at fragment sizes 1kb, 5kb and 10kb.

(XLSX) Click here for additional data file.

Enrichment of transcription factor binding sites in the significant interactions identified by SIC models on the Rao GM Hi-C library at fragment sizes 10kb.

(XLSX) Click here for additional data file.

Details of autoimmune and breast cancer traits used in this study for GWAS analysis.

(XLSX) Click here for additional data file.

The statistical summaries of significant interactions identified by MaxHiC, GOTHiC, CHiCAGO, Fit-Hi-C, Fit-Hi-C2, and HiCDC+ on the Mifsud GM12878 capture Hi-C library at fragments 5kb and 10kb.

As the table shows, the average read count of significant interactions identified by MaxHiC are much higher than those identified by CHiCAGO, GOTHiC, Fit-Hi-C, Fit-HiC2 and HiCDC+. (XLSX) Click here for additional data file.

42 in total

1. Identifying statistically significant chromatin contacts from Hi-C data with FitHiC2.

Authors: Arya Kaul; Sourya Bhattacharyya; Ferhat Ay
Journal: Nat Protoc Date: 2020-01-24 Impact factor: 13.491

2. Accurate loop calling for 3D genomic data with cLoops.

Authors: Yaqiang Cao; Zhaoxiong Chen; Xingwei Chen; Daosheng Ai; Guoyu Chen; Joseph McDermott; Yi Huang; Xiaoxiao Guo; Jing-Dong J Han
Journal: Bioinformatics Date: 2020-02-01 Impact factor: 6.937

3. Comprehensive mapping of long-range interactions reveals folding principles of the human genome.

Authors: Erez Lieberman-Aiden; Nynke L van Berkum; Louise Williams; Maxim Imakaev; Tobias Ragoczy; Agnes Telling; Ido Amit; Bryan R Lajoie; Peter J Sabo; Michael O Dorschner; Richard Sandstrom; Bradley Bernstein; M A Bender; Mark Groudine; Andreas Gnirke; John Stamatoyannopoulos; Leonid A Mirny; Eric S Lander; Job Dekker
Journal: Science Date: 2009-10-09 Impact factor: 47.728

4. A promoter-level mammalian expression atlas.

Authors: Alistair R R Forrest; Hideya Kawaji; Michael Rehli; J Kenneth Baillie; Michiel J L de Hoon; Vanja Haberle; Timo Lassmann; Ivan V Kulakovskiy; Marina Lizio; Masayoshi Itoh; Robin Andersson; Christopher J Mungall; Terrence F Meehan; Sebastian Schmeier; Nicolas Bertin; Mette Jørgensen; Emmanuel Dimont; Erik Arner; Christian Schmidl; Ulf Schaefer; Yulia A Medvedeva; Charles Plessy; Morana Vitezic; Jessica Severin; Colin A Semple; Yuri Ishizu; Robert S Young; Margherita Francescatto; Intikhab Alam; Davide Albanese; Gabriel M Altschuler; Takahiro Arakawa; John A C Archer; Peter Arner; Magda Babina; Sarah Rennie; Piotr J Balwierz; Anthony G Beckhouse; Swati Pradhan-Bhatt; Judith A Blake; Antje Blumenthal; Beatrice Bodega; Alessandro Bonetti; James Briggs; Frank Brombacher; A Maxwell Burroughs; Andrea Califano; Carlo V Cannistraci; Daniel Carbajo; Yun Chen; Marco Chierici; Yari Ciani; Hans C Clevers; Emiliano Dalla; Carrie A Davis; Michael Detmar; Alexander D Diehl; Taeko Dohi; Finn Drabløs; Albert S B Edge; Matthias Edinger; Karl Ekwall; Mitsuhiro Endoh; Hideki Enomoto; Michela Fagiolini; Lynsey Fairbairn; Hai Fang; Mary C Farach-Carson; Geoffrey J Faulkner; Alexander V Favorov; Malcolm E Fisher; Martin C Frith; Rie Fujita; Shiro Fukuda; Cesare Furlanello; Masaaki Furino; Jun-ichi Furusawa; Teunis B Geijtenbeek; Andrew P Gibson; Thomas Gingeras; Daniel Goldowitz; Julian Gough; Sven Guhl; Reto Guler; Stefano Gustincich; Thomas J Ha; Masahide Hamaguchi; Mitsuko Hara; Matthias Harbers; Jayson Harshbarger; Akira Hasegawa; Yuki Hasegawa; Takehiro Hashimoto; Meenhard Herlyn; Kelly J Hitchens; Shannan J Ho Sui; Oliver M Hofmann; Ilka Hoof; Furni Hori; Lukasz Huminiecki; Kei Iida; Tomokatsu Ikawa; Boris R Jankovic; Hui Jia; Anagha Joshi; Giuseppe Jurman; Bogumil Kaczkowski; Chieko Kai; Kaoru Kaida; Ai Kaiho; Kazuhiro Kajiyama; Mutsumi Kanamori-Katayama; Artem S Kasianov; Takeya Kasukawa; Shintaro Katayama; Sachi Kato; Shuji Kawaguchi; Hiroshi Kawamoto; Yuki I Kawamura; Tsugumi Kawashima; Judith S Kempfle; Tony J Kenna; Juha Kere; Levon M Khachigian; Toshio Kitamura; S Peter Klinken; Alan J Knox; Miki Kojima; Soichi Kojima; Naoto Kondo; Haruhiko Koseki; Shigeo Koyasu; Sarah Krampitz; Atsutaka Kubosaki; Andrew T Kwon; Jeroen F J Laros; Weonju Lee; Andreas Lennartsson; Kang Li; Berit Lilje; Leonard Lipovich; Alan Mackay-Sim; Ri-ichiroh Manabe; Jessica C Mar; Benoit Marchand; Anthony Mathelier; Niklas Mejhert; Alison Meynert; Yosuke Mizuno; David A de Lima Morais; Hiromasa Morikawa; Mitsuru Morimoto; Kazuyo Moro; Efthymios Motakis; Hozumi Motohashi; Christine L Mummery; Mitsuyoshi Murata; Sayaka Nagao-Sato; Yutaka Nakachi; Fumio Nakahara; Toshiyuki Nakamura; Yukio Nakamura; Kenichi Nakazato; Erik van Nimwegen; Noriko Ninomiya; Hiromi Nishiyori; Shohei Noma; Shohei Noma; Tadasuke Noazaki; Soichi Ogishima; Naganari Ohkura; Hiroko Ohimiya; Hiroshi Ohno; Mitsuhiro Ohshima; Mariko Okada-Hatakeyama; Yasushi Okazaki; Valerio Orlando; Dmitry A Ovchinnikov; Arnab Pain; Robert Passier; Margaret Patrikakis; Helena Persson; Silvano Piazza; James G D Prendergast; Owen J L Rackham; Jordan A Ramilowski; Mamoon Rashid; Timothy Ravasi; Patrizia Rizzu; Marco Roncador; Sugata Roy; Morten B Rye; Eri Saijyo; Antti Sajantila; Akiko Saka; Shimon Sakaguchi; Mizuho Sakai; Hiroki Sato; Suzana Savvi; Alka Saxena; Claudio Schneider; Erik A Schultes; Gundula G Schulze-Tanzil; Anita Schwegmann; Thierry Sengstag; Guojun Sheng; Hisashi Shimoji; Yishai Shimoni; Jay W Shin; Christophe Simon; Daisuke Sugiyama; Takaai Sugiyama; Masanori Suzuki; Naoko Suzuki; Rolf K Swoboda; Peter A C 't Hoen; Michihira Tagami; Naoko Takahashi; Jun Takai; Hiroshi Tanaka; Hideki Tatsukawa; Zuotian Tatum; Mark Thompson; Hiroo Toyodo; Tetsuro Toyoda; Elvind Valen; Marc van de Wetering; Linda M van den Berg; Roberto Verado; Dipti Vijayan; Ilya E Vorontsov; Wyeth W Wasserman; Shoko Watanabe; Christine A Wells; Louise N Winteringham; Ernst Wolvetang; Emily J Wood; Yoko Yamaguchi; Masayuki Yamamoto; Misako Yoneda; Yohei Yonekura; Shigehiro Yoshida; Susan E Zabierowski; Peter G Zhang; Xiaobei Zhao; Silvia Zucchelli; Kim M Summers; Harukazu Suzuki; Carsten O Daub; Jun Kawai; Peter Heutink; Winston Hide; Tom C Freeman; Boris Lenhard; Vladimir B Bajic; Martin S Taylor; Vsevolod J Makeev; Albin Sandelin; David A Hume; Piero Carninci; Yoshihide Hayashizaki
Journal: Nature Date: 2014-03-27 Impact factor: 49.962

5. The Cohesin Release Factor WAPL Restricts Chromatin Loop Extension.

Authors: Judith H I Haarhuis; Robin H van der Weide; Vincent A Blomen; J Omar Yáñez-Cuna; Mario Amendola; Marjon S van Ruiten; Peter H L Krijger; Hans Teunissen; René H Medema; Bas van Steensel; Thijn R Brummelkamp; Elzo de Wit; Benjamin D Rowland
Journal: Cell Date: 2017-05-04 Impact factor: 41.582

6. A central role for canonical PRC1 in shaping the 3D nuclear landscape.

Authors: Shelagh Boyle; Ilya M Flyamer; Iain Williamson; Dipta Sengupta; Wendy A Bickmore; Robert S Illingworth
Journal: Genes Dev Date: 2020-05-21 Impact factor: 11.361

7. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.

Authors: Mark D Robinson; Davis J McCarthy; Gordon K Smyth
Journal: Bioinformatics Date: 2009-11-11 Impact factor: 6.937

8. HiC-DC+ enables systematic 3D interaction calls and differential analysis for Hi-C and HiChIP.

Authors: Merve Sahin; Wilfred Wong; Yingqian Zhan; Kinsey Van Deynze; Richard Koche; Christina S Leslie
Journal: Nat Commun Date: 2021-06-07 Impact factor: 14.919

9. GOTHiC, a probabilistic model to resolve complex biases and to identify real interactions in Hi-C data.

Authors: Borbala Mifsud; Inigo Martincorena; Elodie Darbo; Robert Sugar; Stefan Schoenfelder; Peter Fraser; Nicholas M Luscombe
Journal: PLoS One Date: 2017-04-05 Impact factor: 3.240

10. Mustache: multi-scale detection of chromatin loops from Hi-C and Micro-C maps using scale-space representation.

Authors: Abbas Roayaei Ardakany; Halil Tuvan Gezer; Stefano Lonardi; Ferhat Ay
Journal: Genome Biol Date: 2020-09-30 Impact factor: 13.583

1 in total

1. Correction: MaxHiC: A robust background correction model to identify biologically relevant chromatin interactions in Hi-C and capture Hi-C experiments.

Authors: Hamid Alinejad-Rokny; Rassa Ghavami Modegh; Hamid R Rabiee; Ehsan Ramezani Sarbandi; Narges Rezaie; Kin Tung Tam; Alistair R R Forrest
Journal: PLoS Comput Biol Date: 2022-09-09 Impact factor: 4.779

1 in total