Literature DB >> 35685374

SnapHiC2: A computationally efficient loop caller for single cell Hi-C data.

Xiaoqi Li¹, Lindsay Lee², Armen Abnousi², Miao Yu³, Weifang Liu⁴, Le Huang⁵, Yun Li^4,6,7, Ming Hu².

Abstract

Single cell Hi-C (scHi-C) technologies enable the study of chromatin spatial organization directly from complex tissues at single cell resolution. However, the identification of chromatin loops from single cells is challenging, largely due to the extremely sparse data. Our recently developed SnapHiC pipeline provides the first tool to map chromatin loops from scHi-C data, but it is computationally intensive. Here we introduce SnapHiC2, which adapts a sliding window approximation when imputing missing contacts in each single cell and reduces both memory usage and computational time by 70%. SnapHiC2 can identify 5 Kb resolution chromatin loops with high sensitivity and accuracy and help to suggest target genes for GWAS variants in a cell-type-specific manner. SnapHiC2 is freely available at: https://github.com/HuMingLab/SnapHiC/releases/tag/v0.2.2.

Entities: Chemical

Keywords: Chromatin loops; Chromatin spatial organization; single cell Hi-C; the random walk with restart (RWR) algorithm

Year: 2022 PMID： 35685374 PMCID： PMC9168059 DOI： 10.1016/j.csbj.2022.05.046

Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN： 2001-0370 Impact factor: 6.155

Introduction

The recently developed single cell Hi-C (scHi-C) [1], [2], [3], [4] and its derived technologies, such as sn-m3c-seq [5], [6], provide powerful tools to study chromatin spatial organization directly from complex tissues at single cell resolution. In particular, sn-m3c-seq can reveal distinct cell types based on single cell DNA methylome [5], [6]. Many three-dimensional (3D) genome features, including A/B compartments, topologically associating domains (TADs) and chromatin loops, have been characterized in single cells [7], [8], [9]. Among them, it is of particular interest to identify chromatin loops from single cells, which can shed novel insights on cell-type-specific gene regulation mechanisms. Chromatin loops, which are defined as pairs of genomic loci with significantly higher contact frequency compared to the expected contact frequency due to random collision, have been originally discovered from deeply sequenced bulk cell Hi-C data [10] and play a critical role in genome structure and genome function [11]. However, identifying chromatin loops from single cell Hi-C (scHi-C) data is extremely challenging, largely due to data sparsity. Specifically, ∼1 billion contacts are usually required to robustly map chromatin loops from bulk cell Hi-C data. Nevertheless, most currently available scHi-C and sn-m3c-seq datasets consist of a few hundred cells for each cell type, with on average ∼ 1 million contacts per cell [2], [5], [6]. Aggregating all cells belonging to the same cell type into pseudo bulk Hi-C data still cannot reach the typical required number of contacts for loop calling, limiting the sensitivity of existing methods developed for bulk Hi-C data [12]. A detailed description of publicly available scHi-C datasets[1], [2], [3], [4], [5], [6], [13], [14], [15], [16], [17], [18], [19], [20] and related computational challenges can be found in recent review papers[21], [22], [23]. To address this challenge, we recently published SnapHiC, a computational pipeline to identify loops from scHi-C data [9]. SnapHiC consists of two steps. First, it adopts the random walk with restart (RWR) algorithm [7] to impute contact probability in each cell at 10 Kb bin resolution. Next, it applies both global and local background models to identify chromatin loops. The key methodological innovation in SnapHiC is that each cell is treated as an independent unit, leading to substantially improved power in loop calling (see details in our recent review paper [21]). However, the computational cost of SnapHiC is high. In particular, the first step (i.e., RWR imputation) consumes a large amount of computational resource in terms of both computing time and required memory, thus limiting its applicability to a large number (e.g., greater than500) of cells or calling loops at sub-10 Kb resolution. Given the increasing interest in identifying loops from scHi-C data [24], a computationally more efficient implementation of SnapHiC is therefore of urgent need. Here we present SnapHiC2, an efficient chromatin loop caller for scHi-C data. Compared to SnapHiC, SnapHiC2 achieves more than 3X speed up and saves ∼ 70% memory usage, making the identification of sub-10 Kb-resolution loops from a large number of cells computationally feasible. SnapHiC2 is freely available at: .

Materials and methods

A sliding window approach for scHi-C imputation.

In SnapHiC2, we employ a sliding window approach to approximate the RWR algorithm (Figure S1). Briefly, in each single cell, we model each chromosome as an unweighted undirected graph, where each node is an equal-sized genomic locus (e.g., 5 Kb bin or 10 Kb bin). We then add edges to adjacent genomic loci pairs and all genomic loci pairs with chromatin contacts. Next, we use the RWR algorithm to calculate the probability of traveling between any two loci, which is defined as the imputed chromatin contact probability. The computational complexity of the RWR algorithm is determined by the size of the graph, in other words, the number of genomic loci in each chromosome. In the original study [9], we applied the RWR algorithm to the entire chromosome, which is time-consuming and requires a large amount of memory. As a computationally more efficient implementation, SnapHiC2 leverages partially overlapped sliding window approximation, and substantially speeds up the RWR algorithm. Let represent a 10 Kb bin resolution Hi-C contact matrix for one cell (the bin resolution can also be set to 5 Kb), where is the number of contacts between bin and bin . Instead of computing the RWR-imputed contact probability for all 10 Kb bin pairs in the entire chromosome-wide Hi-C contact matrix, we break down the computation into a series of partially overlapped matrices along the diagonal line of the original matrix. Specifically, each matrix is by in size, and two consecutive matrices overlap by a by sub-matrix. In other words, the step size of the sliding window is (see detailed illustration in Figure S1). Let be a by matrix, which is the th step in the moving window. We use the RWR algorithm to impute the contact probability for all 10 Kb bin pairs within this by matrix. To avoid artifacts near corners due to incomplete chromatin contact information, we only keep imputed contact probability for a subset of bin pairs in the middle rectangle specified below: Figure S1 illustrates the first 3 steps of the sliding window procedure. In step 1, we perform RWR for bin pairs in the red block and generate imputed contact probabilities for bin pairs in the red rectangle. Then moving to step 2, we again perform RWR for bin pairs in the blue block and generate imputed contact probabilities for bin pairs in the blue rectangle. Continuing to step 3, we yet again perform RWR for bin pairs in the purple block and generate imputed contact probabilities for bin pairs in the purple rectangle. We repeat this sliding window procedure until we reach the end of the chromosome. Then, we concatenate the RWR-imputed contact probabilities for all bin pairs within in 1 Mb genomic distance (the 1 Mb boundary is highlighted by the dashed yellow line, and bin pairs subject to loop calling reside in shaded regions illustrated by the striped red, blue, purple lines), while skipping the two corners near chromosome start and end (i.e., the black triangles).

A data-driven approach to determine sliding window size

It is critical to choose an appropriate size for the sliding windows. When the window size is too small, a large amount of information outside the sliding window will be lost, resulting in an inaccurate approximation of RWR-imputed contact probabilities. In contrast, when the window size is too large, the computational cost will be too expensive. SnapHiC2 adopts a data-driven approach and selects the window size to keep more than 80% of the chromatin contacts, achieving a balance between imputation accuracy and computational efficiency (see details in Results).

Results

The data-driven approach to determine sliding window size

We used scHi-C data from 100 oligodendrocytes (ODCs) from human cortex [5] and 742 mouse embryonic stem cells (mESCs) [2] as benchmark examples. For each cell, we calculated the proportion of intra-chromosomal contacts within a certain 1D genomic distance (Figure S2A) and found that more than 80% of contacts were within 20 Mb and 10 Mb, for 100 ODCs and 742 mESCs, respectively. Next, we applied SnapHiC2 with five different sliding window sizes (10 Mb, 20 Mb, 30 Mb, 40 Mb, and 50 Mb) and compared SnapHiC2-identified chromatin loops with SnapHiC-identified chromatin loops, which are based on the full chromosome version of the RWR imputation. We first evaluated the loop overlap between these two versions. SnapHiC2 identified a slightly larger number of loops than SnapHiC (Figure S2B), with 60.9% ∼77.3% of loops identified as overlapping (Figure S2C). We then evaluated the precision, recall, and F1-score (i.e., the harmonic mean of precision and recall), treating as working truth chromatin loops identified from bulk Hi-C data, H3K4me3 PLAC-seq data, as well as cohesin and H3K27ac HiChIP data [25], [26], [27], [28], [29]. We found that SnapHiC2 with different sliding window sizes achieved comparable precision, recall, and F1-score, compared with SnapHiC (Figure S2D, S2E, and S2F). For example, when applied to 100 ODCs, SnapHiC2 with sliding window size 20 Mb achieved precision 0.699, recall 0.261 and F1-score 0.380, similar to SnapHiC (precision 0.775, recall 0.232 and F1-score 0.357). Consistently, when applied to 742 mESCs, SnapHiC2 with sliding window size 10 Mb achieved precision 0.560, recall 0.285 and F1-score 0.378, again comparable to SnapHiC (precision 0.655, recall 0.278 and F1-score 0.391). Taken together, our results suggest that SnapHiC2, with its data-driven strategy to select sliding window size that retains more than 80% of contacts, can identify loops with similar quality as the original SnapHiC algorithm.

SnapHiC2 is computationally efficient

Reassured with the satisfactory performance of SnapHiC2, we then evaluated its computational efficiency. Specifically, for the RWR imputation step, we applied both SnapHiC and SnapHiC2 (with sliding window size 20 Mb) to 100 ODCs and 742 mESCs, each repeated three times, and recorded the corresponding computational time and memory used (Fig. 1). When applied to 100 ODCs, SnapHiC2 was 3.2 times faster than SnapHiC (average time: 1.56 h vs. 5.04 h, Fig. 1A) and only consumed 30.2% of the memory (average required memory: 16.02 GB vs. 53.02 GB, Fig. 1B). Consistently, when applied to 742 mESCs, SnapHiC2 was 3.1 times faster than SnapHiC (average time: 8.98 h vs. 27.81 h, Fig. 1A), and only consumed 36.7% of the memory (average required memory: 13.04 GB vs. 35.55 GB, Fig. 1B). These results demonstrate that SnapHiC2 substantially enhances computational efficiency compared to SnapHiC.

Figure 1

Computational advantage of SnapHiC2 in speed and memory, using 100 ODCs and 742 mESCs as the illustrative example. A. Computational time comparison. B.Memory usage comparison.

SnapHiC2 can identify 5 Kb resolution chromatin loops with high sensitivity and accuracy

In the original study [9], we applied SnapHiC to identify 10 Kb resolution chromatin loops. With the more efficient SnapHiC2 implementation, we can afford to improve the bin resolution to 5 Kb. Here, we re-analyzed scHi-C data in chromosome 3 from 742 mESCs [2] at 5 Kb resolution using both SnapHiC and SnapHiC2. In addition, we aggregated the 742 mESCs and applied two versions of the HiCCUPS algorithm [10], one with the default settings (HiCCUPS-default) for deeply sequenced bulk Hi-C data (greater than1 billion contacts) and the other with more lenient settings (HiCCUPS-lenient) for shallowly sequenced bulk Hi-C data (∼500 million contacts). We first applied SnapHiC, SnapHiC2, HiCCUPS-default, and HiCCUPS-lenient to chromosome 3 data in all 742 mESCs, and a subset of mESCs (N = 100, 200, 300, 400, 500, 600, and 700, Fig. 2). We randomly sampled a subset of mESCs three times to evaluate the robustness of SnapHiC2, HiCCUPS-default and HiCCUPS-lenient. Due to the high computational cost of SnapHiC at 5 Kb bin resolution, we only run SnapHiC once. We found that both SnapHiC and SnapHiC2 identified more 5 Kb loops than both two versions of HiCCUPS (Fig. 2A and Table S1). For example, using all 742 mESCs, SnapHiC, SnapHiC2, HiCCUPS-default and HiCCUPS-lenient identified 252, 294, 6 and 24 5 Kb loops in chromosome 3, respectively (Fig. 2A). Next, we used 5 Kb resolution loops identified from bulk Hi-C data, H3K4me3 PLAC-seq and cohesin, and H3K27ac HiChIP data [26], [27], [28], [29] as the reference loop list (i.e., treated as the working truth) and evaluated the precision, recall, and F1-score. Compared to loops identified from two versions of HiCCUPS, both SnapHiC and SnapHiC2-identified loops achieved slightly lower precision (Fig. 2B), much improved recall (Fig. 2C), and substantially higher F1-score (Fig. 2D). For example, using all 742 mESCs, SnapHiC2 loops have precision 0.534, recall 0.182 and F1-score 0.271, while SnapHiC loops have precision 0.623, recall 0.180 and F1-score 0.279. In addition, HiCCUPS-default loops have precision 0.667, recall 0.006 and F1-score 0.013. HiCCUPS-lenient loops have precision 0.708, recall 0.031 and F1-score 0.060.

Figure 2

SnapHiC2 can identify high-quality 5Kb resolution chromatin loops from mESC cells. In Figure 2A, 2B, 2C and 2D, the error bar represents the standard deviation across three random samples. A. For 100, 200, 300, 400, 500, 600, 700, 742 cells, SnapHiC2 identifies similar numbers of loops with SnapHiC. SnapHiC2 identified more loops than HiCCUPS with either default or lenient settings. B. SnapHiC2 achieves slightly lower precision than SnapHiC and the two versions of HiCCUPS. C. SnapHiC2 achieves slightly higher recall than SnapHiC and much higher recall than the two versions of HiCCUPS. D. SnapHiC2 achieves comparable F1-score with SnapHiC and much higher F1-score than the two versions of HiCCUPS. E. Both SnapHiC and SnapHiC2 detect a functional 5Kb resolution enhancer-promoter interaction at the Sox2 locus in mESCs with much fewer cells than the two versions of HiCCUPS. F. Genome browser view of the Sox2 locus showing that the 5Kb loops identified by SnapHiC2 can better pinpoint the functional enhancer of Sox2 gene than the 10Kb loops identified by SnapHiC.

The Sox2 locus

We then examined the Sox2 gene locus, where a functional enhancer-promoter interaction between Sox2 promoter and the super-enhancer ∼ 100 Kb downstream has previously been validated by CRISPR experiment in mESCs [30]. At 5 Kb resolution, both SnapHiC and SnapHiC2 required as few as 100 cells to identify such enhancer-promoter interaction, whereas HiCCUPS-default and HiCCUPS-lenient required at least 600 cells and 500 cells, respectively (Fig. 2E). This example showcases the significantly enhanced sensitivity of SnapHiC2 compared to two versions of HiCCUPS. Next, for all 742 mESCs, we compared the SnapHiC2-identified 5 Kb-resolution loop with the SnapHiC-identified 10 Kb-resolution loop with Sox2 promoter (Fig. 2F). The bin interacting with Sox2 promoter is chr3:34.755 Mb-34.76 Mb and chr3:34.75 Mb-34.76 Mb, based on SnapHiC2′s 5 Kb calling and SnapHiC’s 10 Kb calling, respectively. In the CRISPR experiment [30], the deleted region is chr3:34,756,113–34,761,855, largely overlapped with the 5 Kb interacting bin identified by SnapHiC2. In addition, we found a strong H3K27ac ChIP-seq peak (chr3:34,754,670–34,758,019) from mESCs overlapped with the 5 Kb interacting bin. Taken together, these pieces of evidence suggest that the 5 Kb loops identified by SnapHiC2 can better pinpoint the functional enhancer of Sox2 gene than the 10 Kb loops identified by SnapHiC.

SnapHiC2 can identify cell-type-specific loops linking GWAS variants to putative disease genes.

Lastly, we applied SnapHiC2 to identify 5 Kb-resolution loops from astrocytes, L2/3 neurons, microglia and oligodendrocytes, each with 261 cells [5] (Table S2). We identified a large proportion of cell-type-specific loops, underscoring the importance of mapping chromatin loops using single cells from the same cell type (Figure S3). As one illustrative example, we checked chromatin loops anchored at the transcription start site of gene PAGR1, which has been reported to be associated with neuropsychiatric disorders [31]. We found a L2/3-specific loop linking the promoter of gene PAGR1 to a 5 Kb bin (chr16:29,930,000–29,935,000) containing six schizophrenia-associated GWAS SNPs (Fig. 3) (rs4407079 [chr16:29,931,899], rs4609871 [chr16:29,932,064], rs9936474 [chr16:29,932,691], rs12932403 [chr16:29,934,050], rs12596042 [chr16:29,934,754] and rs11150574 [chr16:29,934,819]). Among them, four SNPs (rs4407079, rs4609871, rs9936474 and rs12932403) reside in neuronal enhancers [25]. In addition, gene PAGR1 shows the highest expression in L2/3 neurons among these four cell types [32] (Fig. 3). Notably, gene PARG1 is greater than 100 Kb away from these schizophrenia-associated GWAS SNPs, with multiple genes residing in the middle. In addition, multiple schizophrenia-associated GWAS SNPs are outside of the neuronal enhancers. Integrating the multi-omics evidence from epigenome, transcriptome and 3D genome suggests that PAGR1 is the likely functional gene underlying the GWAS variants, and most likely in L2/3 neurons.

Figure 3

SnapHiC2 identifies a 5Kb L2/3-specific loop linking the PAGR1 promoter to six schizophrenia-associated GWAS SNPs.

Discussion

In this paper, we describe SnapHiC2, a computationally efficient pipeline to identify chromatin loops from scHi-C data. Compared to our original SnapHiC, SnapHiC2 adopts a sliding window approach when implementing the RWR algorithm, achieving more than 3 times speed up and reducing memory usage by around 70%. In addition, we provide an implementation of the loop calling step (i.e., step 2) for running on GPUs, leveraging the JAX library in Python. When testing on the Volta GPU with Intel(R) Xeon(R) Gold 6148 CPU @ 2.40 GHz, we found that SnapHiC2′s GPU implementation of step 2 achieved more than 2X speed-up compared with SnapHiC’s CPU implementation of step 2, each with a single processor. SnapHiC2 enables the identification of chromatin loops from single cells at 5 Kb resolution, facilitating the characterization of promoter-interacting enhancers, as well as identification and prioritization of putative target genes for GWAS variants. We believe that SnapHiC2 will become a useful tool to study 3D chromatin organization at single cell resolution.

Funding

This research was supported by the National Institute of Health grants U01HG011720 (awarded to YL) and R35HG011922 (awarded to MH). YL, XL, and LH are partially supported by U01DA052713, R01MH125236, and P50HD103573. Conflict of interest: none declared.

Reference genomes

We used mm10 and hg19 for mouse scHi-C data and human scHi-C data, respectively.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. A.A. is currently an employee of RocketReach.

21 in total

1. Changes in genome architecture and transcriptional dynamics progress independently of sensory experience during post-natal brain development.

Authors: Longzhi Tan; Wenping Ma; Honggui Wu; Yinghui Zheng; Dong Xing; Ritchie Chen; Xiang Li; Nicholas Daley; Karl Deisseroth; X Sunney Xie
Journal: Cell Date: 2021-01-18 Impact factor: 41.582

2. Parental-to-embryo switch of chromosome organization in early embryogenesis.

Authors: Samuel Collombet; Noémie Ranisavljevic; Takashi Nagano; Csilla Varnai; Tarak Shisode; Wing Leung; Tristan Piolot; Rafael Galupa; Maud Borensztein; Nicolas Servant; Peter Fraser; Katia Ancelin; Edith Heard
Journal: Nature Date: 2020-03-25 Impact factor: 49.962

3. Enhancer connectome in primary human cells identifies target genes of disease-associated DNA elements.

Authors: Maxwell R Mumbach; Ansuman T Satpathy; Evan A Boyle; Chao Dai; Benjamin G Gowen; Seung Woo Cho; Michelle L Nguyen; Adam J Rubin; Jeffrey M Granja; Katelynn R Kazane; Yuning Wei; Trieu Nguyen; Peyton G Greenside; M Ryan Corces; Josh Tycko; Dimitre R Simeonov; Nabeela Suliman; Rui Li; Jin Xu; Ryan A Flynn; Anshul Kundaje; Paul A Khavari; Alexander Marson; Jacob E Corn; Thomas Quertermous; William J Greenleaf; Howard Y Chang
Journal: Nat Genet Date: 2017-09-25 Impact factor: 38.330

SnapHiC2: A computationally efficient loop caller for single cell Hi-C data.

Introduction

Materials and methods

A sliding window approach for scHi-C imputation.

A data-driven approach to determine sliding window size

Results

The data-driven approach to determine sliding window size

SnapHiC2 is computationally efficient

SnapHiC2 can identify 5 Kb resolution chromatin loops with high sensitivity and accuracy

The Sox2 locus

SnapHiC2 can identify cell-type-specific loops linking GWAS variants to putative disease genes.

Discussion

Funding

Reference genomes

Declaration of competing interest

1. Changes in genome architecture and transcriptional dynamics progress independently of sensory experience during post-natal brain development.

2. Parental-to-embryo switch of chromosome organization in early embryogenesis.

3. Enhancer connectome in primary human cells identifies target genes of disease-associated DNA elements.

4. HiChIP: efficient and sensitive analysis of protein-directed genome architecture.

5. Prenatal expression patterns of genes associated with neuropsychiatric disorders.

6. Single-nucleus Hi-C reveals unique chromatin reorganization at oocyte-to-zygote transition.

7. Cell-cycle dynamics of chromosomal organization at single-cell resolution.

8. Robust single-cell Hi-C clustering by convolution- and random-walk-based imputation.

9. Multiscale and integrative single-cell Hi-C analysis with Higashi.

10. Joint profiling of DNA methylation and chromatin architecture in single cells.

Review 1. Understanding the function of regulatory DNA interactions in the interpretation of non-coding GWAS variants.