| Literature DB >> 26619908 |
Nicolas Servant1,2,3, Nelle Varoquaux4,5,6, Bryan R Lajoie7, Eric Viara8, Chong-Jian Chen9,10,11,12,13,14, Jean-Philippe Vert15,16,17, Edith Heard18,19,20, Job Dekker21, Emmanuel Barillot22,23,24.
Abstract
HiC-Pro is an optimized and flexible pipeline for processing Hi-C data from raw reads to normalized contact maps. HiC-Pro maps reads, detects valid ligation products, performs quality controls and generates intra- and inter-chromosomal contact maps. It includes a fast implementation of the iterative correction method and is based on a memory-efficient data format for Hi-C contact maps. In addition, HiC-Pro can use phased genotype data to build allele-specific contact maps. We applied HiC-Pro to different Hi-C datasets, demonstrating its ability to easily process large data in a reasonable time. Source code and documentation are available at http://github.com/nservant/HiC-Pro .Entities:
Mesh:
Year: 2015 PMID: 26619908 PMCID: PMC4665391 DOI: 10.1186/s13059-015-0831-x
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Comparing solutions for Hi-C data processing
| Mapping | Detection of valid interactions | Binning | Correction of systematic noise | Parallel implementation | Allele-specific analysis | |
|---|---|---|---|---|---|---|
| HOMER | x | x | ||||
| HICUP | x | x | x | |||
| HiC-inspector | xa | x | x | |||
| HiC-Box | xa | x | x | x | ||
| HiCdat | xa | x | x | x | ||
| Hiclib | x | x | x | x | ||
| HiC-Pro | x | x | x | x | x | x |
HOMER [10] offers several programs to analysis Hi-C data from aligned reads. aHiC-inpector [12], HiCdat [13] and HiC-Box [14] do not allow chimeric reads to be rescued during the mapping. HICUP [11] provides a complete pipeline until the detection of valid interaction products. It can be used together with the SNPsplit software [15] to extract allele-specific mapped reads. The hiclib Python library [17] can be applied for all analysis steps but requires good programming skills and cannot be used in a single command-line manner. None of these software enable very large amounts of data to be processed easily in a parallel mode. Note that HOMER, hiclib and HiCdat also offer additional functions for downstream analysis. In the case of HiC-Pro, the downstream analysis is supported by the HiTC BioConductor package [28]
HiC-Pro performance and comparison with hiclib
| Dataset | IMR90 | IMR90 | IMR90 | IMR90_CCL186 |
|---|---|---|---|---|
| Number of reads | 397,200,000 | 397,200,000 | 397,200,000 | 1,535,222,082 |
| Pipeline | hiclib | HiC-Pro | HiC-Pro parallel | HiC-Pro parallel |
| Number of input files | 10 | 10 | 84 | 160 |
| Number of jobs | 1 | 1 | 42 | 80 |
| Number of CPUs per job | 8 | 8 | 4 | 4 |
| Maximum memory | 10 | 7 | 7 | 24 |
| Wall time | 28:24 | 14:32 | 02:15 | 11:49 |
| Mapping | 22:03 | 10:31 | 00:21 | 05:56 |
| Filtering | 00:30 | 03:10 | 00:05 | 00:36 |
| Merge | 00:20 | 00:18 | 00:50 | |
| Contacts maps | 01:45 | 00:15 | 00:15 | 00:42 |
| Normalization | 04:06 | 01:16 | 01:16 | 03:49 |
HiC-Pro was run on the IMR90 Hi-C dataset from Dixon et al. and Rao et al. in order to generate contact maps at resolutions of 20 kb, 40 kb, 150 kb, 500 kb and 1 Mb. Contact maps at 5 kb were also generated for the IMR90_CCL186 dataset. The CPU time for each step of the pipeline is reported and compared with the hiclib Python library. The reported results include time of writing contact maps in text format. Times are minutes:seconds
Fig. 1Comparison of HiC-Pro and hiclib processing. a Both pipelines generate concordant results across processing steps. The fraction of uniquely aligned read pairs is calculated on the total number of initial reads. Self-circle and dangling-end fractions are calculated on the total number of aligned read pairs. Intra- and inter-chromosomal contacts are calculated as a fraction of filtered valid interactions. b Boxplots of the Spearman correlation coefficients of intra- and inter-chromosomal maps generated at different resolutions by both pipelines. c Chromosome 6 contact maps generated by hiclib (top) and HiC-Pro (bottom) at different resolutions. The chromatin interaction data generated by the two pipelines are highly similar
Performance of iterative correction on IMR90 data
| HiC-Pro – Iced (dense – 1 CPU) | HiC-Pro – Iced (sparse – 1 CPU) | HiCorrector – MES (dense – 1 CPU) | HiCorrector – MEP (dense – 8 CPUs) | |
|---|---|---|---|---|
| IMR90 1Mbp | 00:00:12 | 00:00:25 | 00:00:25 | 00:00:06 |
| IMR90 500 kbp | 00:00:40 | 00:01:30 | 00:02:15 | 00:00:22 |
| IMR90 150 kbp | - | 00:04:28 | 00:13:21 | 00:03:10 |
| IMR90 40 kbp | - | 00:07:19 | 02:35:34 | 00:35:43 |
| IMR90 2 0kbp | - | 00:08:36 | 12:57:17 | 02:34:05 |
HiC-Pro is based on a fast implementation of the iterative correction algorithm. We therefore compare our method with the MES (Memory-Efficient Sequential) and MEP (Memory-Efficient Parallel) algorithms of the HiCorrector software [20] for Hi-C data normalization (hours:minutes:seconds). All algorithms were terminated after 20 iterations (see Additional file 1 for details)
Fig. 2Allele-specific analysis. a Allele-specific analysis of the GM12878 cell line. Phasing data were gathered from the Illumina Platinum Genomes Project. In total, 2,239,492 high quality SNPs from GM12878 data were used to distinguish both alleles. Around 6 % of the read pairs were assigned to each parental allele and used to build the allele-specific contact maps. b Intra-chromosomal contact maps of inactive and active X chromosome of the GM12878 cell line at 500-kb resolution. The inactive copy of chromosome X is partitioned into two mega-domains which are not seen in the active X chromosome. The boundary between the two mega-domains lies near the DXZ4 micro-satellite
Fig. 3HiC-Pro workflow. Reads are first aligned on the reference genome. Only uniquely aligned reads are kept and assigned to a restriction fragment. Interactions are then classified and invalid pairs are discarded. If phased genotyping data and N-masked genome are provided, HiC-Pro will align the reads and assign them to a parental genome. For the Hi-C protocol based on restriction enzyme digestion, the read pairs will then be assigned to a restriction fragment and invalid ligation products will be filtered out. These first steps can be performed in parallel for each read chunk. Data from multiple chunks are then merged and binned to generate a single genome-wide interaction map. For allele-specific analysis, only pairs with at least one allele-specific read are used to build the contact maps. The normalization is finally applied to remove Hi-C systematic bias on the genome-wide contact map. MAPQ Mapping Quality , PE paired end
Fig. 4Read pair alignment and filtering. a Read pairs are first independently aligned to the reference genome using an end-to-end algorithm. Then, reads spanning the ligation junction which were not aligned in the first step are trimmed at the ligation site and their 5′ extremity is realigned on the genome. All aligned reads after these two steps are used for further analysis. b According to the Hi-C protocol, digested fragments are ligated together to generate Hi-C products. A valid Hi-C product is expected to involve two different restriction fragments. Read pairs aligned on the same restriction fragment are classified as dangling end or self-circle products, and are not used to generate the contact maps. PE paired end, LS Ligation Site
Fig. 5HiC-Pro quality controls. Quality controls reported by HiC-Pro (IMR90, Dixon et al. data). a Quality control on read alignment and pairing. Low quality alignment, singleton and multiple hits are usually removed at this step. b Read pair filtering. Read pairs are assigned to a restriction fragment. Invalid pairs, such as dangling-end and self-circle, are good indicators of the library quality and are tracked but discarded for subsequent further analysis. The fractions of duplicated reads, as well as short range versus long range interactions, are also reported
Comparison of contact map formats
| Dense format (MB) | Sparse symmetric format (MB) | |
|---|---|---|
| IMR90_CCL186 1 Mbp | 27 | 49 |
| IMR90_CL186 500 kbp | 82 | 181 |
| IMR90_CCL186 150 kbp | 822 | 911 |
| IMR90_CCL186 40 kbp | 12,000 | 1900 |
| IMR90_CL186 20 kbp | 45,000 | 2600 |
| IMR90_CL186 5 kbp | 720,000 | 4200 |
Disk space for IMR90_CCL186 genome-wide contact maps generated using either the classical dense format or the sparse symmetric format at different resolutions