Literature DB >> 25391400

Hi-Corrector: a fast, scalable and memory-efficient package for normalizing large-scale Hi-C data.

Wenyuan Li¹, Ke Gong¹, Qingjiao Li¹, Frank Alber¹, Xianghong Jasmine Zhou¹.

Abstract

UNLABELLED: Genome-wide proximity ligation assays, e.g. Hi-C and its variant TCC, have recently become important tools to study spatial genome organization. Removing biases from chromatin contact matrices generated by such techniques is a critical preprocessing step of subsequent analyses. The continuing decline of sequencing costs has led to an ever-improving resolution of the Hi-C data, resulting in very large matrices of chromatin contacts. Such large-size matrices, however, pose a great challenge on the memory usage and speed of its normalization. Therefore, there is an urgent need for fast and memory-efficient methods for normalization of Hi-C data. We developed Hi-Corrector, an easy-to-use, open source implementation of the Hi-C data normalization algorithm. Its salient features are (i) scalability-the software is capable of normalizing Hi-C data of any size in reasonable times; (ii) memory efficiency-the sequential version can run on any single computer with very limited memory, no matter how little; (iii) fast speed-the parallel version can run very fast on multiple computing nodes with limited local memory.
AVAILABILITY AND IMPLEMENTATION: The sequential version is implemented in ANSI C and can be easily compiled on any system; the parallel version is implemented in ANSI C with the MPI library (a standardized and portable parallel environment designed for solving large-scale scientific problems). The package is freely available at http://zhoulab.usc.edu/Hi-Corrector/.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Chromatin

Year: 2014 PMID： 25391400 PMCID： PMC4380031 DOI： 10.1093/bioinformatics/btu747

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

The recent development of genome-wide proximity ligation assays such as Hi-C (Lieberman-Aiden ) and its variant TCC (Kalhor ) has significantly facilitated the study of spatial genome organization. The raw chromatin interaction data obtained by Hi-C methods can have both technical and biological biases (Imakaev ). Therefore, correcting biases in the Hi-C data is an important preprocessing step. Among several recently developed methods (Hu ; Imakaev ; Yaffe and Tanay, 2011), the iterative correction (abbreviated as IC) algorithm (Imakaev ) has been used most widely by recent studies (Ay ; Le ; Naumova ; Varoquaux ) due to its conceptual simplicity, parameter-free algorithm and ability to account for unknown biases, although its assumption of the equal visibility across all loci may require further exploration. Mathematically, the IC algorithm is a matrix scaling or balancing method that transforms a symmetric matrix into one that is doubly stochastic, meaning that the row and column sums of the matrix are equal to 1. However, the Hi-C chromatin interaction matrix is of the massive size O(N2), where N is the number of genomic regions. Thus, it requires expensive computing resources such as large memory and long computation time. This is especially problematic for high-resolution data at the kilobase level or beyond (Jin ; Le ). For example, at the resolution of 10K base pairs per region, the human genome has 303 640 regions and the matrix of the Hi-C data occupies about 343 GB of memory, which cannot be loaded into any common desktop computer even in the compressed format. Most scaling algorithms in the matrix computation field (Knight, 2008; Knight and Ruiz, 2012) suffer from this scalability issue, because their main focus is improving the convergence rate and numerical stability. Therefore, there is high demand for a fast and scalable IC algorithm that can work with massive Hi-C data matrices on common computing resources. Here we propose a set of scalable algorithms (adapted from the original IC algorithm) to meet this need. Both sequential and parallel versions were implemented in the standard and efficient C language, which allows for precise memory control. The sequential implementation is memory efficient and can run on any single computer with limited memory, even for Hi-C datasets of large size. It is designed to overcome the memory limit by loading a portion of data into the memory at each time, so requires some extra time for file reading. The parallel implementation is both memory efficient and fast. It can run on one of the most popular parallel computing resources: a computer cluster (i.e. a distributed-memory computing environment). In this environment, a set of general-purpose processors or computers can be interconnected to share resources, and each computer retains its local and limited memory. The parallel algorithm is designed with very low communication overhead among computing nodes, so that it runs faster on clusters with more computers. Although the Hi-C analysis pipeline, ICE (Imakaev ), implements the IC algorithm, it works only on a single computer and cannot utilize as many computing resources as possible to speed up the computation. Very few parallel matrix scaling or balancing algorithms have been developed prior to this work (Amestoy ; Zenios and Iu, 1990). However, none of them are suitable for the bias correction task of Hi-C data. Zenios and Iu (1990) parallelized the matrix balancing algorithm in 1990 for a shared-memory computer, which cannot address the memory shortage problem. Amestoy designed a complicated data distribution strategy based on the partitions of non-zero elements. Their method is not applicable to the raw Hi-C contact map, which contains a high proportion of non-zero elements. We performed experiments on high-performance computing resources and clusters with different numbers of nodes and memory capacities. The results showed that this package could meet the strong demand for normalizing massive Hi-C data given limited computing resources.

2 Algorithms and implementation

Given an observed chromosome contact frequency matrix over N genomic regions, the IC method eliminates biases so that all genomic regions have equal visibility (Imakaev ). To make this algorithm memory-efficient, we designed a strategy of breaking the matrix O into K equal partitions of complete rows and loading only one partition into memory at any given time. Therefore, the memory requirement can be very low when K is large. This strategy adapts the IC algorithm by adding two steps: (i) loading the kth matrix partition O into memory and (ii) updating this partition with the last updated bias vector b. The new Memory Efficient Sequential algorithm (called IC-MES) works even for the extreme case of K = N, where only one row is loaded each time. IC-MES is memory efficient, but it is still a sequential algorithm that runs on a single computing machine. Therefore, it may be too slow when the machine has small memory. To normalize large Hi-C matrices in a short time, we also designed a fast, scalable and Memory-Efficient Parallel algorithm (called IC-MEP) that can maximally exploit the parallelism of the normalization problem and make use of many commonly available computing resources. In essence, the normalization problem is a data divisible task: a series of operations that can independently work on separate partitions of the data. This problem is perfectly suited to the data-parallel model in a distributed-memory computing environment such as a computer cluster, which consists of K independent processors (or nodes) that are loosely or tightly connected in high-speed networks and have limited local memory. We employed the manager–worker parallel programming paradigm. The manager task partitions the data into K blocks, then initiates K worker tasks in different nodes; each worker task processes a single data block. The manager coordinates all workers and synchronizes their calculations with updated bias vectors. The IC-MEP algorithm has very little network messaging overhead, because no communication exists between workers. Therefore, it is computationally efficient. Furthermore, in order for each worker to run its task on limited memory, we also used the memory-saving strategy of the IC-MES algorithm. That is, each worker further partitions its assigned data block into a set of sub-blocks and loads only one sub-block into memory at any given time. Theoretically, the IC-MEP algorithm can work on any number of processors with any local memory capacity. Details of these three algorithms and their flowchart figures are provided in the Supplementary materials. We used ANSI C to implement the two sequential algorithms IC and IC-MES, because of its maximum control and memory efficiency. We implemented the parallel algorithm IC-MEP using the popular message passing interface, which is a highly standardized and portable environment designed for solving large-scale scientific problems on distributed memory systems.

3 Results

We compared three algorithms (IC, IC-MES and IC-MEP) on the TCC/Hi-C data of two human cell types: GM12878 and hESC (Dixon ; Kalhor ). The whole genome is partitioned into the equal-size regions (or bins); the bin size is the main indicator of Hi-C data resolution. The results are listed in Table 1. In the experiment with 20K bp resolution data, the basic IC algorithm requires a minimum memory of 86 GB. The algorithm IC-MES can run with just 4 GB memory (a common memory configuration in office computers) and complete the same work in reasonable time (within 4 h). IC-MEP can dramatically speed up the computation using more processors (about 6 min with 48 processors), while using only 1 GB of memory in each processor. For the 10K bp data, none of HPC computer nodes (with 128 GB memory limit) can load the full matrix (about 343 GB) for the basic IC algorithm. But IC-MES and IC-MEP can use 2 GB memory to quickly get the results (even in half hour using 48 processors). Details are provided in the Supplemental materials.

Table 1.

Running time of three algorithms on 10K and 20K bp resolution Hi-C data

Algorithm	IC	IC-MES	IC-MEP
20K bp data (151 825 bins)
#Processor	1	1	16	48
Memory	86 GB	4 GB	1 GB	1 GB
Time (gm12878)	0:36:50	3:58:14	0:19:50	0:6:38
Time (hESC)	0:35:01	3:49:18	0:19:48	0:6:47
10K bp data (303 640 bins)
#Processor	1	1	16	48
Memory	343 GB	32 GB	2 GB	2 GB
Time (gm12878)	NA	47:27:32	4:50:03	0:26:02
Time (hESC)	NA	37:26:15	4:49:27	0:26:09

All algorithms were terminated after 10 iterations for the purpose of performance comparison, since each iteration has almost the same running time. ‘Memory’ includes only the memory allocated for computation in each processor, not system overhead. The elapsed time format is ‘hours : minutes : seconds’.

Running time of three algorithms on 10K and 20K bp resolution Hi-C data All algorithms were terminated after 10 iterations for the purpose of performance comparison, since each iteration has almost the same running time. ‘Memory’ includes only the memory allocated for computation in each processor, not system overhead. The elapsed time format is ‘hours : minutes : seconds’.

4 Conclusion

With the rapidly increasing resolution of Hi-C datasets, the size of the chromatin contact map will soon exceed the memory capacity of general computers. We developed Hi-Corrector, a scalable and memory-efficient package for bias removal in HiC data. Hi-Corrector can run on any single computer or a computer cluster with limited memory size to complete the task. We performed experiments on high-resolution HiC data from two human cell types to show that the package can process very large data sets in reasonable time using the single processor, and in very short time with multiple processors. The experiments further demonstrate the scalability of our package with the observation shown in Table 1 that the more processors used, the faster it is. Therefore, Hi-Corrector is a timely resource addressing the challenge of normalizing high-resolution Hi-C data.

Funding

National Science Foundation Grant CAREER 1150287 and the Arnold and Mabel Beckman foundation (BYI program) (to F.A.); and National Institutes of Health Grant (NHLBI MAPGEN U01HL108634 to X.J.Z.); and F.A. is a Pew Scholar in Biomedical Sciences, supported by the Pew Charitable Trusts. Conflict of Interest: none declared.

11 in total

1. Probabilistic modeling of Hi-C contact maps eliminates systematic biases to characterize global chromosomal architecture.

Authors: Eitan Yaffe; Amos Tanay
Journal: Nat Genet Date: 2011-10-16 Impact factor: 38.330

2. Organization of the mitotic chromosome.

Authors: Natalia Naumova; Maxim Imakaev; Geoffrey Fudenberg; Ye Zhan; Bryan R Lajoie; Leonid A Mirny; Job Dekker
Journal: Science Date: 2013-11-07 Impact factor: 47.728

3. High-resolution mapping of the spatial organization of a bacterial chromosome.

Authors: Tung B K Le; Maxim V Imakaev; Leonid A Mirny; Michael T Laub
Journal: Science Date: 2013-10-24 Impact factor: 47.728

4. Genome architectures revealed by tethered chromosome conformation capture and population-based modeling.

Authors: Reza Kalhor; Harianto Tjong; Nimanthi Jayathilaka; Frank Alber; Lin Chen
Journal: Nat Biotechnol Date: 2011-12-25 Impact factor: 54.908

5. Comprehensive mapping of long-range interactions reveals folding principles of the human genome.

Authors: Erez Lieberman-Aiden; Nynke L van Berkum; Louise Williams; Maxim Imakaev; Tobias Ragoczy; Agnes Telling; Ido Amit; Bryan R Lajoie; Peter J Sabo; Michael O Dorschner; Richard Sandstrom; Bradley Bernstein; M A Bender; Mark Groudine; Andreas Gnirke; John Stamatoyannopoulos; Leonid A Mirny; Eric S Lander; Job Dekker
Journal: Science Date: 2009-10-09 Impact factor: 47.728

6. Topological domains in mammalian genomes identified by analysis of chromatin interactions.

Authors: Jesse R Dixon; Siddarth Selvaraj; Feng Yue; Audrey Kim; Yan Li; Yin Shen; Ming Hu; Jun S Liu; Bing Ren
Journal: Nature Date: 2012-04-11 Impact factor: 49.962

7. Three-dimensional modeling of the P. falciparum genome during the erythrocytic cycle reveals a strong connection between genome architecture and gene expression.

Authors: Ferhat Ay; Evelien M Bunnik; Nelle Varoquaux; Sebastiaan M Bol; Jacques Prudhomme; Jean-Philippe Vert; William Stafford Noble; Karine G Le Roch
Journal: Genome Res Date: 2014-03-26 Impact factor: 9.043

8. Iterative correction of Hi-C data reveals hallmarks of chromosome organization.

Authors: Maxim Imakaev; Geoffrey Fudenberg; Rachel Patton McCord; Natalia Naumova; Anton Goloborodko; Bryan R Lajoie; Job Dekker; Leonid A Mirny
Journal: Nat Methods Date: 2012-09-02 Impact factor: 28.547

9. A high-resolution map of the three-dimensional chromatin interactome in human cells.

Authors: Fulai Jin; Yan Li; Jesse R Dixon; Siddarth Selvaraj; Zhen Ye; Ah Young Lee; Chia-An Yen; Anthony D Schmitt; Celso A Espinoza; Bing Ren
Journal: Nature Date: 2013-10-20 Impact factor: 49.962

10. A statistical approach for inferring the 3D structure of the genome.

Authors: Nelle Varoquaux; Ferhat Ay; William Stafford Noble; Jean-Philippe Vert
Journal: Bioinformatics Date: 2014-06-15 Impact factor: 6.937

30 in total

1. Computational methods for predicting 3D genomic organization from high-resolution chromosome conformation capture data.

Authors: Kimberly MacKay; Anthony Kusalik
Journal: Brief Funct Genomics Date: 2020-07-29 Impact factor: 4.241

2. GrapHi-C: graph-based visualization of Hi-C datasets.

Authors: Kimberly MacKay; Anthony Kusalik; Christopher H Eskiw
Journal: BMC Res Notes Date: 2018-06-29

3. Higher-Order Inter-chromosomal Hubs Shape 3D Genome Organization in the Nucleus.

Authors: Sofia A Quinodoz; Noah Ollikainen; Barbara Tabak; Ali Palla; Jan Marten Schmidt; Elizabeth Detmar; Mason M Lai; Alexander A Shishkin; Prashant Bhat; Yodai Takei; Vickie Trinh; Erik Aznauryan; Pamela Russell; Christine Cheng; Marko Jovanovic; Amy Chow; Long Cai; Patrick McDonel; Manuel Garber; Mitchell Guttman
Journal: Cell Date: 2018-06-07 Impact factor: 41.582

4. A computational strategy to adjust for copy number in tumor Hi-C data.

Authors: Hua-Jun Wu; Franziska Michor
Journal: Bioinformatics Date: 2016-08-16 Impact factor: 6.937

Review 5. Genome-wide mapping and analysis of chromosome architecture.

Authors: Anthony D Schmitt; Ming Hu; Bing Ren
Journal: Nat Rev Mol Cell Biol Date: 2016-09-01 Impact factor: 94.444

6. Transformation of Accessible Chromatin and 3D Nucleome Underlies Lineage Commitment of Early T Cells.

Authors: Gangqing Hu; Kairong Cui; Difeng Fang; Satoshi Hirose; Xun Wang; Darawalee Wangsa; Wenfei Jin; Thomas Ried; Pentao Liu; Jinfang Zhu; Ellen V Rothenberg; Keji Zhao
Journal: Immunity Date: 2018-02-20 Impact factor: 31.745

7. Trac-looping measures genome structure and chromatin accessibility.

Authors: Binbin Lai; Qingsong Tang; Wenfei Jin; Gangqing Hu; Darawalee Wangsa; Kairong Cui; Benjamin Z Stanton; Gang Ren; Yi Ding; Ming Zhao; Shuai Liu; Jiuzhou Song; Thomas Ried; Keji Zhao
Journal: Nat Methods Date: 2018-08-27 Impact factor: 28.547

8. Three-dimensional analysis reveals altered chromatin interaction by enhancer inhibitors harbors TCF7L2-regulated cancer gene signature.

Authors: Diana L Gerrard; Yao Wang; Malaina Gaddis; Yufan Zhou; Junbai Wang; Heather Witt; Shili Lin; Peggy J Farnham; Victor X Jin; Seth E Frietze
Journal: J Cell Biochem Date: 2018-12-11 Impact factor: 4.429

9. Using contact statistics to characterize structure transformation of biopolymer ensembles.

Authors: Priyojit Das; Rosela Golloshi; Rachel Patton McCord; Tongye Shen
Journal: Phys Rev E Date: 2020-01 Impact factor: 2.529

10. The 3D genomic landscape of differential response to EGFR/HER2 inhibition in endocrine-resistant breast cancer cells.

Authors: Yini Yang; Lavanya Choppavarapu; Kun Fang; Alireza S Naeini; Bakhtiyor Nosirov; Jingwei Li; Ke Yang; Zhijing He; Yufan Zhou; Rachel Schiff; Rong Li; Yanfen Hu; Junbai Wang; Victor X Jin
Journal: Biochim Biophys Acta Gene Regul Mech Date: 2020-09-19 Impact factor: 4.490