| Literature DB >> 25391400 |
Wenyuan Li1, Ke Gong1, Qingjiao Li1, Frank Alber1, Xianghong Jasmine Zhou1.
Abstract
UNLABELLED: Genome-wide proximity ligation assays, e.g. Hi-C and its variant TCC, have recently become important tools to study spatial genome organization. Removing biases from chromatin contact matrices generated by such techniques is a critical preprocessing step of subsequent analyses. The continuing decline of sequencing costs has led to an ever-improving resolution of the Hi-C data, resulting in very large matrices of chromatin contacts. Such large-size matrices, however, pose a great challenge on the memory usage and speed of its normalization. Therefore, there is an urgent need for fast and memory-efficient methods for normalization of Hi-C data. We developed Hi-Corrector, an easy-to-use, open source implementation of the Hi-C data normalization algorithm. Its salient features are (i) scalability-the software is capable of normalizing Hi-C data of any size in reasonable times; (ii) memory efficiency-the sequential version can run on any single computer with very limited memory, no matter how little; (iii) fast speed-the parallel version can run very fast on multiple computing nodes with limited local memory.Entities:
Mesh:
Substances:
Year: 2014 PMID: 25391400 PMCID: PMC4380031 DOI: 10.1093/bioinformatics/btu747
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Running time of three algorithms on 10K and 20K bp resolution Hi-C data
| Algorithm | IC | IC-MES | IC-MEP | |
|---|---|---|---|---|
| 20K bp data (151 825 bins) | ||||
| #Processor | 1 | 1 | 16 | 48 |
| Memory | 86 GB | 4 GB | 1 GB | 1 GB |
| Time (gm12878) | 0:36:50 | 3:58:14 | 0:19:50 | 0:6:38 |
| Time (hESC) | 0:35:01 | 3:49:18 | 0:19:48 | 0:6:47 |
| 10K bp data (303 640 bins) | ||||
| #Processor | 1 | 1 | 16 | 48 |
| Memory | 343 GB | 32 GB | 2 GB | 2 GB |
| Time (gm12878) | NA | 47:27:32 | 4:50:03 | 0:26:02 |
| Time (hESC) | NA | 37:26:15 | 4:49:27 | 0:26:09 |
All algorithms were terminated after 10 iterations for the purpose of performance comparison, since each iteration has almost the same running time. ‘Memory’ includes only the memory allocated for computation in each processor, not system overhead. The elapsed time format is ‘hours : minutes : seconds’.