| Literature DB >> 28056762 |
Charalampos Lazaris1,2, Stephen Kelly3,4, Panagiotis Ntziachristos5, Iannis Aifantis6,7, Aristotelis Tsirigos8,9,10,11.
Abstract
BACKGROUND: Chromatin conformation capture techniques have evolved rapidly over the last few years and have provided new insights into genome organization at an unprecedented resolution. Analysis of Hi-C data is complex and computationally intensive involving multiple tasks and requiring robust quality assessment. This has led to the development of several tools and methods for processing Hi-C data. However, most of the existing tools do not cover all aspects of the analysis and only offer few quality assessment options. Additionally, availability of a multitude of tools makes scientists wonder how these tools and associated parameters can be optimally used, and how potential discrepancies can be interpreted and resolved. Most importantly, investigators need to be ensured that slight changes in parameters and/or methods do not affect the conclusions of their studies.Entities:
Keywords: Benchmarking; Chromosome conformation; Computational pipeline; Data provenance; Hi-C; Parameter exploration
Mesh:
Substances:
Year: 2017 PMID: 28056762 PMCID: PMC5217551 DOI: 10.1186/s12864-016-3387-6
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Comparison of HiC-bench with published Hi-C analysis or visualization tools
| Hi-C tasks | HiC-bench | HiFive | Hi-Cpipe | HiCNorm | hiclib | HiTC | HOMER | Hi-Corrector | HiC-Pro | TADbit | HiCUP | HiC-Box | HiCdat | HIPPIE | Sushi | HiCPlotter |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Alignment | x | x | x | x | x | x | x | x | ||||||||
| Filtering | x | x | x | x | x | x | x | x | x | |||||||
| Genome browser tracks | x | |||||||||||||||
| Quality assessment plots | x | x | x | x | x | x | x | x | ||||||||
| Contact matrices | x | x | x | x | x | x | x | x | ||||||||
| Matrix correction | x | x | x | x | x | x | x | x | x | x | x | |||||
| Matrix comparison | x | x | ||||||||||||||
| Boundary scores | x | |||||||||||||||
| Domains | x | x | ||||||||||||||
| Boundary comparison | x | |||||||||||||||
| Specific interactions | x | x | x | x | x | x | x | x | x | |||||||
| Annotations | x | x | x | |||||||||||||
| Allele-specific interactions | x | x | ||||||||||||||
| Visualization | x | x | x | x | x | x | x | x | ||||||||
| Integration with ChIP-seq data | x | x | x | |||||||||||||
| Parallelization | x | x | x | x | x | x | x | |||||||||
| Integration of alternative tools | x | |||||||||||||||
| Parameter exploration | x | |||||||||||||||
| Reproducibility | x | x |
HiC-bench is a comprehensive and feature-rich Hi-C analysis pipeline that performs various Hi-C tasks by combining our newly-developed tools with existing tools
Fig. 1HiC-bench workflow. Raw reads (input fastq files) are aligned and then filtered (align and filter tasks). Filtered reads are used for the creation of Hi-C track files (tracks) that can be directly uploaded to the WashU Epigenome Browser [27]. A report with a statistics summary of filtered Hi-C reads, is also automatically generated (filter-stats). Raw Hi-C matrices (matrix-filtered) are normalized using (a) scaling (matrix-prep), (b) iterative correction (matrix-ic) [9] or (c) HiCNorm (matrix-hicnorm) [28]. A report with the plots of the normalized Hi-C counts as function of the distance between the interacting partners (matrix-stats) is automatically generated for all methods. The resulting matrices are compared across all samples in terms of Pearson and Spearman correlation (compare-matrices and compare-matrices-stats). Boundary scores are calculated and the corresponding report with the Principal Component Analysis (PCA) is automatically generated (boundary-scores and boundary-scores-pca). Domains are identified using various TAD calling algorithms (domains) followed by comparison of TAD boundaries (compare-boundaries and compare-boundaries-stats). A report with the statistics of boundary comparison is also automatically generated. Hi-C visualization of user-defined genomic regions is performed using HiCPlotter (hicplotter) [23]. Specific chromatin interactions (interactions) are detected and annotated (annotations). Finally, enrichment of top interactions in certain chromatin marks, transcription factors etc. provided by the user, is automatically calculated (annotations-stats)
Fig. 2a Computational trails. Each combination of tools and parameter settings can be imagined as a unique computational “trail” that is executed simultaneously with all the other possible trails to create a collection of output objects. As an example, one of these possible trails is presented in red. The raw reads were aligned, filtered and then binned in 40 kb resolution matrices. Our own naïve matrix scaling method was then used for matrix correction and domains were called using TopDom [31]. b HiC-bench pipeline task architecture. All pipeline tasks are performed by a single R script, “pipeline-master-explorer.r”. This script generates output objects based on all combinations of input objects and parameter scripts while taking into account the split variable, group variable and tuple settings. The output objects are stored in the corresponding “results” directory. As an example, domain calling for IMR90 is presented. The filtered reads of the IMR90 Hi-C sample (digested with HindIII) are used as input. The pipeline-master-explorer script tests if TAD calling with these settings has been performed and if not it calls the domain calling wrapper script (code/hicseq-domains.tcsh) with the corresponding parameters (e.g., params/params.armatus.gamma_0.5.tcsh). After the task is complete, the output is stored in the corresponding “results” directory
The HiC-bench toolkit
| Hi-C tasks | HiC-bench toolkit |
|---|---|
| Alignment | bowtie2, |
| Filtering |
|
| Genome browser tracks |
|
| Matrix generation |
|
| Matrix correction | IC, HiCNorm, |
| Boundary scores |
|
| Domain calling | DI, Armatus, TopDom, |
| Interactions |
|
| Annotations |
|
| Visualization | HiCPlotter |
The HiC-bench toolkit consists mostly of newly-developed tools (shown in bold) but we have also incorporated existing tools to allow comparisons and benchmarking
Fig. 3Comparison of topological domain calling methods subject to Hi-C contact matrix preprocessing by simple filtering or iterative correction (IC). The methods were assessed in terms of boundary overlap between replicates (a), change (%) in mean boundary overlap after matrix correction (b), change (%) in standard deviation of mean overlap across replicates after matrix correction (c) and number of identified topological domains per cell type (d). The different colors correspond to the different callers. Gradients of the same color are used for the different values of the same parameter, ranging from low (light color) to high (dark color) values. The TAD callers along with the corresponding parameter settings are presented in the legend. For this analysis all available read pairs were used
Fig. 4Comparison of topological domain calling methods for different preprocessing method and sequencing depth. TAD calling methods were assessed in terms of boundary overlap between replicates (a), number of identified topological domains (b) and boundary overlap across replicates upon increasing sequencing depth (c) for different matrix preprocessing (filtered and IC corrected) and different sequencing depths (10 million, 20 million and 40 million reads). For TAD calling, only the optimal caller/parameter value pairs are shown (defined as the ones achieving the maximum boundary overlap for IC and 40 million reads). The boxplot and line colors correspond to the different TAD callers