| Literature DB >> 28604721 |
Francesco Ferrari1,2, Silvio Bicciato3, Mattia Forcato3, Chiara Nicoletti3, Koustav Pal1, Carmen Maria Livi1.
Abstract
Hi-C is a genome-wide sequencing technique used to investigate 3D chromatin conformation inside the nucleus. Computational methods are required to analyze Hi-C data and identify chromatin interactions and topologically associating domains (TADs) from genome-wide contact probability maps. We quantitatively compared the performance of 13 algorithms in their analyses of Hi-C data from six landmark studies and simulations. This comparison revealed differences in the performance of methods for chromatin interaction identification, but more comparable results for TAD detection between algorithms.Entities:
Mesh:
Substances:
Year: 2017 PMID: 28604721 PMCID: PMC5493985 DOI: 10.1038/nmeth.4325
Source DB: PubMed Journal: Nat Methods ISSN: 1548-7091 Impact factor: 28.547
Methods for Hi-C data analysis used in this comparison.
| Method | Availability | Programming language | |
|---|---|---|---|
| Chromatin interactions | Fit-Hi-C | Python | |
| GOTHiC | R | ||
| HOMER | Perl, R | ||
| HIPPIE | Python, Perl, R | ||
| diffHic | R, Python | ||
| HiCCUPS | Java | ||
| TADs | HiCseg | R | |
| TADbit | Python | ||
| DomainCaller | Matlab, Perl | ||
| InsulationScore | Perl | ||
| Arrowhead | Java | ||
| TADtree | Python | ||
| Armatus | C++ | ||
HiCCUPS and Arrowhead are the algorithms for interaction and TAD calling of the Juicer software suite.
Hi-C experimental data.
| Cell type | Restriction Enzyme | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Study | LCL | H1-hESC | IMR90 | Fly Embryo | Hi-C Protocol | HindIII (6bp) | NcoI (6bp) | DpnII (4bp) | MboI (4bp) | Read length (bp) | Median read count (per replicate, in millions) | Resolution (kb) | N° of replicate samples |
| Lieberman-Aiden | ✔ | Dilution | ✔ | ✔ | 76 | 11 | 1000 | 4 | |||||
| Sexton | ✔ | Simplified | ✔ | 36 | 362 | 40 | 1 | ||||||
| Dixon 2012 | ✔ | ✔ | Dilution | ✔ | 36-100 | 328 | 40 | 4 | |||||
| Jin | ✔ | ✔ | Dilution | ✔ | 36-50 | 440 | 5-40 | 7 | |||||
| Rao | ✔ | ✔ | In situ | ✔ | ✔ | 101 | 240 | 5-40 | 23 | ||||
| Dixon 2015 | ✔ | Dilution | ✔ | 36-50 | 999 | 5-40 | 2 | ||||||
LCL: lymphoblastoid cell lines (i.e., GM06990 in Lieberman-Aiden and GM12878 in Rao)
Dilution, simplified, and in-situ refer to the Hi-C protocols presented in Lieberman-Aiden et al., (2009), Sexton et al, (2012), and Rao et al.(2014), respectively
Samples have been sequenced with different read length in the same study
Resolution refers to the resolution used in this comparison. In the case of two values, the first refers to the resolution used for chromatin interactions, the second for TADs.
Figure 1Tools for Hi-C data analysis used in the comparison and performances in data preprocessing.
a) Tools for the identification of chromatin interactions and TADs from Hi-C data and key analysis steps (orange arrows). Blue boxes detail the strategy used in each analysis step by each tool. A grey box is used when an external tool is required for a preprocessing step. Since most tools perform filtering and binning together, a blue or grey box spanning both steps is used in the schematic workflow. For filtering the following abbreviations are used: read level filtering (R); read-pair level filtering (R-pair); fragment level filtering (Fr.).
b) Percentage of aligned read pairs (alignment rate) for all datasets ordered by read length (grey arrows at the bottom). Data are shown as mean±standard error of the mean. Samples with different or mixed read length were not used when calculating the alignment rate.
c) Percentage of mapped reads retained after filtering (fraction of usable reads) in each dataset, ordered by experimental protocol (grey arrows at the bottom). Data are shown as mean±standard error of the mean. GOTHiC could not be applied to Dixon 2015 since the read-pairing step required an amount of memory larger than 1 TB of RAM.
Figure 2Comparative results of methods for the identification of chromatin interactions.
a) Scatter plot of total number of cis interactions called by each method as a function of the number of reads retained by the filtering step in all datasets at 5kb resolution (i.e., Jin H1-hESC, Jin IMR90, Rao GM12878, Rao IMR90, and Dixon 2015 H1-hESC; n= 32). Different points represent sample replicates. Linear interpolation for each method is shown as a solid line.
b) Boxplot of average distances between anchoring points in cis interactions (log scale) in sample replicates considering all datasets analyzed at 5kb resolution (n= 32).
c) Heatmap of the contact matrix of Rao GM12878 replicate H (chr21:35,000,000-36,000,000) at 5kb resolution. Identified peaks are marked in different colors for the various methods.
d) Box plots of the Jaccard Index for concordance of cis (upper) and trans (lower) interaction calls between sample replicates (intra-dataset concordance) for all datasets with at least 2 replicates (n=39; Supplementary Table 1). For Fit-Hi-C and HiCCUPS, the Jaccard Index was calculated only for cis interactions since these tools do not return trans interactions.
e) Proportion of cis interactions classified on the base of the chromatin states at their anchoring points (promoter-enhancer, upper; heterochromatin/quiescent to heterochromatin/quiescent, middle; less expected, lower) in all datasets at 5kb. With the exception of Jin H1-hESC (that contains a single replicate), only cis interactions conserved in at least 2 replicates within each dataset were classified using the chromatin states (Supplementary Table 4).
f) Performances in the identification of true positive validated evidences of cis interactions. Each row represents the comparison between a list of true positives and the interactions called by each method in each dataset. The dot size is proportional to the percentage of recalled true positives and the dot color accounts for the number of total called interactions. The validation technique and the name of true positive lists are displayed on the left side. The dataset used to call interactions are on the right and shaded in grey if at 40 kb resolution. True-positive interactions were searched among cis interactions conserved in at least 2 replicates within each dataset, with the exception of Jin H1-hESC and Sexton (both containing a single replicate). GOTHiC was not applied to Dixon 2015 (see legend of Fig. 1c).
Figure 3Comparative results of methods for the identification of TADs.
a) Scatter plot of total number of TADs called by each method as a function of the number of reads retained by the filtering step in all datasets except Lieberman-Aiden and Jin H1-hESC (n=36; Supplementary Table 1). Different points represent sample replicates. Loess interpolation for each method is shown as solid line.
b) Boxplot of median TAD size in all replicates of all datasets (analyzed at 40kb) except Lieberman-Aiden and Jin H1-hESC (n=36).
c) Heatmap of the contact matrix of Rao GM12878 replicate H (chr1:153,000,000-155,500,000) at 40kb resolution. Identified TADs are framed in different colors for the various methods.
d) Box plots of the Jaccard Index for concordance of TAD boundaries between sample replicates of all datasets with at least 2 replicates (n=39).