| Literature DB >> 30890172 |
Galip Gürkan Yardımcı1, Hakan Ozadam2, Michael E G Sauria3, Oana Ursu4, Koon-Kiu Yan5, Tao Yang6, Abhijit Chakraborty7, Arya Kaul7, Bryan R Lajoie2, Fan Song6, Ye Zhan8, Ferhat Ay7, Mark Gerstein9, Anshul Kundaje4,10, Qunhua Li11, James Taylor3,12, Feng Yue6,13, Job Dekker14,15,16, William S Noble17.
Abstract
BACKGROUND: Hi-C is currently the most widely used assay to investigate the 3D organization of the genome and to study its role in gene regulation, DNA replication, and disease. However, Hi-C experiments are costly to perform and involve multiple complex experimental steps; thus, accurate methods for measuring the quality and reproducibility of Hi-C data are essential to determine whether the output should be used further in a study.Entities:
Mesh:
Year: 2019 PMID: 30890172 PMCID: PMC6423771 DOI: 10.1186/s13059-019-1658-7
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1Overview of the study. a Schematic showing the approach for generating noise-injected Hi-C matrices. In the upper panel, we generate two types of noise from real Hi-C data (center): random ligation noise (right) and genomic distance effect noise (left). The three matrices are then mixed to generate noisy datasets (lower panel). By changing the mixing proportions, we can create datasets with varying percentages of noise. b To benchmark the performance of various quality control and reproducibility measures, we compiled a large number of Hi-C replicates from 13 cell types and simulated noise-injected datasets from the original data. Real and simulated datasets binned at different resolutions and downsampled to different coverage levels are the inputs to reproducibility and quality control measures where each replicate pair and single replicate are assigned a score. Performance of each measure is evaluated on their ability to correctly rank real and simulated datasets. c Summary of the basic principles of the four reproducibility methods evaluated in this study
Fig. 2Comparison of reproducibility measures. a Curves showing the mean reproducibility score assigned to 11 cell types at each noise injection level for 33% and 66% random ligation noise configurations. Vertical bars represent one standard deviation away from the mean. b Reproducibility scores assigned to biological replicate (blue), non-replicate (red), and pseudo-replicate (purple) pairs for each cell type. Coverage values are the mean number of interactions for each pair of replicates. c Reproducibility scores assigned to biological replicate (blue), non-replicate (red), and pseudo-replicate (purple) pairs from six cell types at seven different coverage levels. Dashed lines indicate the empirical threshold for distinguishing biological replicate pairs from non-replicate pairs. d Reproducibility scores assigned to biological replicate (blue) and non-replicate (red) pairs for clone-8 and S2 cells from Drosophila. Each panel shows the separation between two replicate pair types for each Hi-C reproducibility measure. Dashed lines correspond to the empirical thresholds inferred from human Hi-C data
Fig. 3Effects of resolution on reproducibility measures. a Reproducibility scores assigned to biological replicate (blue), non-replicate (red), and pseudo-replicate (purple) pairs from HepG2 and HeLa Hi-C datasets at 10-kb, 40-kb and 500-kb resolutions. b Reproducibility scores assigned to different cell types at different resolutions, plotted as a function of noise level. c Reproducibility scores assigned to downsampled biological replicate pairs at different resolutions. Both the HepG2 and HeLa datasets contain > 400 million read pairs
Fig. 4Quality measures. a QuASAR-QC scores assigned to noise-injected matrices from 11 cell types (b). Total number of significant contacts above a 5% FDR threshold from noise-injected matrices from 11 cell types. c Violin plots showing the distribution of TAD boundary distances between biological replicates and noise-injected replicates for T470 cells. There is no significant change in the distribution of TAD boundary distances at any given noise level. d QuASAR-QC scores assigned to downsampled replicates from six different cell types. e Total number of significant contacts above a 5% FDR threshold from downsampled replicates from six different cell types. f Violin plots showing the distribution of distances between domain boundaries in biological replicates and noise-injected replicates for T470 cells. In panels c and f, asterisks indicate that the distribution of boundary distances is significantly larger than the null distribution, which is obtained by comparing biological replicates
Fig. 5Comparison of QuASAR-QC to mapping statistics. Scatter plots of QuASAR-QC scores of biological replicates from 13 cell types plotted against quality statistics that describe percentages of a successful mapping, b artifactual Hi-C fragments, c intrachromosomal interactions, and d PCR duplicates. Dots correspond to low coverage Hi-C replicates from 11 cell types generated using HindIII, and triangles correspond to replicates from two deeply sequenced cell types generated by DpnII. Red dots correspond to a subset of samples with very similar total coverage (138–171 million read pairs). Each plot lists two Pearson correlation coefficients: the correlations between the given statistic and QuASAR-QC scores for only the 11 HinDIII cell types and for all 13 cell types
| Biosample | ENCODE sample IDs |
| A549 | ENCSR444WCZ |
| CAKI2 | ENCSR401TBQ |
| G401 | ENCSR079VIJ |
| LNCaP | ENCSR346DCU |
| NCIH460 | ENCSR489OCU |
| PANC1 | ENCSR440CTR |
| RPMI7951 | ENCSR862OG |
| SKMEL5 | ENCSR312KHQ |
| SKNDZ | ENCSR105KFX |
| SKNMC | ENCSR834DXR |
| T47D | ENCSR549MGQ |
| HepG2 | ENCSR194SRI |
| HeLa | ENCSR693GXU |