Literature DB >> 28369339

HiC-spector: a matrix library for spectral and reproducibility analysis of Hi-C contact maps.

Koon-Kiu Yan1,2, Galip Gürkan Yardimci3, Chengfei Yan1,2, William S Noble3,4, Mark Gerstein1,2,5.   

Abstract

SUMMARY: Genome-wide proximity ligation based assays like Hi-C have opened a window to the 3D organization of the genome. In so doing, they present data structures that are different from conventional 1D signal tracks. To exploit the 2D nature of Hi-C contact maps, matrix techniques like spectral analysis are particularly useful. Here, we present HiC-spector, a collection of matrix-related functions for analyzing Hi-C contact maps. In particular, we introduce a novel reproducibility metric for quantifying the similarity between contact maps based on spectral decomposition. The metric successfully separates contact maps mapped from Hi-C data coming from biological replicates, pseudo-replicates and different cell types.
AVAILABILITY AND IMPLEMENTATION: Source code in Julia and Python, and detailed documentation is available at https://github.com/gersteinlab/HiC-spector . CONTACT: koonkiu.yan@gmail.com or mark@gersteinlab.org. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
© The Author(s) 2017. Published by Oxford University Press.

Entities:  

Mesh:

Substances:

Year:  2017        PMID: 28369339      PMCID: PMC5870694          DOI: 10.1093/bioinformatics/btx152

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 Introduction

Genome-wide proximity ligation assays such as Hi-C have emerged as powerful techniques to understand the 3D organization of the genome (Kalhor ; Lieberman-Aiden ). Although these techniques offer new biological insights, they demand different data structures and present new computational questions (Ay and Noble, 2015; Dekker ). For instance, a fundamental question of practical importance is, how can we quantify the similarity between two Hi-C data sets? In particular, given two experimental replicates, how can we determine if the experiments are reproducible? Data from Hi-C experiments are usually summarized by so-called chromosomal contact maps. By binning the genome into equally sized bins, a contact map is a matrix whose elements store the population-averaged co-location frequencies between pairs of loci. Therefore, mathematical tools like spectral analysis can be extremely useful in understanding these chromosomal contact maps. Our aim is to provide a set of basic analysis tools for handling Hi-C contact maps. In particular, we introduce a simple but novel metric to quantify the reproducibility of the maps using spectral decomposition.

2 Algorithms

We represent a chromosomal contact map by a symmetric and non-negative adjacency matrix . The matrix elements represent the frequencies of contact between genomic loci. Recent single-cell imaging experiment suggests that the frequency serves as a reasonable proxy of spatial distance (Wang ). In principle, the larger the value of , the closer is the distance between loci and . The starting point of spectral analysis is the Laplacian matrix , which is defined as . Here is a diagonal matrix in which (the coverage of bin in the context of Hi-C). The Laplacian matrix further takes a normalized form (Chung, 1997). It can be verified that 0 is an eigenvalue of , and the set of eigenvalues of ( is referred to as the spectrum of Given two contact maps and , we propose to quantify their similarity by decomposing their corresponding Laplacian matrices and respectively and then comparing their eigenvectors. Let and be the spectra of and , and and be their sets of normalized eigenvectors. A distance metric is defined as Here represents the Euclidean norm. The parameter r is the number of leading eigenvectors picked from and . In general, provides a metric to gauge the similarity between two contact maps. and are more correlated if A and B are two biological replicates as compared with the case when they are two different cell lines (see Supplementary Fig. S1). For the choice of r, like any principal component analysis, the leading eigenvectors are more important than the lower ranked eigenvectors. In fact, we observe that the Euclidean distance between a pair of high-order eigenvectors is the same as the distance between a pair of unit vectors whose components are randomly sampled from a standard normal distribution (see Supplementary Fig. S2). In other words, the high-order eigenvectors are essentially noise terms, whereas the signal is stored in the leading vectors. As a rule of thumb, we found the choice is good enough for practical purposes. Furthermore, as the distance between a pair of randomly sampled unit vectors presents a reference, we linearly rescale the distance metric into a reproducibility score Q ranges from 0 to 1 (see the Supplementary Material). We used HiC-spector to calculate the reproducibility scores for more than a hundred pairs of Hi-C contact maps. As shown in Figure 1, the reproducibility scores between pseudo-replicates are greater than the scores for real biological replicates, which are greater than the scores between maps from different cell lines (see the Supplement). It is worthwhile to point out that two contact maps can be compared in terms of features like topologically associating domains (TADs) and loops. It depends strongly on the choices of methods and parameters. Nevertheless, what we refer to, as ‘reproducibility’ is a direct comparison of the contact maps.
Fig. 1

Reproducibility scores for three sets of Hi-C contact maps pairs. Contact maps came from Hi-C experiments performed in 11 cell lines. Biological replicates refer to a pair of replicates of the same experiment. Pseudo replicates are obtained by pooling the reads from two replicates together performing down sampling. There are 11 biological replicates, 33 pairs of pseudo replicates, and 110 pairs of maps between different cell types. Each box shows for a pair the distribution of Q in 23 chromosomes, with crosses as the outliers

Reproducibility scores for three sets of Hi-C contact maps pairs. Contact maps came from Hi-C experiments performed in 11 cell lines. Biological replicates refer to a pair of replicates of the same experiment. Pseudo replicates are obtained by pooling the reads from two replicates together performing down sampling. There are 11 biological replicates, 33 pairs of pseudo replicates, and 110 pairs of maps between different cell types. Each box shows for a pair the distribution of Q in 23 chromosomes, with crosses as the outliers Mathematically there are different ways to compare two matrices. For instance, one could assume all matrix elements are independent and define a distance metric using Spearman correlation. The intuition behind is essentially a better way to decompose a contact map. The normalized Laplacian matrix is closely related to a random-walk-process taking place in the underlying graph of The leading eigenvector refers to the steady state distribution; the next few eigenvectors correspond to the slower decay modes of the random walk process and capture the densely interacting domains that are highly significant in contact maps. In fact, HiC-spector can better separate biological replicates and non-replicates compared with the correlation coefficient (see Supplementary Fig. S3). Apart from the reproducibility score, HiC-spector provides a number of matrix algorithms useful for analyzing contact maps. For instance, to perform a widely used normalization procedure for contact maps (Imakaev ), we include the Knight-Ruiz algorithm (Knight and Ruiz, 2012), which is a newer and faster algorithm for matrix balancing. Also, we have included the functions for estimating the average contact frequency with respect to the genomic distance, as well as identifying the so-called A/B compartments (Lieberman-Aiden ) using the corresponding correlation matrix.

3 Implementation and benchmark

HiC-spector is a library written in Julia, a high-performance language for technical computing. A Python script for the reproducibility score is also provided. The bottleneck for evaluating Q is matrix diagonalization. The runtime is very efficient but depends on the size of contact maps (see Supplementary Fig. S5 for details).

4 Materials and methods

Hi-C data are generated by the ENCODE consortium (see the Supplementary Material). Contact maps were generated using the tool cworld (https://github.com/dekkerlab/cworld-dekker). Click here for additional data file.
  6 in total

Review 1.  Exploring the three-dimensional organization of genomes: interpreting chromatin interaction data.

Authors:  Job Dekker; Marc A Marti-Renom; Leonid A Mirny
Journal:  Nat Rev Genet       Date:  2013-05-09       Impact factor: 53.242

2.  Genome architectures revealed by tethered chromosome conformation capture and population-based modeling.

Authors:  Reza Kalhor; Harianto Tjong; Nimanthi Jayathilaka; Frank Alber; Lin Chen
Journal:  Nat Biotechnol       Date:  2011-12-25       Impact factor: 54.908

3.  Spatial organization of chromatin domains and compartments in single chromosomes.

Authors:  Siyuan Wang; Jun-Han Su; Brian J Beliveau; Bogdan Bintu; Jeffrey R Moffitt; Chao-ting Wu; Xiaowei Zhuang
Journal:  Science       Date:  2016-07-21       Impact factor: 47.728

4.  Comprehensive mapping of long-range interactions reveals folding principles of the human genome.

Authors:  Erez Lieberman-Aiden; Nynke L van Berkum; Louise Williams; Maxim Imakaev; Tobias Ragoczy; Agnes Telling; Ido Amit; Bryan R Lajoie; Peter J Sabo; Michael O Dorschner; Richard Sandstrom; Bradley Bernstein; M A Bender; Mark Groudine; Andreas Gnirke; John Stamatoyannopoulos; Leonid A Mirny; Eric S Lander; Job Dekker
Journal:  Science       Date:  2009-10-09       Impact factor: 47.728

Review 5.  Analysis methods for studying the 3D architecture of the genome.

Authors:  Ferhat Ay; William S Noble
Journal:  Genome Biol       Date:  2015-09-02       Impact factor: 13.583

6.  Iterative correction of Hi-C data reveals hallmarks of chromosome organization.

Authors:  Maxim Imakaev; Geoffrey Fudenberg; Rachel Patton McCord; Natalia Naumova; Anton Goloborodko; Bryan R Lajoie; Job Dekker; Leonid A Mirny
Journal:  Nat Methods       Date:  2012-09-02       Impact factor: 28.547

  6 in total
  27 in total

1.  IDR2D identifies reproducible genomic interactions.

Authors:  Konstantin Krismer; Yuchun Guo; David K Gifford
Journal:  Nucleic Acids Res       Date:  2020-04-06       Impact factor: 16.971

2.  CTCF chromatin residence time controls three-dimensional genome organization, gene expression and DNA methylation in pluripotent cells.

Authors:  Widia Soochit; Frank Sleutels; Gregoire Stik; Frank Grosveld; Ralph Stadhouders; Niels Galjart; Marek Bartkuhn; Sreya Basu; Silvia C Hernandez; Sarra Merzouk; Enrique Vidal; Ruben Boers; Joachim Boers; Michael van der Reijden; Bart Geverts; Wiggert A van Cappellen; Mirjam van den Hout; Zeliha Ozgur; Wilfred F J van IJcken; Joost Gribnau; Rainer Renkawitz; Thomas Graf; Adriaan Houtsmuller
Journal:  Nat Cell Biol       Date:  2021-07-29       Impact factor: 28.824

3.  GenomeDISCO: a concordance score for chromosome conformation capture experiments using random walks on contact map graphs.

Authors:  Oana Ursu; Nathan Boley; Maryna Taranova; Y X Rachel Wang; Galip Gurkan Yardimci; William Stafford Noble; Anshul Kundaje
Journal:  Bioinformatics       Date:  2018-08-15       Impact factor: 6.937

Review 4.  Computational methods for analyzing and modeling genome structure and organization.

Authors:  Dejun Lin; Giancarlo Bonora; Galip Gürkan Yardımcı; William S Noble
Journal:  Wiley Interdiscip Rev Syst Biol Med       Date:  2018-07-18

5.  Transcription Factor-Directed Re-wiring of Chromatin Architecture for Somatic Cell Nuclear Reprogramming toward trans-Differentiation.

Authors:  Alessandra Dall'Agnese; Luca Caputo; Chiara Nicoletti; Julia di Iulio; Anthony Schmitt; Sole Gatto; Yarui Diao; Zhen Ye; Mattia Forcato; Ranjan Perera; Silvio Bicciato; Amalio Telenti; Bing Ren; Pier Lorenzo Puri
Journal:  Mol Cell       Date:  2019-09-10       Impact factor: 17.970

6.  OneD: increasing reproducibility of Hi-C samples with abnormal karyotypes.

Authors:  Enrique Vidal; François le Dily; Javier Quilez; Ralph Stadhouders; Yasmina Cuartero; Thomas Graf; Marc A Marti-Renom; Miguel Beato; Guillaume J Filion
Journal:  Nucleic Acids Res       Date:  2018-05-04       Impact factor: 16.971

7.  Hierarchical chromatin organization detected by TADpole.

Authors:  Paula Soler-Vila; Pol Cuscó; Irene Farabella; Marco Di Stefano; Marc A Marti-Renom
Journal:  Nucleic Acids Res       Date:  2020-04-17       Impact factor: 16.971

Review 8.  Resources and challenges for integrative analysis of nuclear architecture data.

Authors:  Youngsook L Jung; Koray Kirli; Burak H Alver; Peter J Park
Journal:  Curr Opin Genet Dev       Date:  2021-01-12       Impact factor: 5.578

9.  Nucleome Dynamics during Retinal Development.

Authors:  Jackie L Norrie; Marybeth S Lupo; Beisi Xu; Issam Al Diri; Marc Valentine; Daniel Putnam; Lyra Griffiths; Jiakun Zhang; Dianna Johnson; John Easton; Ying Shao; Victoria Honnell; Sharon Frase; Shondra Miller; Valerie Stewart; Xin Zhou; Xiang Chen; Michael A Dyer
Journal:  Neuron       Date:  2019-09-04       Impact factor: 17.173

10.  covNorm: An R package for coverage based normalization of Hi-C and capture Hi-C data.

Authors:  Kyukwang Kim; Inkyung Jung
Journal:  Comput Struct Biotechnol J       Date:  2021-05-27       Impact factor: 7.271

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.