Literature DB >> 33794739

4DNvestigator: time series genomic data analysis toolbox.

Stephen Lindsly¹, Can Chen², Sijia Liu^3,4, Scott Ronquist¹, Samuel Dilworth⁵, Michael Perlman⁶, Indika Rajapakse^1,2.

Abstract

Data on genome organization and output over time, or the 4D Nucleome (4DN), require synthesis for meaningful interpretation. Development of tools for the efficient integration of these data is needed, especially for the time dimension. We present the '4DNvestigator', a user-friendly network-based toolbox for the analysis of time series genome-wide genome structure (Hi-C) and gene expression (RNA-seq) data. Additionally, we provide methods to quantify network entropy, tensor entropy, and statistically significant changes in time series Hi-C data at different genomic scales.

Entities: Chemical

Keywords: 4DN; centrality; entropy; networks; time series

Mesh：

Year: 2021 PMID： 33794739 PMCID： PMC8049205 DOI： 10.1080/19491034.2021.1910437

Source DB: PubMed Journal: Nucleus ISSN： 1949-1034 Impact factor: 4.197

Introduction

4D nuclear organization (4D Nucleome, 4DN) is defined by the dynamical interaction between 3D genome structure and function [1-3]. To analyze the 4DN, genome-wide chromosome conformation capture (Hi-C) and RNA sequencing (RNA-seq) are often used to observe genome structure and function, respectively (Figure 1a). The availability and volume of Hi-C and RNA-seq data is expected to increase as high throughput sequencing costs decline, thus the development of methods to analyze these data is imperative. The relationship of genome structure and function has been studied previously [3-7], yet comprehensive and accessible tools for 4DN analysis are underdeveloped. The 4DNvestigator is a unified toolbox that loads time series Hi-C and RNA-seq data, extracts important structural and functional features (Figure 1b), and conducts both established and novel 4DN data analysis methods. We show that network centrality can be integrated with gene expression to elucidate structural and functional changes through time, and provide relevant links to the NCBI and GeneCards databases for biological interpretation of these changes [8,9]. Furthermore, we utilize entropy to quantify the uncertainty of genome structure, and present a simple statistical method for comparing two or more Hi-C matrices.

Figure 1.

The 4D Nucleome. (a) representative time series Hi-C and RNA-seq data correspond to genome structure and function, respectively. (b) genome structure and function are intimately related. the 4DNvestigator integrates and visualizes time series data to study their dynamical relationship

Materials and methods

An overview of the 4DNvestigator workflow is depicted in Figure 2, and a Getting Started document is provided to guide the user through the main functionalities of the 4DNvestigator. The 4DNvestigator takes processed Hi-C and RNA-seq data as input, along with a metadata file which describes the sample and time point for each input Hi-C and RNA-seq file (See Supplementary Materials ‘Data Preparation’). A number of novel methods for analyzing 4DN data are included within the 4DNvestigator and are described below.

Figure 2.

Overview of the 4DNvestigator data processing pipeline. within this diagram, 4DN refers to the 4DNvestigator

4DN feature analyzer

The ‘4DN feature analyzer’ quantifies and visualizes how much a genomic region changes in structure and function over time. To analyze both structural and functional data, we consider the genome as a network. Nodes within this network are genomic loci, where a locus can be a gene or a genomic region at a particular resolution (i.e. 100 kb or 1 Mb bins). Edges in the genomic network are the relationships or interactions between genomic loci.

Structural data

Structure in the 4DN feature analyzer is derived from Hi-C data. Hi-C determines the edge weights in our genomic network through the frequency of contacts between genomic loci. To analyze genomic networks, we adopt an important concept from network theory called centrality. Network centrality is motivated by the identification of nodes that are the most ‘central’ or ‘important’ within a network [10]. The 4DN feature analyzer uses degree, eigenvector, betweenness, and closeness centrality (step 1 of Algorithm 1), which have been shown to be biologically relevant [7]. For example, eigenvector centrality can identify structurally defined regions of active/inactive gene expression, since it encodes clustering information of a network [7,11]. Additionally, betweenness centrality measures the importance of nodes in regard to the flow of information between pairs of nodes. Boundaries between euchromatin and heterochromatin, which often change in reprogramming experiments, can be identified in a genomic network through betweenness centrality [7].

Functional data

Function in the 4DN feature analyzer is derived from gene expression through RNA-seq. Function is defined as the transformation of Transcripts Per Million (TPM) or Reads Per Kilobase Million (RPKM). For regions containing more than one gene, the mean expression of all genes within the region is used. The 4DN feature analyzer can also use other one-dimensional features (e.g. ChIP-seq, DNase-seq, etc.). The interpretation of the results and visualizations would change accordingly.

Integration of data

Hi-C data is naturally represented as a matrix of contacts between genomic loci. Network centrality measures are one-dimensional vectors that describe important structural features of the genomic network. We combine network centrality with RNA-seq expression to form a structure-function ‘feature’ matrix that defines the state of each genomic region at each time point (Figure 3A, step 3 of Algorithm 1). Within this matrix, rows represent genomic loci and columns are the centrality measures (structure) and gene expression (function) of each locus. The z-score for each column is computed to normalize the data (step 4 of Algorithm 1).

Figure 3.

4DN feature analyzer. (A) 4DN data is input to the 4DN feature analyzer. Top: Structure data (Hi-C) is described using one-dimensional features for compatibility with function data (RNA-seq). Bottom: Multiple structural features and function data are integrated into the structure-function feature matrix. (B) The 4DN feature analyzer can use structure and function data directly to visualize a system’s dynamics using the 4DN phase plane [3,15]. Structure defines the -axis (left: eigenvector centrality, right: PC1) and function defines the -axis (left: (RPKM), right: (TPM)), and points show structure-function coordinates through time. Left: Maternal and paternal alleles of nine cell cycle genes through G1, S, and G2/M phases of the cell cycle (adapted from [15]). Right: Top ten genomic regions (100 kb) with the largest changes in structure and function during cellular reprogramming [7]. (C) Multiple dimension reduction techniques can be used to visualize the 4DN feature analyzer’s structure-function feature matrix (from left to right: LE, UMAP, and t-SNE). Top: 100 kb regions of Chromosome 4 across six time points during cellular differentiation [28]. Bottom: 100 kb regions of Chromosome 11 across three time points during cellular reprogramming [7]. (D) Example output of the 4DN feature analyzer. The output includes genes contained in loci with the largest changes, and links to their NCBI and GeneCards database entries [8,9]

4DN analysis

The 4DN feature analyzer reduces the dimension of the structure-function feature matrix for visualization and further analysis (steps 5 and 6 of Algorithm 1). We include the main linear dimension reduction method, Principal Component Analysis (PCA), and multiple nonlinear dimension reduction methods: Laplacian Eigenmaps (LE) [12], t-distributed Stochastic Neighbor Embedding (t-SNE) [13], and Uniform Manifold Approximation and Projection (UMAP) [14] (Figure 3C). These methods are described in more detail in Supplementary Materials ‘Dimension Reduction’. The 4DN feature analyzer can also visualize the dynamics of genome structure and function using the 4DN phase plane (step 6 of Algorithm 1) [3,15]. We designate one axis of the 4DN phase plane as a measure of genome structure (e.g. eigenvector centrality) and the other as a measure of genome function (gene expression). Each point on the phase plane represents the structure and function of a genomic locus at a specific point in time (Figure 3B). The 4DN feature analyzer identifies genomic regions and genes with large changes in structure and function over time, and provides relevant links to the NCBI and GeneCard databases [8,9].

Additional 4DNvestigator tools

General structure and function analysis

The 4DNvestigator also includes a suite of previously developed Hi-C and RNA-seq analysis methods. Euchromatin and heterochromatin compartments can be identified from Hi-C [4,16], and regions that change compartments between samples are automatically identified. Significant changes in gene expression between RNA-seq samples can be determined through differential expression analysis using established methods [17].

Network entropy

Entropy measures the amount of uncertainty within a system [18]. We use entropy to quantify the organization of chromatin structure from Hi-C data, where higher entropy corresponds to less structural organization. Since Hi-C is a multivariate analysis measurement (each contact coincidence involves two variables, the two genomic loci), we use multivariate entropy as follows: where represents the dominant features of the Hi-C contact matrix. In mathematics, these dominant features are called eigenvalues [19]. Biologically, genomic regions with high entropy likely correlate with high proportions of euchromatin, as euchromatin is more structurally permissive than heterochromatin [20,21]. Furthermore, entropy can be used to quantify stemness, since cells with high pluripotency are less defined in their chromatin structure [22]. We provide the full algorithm for network entropy and calculate the entropy of Hi-C data from multiple cell types in Supplementary Materials ‘Network Entropy’.

Tensor entropy

The notion of transcription factories supports the existence of simultaneous interactions involving three or more genomic loci [23]. This implies that the configuration of the human genome can be more accurately represented by -uniform hypergraphs, a generalization of networks in which each edge can join exactly nodes (e.g. a standard network is a 2-uniform hypergraph). We can construct -uniform hypergraphs from Hi-C contact matrices by computing the multi-correlations of genomic loci. Tensor entropy, an extension of network entropy, measures the uncertainty or disorganization of uniform hypergraphs [24]. Tensor entropy can be computed from the same entropy formula (1) with generalized singular values from tensor theory [24,25]. We provide the definitions for multi-correlation and generalized singular values, the algorithm to compute tensor entropy, and an application of tensor entropy on Hi-C data in Supplementary Materials ‘Tensor Entropy’.

Larntz-Perlman procedure

The 4DNvestigator includes a statistical test, proposed by Larntz and Perlman (the LP procedure), that compares correlation matrices [26,27]. The LP procedure is applied to correlation matrices from Hi-C data, and is able to determine whether multiple Hi-C samples are significantly different from one another. Suppose that are the sample correlation matrices of Hi-C contacts with corresponding population correlation matrices for . The null hypothesis is . First, compute the Fisher z-transformation by Then, form the matrices such that where, . The test statistic is given by , and is rejected at level if where is the chi-square distribution with degree of freedom, and is the Šidák correction. Finally, calculate the -value at which . We note that this -value is conservative, and that the actual -value may be smaller depending upon the amount of correlation among the variables. The LP procedure determines the statistical significance of any differences between multiple Hi-C samples for a genomic region of interest. We provide benchmark results of the LP procedure with other Hi-C comparison methods in Supplementary Materials ‘LP Procedure for Comparing Hi-C Matrices’. 4DN feature analyzer. (A) 4DN data is input to the 4DN feature analyzer. Top: Structure data (Hi-C) is described using one-dimensional features for compatibility with function data (RNA-seq). Bottom: Multiple structural features and function data are integrated into the structure-function feature matrix. (B) The 4DN feature analyzer can use structure and function data directly to visualize a system’s dynamics using the 4DN phase plane [3,15]. Structure defines the -axis (left: eigenvector centrality, right: PC1) and function defines the -axis (left: (RPKM), right: (TPM)), and points show structure-function coordinates through time. Left: Maternal and paternal alleles of nine cell cycle genes through G1, S, and G2/M phases of the cell cycle (adapted from [15]). Right: Top ten genomic regions (100 kb) with the largest changes in structure and function during cellular reprogramming [7]. (C) Multiple dimension reduction techniques can be used to visualize the 4DN feature analyzer’s structure-function feature matrix (from left to right: LE, UMAP, and t-SNE). Top: 100 kb regions of Chromosome 4 across six time points during cellular differentiation [28]. Bottom: 100 kb regions of Chromosome 11 across three time points during cellular reprogramming [7]. (D) Example output of the 4DN feature analyzer. The output includes genes contained in loci with the largest changes, and links to their NCBI and GeneCards database entries [8,9]

Results

We demonstrate how the 4DN feature analyzer can process time series structure and function data (Figure 3A) with three examples (Figure 3B-D). Example 1: Cellular Proliferation. Hi-C and RNA-seq data from B-lymphoblastoid cells (NA12878) capture the G1, S, and G2/M phases of the cell cycle for the maternal and paternal genomes [15]. We visualize the structure-function dynamics of the maternal and paternal alleles for nine cell cycle regulating genes using the 4DN phase plane (Figure 3B, left). We are interested in the importance of these genes within the genomic network through the cell cycle, so we use eigenvector centrality as the structural measure. This analysis highlights the coordination between the maternal and paternal alleles of these genes through the cell cycle. Example 2: Cellular Differentiation. We constructed a structure-function feature matrix from time series Hi-C and RNA-seq data obtained from differentiating human stem cells [28]. These data consist of six time points which include human embryonic stem cells, mesodermal cells, cardiac mesodermal cells, cardiac progenitors, primitive cardiomyocytes, and ventricular cardiomyocytes [28]. We analyze Chromosome 4 across the six time points in 100 kb resolution by applying three dimension reduction techniques to the structure-function feature matrix: LE, UMAP, and t-SNE (Figure 3C, top). There is a better separation of the cell types during differentiation using UMAP and t-SNE than from LE. The optimal methods for visualization and analysis are data dependent, so the 4DNvestigator offers multiple tools for the user’s own exploration of their data. Example 3: Cellular Reprogramming. Time series Hi-C and RNA-seq data were obtained from an experiment that reprogrammed human dermal fibroblasts to the skeletal muscle lineage [7]. We analyze samples collected 48 hr prior to, 8 hr after, and 80 hr after the addition of the transcription factor MYOD1. The ten 100 kb regions from Chromosome 11 that varied most in structure and function are visualized using the 4DN phase plane in Figure 3B (right). We also construct a structure-function feature matrix of Chromosome 11 in 100 kb resolution. Similar to the differentiation data analysis, we use LE, UMAP, and t-SNE to visualize the structure-function dynamics. These low dimensional projections show the separation of the three time points corresponding to before, during, and after cellular reprogramming (Figure 3C, bottom). We show an example output of the 4DN feature analyzer, which highlights genes contained in the genomic loci that have the largest structure-function changes through time and provides links to the NCBI and GeneCards database entries for these genes (Figure 3D) [8,9].

Discussion

The 4DNvestigator provides rigorous and automated analysis of Hi-C and RNA-seq time series data by drawing on network theory, information theory, and multivariate statistics. It also introduces a simple statistical method for comparing Hi-C matrices, the LP procedure. The LP procedure is distinct from established Hi-C matrix comparison methods, as it takes a statistical approach to test for matrix equality, and allows for the comparison of many matrices simultaneously. Thus, the 4DNvestigator provides a comprehensive toolbox that can be applied to time series Hi-C and RNA-seq data simultaneously or independently. These methods are important for producing rigorous quantitative results in 4DN research. Click here for additional data file. Click here for additional data file. Click here for additional data file.

Algorithm 1: 4DN feature analyzer
Input: Hi-C matrices A(m)∈Rn×n, and RNA-seq vectors r(m)∈Rn×1, m=1,…,T
Output: Low dimensional space Y(m) and genes in loci with the largest structure-function changes
1 Compute degree, eigenvector, betweenness, and closeness centrality of A(m), and define as bdeg(m), beig(m), bbet(m), bclose(m), respectively, where each b(m)∈Rn×1
2 Compute the first principal component (PC1) of A(m)
3 Form the feature matrices X(m)=[bdeg(m),beig(m),bbet(m),bclose(m),r(m)], where X(m)∈Rn×5
4 Normalize the columns of X(m)
5 Compute the common low dimensional space Y(m)
6 Visualize the low dimensional projection Y(m) or 4DN phase plane
Return: Y(m) and genes in loci with the largest structure-function changes

15 in total

1. Statistical mechanics of pluripotency.

Authors: Ben D MacArthur; Ihor R Lemischka
Journal: Cell Date: 2013-08-01 Impact factor: 41.582

2. Functional organization of the human 4D Nucleome.

Authors: Haiming Chen; Jie Chen; Lindsey A Muir; Scott Ronquist; Walter Meixner; Mats Ljungman; Thomas Ried; Stephen Smale; Indika Rajapakse
Journal: Proc Natl Acad Sci U S A Date: 2015-06-15 Impact factor: 11.205

3. The 4D Nucleome.

Authors: Thomas Ried; Indika Rajapakse
Journal: Methods Date: 2017-07-01 Impact factor: 3.608

4. Spectral identification of topological domains.

Authors: Jie Chen; Alfred O Hero; Indika Rajapakse
Journal: Bioinformatics Date: 2016-05-05 Impact factor: 6.937

5. The GeneCards Suite: From Gene Data Mining to Disease Genome Sequence Analyses.

Authors: Gil Stelzer; Naomi Rosen; Inbar Plaschkes; Shahar Zimmerman; Michal Twik; Simon Fishilevich; Tsippi Iny Stein; Ron Nudel; Iris Lieder; Yaron Mazor; Sergey Kaplan; Dvir Dahary; David Warshawsky; Yaron Guan-Golan; Asher Kohn; Noa Rappaport; Marilyn Safran; Doron Lancet
Journal: Curr Protoc Bioinformatics Date: 2016-06-20

6. The 4D nucleome project.

Authors: Job Dekker; Andrew S Belmont; Mitchell Guttman; Victor O Leshyk; John T Lis; Stavros Lomvardas; Leonid A Mirny; Clodagh C O'Shea; Peter J Park; Bing Ren; Joan C Ritland Politz; Jay Shendure; Sheng Zhong
Journal: Nature Date: 2017-09-13 Impact factor: 49.962

7. Comprehensive mapping of long-range interactions reveals folding principles of the human genome.

Authors: Erez Lieberman-Aiden; Nynke L van Berkum; Louise Williams; Maxim Imakaev; Tobias Ragoczy; Agnes Telling; Ido Amit; Bryan R Lajoie; Peter J Sabo; Michael O Dorschner; Richard Sandstrom; Bradley Bernstein; M A Bender; Mark Groudine; Andreas Gnirke; John Stamatoyannopoulos; Leonid A Mirny; Eric S Lander; Job Dekker
Journal: Science Date: 2009-10-09 Impact factor: 47.728

8. What can systems theory of networks offer to biology?

Authors: Indika Rajapakse; Mark Groudine; Mehran Mesbahi
Journal: PLoS Comput Biol Date: 2012-06-28 Impact factor: 4.475

9. Transcriptionally active HERV-H retrotransposons demarcate topologically associating domains in human pluripotent stem cells.

Authors: Yanxiao Zhang; Ting Li; Sebastian Preissl; Maria Luisa Amaral; Jonathan D Grinstein; Elie N Farah; Eugin Destici; Yunjiang Qiu; Rong Hu; Ah Young Lee; Sora Chee; Kaiyue Ma; Zhen Ye; Quan Zhu; Hui Huang; Rongxin Fang; Leqian Yu; Juan Carlos Izpisua Belmonte; Jun Wu; Sylvia M Evans; Neil C Chi; Bing Ren
Journal: Nat Genet Date: 2019-08-19 Impact factor: 38.330

10. Genome Architecture Mediates Transcriptional Control of Human Myogenic Reprogramming.

Authors: Sijia Liu; Haiming Chen; Scott Ronquist; Laura Seaman; Nicholas Ceglia; Walter Meixner; Pin-Yu Chen; Gerald Higgins; Pierre Baldi; Steve Smale; Alfred Hero; Lindsey A Muir; Indika Rajapakse
Journal: iScience Date: 2018-08-08

6 in total

1. Dynamic 3D genome reorganization during development and metabolic stress of the porcine liver.

Authors: Luxi Chen; Jing Li; Renqiang Yuan; Yujie Wang; Jiaman Zhang; Yu Lin; Lina Wang; Xingxing Zhu; Wei Zhu; Jingyi Bai; Fanli Kong; Bo Zeng; Lu Lu; Jideng Ma; Keren Long; Long Jin; Zhiqing Huang; Jinlong Huo; Yiren Gu; Danyang Wang; Delin Mo; Diyan Li; Qianzi Tang; Xuewei Li; Jiangwei Wu; Yaosheng Chen; Mingzhou Li
Journal: Cell Discov Date: 2022-06-14 Impact factor: 38.079

2. Generation and characterization of stable pig pregastrulation epiblast stem cell lines.

Authors: Minglei Zhi; Jinying Zhang; Qianzi Tang; Dawei Yu; Shuai Gao; Dengfeng Gao; Pengliang Liu; Jianxiong Guo; Tang Hai; Jie Gao; Suying Cao; Zimo Zhao; Chongyang Li; Xiaogang Weng; Mengnan He; Tianzhi Chen; Yingjie Wang; Keren Long; Deling Jiao; Guanglei Li; Jiaman Zhang; Yan Liu; Yu Lin; Daxin Pang; Qianqian Zhu; Naixin Chen; Jingjing Huang; Xinze Chen; Yixuan Yao; Jingcang Yang; Zicong Xie; Xianya Huang; Mengxin Liu; Ran Zhang; Qiuyan Li; Yiliang Miao; Jianhui Tian; Xingxu Huang; Hongsheng Ouyang; Bofeng Liu; Wei Xie; Qi Zhou; Hongjiang Wei; Zhonghua Liu; Caihong Zheng; Mingzhou Li; Jianyong Han
Journal: Cell Res Date: 2021-11-30 Impact factor: 46.297

3. Dynamic transcriptome and chromatin architecture in granulosa cells during chicken folliculogenesis.

Authors: Diyan Li; Chunyou Ning; Jiaman Zhang; Yujie Wang; Qianzi Tang; Hua Kui; Tao Wang; Mengnan He; Long Jin; Jing Li; Yu Lin; Bo Zeng; Huadong Yin; Xiaoling Zhao; Yao Zhang; Huailiang Xu; Qing Zhu; Mingzhou Li
Journal: Nat Commun Date: 2022-01-10 Impact factor: 14.919

4. Rearrangement of T Cell genome architecture regulates GVHD.

Authors: Yaping Sun; Gabrielle A Dotson; Lindsey A Muir; Scott Ronquist; Katherine Oravecz-Wilson; Daniel Peltier; Keisuke Seike; Lu Li; Walter Meixner; Indika Rajapakse; Pavan Reddy
Journal: iScience Date: 2022-08-06

5. Deciphering multi-way interactions in the human genome.

Authors: Gabrielle A Dotson; Can Chen; Stephen Lindsly; Anthony Cicalo; Sam Dilworth; Charles Ryan; Sivakumar Jeyarajan; Walter Meixner; Cooper Stansbury; Joshua Pickard; Nicholas Beckloff; Amit Surana; Max Wicha; Lindsey A Muir; Indika Rajapakse
Journal: Nat Commun Date: 2022-09-20 Impact factor: 17.694

6. Functional organization of the maternal and paternal human 4D Nucleome.

Authors: Stephen Lindsly; Wenlong Jia; Haiming Chen; Sijia Liu; Scott Ronquist; Can Chen; Xingzhao Wen; Cooper Stansbury; Gabrielle A Dotson; Charles Ryan; Alnawaz Rehemtulla; Gilbert S Omenn; Max Wicha; Shuai Cheng Li; Lindsey Muir; Indika Rajapakse
Journal: iScience Date: 2021-11-16

6 in total