Shiqing Yu1, Mathias Drton2, Daniel E L Promislow3, Ali Shojaie4. 1. Department of Statistics, University of Washington, NE Stevens Way, Seattle, WA, 98195, USA. syu.phd@gmail.com. 2. Department of Mathematics, Technical University of Munich, Boltzmannstraße, 85748, Garching bei München, Germany. 3. Departments of Pathology and Biology, University of Washington, NE Pacific St, Seattle, WA, 98195, USA. 4. Department of Biostatistics, University of Washington, NE Pacific St, Seattle, WA, 98195, USA.
Abstract
BACKGROUND: Differential correlation networks are increasingly used to delineate changes in interactions among biomolecules. They characterize differences between omics networks under two different conditions, and can be used to delineate mechanisms of disease initiation and progression. RESULTS: We present a new R package, CorDiffViz, that facilitates the estimation and visualization of differential correlation networks using multiple correlation measures and inference methods. The software is implemented in R, HTML and Javascript, and is available at https://github.com/sqyu/CorDiffViz . Visualization has been tested for the Chrome and Firefox web browsers. A demo is available at https://diffcornet.github.io/CorDiffViz/demo.html . CONCLUSIONS: Our software offers considerable flexibility by allowing the user to interact with the visualization and choose from different estimation methods and visualizations. It also allows the user to easily toggle between correlation networks for samples under one condition and differential correlations between samples under two conditions. Moreover, the software facilitates integrative analysis of cross-correlation networks between two omics data sets.
BACKGROUND: Differential correlation networks are increasingly used to delineate changes in interactions among biomolecules. They characterize differences between omics networks under two different conditions, and can be used to delineate mechanisms of disease initiation and progression. RESULTS: We present a new R package, CorDiffViz, that facilitates the estimation and visualization of differential correlation networks using multiple correlation measures and inference methods. The software is implemented in R, HTML and Javascript, and is available at https://github.com/sqyu/CorDiffViz . Visualization has been tested for the Chrome and Firefox web browsers. A demo is available at https://diffcornet.github.io/CorDiffViz/demo.html . CONCLUSIONS: Our software offers considerable flexibility by allowing the user to interact with the visualization and choose from different estimation methods and visualizations. It also allows the user to easily toggle between correlation networks for samples under one condition and differential correlations between samples under two conditions. Moreover, the software facilitates integrative analysis of cross-correlation networks between two omics data sets.
Correlations between omics measurements are widely used to interrogate mechanisms of biological interactions. Differential correlation networks capture differences between omics correlations in two populations/conditions, e.g., cases and controls [1, 2]. They can thus be used to gain insight into aberrations in biological processes and mechanisms of disease initiation and progression [3]. They have also been instrumental in gaining insights into biological responses to environmental factors [4, 5] or functional consequences of mutations [6, 7]. This has led to the development of multiple methods for differential correlation analysis in recent years [8-14]; see [2, 15] for more comprehensive review. However, software tools for estimating and visualizing differential correlation networks have received less attention. Moreover, existing software either only focus on a single omics data type (commonly, mRNA expressions) and do not facilitate integrative analysis of multiple omics data [10, 13, 14, 16–19], or only provide static visualizations (e.g. heatmaps) [16, 17, 19–21].Our CorDiffViz package provides a simple tool for estimation and interactive visualization of correlation networks and their differences. It also facilitates omics data integration via unifying visualizations for single and differential cross-correlation networks among two omics data types. Differential cross-correlation networks have been examined recently [11, 12, 21]; however, this work only provides visualization for a single correlation at each time [21], or has no publicly available package or visualization tools [11, 12] .Correlations in multi-omics settings can be compared via a direct approach: concatenating the multiple omics measures into a single data set and using existing software tools to compare correlations in such concatenated data across two populations. However, by developing a tailored method for cross-omics correlation analysis, our software has at least two advantages over the direct approach. First, the direct approach requires estimating and testing many more correlations ( for two data sets and with and variables), whereas our approach processes only correlations. This not only saves a significant amount of computation but also narrows the focus of multiple testing adjustments to the correlations of interest, leading to a power gain for statistical tests. Second, in addition to the interactive visualization that is currently not widely available, our software offers tailored visualization for cross-omics correlation.Another benefit of CorDiffViz compared with existing tools is that, in addition to Pearson correlation, it implements rank-based correlation measures that are better suited for non-Gaussian observations commonly encountered in omics data. The package provides both parametric and permutation tests for these correlation types. Unlike existing software tools, the resulting p-values, together with the implemented adjustments for multiple testing and false discovery rate (FDR) control, provide formal inference for differential correlation/cross-correlation analysis by accounting for the uncertainty in differential correlation measures. These estimation and visualization capabilities are particularly designed for sparse (differential) correlation matrices, where most (changes in) correlations are zero or negligible. The user has access to interactive visualization of both single condition and differential correlation networks by just calling one simple function in R [22]; see Fig. 1.
Fig. 1
A screenshot of the function prototype of viz(), the main user-facing function in the CorDiffViz package
A screenshot of the function prototype of viz(), the main user-facing function in the CorDiffViz package
Implementation
Overview
In this section, we give an overview of the estimation methods implemented in the CorDiffViz package for differential correlation analysis. To simplify the user interface, all methods have been implemented in a single function, viz() in R. The full list of its options is displayed in Fig. 1. In what follows, we describe the estimation methods available through the function, along with their various options.Consider two sets of omics measurements (e.g., mRNA expression and protein abundances) with data matrices and (with units and and measurements, respectively) coming from one population and and from another (with units and and measurements, respectively). We estimate and visualize , and , where denotes the matrix of correlations between column vectors of and . It is worth noting that the software can also be used for analyzing one omics data type, e.g., mRNA expression levels, by simply excluding the second sets of data, and from the above expressions and focusing on, e.g. . The NULL default values for dat1Y and dat2Y in Fig. 1 corresponds to this simpler problem.We consider five measures of correlation: (i) Pearson’s product-moment correlation r, (ii) Kendall’s , (iii) Spearman’s , (iv) the -transformed , , and (v) the -transformed , . For continuous distributions obtained from arbitrary monotone transformations of the original data (known as Gaussian copulas), the transformed rank correlations from (iv) and (v) consistently estimate an underlying Pearson’s r [23, 24]. Under non-Gaussian models, these correlation types, along with the corresponding tests described below, provide more robust inference for differential correlations compared to the z-tests for Pearson correlations in [16].For each correlation measure, the user can choose from the following estimates for visualization: (a) the raw (differential) correlation matrices, (b) the matrices thresholded using parametric tests, and (c) the matrices thresholded using permutation tests. For (b), the limiting distribution of each sample correlation is used for z-tests that are further adjusted for multiple testing; entries in the matrices that are not statistically significant are set to 0. The user can choose the adjustment method through the adj_method argument (see Fig. 1) from those supported by p.adjust() in base R. Denoting by the standard normal distribution, and by the Student’s t-distribution with degrees of freedom, the limiting null distributions for our correlation estimates are determined as follows. For the differential correlations, we use limiting normal distributions that follow from (I)–(V) above, using the fact that for independent random variables U and V.Pearson’s correlation: under the Fisher transformation, [25, 26]Kendall’s : [26];Spearman’s : [26];: (from (II) with the delta method [27]);: (from (III) with the delta method [27]).For (c), we use permutation tests in which samples are randomly shuffled; the user can choose the number of permutations through argument B (see Fig. 1). A random number seed for this procedure can be specified by the argument perm_seed to ensure reproducibility.For either case, parametric or permutation tests, one can also choose to perform one-sided ( or ) or two-sided tests using the sides argument (Fig. 1). The choice between parametric and permutation tests is up to the user—we note that, permutation tests tend to be more robust to violated assumptions, while parametric tests provide higher statistical power when their assumptions are met.Finally, for Pearson’s correlation, the user can also choose an inference procedure for high-dimensional differential correlation networks adapted from the method proposed by [3]. Specifically, suppose we have samples and , corresponding to two omics data types (e.g., mRNA expression and protein abundances), respectively, for population and variables , . For random vectors and , let be the unadjusted sample covariance. Then the raw correlation between and is .Then the thresholded differential correlation between and for populations is defined as . The parameter in (1) is empirically chosen through cross-validation [3]. Since cross-validation involves random sampling, a corresponding seed can be specified through the argument Cai_seed (Fig. 1).
Procedure
When calling the main function viz() (Fig. 1) in R, the package automatically estimates the (differential) correlation matrices, and performs permutation and parametric tests as instructed by the user. The user may run the function multiple times (with different arguments) on multiple datasets by assigning a different name to each run; each run can be visualized by selecting it from a dropdown menu in viz.html, which is automatically generated by the package.The function outputs all raw data matrices, raw (differential) correlation matrices, and matrices that contain entry-wise p-values for the tests (with adjustments for multiple testing as selected by the user). These files store data using Javascript code and are for internal use by the HTML and Javascript files only. The user can then open viz.html to access the visualization.In addition, the function also outputs static heatmaps as well as static plots of the (differential) undirected graphs. The randomization in the choice of layout for the latter can be controlled by layout_seed (Fig. 1).
Visualization design
Interactive visualization is available through the automatically copied HTML file in the current working directory in Google Chrome or Mozilla Firefox browsers. The user first needs to select the dataset (the name of the run they wish to visualize) under the current directory from a dropdown menu. Two visualization modes are currently available: Correlation Plots and Interactive Networks. In both modes, one can toggle between correlation matrices/networks for either population or the differential correlation matrix/network by enabling the “One sample” and “Two sample” buttons, respectively; for the former, the user can choose which population to visualize. One can also choose from the five correlation measures discussed above. Instead of the default 5% significance level, the user can manually enter a desired level (before adjustments for multiple testing) for the tests. In addition, under both modes, a dropdown menu allows the user to choose which variables to include (as shown in Fig. 2). The red/blue color represents a negative/positive (differential) correlation, whose magnitude is indicated by the color saturation.
Fig. 2
A screenshot of the dropdown menu for selecting variables to include in the correlation plots
A screenshot of the dropdown menu for selecting variables to include in the correlation plots
Results and discussion
In this section we demonstrate CorDiffViz in two applications. In the first application, we perform differential correlation analysis in a single omics data set, the setting that has also been considered in some of the existing software, and compare the capabilities of CorDiffViz with the existing software. The second application demonstrates how CorDiffViz can be used for differential cross-correlation analysis among two omics data sets, a setting of increasing interest for which public estimation and visualization software tools are lacking. We end this section with additional comments about the broader applicability of the package.
Differential correlation analysis of single omics data
We illustrate our tool using a metabolomics dataset for mice with 100 metabolites from [28]. The data contains the metabolic profiles of 41 non-diabetic and 30 diabetic mice and has been recently analyzed in [29]. The names of the metabolites are compound IDs in the Kyoto Encyclopedia of Genes and Genomes (KEGG); for example, the hub node C00152 in Fig. 4 corresponds to L-Asparagine. In Figs. 3 and 4, we show screenshots of differential correlations using Kendall’s with permutation tests () and p-values adjusted using the FDR controlling procedure of [30]. The significance level is set to 0.05.
Fig. 4
A screenshot of the interactive network plots using Cytoscape.js for the metabolomics dataset. Here C00152 (L-Asparagine) is clicked on, with all its neighbors and edges highlighted, and all other edges hidden. Red/blue colors of the edges indicate negative/positive (differential) correlations, while color saturation suggests their magnitude
Fig. 3
A screenshot of the interactive correlation plots using D3.js for the metabolomics dataset. Red/blue colors indicate negative/positive (differential) correlations, while color saturation and size of the circles suggest their magnitude
A screenshot of the interactive correlation plots using D3.js for the metabolomics dataset. Red/blue colors indicate negative/positive (differential) correlations, while color saturation and size of the circles suggest their magnitudeA screenshot of the interactive network plots using Cytoscape.js for the metabolomics dataset. Here C00152 (L-Asparagine) is clicked on, with all its neighbors and edges highlighted, and all other edges hidden. Red/blue colors of the edges indicate negative/positive (differential) correlations, while color saturation suggests their magnitudeThe first visualization mode, Correlation Plots, implemented using D3.js and illustrated in Fig. 3, is a direct presentation of the raw or thresholded (differential) correlation matrices. The square/rectangle represents the matrix, with the entry in the j-th row and k-th column representing for one population , or their difference. As in the figures, the user can view the value of a specific cell in the matrix, its corresponding variable names, and the corresponding scatter plots (for one population or both depending on the selection).The second visualization mode, Interactive Networks, is implemented using Cytoscape.js and is illustrated in Fig. 4. Each node in the undirected correlation network represents a variable (feature), and an edge is present if the corresponding entry in the (differential) correlation matrix is statistically significant. Multiple network layouts are available. Each node is draggable with size positively related to the number of variables connected to it. The user can easily highlight an edge and hide all other edges, and read the (differential) correlation value and the two variables associated with it. It is further possible to highlight one node and all edges linked to it as well as the corresponding (differential) correlations, sorted in descending magnitude, as shown in Fig. 4.The differential correlation matrix heatmap from DGCA [19] in Fig. 5 serves a somewhat similar purpose as the correlation plots from our tool in Fig. 3. For consistency, this heatmap is also obtained using 1000 permutations. However, only static heatmaps are supported by DGCA. Moreover, in larger data sets, even with the 100 metabolites in our dataset, the heatmap can become more difficult to discern, as there are no easy options for selecting a subset of variables to visualize. Other existing tools have similar limitations—they either provide static visualization, or visualization for pairwise correlations only.
Fig. 5
Heatmap by DGCA for the metabolomics dataset
Heatmap by DGCA for the metabolomics dataset
Differential cross-correlation analysis of two omics data sets
We now illustrate how our tool can be used to analyze and visualize differential cross-correlation networks by applying it to joint protein abundances and expression levels (based on RNAseq) from the Cancer Genome Atlas (TCGA). We denote the protein abundances as variables and the RNAseq data as variables. We do not compare our results to DGCA in this dataset, as that package does not support multi-omics data analysis.For simplicity, we work with the subset of samples with no missing values for all variables, as well as prostate-specific antigen (PSA) levels available. This leaves 156 samples with 127 and 4749 variables. Since some expression levels have extreme variance while the others have many zeros, following [17], we pick the variables that have a coefficient of variation between 0.5 and 10, after which 2679 variables are left. We then transform the variables using .PSA is an established marker for prostate cancer. While PSA is a continuous measure, it is often dichotomized by practitioners in order to assess the risk of developing prostate cancer. Following this strategy, we split the 156 samples into 105 individuals with PSA levels 10 nanograms per milliliter—a common threshold used for identifying those with highest risk of prostate cancer—and the remaining individuals 51with PSA levels > 10 who are at higher risk of prostate cancer.To reduce the number of variables for better illustration, we focus on proteins and expression levels that are more clearly associated with prostate cancer. To this end, we use Wilcoxon signed-rank test as a screening method, where for each variable we calculate the p-value associated with the hypothesis test that the two samples have equal mean. As a simple illustration, we pick variables that have the highest p-values and no 0’s from both and . We visualize the results using Spearman’s with p-values adjusted to control the FDR [31] at significance level 0.2.The correlation plots are shown in Fig. 6. In this two-omics case, the variables are on the vertical axis, and are on the horizontal axis. Thus, the entry in the j-th row and k-th column now represents for one population , or their difference. Interactive network visualization for the same analysis is shown in Fig. 7. In the two-omics case, nodes in the and groups are colored in orange and green, respectively.
Fig. 6
A screenshot of the interactive correlation plots using D3.js for the TCGA dataset
Fig. 7
A screenshot of the interactive network plots using Cytoscape.js for the TCGA dataset
A screenshot of the interactive correlation plots using D3.js for the TCGA datasetA screenshot of the interactive network plots using Cytoscape.js for the TCGA dataset
Usage
The CorDiffViz package can be used to visualize differential (cross-)correlation networks across various omics data, both for exploratory analysis as well as formal inference. Differential network analysis can be applied to, for example, gene regulatory interaction networks—to analyze the mechanistic changes resulting from responses to changed environmental conditions—or to metabolic interactions—to study the cellular processes that are differentially important [1].The p-values stored in the data files represent multiple testing-adjusted p-values for the parametric and permutation tests. These p-values can be used for direct analysis, or through the visualization interface, where the user can change the default significance threshold. The interactive visualization also offers more insights into detailed changes in the networks. These features provide important insight into altered biological mechanisms, beyond what would be obtained by simply examining differential correlation heatmaps. In fact, our visualization was recently used to interrogate changes in metabolomic interaction mechanisms in Drosophila under two different diets [32], leading to new biological discoveries.
Conclusions
We have developed an integrated R package for estimation and interactive visualization of (differential) correlation matrices/networks for two populations. The package is designed so that by calling a single R function and specifying some parameters for estimation, the estimates will be automatically saved to the local directory. Users then have access to interactive visualization by simply opening an HTML file in the browser. The package is intended to provide convenient tools for interpreting (differential) correlation networks for multi-omics data.
Availability and requirements
Project name: CorDiffVizProject home page: https://github.com/sqyu/CorDiffVizOperating system(s): Platform independentProgramming language: R, HTML and JavascriptOther requirements: Google Chrome or FirefoxLicense: GPL-3Any restrictions to use by non-academics: None.
Authors: Kelly Jin; Kenneth A Wilson; Jennifer N Beck; Christopher S Nelson; George W Brownridge; Benjamin R Harrison; Danijel Djukovic; Daniel Raftery; Rachel B Brem; Shiqing Yu; Mathias Drton; Ali Shojaie; Pankaj Kapahi; Daniel Promislow Journal: PLoS Genet Date: 2020-07-09 Impact factor: 5.917
Authors: Zheng-Zheng Tang; Guanhua Chen; Qilin Hong; Shi Huang; Holly M Smith; Rachana D Shah; Matthew Scholz; Jane F Ferguson Journal: Front Genet Date: 2019-05-17 Impact factor: 4.599
Authors: Duo Jiang; Courtney R Armour; Chenxiao Hu; Meng Mei; Chuan Tian; Thomas J Sharpton; Yuan Jiang Journal: Front Genet Date: 2019-11-08 Impact factor: 4.599