Literature DB >> 35727808

A nonlinear correlation measure with applications to gene expression data.

Yogesh M Tripathi^1,2, Suneel Babu Chatla^3,4, Yuan-Chin I Chang¹, Li-Shan Huang³, Grace S Shieh^1,5,6,7.

Abstract

Nonlinear correlation exists in many types of biomedical data. Several types of pairwise gene expression in humans and other organisms show nonlinear correlation across time, e.g., genes involved in human T helper (Th17) cells differentiation, which motivated this study. The proposed procedure, called Kernelized correlation (Kc), first transforms nonlinear data on the plane via a function (kernel, usually nonlinear) to a high-dimensional (Hilbert) space. Next, we plug the transformed data into a classical correlation coefficient, e.g., Pearson's correlation coefficient (r), to yield a nonlinear correlation measure. The algorithm to compute Kc is developed and the R code is provided online. In three simulated nonlinear cases, when noise in data is moderate, Kc with the RBF kernel (Kc-RBF) outperforms Pearson's r and the well-known distance correlation (dCor). However, when noise in data is low, Pearson's r and dCor perform slightly better than (equivalently to) Kc-RBF in Case 1 and 3 (in Case 2); Kendall's tau performs worse than the aforementioned measures in all cases. In Application 1 to discover genes involved in the early Th17 cell differentiation, Kc is shown to detect the nonlinear correlations of four genes with IL17A (a known marker gene), while dCor detects nonlinear correlations of two pairs, and DESeq fails in all these pairs. Next, Kc outperforms Pearson's and dCor, in estimating the nonlinear correlation of negatively correlated gene pairs in yeast cell cycle regulation. In conclusion, Kc is a simple and competent procedure to measure pairwise nonlinear correlations.

Entities: Chemical

Mesh：

Year: 2022 PMID： 35727808 PMCID： PMC9212159 DOI： 10.1371/journal.pone.0270270

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.752

Introduction

Introduced in the 1990s, a microarray slide can simultaneously detect the expression of thousands of genes from a sample (e.g., from a tissue) under investigation, and cDNA microarray technology has become widely used following the seminal paper [1]. There are two types of experiments, namely static and temporal, in studies using microarrays [2]. In the former experiments, microarrays capture only a single moment of gene expression, whereas, in a temporal experiment, the arrays are collected over a time course which allows the dynamic behavior studied. Because the regulation of gene expression is a dynamic process, it is important to identify changes in gene expression, as well as identify correlated genes and correlated mRNA and protein over time [2-4]. Thus, temporal experiments are commonly carried out in biological sciences [2], and there is extensive statistical literature on time course data analysis [5]. Exploring nonlinearity in biomedical data is gaining popularity in biomedical research methodologies [6]. Several types of pairwise gene expression in humans and other organisms show a nonlinear correlation across time, e.g., expression of paired genes involved in the early human T helper (Th17) cell differentiation, show nonlinear correlation across time (Fig 4 of [7]; Fig 7 of [8]), a phenomenon which motivated this study. Th17 cells have been demonstrated to play an important role in auto-immune disease and inflammation in humans. Moreover, time-course expression of genes involved in the yeast and human cell cycle are also nonlinear (see Fig 1A for details; [9, 10]). However, classical correlation coefficients such as the Pearson’s correlation coefficient (Pearson’s r) and the Spearman’s rank correlation measure only the linear relationship between two variables (e.g., genes in this study). Fig 7 in [8] led us to hypothesize other novel Th17-specific genes can be identified by having the top-ranked absolute nonlinear correlations (across time) with IL17A. Similarly, novel cell cycle genes may be inferred through the top-ranked nonlinear correlations with a known marker gene, e.g., HST3 in Fig 1A. This motivated us to develop a measure for the nonlinear correlation between two variables.

Fig 1

A. Original time-course expression of the gene pair RAD51-HST3, where the red cross (green circle) denotes expression levels of RAD51 (HST3). B. Kernelized (using polynomial kernel of degree 2) expression of the gene pair RAD51-HST3. Székely and colleagues proposed distance correlation (dCor) as a measure of dependence and test for independence between two random vectors [11]. Dcor can be easily implemented in arbitrary dimensions and is widely cited. However, dCor is based on distance covariance, thus it ranges between 0 and 1, i.e., it can not be negative. Chen and colleagues [12] introduced a nonparametric test to detect nonlinear correlations of time-course gene expression data (GED). The maximal local correlation metric was shown to detect the nonlinear association of five time-point GED between rd mice and age-matched wild-type controls, while other correlation methods such as Pearson correlation could not. Later on, an empirical copula-based statistic (CoS) was developed to assess the strength of and test for independence between two random variables [13]. Here, we propose a procedure to measure the nonlinear correlation between two variables across time, using time-course microarray and RNA-seq gene expression data. To the best of our knowledge, this nonlinear measure is the first one that can quantify a negative correlation (a complementary pattern) of two variables by a negative value, e.g., HST3-RAD51 in Fig 1A. This procedure first transforms nonlinear data on the plane (say, in R) via a function (called a kernel, usually nonlinear) to a high-dimensional (Hilbert) space (say, , [14, 15]). A chosen kernel implicitly defines a transformation that conducts a “nonlinear transformation” of the original observation to . Next, we plug the transformed data into a classical correlation coefficient, e.g., Pearson’s r. Because this is a nonlinear transformation, the correlation computed of will be the nonlinear correlation of the original observation ; more details can be found in the Materials and Methods section. This nonlinear correlation coefficient, called kernelized correlation (denoted by K), can capture the nonlinearity across time between two variables or nonlinear relationships of any two pairwise variables in general. As indicated in [13], nonlinear correlations are prevalent in many applications, but not many have been developed. If any approach is developed, there will be many applications in data exploration, variable selection, and others. Thus, K can also be applied to data exploration and variable selection. For example, of the human genome, genes with the top-ranked absolute K values with IL17A, a known marker gene in early Th17 cell differentiation, may be of interest to biologists. K is applicable to detect the correlation of any two genes over time within a single biological group, such as the pairwise correlations of ~800 cell cycle genes in yeast [9]. Furthermore, K can also detect the correlation of one gene’s expression over time between two groups. For example, pairwise correlation of each gene’s expression in the mouse genome, profiled from age-matched wild type controls and rd mice with rod degeneration at five postnatal time points (6, 10, 14, 17, and 21 days of age) [11]. Although examples of short time-course experiments are demonstrated for K, as K plugs transformed data into any correlation coefficient, it can be applied to long series of data in the same way as any classical correlation coefficient. As an important application, the proposed K has been demonstrated to uncover known and novel genes involved in the early differentiation of Th17 cells, by revealing genes having a nonlinear correlation with a known marker gene such as IL17A, whose expression is commonly used to assess the Th17 polarization efficiency ([8, 16, 17]). These uncovered genes are likely to be involved with the early Th17 cell differentiation. Thus, K is a simple but efficient way to identify genes associated with early Th17 cell differentiation. Aijo and colleagues generated RNA-seq data to measure gene expression during early human T helper 17 (Th17) cell differentiation and T-cell activation (Th0) [8]. They let RNA-seq (count data) of a gene assume a negative binomial distribution with the mean following a Gaussian process at the ith time point of the jth replicate. Further, they employed an MCMC method to identify differential expression dynamics between Th17 and Th0; some selected identified differentially expressed genes were verified by qRT-PCR. The method was called DyNB, which extended maSigPro-GLM [18]. maSigPro-GLM is a package, that takes temporal dimension and correlation of RNA-seq time series into account, to detect differential expression of time-course RNA-seq data with or without replicates. Furthermore, we applied K to reveal the nonlinear correlation between pairs of cell cycle genes in yeast [9]. Proper regulation of the cell cycle is crucial to the growth and development of all organisms, and understanding this regulation is central to the study of many diseases. As shown in Fig 1A, the gene pair RAD51-HST3 in yeast has a complementary pattern which suggests that their correlation is negative. The nonlinear correlation of RAD51-HST3 is also supported by the PARE score -18.6, which is mainly based on the area enclosed by the two curves (Section 2.3, [19]). Although Pearson’s r for RAD51- HST3 is -0.50, it is not significant (P = 0.203); see S2 Dataset for details). Thus, it does not reflect the negative nonlinear correlation between RAD51 and HST3, as Pearson’s r only measures linear association between the two genes (variables). Although we demonstrate K using time-course gene expression data, it can be applied to any two pairwise variables in general and other types of multi-dimensional data, e.g., proteomics and metabolomics data. In the Materials and Methods section, sources of data for applications are stated, the kernelized correlation is introduced and the algorithm provided. Furthermore, a code of K to compute [n(n-1)]/2 pairwise correlations in a set of data consisting of n variables, where n could be 20,000 genes in a microarray, can be downloaded at http://staff.stat.sinica.edu.tw/gshieh/KC/Kc.html. In the Result section, K is applied to measure nonlinear relationships between two experimental units (genes in the case of the motivating applications), using the time-course expression of paired genes. First, we compare K to Pearson’s correlation coefficient via a simulation study. Next, we show that K reveals known and potential genes involved in the differentiation of Th17 cells in Application 1. As biological verifications are available in [8], we compare K to dCor [11], in addition to DyNB [8] and a time point-wise analysis DESeq [20]. Furthermore, Pearson’s correlation, dCor, and K are applied to measure nonlinear relationships of gene pairs involved in the cell cycle of yeast [9] in Application 2, and their performances are compared. We close with some discussions and future directions.

Materials and methods

Data

In the Application section, we used publicly available data. Specifically, the time-course expression levels (RNA-seq; in normalized read counts) of eight genes involved in the differentiation of human Th17 cells, profiled in Th0 and Th17 cells were downloaded from [8]. Moreover, time-course microarray gene expression of six gene pairs was downloaded from [9]. Both datasets are available in the S1 and S2 Datasets, respectively.

Transforming nonlinear data to a high-dimensional space via a kernel

Kernelization is known to help reveal nonlinear structures of data [15], provided that there is sufficient data. Here, we propose mapping the original nonlinear data in Euclidean space R (e.g., the data in Fig 1) to a high-dimensional Hilbert space, then computing a linear correlation, e.g., Pearson’s r, based on the kernelized data to result in the nonlinear correlation. Suppose that X and Y are two random variables, and x,…,x (y,…,y) are T observations of X(Y). Let and denote two T×1 vectors. For example, T is the total number of time points in experiments, and T = 5 for immune genes [8]. Note that this kernelization is not limited to time-course data, it can be applied to any pairwise variables. In the following, we introduce the polynomial and Gaussian (or Radial Basis Function (RBF)) kernels, which are commonly used to transform the data into the high-dimensional Hilbert space.

Definition

A polynomial kernel of degree d (≥ 2) is defined as , where the inner product between and is defined as . A Gaussian (RBF) kernel is defined as where the parameter γ (> 0) is known as the inverse kernel width and the norm of is defined as . The following steps are executed to transform a given data set to a nonlinear high-dimensional space and compute the proposed kernelized correlation. Standardize the original gene expression data across time points of each variable (gene here). Choose a kernel, e.g., Gaussian kernel. Let and calculate the kernel matrix K, where and K is a T×T symmetric and strictly positive definite matrix. Center the above kernel matrix K by the following formula and denote the modified matrix by K, where I is a T×T dentity matrix and 1 is a vector of one’s in R. Calculate and to result in the kernelized data corresponding to the original data. Next, we plug the associated elements of and into a correlation coefficient such as Pearson’s r, to produce the proposed kernelized correlation coefficient. Note that K is nonlinear, thus the standardization of variables in Step 1 is essential, where standardization means the z-score of the original variable. Moreover, the values of kernelized data depend on the choice of the kernel and its associated parameter which is trained by cross-validation. Fig 1B is the kernelized gene expression of (RAD51, HST3) using the polynomial kernel of degree 2, which depicts the nonlinearity of the original data well. As an illustration, we use gene expression of the three-time points, t = 2, 3, and 4 of the gene pair RAD51-HST3 in [9], to compute the corresponding K and K in the following. We also obtain K = -1 for RAD51-HST3; the expression of RAD51 and HST3 is depicted in Fig 1A.

Statistical analyses

One sample t-test was used to test for the significance of all estimated correlation coefficients. All statistical tests were one-sided except where otherwise specified, and all analyses were conducted in the R software [21].

Results

Simulation study

Spellman and colleagues built a comprehensive catalog of yeast genes involved in the cell cycle [9], i.e., their mRNA levels vary periodically within the cell cycle. Specifically, the profiled mRNA levels of yeast genes using two-color cDNA microarray, using synchronized cell cultures, e.g., synchronized by α factor arrest. In the alpha set of microarray data, they profiled gene expression for two cell cycles at 18-time points, where time t = 0,7,14,…119 min. They identified the 800 cell cycle regulation genes by their correlations with genes known to be regulated by the cell cycle. As indicated in [9], these genes were regulated in a periodic way coincident with the cell cycle, for the adequate functioning of mechanisms that maintain order during cell division. In Fig 3(A) of [9], genes peaked at the G1 phase of the cell cycle and regulated similarly to the G1 cyclin CLN2 were clustered. In particular, the similar-patterned RNR1-SWE1 gene pair was studied by a time-lagged correlation and machine learning approach (Fig 1(a) of [19]). We mimicked yeast cell-cycle genes in the alpha data set [9], e.g., RNR1-SWE1, to generate the expression of two genes across time. We generated pairwise gene expression in three cases representing different nonlinear relationships and compared the performances of six correlation measures. The six correlation measures studied are K with the polynomial kernel of degree 2 and degree 3 (denoted by K-poly2 and K-poly3, respectively) and the RBF kernel (K-RBF), Pearson’s correlation (r), Kendall’s correlation (τ) and distance correlation (dCor). Case 1 is composed of the time-course expression of two genes that follow the same sine function except with a π/6 time difference. In Case 2, the expression of two genes follows the same cosine function, but the magnitudes of gene 1 (G) are twice of gene 2 (G)’s. Identical time-course expression of two genes is generated in Case 3, except that G has a location shift of 3 in the y-axis above G. Specifically, time-course expression of paired genes with 18-time points was generated with 100 replications in each of the three cases. We mimicked yeast cell cycle genes in the alpha data set [9] to generate gene expression 7 minutes apart, i.e., T = 0, 7, …, 119, using Eqs (1)–(6) in the following section. The error terms ε and ε were independent and generated from i.i.d. N(0,1), i = 1, …, 18, and c = 0.5, 1, and 2 which provided small, medium to large random errors to mimic noises in real data. When a = 0 in Eqs (1)–(6), G and G are independent, by which we estimated the false positive rate (FPR) of the five correlation measures using 100 repeat simulations. The FPR is interpreted as the Type I error. When a = 1 in Eqs (1)–(6), we simulated 100 replicates to estimate the true positive rates (TPRs; statistical power) and summarize the mean (the estimate for nonlinear correlation) and p-value for each correlation measure at the bottom of the result tables. We set the significance level at α = 0.05, and the significance for each measure in each replicate is determined after a Bonferroni correction (< 5⋅10-4). The following Case 1 shows that G potentially regulates G, so G expresses behind G.

Case 1

The true time-course expression of G and G (when c = 0.0) and a set of simulated data (when c = 0.5, 1.0, and 2.0) are plotted in Fig 2, in which the two dotted sine curves have a time difference of π/6.

Fig 2

The simulated expression of gene G and G generated by Eqs (1) and (2) with various values of c, where˙(▲) denotes the values of G (G) and i = 0, 1, …, 17.

As shown in Table 1, when a = 0 and noise in data increases from low, medium to high (c = 0.5, 1.0 to 2.0), the FPR of K-poly2 and K-poly3 are high (24-28% and 30-52%, respectively), while the FPR of Pearson’s, Kendall’s and distance correlation remain 0.0. When a = 1 and noise is low (c = 0.5), K-RBF, Pearson’s correlation, and dCor have high TPRs (99%, 92%, and 92%, respectively), while Kendall’s correlation has only 4%; K-RBF has a moderate FPR (18%). The advantage of K-RBF is clearer when noise is increased to moderate (c = 1.0), its TPR is much higher (63%) than that of Pearson’s correlation (21%), Kendall’s correlation (0%) and dCor (33%), while its FPR (1%) is equivalent to that of Pearson’s and Kendall’s correlation and dCor (0%). Further, Fig 2 shows that when c = 1.0, G and G are nonlinearly correlated, K-RBF (0.71) is significant, but dCor, Pearson’s and Kendall’s correlation are not (their p-values are not < 5⋅10-4; Table 1).

Table 1

Comparison of the six correlation measures on the nonlinear correlation between genes G and G, generated by Eqs (1) and (2) with 100 replicates.

c =	0.0	0.5	1.0	2.0
a = 0	False positive rates
K_c-poly2^a	--	0.28	0.24	0.27
K_c-poly3	--	0.30	0.45	0.52
K_c-RBF (γ = 0.5)	--	0.18	0.01	0.00
Pearson’s corr.^b	--	0.00	0.00	0.00
Kendall’s corr.	--	0.00	0.00	0.00
Distance corr.	--	0.00	0.00	0.00
a = 1	True Positive rates
K_c-poly2	--	1.00	0.88	0.47
K_c-poly3	--	1.00	0.96	0.69
K_c-RBF (γ = 0.5)	--	0.99	0.63	0.08
Pearson’s corr.	--	0.92	0.21	0.02
Kendall’s corr.	--	0.04	0.00	0.00
Distance corr.	--	0.92	0.33	0.04
	Correlation measurement (p-values)
K_c-poly2	1.00 (< ϵ^a)	0.99 (< ϵ)	0.88 (< ϵ)	0.47 (0.02)
K_c-poly3	1.00 (< ϵ)	1.00 (< ϵ)	0.96 (< ϵ)	0.67 (0.001)
K_c-RBF (γ = 0.5)	0.97 (< ϵ)	0.92 (< ϵ)	0.71 (< ϵ)	0.32 (0.10)
Pearson’s corr.	0.86 (< ϵ)	0.78 (< ϵ)	0.61 (0.004)	0.32 (0.10)
Kendall’s corr.	0.65 (0.002)	0.57 (0.007)	0.43 (0.04)	0.20 (0.21)
Distance corr.	0.84 (< ϵ)	0.79 (< ϵ)	0.65 (0.002)	0.47 (0.02)

aK-poly2, K-poly3, and K-RBF denote K with the polynomial kernel of degree 2 and 3, and the RBF kernel, respectively.

bThe notation corr. and ϵ denote correlation coefficient and 5⋅10-4, respectively.

aK-poly2, K-poly3, and K-RBF denote K with the polynomial kernel of degree 2 and 3, and the RBF kernel, respectively. bThe notation corr. and ϵ denote correlation coefficient and 5⋅10-4, respectively. The results show that K-RBF outperforms Pearson’s and Kendall’s correlation on the estimation of the nonlinear correlation in data generated with low to medium noise in Case 1. When noise is low, Pearson’s correlation and dCor performs slightly better than K-RBF, K-RBF outperforms Pearson’s correlation and dCor when noise is moderate. However, K with polynomial kernels have high FPRs which may result in overestimation, and Kendall’s correlation coefficients are small and insignificant, thus these measures are not suitable for estimation of nonlinear correlation.

Case 2

The true time course expression of G and G (generated by Eqs (3) and (4) with c = 0.0), and a set of simulated data (generated by Eqs (3) and (4) with c = 0.5, 1.0 and 2.0) are plotted in Fig 3, in which the dotted curves of the expression generated by c = 0.0 are two identical cosine curves except that the range of G is twice that of G.

Fig 3

The simulated expression of gene G and G generated by Eqs (3) and (4) with various values of c, where˙(▲) denotes the values of G (G) and i = 0, 1, …, 17.

The results are similar to those of Case 1 and are summarized in Table 2. Taking both PR and FPR into account, when noise is low, K-RBF performs similarly to Pearson’s correlation and dCor on the estimation of the nonlinear correlation in Case 2. But K-RBF has better performance than Pearson’s correlation and dCor when noise is moderate. Similar to Case 1, K with polynomial kernels and Kendall’s correlation coefficient are not suitable estimators, due to high FPRs and small and insignificant coefficients, respectively.

Table 2

Comparison of the six correlation measures on the nonlinear correlation between G and G, generated by Eqs (3) and (4) with 100 replicates.

c =	0.0	0.5	1.0	2.0
a = 0	False positive rates
K_c-poly2^a	--	0.30	0.29	0.34
K_c-poly3	--	0.34	0.49	0.57
K_c-RBF (γ = 0.5)	--	0.14	0.06	0.01
Pearson’s corr.^b	--	0.00	0.00	0.00
Kendall’s corr.	--	0.00	0.00	0.00
Distance corr.	--	0.00	0.00	0.00
a = 1	True Positive rates
K_c-poly2	--	1.00	0.77	0.36
K_c-poly3	--	1.00	0.95	0.61
K_c-RBF (γ = 0.5)	--	0.98	0.30	0.00
Pearson’s corr.	--	0.82	0.08	0.00
Kendall’s corr.	--	0.07	0.00	0.00
Distance corr.	--	0.85	0.08	0.00
	Correlation measurement (p-values)
K_c-poly2	1.00 (< ϵ ^a)	0.99 (< ϵ)	0.81 (< ϵ)	0.38 (0.06)
K_c-poly3	1.00 (< ϵ)	1.00 (< ϵ)	0.88 (< ϵ)	0.49 (0.02)
K_c-RBF (γ = 0.5)	1.00 (< ϵ)	0.92 (< ϵ)	0.55 (0.01)	0.18 (0.24)
Pearson’s corr.	1.00 (< ϵ)	0.77 (< ϵ)	0.47 (0.03)	0.20 (0.21)
Kendall’s corr.	1.00 (< ϵ)	0.57 (0.01)	0.33 (0.09)	0.13 (0.30)
Distance corr.	1.00 (< ϵ)	0.79 (< ϵ)	0.55 (0.01)	0.41 (0.05)

aK-poly2, K-poly3, and K-RBF denote K with the polynomial kernel of degree 2 and 3, and the RBF kernel, respectively.

bThe notation corr. and ϵ denote correlation coefficient and 5⋅10-4, respectively.

Case 3

The true time course expression of gene G and G (generated by Eqs (5) and (6) with c = 0) and a set of simulated data (generated by Eqs (5) and (6) with c = 0.5, 1.0 and 2.0, respectively) are plotted in Fig 4. When c = 0, these two dotted expression curves differed only by a location shift of 3 in the y-axis, and as c assumed larger values, two less identical curves were generated. The results are similar to those of Case 1 and 2 and are summarized in Table 3.

Fig 4

The simulated expression of gene G and G generated by Eqs (5) and (6) with various values of c, where˙(▲) denotes the values of G (G) and i = 0, 1, …, 17.

Table 3

Comparison of the six correlation measures on the nonlinear correlation between genes G and G, generated by Eqs (5) and (6) with 100 replicates.

c =	0.0	0.5	1.0	2.0
a = 0	False positive rates
K_c-poly2^a	--	0.20	0.19	0.23
K_c-poly3	--	0.26	0.40	0.49
K_c-RBF (γ = 0.5)	--	0.11	0.07	0.01
Pearson’s corr.^b	--	0.00	0.00	0.00
Kendall’s corr.	--	0.00	0.00	0.00
Distance corr.	--	0.00	0.00	0.00
a = 1	True Positive rates
K_c-poly2	--	1.00	0.94	0.42
K_c-poly3	--	1.00	0.99	0.75
K_c-RBF (γ = 0.5)	--	1.00	0.76	0.10
Pearson’s corr.	--	1.00	0.39	0.02
Kendall’s corr.	--	0.49	0.02	0.00
Distance corr.	--	1.00	0.54	0.05
	Correlation measurement (p-values)
K_c-poly2	1.00 (< ϵ ^a)	1.00 (< ϵ)	0.92 (< ϵ)	0.48 (0.02)
K_c-poly3	1.00 (< ϵ)	1.00 (< ϵ)	0.98 (< ϵ)	0.72 (< ϵ)
K_c-RBF (γ = 0.5)	1.00 (< ϵ)	0.97 (< ϵ)	0.78 (< ϵ)	0.37 (0.07)
Pearson’s corr.	1.00 (< ϵ)	0.90 (< ϵ)	0.68 (0.001)	0.35 (0.08)
Kendall’s corr.	1.00 (< ϵ)	0.70 (0.001)	0.49 (0.02)	0.23 (0.18)
Distance corr.	1.00 (< ϵ)	0.90 (< ϵ)	0.71 (< ϵ)	0.47 (0.02)

aK-poly2, K-poly3, and K-RBF denote K with the polynomial kernel of degree 2 and 3, and the RBF kernel, respectively.

bThe notation corr. and ϵ denote correlation coefficient and 5⋅10-4, respectively.

aK-poly2, K-poly3, and K-RBF denote K with the polynomial kernel of degree 2 and 3, and the RBF kernel, respectively. bThe notation corr. and ϵ denote correlation coefficient and 5⋅10-4, respectively. When the noise in data is low, Pearson’s correlation and dCor perform slightly better than K-RBF (with smaller FPRs). However, when noise is moderate, K-RBF outperforms Pearson’s correlation and dCor on the estimation of the nonlinear correlation in Case 3. And similar to Case 1 and 2, K with polynomial kernels and Kendall’s correlation are not suitable estimators, due to high FPRs and small and insignificant coefficients, respectively, when noise is low to moderate in data.

Applications

In this section, we first applied the proposed kernelization correlation procedure to identify genes whose expression patterns across time were similar or complementary to IL17A, which played an important role in the early differentiation of Th17 cells in humans. Through this procedure, the genes significantly correlated to IL17A were inferred to be involved in the early differentiation of Th17 cells. This is a simple but efficient way to identify genes associated with the differentiation of Th17 cells. Next, we applied K to estimate the nonlinear correlation of cell cycle genes in yeast.

Application 1

The expression of IL17A is commonly used to assess the Th17 polarization efficiency [16]. Previously, transcriptional factors Rorc and Stat3 were revealed to be the key regulators at the early stage of Th17 differentiation in murine and humans [22]. Aijo and colleagues profiled gene expression in the early phase of Th17 differentiation to understand the process of differentiation and how the differentiation signal propagates through various pathways. The knowledge they gained is very useful for uncovering markers of differentiation of Th17 cell populations. In this application, we applied K to the RNA-seq data of the early phase of Th17 cell differentiation and T-cell activation (Th0) cells [8]. There are three replicates for each gene profiled from Th0 and Th17 cells at 0, 12, 24, 48 and 72 h, in which the differences in differentiation efficiency among the replicates have been adjusted; see Section 3.2 [8]. The time-course expression of IL17A and RORC (ISG20, RAB3, and TIAM1) in Fig 1 (Fig 7) of [8], respectively, show that IL17A has a positive correlation with RORC, ISG20, and RAB3, but it has a negative correlation with TIAM1. These correlations are consistent with the expression (in normalized read counts) of these genes in Th17 cells. Moreover, the CoS test was applied (with 1,000 repeats) to test for the nonlinearity of the four IL17A gene pairs, which were all significant with P equal to 0.014 and 0.000 (rounded to the third digit) for the latter three pairs, respectively. This result justifies that the nonlinear correlations of these pairs exist. We applied K-RBF (with the default γ value) to estimate the nonlinear correlation of similar-patterned (complementary-patterned) gene pairs, e.g., IL17A-ISG20 (IL17A-TIAM1) profiled in Th17 cells [8]. Because the normalized data of these genes in replicate 1 differed much from those of replicate 2 and 3, we set replicate 1 data aside and computed K using replicate 2 and 3 data. For genes whose expression of Th0 cells was highly similar to that of Th17 cells, they were not involved in the differentiation of Th17 cells. Thus, to exclude genes irrelevant to immune differentiation, we first computed K of MAP1B, RORC, KIF11, IGS20, RAB3 and TIAM1 with themselves in Th17 cells and Th0 cells, e.g., correlation of the expression of MAP1B in Th17 and the expression of MAP1B in Th0 cells. Since self-correlation is similar-patterned, we used K-RBF (γ = 0.5), and obtained an averaged self-correlation of MAP1B (KIF11) equal to 0.998 (0.996) with P < 0.001, using normalized read counts (via DESeq). This result is consistent with the expression of MAP1B (KIF11) in Th0 cells being highly similar to that in Th17 cells (S4 Fig, [8]). Consequently, MAP1B and KIF11 are likely to be false positives (not involved in immune differentiation) and should be excluded. Next, we applied K-RBF to compute the correlation of IL17A with each of TIAM1, ISG20, RAB3 and RORC in Th17 cells. For similar-patterned gene pairs, the default value of γ (0.5) was used. For negatively correlated gene pairs such as IL17A-TIAM1, we first used replicate 2 (replicate 3) gene expression to train γ, then applied the trained K to estimate the nonlinear correlation for replicate 3 (replicate 2) data and averaged the two values of K. The cross-validation formula to train γ was and the trained γ = 7.5 for IL17A-TIAM1. The averaged K value of IL17A-TIAM1, IL17A-ISG20, IL17A-RAB3 and IL17A-RORC are -0.21, 0.83, 0.90 and 0.84, respectively and the corresponding p-values using t-test are 0.002, 0.020, 0.006 and 0.017, which are all significant at α = 0.05. These four genes have significant K with IL17A, thus these are classified as involved in immune differentiation. Comparison of K to DyNB, timepoint-wise analysis using DESeq and dCor. Aijo and colleagues showed that DyNB [8] was able to uncover the nonlinear correlation of IL17A with ISG20, RAB3 and TIAM1, which DESeq [20] from timepoint-wise analysis failed to uncover; mRNA expression of IL17A and RORC also exhibited similar patterns across time (Fig 1(B) and 1(D) in [8]). Further, timepoint-wise analysis (DESeq) detected two false positives, KIF11 and MAP1B. We note that timepoint-wise analysis did not take into account correlations between time points and the whole pattern of time-course gene expression. Therefore, both K and DyNB could detect the aforementioned nonlinear temporal correlation of the above four pairs, but not DESeq. We further compared the nonlinear correlation measure dCor to K. DCor estimated the correlation of IL17A-IGS20 and IL17A-RORC correctly (at the level of 0.05), but failed to quantify IL17A-RAB3 and IL17A-TIAM1. As IL17A-TIAM1 exhibited a complementary pattern (Fig 7(C) in [8]), their correlation should be negative; we refer to Table 4 for details.

Table 4

DCor and K-RBF estimated for the four gene pairs of IL17A, which played an important role in the early differentiation of Th17 cells in humans.

mean (se) p-value	dCor	K_c-RBF^a
IL17A-TIAM1	0.65 (0.09) 0.032	-0.21 (0.001) 0.002
IL17A-ISG20	0.61 (0.10) 0.035	0.83 (0.04) 0.020
IL17A-RAB3	0.66 (0.24) 0.081	0.90 (0.01) 0.006
IL17A-RORC	0.68 (0.15) 0.048	0.84 (0.03) 0.017

aThe trained γ = 7.5 (γ = 0.5) was used in the RBF kernel for IL17A-TIAM1 (the remaining three pairs).

Application 2

In this application, we applied Pearson’s correlation, K and dCor [11] to estimate the nonlinear correlations of six gene pairs, which were involved in the cell cycle of S.cerevisiae [9]. These pairs are RNR1-SWE1, RNR1-RAD51, SWE1-RAD51, HST3-RAD51, HST3-RNR1, and HST3-SWE1, where the former three pairs have similar patterns, but the latter three exhibit complementary patterns. We first tested whether the nonlinear correlations exist in these pairs by the CoS test [12]. The P value of CoS (with 1,000 repeats) for these pairs was equal to 0.000, 0.000, 0.004, 0.000, 0.002, and 0.003, respectively, which demonstrated nonlinearity existed in these pairs. Next, we applied K-RBF (with the default γ = 0.5) to estimate the nonlinear correlation of gene pairs with similar patterns, similar to Application 1. We treated data of the two cell cycles in the experiments, namely time t = 1-9 and 10-18 (two biological repeats), as two replicates. The Pearson’s correlation, dCor and K computed by the replicates for (RNR1-SWE1, RNR1-RAD51 and SWE1-RAD51) were (0.82, 0.92 and 0.72), (0.85, 0.91 and 0.78) and (0.91, 0.99 and 0.91), respectively. All of these were significant at the 0.10 level; see Table 5 for details. For gene pairs with anti-similar patterns, we used replicate 1 (replicate 2) to train γ of the RBF kernel and applied the trained K to estimate the nonlinear correlation for replicate 2 (replicate 1). The objective function to be optimized is as follows. and K is the value of K in biological repeat i.

Table 5

Pearson’s r, dCor, and K-RBF estimated for the pairs of similar and complementary patterned cell cycle genes in yeast.

mean (se) p-value	Pearson’s r	dCor	K_c-RBF^a
Similar patterned
RNR1-SWE1	0.82 (0.06) 0.048	0.85 (0.01) 0.003	0.91 (0.07)0.050
RNR1-RAD51	0.92 (0.04)0.024	0.91 (0.06) 0.014	0.99 (0.004) 0.003
SWE1-RAD51	0.72 (0.11) 0.094	0.78 (0.15) 0.043	0.91 (0.04) 0.031
Complementary patterned
HST3-RNR1	-0.55 (0.16) 0.129	0.72 (0.01) 0.003	-0.85 (0.01) 0.006
HST3-RAD51	-0.50 (0.23) 0.203	0.65 (0.06) 0.022	-0.87 (0.004) 0.002
HST3-SWE1	-0.49 (0.26) 0.224	0.67 (0.01) 0.003	-0.75 (0.11) 0.066

aThe default γ = 0.5 was used in the RBF kernel for the similar patterned pairs. The trained γ = 1.0 (γ = 0.7) was used in the RBF kernel for HST3-RNR1 and HST3-RAD51 (HST3-SWE1).

aThe default γ = 0.5 was used in the RBF kernel for the similar patterned pairs. The trained γ = 1.0 (γ = 0.7) was used in the RBF kernel for HST3-RNR1 and HST3-RAD51 (HST3-SWE1). As shown in the S1 Dataset in which γ was trained, the values of the objective function with γ =1.0, 1.0, and 0.7 did not differ much from the associated global minimum of HST3-RAD51, RNR1-HST3, and HST3-SWE1, respectively; the differences were < 8×10-5, thus we used K-RBF with these trained γ values. The estimated correlation of RNR1-HST3, HST3-RAD51 and HST3-SWE1 by Pearson’s correlation (r), dCor and K are summarized in the lower half of Table 5. Note that this training of the γ value is essential for complementary patterned variables. At the 0.10 significance level, K of HST3-RNR1, HST3-RAD51, and HST3-SWE1 were all significant, but none of the Pearson’s correlation coefficients was significant, nor did dCor result in negative signs of the complementary patterns correctly. In particular, both Pearson’s correlation and dCor did not reflect the complementary pattern of HST3-RAD51 as shown in Fig 1A.

A rule of thumb for kernel selection of K

The simulation study shows that when the noise of data is low to moderate, taking both TPR and FPR into account (e.g., using the likelihood ratio for a positive test TPR/FPR), K-RBF is better than K-poly2, and K-poly3 performs the worst. Furthermore, Application 1 and 2 show that K -RBF with a trained γ value (K-RBF with γ = 0.5) is able to detect negative (positive) correlations adequately, but K-poly2 fails to detect IL17A-TIAM1 (-0.0002 with P = 0.500) and IL17A-RORC (-0.0002 with P = 0.500). Therefore, we suggest using K-RBF (with γ = 0.5) for positively correlated variables, and K -RBF (with a trained γ value) for negatively correlated variables.

Discussion

We have proposed the kernelized correlation K, which has been shown able to detect a nonlinear correlation of two variables (across time) as shown in the simulation study and two applications. To the best of our knowledge, this nonlinear measure is the first to quantify negatively correlated pairwise variables by a negative value, and it can also be applied to data exploration, variable selection, and others. Nevertheless, it cannot be applied to measure correlations of paired variables in static experiments, e.g., gene A in the control and experimental groups. The advantages of K lie in that it is simple and distribution-free. This tool will enable non-computational researchers to identify nonlinear correlations in biomedical data which may have important applications, such as identifying genes involved in the differentiation of human T helper cells. For similar patterned pairwise variables, using the default γ value of the RBF kernel is sufficient. For complementary patterned pairwise variables, a CV-trained value of γ is required similar to other bandwidth-based methods; nevertheless, this training step is fairly simple using the cross-validation formula, and the code is also provided at http://staff.stat.sinica.edu.tw/gshieh/KC/Kc.html. As demonstrated in the Result section, the proposed K was able to quantify nonlinear associations in the simulated cases that had been designed with some degree of nonlinear correlation, in contrast to Pearson’s and Kendall’s correlations, which did not detect significant correlation when the data were nonlinear. In Application 1, K-RBF was shown to identify the negative correlation of IL17A-TIAM1 and the remaining three gene pairs with positive correlations, as well as the sophisticated DyNB [3], while DESeq failed in these pairs and dCor only identified IL17A-ISG20 and IL17A-RORC correctly. In Application 2, K was shown to detect negative nonlinear correlations of the three gene pairs involved in yeast cell cycle regulation, while both Pearson’s correlation and dCor failed in all these pairs. Our method is a general approach to detecting nonlinear correlations. The property of distribution-free makes K applicable to a wide variety of problems. Although we only applied it to gene expression data in this study, this method can be applied to other types of omic data, e.g., proteomics and metabolomics. Taken together, this nonlinear correlation approach is useful for the estimation of the nonlinearity of biological associations. In future research, a natural extension would be to develop a multivariate version of K. The derivation of measures for nonlinear correlation in functional data is another promising topic.

Training the parameter gamma in the RBF kernel of Application 1 and 2.

(DOCX) Click here for additional data file.

The R code of kernelized correlation K.

(RMD) Click here for additional data file.

Dataset of Application 1.

(ZIP) Click here for additional data file.

Dataset of Application 2.

(CSV) Click here for additional data file. 17 May 2022

PONE-D-21-31352

A Nonlinear Correlation Measure with Applications to Gene Expression Data

PLOS ONE Dear Dr. Shieh, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Jul 01 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Maria Alessandra Ragusa, PhD Professor Academic Editor PLOS ONE Journal requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf". 2. We note that the grant information you provided in the ‘Funding Information’ and ‘Financial Disclosure’ sections do not match. When you resubmit, please ensure that you provide the correct grant numbers for the awards you received for your study in the ‘Funding Information’ section. 3. We note that you have stated that you will provide repository information for your data at acceptance. Should your manuscript be accepted for publication, we will hold it until you provide the relevant accession numbers or DOIs necessary to access your data. If you wish to make changes to your Data Availability statement, please describe these changes in your cover letter and we will update your Data Availability statement to reflect the information you provide. Additional Editor Comments: Dear Corresponding Author, according to the referee's comments the paper is accepted for publication. Best regards. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The manuscript presents a novel approach to reveal nonlinear correlation in biomedical datasets. The nonlinear kernel measure has been proposed to project high-dimensional data into Hilbert space. The authors demonstrated the effectiveness their approach for discovery of genes involved in differentiation human T helper cells. The statistical analysis provides sufficient evidence for advantages of the approach with respect to several canonical correlation measures. Reviewer #2: The authors proposed a kernelized correlation coefficient to measure a possibly nonlinear association among genes. Although the proposed method is straightforward, it has potential and broad applicability in bioinformatics. My specific comments are given in the following. - One obvious competitor is distance correlation to measure the nonlinear association between two variables. So I suggest adding this method to the comparison. - I believe the choice of the kernel is quite essential, especially in the applications in genetics. I suggest providing more details about how to choose the kernel in practice. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 1 Jun 2022 We have included point-to-point replies to the comments and suggestions made by AE and Reviewers in the "Response to Reviewers" .docx Submitted filename: Response_to_Reviewers.docx Click here for additional data file. 8 Jun 2022 A Nonlinear Correlation Measure with Applications to Gene Expression Data PONE-D-21-31352R1 Dear Dr. Shieh, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Maria Alessandra Ragusa, PhD Professor Academic Editor PLOS ONE Additional Editor Comments (optional): The revised version is now ready for publication. Best regards. Reviewers' comments: 10 Jun 2022 PONE-D-21-31352R1 A nonlinear correlation measure with applications to gene expression data Dear Dr. Shieh: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Maria Alessandra Ragusa Academic Editor PLOS ONE

17 in total

1. Significance analysis of time course microarray experiments.

Authors: John D Storey; Wenzhong Xiao; Jeffrey T Leek; Ronald G Tompkins; Ronald W Davis
Journal: Proc Natl Acad Sci U S A Date: 2005-09-02 Impact factor: 11.205

2. An overview of statistical learning theory.

Authors: V N Vapnik
Journal: IEEE Trans Neural Netw Date: 1999

3. Quantitative monitoring of gene expression patterns with a complementary DNA microarray.

Authors: M Schena; D Shalon; R W Davis; P O Brown
Journal: Science Date: 1995-10-20 Impact factor: 47.728

4. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization.

Authors: P T Spellman; G Sherlock; M Q Zhang; V R Iyer; K Anders; M B Eisen; P O Brown; D Botstein; B Futcher
Journal: Mol Biol Cell Date: 1998-12 Impact factor: 4.138

A nonlinear correlation measure with applications to gene expression data.

Introduction

Materials and methods

Data

Transforming nonlinear data to a high-dimensional space via a kernel

Definition

Statistical analyses

Results

Simulation study

Case 1

Case 2

Case 3

Applications

Application 1

Application 2

A rule of thumb for kernel selection of K

Discussion

Training the parameter gamma in the RBF kernel of Application 1 and 2.

The R code of kernelized correlation K.

Dataset of Application 1.

Dataset of Application 2.

1. Significance analysis of time course microarray experiments.

2. An overview of statistical learning theory.

3. Quantitative monitoring of gene expression patterns with a complementary DNA microarray.

4. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization.

5. Phenotypical characterization of human Th17 cells unambiguously identified by surface IL-17A expression.

6. Identification of early gene expression changes during human Th17 cell differentiation.

Review 7. Transcriptional regulation of Th17 cell differentiation.

8. Differential expression analysis for sequence count data.

9. DGCA: A comprehensive R package for Differential Gene Correlation Analysis.

10. Next maSigPro: updating maSigPro bioconductor package for RNA-seq time series.