Yi-Hui Zhou1. 1. Department of Biological Sciences and Bioinformatics Research Center North Carolina State University Raleigh 27695 North Carolina USA.
Abstract
The problem of detecting the changes in covariance for a single pair of genomic features has been studied in some detail but may be limited in importance or general applicability. For testing equality of covariance matrices of a set of features, many methods have been limited to the two-sample problem and involve varying assumptions on the number of features p versus the sample size n. More general covariance regression approaches are appealing but have been insufficiently structured to provide interpretable testing. To address these deficiencies, we propose a simple uniform framework to test association of covariance matrices with an experimental variable, whether discrete or continuous. We describe four different summary statistics, to ensure power and flexibility under various alternatives, including a new "connectivity" statistic that is sensitive to the changes in overall covariance magnitude. For continuous experimental variables, a natural individual "risk score" is associated with several of the statistics. We establish asymptotic results applicable to both continuous and discrete responses, with relatively mild conditions and allowing for situations where p>n. We also show that the proposed statistics are permutationally equivalent to some existing methods in the two-sample special case. We demonstrate the power and utility of our approaches via simulation and analysis of real data. The R package CorDiff is published on R CRAN.
The problem of detecting the changes in covariance for a single pair of genomic features has been studied in some detail but may be limited in importance or general applicability. For testing equality of covariance matrices of a set of features, many methods have been limited to the two-sample problem and involve varying assumptions on the number of features p versus the sample size n. More general covariance regression approaches are appealing but have been insufficiently structured to provide interpretable testing. To address these deficiencies, we propose a simple uniform framework to test association of covariance matrices with an experimental variable, whether discrete or continuous. We describe four different summary statistics, to ensure power and flexibility under various alternatives, including a new "connectivity" statistic that is sensitive to the changes in overall covariance magnitude. For continuous experimental variables, a natural individual "risk score" is associated with several of the statistics. We establish asymptotic results applicable to both continuous and discrete responses, with relatively mild conditions and allowing for situations where p>n. We also show that the proposed statistics are permutationally equivalent to some existing methods in the two-sample special case. We demonstrate the power and utility of our approaches via simulation and analysis of real data. The R package CorDiff is published on R CRAN.
Methods to detect association of an experimental variable with the changes in mean expression of sets of genes are well established (Barry, Nobel, & Wright, 2005; Goeman, Van De Geer, De Kort, & Van Houwelingen, 2004). In contrast, although tests of the changes in correlations or covariances have received considerable attention in areas such as genomics (McKenzie, Katsyv, Song, Wang, & Zhang, 2016) and finance (Isogai, 2016), set‐based methods are less established. Existing methods for genomics have been confined largely to the two‐sample problem (Choi & Kendziorski, 2009; Hu, Qiu, & Glazko, 2009; Hu, Qiu, Glazko, Klebanov, & Yakovlev, 2009). Exceptions include the liquid association method of Li (2002), developed to describe ternary relationships among genes, but the underlying motivation is based on the concept of differential covariance. For tests of biological systems, ensemble tests of covariance matrices for sets of genes in a proposed network or pathway can provide insight that might not be apparent in tests of individual gene pairs (Yuan, Deng, Tang, & Li, 2016).Several existing differential covariance methods for genomics have primarily used permutation resampling for testing (e.g., Hu et al., 2009), partly because classical likelihood approaches require the sample size n to be large compared with the number of features p (Anderson, 1962; John, 1971). The two‐sample problem tests equality of p×p covariance matrices H
0:Σ1=Σ2 based on samples of sizes n
1 and n
2, where n
1+n
2=n. In settings where p>min{n
1,n
2}, likelihood ratio testing may perform poorly or be undefined.Recently, a number of statistical investigators have reconsidered the two‐sample set‐based differential covariance problem with an emphasis on behaviour when p>n, with little reference to the existing genomics literature. Li and Chen (2012) derived an approximately standard normal statistic for the Frobenius norm of differences in the two p×p sample covariance matrices, with considerable attention to sources of bias when p is large. A maximum standardized difference statistic was proposed (Cai, Liu, & Xia, 2013) between two sample covariances, with testing based on an extreme value approximation. There has been comparatively little discussion of the fact that these methods are designed for very different alternatives, ranging from modest but widespread differences in the two sample covariance matrices (Li & Chen, 2012) to large differences in a very few covariance elements (Cai et al., 2013). An approach was reported (Zhu, Lei, Devlin, & Roeder, 2017) for sparse effects using eigenvalues but is also limited to the two‐sample problem. Cai and Sun (2017) give a review of the previously cited papers, from the perspective of one‐sample and two‐sample testing. Additional aspects, such as local testing to discover groups of differentially correlated features, are beyond our scope. Importantly, there are few approaches to test association of covariances with a continuous outcome.Covariance regression methods (Hoff & Niu, 2012; Zou, Lan, Wang, & Tsai, 2017) offer a potentially more flexible approach, with the ability to handle continuous predictor variables. Methods such as Zou et al. (2017) are ultimately sensitive to the property that similarity in predictors corresponds to similarity in responses and thus are similar to kernel methods such as SKAT (Ionita‐Laza, Lee, Makarov, Buxbaum, & Lin, 2013). The special case of a linear kernel and a restricted set of predictors in Zou et al. (2017) has some similarities to our testing approach, but the authors focus on efficiency and comparison of maximum likelihood and other fitting approaches. The regression method of Hoff and Niu (2012) for a predictor denoted y, which may be continuous, models the covariance as Σ=Ψ+Byy
B
and for a positive definite p×p matrix Ψ maintains a positive definite covariance function. However, the p‐vector B provides in some ways less flexibility compared with the methods we propose, because we wish to be sensitive to any changes in the covariance matrix, but at the same time, the vector B may be difficult to interpret. In addition, the approach does not match with standard statistics used in the two‐sample problem and does not lend itself to a simple test of association with y.The differential covariance literature continues to grow, with little cross‐talk among the areas described above (classical two‐sample testing, two‐sample testing with p>n, and covariance regression). Key advances for genomics would be (a) to recognize situations in which analytic p‐values are available and accurate even if p>n, (b) to move beyond the two‐sample scenario without the complications of interpreting the full results of covariance regression, (c) to have a framework for understanding different test statistics, for a variety of alternatives.
A new set of statistics
To address these issues, we first note that the two‐sample problem can be viewed as an “association” of the covariance matrix with a binary group indicator. More generally, the investigator may be interested in trend association of covariance with an experimental variable y that might be multilevel or on a continuous scale. To our knowledge, no general method is available with the requisite flexibility, without restrictive parametric requirements or assumptions of the feature size p relative to n.In this manuscript, we use a simple framework to propose four different statistics to test the changes in a covariance matrix of p features. Several of these statistics have been proposed for the two‐sample problem but not extended to a continuous predictor. Moreover, existing methods have been published in isolation, providing little opportunity to consider power characteristics for various types of alternatives. Thus, a major contribution of this paper is to provide a common framework and to point out the similarities and differences among covariance‐change statistics.Here, we propose four statistics to test the changes in a covariance matrix of p features, when it is anticipated that a change in y will result in (a) a directional change in many elements of the covariance matrix, (b) a nondirectional change in covariance, (c) a change in the overall magnitude of covariances, or (d) a large change in one or a few elements of the covariance matrix. In contrast to almost all of the comparable methods, the statistics apply naturally whether y is continuous or discrete. Asymptotic results and p‐values for most of the statistics are derived under relatively mild conditions, to provide computationally efficient p‐values for the two‐sample comparisons. Permutation can be used for the remaining statistic and if the researcher does wish to rely on asymptotic theory to ensure control of type I error. Our approach is not limited by the data dimensions and is applicable to situations where p>n.This paper is organized as follows. In Section 2, we introduce the method and test statistics. Section 3 provides asymptotic results for the general methods, and for the special case of the two‐sample problem shows that several of our statistics match up with those previously proposed. In Section 4, we compare the proposed statistics with existing methods, in terms of type I error and power. Several different simulation settings are presented for the two‐sample problem, comparing our statistics to existing methods. In addition, we compare our proposed methods in the setting with continuous y. Section 5 illustrates with real examples.
METHODS
Notation
Let X be the p×n data matrix consisting of elements x
and y the n‐vector of clinical/experimental data. The sample mean and variance of a vector follow standard notation, for example,
and
. The ith row and kth column of X are denoted x
and x
., and each column is assumed to have a population p‐mean of zero. Random variables are capitalized (e.g., random Y
vs. observed y
). We denote the p×p covariance of X, which may depend on y
, as
. The zero‐mean assumption is implicit in most covariance tests, following an intent that the test statistics be sensitive only to the changes in covariance. For a subset of samples ω with at least two samples, the sample covariance is
. A single i,j element is
. We use ξ to denote the operator that sums all elements of a matrix and the superscript “∘k” to denote the element‐wise exponent of a matrix to power k.
A conceptual trend model
The existing methods are limited to two‐sample comparison with binary y. To motivate our statistics for binary/continuous y, we adopt a conceptual trend model for the covariance dependence of X on y: Σ=β
0+β
1
y for p×p matrices β
0, β
1. Thus, for sample k, according to our assumptions, E(X
|y
)=0 for each i and cov(X
,X
|y
)=E(X
X
,|y
)=β
0,+β
1,
y
for the ith and jth features. Letting z
=x
x
, the model immediately suggests linear regression of z on y, for which the least squares slope solution is
, where
. Although the trend assumption is simple,
can be viewed as an approximate score statistic for additive models of the form E(Z
|y
)=η
0,+η
1,
f(y
) for a strictly monotone smooth f, where η
0, and η
1, are the unknown coefficients in this “true” model, and our β
1, is thus locally powerful for detecting departures from the null η
1,=0. We make two further observations: (a) We do not consider β
0, to be of interest for detecting the covariance changes, and (b) linear rescaling of y will not meaningfully change our results, because it results in constant changes in the proposed statistics. Thus, without loss of generality, we assume
, so
. These least squares solutions are not intended to be used directly but serve to motivate global test statistics described below. For the two‐sample special case, the trend model results in
matrices that are sample covariance matrices for each of the two samples and thus non‐negative definite. For continuous y, the trend model does not guarantee non‐negative definiteness throughout the range of y. However, the clear interpretation remains for
and its use for each of the summary covariance‐change statistics.
Four statistics
In this subsection, we propose four statistics as a unified framework for covariance testing for sets of genes, which are equally applicable to a binary (the two‐sample problem) or continuous y . In different scenarios, each of them has a role to promote biological/medical discoveries. A summation statistic S is sensitive to the covariance changes in the same direction, whereas a quadratic form statistic Q is sensitive to the changes in either direction. Several existing methods are essentially two‐sample special cases of S and Q, as we later show. The connectivity statistic C is completely novel and has a relationship to overall covariance magnitude, which may be useful in network analysis. Statistics similar to the maximum statistic M have been well studied (Cai et al., 2013) but had been previously limited to the two‐sample problem.
A summation statistic
To effectively measure the covariance changes, we propose
as a summation statistic to detect the global changes in covariances that are concordantly associated with the experimental variable y (i.e., in the same direction). A simplification for S is
for
. In datasets where the null for S can be rejected, the value w
represents a natural “risk score” for sample k, with extreme w values corresponding to extreme y.Although the initial motivation for S was based on p×p covariance terms, the restated statistic is ultimately based on an inner product of n‐vectors, and thus, we may use a large sample normal approximation as n→∞ for rescaled S to obtain p‐values.
A quadratic form statistic
In contrast,
is sensitive to the changes that are not directionally concordant. Similar to S, Q can also be represented by n‐vectors and n×n matrices.
where
. The matrix with elements a
can be simplified to A=(X
X)∘2. Finally, we have the quadratic form Q=y
Ay.The nature of Q makes it difficult derive a risk score analogue and also difficult to justify closed‐form limiting approximations to its null distribution. An exception is in extreme cases, such as dominance of a single eigenvalue in A (approximately chi‐square), or with a large number of eigenvalues of similar magnitude (approximately normal). As explored in Zhou et al. (2013), for small to moderate sample sizes, a weighted beta approximation can be more accurate than standard approximations for sums of independent chi‐square distributions. However, the procedure can be somewhat computationally intensive, and here, we opt for direct permutation of y to obtain p‐values, for a general A. For certain special cases, solutions for the first four permutation moments (Zhou et al., 2013) may be used to obtain p‐values but require restrictive assumptions on the form of A.
A connectivity statistic
Each element a
of A has the form of a squared inner product between samples k and l, and so
reflects broad‐scale association (a “connectivity index”) of sample k with remaining samples. Accordingly, we propose the connectivity statistic C=y
b for risk scores {b
} to reflect correlation between y and the connectivity index. Correlations between samples are ultimately driven by correlation between features, and C reflects the tendency for the aggregate magnitude of feature–feature correlations to be associated with y, which is quite different from the type of alternative envisioned for S and Q.
A maximum statistic
Our fourth statistic is similar to that of Cai et al. (2013) for the two‐sample special case. We use a test for the maximum element difference, scaled by an appropriate standard error. For our generalization of the statistic, we define
, where r
is the Pearson correlation between y and z
. Defining
, we propose the maximum statistic
, with a risk score
.Approximate p‐values for M use an extreme value approximation for
variates as n→∞. Beyond standard assumptions that elements of X and Y have appropriate tail behaviour (e.g., sub‐Gaussian), there are modest restrictions on Σ and that
grows more slowly than n
1/5. With these assumptions, approximate p‐values are obtained using
.
Multiple testing
For practical data analysis, we use the standard Benjamini–Hochberg false discovery rate (FDR) control for each of the covariance‐change statistics across all the gene sets examined, in order to examine all findings comprehensively. This approach may be viewed as a form of stratified FDR control (Sun, Craiu, Paterson, & Bull, 2006), with the aggregated FDR (for all gene sets and all four statistics) expressible as a weighted average of the stratum‐specific FDRs. In this manner, we are able to focus on top findings for each of the statistics in turn, highlighting the most significant findings. Note that, according to this perspective, we can choose the most significant findings across the statistics while maintaining the stratified FDR, and whichever statistics appear in the FDR‐significant set are driven by those that are most powerful for the problem at hand.
THEORETICAL RESULTS
Asymptotic theorems to obtain p values
We have established asymptotic results for statistics S, C, and M for general y, which is non‐trivial because p may be large relative to n. Suitable rescaling provides p‐values using normal (S and C) or extreme value (M) distributions.
Theorem 1 (S and C)
This theorem implies that S and C are asymptotically normal, as will be explained following the proof. The proof is a triangular array version of the central limit theorem tailored for our setting.The Lindeberg condition for the sequence of random variables {U
} states that for any ε>0,
as n→∞. Assume Y and W
are random variables such that E(Y
4)<∞ and
for all k,n, and k=1,…,n. Define V
=W
Y
and
and assume the Lindeberg condition holds for
and {V
}. Further, define
where
,
, s
, and s
are the sample means and standard deviations of W
., and Y, and
. Then, under H
0 that W
. and Y are independent, for any t, P(Z ≤ t)→Φ(t) as n→∞.The proof is in the Supporting Information.Both S and C can be placed within this framework. To apply the theorem to S, we define
and
. Under the null hypothesis that X is independent of Y, clearly, W is also independent of Y. It is clear from the definition that it allows for dependence of p on n. For C, we define W
=B
where B
is the random variable risk score given in Section 2.3.3.Note that the theorem is stated generally enough that the columns of X, which correspond to the k subscript in W, be nonidentically distributed. Verifying the conditions requires a model for X with increasing n but will be satisfied for a wide variety of “typical” assumptions. For example, previous work has often assumed multivariate normality of X and either a binary or normal Y, for which the conditions can be shown to be satisfied (see below for example) under very general conditions.
An example of establishing conditions for Theorem 1 for S and C for multivariate normal X
Suppose each of the columns of X is p‐variate multivariate normal MVN(0,Σ), where both p and Σ may depend on n. We have from the above definitions
,
, and
. For MVN X,
, where Z
∼N(0,1) (and we drop the n subscript from Z). As k is arbitrary, Theorem 1 requires establishing Lindeberg conditions for
, for which the stronger Lyapunov condition is sufficient, and we compute
for δ>0, and the terms involving ξ
cancel whereas other terms do not depend on n. Also, p does not appear because the sum across rows of X is normal and already considered in
. It is thus clear that L
→0 as n→∞, satisfying the condition. Applying the same approach to
(i.e., establishing the condition for the V
terms) similarly yields cancellation of ξ
and L
→0. Specifically, p does not appear, and the denominator of the right‐hand side of the above equation will include ξ
4+2
E(Y
2)var(Z
2) and
, because under the null hypothesis,
and Y
are independent.For C, we apply the theorem conditions to
, and the Lyapunov condition is easy to establish from independence of columns of X.
Theorem 2 (M)
Suppose that the correlation condition C1, tail condition C2, and the moment condition C3 shown below hold. Then, under H
0, for any t,
as n→∞.The proof is in the Supporting Information.In Cai et al. (2013), several conditions (denoted C1–C3) were assumed to hold, and our analogues are required. In our notation, consistent with Cai et al. (2013), X
is the data value for feature i and sample k, with mean μ
. Key terms in the proof and conditions include the pair {U
,V
}, where U
=(X
−μ
),V
=W
if i ≤ j, and U
=W
,V
=(X
−μ
) if i>j. Terms ρ
are feature–feature correlations, and
,
, and under the null hypothesis σ
=0. Roughly, condition C1 controls the proportion of large feature–feature correlations, C2 limits the tail behaviour for X, and C3 provides bounds on four‐way cross moments among features.Cai et al. (2013). For r∈(0,1), define Λ(r)={1 ≤ i ≤ p:|ρ
|>r} for some j≠i. Suppose there exists a subset Υ⊂{1,2,..,p} with cardinality o(p) and a constant α
0>0 such that for all γ>0,
and there exists r>1 and a sequence Λ such that the cardinality of Λ(r) ≤ Λ=o(p).Sub‐Gaussian tail and polynomial tail conditions, analogue of C2 and C2∗ in Cai et al. (2013). Suppose that
. There exist constants η>0 and K>0 such that
for all i. Alternatively, we assume that for some γ
0,c
1>0,
, and ϵ>0,
and
for all i.Also, in either case, we assume
for some τ>0.Analogue of C3 in Cai et al. (2013). For any collection i,j,k, and l∈{1,2,…,p}, we assume without loss of generality that i ≤ j and k ≤ l, and we suppose there exists
such thatWe use Lemmas 1 and 2 from Cai et al. (2013) and the following lemma.Analogue of Lemma 3 of Cai et al. (2013). Under the conditions of C2, there exists some constant C>0 such that
where
as n,p→∞.We have
. We first note that Lemma 3 from Cai et al. (2013) applies directly to
as an estimator of var(Z
) or alternatively to
as an estimator of the same quantity if Z is centred. Furthermore,
and so
using variance scaling wlog as in the Supplemental Appendix of Cai et al. (2013)With the above assumptions in hand, the proof follows from the proof of Theorem 1 in Cai et al. (2013), where in each instance, our
is substituted for (say)
from Cai et al. (2013), and zero substituted for
, and a single denominator
in place of the denominator in Cai et al. (2013), which gives the correlation coefficient. Condition C1 prevents excessive correlation of features, whereas C2 ensures that large deviations in the data do not prevent limiting convergence of extreme
.
Permutation testing
Although for computational speed we typically rely on asymptotic p‐values, permutation testing can be useful, as a means of both performing small sample analysis and informing interpretation of our statistics, as we show in the next subsection. Letting Π denote a random permutation of n elements from among the n! possibilities (realized value π), the statistics for permutation π are
,
,
, and M
(which require computation of the
values and standard errors for each permutation). S and C are subjected to two‐sided testing, with p‐values based on both right and left tails, whereas Q and M are one‐tailed, rejecting for large values. For example, with H random permutations and π[h] denoting the hth permutation, the empirical p value for S is
, whereas the p value for Q is
.The null hypothesis is that the relationships of columns of X to the elements of y are exchangeable (Good, 2002), which holds if X and y are drawn from independent distributions. A primary advantage of permutation testing is that, aside from slight issues due to discreteness or tied outcomes, type I error rates are controlled without requiring parametric assumptions (Zhou & Wright, 2015). Note that, because permutation testing is conditional on the observed data, the dimension p is immaterial in terms of the ability to maintain appropriate false positive control. However, the sample size n should be sufficient that p‐values <α can be achieved. For example, if y is continuous, then the minimum p value achievable for two‐sided statistics is 2/n!.
Special case: Two group comparisons and permutation equivalence
As stated earlier, all of our statistics apply for a general y. Here, we show that in the two‐sample special case, our proposed statistics S and Q are equivalent (in a permutation sense) to natural summaries of sample covariance matrices. In addition, M closely matches the Cai et al. (2013) statistic. These concepts are illustrated in Figure 1 (left panel), which summarizes the properties that each statistic has in terms of alternatives to which it is sensitive.
Figure 1
Left panel: A visual summary of the proposed covariance‐change statistics and are grouped with previously proposed statistics for the two‐sample problem special case. Right panel: Comparison of the four proposed statistics to various existing statistics for the two‐sample problem special case, for a single‐simulated dataset, and for 100 permutations. Pearson correlations illustrate the exact and approximate correspondence of some pairs of statistics
Left panel: A visual summary of the proposed covariance‐change statistics and are grouped with previously proposed statistics for the two‐sample problem special case. Right panel: Comparison of the four proposed statistics to various existing statistics for the two‐sample problem special case, for a single‐simulated dataset, and for 100 permutations. Pearson correlations illustrate the exact and approximate correspondence of some pairs of statisticsFor the two‐sample problem, an obvious summation‐like statistic would be
. A statistic
is sensitive to the covariance changes in either direction that might otherwise cancel in the previous statistic. A statistic
is directional, but for the magnitude of covariances, and by squaring the covariance elements provides additional weight to gene–gene pairs with high covariance magnitude. Finally, a maximum statistic would identify the maximum covariances differences
but appropriately scaled by a standard error for each gene pair.Let ω
1 and ω
2 be the indexes for samples in groups 1 and 2, respectively, and the subscripts 1 and 2 will be used for simplicity. We assign the experimental variable
if k∈ω
1 and
if k∈ω
2. Thenthe directional statistic S is equivalent to
;the nondirectional statistic Q is equivalent to
.The proof is in the Supporting Information.Figure 1 (right panel) shows the results from 100 random permutations of y for the two‐sample problem with n
1=n
2=20, p=50. A single X was generated using the null version of Model 2 described in the next section, but the qualitative results hold regardless of the choice of X. As we showed above, S and Q are equivalent to
and
, respectively. Under the permutations, C has a high Pearson correlation over permutations with
, supporting the perspective that C reflects a contrast in the overall magnitude of covariances. Finally, our M is correlated under permutation with the statistic from Cai et al. (2013), although they differ modestly due to the differences in the standard errors used.This permutation example underscores the correspondence between our statistics and those that seem “natural” for the two‐sample problem, but we emphasize that our statistics apply for either discrete or continuous y.
SIMULATION MODELS FOR TYPE I ERROR AND POWER
Initial comparisons follow the simulation settings from Li and Chen (2012), for which feature covariances were described using auto‐regressive notation. More compactly than the original articles, we describe their simulation settings in terms of the covariance matrices.
Simulation Model 1 with a continuous y
For this simulation model, values in y are drawn iid N(0,1) in each simulation and converted to the rescaled experimental variable
. X is drawn as multivariate
, with
. We assume γ
1 is the identity matrix and γ
2 is the compound symmetric matrix,
in which we call ρ the “effect size.” Under the null, there is no change in the covariance structure, that is, γ
2 is the identity matrix, as is
for all y
∗. As ρ increases, the relationship between the covariance and y
∗ becomes stronger. Figure 2 shows that the power for the proposed statistics is near the intended α=.05 when ρ=0. Figure 2 also shows that the directional statistic S is the most powerful approach overall.
Figure 2
Power comparison among S, Q, C, and M for Simulation Model 4. The dashed line at α=.05 indicates that all the proposed methods control type I error well under the null (ρ=0). The effect size ρ ranges from 0 to .8
Power comparison among S, Q, C, and M for Simulation Model 4. The dashed line at α=.05 indicates that all the proposed methods control type I error well under the null (ρ=0). The effect size ρ ranges from 0 to .8
Simulation Model 2 with a continuous y
This simulation model is a bit more complex, following a similar approach used in Cai et al. (2013). The approach generates covariance matrices that are nondirectional in relationship to y and with no overall variation in magnitude, while respecting the need for positive definiteness. To an initial p×p identity matrix I, Σ∗(1) was formed by drawing the first p/2×p/2 off‐diagonal elements from a uniform density U[−ρ,ρ], followed by Σ∗(2)=Σ∗(1)+Σ∗(1) and
. Σ2 is formed by reversing the rows and columns of Σ1, and finally,
, where y
∗ is the result of linear rescaling of y to the [0,1] interval as in the previous subsection. Here, ρ∈[0,1) serves as an effect size, and Σ1 and Σ2 differ in the groups of genes that show correlation structure but otherwise are the same in the average magnitude of elements and show no directionality. Figure 3 provides the power comparison among the four proposed methods. As expected, S and C have little or no power, whereas M has low power. The statistic Q benefits from aggregation of covariance‐squared differences and thus has much more power than the other methods. All methods control type I error properly (dashed line at 0.05 in Figure 3).
Figure 3
Power comparison among S, Q, C, and M for Simulation Model 5. The dashed line at α=.05 indicates that all the proposed methods control type I error well under the null (ρ=0). The effect size ρ ranges from 0 to .6. Q is the most powerful method among these statistics for this simulation model
Power comparison among S, Q, C, and M for Simulation Model 5. The dashed line at α=.05 indicates that all the proposed methods control type I error well under the null (ρ=0). The effect size ρ ranges from 0 to .6. Q is the most powerful method among these statistics for this simulation model
Simulation Model 3 with discrete y (to assess type I error)
This simulation procedure was originally from Li and Chen (2012). We assume the first population X
1 ∼N(0,Σ1); whereas the second population X
2 ∼N(0,Σ2), where
The difference between the two covariance matrices is
To assess type I error, we set θ
2=0, which implies the null Σ1−Σ2=0. We show results for n
1=n
2={20,50,80,100} and feature dimension p={32,64,128,256,512,700}. The number of simulations was 1,000 for each setting. The asymptotic results were used to obtain p‐values for S and C and 1,000 permutations for Q. Although the parametric method works well for M for a moderate sample size, to ensure robustness for this statistic for the entire range of simulations, we also report permutation‐based p‐values, labelled M
.Table 1 shows that for this multivariate normal model, most methods perform well and control type I error. The Cai et al. (2013) method and our similar M statistics are noticeably anticonservative for α=.05 for the smaller sample size (n
1=n
2=20) and more so as p increases. As stated above, for comparison, we have also included in the table a statistic M
, which is the permutation‐based version of our M. For larger sample sizes, the asymptotic p‐values for M are close to nominal, even for large p. For the setting with n
1=n
2=20, p=50, 100,000 simulations were performed to provide greater insight into tail behaviour (Figure 4); p‐values for Q perform well, which is sensible, as the permutation null holds. In addition, we also show good results for a “residualized” Q (lower right panel), in which each row of X is residualized using simple linear regression for the effect of y. The rationale for such an approach might be to ensure that the test statistic is sensitive to the changes in covariance only, not to any linear association with y. Here, the residualization is also performed inside the permutation loop.
Table 1
Type I error comparison, Simulation Model 1, X
∼N(0,Σ), Σ1=Σ2
n1=n2
Method
p=32
p=64
p=128
p=256
p=512
p=700
20
S
.042
.043
.051
.035
.033
.044
Q
.055
.058
.046
.043
.043
.052
C
.041
.049
.029
.053
.053
.061
Li–Chen
.044
.054
.051
.048
.051
.038
Cai
.092
.14
.139
.204
.211
.263
M
.074
.085
.085
.132
.177
.209
Mp
.053
.054
.050
.052
.051
.050
50
S
.047
.042
.053
.035
.049
.035
Q
.052
.041
.045
.049
.042
.046
C
.055
.049
.051
.046
.057
.042
Li–Chen
.052
.060
.033
.043
.054
.049
Cai
.042
.068
.058
.065
.055
.059
M
.049
.042
.058
.034
.043
.033
Mp
.059
.054
.051
.048
.051
.050
80
S
.043
.057
.051
.043
.047
.04
Q
.065
.051
.040
.046
.044
.048
C
.056
.045
.047
.051
.063
.05
Li–Chen
.054
.060
.047
.048
.052
.053
Cai
.052
.056
.043
.052
.058
.041
M
.074
.043
.034
.036
.042
.027
Mp
.046
.046
.051
.047
.050
.049
100
S
.056
.05
.042
.051
.047
.05
Q
.039
.051
.050
.040
.060
.053
C
.057
.042
.028
.043
.057
.055
Li–Chen
.056
.049
.052
.046
.049
.048
Cai
.050
.052
.043
.039
.036
.047
M
.072
.052
.04
.045
.032
.027
Mp
.047
.054
.048
.050
.048
.046
Figure 4
QQplots for several of the proposed methods and existing methods for the null two‐sample problem of Simulation Model 1, p=50, n
1=n
2=20
Type I error comparison, Simulation Model 1, X
∼N(0,Σ), Σ1=Σ2QQplots for several of the proposed methods and existing methods for the null two‐sample problem of Simulation Model 1, p=50, n
1=n
2=20
Simulation Model 4 with discrete y (to assess type I error)
Here, we follow the previous Simulation Model but with skewed data elements. Specifically, let G(w;4,0.5) denote the Gamma distribution function with shape parameter 4 and scale 0.5 evaluated at w. Then, if W∼G, X=W−2 has mean zero and variance 1, that is, follows a centred Gamma. The elements of X
1 and X
2 are drawn as shown above, following the same null covariance structure that was used in Simulation Model 3.Here, the Cai approach in Cai et al. (2013) becomes conservative, with increasing both the sample size and feature size (Table 2). The Li–Chen method is anticonservative, but the type I error becomes closer to nominal as the sample size and feature size increase.
Table 2
Type I error comparison, Simulation Model 2, Σ1=Σ2, elements following centred Gamma
n1=n2
Method
p=32
p=64
p=128
p=256
p=512
p=700
20
S
.034
.042
.039
.039
.039
.04
Q
.057
.046
.066
.050
.047
.042
C
.035
.05
.047
.05
.037
.05
Li–Chen
.158
.112
.083
.071
.053
.063
Cai
.048
.048
.058
.055
.083
.085
M
.083
.102
.108
.145
.152
.171
Mp
.039
.039
.050
.037
.042
.046
50
S
.049
.061
.062
.048
.055
.044
Q
.055
.049
.055
.043
.051
.042
C
.045
.053
.051
.048
.044
.05
Li–Chen
.048
.048
.058
.055
.083
.085
Cai
.016
.013
.010
.007
.003
.004
M
.039
.062
.053
.054
.047
.042
Mp
.051
.05
.055
.042
.050
.048
80
S
.051
.046
.05
.049
.045
.056
Q
.056
.048
.043
.042
.038
.057
C
.062
.049
.042
.047
.047
.061
Li–Chen
.165
.141
.090
.059
.051
.056
Cai
.019
.010
.005
.006
.005
.002
M
.046
.051
.036
.049
.048
.038
Mp
.059
.039
.045
.045
.048
.051
100
S
.054
.037
.047
.047
.047
.048
Q
.045
.042
.049
.049
.051
.039
C
.041
.038
.045
.058
.049
.053
Li–Chen
.176
.133
.088
.069
.050
.046
Cai
.013
.009
.007
.003
.003
.003
M
.048
.044
.039
.034
.037
.048
Mp
.059
.041
.058
.052
.053
.050
Type I error comparison, Simulation Model 2, Σ1=Σ2, elements following centred Gamma
Simulation Model 5 with discrete y (to assess power)
For power comparisons, we return to the multivariate normal data elements. We use Simulation Model 3 but with covariance matrices determined by θ
1=2,θ
2=1 (one of the simulation models also used by Li and Chen, 2012, and summarized in their Table 3). Although this simulation model was used by Li and Chen (2012) to support their proposed statistic, our proposed C has consistently the highest power for all the n,p settings. The Li–Chen statistic shows power slightly higher than that of Q, even though they both are based on the Frobenius norm. We speculate that the reason is related to the fact that permutation testing is conditional on the observed data, and the power difference nearly disappears at the larger sample sizes. It is perhaps a bit surprising that S is less powerful than Q, as the covariance differences are directional. However, the squared terms in Q also may effectively act to reduce noise, and we have observed situations in which S is more powerful. The Cai and M statistics show the lowest power, as they use only the most extreme covariance difference element and do not aggregate over the large number of covariance difference elements.
Table 3
Power comparison, Simulation Model 3, X
∼N(0,Σ), Σ1≠Σ2
n1=n2
Method
p=32
p=64
p=128
p=256
p=512
p=700
20
S
.18
.156
.179
.163
.164
.158
Q
.211
.231
.235
.234
.221
.213
C
.63
.826
.972
.999
1
1
Li–Chen
.273
.273
.252
.285
.269
.272
Cai
.138
.140
.164
.204
.233
.282
M
.125
.111
.115
.157
.182
.189
Mp
.129
.072
.050
.061
.083
.054
50
S
.456
.474
.456
.452
.473
.437
Q
.705
.751
.803
.809
.772
.789
C
.989
1
1
1.000
1
1
Li–Chen
.752
.800
.824
.861
.839
.857
Cai
.234
.163
.146
.136
.104
.084
M
.217
.14
.134
.126
.097
.077
Mp
.270
.133
.092
.122
.034
.051
80
S
.644
.695
.686
.692
.677
.703
Q
.955
.972
.991
.995
.992
.992
C
1
1
1
1.000
1
1
Li–Chen
.941
.980
.992
.994
.996
.998
Cai
.496
.420
.377
.316
.246
.189
M
.468
.417
.367
.305
.229
.175
Mp
.574
.394
.333
.242
.253
.201
100
S
.761
.776
.788
.796
.813
.814
Q
.991
.997
.999
1.000
1.000
1.000
C
1
1
1
1.000
1
1
Li–Chen
.997
1.000
.999
1.000
1.000
1.000
Cai
.700
.652
.557
.508
.423
.406
M
.689
.633
.56
.494
.403
.398
Mp
.700
.649
.601
.487
.375
.374
Power comparison, Simulation Model 3, X
∼N(0,Σ), Σ1≠Σ2
ANALYSIS OF REAL DATASETS
A continuous phenotype example
To illustrate the utility of covariance testing in association with a continuous phenotype, we reanalyzed the well‐known data of van de Vijver et al. (2002), in which gene expression in breast tumours of 295 patients younger than 55 was examined for association with disease‐free survival. To identify biological pathways (gene sets) of interest in the comparisons, groups of genes for 372 KEGG and 8,039 Gene Ontology pathways were identified, so that for each test, p represents the number of genes in the pathway. To identify pathways of greatest interest, we aggregate over all gene pairs in each pathway, using statistics as described below. We used martingale residuals, adjusted for age and sex, and 50 surrogate variables, obtained from the expression data, as a quantitative phenotype y; p‐values for Q were determined by 100,000 permutations for high accuracy, and the remaining statistics used the asymptotic approximations. Of the four statistics proposed, three achieved false discovery q<0.15 for the most significant gene set using Benjamini–Hochberg adjustment. These were Q (GO:0043254 regulation of protein complex assembly, 193 genes, q=0.0066), C (GO:0022408 negative regulation of cell–cell adhesion, 75 genes, q=0.124), and M (GO:1900221 regulation of beta‐amyloid clearance, five genes, q=0.055).To further examine the finding for statistic C, we performed proportional hazards regression using the risk score vector {b
} as a predictor for disease‐free survival while including the covariates described. The resulting Wald statistic for the risk scores was p=8.0×10−15. This striking result for the risk scores would be highly significant by any conceivable multiple test correction, even though the q‐value for C was of borderline significance. To visualize, we divided the risk score vector for C into tertiles, and Figure 5 shows the corresponding Kaplan–Meier curves for disease‐free survival. The result shows that high “connectivity” of the genes in the pathway is associated with reduced disease‐free survival, consistent with observations that loss of cellular adhesion promotes metastasis (Martin, Ye, Sanders, Lane, & Jiang, 2013). Note that this observation would not be apparent from standard gene‐set enrichment approaches using overall expression levels.
Figure 5
Kaplan–Meier curves for disease‐free survival and breast cancer data of van de Vijver et al. (2002; p=8×10−15). The curves correspond to tertiles of the risk scores for statistic C, for pathway GO:0022408 “negative regulation of cell–cell adhesion”
Kaplan–Meier curves for disease‐free survival and breast cancer data of van de Vijver et al. (2002; p=8×10−15). The curves correspond to tertiles of the risk scores for statistic C, for pathway GO:0022408 “negative regulation of cell–cell adhesion”
Discrete phenotype example 1
The second real dataset is gene expression data on kidney transplant tissue (Modena et al., 2016), in which those with acute rejection (y=1,n
1=54) were compared with normal outcomes (y=2,n
2=99). Pathway analyses proceeded similarly as with the previous example, with 7,266 Gene Ontology BP and 402 KEGG pathways examined. The most significant pathways for each statistic are listed in Figure 6. Each heatmap depicts the matrix corresponding to the statistic (e.g., for S, it depicts
). The most significant pathways for each statistic are as follows, with multiple comparison false discovery q‐values: GO:0035754 B cell chemotaxis (five genes, for S, q=1.4×10−6), GO:0070193 synaptonemal complex organization (11 genes, for Q, q=0.03), and GO:0009394 2‐deoxyribonucleotide metabolic process (28 genes, for C, q=4.2×10−18).
Figure 6
Illustration of the four statistics in the kidney transplant data, using gene set analysis of gene expression in those with acute rejection (n
1=54) versus normal (n
2=99). All panels except for Panel (a) were zero‐centred to better illustrate the covariance changes. (a) Heatmap of
for GO:0035754, the most significant pathway for S. (b) Heatmap of
for GO:0070193, the most significant pathway for Q. (c) Heatmap of
for GO:0009394, the most significant pathway for C. (d) Heatmap of M
values for GO:0021889, the most significant pathway for M. The inset shows the covariance in acute rejection versus controls for A
T
F5 and E
R
B
B4
Illustration of the four statistics in the kidney transplant data, using gene set analysis of gene expression in those with acute rejection (n
1=54) versus normal (n
2=99). All panels except for Panel (a) were zero‐centred to better illustrate the covariance changes. (a) Heatmap of
for GO:0035754, the most significant pathway for S. (b) Heatmap of
for GO:0070193, the most significant pathway for Q. (c) Heatmap of
for GO:0009394, the most significant pathway for C. (d) Heatmap of M
values for GO:0021889, the most significant pathway for M. The inset shows the covariance in acute rejection versus controls for A
T
F5 and E
R
B
B4As another illustration for this dataset, we show the results for the most significant M statistic for GO:0021889 olfactory bulb interneuron differentiation (Figure 6d, 13 genes, p value = .0005, q n.s.). To best illustrate the changes in correlation rather than the changes in variance, for this statistic, we row‐scaled the data to have variance 1 for each gene. The gene pair {ATF5, ERRB4} shows the most significant change, with a high negative correlation in the AR group and little correlation in the normal group. ATF5 has been associated with transplant rejection in multiple organ systems (Morgun et al., 2006). There is little literature on ERBB4 and transplant rejection, but the gene has been associated with kidney nephropathy (Sandholm et al., 2012) and thought to be protective of polycystic kidney disease in a mouse model (Zeng, Miyazawa, Kloepfer, & Harris, 2014).
Discrete phenotype example 2
The third real dataset is a targeted reanalysis of a brain expression dataset compiled by Fulcher and Fornito (2016). The authors had used previous mouse brain connectome findings to classify each of 213 brain regions as “hub” and “nonhub” regions using imaging‐based connectivity, with expression data obtained for each region. A primary biological finding in their paper was that hub regions involved coexpression of genes involved in energy metabolism. However, to test this hypothesis, the authors needed to use an indirect means, creating a threshold‐based connectivity score for each gene and testing for enrichment using methods that do not acknowledge gene–gene correlations. We reasoned that a gene‐set approach using S and C might be able to obtain a similar finding directly. Defining y=1 for hub regions and y=0 for nonhub regions, and considering each region to be a sample, we performed testing for each of 5,944 Gene Ontology pathways containing at least five genes. For S, the most significant pathways associated with an increase of covariance for hub regions were GO:0005746 (q=1.09×10−4, five genes, “mitochondrial respiratory chain”) and GO:0045039 (q=1.37×10−4, five genes, “protein import into mitochondrial inner membrane”). Similarly, for C, for the same GO category, we obtained q=9.85×10−4. These pathways clearly involve energy metabolism, whereas the top pathways showing increased covariance in nonhub regions were not involved in energy metabolism. For example, in the nonhub regions, for S, we obtained GO:0015677 (q=4.08×10−6, seven genes, “copper ion import”) and GO:0019218 (q=4.75×10−4, five genes, “regulation of steroid metabolic process”).
DISCUSSION
We have proposed four covariance test statistics in a straightforward trend‐testing framework that applies to general y. The approach is not limited by p, n, or whether y is discrete or continuous. For most of the statistics, a natural risk score is an output. The availability of testing for a continuous y is a distinct advantage over previous methods, making covariance testing a simple approach that can be applied in a huge variety of settings. We propose that the approach can be part of a standard testing toolkit and used to evaluate, for example, pathway associations, in high‐throughput data or the statistical significance of network discoveries.
DATA AVAILABILITY STATEMENT
The three datasets described in Section 5 are in https://sites.google.com/ncsu.edu/zhouslab/home/software?authuser=0.
SOFTWARE
The accompanying software CorDiff is available on R CRAN (https://cran.r-project.org/web/packages/CorDiff/index.html).Supporting info itemClick here for additional data file.
Authors: Niina Sandholm; Rany M Salem; Amy Jayne McKnight; Eoin P Brennan; Carol Forsblom; Tamara Isakova; Gareth J McKay; Winfred W Williams; Denise M Sadlier; Ville-Petteri Mäkinen; Elizabeth J Swan; Cameron Palmer; Andrew P Boright; Emma Ahlqvist; Harshal A Deshmukh; Benjamin J Keller; Huateng Huang; Aila J Ahola; Emma Fagerholm; Daniel Gordin; Valma Harjutsalo; Bing He; Outi Heikkilä; Kustaa Hietala; Janne Kytö; Päivi Lahermo; Markku Lehto; Raija Lithovius; Anne-May Osterholm; Maija Parkkonen; Janne Pitkäniemi; Milla Rosengård-Bärlund; Markku Saraheimo; Cinzia Sarti; Jenny Söderlund; Aino Soro-Paavonen; Anna Syreeni; Lena M Thorn; Heikki Tikkanen; Nina Tolonen; Karl Tryggvason; Jaakko Tuomilehto; Johan Wadén; Geoffrey V Gill; Sarah Prior; Candace Guiducci; Daniel B Mirel; Andrew Taylor; S Mohsen Hosseini; Hans-Henrik Parving; Peter Rossing; Lise Tarnow; Claes Ladenvall; François Alhenc-Gelas; Pierre Lefebvre; Vincent Rigalleau; Ronan Roussel; David-Alexandre Tregouet; Anna Maestroni; Silvia Maestroni; Henrik Falhammar; Tianwei Gu; Anna Möllsten; Danut Cimponeriu; Mihai Ioana; Maria Mota; Eugen Mota; Cristian Serafinceanu; Monica Stavarachi; Robert L Hanson; Robert G Nelson; Matthias Kretzler; Helen M Colhoun; Nicolae Mircea Panduru; Harvest F Gu; Kerstin Brismar; Gianpaolo Zerbini; Samy Hadjadj; Michel Marre; Leif Groop; Maria Lajer; Shelley B Bull; Daryl Waggott; Andrew D Paterson; David A Savage; Stephen C Bain; Finian Martin; Joel N Hirschhorn; Catherine Godson; Jose C Florez; Per-Henrik Groop; Alexander P Maxwell Journal: PLoS Genet Date: 2012-09-20 Impact factor: 5.917