J R Deller1, Hayder Radha1, J Justin McCormick2, Huiyan Wang3. 1. Department of Electrical and Computer Engineering, Michigan State University, 2120 EB, East Lansing, MI 48824, USA. 2. Carcinogenesis Laboratory, Department of Molecular Biology and Biochemistry, Michigan State University, 341 FST, East Lansing, MI 48824, USA. 3. College of Computer Science and Information Engineering, Zhejiang Gongshang University, 18 Xuezheng Street, Zhejiang Province Hangzhou, 310018, China.
Abstract
Microarray data are used to determine which genes are active in response to a changing cell environment. Genes are "discovered" when they are significantly differentially expressed in the microarray data collected under the differing conditions. In one prevalent approach, all genes are assumed to satisfy a null hypothesis, ℍ 0, of no difference in expression. A false discovery (type 1 error) occurs when ℍ 0 is incorrectly rejected. The quality of a detection algorithm is assessed by estimating its number of false discoveries, 𝔉. Work involving the second-moment modeling of the z-value histogram (representing gene expression differentials) has shown significantly deleterious effects of intergene expression correlation on the estimate of 𝔉. This paper suggests that nonlinear dependencies could likewise be important. With an applied emphasis, this paper extends the "moment framework" by including third-moment skewness corrections in an estimator of 𝔉. This estimator combines observed correlation (corrected for sampling fluctuations) with the information from easily identifiable null cases. Nonlinear-dependence modeling reduces the estimation error relative to that of linear estimation. Third-moment calculations involve empirical densities of 3 × 3 covariance matrices estimated using very few samples. The principle of entropy maximization is employed to connect estimated moments to 𝔉 inference. Model results are tested with BRCA and HIV data sets and with carefully constructed simulations.
Microarray data are used to determine which genes are active in response to a changing cell environment. Genes are "discovered" when they are significantly differentially expressed in the microarray data collected under the differing conditions. In one prevalent approach, all genes are assumed to satisfy a null hypothesis, ℍ 0, of no difference in expression. A false discovery (type 1 error) occurs when ℍ 0 is incorrectly rejected. The quality of a detection algorithm is assessed by estimating its number of false discoveries, 𝔉. Work involving the second-moment modeling of the z-value histogram (representing gene expression differentials) has shown significantly deleterious effects of intergene expression correlation on the estimate of 𝔉. This paper suggests that nonlinear dependencies could likewise be important. With an applied emphasis, this paper extends the "moment framework" by including third-moment skewness corrections in an estimator of 𝔉. This estimator combines observed correlation (corrected for sampling fluctuations) with the information from easily identifiable null cases. Nonlinear-dependence modeling reduces the estimation error relative to that of linear estimation. Third-moment calculations involve empirical densities of 3 × 3 covariance matrices estimated using very few samples. The principle of entropy maximization is employed to connect estimated moments to 𝔉 inference. Model results are tested with BRCA and HIV data sets and with carefully constructed simulations.
This work is motivated by analytical challenges that arise in the use of microarray data to discover genes that are differentially expressed across experimental conditions such as control and treatment. Although the discussion centers around this genomics task, the developed methods are quite general and should be useful in other multiple-testing applications in which there is substantial dependence among test measures, and in which a small sample size may cause significant fluctuations in statistics employed in the testing. The specific aim of this work is to develop a reliable estimator of the number of false discoveries (type I errors—denoted 𝔉) in a multiple-testing problem in such an adverse setting.The classic and contemporary literature in cell biology, and the more recent literature in genomics (both in print and posted on the Internet), is replete with tutorial information at all levels about cell anatomy and physiology, and genomics, as well as the microarray technology deployed in the present application. A good entry point for accessing information about contemporary developments in the genomics field is the web site of the US National Genome Research Institute [1]. The papers by Page et al. [2] and Wang [3] provide relatively current reviews of microarray technology and methods. A brief description of the biological aspects of the genomics application underlying this work is found in Appendix A of this paper.In the gene-discovery application, each gene is tested against a null hypothesis, ℍ
0, that the gene is not differentially expressed across experimental conditions. All genes are initially assumed to satisfy ℍ
0 in this analysis, and 𝔉 is estimated conservatively. This “all-null” presumption is consistent with this application in which ℍ
0 is true for a vast majority of the genes in any experiment. Beyond the gene detection problem, however, this presumption is realistic in many practical applications of large-scale testing in which the prior probability of null cases, say π
0, is large, and in which the goal is to identify a small set of interesting “nonnull” cases [4]. With π
0 ≈ 1, it is also possible to impose identifiability (the strongly justified assumption that a given gene satisfies ℍ
0) on some of the genes, yielding crucial information with which to condition the estimation of 𝔉 [5].One of the earliest reports of research using the microarray (or “gene chip”) appeared in a paper by Schena et al. in Science in 1995 [6]. Generally speaking, research that employs the microarray to analyze gene expression data has one (or both) of the following underlying aims: the discovery of gene coexpression, or the discovery of gene differential expression. To the extent that these problems have been investigated separately, the coexpression problem has frequently been addressed by clustering methods (e.g., [7-13]), whereas differential expression has been studied using variations of classical statistical hypothesis testing (e.g., [5, 14]). Whereas differential-expression/hypothesis-testing research was, and is, concerned with expression in response to differing cell conditions (normal versus pathology, medical treatment versus control, etc.), the early coexpression/clustering research was often focused on phenotypic manifestations of the gene expression.As the technology has matured, the dichotomy suggested above has blurred with many current applications of the microarray involving “hybrid” research questions into both differential and coexpression. Application areas include discovery and exploration of gene regulatory systems, tissue and tumor classification, biomarker prediction, discovery and reverse engineering of gene expression networks—not to mention the microarray's deployment in the study of protein synthesis, metabolism, evolution, and other areas related to cell biology. Technical approaches to these problems have gone well beyond classical clustering and hypothesis-testing methods. Indeed, in the past few years, statistical (i.e., “nonclustering”) methods to address the coexpression problem have been reported (e.g., [15, 16]), while the hybrid of the two problems—that of detecting and analyzing differentially coexpressed genes—has been researched using an ever-increasing number of methods including clustering with complex and dynamic feature selection methods, image transformation and processing of expression data, biclustering, graph and network theory, hypothesis testing, and other statistical approaches (e.g., [17-28]). In this paper, we return to the focused problem of detecting differential expression across treatment conditions, but it will become clear—as it has to the community working on this problem—that differential expression cannot be studied independently of coexpression.Early work on classical statistical techniques for microarray-based gene discovery is summarized in the 2002 review paper by Pan [29]. Initially, it was customary to treat gene expression outcomes as realizations of independent random variables (RVs). More recent papers, however—notably, those of Owen [30], Efron [31], and Pawitan et al. [32]—caution researchers of the detrimental effects of correlated gene-expressions on the validity of “discovered” genes. In particular, it was reported that highly correlated tests increase the variance of 𝔉 (or, its normalized counterpart, the false discovery rate, , where G
∗ is the number of “discovered” genes), thus making estimates of 𝔉 less reliable. In particular, high variance renders the average, , an unreliable estimator of 𝔉 [30]. Among many causes, intergene correlation is attributable to coexpressed genes [4] and to unmodeled factors that introduce systematic effects across genes [33, 34]. As a result, for most real data, the assumption of independence or weak dependence among gene expressions is unfounded, and methods treating correlation are necessary [35, 36].Accordingly, there has been significant recent interest in improving statistical gene detection methods in light of this detrimental correlation. For example, Storey et al. [37] present an approach to the notion of sharing information across t scores, which they describe as “borrowing strength across the tests” for a potential increase in statistical power. Tibshirani and Wasserman [38] discuss a quantity called the “correlation-shared” t-statistic and derives theoretical bounds on its performance. Hu et al. [39] examine the covariance structure of the expression data and discover benefits of linking coexpression and differential expression in a distance measure—thus, moving toward the “hybrid” problem described above.Recent research into the hybrid differential coexpression problem has also yielded results and methods that could ultimately benefit the differential expression problem. Because the differential coexpression research is often concerned with differing phenotypes, rather than with different treatment conditions, two given research efforts involving differential coexpression might seek answers to different sets of genetic questions through expression data. Like the “dual conditions researchers,” however, the “phenotype” researchers have encountered their own forms of confounding dependencies, notably the relative gene locations, the expression time sequencing, and phase information (e.g., [40-42]). Papers have been published addressing these issues, including the exposition of new statistical approaches—for example, “CorScor” developed by Dettling et al. [21], the “ECF-statistic” of Lai et al. [22], “the gene-set coexpression analysis” of Choi and Kendziorski [15]—as well as new clustering methods—for example, the web-based expression analyzer of Xiang et al. [43], high-order preclustering method of Wong et al. [44], and the “BioSym” distance measure of Bandyopadhyay and Bhattacharyya [45]. A recent review of clustering methods in genomics appears in the paper by Dalton et al. [46]. A more general examination of the performance of classifiers of microarray expressions appears in the paper by Ancona et al. [47].The present paper returns to the problem of gene discovery by statistical hypothesis testing, but with the concern for the effects of nonlinear dependencies on (the estimation of) the number of false results. Empirical work below provides cogent evidence that accounting for intergene correlation alone does not sufficiently mitigate the adverse effects of dependency. Recent work by Hu et al. [48] has shown the importance of accounting for nonlinear dependence in imputing missing values in microarray data. Modeling nonlinear dependencies is a challenging problem, and the present work makes only a modest—nevertheless, empirically significant—step into the realm of nonlinear dependence by modeling the third-moment characteristics of the quantity 𝔉. In principle, the proposed extension admits any order moment, but computational constraints limit the present developments. However, even the single step to a third-moment extension under severe sampling fluctuations is very challenging, and, in spite of this modest modeling enhancement, it is a hard-won extension yielding significantly improved estimates for a range of real and simulated examples (see Section 5).Thus, a central finding of this work is that null statistic histogram approaches can be improved by including third-moment skewness corrections. Advancing the techniques to model higher order dependencies is challenging, but the effort could have a substantial payoff. Errors in gene detection are expensive in financial terms, but the derailing of biomedical research resulting from a false gene discovery could be profoundly costly in many ways. Even modest improvements in genomic techniques are potentially very significant.
2. Notation and Terminology
Because this paper has a practical aim, we will assume, without comment, “friendly” mathematical conditions such as existence of distributions, measurability, and sure convergence of integrals. Even so, the mathematical developments in this paper necessarily involve extensive notation and we strive for consistency and clarity in its use. Quantities are generally formulated as RVs unless stated otherwise. This excludes obviously deterministic quantities like sequence indices, integers defined in the abstract (e.g., the number of microarrays, “M”), and statistical expectations. Precise formulations require that probability distributions ordinarily be denoted formally as, for example, p
(ξ
1, ξ
2) for the joint distribution of RVs v
1 and v
2, but the more common abusive notation “p(v
1, v
2)” is more expedient in a few cases. The notation p(·) may denote either a discrete or continuous (i.e., density) distribution, depending, of course, on the RV(s) being modeled. The meaning should be clear in context. We deliberately allow this ambiguity because it avoids some notational awkwardness as discrete distributions are fitted with densities. On the other hand, the notation ℙ(A) is used to denote a probability assignment to a measurable event A.Many developments in this paper are centered on second-order statistical concepts. It is important to carefully define terminology used in this regard, since the vocabulary has nuanced differences across disciplines. The elementary notation for scalar RVs in Table 1 is standard and is used conventionally in this paper. A caveat arises in the discussion of related matrices, however. The term “correlation matrix” is used in this paper in a way consistent with its use in many statistical developments, but not in a way that is universal across disciplines. The following definitions are used throughout this paper.
Table 1
Notation used for elementary scalar quantities. v, v
, and v
are RVs and ℰ denotes the expectation.
Mean, average
μx or μ(x) = ℰ{x}
Standard deviation
σv≝ℰ{(v-μv)2}=φ(v,v)
Covariance
φ(v1, v2)≝ℰ{(v1 − μv1)(v2 − μv2)}
Variance
σv2≝ℰ{(v − μv)2} = φ(v, v)
Correlation
ℰ{v1, v2}≝φ(v1, v2) + μv1μv2 (no special symbol reserved)
Correlation coefficient
ρ(v1,v2)≝φ(v1,v2)σv1σv2(normalizedcovariance)
Definitions 1 .
Consider a random vector v
= [v
1 ⋯ v
] with mean vector
≝ℰ{v}. Then, the covariance matrix associated with v is defined as
in which the (i, j) element is φ(v
, v
). The correlation matrix of v is
in which the (i, j) element is φ(v
, v
) and S is a diagonal matrix with (i, i) element = σ
, the standard deviation of the ith RV, v
.In this paper, the term “correlation matrix” will refer to the definition in (2). On the contrary, in much of the engineering literature, the outer product ℰ{v
v
} = Σ +
is called the (spatial) correlation matrix. In this case the (i, j) element of the matrix is the scalar correlation ℰ{v
v
}. In our definition the elements are correlation coefficients, which are, in fact, normalized covariances. One significant implication of this fact is that mean values of RVs have no effect on either matrix. This should be kept in mind in the developments to follow.
3. Problem Formulation
G genes are to be studied using M microarray experiments. The expression values are recorded in an G × M matrix, X = [x
]. For analytical purposes, the expression quantities x
are generally RVs. Each of the M microarray experiments takes place under one of two conditions (indexed by k = 1 or 2) such as control and treatment. These two subsets of the data are called treatment groups in the paper. There are M
samples (i.e., microarrays) in treatment group k. Based on the evidence in X, we seek to identify a “small” number, G
∗ ≪ G, of genes that are significantly differentially expressed across the two treatment groups. One widely used strategy (e.g., [5, 14]) is to posit that each of the genes, for g = 1,2,…G satisfies a null hypothesis,
All G genes are tested against this hypothesis using two-sample null statistics z
1, z
2,…, z
[14]. The magnitudes of z
scores establish a gene ranking, and the G
∗ genes with the largest scores are reported as statistically significant discoveries.Clearly, the list of G
∗ discovered genes is only meaningful to the extent that 𝔉 is very small. Of course, 𝔉 can only be estimated since the state of any gene (i.e., whether or not it should be “discovered”) is unknown. Strong causal relationships among genes give rise to highly correlated z
scores and greatly complicate the estimate of 𝔉 [30, 31]. Moreover, in spite of their declining cost, microarrays are still a relatively expensive technology. Consequently, the number of microarrays, M, in an experiment is usually smaller than number of genes, G, by as much as four orders of magnitude. Typically, existing microarrays record expression data for at least a few thousand genes. The fact that M ≪ G further complicates the problem because the knowledge about the underlying gene-gene correlation structure is critically sparse in the observations. At the same time, the consequences of correlation on differential analysis cannot be overlooked [35]. In fact, the present work will suggest that even nonlinear dependencies must be accounted for in order to properly estimate 𝔉. Theoretical justifications for this contention are given momentarily.This paper develops a moment-based estimator of 𝔉 by giving the null z histogram a stochastic interpretation. The observed null counts are viewed as realizations of a more fundamental random model shaped by inter-z
dependence. A small zero-symmetric bin in the space of z statistics is designated as the center area, and it is posited that no z
scores from “nonnull” genes fall in this range. Then, the null count in the center area, say C, is observable, and by conditioning 𝔉 on C, the variance of 𝔉 can be reduced significantly [5, 49]. We relate 𝔉 to C through the discrete joint distribution p(𝔉, C). To obtain an approximation of p(𝔉, C), we estimate its first three moments, then fit the maximum entropy function (Appendix B). This approach inherently yields an estimate of the conditional distribution p(𝔉∣C). A large number of estimates of the distribution of would theoretically be more useful than a point estimate because of the noisy nature of large-scale inferences [30]. Compared to histogram-curve-fitting techniques like empirical null [5], however, the present approach enjoys the attractive feature that covariance is separately estimated, and then explicitly incorporated into the inference.Efron [31] reports that RVs 𝔉 and C are found to be extremely negatively correlated in a number of real experiments. He provides an explanation for this finding, then employs these insights to develop a Poisson-model-based second-order estimator of 𝔉 which, like the present approach, relies on the center area concept. While Efron's work is extremely important, his own research has gone on to show that purely second-order 𝔉 estimates suffer from over- and under-estimation effects. The second-moment estimates of 𝔉 reported later in the present paper (see Section 5), as well as those in the cited Efron paper, all show these adverse effects. There are three contributory factors: (i) 𝔉 is bounded below by zero, (ii) the mean of 𝔉 is small, and (iii) intergene covariance causes the variance of 𝔉 to inflate. All of these factors suggest that skewness corrections—reflecting nonlinear dependence—are vital.
4. Methods
4.1. Moments of the Joint Distribution P(𝔉, C)
4.1.1. Count Models
The process begins by transforming test t statistics to z values as z
= P
−1{P
0(t
)}, g = 1,…, G, where P
0 is the putative null cumulative distribution function (c.d.f.) of the test statistic, and P
−1 is the inverse c.d.f. of the unit normal density, p
≡ 𝒢(0,1). The z values, modeled as RVs, provide the analytical convenience of multivariate normal form while describing the joint t-statistic behavior. We formally define the fundamental quantities:
in which #{𝒮} denotes the number of elements in the discrete set {S}. 𝒵⊆ℝ is the sample space of z values. The interval 𝒵
≝{z ∈ 𝒵 : | z
| < c} corresponding to count C is called the center area, and the semi-infinite interval 𝒵
≝{z ∈ 𝒵 : z ≤ δ} associated with count 𝔉 is the left tail area. For proper comparison with Efron's results [31], we work with a left tail area; however, the present approach can employ right- or double-sided tail areas equally well.The premise that very few nonnull z
scores fall in 𝒵
and, hence, that C is practically observable is of prime importance. A similar assumption plays a central role in the literature on estimating the proportion of null genes, as in, for example, papers by Efron [31], Pawitan et al. [32], and Langaas et al. [50]. The empirical null approach [5] relies on similar reasoning. We exploit the observability of C to: (i) estimate the moments of p(𝔉, C), (ii) use them to infer the distribution (estimate) , and then (iii) report (an estimated) p(𝔉∣C) which in turn could be used to find an estimator of 𝔉 conditioned upon C. Initially, all cases are treated as null. Improvement is possible by estimating π
0 [50, 51].
4.1.2. Assumptions
Assumptions
The following assumptions underlie these developments:π
0 is large, say π
0 ≥ 0.9 (Efron discusses this bound in [4]).z
is a unit normal variate [~𝒢(0,1)] for all g = 1,2,…, G.The z scores are jointly Gaussian to the third order (but not uncorrelated).Recall that x
[element (g, m) of the expression matrix X] denotes (the RV model for) the expression of gene g on microarray m. Let x
denote the marginal RV for x
, that is, the model for the expression outcomes of gene g. The realizations of x
are the elements of row g of an observed X. Then, it is assumed that φ(z
, z
) = ρ(z
, z
) ≈ ρ(x
, x
) for all g, g′ (justified below).
4.1.3. The z-Value Histogram
It is convenient and computationally efficient to obtain the moments of p(𝔉, C) using the moments of the z-value histogram. We seek central moments because they facilitate working with the maximum entropy distribution (Appendix B). The moment estimation is carried out as follows. 𝒵 is partitioned into B disjoint bins, 𝒵 = ∪
𝒵
, where the bth bin has center z
[ and width Δ (constant with b). Then, the z-histogram bin counts are
in which I
(z
) is the indicator function for the event “score z
falls in bin 𝒵
.”Consider bin b with count Y
for some b ∈ {1,…, B}. The mean of count Y
is
The second-order joint central moment covariance of the pair (Y
, Y
), where b may equal b′, is
Because of the bivariate normality of z
and z
, (7) can be approximated by:
where δ
is the Kronecker delta, and p
(ζ
1, ζ
2; 0, Σ[) ≡ 𝒢(0, Σ[) is the bivariate Gaussian density with mean vector 0 = [0 ⋯ 0], and covariance matrix (equivalent to the correlation matrix in this case)
That is, defining the vector of arguments ζ≝[ζ
1
ζ
2],
where |·| denotes the determinant. In the approximation (8), the density in (10) is evaluated at ζ
1 = z
[ and ζ
2 = z
[.As reflected in the second line of (10), the covariance matrix Σ
[ is fully specified by a a single-scalar parameter, ρ(z
, z
) for each g, g′ pair. Thus, we can express the Gaussian density in (10) as being parameterized by this autocorrelation coefficient for the given z-score pair, p
[ζ; 0, ρ(z
, z
)]. Now, suppose that we can derive an empirical density [52], say q
(·), over the interval [−1,1], fitted to the discrete set of autocorrelation coefficients, {ρ(z
, z
), 1 ≤ g, g′ ≤ G}. This empirical distribution allows the summation over gene indices in (10) to be replaced by a continuous computation. This smoothed computation represents a further approximation of φ(Y
, Y
), which is stated in the form of a lemma below. This result is similar to those of Owen [30, Theorem 1] and Efron [31, Lemma 2].
Lemma 1 .
Let q
(ξ) denote the empirical density of ρ
, derived from the
z-pair sample correlation coefficients, ρ(z
, z
), 1 ≤ g, g′ ≤ G. Then, the second joint central moment of a histogram count pair (Y
, Y
), where b may equal b′, is approximated by
where [G!]≝G(G − 1)⋯(G − [ℓ − 1]), for 1 ≤ ℓ ≤ G, andFurther, the third joint central moment of a triplet (Y
, Y
, Y
), where two or more indices may be equal, is:
where
in which δ
is the Kronecker sequence over ℤ × ℤ × ℤ.The assumed trivariate normality of the z scores implies that the joint distribution for each score triplet is specified by a 3 × 3 covariance matrix. Let us denote the covariance (equivalent to correlation) matrix for the z-value triplet (z
, z
, z
) by
For each z-score triplet, R
is an element of the space—call it ℛ
3—of all symmetric positive-semidefinite matrices in ℝ3 × 3 with element magnitudes no greater than unity. R
is continuously distributed over ℛ
3.Again, we need an empirical way to compute the joint moments of the z scores. Let q
(Ξ) be the empirical density of the correlation matrices, R
. This density must be inferred from the observed data. Of practical importance is the fact that, although each R
is distributed over a subspace of ℝ3 × 3, the domain of q
is effectively a three-dimensional manifold of that space because each covariance matrix is unambiguously determined by its three values above or below its main diagonal (with unity diagonal elements). The argument Ξ may be thought of as a vector of these three elements (over the continuum of allowable values), but we will continue to denote it as a matrix as a reminder of its association with R
. We have, in conjunction with (6)–(14), the following useful approximation.
Lemma 2 .
The third-order joint central moment of a histogram count triplet (Y
, Y
, Y
), where two or more indices may be equal, is approximated by
where
, Q
, , and [G!] are defined in (6) and (11), and Δ is the z-histogram bin width.To obtain the moments of p(𝔉, C) it is simply necessary to combine the moments of the corresponding Y
counts. For example,The key quantities in the lemmas for approximating moments are the empirical covariance densities. Obtaining these densities in the presence of severe sampling errors is discussed next.
4.2. Empirical Correlation Densities
4.2.1. Approach
Because of severe sampling fluctuations, the current methods can recover only q
(ξ) from the data. This density is then used to estimate q
(Ξ). For this purpose, as well as to facilitate the calculations of Lemmas 1 and 2, we seek to parameterize the requisite densities. For most real examples, a single omnibus parameter α is found to be sufficient.
4.2.2. Data Normalization
For all-null false discovery rate calculations, normalization of the per-microarray expression results has been found to be beneficial [31, Remark E]. The columns of the data matrix X are standardized to mean zero and unity variance. This standardization normalizes output “brightness” among microarrays [53, 54]. It also forces the sum of covariances, and approximately the sum of correlations, to be zero. This permits the fitting of a zero symmetric density to q
(ξ), which, in turn, has profound consequences for the form of q
(Ξ).Formally, let X
denote the residual expression matrix, obtained by subtracting from X each gene's average response within each treatment group, and let x
denote the marginal random variable modeling the residual expression outcomes for gene g [like x
of Assumption (4), p. 5]. All further discussion of gene expression values will refer to these normalized residual values.
4.2.3. Obtaining q
(ξ)
The empirical densities q
and q
, as well as others to be introduced below, clearly play a key role in moment estimation above. In each case, the empirical density—a surrogate for the true statistical density of the correlation coefficient(s) being modeled—is a distribution of a correlation function or matrix over a continuum, but it must be inferred from the data samples.To deduce q
(ξ), we require q
(ξ)—the empirical density of the
gene expression correlation coefficients. The mapping between the domains of q
and q
is needed, in principle, to calibrate q
. However, for the usual two-sample t-statistic, assuming independent columns in X
, ρ(z
, z
) ≈ ρ(x
, x
) [recall Assumption (4), p. 5]; hence, q
(ξ) ≈ q
(ξ). We make the assumption of equality of these densities below.Let denote the sample covariance between rows (genes) g and g′ of X
. For convenience, we define the notation
These are the values to be fit with density q
(ξ).To reduce the variability added by sampling errors, we apply the Fisher transform:
For bivariate normal samples, the Fisher transform has remarkable normalizing and variance stabilizing properties [55], and each is well modeled by the distribution , where τ
is the Fisher-transformed underlying correlation coefficient. Assuming a sampling model
where p
is the distribution of the Fisher-transformed underlying correlation coefficients, we can interpret the histogram of Fisher-transformed sample correlations, say the “-histogram,” as a convolution of p
(ξ), the statistical correlation density on the scale resulting from the Fisher transform, and the histogram of sampling errors, say, the “ɛ-histogram,” also on the τ-scale. Then, the underlying p
(ξ) is obtained by deconvolving this density from the convolved pair, . For a wide variety of microarray data sets studied in this work (also see [30]), the normal distribution 𝒢(0, σ
2) fits nicely to the histogram. For bivariate normal samples, where ɛ ~ 𝒢(0, [G − 3]−1), the estimate of p
(ξ), say , takes the normal form 𝒢(0, σ
2 − [G − 3]−1). It is this estimate that will serve as the empirical density of the Fisher-transformed correlations, .Having obtained the underlying q
(ξ), we must, in principle, undo the mapping (20) to obtain q
(ξ), then deduce q
from q
. Recall, however, that we assume that the correlation coefficients of the z and x
variables are identical [31], so that we may directly seek q
(ξ) = q
(ξ) from q
(ξ). A distribution that fits the inverse-transformed q
(ξ) well is
a class of densities in the general Beta distribution family given by:
where B is the Beta function and α and β are nonnegative shape parameters. Comparing (22) and (23) gives a useful probabilistic interpretation of the correlation coefficient, say , modeled by the empirical density q
: We see from (22) that . Therefore, will be a factor of four greater than the variance of a Beta-distributed random variable with parameters α = β. That is,
Thus, using as an estimate of , we obtain the parameter α, hence, the distribution q
.
4.2.4. Obtaining q
(Ξ)
We now pursue q
(Ξ) as an extension of q
(ξ). Like q
, the effective domain of q
is of only three dimensions. Also similarly to the scalar density, for the two-sample t-statistic, q
(Ξ) ≈ q
(Ξ). Hence, we can pursue q
indirectly by finding q
.We seek a joint distribution on the space of all 3 × 3 correlation matrices such that all the inherent marginal distributions (i.e., the distributions of ρ(x
, x
) for g ≠ g′) are equivalent to q
(ξ). Such a result can be obtained from the inverse-Wishart distribution whose marginalization properties are helpful when studying a subset of variables [56]. Suppose that the true statistical covariance matrix of X
, say,
follows the inverse-Wishart distribution 𝒲
−1(I, ν), ν ≥ G,
where ν is the single parameter that characterizes the distribution, and tr {·} indicates the trace. The goal is to relate ν to parameter α of (22) and to determine the distribution of any of the 3 × 3 covariance submatrices of Σ.Following the separation strategy of Barnard et al. [57], we decompose Σ into its variances and normalized covariances (i.e., correlation coefficients) as
where S ∈ ℝ is the diagonal matrix whose ith diagonal element, s
, is the standard deviation of the gene i residual expression [recall (2)]. R
≝R
is the G × G correlation matrix of the residual expression matrix X
. Under the transformation Σ → (S, R
), the Jacobian is given by (2∏
s
) [58, Theorem 3]. Thus, after marginalization over S:
where ξ
is the ith diagonal element of Ξ−1. The product arises because of independence of the s
elements. Substituting ω
= ξ
/2s
2 yields
which leads to an expression for the probability density of the matrix R
:
where (A) denotes the ith principal submatrix of A, and where we have used the fact that ξ
= |(Ξ) | /|Ξ|. For R
with probability density (30), the marginal density of its arbitrary correlation submatrix also has a useful expression.
Lemma 3 .
For a correlation matrix R
∈ ℛ
with the probability density (30), the κ × κ correlation submatrix, R
∈ ℛ
, has the density
Proof
Suppose that the κ × κ statistical covariance submatrix Σ
undergoes the transformation Σ
→ (S
, R
), where S
is the diagonal scaling matrix of appropriate standard deviations [recall (27)]. Then due to the marginalization property of inverse-Wishart, R
~ 𝒲
−1(I, ν − G + κ). Following steps(26)–(30) for Σ
yields (31).Substituting κ = 2 in result (31) yields
Note that this density is the function of a scalar argument, the single value of the off-diagonal elements of R
2
. We have indicated this by use of the subscript “ρ
12” in the second term in the expression. The critical property of this result is that it has the same uniparametric form as (22). By setting ν − G = 2α + 1 we can force the inherent marginal densities of the R
entries (ρ(x
, x
), g ≠ g′) to equal q
(ξ)—the specific aim of this derivation.Finally, substituting κ = 3 in result (31), we obtain
in which ξ
is the (i, j) element of the evaluated matrix (in the abstract) Ξ. The density of a particular covariance matrix in ℛ
3, say Ξ = R
, involves the use of the three elements in the upper triangle of the matrix, reinforcing earlier assertions that the domain of q
is a manifold of the matrix space. Recall that R
[ (assumed equivalent to R
[) is the original notation for the 3 × 3 covariance matrix of a score triplet (z
, z
, z
), and, by extension, (x
, x
, x
). In the present discussion, R
assumes the role of a 3 × 3 submatrix of R
, namely, R
3
.For large G, the assumption of the inverse-Wishart distribution in (26) is not well justified. However, the assumption is used here strictly for its value in deducing q
(Ξ) from q
(ξ). There is no concern for a model of the entire matrix R
= R
. Further, single-parameter distributions on a positive definite matrix space are few. The inverse-Wishart distribution is chosen for its useful marginalization property. Of course, a tenuous assumption is not justified by a useful property if it leads to an unuseful procedure. The practical validation of the assumption is manifest in Section 5. The “Bayesian correlation priors” point of view from the work of Liechty et al. [59] was especially helpful in formulating these ideas. Exploring other ways to obtain q
is a worthwhile but challenging endeavor.Finally, we have derived p
(Ξ | ν) only up to a proportionality constant. However, for q
(Ξ)(κ = 3), normalization is straightforward. For κ ≥ 4 it is necessary to resort to Monte Carlo methods which only require densities up to a proportionality constant.
5. Results
5.1. Testing on Real Data Sets
A MATLAB implementation of the approach based on the work above is available at the website http://www.egr.msu.edu/~deller/. The methods were tested on two real data sets, both showing a significant amount of intergene covariance but exactly opposite 𝔉 behaviors. Calculations below are for left-sided tail-area parameter δ = −2.5, and center area parameter c = 1 [recall (4)]. Comparisons with Efron's [31] second-order estimator of 𝔉 are made.The first data set is from the breast cancer (BRCA) study of Hedenfalk et al. [60]. These data record the expression of G = 3226 genes on M = 15 microarrays with seven samples assigned to BRCA1 mutations and eight to BRCA2. The original research seeks to identify genuine mRNA activity differences between these two categories. In the present paper, the logarithm is applied to the mRNA levels to increase Gaussianity [61].In a study of the human immunodeficiency virus (HIV), Van 't Wout et al. [62] investigated G = 7680 genes over M = 8 microarrays with four samples assigned to an HIV infected condition and the remaining four to the control. To produce the test cells, the control cells (CD4 T cell lines) were infected by the HIV − 1BRU virus. The paper reports raw mRNA levels which, like the BRCA data, are converted to logarithms in the present work.The present analysis reduces an entire expression matrix to two numbers: the center area null count, C, and the omnibus parameter, α. As is evident in Figure 1, the parametrization described in Section 4 is realistic. For the BRCA data, α = 17.77, and for HIV, α = 3.51. The Figure 1 caption provides details.
Figure 1
Effect of sampling fluctuations on the empirical covariance density. (a) Upper curves: BRCA data. (b) Lower curves: HIV data. For each subfigure: left panel is the histogram of sample covariances after applying the Fisher transformation (20) and a normal distribution (heavy curve) fitted to it; right panel is the histogram of denoised covariances and a modified beta distribution fitted to it (heavy curve). These distributions summarize the cumulative effect of (2
) gene-gene covariances in a single parameter (α) distribution.
The next step is to compute the moments of p(𝔉, C) per Lemmas 1 and 2. These calculations require parameters G, α, c, δ, and Δ. We set Δ = 0.1. These estimated moments are used to find the maximum-entropy (maxent) distribution (Appendix B). Figure 2, for example, reports the moments and the corresponding maxent distribution for the BRCA data. 𝔉 and C exhibit strong negative correlation of −0.89, a value similar to that in [31, Table 1]. Furthermore, 𝔉 shows significant positive skewness, which causes C to exhibit negative skewness. This is not surprising as 𝔉 is bounded below by zero, and yet has small mean but inflated variance. The third-moment provides an additional level of detail about the joint behavior of 𝔉 and C.
Figure 2
BRCA example: estimated p(𝔉, C) moments and estimated distributions using maxent, . The distribution estimate on the left uses third-moment information in the maxent optimization, while the right estimate uses only second moments. The third-order estimate exhibits finer details than its second-order counterpart, and a contour that cannot be modeled using a quadratic distribution.
During the maxent numerical optimization, a 100 × 500 equispaced mesh was found sufficient for the BRCA data; however, for the HIV data the resolution had to be increased to 400 × 2000. This is because, in addition to the larger G, the HIV X also exhibits more covariance. The BRCA optimization required 30 iterations, while HIV took ~70 iterations, to converge to an estimated distribution.Figure 3 reports the estimated p(𝔉∣𝒪), where 𝒪 refers to the observed data. Second- and third-moment estimates are shown separately. In the framework of statistical inference such a distribution is the ultimate goal, but this result could later be used for other purposes like computing point estimates and associated confidence intervals.
Figure 3
Estimated conditional distributions of the number of false discoveries . Panel (a) BRCA data. Panel (b) HIV data. To show the effect of skewness corrections the third-moment 𝔉 distribution (solid curve) is compared to its second-moment counterpart (dashed curve). For BRCA the second-moment mean estimate is 79 compared to 104 for the third-moment, while for HIV these are 19 and 8. The BRCA curves are also labeled with 50% (solid line) and 75% (dotted line) confidence intervals.
If the mean of the estimated is used as a point estimate of 𝔉, then for the BRCA data, third-moment calculations suggest 104 false discoveries versus 79 for the second-moment while the usual suggests only 20 false discoveries. These numbers must be put in perspective by noting that the actual z
count falling in the left-sided tail-area is 116. For the HIV data, the third-moment analysis found eight false discoveries compared to 19 for second-moment, while the mean estimator produces 48. This time the z
count in the left-sided tail area is 46. Clearly, extensive analysis of intergene dependence can lead to very different conclusions from the same data, relative to those of the mean estimate of false discoveries (and, in turn, the procedures built around it).The second-order estimator designed by Efron [31] found 77 false discoveries for BRCA. Efron compares that to the results of nonparametric analysis in the same paper and concludes underestimation, but the issue is not further pursued. Our findings show that nonlinear dependence (as reflected in the present case by moments higher than second) is potentially very important in characterizing the null z histogram.We note in passing that the availability of permits the application of the bound ℙ(𝔉/G
∗∣𝒪 ≥ γ) ≤ λ as a control measure, as recommended by Lehmann and Romano [63]. It is not a trivial matter to choose γ and λ such that a fair comparison with other error measures is possible; however, for illustrative purposes, we set γ = 0.15 and λ = 0.5. With this constraint, the present approach reports 174 discoveries (108 for second-moment) for the HIV X. This compares favorably with the results of Efron [31], where the Benjamini-Hochberg procedure, with false discover rate control level 0.10 and an empirical null from [5], yields 180 discoveries. Without covariance modeling, the Benjamini-Hochberg procedure reports only 20 discoveries.
5.2. Testing on Simulated Data
Insight is gained by testing the approach on simulated data for which the “correct answer” is known. In the studies below, all cases are null (no treatment, residuals only). The goal is to see how well the realized left-sided tail-area count can be estimated from the center count. To maintain realism, we simulate raw mRNA levels. The testing scenario is a “two-group study,” so en route to z-values we take the usual two-sample t-statistic.Let the mRNA expression level, x
, of gene g measured by microarray m, be distributed as the Gamma density: For m = 1,…, M,
where Γ(κ) = ∫0
v
e
−
dv is the Gamma function. κ and θ are called the shape parameter and the scale parameter of the distribution, respectively. This distribution is similar to the Gamma-Gamma model used by Newton et al. [64]. In (34) the shape parameter κ is common to all genes, while the scale parameters {θ
}
characterize varying mRNA levels from gene to gene, but are assumed i.i.d. as
The intuition that genes with larger underlying mRNA levels would have higher variance is supported by model (34) since the mean of the gth gene is κθ
and variance is κθ
2.The parameter set (κ, κ
0, θ
0) in (34) and (35) can be chosen on the basis of the overall gene expression histogram of real microarray data. Results for three such parameter sets: (1,0.6,500), (2,0.39,384), and (3,0.33,300), are presented. These numbers were chosen to preserve the total sample variance, and the particular values are based on the HIV data of Van 't Wout et al. [62] which were collected using Affymetrix microarrays. In particular, κ = 1 models x
variables that are exponentially distributed, κ = 2 models a unimodal distribution with heavy tails and a noticeable departure from Gaussianity. Case κ = 3 represents an approximation to a Gaussian distribution, but with slightly heavier tails.Substantial row-wise covariance was added via the Gaussian copula technique: A (G × M) matrix, say Z, of i.i.d. unit normal RVs was used to produce a correlated matrix, Z
, via the mapping
in which L is the lower-triangular Cholesky factor (e.g., [65]) of R
+ ɛ
I
, the correlation matrix of the actual expression matrix X from the BRCA study, plus a small diagonal load to prevent singularity due to the fact that M < G. Several other dense matrices, R, generated through a different method [66], yielded similar results. This process imposes the covariance of the real BRCA data on the simulated substrate of independent Gaussian variables. The resulting elements z
were mapped to P values, P
(z
), then further transformed to simulated expression variables, x
, through the inverse Gamma c.d.f. as in (34). The result is the simulated expression matrix X = [x
].Figures 4 and 5 compare second- and third-moment estimates for δ = −2.0 and δ = −2.5, respectively. In both cases, c = 1 for the center area (see Section 6). For each (κ, κ
0, θ
0), 800 matrices X were processed. On each X the approach was applied in its entirety and no additional knowledge was assumed. The a posteriori mean was used as the final estimate. The usual mean estimator consistently reported 20 for δ = −2.5 and 73 for δ = −2.0, regardless of the particular X.
Figure 4
Simulation experiments comparing conditional estimates: third-moment estimates (+ marker) and second-moment (∘ marker). This figure corresponds to the left-sided tail area with δ = −2.0 [see (4)]. Substantial rowwise covariance is present. The abscissa is the realized count while the ordinate is the estimated count. The significance of parameters (κ, κ
0, θ
0) is discussed the text. Third-moment skewness corrections tend to make the estimation process more accurate.
Figure 5
Left-sided tail area with δ = −2.5. Otherwise Figure 4 caption is applicable.
Strikingly, for all three parameter sets (κ, κ
0, θ
0), the third-moment skewness corrections make the estimation process more accurate. For some of the scenarios third-moment estimates saturate somewhat, but the effect is minor compared to that in the second-order approaches. To the extent that these parameter sets cover a wide range of realistic gene expression data, the practical utility of the proposed approach is evident.
6. Discussion
Advances in DNA microarray technology, improved standardization procedures, and a careful execution of laboratory protocols collectively lead to testing situations with marginally correct but strongly correlated null hypotheses. If correlation is the result of intrinsic gene-gene interactions, no experimental design can circumvent it. Correlation can cause the realized to vary significantly from case to case [63], and the control of via the usual μ
= ℰ{𝔉} may no longer represent the basic facts. The moment theory of the null statistic histogram can be used to deduce an estimator of μ
which explicitly combines identifiability and covariance. Though we have explored these ideas in the differential analysis context above, the findings are quite general.It is reasonable to question the necessity of the heavy mathematical machinery of the foregoing sections since it is possible to simulate a number of sets of z-scores {z
}
from the distribution z ~ 𝒢(0, Σ
), then estimate the moments. However, due to sampling fluctuations, the underlying covariance matrix Σ
is ordinarily unattainable; however, pursuing quantities like q
(ξ) and q
(ξ) is still possible. Also, as G gets larger (~25,000 for recent microarray studies) computational demands, as well as the large number of z-score sets required, become prohibitive.Permutation calculations, as in [31, Section 4], offer an alternative way to estimate the moments. They too can run into computational difficulties, especially when the test statistic itself is computationally intensive. Further difficulty arises when samples are few. For a two-group study like HIV (four samples each condition), the data provide only 70 unique permutations.When a direct extraction of inter-hypotheses covariance is not feasible, single omnibus parameter models remain useful in that the investigator can still use judgment to intelligently incorporate some form of covariance effect by setting a value of the parameter α.The distribution of interest p(𝔉, C) resides over support domain 𝒟 as shown in Figure 6, and the maxent algorithm is adept at handling such complicated support regions. At a more fundamental level, through maxent, we seek to minimize the amount of unintentional prior information brought into the inference.
Figure 6
Discrete support 𝒟
(□ markers) versus continuous support 𝒮
(solid boundary). 𝒮
is normalized to improve numerical stability.
Apart from the numerical parameters Δ (bin width) and the mesh resolution in maxent, the only open choice of parameterization in the present method is c, the center area boundary. The selection of c = 1 in this paper is based on the first eigenvector analysis of Efron [31] which suggests that (within certain approximations) the interval [−1,1] has completely opposite count behavior from the rest of the 𝒵 space.One surprising result of this and similar studies is that more inter-z
covariance does not translate into more extreme covariance between variables 𝔉 and C. In the BRCA data, for example, the coefficient between 𝔉 and C is −0.89, while for the HIV data the covariance drops to −0.75. Further research to gain insight into this behavior would be very useful.Finally, while covariance was viewed in the present paper as a “destructive factor” in the attempt to estimate 𝔉, inter-z
covariance can, in fact, be exploited to increase power by finding a superior ranking of potential discoveries. The recent multiple testing literature has begun to address this possibility.
Ethical Approval
Human subject data used in this study are publicly available and anonymous and are, therefore, exempted from continuing Internal Review Board scrutiny according to US Health and Human Services Policy 45 CFR 36, Subpart A, §46.101 (2.b.4).