Karl Oskar Ekvall1,2, Aaron J Molstad3. 1. Division of Biostatistics, Institute of Environmental Medicine, Karolinska Institutet, Stockholm, Sweden. 2. Applied Statistics Research Unit, Institute of Statistics and Mathematical Methods in Economics, TU Wien, Vienna, Austria. 3. Department of Statistics and Genetics Institute, University of Florida, Gainesville, Florida, USA.
Abstract
We propose a new method for multivariate response regression and covariance estimation when elements of the response vector are of mixed types, for example some continuous and some discrete. Our method is based on a model which assumes the observable mixed-type response vector is connected to a latent multivariate normal response linear regression through a link function. We explore the properties of this model and show its parameters are identifiable under reasonable conditions. We impose no parametric restrictions on the covariance of the latent normal other than positive definiteness, thereby avoiding assumptions about unobservable variables which can be difficult to verify in practice. To accommodate this generality, we propose a novel algorithm for approximate maximum likelihood estimation that works "off-the-shelf" with many different combinations of response types, and which scales well in the dimension of the response vector. Our method typically gives better predictions and parameter estimates than fitting separate models for the different response types and allows for approximate likelihood ratio testing of relevant hypotheses such as independence of responses. The usefulness of the proposed method is illustrated in simulations; and one biomedical and one genomic data example.
We propose a new method for multivariate response regression and covariance estimation when elements of the response vector are of mixed types, for example some continuous and some discrete. Our method is based on a model which assumes the observable mixed-type response vector is connected to a latent multivariate normal response linear regression through a link function. We explore the properties of this model and show its parameters are identifiable under reasonable conditions. We impose no parametric restrictions on the covariance of the latent normal other than positive definiteness, thereby avoiding assumptions about unobservable variables which can be difficult to verify in practice. To accommodate this generality, we propose a novel algorithm for approximate maximum likelihood estimation that works "off-the-shelf" with many different combinations of response types, and which scales well in the dimension of the response vector. Our method typically gives better predictions and parameter estimates than fitting separate models for the different response types and allows for approximate likelihood ratio testing of relevant hypotheses such as independence of responses. The usefulness of the proposed method is illustrated in simulations; and one biomedical and one genomic data example.
In many regression applications, there are multiple response variables of mixed types. For instance, when modeling complex biological processes like fertility (see Section 6.1), the outcome (ability to conceive) is often best characterized through a collection of variables of mixed types: both count (number of egg cells and number of embryos) and continuous (square‐root estradiol levels and log gonadotropin levels). Similarly, one may be interested in the dependence between various types of omic data in a particular genomic region (see Section 6.2), some binary (eg, presence of somatic mutations) and some continuous (eg, gene expression). Joint modeling of responses facilitates inference on their dependence, and it can lead to more efficient estimation, better prediction, and allows for the testing of joint hypotheses without the need for multiple testing corrections. Popular regression models, however, typically assume all responses are of the same type. A case in point is the multivariate normal linear regression model which assumes a vector of responses and vector of predictors satisfy
for some regression coefficient matrix and covariance matrix , the set of symmetric and positive definite matrices. Model (1) is fundamental in multivariate statistics and leads to relatively straightforward likelihood‐based inference; the parameters are identifiable, have intuitive interpretations, and maximum likelihood estimates are computationally tractable even when and are relatively large (but smaller than the number of independent observations ). Here, our goal is the development of a likelihood‐based method for mixed‐type response regression that retains some of the useful properties of (1). We focus on settings where dependence between responses cannot be parsimoniously parameterized and interest is in inference on an unstructured covariance matrix characterizing this dependence. Rather than a method for specific combinations of response distributions, of which there are many,
,
,
,
,
,
,
,
,
,
,
we pursue a unified method that can be adapted to many different combinations of response distributions with relative ease.To develop our method, we assume there is a known link function and latent vector such that
That is, a latent vector satisfies the multivariate linear regression model and the observable responses are connected to that latent vector through their conditional mean. We further suppose that, conditionally on , the elements of satisfy generalized linear models with linear predictors given by the elements of (see detailed specification in Section 2). This leads to a likelihood based on independent observations which can be written
where is the conditional density for given and is the multivariate normal density with mean and covariance matrix . Note (1) is a special case of (2) where and is the identity. Conversely, (3) can be obtained as the likelihood for a specific generalized linear latent and mixed model (GLLAMM),
a class of models which also includes many other models for mixed‐type responses as special cases.
,
,
,
,
,
Example 1 illustrates our setting.Suppose is a vector of 10 count variables () and 10 continuous variables () whose joint distribution is of interest. If there are no predictors, a possible version of (2) assumes for , for , and , where and .There are several challenges with inference based on (3). First, in general the integrals cannot be computed analytically, and numerical integration can be prohibitively time consuming even for . This is commonly addressed by, for example, Laplace approximations, Monte Carlo integration, or penalized quasi likelihood (PQL), which results from dropping terms in a Laplace approximation that are assumed to be of lower order.
,
,
,
,
Here, in the interest of developing a method operational for relatively large , we will approximate the likelihood using a modified version of PQL (Section 3), which is known to be fast.
We will also discuss how the proposed method can be extended to settings where the likelihood is approximated by other means.A second challenge is that, even if the integrals in (3) can be computed efficiently, maximum likelihood estimation of requires optimization over :
When is small it may be possible to solve (4) using off‐the‐shelf optimizers (eg, the ml command in Stata or the optim function in R). For example, if , then (4) can be characterized as a constrained optimization problem over , the constraints being that , , and . When is even moderately large, however, such constraints become untenable. In software packages for mixed and latent variable models it is common to maximize the likelihood in and the Cholesky root such that .
,
This ensures the estimate of is positive semidefinite, but it does not ensure positive definiteness and it complicates more ambitious inference. For example, implementing a likelihood‐ratio test of whether some of the responses are independent requires solving (4) subject to the constraint that some off‐diagonal elements of are zero (Section 4). Similarly, for some response‐types it is necessary to constrain diagonal elements of to ensure identifiability (see Section 2), and sometimes it is desirable to assume diagonal elements of corresponding to responses of the same type (eg, responses 1‐10 or 11‐20 in Example 1) are the same. These types of restrictions are nontrivial to accommodate when parameterizing in terms of , but as we will see in Section 3, they are elegantly handled by the method we propose. Briefly, our method iteratively updates a PQL‐like approximation of (3) and maximizes that approximation in and using block coordinate descent. The update for is a least squares problem which can be solved in closed form and the update for is solved using an accelerated projected gradient descent algorithm. The algorithm scales well in the dimension and it natively supports restrictions of the form for sets onto which a projection can be computed.In some settings it may be appropriate to, as is common in mixed models, restrict for some design matrix and covariance matrix of random effects (see eg, Jiang
). Such structures may be suggested by subject‐specific knowledge or the sampling design, or they may be motivated by necessity when is small relative to . Similarly, many methods based on the marginal moments, such as for example generalized estimating equations,
specify a covariance structure. In many applications, however, it is unclear whether there is any particular dependence structure. Indeed, it may be of primary interest to discover a structure using data, rather than imposing one a priori; our method enables such discoveries. Some methods based on the marginal moments also allow the specification of an unstructured correlation matrix, but for rather than .
,
In general, however, such methods are inconsistent with (2) since the mean vector and correlation matrix of cannot be independently parameterized. Accordingly, it is often unclear whether the estimates are in the parameter set of any particular model for mixed‐type responses. Nevertheless, methods based on the marginal moments can be useful in practice and they are a reasonable comparison to our method for some purposes.Other advantages of the parameterization we consider are that (i) the off‐diagonal elements of affect the joint distribution of responses but not their marginal (univariate) distributions and (ii) responses are independent if the corresponding off‐diagonal elements of are zero. Thus, we can test the null hypothesis that some responses are independent by testing whether the corresponding off‐diagonal elements of are zero, without that null hypothesis implying restrictions on the marginal distributions. This is generally not possible in more parsimonious parameterizations of the form since, in general, elements of affect both marginal and joint distributions.
MODEL
Specification
We now specify our model in detail and generalize somewhat compared to the Introduction. Assume the elements of are conditionally independent given with conditional densities of the form
where is a vector of strictly positive dispersion parameters and is a cumulant function for the distribution of . For example, and , with primes denoting derivatives. From these properties it follows that the link function defined by satisfies with real‐valued and strictly increasing for . That is, the th latent variable has a direct effect on the th response but no other responses. Equation (5) specifies a (conditional) generalized linear model
(GLM), which when specializes to a one‐parameter exponential family distribution, as in Example 1. Here, we do not assume for every but we do assume the are known.Because it makes our development no more difficult, in what follows we consider a slightly more general version of (2) where
for a nonstochastic design matrix and . The classical multivariate response regression setting in (2) which motivates our study is then a special case with , , and , where is the Kronecker product and the vectorization operator stacking the columns of its matrix argument. Unlike (2) where all responses have the same predictors, (6) allows distinct predictors for each response, as in seemingly unrelated regressions.With , denoting independent realizations, the likelihood is
To see the connection to GLLAMMs and other mixed models which are often written for all observations simultaneously, let , , and . Then (7) is the likelihood for a model which assumes the elements of follow conditionally independent GLMs given , with canonical link functions and linear predictors given by the elements of
that is, each of the responses has a linear predictor with its own random intercept, and the covariance matrix for those random intercepts is .
Parameter interpretation and identifiability
It is often difficult to interpret parameters in latent variable models and, similarly, it is often unclear which parameters are identifiable—we address some such concerns in this section. The parameters are straightforward to interpret in the latent regression, but interpreting them in the marginal distribution of requires more work. To that end, note that the mean vector and covariance matrix of are, respectively, by iterated expectations,
We make a number of observations based on (8): first, because is assumed diagonal, the covariance between responses is determined by . Second, since and are determined by the univariate distribution of , off‐diagonal elements of do not affect means and variances of the responses. Third, since and are nonlinear and nonconstant in general, and in general depend on both and diagonal elements of . Fourth, since is increasing in and does not depend on , is decreasing in and . This is intuitive as responses are conditionally uncorrelated and hence, loosely speaking, a large element of indicates substantial variation in the corresponding response is independent of the variation in the other responses. In some settings, more precise statements are possible by analyzing closed form expressions for the moments in (8), as the next example illustrates.(Normal and Poisson responses) Suppose there are responses such that and for , and and for . These moments are consistent with assuming for , and, if , , . When not assuming , we say these moments are consistent with normal and (conditional) quasi‐Poisson distributions. We examine the effects of these assumptions on the marginal moments of . Some algebra (Supplementary Material) gives the following moments: , , , , , , and ; the remaining entries of are the same as those given up to obvious changes in subscripts. Clearly, both and depend on and , but regardless of type, the variance of is increasing in , the mean is increasing in , and the covariance between and is increasing in . We will later use these observations to prove a result which implies and are identifiable in this example.Consider the linear dependence between responses with conditional normal and quasi‐Poisson moments, and , say. The sign of their correlation is the sign of and the squared correlation satisfies, by Cauchy‐Schwarz's inequality, . Thus, strong linear dependence between and requires a small .To gain intuition for how two responses with quasi‐Poisson moments behave, suppose for simplicity that , , and . Then . For small , this correlation is approximately , which for is upper bounded by 1 and lower bounded by . The latter expression tends to if and 0 if . Thus, strong negative correlation between and requires a small .Example 2 is convenient because the moments have closed form expressions. In more complicated settings, the following result can be useful. It implies the mean of and covariance of and are strictly increasing in, respectively, the mean of and covariance between and .Let
be a bivariate normal density with marginal densities
and
and covariance
; then for any increasing, nonconstant
, the functions defined by
and
are, assuming the (Lebesgue) integrals exist, strictly increasing on
and
, respectively.The proof is in the Supplementary Material. We illustrate the usefulness of this result in another example.(Normal and Bernoulli responses) Suppose with and Bernoulli distributed with
. Suppose also for simplicity , . The marginal distribution of is Bernoulli with where is the standard normal density. One can show that, if is fixed, if and if . That is, any success probability is attainable by varying and, hence, some restrictions are needed for identifiability. One possibility, which has been used in similar settings, is to fix to some value, say one.
,
While fixing does not impose restrictions on the distribution of as long as can vary freely, it may impose restrictions on the joint distribution of , properties of which we consider next.Equation (8) implies . The integral does not admit a closed form expression, but Lemma 1 says the covariance is strictly increasing in , which can be used to show the parameters are identifiable in this example if is known (Theorem 1). To understand which values can take, consider the limiting case as and assume for simplicity . In the limit, the covariance matrix is singular and the distribution of the same as that obtained by letting . Then one can show . By using the dominated convergence theorem as and , this can be shown to tend to and be upper bounded by . This correlation corresponds to a limiting case and is an upper bound on the attainable correlation between Bernoulli and normal responses.We conclude with a result on identifiability. The result is stated for some common choices of link functions and distributions of , but the proof can be adapted to other settings. In the Supplementary Material, we state a result (Lemma B.4) which outlines conditions for identifiability more generally. Essentially, when satisfies a GLM, it suffices that, for every , the variance of is not a function of the mean of . If it is, as is the case when is Bernoulli distributed, restrictions on diagonal elements of are needed for identifiability. The proof is in Supplementary Material.Suppose
is an independent sample from our model and that (i)
,
, is either the identity, natural logarithm, or logit (log‐odds) function; (ii)
is fixed and known for every
corresponding to logit
; and (ii)
is invertible, then distinct
correspond to distinct distributions for
.From a computational perspective, nonidentifiability can lead to likelihoods with infinitely many maximizers or ridges along which the likelihood is constant. In light of this result, we constrain the diagonals of corresponding to binary response variables: we found this leads to faster computation and improved estimation accuracy.
ESTIMATION
Overview
We propose an algorithm based on linearization of the conditional mean function , where ; this is essentially equivalent to linearization of the link function, which has been considered in other latent variable models.
More specifically, consider the elementwise first order Taylor approximation of around an arbitrary : . Applying expectations and covariances on both sides yields and . Approximating leads to . Intuitively, we expect and to be good approximations if takes values near with high probability. Now, consider a working model which says are independent with
for observation‐specific approximation points , . The corresponding negative log‐likelihood is, up to scaling and additive constants
If all responses are normal, then the working model is exact and minimizers of are maximum likelihood estimates (MLEs). More generally, minimizers of are approximate MLEs whose quality depend on the accuracy of the working model (9). For further insight it is helpful to note that if is set to the maximizer of the th integrand in (7), then is the same type of approximation of as that used in PQL, specialized to our setting. The correspondence between linearization of the link function and PQL is detailed by Breslow.
The correspondence implies that, in addition to the moment‐based motivation given here, can be motivated as a Laplace approximation of , with terms assumed to be of lower stochastic order ignored (see Breslow and Clayton
for details). Roughly speaking, the approximation will be better the closer the distribution of is to a normal. Because the latent variables are normal, we expect the distribution of to be close to normal if the distribution of is. Now, to estimate variance parameters, conventional PQL makes further approximations which lead to a set of estimating equations. We proceed differently and avoid these approximations, both because they lack formal motivation (section 2.5 in Breslow and Clayton
) and because they do not in general lead to a simpler optimization problem for an unstructured and positive definite . Thus, we will work directly with the approximate log‐likelihood and propose an algorithm that is substantially different from ones commonly used for mixed models.With as the starting point, a natural algorithm for estimating and would iterate between updating by minimizing with the held fixed and then updating the to get a more accurate working model. Motivated by the connection to Laplace approximations, we update the by setting them to equal to the maximizers of the integrand in (7) with and fixed at their current iterates. To summarize, we propose a blockwise iterative algorithm whose th iterates are obtained using the updating equations
This algorithm can be run for a prespecified number of iterations or until convergence of the and iterates, for example. While the complete algorithm is not designed to minimize a particular objective function, the individual updates, which we discuss in more detail shortly, minimize objective functions that can be tracked to determine convergence within each update. In our experience, the values of and tend to converge after (at most) tens of iterations of (10) and (11). The final iterates of and are approximate MLEs and evaluated at the final iterates of is an approximate log‐likelihood which we will use for approximate likelihood‐based inference, including the construction of likelihood‐ratio tests.A formal statement of the proposed algorithm is in Algorithm 1.
Updating and
To solve (10), we use a blockwise coordinate descent algorithm. Treating as fixed throughout and ignoring the iterate superscript, this algorithm iterates between updating and . Specifically, the th iterates of the algorithm for solving (10) can be expressed
Update (12) can be shown to be a weighted residual sum‐of‐squares with solution
where and . Minimizing with respect to is nontrivial owing to nonconvexity and the constraint that is positive semi‐definite. One possibility is to parameterize in a way that lends itself to unconstrained optimization (see eg, Pinheiro and Bates
) and use a generic solver. However, such parameterizations are inconvenient since we, as discussed in Section 2, sometimes restrict diagonal elements of to be equal to a prespecified constant for identifiability. Similarly, testing correlation of responses requires constraining some off‐diagonal elements to equal zero. Thus, we need an algorithm that allows restrictions on the elements of and ensures estimates are positive semidefinite, and can be constrained to be positive definite if desirable.By picking an appropriate (convex) , (13) can be characterized as an optimization problem over with the constraint that . To handle both the nonconvexity and general constraints, we propose to solve this problem using a variation of the inertial proximal algorithm proposed by Ochs et al.
This is an accelerated projected gradient descent algorithm that can be used to minimize an objective function which is the sum of a nonconvex smooth function and convex nonsmooth function. In our case, (as a function of ) is the nonconvex smooth function and the convex nonsmooth function is the function which equals if and zero otherwise. This algorithm, like many popular accelerated first order algorithms, for example, FISTA,
uses “inertia” in the sense that the search point is informed by the direction of change from the previous iteration, which can lead to faster convergence.To summarize briefly, our algorithm for (13) has th iterate , where , is determined using backtracking line search (see algorithm 4 of Ochs et al
), and is the projection onto . We assume the projection is defined; it suffices, for example, that is nonempty, closed, and convex (corollary 5.1.19 of Megginson
). In our software, is the intersection of a set of matrices with constrained elements and the set of symmetric matrices with eigenvalues bounded below by , where ensures positive semidefiniteness and positive definiteness. To compute projections onto of this form, we implement Dykstra's alternating projection algorithm.
This algorithm iterates between projections onto each of the two sets whose intersection defines . Both projections can be computed in closed form, so this algorithm tends to be very efficient. The gradient of with respect to needed for implementing the algorithm can be found in the Supplementary Material. The algorithm is terminated when the objective function values converge.
Updating the approximation points
We use a trust region algorithm for updating (eg, chapter 4 of Nodecal and Wright
). Essentially, the trust region algorithm approximates the objective function locally by a quadratic and requires the computation of gradients and Hessians. The gradient is given in the Supplementary Material and the Hessian is, assuming is positive definite, for , . Since and are positive definite and the latter does not depend on , the objective function is strongly concave and therefore has a unique maximizer and stationary point. In practice, however, can be singular or near‐singular and the Hessian can have a large condition number. To improve stability, we regularize by (i) adding an ‐penalty on and (ii) replacing by for some small . Then the optimization problem for updating is
where . The intuition for shrinking to is that the latter is the mean of when is the true parameter. The penalty and regularization of are only included in the update for , not in the objective function for updating and . In the Supplementary Materials, we outline a procedure for obtaining starting values that can improve computing times and the quality of the resulting estimates relative to naive starting values.
APPROXIMATE LIKELIHOOD RATIO TESTING
To make inferences about the parameters we use the approximate negative log‐likelihood , where are held at the final iterates given by Algorithm 1. Focus is on testing hypotheses of the form vs , where and partition the parameter set. If the null hypothesis constrains only, we write for simplicity , and similarly with if the null hypothesis constrains only. We propose the test statistic , where and the approximation points are obtained by running Algorithm 1 with the restrictions implied by and . That is, and are estimates and expansion points, respectively, from fitting the null model while and are obtained by maximizing the working likelihood from (9) with the expansion points fixed at those obtained by fitting the null model. We fix the expansion points to ensure and are maximizers of the same working likelihood, but under different restrictions. We chose the null model's expansion points to be conservative; that is, to favor the null hypothesis model. If the working model is accurate and contains no boundary points of the parameter set, we expect to be, under the null hypothesis, approximately chi‐square distributed with degrees of freedom equal to the number of restrictions implied by .A main motivation for our method is inference on the covariance matrix . Null models corresponding to hypotheses that constrain elements of are straightforward to fit by including those constraints in the definition of the set in the update for . For example, to test whether the first and second element of are independent, fitting the null model corresponds to setting
assuming there are no other restrictions. Similarly, independence of more than two responses can be tested by including more off‐diagonal restrictions in the definition of , and equality of variances for some responses can be tested by including restrictions such as . In principle our method could also be used to test restrictions such as , corresponding to the first latent variable being constant. However, any null hypothesis which forces some eigenvalues of to be zero corresponds to testing of boundary points, and then the likelihood ratio test‐statistic has a different asymptotic distribution.
,
,A formal statement of the full algorithm for hypothesis testing is given in Algorithm 2. We investigate the size and power of the proposed procedure in Section 5.5.
NUMERICAL EXPERIMENTS
We are not aware of a publicly available (or otherwise) software that fits our model outside of special cases. We therefore compare to existing methods which assume related but somewhat different models. Specifically, when investigating prediction (Section 5.2) and estimation of (Supplementary Material) we compare to separate GLMMs for the different response types. We chose this comparison because the GLMMs have correctly specified marginal distributions in our setting. In particular, the marginal moments and , , are correctly specified under the GLMMs. Thus, we expect the GLMMs to give reasonable estimates of the mean functions, and expect differences in predictive performance to our method to be indicative of the usefulness of joint modeling and estimating the off‐diagonal elements of . We also compare our method on prediction, estimation of , and estimation of to multivariate covariance generalized linear models (MCGLMs).
MCGLMs model the marginal moments of and, accordingly, provide estimates of the mean vector and covariance matrix of , but not of . To further highlight the usefulness of modeling dependence, we include results for our method with constrained to be diagonal, that is, with responses assumed independent. Whether constraining to be diagonal or not, diagonal elements corresponding to Bernoulli responses are constrained to equal one when using our method.
Prediction comparisons
We consider response variables, three normally distributed, three Poisson, and three Bernoulli. When fitting separate GLMMs for the three‐dimensional response vectors with elements of the same type, we consider two covariance structures that common software can fit: (i) and for or (ii) , for some and . Option (i) assumes all responses are independent and option (ii) corresponds to using a shared random effect for observations of the same type in the same response vector. We refer to these as independent and clustered GLMMs, respectively. There are many software packages for fitting GLMMs. We pick the glmm
package to fit the independent GLMMs and fit the clustered GLMMs using lme4.
Briefly, the former uses a Monte Carlo approximation of the likelihood and the latter uses adaptive Gaussian quadrature.Predictions are formed by plugging estimates from the different methods into the expressions for in Section 2, and for simplicity we define prediction errors as the differences between those expectations and the observed responses, regardless of type. When a closed form expression is unavailable, the expectation is obtained by one‐dimensional numerical integrations. We compare to (oracle) predictions using the true and .We next describe how data are generated in the simulations. The responses have different predictors and is partitioned accordingly: , , and . We write for the th observation of the predictors for the th response. In all simulations, each consists of a one in the first element (an intercept) and, in the remaining elements, independent realizations of a random variable, where for all and . For , the true regression coefficient has first element equal to and all other elements chosen as independent realizations of a . We set if the response is normal or (quasi‐)Poisson, and equal to zero if the response is Bernoulli. Similarly, if the response is normal, we set ; otherwise, we set .We consider three different structures for : for some we set where is (i) autoregressive (); (ii) compound symmetric (); or (iii) block‐diagonal, meaning if , , or and zero otherwise. The first through third responses are normal, the fourth through sixth Bernoulli, and seventh through ninth Poisson. Hence, each of the blocks given by the structure in (iii) includes one of each response type. These structures are used to generate the data but are not imposed when fitting models. For all the structures, the GLMMs have correctly specified diagonal elements of , but incorrectly specified off‐diagonal elements in general.For each structure of , we investigate the effects of the sample size (), the number of predictors (, ), and the correlation parameter (). We present relative squared out of sample prediction errors, defined as the ratio of a method's sum of squared prediction error to the sum of squared prediction error of the oracle prediction. Averages are based on 500 independent replications, and for each replication, out of sample predictions are on an independent test set of observations.In the top row of Figure 1, as increases, each method's performance improves relative to oracle predictions. However, across all settings, the proposed method performs best. When the covariance structure is nonsparse (eg, autoregressive or compound symmetric), the clustered GLMMs can outperform our method with the diagonal covariance matrix and the independent GLMMs. The same relative performances are observed as increases in the middle row; and when increases in the bottom row. When is block‐diagonal, both versions of our method outperform the competitors. In the case of clustered GLMMs, this is likely due to the specified covariance structure being a poor approximation to the true covariance. For independent GLMMs, this is likely due to the fact that glmm (nor other software) can impose the identifiability condition on for the Bernoulli responses. MCGLMs perform second‐best in most settings, can perform similarly to clustered GLMMs when the true covariance structure is nonsparse, and can be outperformed by our method with diagonal covariance when dependence is weak. In summary, the results show the usefulness of joint modeling for prediction, and they illustrate the usefulness of the proposed method relative to ones based on marginal moments.
FIGURE 1
Average relative squared prediction errors. Top: and for . Middle: and . Bottom: and for . glmm indicates clustered GLMMs, glmm‐Ind GLMMs with diagonal covariance matrix, mcglm the method of Bonat and Jørgensen,
mmrr the proposed method, and mmrr‐Ind the proposed method with diagonal covariance matrix
Average relative squared prediction errors. Top: and for . Middle: and . Bottom: and for . glmm indicates clustered GLMMs, glmm‐Ind GLMMs with diagonal covariance matrix, mcglm the method of Bonat and Jørgensen,
mmrr the proposed method, and mmrr‐Ind the proposed method with diagonal covariance matrixThe Supplementary Material include results on mean squared errors for estimating , for the same methods and settings used in Figure 1, and the results are qualitatively similar to those in Figure 1. The Supplementary Material also include a comparison to separate GLMs, effectively assuming independence between responses, which leads to substantially worse predictions than all methods considered here.
Performances for different response types
In Figure 1 averages were taken over all responses types. To see if the benefits of joint modeling are greater for some response types, it is of interest to stratify results by type. We compare the two versions of our method (diagonal vs nondiagonal ). Because both versions have correctly specified univariate response distributions and are fit using the same algorithm except for the constraints on , these simulations investigate the usefulness of joint modeling of mixed‐type responses. Data are generated as in the previous section.In the first row of Figure 2, as increases, both methods' relative mean squared prediction error approaches the oracle prediction error. However, the predictions from joint modeling outperforms those using a diagonal . The differences between the two methods are smallest for Bernoulli responses. A similar result is observed in the second row: as approaches 10, both methods' relative performance degrades, although for all three response types, predictions from the joint modeling degrades more gradually. In the bottom‐most row, we display results as varies. When there is a less substantial difference between the two methods. As approaches 0.95, the difference between the two methods becomes greater. This result is also observed in the Bernoulli responses, but to a lesser degree than the normal and Poisson.
FIGURE 2
Average relative squared prediction errors. Top: and for . Middle: and . Bottom and for . mmrr is the proposed method and mmrr‐Ind the proposed method with diagonal covariance matrix
Average relative squared prediction errors. Top: and for . Middle: and . Bottom and for . mmrr is the proposed method and mmrr‐Ind the proposed method with diagonal covariance matrixIn Section F.2 of the Supplementary Material, we present a simulation study which focuses on modeling many Bernoulli responses and a single normal response. Those results highlight that even though the relative squared prediction error for the Bernoulli responses is only slightly improved by joint modeling, one can realize substantial prediction accuracy gains for the single normal response variable by exploiting dependence between it and the Bernoulli responses. We present similar results comparing our method and MCGLMs in the Supplementary Material.
Covariance estimation
To investigate the usefulness of the proposed method for estimating covariances and correlations specifically, we consider a setting without predictors. That is, we consider (2) with , corresponding to an intercept only. Ideally we would compare to an established method for estimating in our setting. Since none is available (to the best of our knowledge), we compare the estimates of and from our method to moment‐based estimates. One is the empirical (or sample) covariance matrix , where is the sample mean. The corresponding empirical correlation matrix is . MCGLMs also provide a moment‐based estimate of , which we found to be indistinguishable from the empirical correlation matrix in the present setting without predictors.We note the comparison to , or other moment‐based estimates of not assuming (2), is not ideal because cannot in general be mapped to an estimate of . Formally the issue is that while is injective as a function of , which follows from Theorem 1, that function is not onto . Put differently, not all realizations of give an estimate consistent with (2). Nevertheless, is (strongly) consistent for when there are no predictors since the observations are then i.i.d., so we expect reasonable estimates of .The data generating model is as in Section 5.2, but with an intercept only. Figure 3 shows average relative mean square error for the diagonal entries of by type. Specifically, we report the averages of: (normal) , (Bernoulli) , and (Poisson) where is an estimate of with th diagonal entry In Figure 3, we see our method estimates the variances of the Poisson components more accurately than does the sample covariance, but the two perform similarly for normal and Bernoulli responses.
FIGURE 3
Average relative mean squared error for estimating the diagonals of stratified by response type. (Top row) Results with (Bottom row) Results with . mmrr is the proposed method and Empirical the sample covariance
Average relative mean squared error for estimating the diagonals of stratified by response type. (Top row) Results with (Bottom row) Results with . mmrr is the proposed method and Empirical the sample covarianceFigure 4 displays the average mean squared estimation error for , the correlation matrix of . For smaller sample sizes, the proposed method performs better than MCGLMs. For larger sample sizes the differences diminish somewhat, which is not surprising given that sample correlations are consistent. As the correlation parameter varies with , we see that the differences between the methods remain relatively constant, with the proposed one being better in every setting.
FIGURE 4
Average squared Frobenius norm error for estimating . (Top row) Results with (Bottom row) Results with and . mmrr is the proposed method and mcglm the method of Bonat and Jørgensen
Average squared Frobenius norm error for estimating . (Top row) Results with (Bottom row) Results with and . mmrr is the proposed method and mcglm the method of Bonat and Jørgensen
Approximate likelihood ratio testing
We examine the approximate likelihood ratio testing procedure described in Section 4. Let be the set of diagonal and positive definite matrices. We study the size and power of the proposed tests for vs , and, assuming all responses have the same predictors as in (2), vs , where denotes the th predictor's effect on the th response variable, that is, the th element of . We set . Thus, the null hypothesis implies the first predictor (ignoring the intercept) has no effect on any response. Multiple testing corrections, which are often needed when using separate models for the responses, are not needed here.Data are generated as in Section 5.3 but with for all and . In the first setting, and , . The top row of Figure 5 displays the proportion of rejections at the 0.05 significance level. When (null hypothesis is true), the proportion of rejections is approximately 0.10 when , below 0.075 when , and near 0.05 (the nominal level) when . As increases, even with , the proportion of correctly rejected null hypotheses is near one when . The power depends positively on both the magnitude of and the sample size.
FIGURE 5
Top: Proportion of rejected at 0.05 level. Bottom: Proportion of rejected at the 0.05 level. Horizontal dashed and dotted black lines indicate 0.10 and 0.05
Top: Proportion of rejected at 0.05 level. Bottom: Proportion of rejected at the 0.05 level. Horizontal dashed and dotted black lines indicate 0.10 and 0.05In the second setting, we fix and study how the effect size of the affects power. After generating as in Section 5.3, for independently, we replace with a realization of a where . The second row of Figure 5 shows that when , so that for all , the proportion of rejections is slightly above 0.10 for , but close to 0.05 (the nominal size) for all . There is also an indication that correlation between responses benefits power. For example, the power curves under compound symmetry tend to be above the corresponding ones under block‐diagonal structure.
DATA EXAMPLES
Fertility data
We consider a dataset collected on 333 women who were having difficulty becoming pregnant.
The goal is to model four mixed‐type response variables related to the ability to conceive. The predictors are age and three variables related to antral follicles: small antral follicle count, average antral follicle count, and maximum follicle stimulating hormone level. Antral follicle counts can be measured via noninvasive ultrasound and are therefore often used to model fertility.The response variables quantify the ability to conceive in different ways. Two are approximately normally distributed (square‐root estradiol level and log‐total gonadotropin level); and two are counts (number of egg cells and number of embryos). We modeled the latter using our model with conditional quasi‐Poisson distributions. We set for continuous responses and for counts. To illustrate how the proposed methods can be applied, we test the null hypothesis that is diagonal and find evidence against it (‐value ) using the test described in Section 5.5. That is, there is evidence suggesting the four responses are not independent given the predictors. Fitting the unrestricted model using our software took less than three seconds on a laptop computer with 2.3 GHz 8‐Core Intel Core i9 processor. The hypothesis testing procedure took less than six seconds on the same machine.The estimated correlation matrices for the four observed responses, , and the latent variables, , are, respectively,
where the variable ordering is square‐root estradiol level, log‐total gonadotropin level, number of egg cells, and number of embyros. The estimate of is here evaluated at . The estimates indicate substantial positive correlation between the number of egg cells and number of embryos, whereas estradiol and gonadotropin levels appear weakly negatively correlated with these two variables.We also test whether the small antra follicle count is a significant predictor of any of the responses after accounting for age, average antral follicle count, and maximum follicle stimulating hormone level. The number of small antra follicles (2‐5 mm) is correlated with the number of total antra follices (2‐10 mm), and it has been argued that only total antra follicle count are needed in practice.
Fitting our model with , we reject the null hypothesis that the four regression coefficients (one for each response) corresponding to antra follicle count is zero (‐value ).Finally, to illustrate how uncertainty can be quantitified using the approximate likelihood, we construct an approximate 95% confidence interval for by inverting the proposed approximate likelihood ratio test‐statistic numerically. This gives the confidence interval , which corresponds to correlations between 0.62 and 0.8. Confidence intervals for the other parameters could be constructed similarly.
Somatic mutations and gene expression in breast cancer
We now focus on jointly modeling common somatic mutations and gene expression measured on patients with breast cancer collected by The cancer genome atlas project (TCGA). A somatic mutation is an alteration in the DNA of a somatic cell. Somatic mutations are believed to play a central role in the development of cancer. Because somatic mutations modify gene expression, directly and indirectly, it is natural to model somatic mutations and gene expression jointly.The somatic mutation variables are binary, indicating presence or absence of a somatic mutation in the region of a particular gene. We focus on the ten genes where a somatic mutation was present in more than 5% of subjects. Thus, we have , coming from ten genes each with one response corresponding to gene expression and one to the presence of a somatic mutation. For gene expression, we model log‐transformed RPKM measurements as normal random variables. Each patient's age is included as a predictor.We test the covariance matrix for block‐diagonality. Under the null hypothesis, entries of corresponding the correlations between somatic mutations and gene expression measurements are zero (ie, is no correlation between somatic mutations and gene expression). The observed statistic is with 100 degrees of freedom for a ‐value Figure 6 displays the estimated correlation matrix for the . We observe the latent variables corresponding to somatic mutations and gene expression in CDH1 are highly negatively correlated, whereas for GATA3, somatic mutation and gene expression latent variables have a strong positive correlation. Latent variables for many of the somatic mutations are highly correlated (eg, TTN, MLL3, MUC4, MUC12, MUC16). However latent variables corresponding to some somatic mutations, for example, those in the region of TP53, exhibit small or even negative correlations with many others (eg, GATA3, CDH1, PIK3CA). Confidence intervals could be constructed as in Example 6.1.
FIGURE 6
Heatmap of the estimated correlation matrix for the in Section 6.2. Suffix ‐SM for somatic mutations; ‐GEx for gene expression
Heatmap of the estimated correlation matrix for the in Section 6.2. Suffix ‐SM for somatic mutations; ‐GEx for gene expression
DISCUSSION
We have proposed a likelihood‐based method for mixed‐type multivariate response regressions. Our method gives an approximate maximum likelihood estimate of an unstructured covariance matrix characterizing the dependence between responses. This is particularly useful in settings where a dependence structure is not suggested by subject‐specific knowledge or one wants to discover such a structure using data. To address the computational challenges with estimating an unstructured covariance matrix, we have proposed a new algorithm. The algorithm handles identifiability constraints and it scales well in the dimension of the response vectors.An advantage of likelihood‐based methods compared to ones based on marginal moments, such as for example generalized estimating equations and its extensions,
,
,
,
is the plethora of existing procedures for inference and model selection using likelihoods. For example, Wald tests and standard errors for the coefficient estimates, based on the observed Fisher information, are readily available once maximum likelihood estimates have been computed, as are information criteria. We also used the likelihood to propose a testing procedure for the covariance matrix. On the other hand, it is often computationally expensive to evaluate the likelihood in latent variable models and it can lead to complicated optimizations. We addressed that using a PQL‐like approximation and a new algorithm, and showed the resulting approximate maximum likelihood estimates are often useful. Extending our method to other likelihood‐approximations is a possibility. For example, the PQL‐approximation could be replaced by a Laplace or Monte Carlo approximation. Another potential line of future research is the asymptotic properties of PQL‐type estimators, which despite the estimators' apparent practical usefulness, are not fully understood.Finally, we note some types of mixed‐type longitudinal data are handled by our method as‐is, while other types would require modifications or would be better analyzed using methods developed for that purpose. For example, our method can be applied when the elements of , , are repeated measures of mixed‐type responses on independently sampled patients indexed by . Modifications may be necessary if one wants to impose a particular structure on , if , or if there is dependence between patients.Data S1 Supplementary MaterialClick here for additional data file.