Literature DB >> 30174379

Asymptotic theory for maximum likelihood estimates in reduced-rank multivariate generalized linear models.

E Bura1,2, S Duarte3, L Forzani3, E Smucler4,5, M Sued5.   

Abstract

Reduced-rank regression is a dimensionality reduction method with many applications. The asymptotic theory for reduced rank estimators of parameter matrices in multivariate linear models has been studied extensively. In contrast, few theoretical results are available for reduced-rank multivariate generalized linear models. We develop M-estimation theory for concave criterion functions that are maximized over parameter spaces that are neither convex nor closed. These results are used to derive the consistency and asymptotic distribution of maximum likelihood estimators in reduced-rank multivariate generalized linear models, when the response and predictor vectors have a joint distribution. We illustrate our results in a real data classification problem with binary covariates.

Entities:  

Keywords:  M-estimation; exponential family; non-convex; parameter spaces; rank restriction

Year:  2018        PMID: 30174379      PMCID: PMC6101205          DOI: 10.1080/02331888.2018.1467420

Source DB:  PubMed          Journal:  Statistics (Ber)        ISSN: 0233-1888            Impact factor:   1.051


Introduction

The multivariate multiple linear regression model for a q-dimensional response and a p-dimensional predictor vector postulates that , where B is a matrix and is the error term, with and . Based on a random sample satisfying the model, ordinary least squares estimates the parameter matrix B by minimizing the squared error loss function to obtain Reduced-rank regression introduces a rank constraint on B, so that Equation (1) is minimized subject to the constraint , where . The solution is , where are the first r singular vectors of (see, e.g. [1]). Reduced-rank regression has attracted attention as a regularisation method by introducing a shrinkage penalty on B. Moreover, it is used as a dimensionality reduction method as it constructs latent factors in the predictor space that explain the variance of the responses. Anderson [2] obtained the likelihood-ratio test of the hypothesis that the rank of B is a given number and derived the associated asymptotic theory under the assumption of normality of Y and non-stochastic X. Under the assumption of joint normality of Y and X, Izenman [3] obtained the asymptotic distribution of the estimated reduced-rank regression coefficient matrix and drew connections between reduced-rank regression, principal component analysis and correlation analysis. Specifically, principal component analysis coincides with reduced-rank regression when Y =X [4]. Recently, Fan et al. [5] studied the theoretical properties of nuclear norm regularized maximum likelihood type estimates in a class of models that includes reduced-rank regression. They derive statistical rates of convergence in possibly high-dimensional scenarios. However, they do not provide asymptotic results that can be used to construct confidence intervals or hypothesis tests. The monograph by Reinsel and Velu [1] contains a comprehensive survey of the theory and history of reduced rank regression, including in time series, and its many applications. Despite the application potential of reduced-rank regression, it has received limited attention in generalized linear models. Yee and Hastie [6] were the first to introduce reduced-rank regression to the class of multivariate generalized linear models, which covers a wide range of data types for both the response and predictor vectors, including categorical data. Multivariate or vector generalized linear models (VGLMs) is the topic of Yee's [7] book, which is accompanied by the associated R packages, VGAM and VGAMdata. Yee and Hastie [6] proposed an alternating estimation algorithm, which was shown to result in the maximum likelihood estimate of the parameter matrix in reduced-rank multivariate generalized linear models by Bura et al. [8]. Asymptotic theory for the restricted rank maximum likelihood estimates of the parameter matrix in multivariate GLMs has not been developed yet. In general, a maximum likelihood estimator is a concave M-estimator in the sense that it maximizes the empirical mean of a concave criterion function. Asymptotic theory for M-estimators defined through a concave function has received much attention. Huber [9], Haberman [10] and Niemiro [11] are among the classical references. More recently, Hjort and Pollard [12] presented a unified framework for the statistical theory of M-estimation for convex criterion functions that are minimized over open convex sets of a Euclidean space. Geyer [13] studied M-estimators restricted to a closed subset of a Euclidean space. The rank restriction in reduced rank regression imposes constraints that have not been studied before in M-estimation as they result in neither convex nor closed parameter spaces. In this paper we (a) develop M-estimation theory for concave criterion functions, which are maximized over parameters spaces that are neither convex nor closed, and (b) apply the results from (a) to obtain asymptotic theory for reduced rank regression estimators in generalized linear models. Specifically, we derive the asymptotic distribution and properties of maximum likelihood estimators in reduced-rank multivariate generalized linear models where both the response and predictor vectors have a joint distribution. The asymptotic theory we develop covers reduced-rank regression for linear models as a special case. We show the improvement in inference the asymptotic theory offers via analysing the data set Yee and Hastie [6] analysed. Throughout, for a function , denotes the row vector , stands for the column vector of derivatives, while denotes the symmetric matrix of second-order derivatives. For a vector valued function , denotes the matrix,

M-estimators

Let Z be a random vector taking values in a measurable space and distributed according to the law P. We are interested in estimating a finite dimensional parameter using n independent and identically distributed copies of Z. In the sequel, we use Pf to denote the mean of ; i.e. . Let Ξ be a subset of a Euclidean space and be a known function. Assume that the parameter of interest is the maximizer of the map defined on Ξ. One can estimate by maximizing an empirical version of the optimization criterion. Specifically, given a known function , define where, here and throughout, denotes the empirical mean operator . Hereafter, and will be used interchangeably as the criterion function, depending on which of the two is appropriate in a given setting. Assume that is the unique maximizer of the deterministic function M defined in Equation (2). An M-estimator for the criterion function over Ξ is defined as If the maximum of the criterion function over Ξ is not attained but the supremum of over Ξ is finite, any value that almost maximizes the criterion function, in the sense that it satisfies for small, can be used instead. An estimator that satisfies Equation (4) with is called a weak M-estimator for the criterion function over Ξ. When , is called a strong M-estimator. Proposition A.1 in the Appendix lists the conditions for the existence, uniqueness and strong consistency of an M-estimator, as defined in Equation (3), when is concave and the parameter space is convex. Under regularity conditions, as those stated in Theorem 5.23 in van der Vaart [14], the asymptotic expansion and distribution of a consistent strong M-estimator [see Definition 2.1] for is given by where , and is the non-singular symmetric second derivative matrix of at .

Restricted M-estimators

We now consider the optimization of over , where is the image of a function that is not necessarily injective. Specifically, we restrict the optimization problem to the set by requiring: There exists an open set and a map such that . Even when an M-estimator for the unrestricted problem as defined in (3) exists, there is no a priori guarantee that the supremum is attained when considering the restricted problem. Nevertheless, Lemma 2.3 establishes the existence of a restricted strong M-estimator, regardless of whether the original M-estimator is a real maximizer, weak or strong. All proofs are provided in the Appendix. Assume there exists a weak/strong M-estimator for the criterion function over Ξ. If Condition 2.2 holds, then there exists a strong M-estimator for the criterion function over . Proposition 2.4 next establishes the existence and consistency of a weak restricted M-estimator sequence when is a concave function in ξ under the same assumptions as those of the unrestricted problem, stated in Proposition A.1 in the Appendix. Assume that Condition 2.2 holds. Then, under the assumptions of Proposition A.1 in the Appendix, there exists a strong M-estimator of the restricted problem over . Moreover, any strong M-estimator of the restricted problem converges to in probability. We derive next the asymptotic distribution of , with . The constrained estimator is well defined under Lemma 2.3, even when is not uniquely determined. If were unique and had an asymptotic distribution, one could use a Taylor series expansion for g to derive asymptotic results for . Building on this idea, Condition 2.5 introduces a parametrization of a neighbourhood of that allows applying standard tools in order to obtain the asymptotic distribution of the restricted M-estimator. Given there exists an open set in with and where is an open set in and is one-to-one, bi-continuous and twice continuously differentiable, with for some . Under the setting in Condition 2.5, we will prove that is a strong M-estimator for the criterion function over . Then, we can apply Theorem 5.23 of van der Vaart [14] to obtain the asymptotic behaviour of , which, combined with a Taylor expansion of h about , yield a linear expansion for . Finally, requiring Condition 2.6, which relates the parametrizations and , suffices to derive the asymptotic distribution of in terms of g. Consider as in Condition 2.2 and as in Condition 2.5. For each . Condition 2.6 ensures that is well defined regardless of the fact that may contain multiple 's. Moreover, also agrees with . Consequently, the orthogonal projection onto with respect to the inner product defined by a symmetric positive definite matrix Σ satisfies where denotes a generalized inverse of the matrix A. The gradient of g is not necessarily of full rank, in contrast to the gradient of h. Note that is idempotent () and the span of its columns is equal to . However, is not in general symmetric, but rather self-adjoint with respect to the inner product induced by Σ since . Assume that Conditions 2.2, 2.5 and 2.6 hold. Assume also that the unrestricted problem satisfies the regularity Conditions A.2A.4 in the Appendix. Then, any strong M-estimator of the restricted problem that converges in probability to satisfies where is the influence function of the unrestricted estimator defined in Equation (5), is the non-singular symmetric second derivative matrix of at and is defined according to Equation (6). Moreover, is asymptotically normal with mean zero and asymptotic variance As an aside remark, we conjecture that for estimators that are maximizers of a criterion function under restrictions that satisfy Conditions 2.2–2.6, when Equation (5) is true, Equation (7) also holds. This can be important since the asymptotic distribution of restricted estimators will be derived directly from the asymptotic distribution of the unrestricted one. The optimization problem that defines the restricted M-estimators considered in this paper is, in general, not convex and hence difficult to solve. However, in the case of maximum likelihood estimation in reduced rank multivariate generalized linear models, the main application of the results in our paper, efficient algorithms exist for solving it (see [6] and the VGAMR package).

Asymptotic theory for the maximum likelihood estimator in reduced rank multivariate generalized linear models

In this section we show that maximum likelihood estimators in reduced-rank multivariate generalized linear models are restricted strong M-estimators for the conditional log-likelihood. Using results in Section 2.1, we obtain the existence, consistency and asymptotic distribution of maximum likelihood estimators in reduced-rank multivariate generalized linear models. In practice, these estimators can be obtained using the R package VGAM, developed by Yee [15].

Exponential family

Let be a q-dimensional random vector and assume that its distribution belongs to a k-parameter canonical exponential family with pdf (pms) where is a vector of known real-valued functions, is a non-negative known function and is the vector of natural parameters, taking values in where the integral is replaced by a sum when Y is discrete. The set H of the natural parameter space is assumed to be open and convex in , and ψ a strictly convex function defined on H. Moreover, we assume is convex and infinitely differentiable in H. In particular, where is the matrix of second derivatives of ψ. Since ψ is strictly convex, is non-singular for every .

Multivariate generalized linear models

Let be a random vector, where now is a multivariate response and is a vector of predictors. The multivariate generalized linear model postulates that the conditional distribution of Y given X belongs to some fixed exponential family and hypothesizes that the k-vector of natural parameters, which we henceforth call to emphasize the dependence on x, depends linearly on the vector of predictors. Thus, the pdf (pms) of is given by where depends linearly on x. Frequently, a subset of the natural parameters depends on x, whereas its complement does not. The normal linear model with constant variance is such an example. To accommodate this structure, we partition the vector indexing model (12) into and , with and components, and assume that H, the natural parameter space of the exponential family, is , where is an open convex subset of . We also assume that where , and . Let , denote a generic vector and the true parameter. Suppose n independent and identically distributed copies of satisfying Equations (12) and (13), with true parameter vector , are available. Given a realization , , the conditional log-likelihood, up to a factor that does not depend on the parameter of interest, is Let By definition (3), the maximum likelihood estimator (MLE) of the parameter indexing model (12) subject to (13), if it exists, is an M-estimator. Theorem 3.1 next establishes the existence, consistency and asymptotic normality of , the MLE of . Assume that satisfies model (12) subject to (13) with true parameter . Under regularity conditions (A20), (A21), (A22) and (A23) in the Appendix, the maximum likelihood estimate of exists, is unique and converges in probability to . Moreover, is asymptotically normal with covariance matrix where and was defined in Equation (11).

Partial reduced rank multivariate generalized linear models

When the number of natural parameters or the number of predictors is large, the precision of the estimation and/or the interpretation of results can be adversely affected. A way to address this is to assume that the parameters live in a lower dimensional space. That is, we assume that the vector of predictors can be partitioned as with and , and that the parameter corresponding to , , has rank . In this way, the natural parameters in Equation (12) are related to the predictors via where , the set of matrices in of rank , while , and . Following Yee and Hastie [6], we refer to the exponential conditional model (12) subject to the restrictions imposed in Equation (17) as partial reduced rank multivariate generalized linear model. The reduced-rank multivariate generalized linear model is a special case of model (17) with . To obtain the asymptotic distribution of the M-estimators for this reduced model, we will show that Conditions 2.2–2.6 are satisfied for , , and , which are defined next. To maintain consistency with notation introduced in Section 2.1, we vectorize each matrix involved in the parametrization of our model and reformulate the parameter space accordingly for each vectorized object. With this understanding, we use the symbol ≅ to indicate that a matrix space component in a product space is identified with its image through the operator . In the sequel, to keep the notation as simple as possible, we concatenate column vectors without transposing them; that is, we write for . Moreover, we write , with the understanding that β stands for . For the non-restricted problem, belongs to , so that the entire parameter belongs to However, for the restricted problem, we assume that the true parameter belongs to Let and consider , with . Without loss of generality, we assume that , the set of matrices in whose first d rows are linearly independent. Therefore, This is trivial since and its first d rows are linearly independent. Consider Finally, let and be the map Conditions 2.2–2.6 are satisfied for Ξ, and defined in Equations (18)–(24), respectively. Under the rank restriction on in Equation (17), the existence of the maximum likelihood estimator cannot be guaranteed in the sense of an M-estimator as defined in Equation (3) with in Equation (15), and Ξ replaced by in Equation (18). However, we can work with a strong M-estimator sequence for the criterion function over using Lemma 2.3. Theorem 3.3 states our main contribution. Let denote the true parameter value of . Assume that satisfies model (12) subject to (17), with defined in Equation (19). Then, there exists a strong maximizing sequence for the criterion function over for defined in Equation (15). Moreover, any weak M-estimator sequence converges to in probability. If is a strong M-estimator sequence, then is asymptotically normal with covariance matrix where is defined in Equation (16) and with and any decomposition of . The asymptotic variance of the estimator does not depend on the specific decomposition of . That is, for another , even if (26) changes, (25) remains the same. A plug-in procedure can be used to estimate the elements of the matrix in Equation  (25) and then, appealing to Slutzky's Theorem, asymptotic confidence regions for can be constructed. Also, bootstrap variance estimates can be computed by sampling with replacement from the empirical distribution, leading to so-called normal bootstrap intervals (see, e.g. [16], Section 8.3). Since is a projection, . That is, the eigenvalues of are non-negative, so that using partial reduced-rank multivariate generalized linear models results in efficiency gain. Decomposing x into more sets of predictors with reduced rank parameters amounts to adding new rows, like the second set of rows of G in Equation (26), for each reduced rank parameter. For more general families, like the one considered by Yee and Hastie [6], if the family in question satisfies the regularity conditions in Proposition 2.7, the asymptotic variance of the estimators can also be computed by making the corresponding substitutions in the formulas of Proposition 2.7.

Application: marital status in a workforce study

Yee and Hastie [6] analyse data from a self-administered questionnaire collected in a large New Zealand workforce observational study conducted during 1992–1993. For homogeneity, the analysis was restricted to a subset of 4105 European males with no missing values in any of the variables used. Yee and Hastie [6] were interested in exploring whether certain lifestyle and psychological variables were associated with marital status, especially separation/divorce. The response variable is , with levels , , , and . The married/partnered are the reference group. Data on 14 regressors were collected, 12 of which are binary (1/0 for presence/absence, respectively). These have been coded so that their presence is negative healthwise. Their goal was to investigate if and how these 12 unhealthy variables were related to Y , adjusting for age and level of education. The variables are described in Table 1.
Table 1.

Variables used in the workforce study.

Variable nameDescription
maritalMarital status. 1=single, 2=separated or divorced, 3=widower, and 4=married or living with a partner
age30Age 30, in years
logedu1log(1+ years of education at secondary or high school)
bingeIn the last 3 months what is the largest number of drinks that you had on any one day? (1=20 or more, 0=less than 20)
smokenowCurrent smoker?
sunDoes not usually wear a hat, shirt or suntan lotion when outside during summer
nervesDo you suffer from ‘nerves’?
nervousWould you call yourself a ‘nervous’ person?
hurtAre your feelings easily hurt?
tenseWould you call yourself tense or ‘highly strung’?
miserableDo you feel ‘just miserable’ for no reason?
fedupDo you often feel ‘fed-up’?
worryDo you worry about awful things that might happen?
worrierAre you a worrier?
moodDoes your mood often go up and down?
A categorical response Y taking values 1,2,3,4 with probabilities can be expressed as a multinomial vector to fit the generalized linear model presented in this paper, , where if Y =i and otherwise, and . The pdf of Y can be written as where , , , and . The natural parameter , is related to the pdf of Y through the identity , i=1,2,3. Let X be the vector of predictor variables and consider . The dependence of on x will not be made explicit in the notation. As in [6] we fit a multinomial regression model to , as in Equation (27), with where is the intercept and is the coefficient matrix, so that there are parameters to estimate. When a multinomial linear model is fitted to the data at level 0.05, age30 and binge are significant for , smokehow and tense for , and only age30 is significant for . We next fitted a partial reduced rank multivariate generalized linear model, where the two continuous variables, age30 and logedu1, were not subject to restriction. That is, where represent the continuous variables and the 12 binary predictors. The AIC criterion estimates the rank of in Equation (29) to be one (see [6]). Using the asymptotic results from the current paper, Duarte [17] developed a test based on Bura and Yang [18] that also estimates the dimension to be 1. Therefore, in our notation, q=4, , p=14, r=2, d=1, and , y , and . The rank restriction results in a drastic reduction in the total number of parameters from 45 to 24. The reduction in the estimation burden is also reflected in how tight the confidence intervals are compared with those in the unrestricted model, as can be seen in Table 3 and Table 2 in [6]. As a consequence the variables nervous, hurt, which are not significant in the unrestricted generalized linear model, are significant in the reduced (29). Furthermore, some variables, such as binge, smokenow, nervous and tense, are now significant for all responses.
Table 3.

MLEs with their standard errors in parentheses for the full rank generalized linear model (first two rows) and the partial reduced rank generalized linear model (last two rows).

Variable log(p1/p4)log(p2/p4)log(p3/p4)
binge
 GLM0.801*0.3181.127
  (0.143)(0.256)(0.670)
 PRR–GLM0.569*0.786*1.114*
  (0.125)(0.196)(0.409)
smokenow
 GLM0.0220.501*0.654
  (0.126)(0.157)(0.469)
 PRR–GLM0.222*0.306*0.434*
  (0.088)(0.119)(0.208)
sun
 GLM−0.0660.120−0.088
  (0.122)(0.161)(0.518)
 PRR–GLM0.0110.0150.021
  (0.084)(0.116)(0.164)
nerves
 GLM−0.1020.123−1.456
  (0.138)(0.198)(0.841)
 PRR–GLM−0.054−0.074−0.105
  (0.101)(0.139)(0.197)
nervous
 GLM0.2970.3531.007
  (0.169)(0.228)(0.665)
 PRR–GLM0.312*0.430*0.609*
  (0.124)(0.168)(0.291)
hurt
 GLM0.1840.2100.483
  (0.126)(0.167)(0.501)
 PRR–GLM0.180*0.248*0.352
  (0.089)(0.122)(0.199)
tense
 GLM0.1660.483*1.108
  (0.176)(0.214)(0.612)
 PRR–GLM0.302*0.416*0.590*
  (0.122)(0.163)(0.284)
miserable
 GLM−0.0500.128−0.093
  (0.138)(0.178)(0.613)
 PRR–GLM0.0190.02680.038
  (0.094)(0.129)(0.185)
fedup
 GLM0.1120.249−0.214
  (0.122)(0.171)(0.548)
 PRR–GLM0.1170.1610.229
  (0.094)(0.129)(0.185)
worry
 GLM0.113−0.102−0.548
  (0.145)(0.209)(0.818)
 PRR–GLM0.0030.0040.005
  (0.106)(0.146)(0.207)
worrier
 GLM−0.027-0.243−0.548
  (0.131)(0.180)(0.550)
 PRR–GLM−0.116-0.160−0.227
  (0.092)(0.128)(0.193)
mood
 GLM−0.1110.092−0.193
  (0.123)(0.171)(0.553)
 PRR–GLM−0.037−0.052−0.073
  (0.087)(0.120)(0.172)

Significant at 5% level.

Table 2.

Estimators with standard errors for for the generalized linear model (first two rows) and the reduced generalized linear model (last two rows).

Variable log(p1/p4)log(p2/p4)log(p3/p4)
Intercept
 GLM−1.573*−2.921*−6.123*
  (0.388)(0.396)(1.106)
 PRR–GLM−1.762*−2.699*−6.711*
  (0.377)(0.383)(1.047)
age30
 GLM−0.190*0.0120.077*
  (0.008)(0.008)(0.024)
 PRR–GLM−0.191*0.0120.086*
  (0.008)(0.008)(0.023)
logedu1
 GLM0.254−0.316−0.198
  (0.228)(0.214)(0.566)
 PRR–GLM0.338−0.365−0.089
  (0.225)(0.213)(0.559)

Significance at 5% level.

Significance at 5% level. Significant at 5% level. All significant coefficients are positive. These correspond to the variables binge, smokenow, nervous, tense and hurt for single and divorced/separated groups. Since the positive value of the binary variables indicates poor lifestyle and negative psychological characteristics, our analysis concludes that for men with these features, the chance of being single, divorced or widowed is higher than the chance of being married, adjusting for age and education. Also, the coefficients corresponding to the response are twice as large as those of , suggesting the effect of the predictors differs in each group. All computations were performed using the R package VGAM, developed by Yee [15]. The R script to reproduce the data analysis in Section 4 can be accessed at supplemental data.

Discussion

With the exception of the work of Yee and his collaborators on VGLMs [6,7,19], where the distribution of the response can be any member of the multivariate exponential family, reduced-rank regression has been almost exclusively restricted to regressions with continuous response variables. Estimation methods and the corresponding software for general partial reduced rank multivariate generalized linear models were developed by Yee and Hastie [6] and Yee [7]. Yet, the distribution and statistical properties of the estimators were not obtained. In this paper we fill this gap by developing asymptotic theory for the restricted rank maximum likelihood estimates of the parameter matrix in multivariate GLMs. To illustrate the potential impact of our results, we refer to the real data analysis example in Section 4. In order to assess the significance of the predictors, Yee and Hastie [6] calculate the standard errors for the coefficient matrix factors, A and B, independently and can only infer about the significance of the components of the matrix A and the components of the matrix B separately. The asymptotic distribution for either estimator is obtained assuming that the other is fixed and known. In this way, they first analyse to check which predictors are significant and then to examine how they influence each response. Their standard errors are ad-hoc and it is unclear what the product of standard errors measures as relates to the significance of the product of the components of the coefficient matrix . Moreover, this practical ad-hoc approach cannot readily be extended when d>1. Using the results of Theorem 3.3, we can obtain the errors of each component of the coefficient matrices A, B simultaneously, and assess the statistical significance of each predictor on each response. Using the ad-hoc approach of Yee and Hastie [6], a predictor can only be found to be significant across all responses. For example, Yee and Hastie [6] find the predictor hurt to be significant for all three groups (single, divorced/separated, widower). On the other hand, we can assess the significance of any response/predictor combination. Thus, we find hurt to be significant for single and divorced/separated groups, but not for widower men group (see Table 3). A potential future direction for our approach was brought to our attention by a referee. The computational cost for fitting a reduced rank multinomial logistic regression can be very high. Powers et al. [20] proposed replacing the rank restriction with a restriction on the nuclear norm which amounts to a convex relaxation of the reduced-rank multinomial regression problem. Our methodology can be adapted to obtain asymptotic inference for the regularized parameter estimates.
  1 in total

1.  Statistical Inference with M-Estimators on Adaptively Collected Data.

Authors:  Kelly W Zhang; Lucas Janson; Susan A Murphy
Journal:  Adv Neural Inf Process Syst       Date:  2021-12
  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.