Literature DB >> 26608962

Factor analysis models via I-divergence optimization.

Abstract

Given a positive definite covariance matrix [Formula: see text] of dimension n, we approximate it with a covariance of the form [Formula: see text], where H has a prescribed number [Formula: see text] of columns and [Formula: see text] is diagonal. The quality of the approximation is gauged by the I-divergence between the zero mean normal laws with covariances [Formula: see text] and [Formula: see text], respectively. To determine a pair (H, D) that minimizes the I-divergence we construct, by lifting the minimization into a larger space, an iterative alternating minimization algorithm (AML) à la Csiszár-Tusnády. As it turns out, the proper choice of the enlarged space is crucial for optimization. The convergence of the algorithm is studied, with special attention given to the case where D is singular. The theoretical properties of the AML are compared to those of the popular EM algorithm for exploratory factor analysis. Inspired by the ECME (a Newton-Raphson variation on EM), we develop a similar variant of AML, called ACML, and in a few numerical experiments, we compare the performances of the four algorithms.

Entities: Chemical Disease Gene

Keywords: I-divergence; alternating minimization; factor analysis; optimal approximate model

Mesh：

Year: 2015 PMID： 26608962 PMCID： PMC4978782 DOI： 10.1007/s11336-015-9486-5

Source DB: PubMed Journal: Psychometrika ISSN： 0033-3123 Impact factor: 2.500

Introduction

Let Y be a given zero mean normal vector of dimension n and covariance . A standard Factor Analysis (FA) model for Y is a linear modelwhere H is a deterministic matrix, X is a standard normal vector of dimension , i.e., with zero mean and (the k-dimensional identity), and is a zero mean normal vector, independent from X, with diagonal. The model (1.1) explains the n components of Y as linear combinations of the components of X, perturbed by the independent noise . The FA model built-in linear structure and data reduction mechanism make it very attractive in applied research. It is not always possible to describe the given Y with a FA model. Indeed, as a consequence of the hypotheses on X and ,a relation which imposes strong structural constraints on the covariance . Determining whether the given Y admits a FA model (1.1) requires the solution of an algebraic problem: given , find, if they exist, pairs (H, D) such that (1.2) holds. The structural constraints impose that H must be a tall matrix, and D a diagonal matrix. For a given , the existence and uniqueness of a pair (H, D) are not guaranteed. Generically, the Ledermann bound (Anderson & Rubin, 1956; Ledermann, 1937), gives necessary conditions for the existence of a FA model in terms of k and n. As it turns out, for the data reduction case of this paper, the right tools to deal with the existence and the construction of an FA model are geometric in nature and come from the theory of stochastic realization, see Finesso and Picci (1984) for an early contribution on the subject. In the present paper we address the problem of constructing an approximate FA model of the given Y. Since in general the relation (1.2) does not hold for any (H, D), one has to find ways to gauge the closeness of to the FA model covariance . One possibility is to use a form of least squares as a loss criterion. Here we adopt the I-divergence , also known as Kullback-Leibler distance, between the corresponding (multivariate) normal laws. Throughout the paper is given and is assumed to be non-singular. In statistical inference it is well known, and reviewed in Section 2, that the I-divergence is, up to constants independent of H and D, the parameters yielding the covariance , the opposite of the normal log likelihood. One has the identitywhere is now the empirical covariance matrix, used as an estimator of the true covariance . In the empirical context non-singular is usually the case if the number of variables is smaller than the number of observations. A completely different situation, singular , arises when the number of variables is larger than the number of observations, see e.g., Bai and Li (2012), Trendafilov and Unkel (2011) for recent results. The choice of the best (H, D) pair can then be posed as a maximum- likelihood problem, as first proposed by Lawley (1940). Lacking a closed form solution, the maximization problem (1.3) has to be attacked numerically, and several optimization algorithms have been either adapted or custom-tailored for it. Among the former, the EM method, introduced in the context of FA estimation by Rubin and Thayer (1982), and still mutating and evolving (Adachi, 2013; Zhao, Yu, & Jiang, 2008; Zhao & Shi, 2014), takes full advantage of the structure of the likelihood in order to guarantee descent at each iteration, although at the expense of a less than ideal convergence rate, which can be slow and sensitive to the initial conditions. It has long been known, see Csiszár and Tusnády (1984), that any EM algorithm can be reformulated embedding the problem in a properly chosen larger parameter space and then performing alternating partial minimizations of the I-divergence over properly defined subspaces. This setup has previously been followed for various problems, e.g., mixture decomposition (Csiszár & Tusnády, 1984), non-negative matrix factorization (Finesso & Spreij, 2006), and approximation of non-negative impulse response functions (Finesso & Spreij, 2015). The advantage afforded by the embedding procedure is that both partial minimizations have closed form solutions; moreover a necessary and sufficient condition of optimality of a geometric flavor, a Pythagoras rule, see Csiszár (1975), is available to check optimality for both partial minimizations. As it turns out, and we prove this assertion in Section 5, the EM method proposed in Rubin and Thayer (1982) corresponds to a suboptimal embedding, as one of the Pythagoras rules fails. The main result of this paper is an iterative algorithm, called AML, for the construction of an (H, D) pair minimizing the I-divergence from using an optimal embedding, for which both Pythagoras rules hold. We also study the behavior of the algorithm in the singular case, i.e., D not of full rank, which is well known to be problematic for FA modeling (Jöreskog, 1967). These theoretical considerations make up the bulk of the paper. We emphasize that the present paper is not on numerical subtleties and (often very clever) improvements as established in the literature to accelerate the convergence of EM type algorithms. Rather, the central feature is the systematic methodology to derive an algorithm by a constructive procedure. Nevertheless, we make a brief foray into the numerical aspects, developing a version of AML, which we call ACML, in the spirit of ECME [a Newton–Raphson variation on EM, Liu and Rubin (1994)]. The remainder of the paper is organized as follows. In Section 2 the approximation problem is posed and discussed, as well as its estimation problem counterpart. Section 3 recasts the problem as a double minimization in a larger space, making it amenable to a solution in terms of alternating minimization. In Section 4, we present the alternating minimization algorithm, provide alternative versions of it, and study its asymptotics. We also point out, in Section 5, the similarities and the differences between our algorithm and the EM algorithm. Section 6 is dedicated to a constrained version of the optimization problem (the singular D case) and the pertinent alternating minimization algorithm. The study of the singular case also sheds light on the boundary limit points of the algorithm presented in Section 4. The last Section 7 is devoted to numerical illustrations, where we compare the performance of the AML, EM, ACML, and ECME algorithms. The Appendix contains the proofs of most of the technical results, and also decomposition results on the I-divergence, which are interesting in their own right, beyond application to Factor Analysis.

Problem Statement

In the present section, we pose the approximation problem and discuss the closely related estimation problem and its statistical counterpart.

Approximation Problem

Consider independent normal, zero mean, random vectors X and , of respective dimensions k and n, where , and with and , a diagonal matrix. For any deterministic conformable matrix H, the n dimensional vector Y given byis called a standard FA model. The matrices (H, D) are the parameters that identify the model. As a consequence of the hypotheses,Given an n-dimensional covariance matrix , one can pose the problem of approximating it with the covariance of a standard FA model, i.e., of finding (H, D) such thatIn this paper, we pose and solve the problem of finding an optimal approximation (2.3) when the criterion of closeness is the I-divergence (also known as Kullback-Leibler distance) between normal laws. Recall that [see e.g., Theorem 1.8.2 in Ihara (1993)] if and are two zero mean normal distributions on , whose covariance matrices, and , respectively, are both non-singular, the I-divergence takes the explicit form ( denotes determinant)Since, because of zero means, the I-divergence depends only on the covariance matrices, we usually write instead of . The approximate FA model problem can then be cast as follows.

Problem 2.1

Given the covariance matrix , of size n, and an integer , minimize1where the minimization is taken over all diagonal matrices , and . The first result is that in Problem 2.1, a minimum always exists.

Proposition 2.2

There exist matrices , and non-negative diagonal that minimize the I-divergence in Problem 2.1.

Proof

The proof can be found in Finesso and Spreij (2007). Finesso and Spreij (2006) studied an approximate non-negative matrix factorization (NMF) problem where the objective function was also of I-divergence type. In that case, using a relaxation technique, the original minimization was lifted to a double minimization in a higher dimensional space, leading naturally to an alternating minimization algorithm. The core of the present paper consists in following the same approach, in the completely different context of covariance matrices, and to solve Problem 2.1 with an alternating minimization algorithm. As a side remark note that , computed as in (2.4), can be considered as an I-divergence between two positive definite matrices, without referring to normal distributions. Hence the approximation Problem 2.1 is meaningful even without assuming normality.

Estimation Problem

In a statistical setup, the approximation Problem 2.1 has an equivalent formulation as an estimation problem. Let indeed be a sequence of i.i.d. observations, whose distribution is modeled according to (2.1), where the matrices H and D are the unknown parameters. This is the context of Exploratory Factor Analysis, where no constraints are imposed on the matrix H. Let denote the sample covariance matrix of the data. If the data have strictly positive covariance, for large enough N, the sample covariance will be strictly positive almost surely. The normal log likelihood yieldsOne immediately sees that is, up to constants not depending on H and D, equal to . Hence, maximum-likelihood estimation parallels I-divergence minimization in Problem 2.1, only the interpretation is different. The equations for the maximum-likelihood estimators can be found in e.g., Section 14.3.1 of Anderson (1984). In terms of the unknown parameters H and D, with D assumed to be non-singular, they arewhere , defined for any square M, coincides with M on the diagonal and is zero elsewhere. Note that the matrix obtained by maximum-likelihood estimation is automatically invertible. Then it can be verified that Equation (2.7) is equivalent towhich is meaningful also when D is not invertible. It is clear that the system of Equations (2.7), (2.8) does not have an explicit solution. For this reason, several numerical algorithms have been devised, among others a version of the EM algorithm, see Rubin and Thayer (1982). In the present paper we consider an alternative approach, which we will compare with the EM and some of the other algorithms.

Lifted Version of the Problem

In this section, Problem 2.1 is recast in a higher dimensional space, making it amenable to solution via two partial minimizations which will lead, in Section 4.1, to an alternating minimization algorithm. The original n dimensional FA model (2.1) is embedded in a larger dimensional linear model as follows:where the deterministic matrix is square of size k, and for the terms H, X, and the hypotheses leading to model (2.1) still hold. The vector V, as well as its subvector Z, is therefore zero mean normal, with

Remark 3.1

The embedding (3.1) has a simple interpretation as a convenient reparametrization of the following alternative version of the standard FA model (2.1),where Z and are zero mean normal vectors of sizes k and n, respectively, with and . Letting and , where Q is any square root2 of P, one easily recognizes that (3.1) is a reparametrization of (3.3). In this paper, all vectors are zero mean and normal, with law completely specified by the covariance matrix. The set of all covariance matrices of size will be denoted as . An element can always be decomposed aswhere and are square, of respective sizes n and k. Two subsets of , comprising covariance matrices with special structure, will play a major role in what follows. The subset contains all covariances (3.4) with , a given matrix, i.e.,The generic element of will often be denoted as . Also of interest is the subset , containing all covariances (3.4) of the special form (3.2), i.e.,The generic elements of will often be denoted , or when the parameters are relevant. We are now ready to pose the following double minimization problem.

Problem 3.2

Find Problems 3.2 and 2.1 are related by the following proposition.

Proposition 3.3

Let be given. It holds that

Proof

The proof can be found in Finesso and Spreij (2007).

Partial Minimization Problems

The first partial minimization, required for the solution of Problem 3.2, is as follows.

Problem 3.4

Given a strictly positive definite covariance matrix , find The unique solution to this problem can be computed analytically and is given below.

Proposition 3.5

The unique minimizer of Problem 3.4 is given byMoreover,and the Pythagorean ruleholds for any strictly positive . See Appendix 2. Next we turn to the second partial minimization

Problem 3.6

Given a strictly positive definite covariance matrix , find The proposition below gives the explicit solution to this problem.

Proposition 3.7

A minimizer of Problem 3.6 is given bywhereThe corresponding minimizing matrix isMoreover, and the Pythagorean ruleholds for any . See Appendix 2. Note that Problem 3.6 cannot have a unique solution in terms of the matrices H and Q. Indeed, if U is a unitary matrix and , , then , , and . Nevertheless, the optimal matrices , HQ, and are unique, as it can be easily checked using the expressions in Proposition 3.7.

Remark 3.8

Note that, since is supposed to be strictly positive, is strictly positive too. It follows that is strictly positive. We close this section by considering a constrained version of the second partial minimization Problem 3.6 to which we will return in Section 5, when we discuss the connection with the EM algorithm. The constraint that we impose is fixed, whereas H and D remain free. The set over which the optimization will be carried out is defined asWe pose the following constrained optimization problem.

Problem 3.9

Given a strictly positive covariance , find The solution is given in the next proposition.

Proposition 3.10

A solution of Problem 3.9 is obtained for , and as in Proposition 3.7, resulting in See Appendix 2. For the constrained problem, the Pythagorean rule does not hold. Intuitively, since the optimal value of the constrained Problem 3.9 is in general higher than the optimal value of the free Problem 3.6. To compute the extra cost incurred notice that , therefore the Pythagorean rule (3.7) giveshence , where is as in Proposition 3.7. The quantity represents the extra cost. An elementary computation givesi.e., the optimizing matrices and , see (3.6), (3.8), coincide iff . Summarizing the comments on the constrained problem: (i) the optimal value at the minimum is higher since , (ii) the extra cost is explicitly given by (3.10), as , and (iii) there is no analog to the Pythagorean rule (3.7). The conclusion is that the solution of the constrained Problem 3.9 is suboptimal for the free Problem 3.6. The consequences of the suboptimality will be further discussed in Section 5.

Alternating Minimization Algorithm

In this section, the core of the paper, the two partial minimizations of Section 3 are combined into an alternating minimization algorithm for the solution of Problem 2.1. A number of equivalent formulations of the updating equations will be presented and their properties are discussed.

The Algorithm

We suppose that the given covariance matrix is strictly positive definite. To set up the iterative minimization algorithm, assign initial values to the parameters, with diagonal, invertible and invertible. The updating rules are constructed as follows. Let be the parameters at the t-th iteration, and the corresponding covariance, defined as in (3.2). Now solve the two partial minimizations as illustrated below.where denotes the solution of the first minimization with input . To express in a compact form the resulting update equations, defineNote that, by Remark 3.8, is actually invertible for all t, since both and have been chosen to be invertible. It is easy to show that also , and consequently , are strictly positive and therefore invertible. The update equations resulting from the cascade of the two minimizations areProperly choosing the square root in Equation (4.2) will make disappear from the update equations. This is an attractive feature since, at the t-th step of the algorithm, only and are needed to construct the approximation . The choice that accomplishes this is , where is a symmetric root of , resulting in . Upon substituting in Equation (4.3), one gets the AML algorithm.

Algorithm 4.1

(AML) Given , from the t-th step, and as in (4.1), the update equations for a I-divergence minimizing algorithm are Since only depends on and , see (4.1), the parameter has been effectively removed from the update equations, although its presence was essential for the derivation.

Remark 4.2

Algorithm 4.1 has one immediate attractive feature: it preserves at each step the diagonal structure of . Indeed, if we let , then it follows from Equation (4.6) that . Algorithm 4.1 potentially has two drawbacks making its implementation computationally less attractive. To update via Equation (4.5) one has to compute, at each step, the square root of the matrix and the inverse of the matrix . The latter problem may in principle be addressed via the matrix inversion lemma, but this requires an invertible which could be problematic in practical situations when one encounters nearly singular . An alternative approach to Algorithm 4.1, to avoid the square roots at each iteration, is to update and as before.

Proposition 4.3

Let be as in Algorithm 4.1. Pick , and such that is invertible. The update equation for becomes See Appendix 2. One can run the update Equation (4.7), for any number T of steps, and then switch back to , taking any factor of i.e., solve . Since Equation (4.7) transforms into preserving the rank, the latter factorization is always possible.

Asymptotic Properties

In Proposition 4.4 below, we collect the asymptotic properties of Algorithm 4.1, also quantifying the I-divergence decrease at each step.

Proposition 4.4

For Algorithm 4.1, the followings hold. for all . If and then for all . The matrices are invertible for all . If then . Decrease of the objective function: where is the t-th approximation of , and were defined in Section 4.1. The interior limit points (H, D) of the algorithm satisfy which are the ML Equations (2.7) and (2.8). If (H, D) is a solution to these equation also (HU, D) is a solution, for any unitary matrix . Limit points satisfy This follows from Remark 3.8 and the construction of the algorithm as a combination of the two partial minimizations. This similarly follows from Remark 3.8. Use the identity and non-negative definite. In this case, Equation (4.1) shows that and substituting this into the update equations yields the conclusion. As matter of fact, we can express the decrease as a sum of two I-divergences, since the algorithm is the superposition of the two partial minimization problems. The results follow from a concatenation of Proposition 3.5 and Proposition 3.7. Assume that all variables converge. Then, from (4.3), it follows that Equation (2.9) holds in the limit. This gives the first of the desired relations, the rest is trivial. This follows by inserting the result of (f.). In part (f) of Proposition 4.4, we made the assumption that the limit points (H, D) are interior points. This does not always hold true as it may happen that D contains zeros on the diagonal. See also Section 6.2.

Remark 4.5

Assertions (b) and (c) of Proposition 4.4 agree with the recent results of Adachi (2013) (Lemma 1 and Theorem 1) for the closely related EM algorithm with a strictly positive definite empirical covariance matrix . We note that the assertions (b) and (c) are automatic consequences of our setup, they follow from the casting of the problem as a double divergence minimization problem. Indeed, the solutions to the ensuing partial minimization problems are automatically strictly positive definite matrices, as otherwise the minimum divergences would be infinite, which is impossible.

Comparison with the EM Algorithm

In Rubin and Thayer (1982), a version of the EM algorithm (see Dempster, Laird, & Rubin, 1977) has been put forward in the context of estimation for FA models. This algorithm is as follows, with as in (4.1).

Algorithm 5.1

(EM) The EM Algorithm 5.1 differs in both equations from our AML Algorithm 4.1. It is well known that EM algorithms can be derived as alternating minimizations, see Csiszár and Tusnády (1984), it is therefore interesting to investigate how Algorithm 5.1 can be derived within our framework. Thereto one considers the first partial minimization problem together with the constrained second partial minimization Problem 3.9, the constraint being , for some . Later on we will see that the particular choice of , as long as it is invertible, is irrelevant. The concatenation of these two problems results in the EM Algorithm 5.1, as is detailed below. Starting at , one performs the first partial minimization that results in the matrixPerforming now the constrained second minimization, according to the results of Proposition 3.10, one obtainsSubstitution of (5.1) into (5.2) yieldsOne sees that the matrix does not appear in the recursion, just as the matrices do not occur in Algorithm 4.1, but we lost the second optimality property in the construction of Algorithm 4.1, due to the imposed constraint . Moreover, the EM algorithm does not enjoy the diagonal preservation property mentioned in Remark 4.2 for Algorithm 4.1, due to the presence of in Equation (5.3). Summarizing, both Algorithms 4.1 and 5.1 are the result of two partial minimization problems. The latter algorithm differs from ours in that the second partial minimization is constrained. In view of the extra cost incurred by the suboptimal constrained minimization, see Equation (3.9), our Algorithm 4.1 yields a better performance. We will illustrate these considerations by some numerical examples in Section 7.

Singular D

It has been known for a long time, see e.g., Jöreskog (1967), that numerical solutions to the ML Equations (2.7), (2.8) often produce a nearly singular matrix D. This motivates the analysis of the minimization Problem 2.1 when D is constrained, at the outset, to be singular (Section 6.1), and the investigation of its consequences for the minimization algorithm of Proposition 4.3 (Section 6.2).

Approximation with Singular D

In this section, we consider the approximation Problem 2.1 under the constraint wherewith of size and of size . The constrained minimization problem can be formulated as follows.

Problem 6.1

Given of size and integers and k, with , minimizeover (H, D) with D satisfying (6.1).

Remark 6.2

Alternatively, in Jöreskog (1967), the solution of the likelihood Equations (2.8) and (2.9) has been investigated under zero constraints on D. In this section, we work directly on the objective function of Problem 6.1. To reduce the complexity of Problem 6.1 we will now decompose the objective function, choosing a convenient representation of the matrix , where has rows. Inspired by the parametrization in Jöreskog (1967) we make the following observation. Given any orthogonal matrix Q, define , then clearly . Let be the singular value decomposition of , with a positive definite diagonal matrix of size , and U and V orthogonal of sizes and , respectively. LetThe blocks of are and , with and . Hence, without loss of generality, in the remainder of this section we assume thatFinally, letwhich, under (6.2), is equivalent toHere is the announced decomposition of the objective function.

Lemma 6.3

Let D be as in Equation (6.1). The following I-divergence decomposition holds. The proof follows from Lemma 9.1. We are now ready to characterize the solution of Problem 6.1. Observe first that the second and third terms on the right-hand side of (6.3) are non-negative and can be made zero. To this end, it is enough to select such that and then . The remaining blocks, and , are determined minimizing the first term. We have thus proved the following

Proposition 6.4

Any pair (H, D), as in (6.1) and (6.2), solving Problem 6.1 satisfies is minimized, , , .

Remark 6.5

In the special case , the matrices and are empty, , and . From Proposition 6.4, at the minimum, , , and minimizes . The latter problem has solution . It is remarkable that in this case the minimization problem has an explicit solution.

Algorithm When a Part of D has Zero Diagonal

In Section 6.1, we have posed the minimization problem under the additional constraint that the matrix D contains a number of zeros on the diagonal. In the present section, we investigate how this constraint affects the alternating minimization algorithm. For simplicity, we give a detailed account of this, only using the recursion (4.7) for . Initialize the algorithm at withwhere , andwhere is assumed to have full row rank, so that . Clearly, is invertible. For as in Equation (6.5) put

Proposition 6.6

Consider the update Equation (4.7). The upper left block of can be computed running a recursion for , with initial condition ,whereas the blocks on the border of remain constant. The iterates for all have a lower right block of zeros, while the upper left block satisfiesLimit points with satisfy , . See Appendix 2. Note that the recursions of Proposition 6.6 are exactly those that follow from the optimization Problem 6.1. Comparison with (4.7) shows that, while the algorithm for the unconstrained case updates of size , now one needs to update which is of smaller size . In the special case , the matrix of (6.6) is equal to zero. Therefore, which proves the following

Corollary 6.7

Let the initial value be as in Equation (6.4) with . Then for any initial value , the algorithm converges in one step and one has that the first iterates and , which are equal to the terminal values, are given by It is remarkable that in this case the algorithm reaches in one step, the optimal values are computed explicitly in Remark 6.5.

Numerical Comparisons with Other Algorithms

We briefly investigate the numerical performance of our AML Algorithm 4.1, and compare it against the performance of other algorithms. The natural competitor of AML is the EM Algorithm 5.1. After the publication of Rubin and Thayer (1982), the EM algorithm has evolved into a cohort of improved alternatives (Liu & Rubin, 1994, 1998, and more recently by Zhao et al., 2008), basically differing from the original EM on numerical implementation aspects. Most notably, in the ECME variant (Liu & Rubin, 1998), is updated as in the EM algorithm, but is updated by direct maximization of the likelihood (equivalently minimization of the I-divergence) with respect to D, keeping H fixed at the value . This step cannot be done analytically, and is realized taking a few Newton–Raphson iterations, and Liu and Rubin (1998) suggests that one or two iterations are usually sufficient. The resulting does not necessarily increase the likelihood with respect to ; therefore, a check has to be performed, and possibly the iteration has to be repeated adjusting its size. The rationale behind ECME is that the advantage in speed afforded by the direct (along the D parameter) maximization of the likelihood outweighs the drawback of having to check each iteration for actual improvement. We have derived, in the same spirit, a variant of AML retaining the update Equation (4.5) and replacing the update Equation (4.6) with the same Newton–Raphson iterations as in ECME. We named the resulting algorithm ACML. All numerical experiments in this section should be read as comparisons between the performances of AML and ACML versus EM and ECME.

Findings

To run the numerical comparisons, we have selected from the published literature five examples of correlation matrices, some of which are well known for being problematic for FA modeling. We have also constructed a sixth data set as an exact FA covariance , selecting randomly the entries of H and D, see below. For each of the six data sets we ran the four algorithms in parallel (sharing the same initial conditions) several times, changing the initial conditions at each run. The figures at the end of the paper are plots of the I-divergence vs. iterations and have been selected to show the typical behaviors of the four algorithms for each data set. The data sets are the following correlation matrices of size . Figure 1: Rubin–Thayer, , taken from Rubin and Thayer (1982).

Fig. 1

Rubin–Thayer.

Figure 2: Maxwell, , Table 4 in Maxwell (1961), also analyzed as data set 2 in Jöreskog (1967).

Fig. 2

Maxwell.

Figure 3: Rao, , taken from Rao (1955).

Fig. 3

Rao.

Figure 4: Harman, , Table 5.3 in Harman (1967).

Fig. 4

Harman.

Figure 5: Emmett, , Table I in Emmett (1949), also analyzed as data set 1 in Jöreskog (1967).

Fig. 5

Emmett.

Figure 6: The data set is a randomly generated covariance of the standard FA model type, i.e., , with . The elements of and of are samples of a uniform on . For the choice of see below under (c2).

Fig. 6

True FA model.

Rubin–Thayer. Maxwell. Rao. Harman. Emmett. True FA model. In all numerical experiments, the number of factors has been fixed, equal to . Initially it was found that, for a number of runs with different data sets and initial conditions, the ECME algorithm produced negative values for the diagonal matrices caused by a routine application of the Newton–Raphson (NR) algorithm. The NR routine has afterward been improved, implementing the restricted step version of the NR algorithm for both ECME and ACML. In all ECME and ACML experiments, we have consistently taken 2 steps of the NR algorithm at each iteration. To present the findings, we have grouped the data sets into three groups (a.), (b.), and (c.), within which we observed similar behaviors. Different behaviors are ranked according to their limit divergence and speed of convergence, with priority given to the former. Rubin–Thayer data (Figure 1). The graphs of the EM/ECME pair are very similar to those of Liu and Rubin (1998) and we observe that the AML/ACML pair outperforms EM/ECME. The typical ranking for this data set was ACML best, followed by ECME, AML, EM in that order. In a few occasions we observed AML best, followed by EM, ACML, and ECME. The ECME was the most sensitive to initial conditions. Maxwell data (Figure 2). The typical ranking for this data set is as above, ACML, ECME, AML, EM, in decreasing order. For both ECME and ACML we have been able to reproduce Table 5 of Jöreskog for the elements of the D-matrix, and also identified the eighth element of the D-matrix as . In a few occasions ACML and ECME displayed very close behaviors, significantly outperforming AML and EM whose behaviors were also close to each other. Rao data (Figure 3). The typical ranking for the data set was ACML, AML, EM, and ECME. Sometimes it took more than 1500 iterations before a significant drop in the divergence of the best performing algorithm could be seen. The should be estimated close to zero (Jennrich & Robinson, 1969), which was usually the case for ACML, for AML and EM with slower convergence. ECME displayed different behaviors (sometimes very good), depending on the initial conditions. Harman data (Figure 4). The typical ranking for the data set was as above, ACML, AML, EM, ECME. In our runs, ACML performed consistently best, whereas ECME consistently exhibited much larger divergences. For this data set, is known to be zero (Jennrich & Robinson, 1969). All runs of the ACML have quickly produced , sometimes ECME too, although large deviations have been seen as well. AML and ME exhibited much slower convergence, often 5000 iterations were not enough. Emmett data (Figure 5). The behavior of the four algorithms for this data set is exceptional. Very often AML and EM gave faster and better (i.e., to smaller values) convergence than ACML and ECME. True FA covariance matrix (Figure 6). Since , where and the selected number of factors is , this is a perfect modeling situation, with vanishing theoretical minimum divergence. The AML is the best performer, reaching null divergence extremely fast, while the ranking of the other algorithms is sensitive to the value of . Figure 6, for , shows AML and EM. The pair ACML and ECME has a much worse behavior, which cannot be plotted on the same graph. For , the ranking of behaviors is different. AML is still the best, immediately followed by ACML, whereas ECME and EM behave erratically and do not converge to zero. We omitted the figure.

Conclusions

Given a positive definite covariance matrix , which may be an empirical covariance, of dimension n, we considered the problem of approximating it with a covariance of the form , where H has a prescribed low number columns and is diagonal. We have chosen to gauge the quality of the approximation by the I-divergence between the zero mean normal laws with covariances and , respectively. By lifting the minimization problem into a larger space, we have been able to develop an optimal procedure from first principles to determine a pair (H, D) that minimizes the I-divergence. As consequence, the procedure also yields an iterative alternating minimization algorithm (AML) à la Csiszár–Tusnády. As it turns out, the proper choice of the enlarged space is crucial for optimization. We have obtained a number of theoretical results that are of independent interest. The convergence of the algorithm has also been studied, with special attention given to the case where D is singular. The theoretical properties of the AML have been compared to those of the popular EM algorithm for exploratory factor analysis. Inspired by the ECME (a Newton–Raphson variation on EM), we also developed a similar variant of AML, called ACML, and in a few numerical experiments we compared the performances of the four algorithms. We have seen that usually the ACML algorithm performed best, in particular, better than ECME. In some specific experiments, AML was best, and always outperforming the EM algorithm.

1 in total

1. Factor analysis with EM algorithm never gives improper solutions when sample covariance and initial parameter matrices are proper.

Authors: Kohei Adachi
Journal: Psychometrika Date: 2012-11-28 Impact factor: 2.500

1 in total