Vincent Guillemot1, Derek Beaton2, Arnaud Gloaguen3, Tommy Löfstedt4, Brian Levine2, Nicolas Raymond5, Arthur Tenenhaus3, Hervé Abdi6. 1. Bioinformatics and Biostatistics Hub, Institut Pasteur, Paris, France. 2. The Rotman Research Institute Institution at Baycrest, Toronto, ON, Canada. 3. L2S, UMR CNRS 8506, CNRS-CentraleSupélec-Université Paris-Sud, Université Paris-Saclay, 3 rue Joliot-Curie, 91192 Gif-sur-Yvette, France. 4. Department of Radiation Sciences, Umeå University, Umeå, Sweden. 5. IRMAR, UMR 6625, Université de Rennes, Rennes, France. 6. School of Behavioral and Brain Sciences, The University of Texas at Dallas, Richardson, TX, United States of America.
Abstract
We propose a new sparsification method for the singular value decomposition-called the constrained singular value decomposition (CSVD)-that can incorporate multiple constraints such as sparsification and orthogonality for the left and right singular vectors. The CSVD can combine different constraints because it implements each constraint as a projection onto a convex set, and because it integrates these constraints as projections onto the intersection of multiple convex sets. We show that, with appropriate sparsification constants, the algorithm is guaranteed to converge to a stable point. We also propose and analyze the convergence of an efficient algorithm for the specific case of the projection onto the balls defined by the norms L1 and L2. We illustrate the CSVD and compare it to the standard singular value decomposition and to a non-orthogonal related sparsification method with: 1) a simulated example, 2) a small set of face images (corresponding to a configuration with a number of variables much larger than the number of observations), and 3) a psychometric application with a large number of observations and a small number of variables. The companion R-package, csvd, that implements the algorithms described in this paper, along with reproducible examples, are available for download from https://github.com/vguillemot/csvd.
We propose a new sparsification method for the singular value decomposition-called the constrained singular value decomposition (CSVD)-that can incorporate multiple constraints such as sparsification and orthogonality for the left and right singular vectors. The CSVD can combine different constraints because it implements each constraint as a projection onto a convex set, and because it integrates these constraints as projections onto the intersection of multiple convex sets. We show that, with appropriate sparsification constants, the algorithm is guaranteed to converge to a stable point. We also propose and analyze the convergence of an efficient algorithm for the specific case of the projection onto the balls defined by the norms L1 and L2. We illustrate the CSVD and compare it to the standard singular value decomposition and to a non-orthogonal related sparsification method with: 1) a simulated example, 2) a small set of face images (corresponding to a configuration with a number of variables much larger than the number of observations), and 3) a psychometric application with a large number of observations and a small number of variables. The companion R-package, csvd, that implements the algorithms described in this paper, along with reproducible examples, are available for download from https://github.com/vguillemot/csvd.
The singular value decomposition (SVD) [1-3]—the tool “par excellence” of multivariate statistics—constitutes the core of many multivariate methods such as, to name but a few, principal component analysis [4], canonical correlation analysis [5], multiple correspondence analysis [6], and partial least squares methods [7]. To analyze data tables whose rows typically correspond to observations and columns to variables, these statistical methods use the SVD to generate orthogonal optimal linear combinations of the variables—called components or factor scores—that extract the most important information in the original data. In most cases, only the components that explain the largest proportion of the data variance are kept for further investigation. The coefficients—called loadings—of the linear combination used to compute a component are also often used to understand or “interpret” the corresponding components and this interpretation is greatly facilitated (particularly when the number of variables is large) when, for a given component, only a few variables have large loadings and the other variables have negligible loadings. If this pattern does not naturally hold, several procedures can be used to select the variables that are important for a component. The early psychometric school, for example, would use rotations, such as VARIMAX, [8] of the components in a low dimensional subspace; whereas recent approaches favor computationally based methods such as bootstrap ratios [7], or select important variables using an explicit non-linear optimization method such as the LASSO [9]. Closely related to the current work, in the specific case of principal component analysis (for an extensive review of sparsification for PCA see [10]), Witten et al. (see Section 3.2 in [11]) propose to implement either an orthogonality constraint, or a sparsity constraint (but not both simultaneously, see also, for related ideas, [12, 13]). Along the same lines, Benidis et al. [14] proposed, recently, an algorithm, based on Procrustes approach, for sparse principal component analysis that includes an orthogonality constraint on the loadings.Unfortunately, in the more general case of having concurrently the sparsity and the orthogonality constraints active on both left and right pseudo-singular vectors, the components obtained from the LASSO and its derivatives are not orthogonal and this often makes their interpretation difficult. To palliate this problem, we present and illustrate a new LASSO-like sparsification method for the SVD, called constrained singular value decomposition (CSVD), that incorporates orthogonality constraints on both the rows and the columns of a matrix.
1 Notations
Matrices are denoted by uppercase bold letters (e.g., X), vectors by lowercase bold (e.g., x), and their elements by lower case italic (e.g., x). Matrices, vectors, and elements from the same matrix all use the same letter (e.g., A, a, a). The transpose operation is denoted by the superscript “⊤”, the inverse of a square matrix A is denoted by A−1. The identity matrix is denoted I, vectors or matrices of ones are denoted by 1, matrices or vectors of zeros are denoted by 0 (when multiplied with or added to other matrices, matrices I, 1, and 0 are assumed to be conformable, in case of doubt their size is specified). When provided with a square matrix, the diag(⋅) operator returns a vector with the diagonal elements of the matrix, and when provided with a vector, the diag(⋅) operator returns a diagonal matrix with the elements of the vector as the diagonal elements of this matrix. When provided with a square matrix, the trace(⋅) operator gives the sum of the diagonal elements of this matrix. The L2 norm of vector x, denoted ‖x‖2 is defined as , The L1 norm of vector x, denoted ‖x‖1 is defined as ‖x‖1 = ∑(|x|). A vector x is normalized by dividing this vector by its L2 norm (and so a normalized vector has an L2 norm equal to 1). The Frobenius norm of matrix X, denoted ‖X‖ is defined as . The Frobenius inner product of two rectangular matrices A and B of same dimensions, denoted 〈A, B〉 is defined as . The concatenation of an I by J matrix X and an I by 1 vector y is the I by J + 1 matrix denoted [X, y] obtained by the juxtaposition of y on the right side of matrix X. The orthogonal complement of the space linearly spanned by the columns of a rectangular matrix M is denoted M⊥. Two rectangular matrices A and B of same dimensions are said to be orthogonal if and only if 〈A, B〉 = 0.With x being a real scalar and γ a non-negative real number, the scalar soft-thresholding function denoted s(x, γ) is defined as
and the vector soft-thresholding function denoted S(x, γ) is defined as:
This function shrinks all the components of x toward 0, and set the smallest components to 0.The projection of a vector x onto a space is denoted by proj(x, ). The L2-ball of radius ρ, denoted , is defined as
and the L1-ball of radius ρ, denoted denoted , is defined asSingular values are denoted δ, eigenvalues are denoted λ = δ2.
2 Unconstrained singular value decomposition
The SVD of a data matrix of rank L ≤ min(I, J) gives the solution of the following problem: Find a least-squares optimal, rank R (with R ≤ L) approximation of X, denoted . Specifically, the SVD solves the following optimization problem [1, 2, 15]:
where is the set of all real I × J matrices of rank R.Recall that the SVD decomposes X as
where P⊤P = Q⊤Q = I and with δ1 ≥ δ2 ≥ ⋯ ≥ δ > 0, and L is the rank of X. The matrix (respectively ) contains the left (respectively right) singular vectors of X and the diagonal matrix Δ contains the singular values of X. If p (respectively q) denotes the ℓ-th column of P (respectively Q), and δ the ℓ-th element of , then, for any R ≤ L, the optimal matrix is obtained as:
with , and , ∀ ℓ ≠ ℓ′.A classic, albeit non-optimal and potentially numerically unstable, algorithm (described in Algorithm 1) to obtain the unconstrained singular value decomposition of X is based on the “power iteration method.” This algorithm—originally developed for the eigen-decomposition of a square matrix—provides the first singular triplet that comprises the first singular value and first left and right singular vectors. In order to ensure orthogonality between successive singular vectors, the first rank one approximation of X, computed as
is subtracted from X. This procedure—called deflation (see Appendix A)—gives a new matrix
which is orthogonal to . The power iteration method is then applied to the deflated matrix X(2), giving a second rank one approximation denoted . The deflation is then applied to X(2) to give the new residual matrix X(3) orthogonal to X(2), and so on, until nothing is left to subtract because, then, X has been completely decomposed. This way, the optimization problem from Eq 5 can be re-expressed as:Algorithm 1: The power iteration method for the unconstrained SVD. The algorithm consists in alternating the multiplication of the data matrix by the left and right vectors followed by a normalization step. After convergence, the data matrix is deflated and the process is re-iterated.Data: X, ε, RResults: SVD of XDefine X(1) = X;for
ℓ = 1, …, R
dop(0) and q(0) are randomly initialized;δ(0) = 0;δ(1) = p(0)⊤Xq(0);s = 0;while |δ( − δ(| ≥ ε
dop( ← normalize (Xq();q( ← normalize (X⊤p();δ( = p(Xq(;s ← s + 1;endX( = X( − δ(
p(
q(;endAs an alternative to the deflation approach used in Algorithm 1, the orthogonality constraint can be eliminated and integrated into the power iteration algorithm by replacing the normalization steps by the projection of the result of the current iteration onto the intersection of the L2 ball and the space orthogonal to the previously found left or right singular vectors (see Algorithm 2). This projection onto the intersection of these two spaces can be implemented in a number of ways [16], we chose here to use the projection onto convex sets algorithm (POCS, see, e.g., [17, Page 101])—an iterative algorithm easily implementable and generalizable to the projection onto the intersection of more than two convex sets (see Algorithm 3). Recall that, when normalized, a vector x is projected onto the L2 ball and that a vector x is projected onto V⊥ by multiplying it by I − V⊤
V. In POCS, these two projection steps are iterated until convergence.Algorithm 2 An alternative algorithm of the power iteration for the unconstrained SVD: The deflation step is replaced by a projection onto the space orthogonal to the space defined by the already computed lower rank version of the data matrix. Note that 0⊥ corresponds to the whole space, so it is either or .Data: X, ε, RResult: SVD of XDefine P = 0;Define Q = 0;for
ℓ = 1, …, R
dop(0) and q(0) are randomly initialized;δ(0) ← 0;δ(1) ← p(0)⊤Xq(0);s ← 0;while |δ( − δ(| ≥ ε
dop( ← proj(Xq(, );q( ← proj(X⊤
p(, );δ( ← p(
Xq(;s ← s+1;endδ ← δ(;P ← [P, p(];Q ← [Q, q(];endAlgorithm 3 Projection onto the intersection of K convex sets (POCS).Data: x, , εResults: Projection of x ontoDefine x(0) = x;while |x( − x(| ≥ ε
dox( ← proj(x(, ));for
k = 2, …, K
dox( ← proj(x(, ));ends ← s + 1endAlgorithm 1 is obviously faster than Algorithm 2 as implemented using Algorithm 3, because the orthogonality constraint in Algorithm 1 is performed with one operation whereas Algorithm 2 always requires several operations. However, the main benefit of Algorithm 2 is that it easily can be extended to include additional constraints, as illustrated below.
3 Constrained singular value decomposition
3.1 Previous work
Algorithm 4: The Algorithm of Witten et al. [11]: The penalized matrix decomposition (PMD) approach.Data: X, ε, RResulet: SVD of XDefine X(1) = X;for
ℓ = 1, …, R
dop(0) and q(0) are randomly initialized;δ(0) = 0;δ(1) = p(0)⊤Xq(0);s = 0;while |δ( − δ(| ≥ ε
dop( ← normalize (S(Xq(, λ1,)), with λ1, such that ‖p(‖1 = c1,;q( ← normalize (S(X⊤
p(, λ2,)), with λ2, such that ‖q(‖1 = c2,;δ( = p(
Xq(;s ← s + 1;endX( = X( − δ(
p(
q(;endRecently, several authors have proposed sparse variants of the SVD (see e.g., [4, 11, 15, 18, 19] for reviews), or, more specifically, of PCA [14, 20]. For most of these sparse variants, the sparsification is obtained by adding new constraints to Eq 10. For example, the penalized matrix decomposition (PMD) approach by Witten et al. [11] solves the following optimization problem for the first pair of left and right singular vectors:
where C1 and C2 are convex penalty functions from (respectively ) to (such as, e.g., the LASSO, or the fused LASSO constraints) and with c1, and c2, being positive constants. The PMD procedure is described in Algorithm 4. In PMD, the next pseudo-singular triplet is estimated by solving the same optimization problem where X is replaced by a deflated matrix. In contrast to Eq 11, however, the added constraints create a nonlinear optimization problem and this makes the deflated matrices non-orthogonal to the previous rank one optimal matrix [21]. This lack of orthogonality makes the interpretation of the components somewhat difficult because the conclusions about one component involve all correlated components and because the same information is explained (to different degrees) by all correlated components. In the specific case of PCA, Witten et al. proposed, alternatively, to impose an orthogonality constraint on the left singular vectors, without the sparsity constraint, and to leave the sparsity constraint active only on the right vectors (i.e., the loadings). However, this procedure does not solve the problem of having both constraints simultaneously active on the left and right singular vectors (See Equation 3.17 and the subsequent algorithm in [11] for more details).In order to eliminate the problems created by the non-orthogonality of the singular vectors, we present below a new optimal sparsification method, called the constrained singular value decomposition (CSVD) that implements orthogonality constraints on successive sparsified singular vectors.
3.2 Current work: The constrained SVD (CSVD)
The constrained SVD still decomposes the matrix X into singular values and vectors, but imposes, in addition, on the singular vectors constraints that induce sparsity of the weights. Such sparsity-inducing constraints are common in fields where the data comprise large numbers of variables [22] (e.g., tens of thousands, as in genomics [23], to millions, as in neuroimaging [24, 25]). Although the theory of sparsity-inducing constraints is well documented, we are interested in a general formulation that could also be applied for several types of sparsification, as well as more sophisticated constraints.Specifically, we consider the following optimization problem:
where C1 and C2 are convex penalty functions from (respectively ) to , (which could be, e.g., the LASSO, group-LASSO, or fused LASSO constraints) and with c1, and c2, being positive constants: smaller values of c1, and c2, lead to solutions that are more sparse. See, however, e.g., [11, 26], and, as developed in Appendix B, only some ranges of values of c1, and c2, will lead to solutions.An equivalent, but more convenient, form of the problem described in Eq 12 can be derived by considering two orthogonal matrices P, and Q, and a diagonal matrix Δ = diag{} such thatThe term is constant and does not depend on p or q, and so for a given δ positive, the solutions of are the same as the solutions of . In addition the maximum is reached when (see, e.g., [11]). Consequently, minimizing from Eq 13 is equivalent to maximizing each term of the sum . Therefore, Eq 12 is equivalent to the following 1-dimensional maximization problem for ℓ ≥ 1, given the previous vectors p and q for all 0 ≤ ℓ′ < ℓ:
where we set (for convenience) p0 and q0 to 0 (note that for the first pseudo-singular vectors, the orthogonality constraint is not active because all vectors are orthogonal to 0).To facilitate the resolution of this optimization problem, the unicity constraint on the L2 norm of the singular vectors (i.e., p⊤p = 1 and q⊤q = 1) needs to be relaxed and replaced by an equivalent inequality. Specifically, the optimization problem in Eq 14 is reframed asEq 15 defines a biconcave maximization problem with convex constraints. This problem can be solved using block relaxation [27]: An iterative algorithm that consists in a series of two-part iterations in which (Part 1) the expression in Eq 15 is maximized for p with q being fixed, and is then (Part 2) maximized for q with p being fixed. Part 1 of the iteration can be re-expressed as the following optimization problem:
Eq 16 shows that finding the optimal value for p (i.e., Part 1 of the alternating procedure) is equivalent to finding the projection of the vector Xq onto the subspace of defined by the intersection of all the convex spaces involved by the constraints. During Part 2 of the alternating procedure, the vector p is fixed and therefore Part 2 can be expressed as the following minimization problem:
Eqs 16 and 17 replace the L1 and L2 constraints in the minimization problem expressed in Eq 11 by projections onto the intersection of the convex sets (POCS) defined by these constraints. Because the intersection of several convex sets is also a convex set [28], the block relaxation algorithm from Eq 15 is essentially composed of sequential series of operations applied until convergence of the two projections onto their respective convex sets. This strategy can obviously be extended to incorporate multiple additional constraints as long as these constraints define convex subspaces. As shown in Appendix D, the CSVD algorithm is guaranteed to converge to a stable point because it is a member of the more general class of the block successive upper-bound minimization (BSUM) algorithms.In the specific case of the projection on the intersection of the balls L1 and L2, POCS can be replaced by a fast and exact algorithm called PL1L2 ([26], see Appendix C for details). This algorithm (see Algorithm 5) differs from the more general Algorithm 4 only by the specification of the projection method onto the L1 ball which is implemented as a simple and fast algorithm based on the soft-thresholding operator.Note that, Algorithm 2—presented in Section 2 for the unconstrained SVD and using POCS for the projection onto an intersection of convex sets—can be easily generalized to incorporate a new sparsity constraint, by simply applying POCS to the intersection of 3 convex sets (instead of just 2 convex sets): the L2-ball of radius 1, an L1-ball, and the orthogonal subspace to the space defined by the previously found left or right pseudo-singular vectors.Algorithm 5: General algorithm for the Constrained Singular Value Decomposition. The projection onto the L1-ball can be replaced by another projection onto a convex set, making it possible to adapt this algorithm to other purposes.Data: X, ε, R, c1, and c2, for ℓ in 1, …, RResults: CSVD of XDefine P = 0;Define Q = 0;for
ℓ = 1, …, R
dop(0) and q(0) are randomly initialized;δ(0) ← 0;δ(1) ← p(0)⊤Xq(0);s ← 0;while |δ( − δ(| ≥ ε
dop( ← proj(Xq(, );q( ← proj(X⊤p(, );δ( ← p(Xq(;s ← s + 1;endδ ← δ(;P ← [P, p(];Q ← [Q, q(];endIn the following sections, we illustrate—using simulated and real data—the effect and importance of the orthogonality constraint and show how this constraint improves the interpretability of the analysis.
4 Empirical comparative evaluation of the CSVD
In this section, we empirically evaluate, illustrate the constrained singular value decomposition (CSVD), and compare its performance to the performance of the plain SVD and the closely related sparsification method of Witten et al. [11], the PMD method. To do so, we used: 1) some simulated datasets, 2) one simulated dataset mimicking a real dataset, and 3) one real dataset (the characteristics of these datasets are listed in Table 1).
Table 1
Characteristics of the various datasets used to assess the performance of the CSVD and related methods.
Dataset
I(# of rows)
J(# of columns)
Rank
Simulated
150
600
149
Face Data
6
55,200
6
Memory
2,100
30
30
With the first (simulated) dataset we evaluated how the SVD, PMD, and CSVD recover the ground truth for a relatively large dataset with more variables than observations contains a mixture of signal and Gaussian noise.Datasets two and three were chosen to each illustrate a particular aspect of the data. The second dataset investigates the N ≪ P problem and comprises six face images consisting of 230 × 240 = 55, 200 pixels—with each pixel measuring light intensity on a scale going from 0 to 255. The third dataset illustrates the effects of sparsification on a dataset corresponding to a traditional psychometric problem. This simulated has been created to match the pattern of loadings of a real dataset that was collected from 2, 100 participants who—as part of a larger project on memory—answered an online version of the “object-spatial imagery questionnaire” (OSIQ, [29])—a psychometric instrument measuring mental imagery for objects and places. Using a 5-point rating scale, participants rated their agreement for 30 items that should span a 2-dimensional space corresponding to the spatial and object imagery psychometric factors. This dataset is used to compare sparsification and the standard traditional psychometric approach relying on (Varimax) rotation to recover a two dimensional factor structure.For Datasets two and three, we applied three degrees of sparsity (low, medium, and high). As detailed in Appendix B, only some values of c1 and c2 will lead to solutions (specifically, c1 has to be chosen between 1 and and c2 between 1 and ). Table 2 lists the values chosen for c1 and c2 and their interpretation for PMD and the CSVD. Also, for technical reasons, the values of c1 and c2 corresponding to the maximum sparsity for P and Q are set, respectively, to 1 + ε1 and 1 + ε2 (instead of 1) with ε1 and ε2 being two small real positive values.
Table 2
The different values of the sparsity parameters for the CSVD and PMD.
c1
c2
Resulting degree of sparsity
Notation
1 + ε1
1 + ε2
The sparsest level, most of the coefficients are close zero
H (High)
13I
13J
Very sparse
M (Medium)
23I
23J
Somewhat sparse
L (Low)
I
J
No sparsity, corresponds to the regular SVD
N (None)
4.1 Simulated data
With these simulated data, we evaluate the ability of the CSVD to recover known singular triplets, their sparsity structure, and the orthogonality of the estimated left and right singular vectors. These simulated data were created by adding a matrix of Gaussian noise to a 150 by 600 matrix of rank 5 built from its SVD decomposition.Specifically, the data matrix X was created as
whereis the rank 5 matrix of the ground truth with:Δ a 5 × 5 the diagonal matrix of the singular values equal to Δ = diag () = diag([15, 14, 13, 12, 11])P an I = 150 × 5 = 750 by 5, orthogonal matrix with P⊤P = I,Q a J = 600 × 5 = 3, 000 by 5 orthogonal matrix with Q⊤Q = I,E is an I = 150 × J = 600 matrix containing I × J = 90, 000 independent realizations of a Gaussian variable with mean equal to 0 and standard deviation equal to 0.01.Matrices P and Q were designed to be both sparse and orthogonal. Specifically, matrix P was generated with the following model
where 0 denotes 25 × 1 vectors of 0s, and where all p and p′ were 25 × 1 vectors with norm equal to and such that where . A similar model was used for Q which was generated with the following model
where 0 were 120 × 1 vectors of 0s, and where all q q′ vectors were 120 × 1 vectors with norm equal to and such that .With the structure described in Eqs 19 and 20, the low-rank matrix to recover from X is then composed of 2 parts: 1) a common part (no sparsity) to all 5 components (i.e., the part corresponding to the p and q vectors), and 2) one part specific to each component (i.e., the p′ and the q′ vectors) with the corresponding part in the other 4 components being sparse.Figs 1 and 2 show heatmaps of the true left-singular vectors (P) and right-singular vectors (Q).
Fig 1
Simulated data.
Heatmap of : the true left singular vectors (in an horizontal representation: one line equals one singular vector).
Fig 2
Simulated data.
Heatmap of : the true right singular vectors (in an horizontal representation: one line equals one singular vector).
Simulated data.
Heatmap of : the true left singular vectors (in an horizontal representation: one line equals one singular vector).Heatmap of : the true right singular vectors (in an horizontal representation: one line equals one singular vector).We analyzed X with the CSVD and PMD. For both methods, the L1 constraint was set to c1 = 5 for the left-singular vectors and c2 = 11 for the right-singular vectors, based on the sparsity of the ground truth.We asked each method to return 7 vectors in order to evaluate if the methods could recover the ground truth (i.e., the 5-dimensional sub-space) but also how they would behave after this 5-dimensional subspace had been recovered. Fig 3 shows the boxplots of the distribution of the squared difference between the estimated singular vectors and the ground truth for PMD and the CSVD: Both methods correctly uncover the singular vectors. Fig 4 shows the correlations between the first 7 estimated singular vectors compared to the ground truth: Although the first 5 singular vectors are correctly estimated, and, roughly, orthogonal to the previous singular vectors, the 2 last vectors estimated by PMD are correlated to some of the previously computed vectors. This demonstrates the failure of the standard deflation technique to impose the orthogonality constraint when a non-linear optimization methods is used. By contrast with the PMD approach, the orthogonality constraint of the CSVD prevented this problem.
Fig 3
Simulated data.
Boxplots of the squared difference between the estimated singular vectors and the ground truth.
Fig 4
Simulated data.
Heatmap of the correlations between the estimated left (P) and right (Q) singular vectors with the ground truth for the CSVD and PMD. Each cell of the heatmap represents the correlation between one of the 7 estimated (left or right) vectors with the 5 true vectors. Each heatmap contains 5 rows (the ground truth) and 7 columns (the estimated vectors). On the left, are the results obtained with the CSVD and on the right, the results obtained with PMD.
Boxplots of the squared difference between the estimated singular vectors and the ground truth.Heatmap of the correlations between the estimated left (P) and right (Q) singular vectors with the ground truth for the CSVD and PMD. Each cell of the heatmap represents the correlation between one of the 7 estimated (left or right) vectors with the 5 true vectors. Each heatmap contains 5 rows (the ground truth) and 7 columns (the estimated vectors). On the left, are the results obtained with the CSVD and on the right, the results obtained with PMD.Fig 5 displays the computational times of the CSVD and PMD as a function of the number of computed components: the CSVD is faster than PMD for the estimation of the first component, but this advantage diminishes when the number of computed components increases. This pattern indicates that the orthogonality constraint increases the computational time.
Fig 5
Simulated data.
Computational time of PMD and the CSVD when estimating one sparse singular triplet (left) and two sparse singular triplets (right).
Computational time of PMD and the CSVD when estimating one sparse singular triplet (left) and two sparse singular triplets (right).Table 3 contains the estimated singular values for the CSVD and PMD as well as the ground-truth singular values. The estimated pseudo-singular values are comparable for both methods and behave in a similar way compared to the ground truth and to the regular SVD: The first singular values are slightly smaller than their ground truth values, but at some point—which varies depending on the imposed degree of sparsity—the estimated number of pseudo-singular values is larger than the ground truth.
Table 3
Estimated singular values and ground truth.
In the “ground truth” column, a value of 0 indicates that the corresponding real singular value does not exist (i.e., because the underlying matrix has rank 5).
Order
CSVD
PMD
SVD
Ground Truth
1
14.77
14.77
14.97
15.00
2
13.62
13.64
13.99
14.00
3
12.53
12.59
12.99
13.00
4
11.86
11.90
11.97
12.00
5
10.34
10.45
10.93
11.00
6
0.21
2.94
0.04
0
7
0.15
2.98
0.04
0
Estimated singular values and ground truth.
In the “ground truth” column, a value of 0 indicates that the corresponding real singular value does not exist (i.e., because the underlying matrix has rank 5).Additionally, broader settings were considered for the comparison of SVD, CSVD and PMD on simulated data. Specifically, we considered a case with a low signal to noise ratio, and a case where the noise is structured: PMD and CSVD performed equally poorly on noisy data, but were unaffected by a structured noise. These additional results are reported in S1 Table.To sum up: 1) the CSVD and PMD produce highly similar estimates of the first singular vectors; 2) the CSVD and PMD both recover the true sparsity structure of the ground-truth data; 3) for singular vectors of an order higher than the rank of the matrix, PMD produces singular vectors correlated with the previous ones; and 4) the CSVD is computationally more efficient than PMD but this advantage shrinks as the number of computed components increases.
4.2 The face data
The dataset consists in six 240 × 230 = 55, 200 gray scale digitized (range from 0 to 255) face images (three men and three women) that were extracted from a larger face database (see [30, 31]) and are available as the dataset sixFaces from the R-package data4PCCAR (obtained from the Github repository HerveAbdi/data4PCCAR).Each image was unfolded (i.e., “vectorized”) into a 240 × 230 = 55, 200 element vector, which was re-scaled to norm one. A plain SVD was then performed on the 6 × 55, 200 matrix (see Fig 6) obtained from the concatenation of the 6 face vectors. The SVD extracted 6 components with the first one extracting a very large proportion of the total variance (i.e., λ1 = 5.616, τ1 = 94%, see also Fig 7 left panel). This very large first eigenvalue indicates that these face images are highly correlated (see Fig 8)—An interpretation confirmed by the very similar values of the coordinates of the six faces for the first component. But this large first eigenvalue reflects also, in part, that the data were not centered, because, with all entries of the matrix being positive, the first left (respectively right) singular vector (i.e., the first component) is respectively, the 6-element long vector of a weighted mean of the pixels across the faces (respectively the 55, 200-element long vector of a weighted mean of the faces across the pixels) and so all elements of the first pair of singular vectors have the same sign, see Table 4 and the picture of the first “eigen-face” [32] in the left of the top row of Fig 9). The second component differentiates females and males (see the map of the faces for Components 1 versus 2 in Fig 9, and the picture of the second “eigen-face” in the right of the top row of Fig 9).
Fig 6
Face data.
The data matrix of the face dataset: 6 faces by 55,200 voxels. The female faces are denoted F1, F2, and F3, the male faces M1, M2, and M3.
Fig 7
Face data.
Eigenvalues and pseudo-eigenvalues per dimension. Left column: eigenvalues obtained from regular SVD; middle column: pseudo-eigenvalues (i.e., variance of the factor scores) for the CSVD; right column: pseudo-eigenvalues (i.e., variance of the factor scores) for the PMD. For PMD and the CSVD, each line represents a level of sparsity: none (No), low (L), medium (M), and high (H).
Fig 8
Face data.
Cosine matrix for the 6 faces of the face dataset. The female faces are denoted F1, F2, and F3, the male faces M1, M2, and M3.
Table 4
Face example.
Left singular vectors (i.e., face loadings) and associated eigenvalues.
Dimension
Faces
Eigenvalue
Percentage
1
2
3
4
5
6
λ
τ
1
−.41
−.41
−.40
−.41
−.40
−.41
5.616
93.61
2
.14
.09
.76
−.16
−.53
−.30
0.160
2.66
3
−.40
.13
.29
−.14
.67
−.52
0.086
1.43
4
.08
−.68
.34
−.41
.27
.42
0.055
0.91
5
.45
−.54
−.04
.50
.11
−.50
0.052
0.87
6
−.66
−.25
.25
.60
−.16
.22
0.031
0.52
Fig 9
Face data.
First two sparse left singular vectors (P) for the CSVD (left) and PMD (right) for three levels of sparsity: low (L), medium (M), and high (H). The results for the CSVD are reported on the left, and the results for PMD are reported on the right.
Face data.
The data matrix of the face dataset: 6 faces by 55,200 voxels. The female faces are denoted F1, F2, and F3, the male faces M1, M2, and M3.Eigenvalues and pseudo-eigenvalues per dimension. Left column: eigenvalues obtained from regular SVD; middle column: pseudo-eigenvalues (i.e., variance of the factor scores) for the CSVD; right column: pseudo-eigenvalues (i.e., variance of the factor scores) for the PMD. For PMD and the CSVD, each line represents a level of sparsity: none (No), low (L), medium (M), and high (H).Cosine matrix for the 6 faces of the face dataset. The female faces are denoted F1, F2, and F3, the male faces M1, M2, and M3.First two sparse left singular vectors (P) for the CSVD (left) and PMD (right) for three levels of sparsity: low (L), medium (M), and high (H). The results for the CSVD are reported on the left, and the results for PMD are reported on the right.
Face example.
Left singular vectors (i.e., face loadings) and associated eigenvalues.We applied the CSVD and PMD to the face set using three different sparsity levels (low, medium, and high). Fig 9 shows the plot of the first two components for these three levels of sparsity. Both the CSVD and PMD tend to isolate the woman faces on the first dimension and the male faces on the second dimension. The corresponding first two eigen-faces, (see Figs 10 and 11) show that both the CSVD and PMD tend to extract characteristic features of the female faces (first eigen-face) or the male faces (second eigen-face). In contrast, the first and second eigen-faces for the plain SVD (plotted in Fig 12) show respectively a weighted average face (i.e., a linear combination with positive coefficients of the faces) and a mixture between male (with positive coefficients) and female (with negative coefficients) faces. This interpretation for the plain SVD is confirmed by Fig 13 where all faces load almost identically on Dimension 1, and where Dimension 2 separates men from women.
Fig 10
Face data.
The pseudo-eigenfaces for Dimension 1 on the left column and Dimension 2 on the right column. For this graph, only the CSVD was applied, with three different levels of sparsity: low on the top row (L), medium on the middle row (M), and high on the bottom row (H).
Fig 11
Face data.
The pseudo-eigenfaces for Dimension 1 on the left column and Dimension 2 on the right column. For this graph, only PMD was applied, with three different levels of sparsity: low on the top row (L), medium on the middle row (M), and high on the bottom row (H).
Fig 12
Face data.
The six eigenfaces obtained from the plain SVD.
Fig 13
Face data.
The two first left singular vectors of the plain SVD of the non-centered data.
The pseudo-eigenfaces for Dimension 1 on the left column and Dimension 2 on the right column. For this graph, only the CSVD was applied, with three different levels of sparsity: low on the top row (L), medium on the middle row (M), and high on the bottom row (H).The pseudo-eigenfaces for Dimension 1 on the left column and Dimension 2 on the right column. For this graph, only PMD was applied, with three different levels of sparsity: low on the top row (L), medium on the middle row (M), and high on the bottom row (H).The six eigenfaces obtained from the plain SVD.The two first left singular vectors of the plain SVD of the non-centered data.Overall the CSVD and PMD behave similarly and both show (compared to the plain SVD), that introducing sparsity can make the results easier to interpret because groups of individuals (men or women) can be identified and linked to small subset of variables (i.e., here pixels). However CSVD and PMD differ in the number of components suggested by their scree plot as indicated by Fig 7. The differences between the loadings estimated by both methods are also seen in Fig 14, which depicts the cross-product between the 6 right singular vectors for three different sparsity levels and for both methods. This figure shows that the risk of re-injecting variability, that was already described in previous components, increases with the sparsity parameter and the number of required components.
Fig 14
Face data.
Cross-product matrix of the 6 right pseudo-singular vectors for three levels of sparsity: low (L), medium (M), and high (H). The results for the CSVD are reported on the left, and the results for PMD are reported on the right.
Cross-product matrix of the 6 right pseudo-singular vectors for three levels of sparsity: low (L), medium (M), and high (H). The results for the CSVD are reported on the left, and the results for PMD are reported on the right.
4.3 Psychometric example: The mental imagery questionnaire
These simulated data were created to match the loading structure of an original dataset obtained from 2,100 participants who—as part of a larger project on memory—answered an online version of the “object-spatial imagery questionnaire” (OSIQ, [29])—a psychometric instrument measuring mental imagery for objects and places. Using a 5-point rating scale, participants rated their agreement for 30 items (e.g., “I am a good Tetris player”) that should span a 2-dimensional space corresponding to the hypothesized spatial and object imagery psychometric factors.The simulated data were obtained from an original data set by first performing a (centered and un-scaled) PCA on the original dataset and keeping only the loadings and the eigenvalues. Random pseudo-observations were generated by randomly sampling (with a uniform probability distribution) points in the factor space and then building back the corresponding data matrix from these random factor scores and the loadings. The final simulated data matrix was then obtained by scaling this new data matrix so that it only contains integer values whose distribution match, as best as possible, the original data matrix. This way, the simulated data matrix contains random values whose means, variances, and loadings roughly match the original data matrix. The R-code used to create the simulated data can be found from the R-package data4PCCAR (available from from the Github repository HerveAbdi/data4PCCAR; the simulated data matrix can also be found in the same R-package.The 2,100 (participants) by 30 (items) data matrix was pre-processed by centering and normalizing each variable and was then analyzed by PCA (i.e., an SVD of the pre-processed matrix). Fig 15 plots the loadings for the 30 items for the first two components of the PCA. In this figure, each item is labeled by its number in the questionnaire (see [29] for details and list of questions), and its a priori category (i.e., “object” vs “spatial”) is indicated with the first letter (o vs s) and color (blue for “object” vs gold for “spatial”). The scree plot (see Fig 16 left) and the plot of the loadings for the first two dimensions (Fig 15) supports a two factor model (with the plane created by Dimensions 1 and 2 explaining 44% of the total variance). The pattern of the loadings, however, reveals that some items load, as predicted, on only one factor (most object items and some spatial items) but that roughly half of spatial items (i.e., s02, s03, s03, s05, s06, s11, s20, s23, s24) and, at least, one object item (i.e., o15) load on both Dimensions 1 and 2. These items are ambiguous because they can reflect either only one of the hypothesized factors or a combination of both factors. To simplify the interpretation, a standard psychometric approach would keep only the unambiguous items, re-run the analysis with these items, and “prettify” the solution with an orthogonal rotation such as Varimax [33].
Fig 15
OSIQ data.
Loadings of the first two principal components.
Fig 16
OSIQ data.
Scree plots for SVD, the CSVD and PMD for different values of the sparsity parameter.
OSIQ data.
Loadings of the first two principal components.Scree plots for SVD, the CSVD and PMD for different values of the sparsity parameter.To evaluate the effects of sparsification, we used three levels of sparsity (in addition to the “no sparsity” condition corresponding to the plain SVD): Low (L), Medium (M), and High (H). As expected, and illustrated by the scree plots (see Fig 16), sparsification reduced the amount of variance (i.e., the pseudo-eigenvalues) explained by the sparsified components.Fig 17 plots the item loadings for Dimensions 1 and 2 for both the CSVD (left column) and PMD (right column) as a function of the levels of sparsity (L/M/H). For the low and intermediate levels of sparsity. For the first two levels of sparsity (L and M) the CSVD and PMD give similar results, possibly because the factor structure of the items on the first dimension is strong enough to be recovered without the orthogonality constraints. For the highest level of sparsity, the CSVD and PMD single out the same item (o28) on the first dimension (an unsurprising result because the maximized criteria are equivalent) but single out different items (s29 vs s18) on the second dimension.
Fig 17
OSIQ data.
Loadings of Dimensions 1 and 2 with an increasing degree of sparsity for both the CSVD (left column) and PMD (right column).
Loadings of Dimensions 1 and 2 with an increasing degree of sparsity for both the CSVD (left column) and PMD (right column).Fig 18 plots the correlations between the loadings estimated with the CSVD or PMD for all 30 dimensions and shows again that the components extracted by PMD are correlated with other components.
Fig 18
OSIQ data.
Plot of the cross-product between the loadings obtained for up to 30 dimensions. Left: CSVD. Right: PMD. Top to bottom: the three different levels of sparsity.
Plot of the cross-product between the loadings obtained for up to 30 dimensions. Left: CSVD. Right: PMD. Top to bottom: the three different levels of sparsity.Visual inspection of the plain PCA analysis suggests that the items are roughly clustered into three groups (pure object, pure spatial, and mixed spatial/objects). To better characterize these three groups of items, we ran an additional analysis in which we set the sparsity parameters to values (specifically and for, respectively the left and right singular vectors) that would generate three pure dimensions for the item loadings (see Fig 19). With this analysis, the first two dimensions isolate the pure items and the third dimension extracts the mixed items. To confirm this interpretation, we ran a plain PCA on the pure items (see Fig 20 left) followed by a Varimax rotation for two dimensions (see Fig 20 right). The Varimax rotated space recovered a solution equivalent to the CSVD with, however, the caveat, that Varimax required the a priori knowledge of the dimensionality of the space whereas the CSVD did not require this information.
Fig 19
OSIQ data.
Loadings for Dimensions 1, 2, and 3 with sparsity parameters set to c1 ≈ 15.11 and c2 ≈ 2.50. The sparsity parameters were empirically determined visually to create “pure” dimensions.
Fig 20
OSIQ data.
PCA with the reduced set of 14 items from the OSIQ. Left: Dimensions 1 and 2 for plain PCA; Right: Dimensions 1 and 2 after a two-dimensional Varimax rotation.
Loadings for Dimensions 1, 2, and 3 with sparsity parameters set to c1 ≈ 15.11 and c2 ≈ 2.50. The sparsity parameters were empirically determined visually to create “pure” dimensions.PCA with the reduced set of 14 items from the OSIQ. Left: Dimensions 1 and 2 for plain PCA; Right: Dimensions 1 and 2 after a two-dimensional Varimax rotation.
Discussion
The constrained singular value decomposition is a computationally efficient new method that sparsifies the SVD while preserving the orthogonality of the singular vectors. To do so, the CSVD expresses each constraint as a projection onto a convex set and integrates multiple constraints as the projection onto the intersection of the convex sets expressing the constraints (POCS). As shown in Appendix D, the CSVD algorithm is guaranteed to converge to a stable point because it is a member of the more general class of the block successive upper-bound minimization (BSUM) algorithms. The CSVD can easily be extended to incorporate additional constraints (e.g., group LASSO, metrics constraints of the rows or columns, spatial constraints) as long as these constraints can be expressed as projections onto convex sets.To evaluate the relevance of the orthogonality constraints, we compared, on three examples, the plain SVD, the penalized matrix decomposition [11], and the CSVD. We found that, as could be expected, without the orthogonality constraint, higher singular vectors shared information with the earlier singular vectors—a problem likely to hinder the interpretation of these later components. The example using face images shows that the CSVD could extract, from the images, characteristic features defining clusters of observations (e.g., men vs women). The psychometric example illustrates how the CSVD can be used in lieu of rotation (e.g., Varimax) to identify psychometrically “pure” components without having to choose a priori the dimensionality of the space.Of course, some questions remain open. For example, the choice of the sparsification constant is left to the user but this choice could be helped with some cross-validation schemes such as the one suggested by Witten et al. (see Algorithm 5 in [11]). In practice, this choice is likely to involve some trade-off between the interpretability of the pseudo singular vectors and the number of non-zeros loadings for a few of the first singular vectors. Along the same lines, and just like for the plain SVD, the number of (sparse) singular vectors to examine remains, in part, a subjective decision. Finally, the problem of the reliability of the sparsification, although a topic of great interest for sparse methods of prediction [34], remains open for sparse SVD or PCA and should be a topic for future research. Future directions should also include the integration of the CSVD into other methods that are traditionally based on the regular SVD—such as canonical correlation, or partial least squares correlation—or on the generalized SVD—such as correspondence analysis (see, e.g., [15] for previous relevant work along these lines).The R package csvd implementing the constrained singular value decomposition is available for download from https://github.com/vguillemot/csvd.
A The deflation operation generates orthogonal singular vectors
In this section, we show that repeatedly using the deflation operation will generate a set of left (respectively right) orthogonal singular vectors ordered by their singular value.Theorem 1 (Deflation). The first singular triplet of the (k + 1)th deflated matrix is the (k + 1)th singular triplet of the original matrix.Proof. Assume that the k ≥ 1, singular values, left and right singular vectors of a given matrix X have been estimated and stored in matrices Δ, P and Q.Let
be the (k + 1)th deflated matrix with:
Additionally, let δ, p and q be (respectively) the first singular value, left and right singular vectors of X(.To prove Theorem 1, we show that δ, p and q are the (k + 1)th singular triplet of X.First, because δ, p and q are a singular triplet of X(, we have:
andSecond, to prove that q is orthogonal to each column of Q, we consider the quantity . By multiplying both sides of Eq 23 by , developing, and simplifying we obtain
Therefore, when δ is not null, q is orthogonal to each column of Q. A similar proof shows that p is orthogonal to each column of P.Third, because p (respectively, q) is orthogonal to each column of P (respectively Q), we have
which, combined with Eqs (23) and (24), implies that
which, in turn, shows that δ, p, and q are a singular triplet of X. This proof also shows (mutatis mutandis) that any singular triplet of X( is a singular triplet of X.Finally, from the definition of X(, and because X( is orthogonal to P
Δ
Q, the rank of X is equal to the rank of X( plus the rank of , which, in run, implies that all the singular values of X( are the remaining singular values of X and that the first singular value of X( is the (k + 1)th singular value of X.
B Only some values of the constraints lead to solutions
As stated by Witten et al. ([11], page 519 ff.) the constraint parameters c1 and c2 lead to solutions only when they are in the range:
Fig 1 in [11] describes the geometric intuition behind this range in .In this appendix, we provide a proof of Statement 28. First, we prove Lemma 1 (that directly implies Statement 28).Lemma 1. Let , then
Proof. Assume that x belongs to and is different from 0. The left side of the inequality is a consequence of Hölder’s inequality, which states that if 0 < p < +∞, and q is a positive real number such that , then
With p = 1, this version of Hölder’s inequality becomes
Since ‖x‖∞ ≤ ‖x‖2, we haveThe right hand side of Eq 29 can be seen as a consequence of Cauchy-Schwarz inequality, which, in our case, would be formulated as follows:
If we set a to [|x1|, …, |x|]⊤ and b to 1, we obtain
which is equivalent toPutting together Eqs 32 and 35 gives:Lemma 1 implies that the constraints on the L2 and L1 norm of the left and right pseudo-singular vectors can be both active at the same time only if the sparsity parameter is chosen such that: (i) the L1-ball of radius ρ (i.e., ) is entirely included in the L2-ball of radius 1 (i.e., ) when ρ ≤ 1, and so that fulfilling the sparsity constraint implies that the L2 constraint is also fulfilled; and (ii) is entirely included in when , and so fulfilling the normalization constraint implies that the sparsity constraint is also fulfilled. To fulfill the constraints on both rows and columns of the CSVD gives the following range for the values of c1 and c2:
which proves the assertion.
C A fast and exact algorithm for the projection onto the intersection of an L1 and L2 ball
In this section we describe a fast and exact algorithm for the projection onto the intersection of the L1-ball of radius c (i.e., ) and the L2-ball of radius 1 (i.e., ). This projection is defined by the following equation:
where , N is the number of variables of the dimension of interest (i.e., I or J), and c is the sparsity parameter (i.e., c1 or c2) with .In [11], the solution of Eq 38 is computed using a binary search algorithm (BiSe). In the main part of our artcie, we propose to use the more general POCS algorithm. BiSe and POCS are iterative algorithms that give an approximate solution to Eq 38. In the case of the projection on the intersection of the L1 and L2 balls, the general POCS algorithm can be replaced by a fast and exact algorithm (see [26, 35]), that we call PL1L2 and detail in this appendix.
Projection onto the L1-ball
The proposed approach implements an efficient algorithm for projecting a vector onto the L1-ball [35].Let be the vector containing the absolute value of the components of x with its elements sorted in decreasing order. Additionally, we define the function φ(λ) = ‖S(x, λ)‖1. This function is continuous, piecewise linear and decreasing from to . Therefore, if ‖x‖1 ≥ c, since φ is continuous, there is a positive number λ such that φ(λ) = c. From this, we can deduce the algorithm of the projection onto the L1-ball of radius c that narrows down to 4 steps.Algorithm 6: Fast projection onto the L1-ball.Data: x, cResult:1. Take the absolute value of the components of x and sort them in decreasing order into a new vector ;2. Find i such that ;3. Find δ such that . Since , then ;4. Compute S(x, λ) with ;At the end of the algorithm, we obtain S(x, λ) which is now the projection of x onto . A similar algorithm was proposed in [36], [37], and [38].
Projection onto the intersection of the L1 and L2-balls
In order to solve the optimization problem from Eq 38, we extend Algorithm 6 to the functionWe have the following Lemma:Lemma 2. Let
x
be a vector of
, composed of n ≤ N non-zero elements. Then
Proof. The proof of this Lemma is very similar to the proof given in Appendix B. Recall that as a consequence of the Cauchy-Schwarz inequality:
With a = [|x1|, …, |x|] and b a vector such that
the previous inequality becomes
which is equivalent toProposition 1. For ,
verifies the 3 following properties:ψ is continuous and decreasing.There exist an integer i and a positive real number δ, smaller than
, such that
.δ is the solution of a second degree polynomial equation.Proof. (i). The numerator and denominator of ψ are continuous because there are compositions of continuous functions. Moreover, for any λ strictly smaller than , . Therefore, ψ is continuous because it is the ratio of a continuous function and a non-zero continuous function.Assuming , for there exists k ∈ 1, …, N such that . For this specific λ, we have:
and
Together, Eqs 46 and 47 imply that the derivative of ψ has the form:
Moreover, because the number of non-zero elements of vector is equal to k, Lemma 2 implies that , and therefore ψ(λ)2 ≤ k. As a consequence, ψ′(λ) ≤ 0, which, in turn, implies that ψ, being a continuous function with a negative derivative, is a decreasing function.(ii). Let Nmax be the number of elements of x equal to (the maximum of ) and . Then
Thus, ψ is decreasing from (Lemma 2) to . This implies that for , there is an integer i ∈ 1, …, N such that . Finally, because ψ is continuous, there is a real number δ in such that .(iii). Using the notations and , with i (and δ) defined as previously stated in (ii), we have:
and
Moreover, since
the following equality holds:Incorporating Eqs 50 and 51 into Eq 53 gives:
The goal is now to find the positive root of this second degree polynomial equation. The discriminant Δ is equal to . It remains to show that Δ is positive.First, the number of non-zero elements of is equal to i and Lemma 2 yields . Second, so . Combining the two previous inequalities yields which implies that i−c2 > 0. Third, from , we deduce that which ensures that Δ is positive.To conclude, the sign of corresponds to the sign of the product of the 2 roots. As this term is negative, the 2 roots have opposite signs. The single solution of is:
Using the fact that , the previous equation can be simplified asWe deduce from this a four step algorithm, called PL1L2, for the projection onto the intersection of the L1-ball of radius c and the L2-ball of radius 1.Algorithm 7: PL1L2: an algorithm for a fast and exact projection onto .Data: x, cResult:1. Take the absolute value of x and sort its elements in decreasing order to get ;2. Find i such that ;3. Let ;4. Compute S(x, λ) with ;
D Convergence of the CSVD algorithm
In this appendix we prove the convergence of the CSVD. To do so, we show that the CSVD is an instance of the block successive upper-bound minimization (BSUM) algorithm (introduced in [27]) and, as such, converges to a stationary point.
D.1 Definitions and notations
Define f, which is the negative of the objective function from Eq 15
The functions u1 and u2 are two “approximations” of f, defined as
and
These two functions depend on the fixed given vectors p and q and vary according to and .In the BSUM framework, f is minimized by iteratively minimizing u1 over a convex set and u2 over a convex set .Definition 1 (BSUM algorithm [27]). The BSUM algorithm (in the present setting) is defined as:Minimize u1
over
with
q
fixed, and update
p
with the solution;Minimize u2
over
with
p
fixed, and update
q
with the solution,and iterate until convergence.Recall the following definitions.Definition 2 (Directional derivative). Let g be a function with gradient at
x
denoted ∇
g. The directional derivative of g in a direction
d
isDefinition 3 (Regularity). Let f be a differentiable function defined over
. Assume that
and
with
and
. If this implies that
then f is regular.
D.2 Equivalence of the BSUM algorithm and the CSVD
Algorithm 5 is equivalent to the BSUM algorithm because:Minimizing f is the same as maximizing the objective function in Eq 15.Minimizing u1 over is equivalent to the left projection step.Similarly, minimizing u2 is equivalent to the right projection step.
D.3 Convergence of the BSUM algorithm
In order to converge to a stationary point, the BSUM algorithm needs to meet a few key assumptions that are specified in the following theorem (adapted from Theorem 2 in [27]).Theorem 2 (Convergence). The BSUM algorithm converges to a stationary point under the following conditions:f is regular,u1, u2
and f coincide (condition (B1) in [27])u1, u2
are upper bounds of f (condition (B2) in [27]), , andthe directional derivatives of u1, u2
and f coincide (condition (B3) in [27])
andu1
and
u2
are continuous functions (condition (B4) in [27]).
D.3.1 Regularity
We show in this section that f is regular. The gradient of f with respect to its arguments, p and q, is defined as
Thus, the directional derivative of f in the direction , with and is equal to
Hence, if and , then the directional derivative of f in the direction of d is also positive, which proves that f is regular.
D.3.2 (B1), (B2), and (B3) in Razaviyayn et al., 2013
Because u1 and u2 are equal to the function f with either p or q fixed, u1 and u2 coincide with f, which proves (B1). Necessarily, so do their directional derivatives, which proves (B3). Finally because they coincide, u1 and u2 are upper bounds of f [in the sense of the condition (B2) in [27]].
D.3.3 Continuity: (B4) in Razaviyayn et al., 2013
Being compositions of linear operations, the functions u1 and u2 are both continuous.
D.4 Conclusion
It follows from these properties that the CSVD method, as described in Algorithm 5 being based on alternating between applying the projected power method with respect to p and to q, is a particular instance of the block successive upper-bound minimization (BSUM) algorithm ([27]). Therefore, any limit point of the CSVD method is a stationary point and so the CSVD converges to a stationary solution of Eq 15.Additional results on broader settings.(PDF)Click here for additional data file.
Authors: Edith Le Floch; Vincent Guillemot; Vincent Frouin; Philippe Pinel; Christophe Lalanne; Laura Trinchera; Arthur Tenenhaus; Antonio Moreno; Monica Zilbovicius; Thomas Bourgeron; Stanislas Dehaene; Bertrand Thirion; Jean-Baptiste Poline; Edouard Duchesnay Journal: Neuroimage Date: 2012-07-08 Impact factor: 6.556