Jianqing Fan1, Weichen Wang1, Yiqiao Zhong1. 1. Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA.
Abstract
In statistics and machine learning, we are interested in the eigenvectors (or singular vectors) of certain matrices (e.g. covariance matrices, data matrices, etc). However, those matrices are usually perturbed by noises or statistical errors, either from random sampling or structural patterns. The Davis-Kahan sin θ theorem is often used to bound the difference between the eigenvectors of a matrix A and those of a perturbed matrix A ˜ = A + E , in terms of l 2 norm. In this paper, we prove that when A is a low-rank and incoherent matrix, the l ∞ norm perturbation bound of singular vectors (or eigenvectors in the symmetric case) is smaller by a factor of d 1 or d 2 for left and right vectors, where d 1 and d 2 are the matrix dimensions. The power of this new perturbation result is shown in robust covariance estimation, particularly when random variables have heavy tails. There, we propose new robust covariance estimators and establish their asymptotic properties using the newly developed perturbation bound. Our theoretical results are verified through extensive numerical experiments.
In statistics and machine learning, we are interested in the eigenvectors (or singular vectors) of certain matrices (e.g. covariance matrices, data matrices, etc). However, those matrices are usually perturbed by noises or statistical errors, either from random sampling or structural patterns. The Davis-Kahan sin θ theorem is often used to bound the difference between the eigenvectors of a matrix A and those of a perturbed matrix A ˜ = A + E , in terms of l 2 norm. In this paper, we prove that when A is a low-rank and incoherent matrix, the l ∞ norm perturbation bound of singular vectors (or eigenvectors in the symmetric case) is smaller by a factor of d 1 or d 2 for left and right vectors, where d 1 and d 2 are the matrix dimensions. The power of this new perturbation result is shown in robust covariance estimation, particularly when random variables have heavy tails. There, we propose new robust covariance estimators and establish their asymptotic properties using the newly developed perturbation bound. Our theoretical results are verified through extensive numerical experiments.
The perturbation of matrix eigenvectors (or singular vectors) has been well studied in matrix perturbation theory (Wedin, 1972; Stewart, 1990). The best known result of eigenvector perturbation is the classic Davis-Kahan theorem (Davis and Kahan, 1970). It originally emerged as a powerful tool in numerical analysis, but soon found its widespread use in other fields, such as statistics and machine learning. Its popularity continues to surge in recent years, which is largely attributed to the omnipresent data analysis, where it is a common practice, for example, to employ PCA (Jolliffe, 2002) for dimension reduction, feature extraction, and data visualization.The eigenvectors of matrices are closely related to the underlying structure in a variety of problems. For instance, principal components often capture most information of data and extract the latent factors that drive the correlation structure of the data (Bartholomew et al., 2011); in classical multidimensional scaling (MDS), the centered squared distance matrix encodes the coordinates of data points embedded in a low dimensional subspace (Borg and Groenen, 2005); and in clustering and network analysis, spectral algorithms are used to reveal clusters and community structure (Ng et al., 2002; Rohe et al., 2011). In those problems, the low dimensional structure that we want to recover, is often ‘perturbed’ by observation uncertainty or statistical errors. Besides, there might be a sparse pattern corrupting the low dimensional structure, as in approximate factor models (Chamberlain et al., 1983; Stock and Watson, 2002) and robust PCA (De La Torre and Black, 2003; Candès et al., 2011).A general way to study these problems is to consider
where A is a low rank matrix, S is a sparse matrix, and N is a random matrix regarded as random noise or estimation error, all of which have the same size d1 × d2. Usually A is regarded as the ‘signal’ matrix we are primarily interested in, S is some sparse contamination whose effect we want to separate from A, and N is the noise (or estimation error in covariance matrix estimation).The decomposition (1) forms the core of a flourishing literature on robust PCA (Chandrasekaran et al., 2011; Candès et al., 2011), structured covariance estimation (Fan et al., 2008, 2013), multivariate regression (Yuan et al., 2007) and so on. Among these works, a standard condition on A is matrix incoherence (Candès et al., 2011). Let the singular value decomposition be
where r is the rank of A, the singular values are , and the matrices , consist of the singular vectors. The coherences , are defined as
where U and V are the (i, j) entry of U and V, respectively. It is usually expected that is not too large, which means the singular vectors u and v are incoherent with the standard basis. This incoherence condition (3) is necessary for us to separate the sparse component S from the low rank component A; otherwise A and S are not identifiable. Note that we do not need any incoherence condition on UV, which is different from Candès et al. (2011) and is arguably unnecessary (Chen, 2015).Now we denote the eigengap where for notational convenience. Also we let E = S + N, and view it as a perturbation matrix to the matrix A in (1). To quantify the perturbation, we define a rescaled measure as , where
which are commonly used norms gauging sparsity (Bickel and Levina, 2008). They are also operator norms in suitable spaces (see Section 2). The rescaled norms and are comparable to the spectral norm in many cases; for example, when E is an all-one matrix, .Suppose the perturbed matrix à also has the singular value decomposition:
where are nonnegative and in the decreasing order, and the notation ᴧ means a ᴧ b = min{a, b}. Denote , which are counterparts of top r singular vectors of A.We will present an matrix perturbation result that bounds and up to sign.[1] This result is different from bounds, Frobenius-norm bounds, or the sin bounds, as the norm is not orthogonal invariant. The following theorem is a simplified version of our main results in Section 2.
Theorem 1
Let
and suppose the singular decomposition in
where
. Then there exists
such that, if
, up to sign,
where
is the coherence given after
(3)
and
.When A is symmetric, and the condition on the eigengap is simply . The incoherence condition naturally holds for a variety of applications, where the low rank structure emerges as a consequence of a few factors driving the data matrix. For example, in Fama-French factor models, the excess returns in a stock market are driven by a few common factors (Fama and French, 1993); in collaborative filtering, the ratings of users are mostly determined by a few common preferences (Rennie and Srebro, 2005); in video surveillance, A is associated with the stationary background across image frames (Oliver et al., 2000). We will have a detailed discussion in Section 2.3.The eigenvector perturbation was studied by Davis and Kahan (1970), where Hermitian matrices were considered, and the results were extended by Wedin (1972) to general rectangular matrices. To compare our result with these classical results, assuming , a combination of Wedin’s theorem and Mirsky’s inequality (Mirsky, 1960) (the counterpart of Weyl’s inequality for singular values) implies
where .Yu et al. (2015) also proved a similar bound as in (7), and that result is more convenient to use. If we are interested in the bound but naively use the trivial inequality , we would have a suboptimal bound in many situations, especially in cases where is comparable to . Compared with (6), the bound is worse by a factor of for u and for v. In other words, converting the bound from Davis-Kahan theorem directly to the bound does not give a sharp result in general, in the presence of incoherent and low rank structure of A. Actually, assuming is comparable with , for square matrices, our bound (6) matches the bound (7) in terms of dimensions d1 and d2. This is because for any , so we expect to gain a factor or in those bounds. The intuition is that, when A has an incoherent and low-rank structure, the perturbation of singular vectors is not concentrated on a few coordinates.To understand how matrix incoherence helps, let us consider a simple example with no matrix incoherence, in which (7) is tight up to a constant. Let be a d-dimensional square matrix, and of the same size. It is apparent that , and that up to sign. Clearly, the perturbation is not vanishing as d tends to infinity in this example, and thus, there is no hope of a strong upper bound as in (6) without the incoherence condition.The reason that the factor or comes into play in (7) is that, the error (and similarly for v) spreads out evenly in d1 (or d2) coordinates, so that the error is far smaller than the error. This, of course, hinges on the incoherence condition, which in essence precludes eigenvectors from aligning with any coordinate.Our result is very different from the sparse PCA literature, in which it is usually assumed that the leading eigenvectors are sparse. In Johnstone and Lu (2009), it is proved that there is a threshold for p/n (the ratio between the dimension and the sample size), above which PCA performs poorly, in the sense that is approximately 0. This means that the principal component computed from the sample covariance matrix reveals nothing about the true eigenvector. In order to mitigate this issue, in Johnstone and Lu (2009) and subsequent papers (Vu and Lei, 2013; Ma, 2013; Berthet and Rigollet, 2013), sparse leading eigenvectors are assumed. However, our result is different, in the sense that we require a stronger eigengap condition (i.e. stronger signal), whereas in Johnstone and Lu (2009), the eigengap of the leading eigenvectors is a constant times . This explains why it is plausible to have a strong uniform eigenvector perturbation bound in this paper.We will illustrate the power of this perturbation result using robust covariance estimation as one application. In the approximate factor model, the true covariance matrix admits a decomposition into a low rank part A and a sparse part S. Such models have been widely applied in finance, economics, genomics, and health to explore correlation structure.However, in many studies, especially financial and genomics applications, it is well known that the observations exhibit heavy tails (Gupta et al., 2013). This problem can be resolved with the aid of recent results of concentration bounds in robust estimation (Catoni, 2012; Hsu and Sabato, 2014; Fan et al., 2017a), which produces the estimation error N in (1) with an optimal entry-wise bound. It nicely fits our perturbation results, and we can tackle it easily by following the ideas in Fan et al. (2013).Here are a few notations in this paper. For a generic d1 by d2 matrix, the matrix maxnorm is denoted as . The matrix operator norm induced by vector norm is . In particular, and denotes the spectral norm, or the matrix 2-norm for simplicity. We use to denote the jth largest singular value. For a symmetric matrix M, denote as its jth largest eigenvalue. If M is a positive definite matrix, then M1/2 is the square root of M, and M−1/2 is the square root of M−1.
The perturbation results
Symmetric matrices
First, we study perturbation for symmetric matrices (so d1 = d2). The approach we study symmetric matrices will be useful to analyze asymmetric matrices, because we can always augment a d1 × d2 rectangular matrix into a symmetric matrix, and transfer the study of singular vectors to the eigenvectors of the augmented matrix. This augmentation is called Hermitian dilation. (Tropp, 2012; Paulsen, 2002)Suppose that is an d-dimensional symmetric matrix. The perturbation matrix is also d-dimensional and symmetric. Let the perturbed matrix be . Suppose the spectral decomposition of A is given by
where , , and where . Note the best rank-r approximation of A under the Frobenius norm is .[2] Analogously, the spectral decomposition of à is
and write , where . Recall that given by (4) is an operator norm in the space, in the sense that . This norm is the natural counterpart of the spectral norm ||E||2 := sup|| ||Eu||2.We will use notations and to hide absolute constants.[3] The next theorem bounds the perturbation of eigenspaces up to a rotation.
Theorem 2
Suppose
, where
, which is the approximation error measured under the matrix
-norm and
is the coherence of V defined in
such thatThis result involves an unspecified rotation R, due to the possible presence of multiplicity of eigenvalues. In the case where , the individual eigenvectors of V are only identifiable up to rotation. However, assuming an eigengap (similar to Davis-Kahan theorem), we are able to bound the perturbation of individual eigenvectors (up to sign).
Theorem 3
Assume the conditions in
, and for any
, the interval
does not contain any eigenvalues of A other than
. Then, up to sign,To understand the above two theorems, let us consider the case where A has exactly rank r (i.e., ε = 0), and r and μ are not large (say, bounded by a constant). Theorem 2 gives a uniform entrywise bound on the eigenvector perturbation. As a comparison, the Davis-Kahan sin Θ theorem (Davis and Kahan, 1970) gives a bound on with suitably chosen rotation R.[4] This is an order of larger than the bound given in Theorem 2 when is of the same order as . Thus, in scenarios where is comparable to , this is a refinement of Davis-Kahan theorem, because the max-norm bound in Theorem 2 provides an entry-wise control of perturbation. Although ,[5] there are many settings where the two quantities are comparable; for example, if E has a submatrix whose entries are identical and has zero entries otherwise, then .Theorem 3 provides the perturbation of individual eigenvectors, under a usual eigengap assumption. When r and μ are not large, we incur an additional term in the bound. This is understandable, since is typically .When the rank of A is not exactly r, we require that is larger than the approximation error . It is important to state that this assumption is more restricted than the eigengap assumption in the Davis-Kahan theorem, since . However, different from the matrix max-norm, the spectral norm only depends on the eigenvalues of a matrix, so it is natural to expect perturbation bounds that only involve and . It is not clear whether we should expect an bound that involves instead of ε. More discussions can be found in Section 5.We do not pursue the optimal bound in terms of r and μ(V) in this paper, as the two quantities are not large in many applications, and the current proof is already complicated.
Rectangular matrices
Now we establish perturbation bounds for general rectangular matrices. The results here are more general than those in Section 1, and in particular, we allow the matrix A to be of approximate low rank. Suppose that both A and E are d1 × d2 matrices, and . The rank of A is at most (where ). Suppose an integer r satisfies r ≤ rank(A). Let the singular value decomposition of A be
where the singular values are ordered as , and the unit vectors (or unit vectors ) are orthogonal to each other. We denote and . Analogously, the singular value decomposition of A is
where . Similarly, columns of and are orthonormal.Define , where (resp. )is the coherence of U (resp. V). This μ0 will appear in the statement of our results, as it controls both the structure of left and right singular spaces. When, specially, A is a symmetric matrix, the spectral decomposition of A is also the singular value decomposition (up to sign), and thus μ0 coincides with μ defined in Section 2.1.Recall the definition of matrix -norm and 1-norm of a rectangular matrix (4). Similar to the matrix -norm, is an operator norm in the space. An obvious relationship between matrix -norm and 1-norm is . Note that the matrix -norm and 1-norm have different number of summands in their definitions, so we are motivated to consider to balance the dimensions d1 and d2.Let be the best rank-r approximation of A under the Frobenius norm, and let , which also balances the two dimensions. Note that in the special case where A is symmetric, this approximation error ε0 is identical to ε defined in Section 2.1. The next theorem bounds the perturbation of singular spaces.
Theorem 4
Suppose that
. Then, there exists orthogonal matrices
such that,Similar to Theorem 3, under an assumption of gaps between singular values, the next theorem bounds the perturbation of individual singular vectors.
Theorem 5
Suppose the same assumptions in
satisfies
, and for any
, the interval
does not contain any eigenvalues of A other than
. Then, up to sign,As mentioned in the beginning of this section, we will use dilation to augment all d1 × d2 matrices into symmetric ones with size d1 + d2. In order to balance the possibly different scales of d1 and d2, we consider a weighted max-norm. This idea will be further illustrated in Section 5.
Examples: which matrices have such structure?
In many problems, low-rank structure naturally arises due to the impact of pervasive latent factors that influence most observed data. Since observations are imperfect, the low-rank structure is often ‘perturbed’ by an additional sparse structure, gross errors, measurement noises, or the idiosyncratic components that can not be captured by the latent factors. We give some motivating examples with such structure.
Panel data in stock markets.
Consider the excess returns from a stock market over a period of time. The driving factors in the market are reflected in the covariance matrix as a low rank component A. The residual covariance of the idiosyncratic components is often modeled by a sparse component S. Statistical analysis including PCA is usually conducted based on the estimated covariance matrix , which is perturbed from the true covariance by the estimation error N (Stock and Watson, 2002; Fan et al., 2013). In Section 3.1, we will develop a robust estimation method in the presence of heavytailed return data.
Video surveillance.
In image processing and computer vision, it is often desired to separate moving objects from static background before further modeling and analysis (Oliver et al., 2000; Hu et al., 2004). The static background corresponds to the low rank component A in the data matrix, which is a collection of video frames, each consisting of many pixels represented as a long vector in the data matrix. Moving objects and noise correspond to the sparse matrix S and noise matrix N. Since the background is global information and reflected by many pixels of a frame, it is natural for the incoherence condition to hold.
Wireless sensor network localization.
In wireless sensor networks, we are usually interested in determining the location of sensor nodes with unknown position based on a few (noisy) measurements between neighboring nodes (Doherty et al., 2001; Biswas and Ye, 2004). Let be an r by n matrix such that each column x gives the coordinates of each node in a plane (r = 2) or a space (r = 3). Assume the center of the sensors has been relocated at origin. Then the low rank matrix , encoding the true distance information, has to satisfy distance constraints given by the measurements. The noisy distance matrix after centering, equals to the sum of A and a matrix N consisting of measurement errors. Suppose that each node is a random point uniformly distributed in a rectangular region. It is not difficult to see that with high probability, the top r eigenvalues of and their eigengap scales with the number of sensors n and the leading eigenvectors have a bounded coherence.In our theorems, we require that the coherence μ is not too large. This is a natural structural condition associated with the low rank matrices. Consider the following very simple example: if the eigenvectors u1,...,u of the low rank matrix A are uniform unit vectors in a sphere, then with high probability, , which implies . An intuitive way to understand the incoherence structure is that no coordinates of are dominant. In other words, the eigenvectors are not concentrated on a few coordinates.In all our examples, the incoherence structure is natural. The factor model satisfies such structure, which will be discussed in Section 3. In the video surveillance example, ideally, when the images are static, A is a rank one matrix x1. Since usually a majority of pixels (coordinates of x) help to display an image, the vector x often has dense coordinates with comparable magnitude, so A also has an incoherence structure in this example. Similarly, in the sensor localization example, the coordinates of all sensor nodes are comparable in magnitude, so the low rank matrix A formed by also has the desired incoherence structure.
Other perturbation results
Although the eigenvector perturbation theory is well studied in numerical analysis, there is a renewed interest among statistics and machine learning communities recently, due to the wide applicability of PCA and other eigenvector-based methods. In Cai and Zhang (2016); Yu et al. (2015), they obtained variants or improvements of Davis-Kahan theorem (or Wedin’s theorem), which are user-friendly in the statistical contexts. These results assume the perturbation is deterministic, which is the same as Davis-Kahan theorem and Wedin’s theorem. In general, these results are sharp, even when the perturbation is random, as evidenced by the BBP transition (Baik et al., 2005).However, these classical results can be suboptimal, when the perturbation is random and the smallest eigenvalue gap does not capture particular spectrum structure. For example, Vu (2011); O’Rourke et al. (2013) showed that with high probability, there are bounds sharper than the Wedin’s theorem, when the signal matrix is low-rank and satisfies certain eigenvalue conditions.In this paper, our perturbation results are deterministic, thus the bound can be suboptimal when the perturbation is random with certain structure (e.g. the difference between sample covariance and population one for i.i.d. samples). However, the advantage of a deterministic result is that it is applicable to any random perturbation. This is especially useful when we cannot make strong random assumptions on the perturbation (e.g., the perturbation is an unknown sparse matrix). In Section 3, we will see examples of this type.
Application to robust covariance estimation
We will study the problem of robust estimation of covariance matrices and show the strength of our perturbation result. Throughout this section, we assume both rank r and the coherence μ(V) are bounded by a constant, though this assumption can be relaxed. We will use C to represent a generic constant, and its value may change from line to line.
PCA in spiked covariance model
To initiate our discussions, we first consider sub-Gaussian random variables. Let X = (X1,...,X) be a random d-dimensional vector with mean zero and covariance matrix
and be an n by d matrix, whose rows are independently sampled from the same distribution. This is the spiked covariance model that has received intensive study in recent years. Let the empirical covariance matrix be . Viewing the empirical covariance matrix as its population version plus an estimation error, we have the decomposition
which is a special case of the general decomposition in (1). Here, is the sparse component, and the estimation error is the noise component. Note that v1,...,v are just the top r leading eigenvectors of and we write V = [v1,...,v]. Assume the top r eigenvectors of are denoted by . We want to find an bound on the estimation error for all .When the dimension d is comparable to or larger than n, it has been shown by Johnstone and Lu (2009) that the leading empirical eigenvector is not a consistent estimate of the true eigenvector v1, unless we assume larger eigenvalues. Indeed, we will impose more stringent conditions on ‘s in order to obtain good bounds.Assuming the coherence μ(V) is bounded, we can easily see for some constant C. It follows from the standard concentration result (e.g., Vershynin (2010)) that if rows of contains i.i.d sub-Gaussian vectors and , then with probability greater than 1 − d−1,To apply Theorem 3, we treat as A and as E. If the conditions in Theorem 3 are satisfied, we will obtainNote there are simple bounds on and :By assuming a strong uniform eigengap, the conditions in Theorem 3 are satisfied, and the bound in (13) can be simplified. Define the uniform eigengap asNote , so if , we haveIn particular, when and , we haveThe above analysis pertains to the structure of sample covariance matrix. In the following subsections, we will estimate the covariance matrix using more complicated robust procedure. Our perturbation theorems in Section 2 provide a fast and clean approach to obtain new results.
PCA for robust covariance estimation
The usefulness of Theorem 3 is more pronounced when the random variables are heavytailed. Consider again the covariance matrix with structure (11). Instead of assuming sub-Gaussian distribution, we assume there exists a constant C > 0 such that , i.e. the fourth moments of the random variables are uniformly bounded.Unlike sub-Gaussian variables, there is no concentration bound similar to (12) for the empirical covariance matrix. Fortunately, thanks to recent advances in robust statistics (e.g., Catoni (2012)), robust estimate of with guaranteed concentration property becomes possible. We shall use the method proposed in Fan et al. (2017a). Motivated by the classical M-estimator of Huber (1964), Fan et al. (2017a) proposed a robust estimator for each element of , by solving a Huber loss based minimization problem
where lα is the Huber loss defined asThe parameter α is suggested to be for , where u is assumed to satisfy . If , Fan et al. (2017a) showedFrom this result, the next proposition is immediate by taking .
Proposition 1
Suppose that there is a constant C with
. Then with probability greater than
, the robust estimate of covariance matrix with
satisfies
where v is a pre-determined parameter assumed to be no less than
.This result relaxes the sub-Gaussianity assumption by robustifying the covariance estimate. It is apparent that the bound in the previous section is still valid in this case. To be more specific, suppose μ(V) is bounded by a constant. Then, (13) holds for the PCA based on the robust covariance estimation. When and , we again haveNote that an entrywise estimation error necessarily implies consistency of the estimated eigenvectors, since we can easily convert an result into an result. The minimum signal strength (or magnitude of leading eigenvalues) for such consistency is shown to be under the sub-Gaussian assumption (Wang and Fan, 2017).If the goal is simply to prove consistency of , the strategy of using our perturbation bounds is not optimal. However, there are also merits: our result is nonasymptotic; it holds for more general distributions (beyond sub-Gaussian distributions); and its entrywise bound gives stronger guarantee. Moreover, the perturbation bounds provide greater flexibility for analysis, since it is straightforward to adapt analysis to problems with more complicated structure. For example, the above discussion can be easily extended to a general with bounded rather than a diagonal matrix.
Robust covariance estimation via factor models
In this subsection, we will apply Theorem 3 to robust large covariance matrix estimation for approximate factor models in econometrics. With this theorem, we are able to extend the data distribution in factor analysis beyond exponentially decayed distributions considered by Fan et al. (2013), to include heavy-tailed distributions.Suppose the observation y, say, the excess return at day t for stock i, admits a decomposition
where is the unknown but fixed loading vector, denotes the unobserved factor vector at time t, and u’s represent the idiosyncratic noises. Let and so that , where . Suppose that f and u are uncorrelated and centered random vectors, with bounded fourth moments, i.e., the fourth moments of all entries of f and u are bounded by some constant. We assume are independent for t, although it is possible to allow for weak temporal dependence as in Fan et al. (2013). From (15), we can decompose into a low rank component and a residual component:
where . To circumvent the identifiability issue common in latent variable models, here we also assume, without loss of generality, and that B is a diagonal matrix, since rotating B will not affect the above decomposition (16).We will need two major assumptions for our analysis: (1) the factors are pervasive in the sense of Definition 2, and (2) there is a constant C > 0 such that , which are standard assumptions in the factor model literature. The pervasive assumption is reasonable in financial applications, since the factors have impacts on a large fraction of the outcomes (Chamberlain et al., 1983; Bai, 2003). If the factor loadings are regarded as random realizations from a bounded random vector, the assumption holds (Fan et al., 2013).
Definition 2
In the factor model > 0 such that
and the eigenvalues of the r by r matrix B.Let be the top r eigenvalues and eigenvectors of , and similarly, for BB. In the following proposition, we show that pervasiveness is naturally connected to the incoherence structure. This connects the econometrics and machine learning literature and provide a good interpretation on the concept of the incoherence. Its proof can be found in the appendix.
Proposition 3
Suppose there exists a constant C > 0 such that
. The factors
for
is bounded by some constant, and
for
so that
.Our goal is to obtain a good covariance matrix estimator by exploiting the structure (16). Our strategy is to use a generalization of the principal orthogonal complement thresholding (POET) method proposed in Fan et al. (2013). The generic POET procedure encompasses three steps:(1) Given three pilot estimators respectively for true covariance , leading eigenvalues and leading eigenvectors , compute the principal orthogonal complement :(2) Apply the correlation thresholding to to obtain thresholded estimate defined as follows:
where is the generalized shrinkage function (Antoniadis and Fan, 2001; Rothman et al., 2009) and is an entry-dependent threshold. will be determined later in Theorem 6. This step exploits the sparsity of .(3) Construct the final estimator .The key feature in the above procedure lies in the flexibility of choosing the pilot estimators in the first step. We will choose according to data generating distribution. Typically we can use for i ≤ r as the eigenvalues/vectors of . However, and in general do not have to come from the spectral information of and can be obtained separately via different methods.To guide the selection of proper pilot estimators, Fan et al. (2017+) provided a high level sufficient condition for this simple procedure to be effective, and its performance is gauged, in part, through the sparsity level of , defined as . When q = 0, m corresponds to the maximum number of nonzero elements in each row of . For completeness, we present the theorem given by Fan et al. (2017+) in the following.
Theorem 6
Let
. Suppose there exists C > 0 such that
and we have pilot estimators
satisfying
Under the pervasiveness condition of the factor model
, if
, the following rates of convergence hold with the generic POET procedure:
and
where
is the relative Frobenius norm.We remark that the additional term in , is due to the estimation of unobservable factors and is negligible when the dimensional d is high. The optimality of the above rates of convergence is discussed in details in Fan et al. (2017+). Theorem 6 reveals a profound deterministic connection between the estimation error bound of the pilot estimators with the rate of convergences of the POET output estimators. Notice that the eigenvector estimation error is under the norm, for which our perturbation bounds will prove to be useful.In this subsection, since we assume only bounded fourth moments, we choose to be the robust estimate of covariance matrix defined in (14). We now invoke our bounds to show that the spectrum properties (eigenvalues and eigenvectors) are stable to perturbation.Let us decompose into a form such that Theorem 3 can be invoked:
where is viewed as , the low-rank part , which is also BB, is viewed as A, and the remaining terms are treated as E. The following results follow immediately.
Proposition 4
Assume that there is a constant C > 0 such that
. If the factors are pervasive, then with probability greater than
, we have
as the leading eigenvalues/vectors of
for i ≤ r. In addition, .The inequality (19) follows directly from Proposition 1 under the assumption of bounded fourth moments. It is also easily verifiable that (20), (21) follow from (19) by Weyl’s inequality and Theorem 3 (noting that ). See Section 3.2 for more details.Note that in the case of sub-Gaussian variables, sample covariance matrix and its leading eigenvalues/vectors will also serve the same purpose due to (12) and Theorem 3 as discussed in Section 3.1.We have seen that the perturbation bounds are useful in robust covariance estimation, and particularly, they resolve a theoretical difficulty in the generic POET procedure for factor model based covariance matrix estimation.
Simulations
Simulation: the perturbation result
In this subsection, we implement numerical simulations to verify the perturbation bound in Theorem 3. We will show that the error behaves in the same way as indicated by our theoretical bound.In the experiments, we let the matrix size d run from 200 to 2000 by an increment of 200. We fix the rank of A to be 3 (r = 3). To generate an incoherence low rank matrix, we sample a d × d random matrix with iid standard normal variables, perform singular value decomposition, and extract the first r right singular vectors . Let and where as before represents the eigengap. Then, we set A = VDV. By orthogonal invariance, v is uniformly distributed on the unit sphere . It is not hard to see that with probability , the coherence of .We consider two types of sparse perturbation matrices E: (a) construct a d × d matrix E0 by randomly selecting s entries for each row, and sampling a uniform number in [0,L] for each entry, and then symmetrize the perturbation matrix by setting ; (b) pick , and let . Note that in (b) we have , and thus we can choose suitable and to control the norm of E. This covariance structure is common in cases where correlations between random variables depend on their “distance” |i − j|, which usually arises from autoregressive models.The perturbation of eigenvectors is measured by the element-wise error:
where are the eigenvectors of in the descending order.To investigate how the error depends on γ and d, we generate E according to mechanism (a) with s = 10, L = 3, and run simulations in different parameter configurations: (1) let the matrix size d range from 200 to 2000, and choose the eigengap γ in {10, 50,100, 500} (Figure 1); (2) fix the product to be one of {2000, 3000, 4000, 5000}, and let the matrix size d run from 200 to 2000 (Figure 2).
Figure 1:
The left plot shows the perturbation error of eigenvectors against matrix size d ranging from 200 to 2000, with different eigengap γ. The right plot shows log(err) against log(d). The slope is around −0.5. Blue lines represent γ = 10; red lines γ = 50; green lines γ = 100; and black lines γ = 500. We report the largest error over 100 runs.
Figure 2:
The left plot shows the perturbation error of eigenvectors against matrix size d ranging from 200 to 2000, when is kept fixed, with different values. The right plot shows the error multiplied by against d. Blue lines represent = 2000; red lines ; green lines ; and black lines . We report the largest error over 100 runs.
To find how the errors behave for E generated from different methods, we run simulations as in (1) but generate E differently. We construct E through mechanism (a) with L = 10, s = 3 and L = 0.6, s = 50, and also through mechanism (b) with L′ = 1.5, ρ = 0.9 and L′ = 7.5, ρ = 0.5 (Figure 3). The parameters are chosen such that is about 30.
Figure 3:
These plots show log(err) aginst log(d), with matrix size d ranging from 200 to 2000 and different eigengap γ. The perturbation E is generated from different ways. Top left: L = 10, s = 3; top right: L = 0.6, s = 50; bottom left: L’ = 1.5, ρ = 0.9; bottom right: L’ = 7.5, ρ = 0.5. The slopes are around −0.5. Blue lines represent γ = 10; red lines γ = 50; green lines γ = 100; and black lines γ = 500. We report the largest error over 100 runs.
In Figure 1 – 3, we report the largest error based on 100 runs. Figure 1 shows that the error decreases as d increases (the left plot); and moreover, the logarithm of the error is linear in log(d), with a slope −0.5, that is, (the right plot). We can take the eigengap γ into consideration and characterize the relationship in a more refined way. In Figure 2, it is clear that err almost falls on the same horizontal line for different configurations of d and γ, with fixed. The right panel clearly indicates that is a constant, and therefore . In Figure 3, we find that the errors behave almost the same regardless of how E is generated. These simulation results provide stark evidence supporting the perturbation bound in Theorem 3.
Simulation: robust covariance esitmation
We consider the performance of the generic POET procedure in robust covariance estimation in this subsection. Note that the procedure is flexible in employing any pilot estimators satisfying the conditions (19) – (21) respectively.We implemented the robust procedure with four different initial trios: (1) the sample covariance with its leading r eigenvalues and eigenvectors as and ; (2) the Huber’s robust estimator given in (14) and its top r eigen-structure estimators and ; (3) the marginal Kendall’s tau estimator with its corresponding and ; (4) lastly, we use the spatial Kendall’s tau estimator to estimate the leading eigenvectors instead of the marginal Kendall’ tau, so in (3) is replaced with . We need to briefly review the two types of Kendall’s tau estimators here, and specifically give the formula for and .Kendall’s tau correlation coefficient, for estimating pairwise comovement correlation, is defined asIts population expectation is related to the Pearson correlation via the transform for elliptical distributions (which are far too restrictive for high-dimensional applications). Then is a valid estimation for the Pearson correlation r. Letting and containing the robustly estimated standard deviations, we define the marginal Kendall’s tau estimator as
In the above construction of , we still use the robust variance estimates from .The spatial Kendall’s tau estimator is a second-order U-statstic, defined as
Then is constructed by the top r eigenvectors of . It has been shown by Fan et al. (2017+) that under elliptical distribution, and its top r eigenvalues satisfy (19) and (20) while suffices to conclude (21). Hence Method (4) indeed provides good initial estimators if data are from elliptical distribution. However, since attains (19) for elliptical distribution, by similar argument for deriving Proposition 4 based on our pertubation bound, consisting of the leading eigenvectors of is also valid for the generic POET procedure. For more details about the two types of Kendall’s tau, we refer the readers to Fang et al. (1990); Choi and Marden (1998); Han and Liu (2014); Fan et al. (2017+) and references therein.In summary, Method (1) is designed for the case of sub-Gaussian data; Method (3) and (4) work under the situation of elliptical distribution; while Method (2) is proposed in this paper for the general heavy-tailed case with bounded fourth moments without further distributional shape constraints.We simulated n samples of from two settings: (a) a multivariate t-distribution with covariance matrix and various degrees of freedom (ν = 3 for very heavy tail, ν = 5 for medium heavy tail and for Gaussian tail), which is one example of the elliptical distribution (Fang et al., 1990); (b) an element-wise iid one-dimensional t distribution with the same covariance matrix and degrees of freedom ν = 3, 5 and , which is a non-elliptical heavy-tailed distribution.Each row of coefficient matrix B is independently sampled from a standard normal distribution, so that with high probability, the pervasiveness condition holds with . The data is then generated by and the true population covariance matrix is .For d running from 200 to 900 and n = d/2, we calculated errors of the four robust estimators in different norms. The tuning for α in minimization (14) is discussed more throughly in Fan et al. (2017b). For the thresholding parameter, we used . The estimation errors are gauged in the following norms: and as shown in Theorem 6. The two different settings are separately plotted in Figures 4 and 5. The estimation errors of applying sample covariance matrix in Method (1) are used as the baseline for comparison. For example, if relative Frobenius norm is used to measure performance, will be depicted for k = 2,3,4, where are generic POET estimators based on Method (k). Therefore if the ratio curve moves below 1, the method is better than naive sample estimator (Fan et al., 2013) and vice versa. The more it gets below 1, the more robust the procedure is against heavy-tailed randomness.
Figure 4:
Error ratios of robust estimates against varying dimension. Blue lines represent errors of Method (2) over Method (1) under different norms; black lines errors of Method (3) over Method (1); red lines errors of Method (4) over Method (1). is generated by multivariate t-distribution with df = 3 (solid), 5 (dashed) and (dotted). The median errors and their IQR’s (interquartile range) over 100 simulations are reported.
Figure 5:
Error ratios of robust estimates against varying dimension. Blue lines represent errors of Method .
is generated by element-wise iid t-distribution with df = 3 (solid), 5 (dashed) and
(dotted). The median errors and their IQR’s (interquartile range) over 100 simulations are reported.
The first setting (Figure 4) represents a heavy-tailed elliptical distribution, where we expect Methods (2), (3), (4) all outperform the POET estimator based on the sample covariance, i.e. Method (1), especially in the presence of extremely heavy tails (solid lines for ν = 3). As expected, all three curves under various measures show error ratios visibly smaller than 1. On the other hand, if data are indeed Gaussian (dotted line for ), Method (1) has better behavior under most measures (error ratios are greater than 1). Nevertheless, our robust Method (2) still performs comparably well with Method (1), whereas the median error ratios for the two Kendall’s tau methods are much worse. In addition, the IQR (interquartile range) plots reveal that Method (2) is indeed more stable than two Kendall’s tau Methods (3) and (4). It is also noteworthy that Method (4), which leverages the advantage of spatial Kendall’s tau, performs more robustly than Method (3), which solely base its estimation of the eigen-structure on marginal Kendall’s tau.The second setting (Figure 5) provides an example of non-elliptical distributed data. We can see that the performance of the general robust Method (2) dominates the other three methods, which verifies the benefit of robust estimation for a general heavy-tailed distribution. Note that Kendall’s tau methods do not apply to distributions outside the elliptical family, excluding even the element-wise iid t distribution in this setting. Nonetheless, even in the first setting where the data are indeed elliptical, with proper tuning, the proposed robust method can still outperform Kendall’s tau by a clear margin.
Proof organization of main theorems
Symmetric Case
For shorthand, we write , and . An obvious bound for κ is (by Cauchy-Schwarz inequality). We will use these notations throughout this subsection.Recall the spectral decomposition of A in (8). Expressing E in terms of the column vectors of V and V┴, which form an orthogonal basis in , we write
Note that since E is symmetric. Conceptually, the perturbation results in a rotation of , and we write a candidate orthogonal basis as follows:
where is to be determined. It is straightforward to check that is an orthogonal matrix. We will choose Q in a way such that is a block diagonal matrix, i.e., . Substituting (28) and simplifying the equation, we obtain
The approach of studying perturbation through a quadratic equation is known. See Stewart (1990) for example. Yet, to the best of our knowledge, existing results study perturbation under orthogonal-invariant norms (or unitary-invariant norms in the complex case), which includes a family of matrix operator norms and Frobenius norm, but excludes the matrix max-norm. The advantages of orthogonal-invariant norms are pronounced: such norms of a symmetric matrix only depend on its eigenvalues regardless of eigenvectors; moreover, with suitable normalization they are consistent in the sense . See Stewart (1990) for a clear exposition.The max-norm, however, does not possess these important properties. An imminent issue is that it is not clear how to relate Q to , which will appear in (29) after expanding E according to (27), and which we want to control. Our approach here is to study directly through a transformed quadratic equation, obtained by left multiplying V┴ to (29). Denote . If we can find an appropriate matrix , and it satisfies the quadratic equation
then Q also satisfies the quadratic equation (29). This is because multiplying both sides of (30) by yields (29), and thus any solution to (30) with the form must result in a solution Q to (29).Once we have such (or equivalently Q), then is a block diagonal matrix, and the span of column vectors of is a candidate space of the span of first r eigenvectors, namely . We will verify the two spaces are identical in Lemma 7. Before stating that lemma, we first provide bounds on and .
Lemma 5
Suppose
. Then, there exists a matrix
such that
is a solution to the quadratic equation
satisfies
. Moreover, if
, the matrix
defined in
Here, ω is defined as
.The second claim of the lemma (i.e., the bound (31)) is relatively easy to prove once the first claim (i.e., the bound on ) is proved. To understand this, note that we can rewrite as , and can be controlled by a trivial inequality . To prove the first claim, we construct a sequence of matrices through recursion that converges to the fixed point , which is a solution to the quadratic equation (30). For all iterates of matrices, we prove a uniform max-norm bound, which leads to a max-bound on by continuity. To be specific, we initialize , and given , we solve a linear equation:
and the solution is defined as . Under some conditions, the iterate converges to a limit , which is a solution to (30). The next general lemma captures this idea. It follows from Stewart (1990) with minor adaptations.
Lemma 6
Let T be a bounded linear operator on a Banach space
. Assume that T has a bounded inverse, and define β = ||T
be a map that satisfies
for some
. Suppose that
is a closed subspace of
such that
and
. Suppose
that satisfies
. Then, the sequence initialized with x0 = 0 and iterated through
converges to a solution x
. Moreover, we have
, and
.To apply this lemma to the equation (30), we view as the space of matrices with the max-norm , and as the subspace of matrices of the form where . The linear operator T is set to be the , and the map is set to be the quadratic function . Roughly speaking, under the assumption of Lemma 6, the nonlinear effect caused by is weak compared with the linear operator T. Therefore, it is crucial to show T is invertible, i.e. to give a good lower bound on . Since the norm is not orthogonal-invariant, a subtle issue arises when A is not of exact low rank, which will be discussed at the end of the subsection.If there is no perturbation (i.e., E = 0), all the iterates are simply 0, so is identical to V. If the perturbation is not too large, the next lemma shows that the column vectors of span the same space as .In other words, with a suitable orthogonal matrix R, the columns of are .
Lemma 7
Suppose
. Then, there exists an orthogonal matrix
such that the column vectors of
Proof of Theorem 2
It is easy to check that under the assumption of Theorem 2, the conditions required in Lemma 5 and Lemma 7 are satisfied. Hence, the two lemmas imply Theorem 2. ■To study the perturbation of individual eigenvectors, we assume, in addition to the condition on that satisfy a uniform gap, (namely ). This additional assumption is necessary, because otherwise, the perturbation may lead to a change of relative order of eigenvalues, and we may be unable to match eigenvectors from the order of eigenvalues. Suppose is an orthogonal matrix such that are eigenvectors of . Now, under the assumption of Theorem 2, the column vectors of and are identical up to sign, so we can rewrite the difference asWe already provided a bound on in Lemma 5. By the triangular inequality, we can derive a bound on . If we can prove a bound on it will finally leads to a bound on . In order to do so, we use the Davis-Kahan theorem to obtain an bound on for all . This will lead to a max-norm bound on R − I (with the price of potentially increasing the bound by a factor of r). The details about the proof of Theorem 3 are in the appendix.We remark that the conditions on in Theorem 2 and Theorem 3 are only useful in cases where . Ideally, we would like to have results with assumptions only involving and like in the Davis-Kahan theorem. Unfortunately, unlike orthogonal-invariant norms that only depend on the eigenvalues of a matrix, the max-norm is not orthogonal-invariant, and thus it also depends on the eigenvectors of a matrix. For this reason, it is not clear whether we could obtain a lower bound on using only the eigenvalues and so that Lemma 6 could be applied. The analysis appears to be difficult if we do not have a bound on considering that even in the analysis of linear equations, we need invertibility and a well-controlled condition number.
Asymmetric case
Let A be d1 + d2 square matrices defined asAlso denote . This augmentation of an asymmetric matrix into a symmetric one is called Hermitian dilation. Here the superscript d means the Hermitian dilation. We also use this notation to denote quantities corresponding to A and .An important observation is thatFrom this identity, we know that A have nonzero eigenvalues where and its corresponding eigenvectors are For a given r, we stack these (normalized) eigenvectors with indices into a matrix :Through the augmented matrices, we can transfer eigenvector results for symmetric matrices to singular vectors of asymmetric matrices. However, we cannot directly invoke the results proved for symmetric matrices, due to an issue about the coherence of V: when d1 and d2 are not comparable, the coherence μ(V) can be very large even when μ(V) and μ(U) are bounded. To understand this, consider the case where and all entries of U are and all entries of V are Then, the coherences μ(U) and μ(V) are O(1), but .This unpleasant issue about the coherence, nevertheless, can be tackled if we consider a different matrix norm. In order to deal with the different scales of d1 and d2, we define the weighted max-norm for any matrix M with d1 + d2 rows as follows:In other words, we rescale the top d1 rows of M by a factor of and rescale the bottom d2 rows by This weighted norm serves to balance the potential different scales of d1 and d2.The proofs of theorems in Section 2.2 will be almost the same with those in the symmetric case, with the major difference being the new matrix norm. Because the derivation is slightly repetitive, we will provide concise proofs in the appendix. Similar to the decomposition in (2.1),
where is has rank 2r. Equivalently,Analogously, we define notations in (28)-(30) and use d in the superscript to signify that they are augmented through Hermitian dilation. It is worthwhile to note that , and that . Recall . In the proof, we will also use , which is a quantity similar to κ.The next key lemma, which is parallel to Lemma 5, provides a bound on the solution to the quadratic equation
Lemma 8
Suppose
. Then, there exists a matrix
such that
is a solution to the quadratic equation
satisfies
Moreover, if
, the matrix
defined in
Here,
is defined as
.In this lemma, the bound (38) bears a similar form to (31): if we consider the max-norm, the first d1 rows of correspond to the left singular vectors u’s, and they scale with ; and the last d2 rows correspond to the right singular vectors v’s, which scale with . Clearly, the weighted max-norm indeed helps to balance the two dimensions. The rest of the proofs can be found in the appendix.
Proofs for Section 2.1
Denote the column span of a matrix M by span(M). Suppose two matrices M1, have orthonormal column vectors. It is known that (Stewart, 1990)
where are the canonical angles between span(M1) and span(M2). Recall the notations defined in (27), and also recall The first lemma bounds .
Lemma 9
We have the following bound on
Proof
Using the definition in (27), we can write . Since the columns of V and form an orthogonal basis in clearlyBy Cauchy-Schwarz inequality and the definition of μ, for any ,Using the identity (40) and the above inequality, we derive
which completes the proof. ■
Lemma 10
If
, then
is an invertible matrix. Furthermore,
where Q0
is an d × r matrix.
Proof
Let Q0 be any d × r matrix with . NoteWe will derive upper bounds on Q0E11 and and a lower bound on Q0ᴧ1. Since E11 = V by definition, we expand Q0E11 and use a trivial inequality to deriveBy Cauchy-Schwarz inequality and the definition of μ in (3), forSubstituting into (42), we obtain an upper bound:To bound we use the identity (40) and writeUsing two trivial inequalities we haveIn the proof of Lemma 9, we showed Thus,Moreover, Combining the two bounds,It is straightforward to obtain a lower bound on since there is an entry of Q0, say (Q0), that has an absolute value of 1, we haveTo show is invertible, we use (42) and (45) to obtainWhen
must have full rank, because otherwise we can choose an appropriate Q0 in the null space of so that , which is a contradiction. To prove the second claim of the lemma, we combine the lower bound (45) with upper bounds (43) and (44) to derive
which is exactly the desired inequality. ■Next we prove Lemma 6. This lemma follows from Stewart (1990), with minor changes that involves We provide a proof for the sake of completeness.
Proof of Lemma 6
Let us write for shorthand and recall As the first step, we show that the sequence is bounded. By construction in (34), we bound using :We use this inequality to derive an upper bound on {x} for all k. We define and
then clearly (which can be shown by induction). It is easy to check (by induction) that the sequence is increasing. Moreover, since the quadratic function
has two fixed points (namely solutions to and the smaller one satisfiesIf , then . Thus, by induction, all are bounded by This implies The next step is to show that the sequence converges. Using the recursive definition (34) again, we deriveSince the sequence is a Cauchy sequence, and convergence is secured. Let be the limit. It is clear by assumption that implies
and by continuity.The final step is to show x⋆ is a solution to Because is bounded and ϕ satisfies (33), the sequence converges to by continuity and compactness. The linear operator T is also continuous, so we can take limits on both sides of we conclude that x⋆ is a solution to ■With all the preparations, we are now ready to present the key lemma. As discussed in Section 5, we set
which is a subspace Consider the matrix max-norm in .
Lemma 11
Suppose
. Then there exists a solution
to the equationWe will invoke Lemma 6 and apply it to the quadratic equation (30). To do so, we first check the conditions required in Lemma 6.Let the linear operator be By Lemma 10, has a bounded inverse, and is bounded from below:Let us define To check the inequalities in (33), observe that
where we used Lemma 9. We also observeThus, if we set then inequalities required in (33) are satisfied. For any , obviously . To show let and observe thatBy definition, we know , so we deduce . Our assumption implies so by Lemma 10, the matrix is invertible, and thus The last condition we check is By Lemma 9 and (46), this is true if
The above inequality holds when Under this condition, we have, by Lemma 6,
where, the second inequality is due to ■The next lemma is a consequence of Lemma 11. We define, as in Lemma 5, that
Lemma 12
If
thenBy the triangular inequality, the second inequality is immediate from the first one. To prove the first inequality, suppose the spectral decomposition of , where where , and where are orthonormal vectors in . Since has nonnegative eigenvalues, we have . Using these notations, we can rewrite the matrix as
Note that , which implies . It is easy to check that whenever . From this fact, we know . Using Cauchy-Schwarz inequality, we deduce that for any ,
This leads to the desired max-norm bound. ■
Proof of Lemma 5
The first claim of the lemma (the existence of and its max-norm bound) follows directly from Lemma 11. To prove the second claim, we split into two parts:
where we used identity . Note that implies . Thus, we can use Lemma 12 and derive
where we used Cauchy-Schwarz inequality. Using the above inequality and the bound on (namely, the first claim in the lemma),Simplifying the bound using and a trivial bound μ ≥ 1, we obtain (31). ■
Proof of Lemma 7
Using the identity in (39), it follows from Davis-Kahan sin Θ theorem (Davis and Kahan, 1970) and Weyl’s inequality that
when , where . Since and , the condition implies . Hence, we have . Moreover,
where we used a trivial inequality for any . Under the condition , it is easy to check that . Thus, we obtain . By the triangular inequality,Since is a block diagonal matrix, span is the same as the subspace spanned by r eigenvectors of . We claim that span . Otherwise, there exists an eigenvector whose associated eigenvalue is distinct from (since ), and thus u is orthogonal to . Therefore,This implies , which is a contradiction. ■
Proof of Theorem 3
We split into two parts—see (35). In the following, we first obtain a bound on , which then results in a bound on .Under the assumption of the theorem, rω < 1/2, soTo bound , we rewrite R as . Expand according to (28),Let us make a few observations: (a) by Cauchy-Schwarz in-equality; (b) by Cauchy-Schwarz inequality again; and (c) by Lemma 12. Using these inequalities, we haveFurthermore, by Davis-Kahn sin Θ theorem (Davis and Kahan, 1970) and Weyl’s inequality, for any ,
when (δ is defined in Theorem 3). This leads to the bound (which is a simplified bound). This is because when , the bound is implied by (49); when the bound trivially follows from . We obtain, up to sign, for i ≤ r,In other words, each diagonal entry of , namely , is bounded by . Since are orthonormal vectors, we have for any i ≠ j, which leads to bounds on off-diagonal entries of . We will combine the two bounds. Note that when ,
and when is trivially bounded by 1 (up to sign), which is trivially bounded by . In either case, we deduceUsing the bounds in (49) and (51) and , we obtainWe use the inequality rω < 1/2 to simplify the above bound:We are now ready to bound In (35), we use the bounds (48), (53), (31) to obtainUsing a trivial inequality , the above bound leads to
■
Proofs for Section 2.2
Recall the definitions of μ0, , κ0 and in Section 5.2. Similar to the symmetric case, we will use the following easily verifiable inequalities.
Lemma 13
Parallel to
where
as defined.Recall . Note and . Thus,
■
Lemma 14
Parallel to
, then
is a non-degenerate matrix. Furthermore, we have the following bound
where
.Following similar derivations with Lemma 10, we have , and for any matrix with ,This can be checked by expressing as a block matrix and expand the matrix multiplication. In particular, one can verify that (i) ; (ii) For any matrix M with d1 + d2 rows, ; (iii) ; (iv) . Moreover, . Thus,
which is the desired inequality in the lemma. In addition, is non-degenerate if . ■
Lemma 15
Parallel to
to the system
, thenWe again invoke Lemma 6. Let be the space equipped with the weighted max-norm . We also define as a subspace of consisting of matrices of the form where has size (d1 + d2 − 2r) × 2r. Let the linear operator be . First notice from Lemma 14, is a linear operator with bounded inverse, i.e., is bounded from below byLet be a map given by . Note that . Using the (easily verifiable) inequality
we derive, by the bound on (Lemma 13), thatMoreover, using the inequality (56) and the bound on (Lemma 13),Thus, we can choose and the condition (33) in Lemma 6 is satisfied. To ensure , it suffices to require (again by Lemma 13),It is easily checkable that the above inequality holds when . Under this condition, by Lemma 6,
which completes the proof. ■
Proof of Lemma 8
The first claim of the lemma (existence of and its max-norm bound) follows from Lemma 15. To prove the second claim, we split into two parts:Note (see (54)). It can be checked that the condition implies . Since and similar to Lemma 12, we haveThis yields
■
Lemma 16
Suppose
. Then, there exists an orthogonal matrix
(or R
(and ) are the top r right (and left) singular vectors of Ã.Similar to the proof of Lemma 7, we will prove , which would then imply that and are the same only up to an orthogonal transformation. The same is true for and , and we will leave out its proof.By Weyl’s inequality for singular values (also known as Mirsky’s theorem (Mirsky, 1960)), for any . By Wedin’s perturbation bounds for singular vectors (Wedin, 1972),Note that (see (54)) Under the assumption in the lemma, clearly , and we have . Moreover, by Lemma 8, we have . Note that each column vector of and are -dimensional. Looking at the last d2 dimensions, we haveUnder the assumption of the lemma, . Therefore, we deduce , and conclude that there exists an orthogonal matrix such that . ■
Proof of Theorem 4
Lemma 8, together with Lemma 16, implies Theorem 4. ■
Proof of Theorem 5
Similar to the proof of Theorem 3, we first split the difference :To bound the first term, note that under our assumption, (derived in the proof of Lemma 8), it is easy to check . We rewrite the matrix R as
Notice that and
where we used (58). Following the same derivations as in the proof of Theorem 3, and using the (easily verifiable) fact , we can bound by . Thus, using , under , we haveFinally, in order to bound , we use (60), (61) and (62), and deriveThis completes the proof. ■
Proofs for Section 3
Proof of Proposition 3
Note first by Weyl’s inequality, . So this implies that if and only if for i ≤ r. And furthermore the eigenvalues of are distinct if and only if .To prove the equivalency of bounded and bounded coherence. We first prove the necessary condition. Again from Weyl’s inequality, for . If is bounded, must also be bounded, since . Therefore implies is bounded. Namely, the factors are pervasive.On the contrary, if pervasiveness holds, we need to prove that μ(V) is bounded. Let . Obviously and . Without loss of generality, assume ‘s are decreasing. So and where . By Theorem 3,
where with the convention . Hence, we have , which implies bounded coherence . ■