Literature DB >> 31666329

Inference and uncertainty quantification for noisy matrix completion.

Yuxin Chen¹, Jianqing Fan², Cong Ma², Yuling Yan².

Abstract

Noisy matrix completion aims at estimating a low-rank matrix given only partial and corrupted entries. Despite remarkable progress in designing efficient estimation algorithms, it remains largely unclear how to assess the uncertainty of the obtained estimates and how to perform efficient statistical inference on the unknown matrix (e.g., constructing a valid and short confidence interval for an unseen entry). This paper takes a substantial step toward addressing such tasks. We develop a simple procedure to compensate for the bias of the widely used convex and nonconvex estimators. The resulting debiased estimators admit nearly precise nonasymptotic distributional characterizations, which in turn enable optimal construction of confidence intervals/regions for, say, the missing entries and the low-rank factors. Our inferential procedures do not require sample splitting, thus avoiding unnecessary loss of data efficiency. As a byproduct, we obtain a sharp characterization of the estimation accuracy of our debiased estimators in both rate and constant. Our debiased estimators are tractable algorithms that provably achieve full statistical efficiency.

Entities: Disease Mutation Species

Keywords: confidence intervals; convex relaxation; nonconvex optimization

Mesh：

Year: 2019 PMID： 31666329 PMCID： PMC6859358 DOI： 10.1073/pnas.1910053116

Source DB: PubMed Journal: Proc Natl Acad Sci U S A ISSN： 0027-8424 Impact factor: 11.205

Low-rank matrix completion is concerned with recovering a low-rank matrix, when only a small fraction of its entries are revealed (1–3). The importance of this problem cannot be overstated, due to its broad applications in, e.g., recommendation systems, sensor network localization, magnetic resonance imaging, computer vision, large covariance estimation, and latent factor learning to name just a few. Tackling this problem in large-scale applications is computationally challenging, resulting from the intrinsic nonconvexity incurred by the low-rank structure. To further complicate matters, another inevitable challenge stems from the imperfectness of data acquisition mechanisms, wherein the acquired samples are usually contaminated by a certain amount of noise. Fortunately, if the entries of the unknown matrix are sufficiently delocalized and randomly revealed, this problem may not be as hard as it seems. Substantial progress has been made over the past several years in designing computationally tractable algorithms—including both convex and nonconvex approaches— that allow one to fill in unseen entries faithfully from partial noisy samples (4–13). Nevertheless, modern decision making would often require one step further. It not merely anticipates a faithful estimate, but also seeks to quantify the uncertainty or “confidence” of the provided estimate, ideally in a reasonably accurate fashion. For instance, given an estimate returned by the convex approach, how does one use it to identify a short interval that is likely to contain a missing entry? Conducting effective uncertainty quantification for noisy matrix completion is, however, far from straightforward. For the most part, the state-of-the-art matrix completion algorithms require solving highly complex optimization problems, which often do not admit closed-form solutions. Of necessity, it is extremely challenging to pin down the distributions of the estimates returned by these algorithms. The lack of distributional characterizations presents a major roadblock to performing valid, yet efficient, statistical inference on the unknown matrix of interest. It is worth noting that a number of recent papers have been dedicated to inference and uncertainty quantification for various high-dimensional problems, including Lasso (14–18), generalized linear models (17, 19), and graphical models (20, 21), among others. Very little work, however, has looked into noisy matrix completion along this direction. While nonasymptotic statistical guarantees for noisy matrix completion have been derived in prior theory, the existing estimation error bounds are supplied only at an order-wise level. Such order-wise error bounds either lose a significant factor relative to the optimal guarantees or come with an unspecified (but often enormous) preconstant. Viewed in this light, a confidence region constructed directly based on such results is bound to be overly conservative, resulting in substantial overcoverage.

A Glimpse of Our Main Contributions

This paper takes a substantial step toward statistically optimal inference and uncertainty quantification for noisy matrix completion. Specifically, we develop a simple procedure to compensate for the bias of the widely used convex and nonconvex estimators. The resulting debiased estimators admit nearly accurate nonasymptotic distributional guarantees. Such distributional characterizations in turn allow us to reason about the uncertainty of the obtained estimates vis-à-vis the unknown matrix. For instance, we can construct 1) confidence intervals for each entry—either observed or missing—of the unknown matrix and 2) confidence regions for the low-rank factors of interest (modulo some global ambiguity), both of which are provably optimal. As a byproduct, we characterize the Euclidean estimation errors of the proposed debiased estimators, which match statistical efficiency precisely (including the preconstant). This theory demonstrates that a computationally feasible algorithm can achieve the best possible statistical efficiency (including the preconstant) for noisy matrix completion.

Models and Notation

To cast noisy matrix completion in concrete statistical settings, we adopt a model commonly studied in the literature.

Ground Truth.

We are interested in estimating an unknown rank- matrix , whose rank- singular-value decomposition (SVD) is given by . For convenience, let and be the balanced low-rank factors of , which obeyDenote by the th largest singular value of . Set

Observation Models.

What we observe is a randomly subsampled and corrupted subset of the entries of ; namely,where is a small set of indexes, and denotes independently generated noise at the location . From now on, we assume the random sampling model where each index is included in independently with probability (i.e., data are missing uniformly at random). We use to represent the orthogonal projection onto the subspace of matrices that vanish outside the index set .

Incoherence Conditions.

Clearly, not all matrices can be reliably estimated from a highly incomplete set of measurements. To address this issue, we impose a standard incoherence condition (2) on the singular subspaces of (i.e., and ),where is termed the incoherence parameter and denotes the largest norm of all rows in . A small implies that the energies of and are reasonably spread out across all of their rows.

Asymptotic Notation.

(or ) means for some constant , means for some constant , means for some constants , and means .

Inferential Procedures and Main Results

The proposed inferential procedure has its basis on 2 of the most popular matrix completion paradigms: convex relaxation and nonconvex optimization. Recognizing the complicated bias of these 2 highly nonlinear estimators and motivated by refs. 14, 15, and 17, we first illustrate how to perform bias correction, followed by a theory that establishes the near-Gaussianity and optimality of the proposed debiased estimators. Gradient descent for solving Eq.

Background: Convex and Nonconvex Estimation Algorithms.

We first review in passing 2 computationally feasible estimation algorithms that are arguably the most widely used in practice. They serve as the starting point for us to design inferential procedures for noisy low-rank matrix completion.

Convex Relaxation.

Recall that the rank function is highly nonconvex, which often prevents us from computing a rank-constrained estimator in polynomial time. For the sake of computational feasibility, prior works suggest relaxing the rank function into its convex surrogate (22); for example, one can consider a penalized least-squares convex programHere, is the nuclear norm (the sum of singular values, which is a convex surrogate of the rank function), and is some regularization parameter. Under mild conditions, the solution to the convex program Eq. attains appealing estimation accuracy (in an order-wise sense), provided that a proper regularization parameter is adopted (4, 13).

Nonconvex Optimization.

It is recognized that the convex approach, which typically relies on solving a semidefinite program, is still expensive and not scalable to large dimensions. This motivates an alternative route, which represents the matrix variable via 2 low-rank factors and attempts solving the following nonconvex program directly:Here, we choose a regularizer of the form primarily to mimic the nuclear norm (23, 24). A variety of optimization algorithms have been proposed to tackle the nonconvex program Eq. or its variants (7, 10, 11, 25); readers are referred to ref. 26 for a recent overview. As a prominent example, a 2-stage algorithm—gradient descent following suitable initialization—provably enjoys fast convergence for a wide range of scenarios (11, 13). The present paper focuses on this simple yet powerful algorithm, as documented in Algorithm 1 and detailed in .

Intimate Connections between Convex and Nonconvex Estimates.

Denote by any minimizer of the convex program Eq. and the estimate returned by Algorithm 1 aimed at solving Eq. . As was recently shown in ref. 13, when the regularization parameter is properly chosen, these 2 estimates provably obey (see for a precise statement)In truth, the 2 matrices in Eq. are exceedingly close to, if not identical with, each other. This salient feature paves the way for a unified treatment of convex and nonconvex approaches: Most inferential procedures and guarantees developed for the nonconvex estimate can be readily transferred to perform inference for the convex one, and vice versa.

Constructing Debiased Estimators.

We are now well equipped to describe how to construct estimators based on the convex estimate and the nonconvex estimate , to enable efficient inference. Motivated by the proximity of the convex and nonconvex estimates and for the sake of conciseness, we abuse notation by using for both convex and nonconvex estimates; see Table 1 and for precise definitions. This allows us to unify the presentation for both convex and nonconvex estimators.

Table 1.

Notation used to unify the convex estimate and the nonconvex estimate

Z∈Rn×n	Either Zcvx or X ncvxYncvx⊤
X,Y∈Rn×r	For the nonconvex case, we take X=X ncvx and Y=Y ncvx; for the convex case, let X=Xcvx and Y=Ycvx, which are the balanced low-rank factors of Zcvx,r obeying Zcvx,r=XcvxYcvx⊤ and Xcvx⊤Xcvx=Ycvx⊤Ycvx.
M d∈Rn×n	The proposed debiased estimator as in Eq. 10.
X d,Y d∈Rn×r	The proposed estimator as in Eq. 11.

Here, is the best rank- approximation of . See for a complete summary.

Notation used to unify the convex estimate and the nonconvex estimate Here, is the best rank- approximation of . See for a complete summary. Given that Eqs. and are both regularized least-squares problems, they behave effectively like shrinkage estimators, indicating that the provided estimates necessarily suffer from nonnegligible bias. To enable the desired statistical inference, it is natural to first correct the estimation bias.

A debiased estimator for the matrix.

A natural debiasing strategy that immediately comes to mind is the simple linear transformation (recall the notation in Table 1)where we identify with . Heuristically, if and are statistically independent, then serves as an unbiased estimator of , i.e., ; this arises since the noise has zero mean and under the uniform random sampling model, with the identity operator. Despite its (near) unbiasedness nature at a heuristic level, however, the matrix is typically full rank, with nonnegligible energy spread across its entire spectrum. This results in dramatically increased variability in the estimate, which is undesirable for inferential purposes. To remedy this issue, we propose to further project onto the set of rank- matrices, leading to the estimatorwhere , and can again be found in Table 1. This projection step effectively suppresses the variability outside the -dimensional principal subspace. As we shall demonstrate, the proposed estimator Eq. provably debiases the provided estimate , while optimally controlling the extent of variability.

An equivalent perspective on the low-rank factors.

As it turns out, the debiased estimator Eq. admits another almost equivalent representation that offers further insights. Specifically, we consider the following estimator for the low-rank factors,where we recall the definition of and in Table 1. To develop some intuition, let us look at a simple scenario where is the rank- SVD of and , . It is then self-evident that and . In words, and are obtained by deshrinking the spectrum of and properly. As we formally establish in , the estimator Eq. for the low-rank factors is extremely close to the debiased estimator Eq. for the whole matrix, in the sense that

Main Results: Distributional Guarantees.

The proposed estimators admit tractable distributional characterizations in the large- regime, which facilitates the construction of confidence regions for many quantities of interest. In particular, this paper centers around 2 types of inferential problems: Each entry of the matrix : The entry can be either missing (i.e., predicting an unseen entry) or observed (i.e., denoising an observed entry). For example, in the problem of sensor localization (27), one wants to infer the distance between any 2 sensors, given partially revealed distances. Mathematically, this seeks to determine the distribution of The low-rank factors : The low-rank factors often reveal critical information about the applications of interest [e.g., community memberships of each individual in the community detection problem (28), angles between each object and a global reference point in the angular synchronization problem (29), or factor loadings and latent factors in factor analysis (30)]. Recognizing the global rotational ambiguity issue, we aim to pin down the distributions of and up to global rotational ambiguity. More precisely, we intend to characterize the distributions offor the global rotation matrix that best “aligns” and , i.e.,Here, is the set of orthonormal matrices in . Clearly, the above 2 inferential problems are tightly related: An accurate distributional characterization for the low-rank factors (Eq. ) often results in a distributional guarantee for the entries (Eq. ).

Distributional guarantees for low-rank factors.

We begin with our distributional characterizations of the low-rank factors. Here, denotes the th standard basis vector in .

Theorem 1.

Suppose that the sample complexity meets for some sufficiently large constant and the noise obeys for some sufficiently small constant . Then one can writewith defined in , defined in , and defined in . Here, the rows of (resp. ) are independent and obeyIn addition, the residual matrices satisfy, with probability at least , that In words, decomposes the estimation error (resp. ) into a Gaussian component (resp. ) and a residual term (resp. ). If the sample size is sufficiently large and the noise size is sufficiently small, then the residual terms are much smaller in size compared to and . To see this, it is helpful to leverage the Gaussianity (Eq. ) to compute that for each , the th row of obeysin other words, the typical size of the th row of is no smaller than the order of . In comparison, the size of each row of (Eq. ) is much smaller than (and hence smaller than the size of the corresponding row of ) with high probability.

Remark 1.

Another interesting feature—which we make precise in the proof of Theorem 1—is that for any given , the two random vectors and are nearly statistically independent. This is crucial for deriving inferential guarantees for the entries of the matrix.

Distributional guarantees for matrix entries.

Equipped with the above theory for low-rank factors and Remark 1, we are ready to characterize the distribution of .

Theorem 2.

For each , define the variance aswhere (resp. ) denotes the th (resp. th) row of (resp. ). Suppose thatThen the matrix defined in satisfieswhere and the residual obeys with probability exceeding . Several remarks are in order. First, we develop some intuition regarding where the formula comes from. By virtue of , one has the following Gaussian approximationAssuming that the first-order expansion is tight, one hasAccording to Remark 1, and are nearly independent. One can thus compute the variance of Eq. asHere, (i) relies on Eq. and the near independence between and , (ii) uses the variance formula in , and (iii) arises from the definitions of and (cf. Eq. ). This explains (heuristically) the variance formula . Given that reveals the tightness of Gaussian approximation under conditions in Eq. , it in turn allows us to construct nearly accurate confidence intervals for each matrix entry . This is formally summarized in the following corollary. Here, denotes the interval .

Corollary 1 (Confidence Intervals for the Entries ).

Let , , and be as defined in . For any given , suppose that holds and thatDenote by the CDF of a standard Gaussian random variable and by its inverse function. Letbe the empirical estimate of . Then one has In words, Corollary 1 tells us that for any fixed significance level , the intervalis a nearly accurate confidence interval of . In addition, we remark that when (and hence ), the above Gaussian approximation is completely off. In this case, one can still leverage to show thatwhere are independent and identically distributed according to . However, it is nontrivial to determine whether is vanishingly small or not based on the observed data, which makes it challenging to conduct efficient inference for entries with small (but a priori unknown) . Last but not least, the careful reader might wonder how to interpret our conditions on the sample complexity and the signal-to-noise ratio. Take the case with for example: Our conditions readThe first condition matches the sample complexity limit (up to some log factor), while the second one coincides with the regime (up to log factor) in which popular algorithms (like spectral methods or nonconvex algorithms) work better than a random guess (7, 10, 11). The take-away message is this: Once we are able to compute a reasonable estimate in an overall sense, then we can reinforce it to conduct entrywise inference in a statistically efficient fashion.

Lower Bounds and Optimality for Inference.

It is natural to ask how well our inferential procedures perform compared to other algorithms. Encouragingly, the debiased estimator is optimal in some sense; for instance, it attains the minimum covariance among all unbiased estimators. To formalize this claim, we 1) quantify the performance of 2 ideal estimators with the assistance of an oracle and 2) demonstrate that the performance of our debiased estimators is arbitrarily close to that of the ideal estimators. We remark in passing such results here; see for precise statements. Below, we denote by (resp. ) the th row of (resp. ).

Lower bound for estimating .

Suppose there is an oracle informing us of and we observe the same set of data as in Eq. . Under such an idealistic setting and under our sample complexity condition, one has, with high probability, that any unbiased estimator of satisfiesThis reveals that the covariance of the estimator (cf. ) attains the Cramér–Rao lower bound with high probability. The same conclusion applies to too.

Lower bound for estimating ().

Suppose there is another oracle informing us of and , that is, everything about except and everything about except . In addition, we observe the same set of data as in Eq. , except that we do not get to see . Under this idealistic model, one can show that with high probability, any unbiased estimator of must have variance no smaller than , where is defined in . This indicates that the variance of our debiased estimator (cf. )—which certainly does not have access to the side information provided by the oracle—is arbitrarily close to the Cramér–Rao lower bound aided by an oracle.

Back to Estimation: The Debiased Estimator Is Optimal.

While the emphasis herein is on inference, we nevertheless single out an important consequence that informs the estimation step. To be specific, the distributional guarantees derived in Theorems 1 and 2 allow us to track the estimation accuracy of , as stated below.

Theorem 3 (Estimation Accuracy of ).

Let be the debiased estimator as defined in . Instate the conditions in . Then with probability at least , one has In stark contrast to prior statistical estimation guarantees (e.g., refs. 4–6 and 13), Theorem 3 pins down the estimation error of the proposed debiased estimator in a sharp manner (namely, even the preconstant is fully determined). Encouragingly, there is a sense in which the proposed debiased estimator achieves the best possible statistical estimation accuracy. In fact, a lower bound has already been derived in ref. 4, section III.B, asserting that one cannot beat the mean-square estimation error even with the help of an oracle. See for a precise statement. The implication of Theorems 1 to 3 is remarkable: The debiasing step not merely facilitates uncertainty assessment, but also proves crucial in minimizing estimation errors. It achieves optimal statistical efficiency in terms of both the rate and the preconstant. This theory about a polynomial time algorithm matches the statistical limit in terms of the preconstant. This intriguing finding is further corroborated by numerical experiments (see Fig. 2).

Fig. 2.

(Left) Estimation error of vs. measured in the Frobenius norm. (Right) Estimation error of vs. measured in the norm. The results are averaged over 20 independent trials for , , and .

Numerical Experiments.

We conduct numerical experiments on synthetic data to verify the distributional characterizations provided in . The verification of is left to . Note that our main results hold for the debiased estimators built upon and . As we formalize in , these 2 debiased estimators are extremely close to each other. Therefore, to save space, we use the debiased estimator built upon the convex estimate throughout the experiments. Fix the dimension and the regularization parameter throughout the experiments. We generate a rank- matrix , where are random orthonormal matrices, and apply the proximal gradient method to solve the convex program Eq. . Denote , where is the empirical variance defined in Eq. . In view of the confidence interval predicted by Corollary 1, for each , we define to be the empirical coverage rate of over 200 Monte Carlo simulations. Correspondingly, denote by (resp. ) the average (resp. the SD) of over indexes . As before, Table 2 gathers the empirical coverage rates for and Fig. 1 displays the quantile–quantile (Q-Q) plots of and vs. the standard Gaussian random variable over 200 Monte Carlo trials for , , and . It is evident that the distribution of matches that of reasonably well.

Table 2.

Empirical coverage rates of for different over 200 Monte Carlo trials

(r,p,σ)	Mean(Cov^E)	Std(Cov^E)
(2,0.2,10−6)	0.9380	0.0200
(2,0.2,10−3)	0.9392	0.0196
(2,0.4,10−6)	0.9455	0.0164
(2,0.4,10−3)	0.9456	0.0164
(5,0.2,10−6)	0.9226	0.0247
(5,0.2,10−3)	0.9271	0.0228
(5,0.4,10−6)	0.9410	0.0173
(5,0.4,10−3)	0.9417	0.0172

Fig. 1.

Q-Q plots of (Left) and (Right) vs. the standard normal distribution. The results are reported over 200 independent trials for , , and .

Empirical coverage rates of for different over 200 Monte Carlo trials Q-Q plots of (Left) and (Right) vs. the standard normal distribution. The results are reported over 200 independent trials for , , and . In addition to the tractable distributional guarantees, the debiased estimator also exhibits superior estimation accuracy compared to the original estimator (cf. Theorem 3). Fig. 2 reports the estimation error of vs. measured in both the Frobenius norm and the norm across different noise levels. The results are averaged over 20 Monte Carlo simulations for , . It can be seen that the errors of the debiased estimator are uniformly smaller than that of the original estimator and are much closer to the oracle lower bound. As a result, we recommend using even for the purpose of estimation. (Left) Estimation error of vs. measured in the Frobenius norm. (Right) Estimation error of vs. measured in the norm. The results are averaged over 20 independent trials for , , and . We conclude this section with experiments on real data. Similar to ref. 4, we use the daily temperature data (31) for 1,400 stations across the world in 2018, which results in a data matrix. Inspection of the singular values reveals that the data matrix is nearly low rank. We vary the observation probability from 0.5 to 0.9 and randomly subsample the data accordingly. Based on the observed temperatures, we then apply the proposed methodology to obtain confidence intervals for all of the entries. Table 3 reports the empirical coverage probabilities and the average length of the confidence intervals as well as the estimation error of both and over 20 independent experiments. It can be seen that the average coverage probabilities are reasonably close to and the confidence intervals are also quite short. In addition, the estimation error of is smaller than that of , which corroborates our theoretical prediction. The discrepancy between the nominal coverage probability and the actual one might arise from the facts that 1) the underlying true temperature matrix is only approximately low rank and 2) the noise in the temperature might not be independent.

Table 3.

Empirical coverage rates and average lengths of the confidence intervals of the entries as well as the estimation error vs. observation probability

	Coverage		CI length		‖Z^−M⋆‖F/‖M⋆‖F
p	Mean	SD	Mean	SD	Convex Zcvx	Debiased M d
0.5	0.8265	0.0016	3.6698	0.0209	0.029	0.028
0.6	0.8268	0.0011	2.8774	0.0098	0.025	0.023
0.7	0.8431	0.0006	2.3426	0.0054	0.022	0.019
0.8	0.8725	0.0003	2.0234	0.0052	0.020	0.015
0.9	0.9093	0.0003	1.8296	0.0072	0.018	0.011

The results are averaged over 20 Monte Carlo trials.

Empirical coverage rates and average lengths of the confidence intervals of the entries as well as the estimation error vs. observation probability The results are averaged over 20 Monte Carlo trials.

Discussion

The present paper makes progress toward inference and uncertainty quantification for noisy matrix completion, by developing simple debiased estimators that admit tractable and accurate distributional characterizations. While we have achieved some early success in accomplishing this, our results are likely suboptimal in the dependency on the rank and the condition number . Also, our theory operates under the moderate-to-high signal-to-noise ratio (SNR) regime, where (which is proportional to the SNR) is required to exceed the order of ; see the conditions in . How to conduct inference in the low SNR regime is an important future direction. More broadly, this paper uncovers that computational feasibility and full statistical efficiency can sometimes be simultaneously achieved despite a high degree of nonconvexity. The analysis and insights herein might shed light on inference for a broader class of nonconvex statistical problems.

Algorithm 1.

Gradient descent for solving Eq.

Suitable initialization: X0, Y0 (SI Appendix)

Gradient updates: for

t=0,1,…,t0−1

Xt+1=Xt−ηp[PΩ(XtYt⊤−M)Yt+λXt],[6a]

Yt+1=Yt−ηp[[PΩ(XtYt⊤−M)]⊤Xt+λYt],[6b]

where η>0 determines the step size or the learning rate.

4 in total

1. Angular Synchronization by Eigenvectors and Semidefinite Programming.

Authors: A Singer
Journal: Appl Comput Harmon Anal Date: 2011-01-30 Impact factor: 3.055

2. Spectral Regularization Algorithms for Learning Large Incomplete Matrices.

Authors: Rahul Mazumder; Trevor Hastie; Robert Tibshirani
Journal: J Mach Learn Res Date: 2010-03-01 Impact factor: 3.654

3. A SIGNIFICANCE TEST FOR THE LASSO.

Authors: Richard Lockhart; Jonathan Taylor; Ryan J Tibshirani; Robert Tibshirani
Journal: Ann Stat Date: 2014-04 Impact factor: 4.028

4. Large Covariance Estimation by Thresholding Principal Orthogonal Complements.

Authors: Jianqing Fan; Yuan Liao; Martina Mincheva
Journal: J R Stat Soc Series B Stat Methodol Date: 2013-09-01 Impact factor: 4.488