Literature DB >> 34557057

Learning from Binary Multiway Data: Probabilistic Tensor Decomposition and its Statistical Optimality.

Abstract

We consider the problem of decomposing a higher-order tensor with binary entries. Such data problems arise frequently in applications such as neuroimaging, recommendation system, topic modeling, and sensor network localization. We propose a multilinear Bernoulli model, develop a rank-constrained likelihood-based estimation method, and obtain the theoretical accuracy guarantees. In contrast to continuous-valued problems, the binary tensor problem exhibits an interesting phase transition phenomenon according to the signal-to-noise ratio. The error bound for the parameter tensor estimation is established, and we show that the obtained rate is minimax optimal under the considered model. Furthermore, we develop an alternating optimization algorithm with convergence guarantees. The efficacy of our approach is demonstrated through both simulations and analyses of multiple data sets on the tasks of tensor completion and clustering.

Entities: Chemical

Keywords: CANDECOMP/PARAFAC tensor decomposition; binary tensor; constrained maximum likelihood estimation; diverging dimensionality; generalized linear model

Year: 2020 PMID： 34557057 PMCID： PMC8457422

Source DB: PubMed Journal: J Mach Learn Res ISSN： 1532-4435 Impact factor: 5.177

Introduction

Motivation

Multiway arrays, a.k.a. tensors, have gained increasing attention in numerous fields, such as genomics (Hore et al., 2016), neuroscience (Zhou et al., 2013), recommender systems (Bi et al., 2018), social networks (Nickel et al., 2011), and computer vision (Tang et al., 2013). An important reason of the wide applicability is the effective representation of data using tensor structure. One example is recommender system (Bi et al., 2018), the data of which can be naturally described as a three-way tensor of user 00D7 item × context and each entry indicates the user-item interaction under a particular context. Another example is the DBLP database (Zhe et al., 2016), which is organized into a three-way tensor of author × word × venue and each entry indicates the co-occurrence of the triplets. Despite the popularity of continuous-valued tensors, recent decades have witnessed many instances of binary tensors, in which all tensor entries are binary indicators encoded as 0/1. Examples include click/no-click action in recommender systems, presence/absence of edges in multi-relational social networks (Nickel et al., 2011), and connection/disconnection in brain structural connectivity networks (Wang et al., 2019). These binary tensors are often noisy and high-dimensional. It is crucial to develop effective tools that reduce the dimensionality, take into account the tensor formation, and learn the underlying structures of these massive discrete observations. A number of successful tensor decomposition methods have been proposed (Kolda and Bader, 2009; Anandkumar et al., 2014; Wang and Song, 2017), revitalizing the classical methods such as CANDECOMP/PARAFAC (CP) decomposition (Hitchcock, 1927) and Tucker decomposition (Tucker, 1966). These methods treat tensor entries as continuous-valued, and therefore they are not suitable to analyze binary tensors. There is a relative paucity of decomposition methods for binary tensors compared to continuous-valued tensors. In this article, we develop a general method and the associated theory for binary tensor decomposition. Let be an order-K (d1, … ,d)-dimensional binary data tensor, where the entries are either 1 or 0 that encodes the presence or absence of the event indexed by the K-tuplet (i1, … ,i). We consider the following low-rank Bernoulli model, where, for ease of notation, we have allowed the operators (~, f, etc) to be applied to tensors in an element-wise manner. That is, the entries of are realizations of independent Bernoulli random variables with success probability , where f is a suitable function that maps to [0, 1]. The parameter tensor, is of the same dimension as but its entries are continuous-valued, and we assume Θ admits a low-rank CP structure. Our goal is to estimate Θ from one instance of the binary tensor . In particular, we are interested in the high dimensional setting where dmin = min d grows. Our primary focus is to understand (i) the statistical estimation error of binary tensor decomposition; (ii) the statistical hardness, in terms of minimax rate and signal-to-noise ratio, of the binary problem compared to its continuous-valued counterpart; and (iii) the computational properties of associated estimation algorithms.

Related Work

Our work is closely related to but also clearly distinctive from several lines of existing research. We survey the main related approaches for comparison.

Continuous-valued tensor decomposition.

In principle, one can apply the existing decomposition methods designed for continuous-valued tensor (Kolda and Bader, 2009; Wang and Song, 2017) to binary tensor, by pretending the 0/1 entries were continuous. However, such an approach will yield an inferior performance: flipping the entry coding 0 ↔ 1 would totally change the decomposition result, and the predicted values for the unobserved entries could fall outside the valid range [0, 1]. Our method, in contrast, is invariant to flipping, because reversing the entry coding of changes only the sign but not the decomposition result of the parameter Θ. Moreover, as we show in Section 3.3, binary tensor decomposition exhibits a “dithering” effect (Davenport et al., 2014) that necessitates the presence of stochastic noise in order to estimate Θ. This is clearly contrary to the behavior of continuous-valued tensor decomposition.

Binary matrix decomposition.

When the order K = 2, the problem reduces to binary or logit principal component analysis (PCA), and a similar model as (1) has been proposed (Collins et al., 2002; De Leeuw, 2006; Lee et al., 2010). While tensors are conceptual generalization of matrices, matrix decomposition and tensor decomposition are fundamentally different (Kolda and Bader, 2009). Under the matrix case, the rank R is required to be no greater than min(d1, d2), and the factor matrices are constrained to be orthogonal for the identification purpose. Both constraints are unnecessary for tensors, since the uniqueness of tensor CP decomposition holds under much milder conditions (Bhaskara et al., 2014). In fact, factors involved in tensors may be nonorthogonal, and the tensor rank R may exceed the dimension. These differences make the earlier algorithms built upon matrix decomposition unsuitable to tensors. Moreover, as we show in Section 3.1, if we were to apply the matrix version of binary decomposition to a tensor by unfolding the tensor into a matrix, the result is suboptimal with a slower convergence rate.

Binary tensor decomposition.

More recently, Mažgut et al. (2014); Rai et al. (2015); Hong et al. (2020) studied higher-order binary tensor decomposition, and we target the same problem. However, our study differs in terms of the scope of the results. In general, there are two types of properties that an estimator possesses. The first type is the algorithm-dependent property that quantifies the impact of a specific algorithm, such as the choice of loss function, initialization, and iterations, on the final estimator. The second type is the statistical property that characterizes the population behavior and is independent of any specific algorithm. Earlier solutions of Mažgut et al. (2014); Rai et al. (2015); Hong et al. (2020) focused on the algorithm effectiveness, but did not address the population optimality. By contrast, we study both types of properties in Sections 3 and 4. This allows us to better understand the gap between a specific algorithm and the population optimality, which may in turn offer a useful guide to the algorithm design.

1-bit completion.

Our work is also connected to 1-bit matrix completion (Cai and Zhou, 2013; Davenport et al., 2014) and its recent extension to 1-bit tensor completion (Ghadermarzy et al., 2018). The completion problem aims to recover a matrix or tensor from incomplete observations of its entries. The observed entries are highly quantized, sometimes even to a single bit. We first show in Section 2.2 that our Bernoulli tensor model has an equivalent interpretation as the threshold model commonly used in 1-bit quantization. Then, the two methods are compared in Section 3.1. We achieve a faster convergence rate than that in 1-bit tensor completion (Ghadermarzy et al., 2018), assuming the signal rank is of constant order. The optimality of our estimator is safeguarded by a matching minimax lower bound.

Boolean tensor decomposition.

Boolean tensor decomposition (Miettinen, 2011; Erdos and Miettinen, 2013a; Rukat et al., 2018) is a data-driven algorithm that decomposes a binary tensor into binary factors. The idea is to use logical operations to replace arithmetic operations such as addition and multiplication in the factorization. These methods also study binary tensors, same as we do, but they took an empirical approach to approximate a particular data instance. One important difference is that we focus on parameter estimation in a population model. The population interpretation offers useful insight on the effectiveness of dimension reduction. Having a population model allows us to tease apart the algorithmic error versus the statistical error. We numerically compare the two approaches in Section 5.

Bayesian binary tensor decomposition.

There have been a number of Bayesian binary tensor decomposition algorithms (Nickel et al., 2011; Rai et al., 2014, 2015). Most of these algorithms focus on the specific context of multi-relational learning. Although we take multi-relational learning as one of our applications, we address a general binary tensor decomposition problem, and we study the statistical properties of the problem, such as the SNR phase diagram and minimax rate. Besides, we provide a frequentist-type solution which is computationally more tractable than a Bayesian one.

Our Contributions

The primary goal of this paper is to study both the statistical and computational properties of binary tensor problem. Our contributions are summarized below. First, we quantify the differences and connections between binary tensor problem and continuous-valued tensor problem. We show that the Bernoulli tensor model (1) is equivalent to entrywise quantization of a latent noisy, continuous-valued tensor. The impact of latent signal-to-noise ratio (SNR) on the tensor recovery accuracy is characterized, and we identify three different phases for tensor recovery according to SNR; see Table 1 in Section 3.3. When SNR is bounded by a constant, the loss in binary tensor decomposition is comparable to the case of continuous-valued tensor, suggesting very little information has been lost by quantization. On the other hand, when SNR is sufficiently large, stochastic noise turns out to be helpful, and is in fact essential, for estimating the signal tensor. The later effect is related to “dithering” (Davenport et al., 2014) and “perfect separation” (Albert and Anderson, 1984) phenomenon, and this is clearly contrary to the behavior of continuous-valued tensor decomposition.

Table 1:

Error rate for low-rank tensor estimation. For ease of presentation, we omit the constants that depend on the order K or rank R.

Tensor type	SNR≫O(1)	O(1)≳SNR≫O(d−(K−1)/2)	O(d−(K−1)/2)≳SNR
Binary	σeα2/σ2d−(K−1)/2	σd ^−(K−1)/2	α
Continuous	σd ^−(K−1)/2	σd ^−(K−1)/2	α

Second, we propose a method for binary tensor decomposition and establish its statistical properties, including the upper bound and the minimax lower bound on the tensor recovery accuracy. These properties characterize the population optimality of the estimator. Note that, in our problem, the tensor dimensions (d1, … ,d) diverge, and so does the number of unknown parameters. As such, the classical maximum likelihood estimation (MLE) theory does not directly apply. We leverage the recent development in random tensor theory and high-dimensional statistics to establish the error bounds of the tensor estimation. The matching information-theoretical lower bounds are correspondingly provided. To our knowledge, these statistical guarantees are among the first for binary tensor decomposition. Lastly, we supplement the above general statistical properties by proposing an alternating optimization algorithm and establish the corresponding algorithmic properties. Our algorithm-dependent error bound reveals an interesting interplay between statistical and computational efficiency. We illustrate the efficacy of our algorithm through both simulations and data applications.

Notation and Organization

We adopt the following notation throughout the article. We use to denote an order-K (d1, … ,d)-dimensional tensor over a filed . We focus on real or binary tensors, i.e., or . The Frobenius norm of is defined as , and the maximum norm of is defined as . We use uppercase letters (e.g., Θ, , ) to denote tensors and matrices, and use lowercase letters (e.g., θ, ) to denote scales and vectors. The vectorization of tensor , denoted , is defined as the operation rearranging all elements of into a column vector. We use ⊗ to denote the kronecker product of vectors and , and ⊙ for the Khatri-Rao product of matrices and . We use to denote the (d − 1)-dimensional unit sphere, and the shorthand [n] := {1, …,n} to denote the n-set for . The rest of the article is organized as follows. Section 2 presents the low-rank Bernoulli tensor model, its connection with 1-bit observation model, and the rank-constrained MLE framework. in Section 3, we establish the statistical estimation error bounds and the phase transition phenomenon. We next develop an alternating optimization algorithm and establish its convergence guarantees in Section 4. We present the simulations in Section 5 and data analyses in Section 6. All technical proofs are deferred to Section 7 and Appendix A. We conclude the paper with a discussion in Section 8.

Model

Low-rank Bernoulli Model

Let be a binary data tensor. We assume the tensor entries are realizations of independent Bernoulli random variables, such that, for all (i1, … ,i) ∈ [d1] × ⋯ × [d], In this model, is a strictly increasing function. We further assume that f(θ) is twice-differentiable in ; f(θ) is strictly increasing and strictly log-concave; and f′(θ) is unimodal and symmetric with respect to θ = 0. All these assumptions are fairly mild. In the context of generalized linear models (GLMs), f is often referred to as the “inverse link function.” When no confusion arises, we also call f the “link function.” The parameter tensor is continuous-valued and unknown; it is the main object of interest in our tensor estimation inquiry. The entries of are assumed to be mutually independent conditional on Θ, which is commonly adopted in the literature (Collins et al., 2002; De Leeuw, 2006; Lee et al., 2010). Note that this assumption does not rule out the marginal correlations among the entries of . Furthermore, we assume the parameter tensor Θ admits a rank-R CP decomposition, where λ1 ≥ … ≥ λ > 0 and , for all r ∈ [R], k ∈ [K]. Without loss of generality, we assume that Θ cannot be written as a sum of fewer than R outer products. The CP structure in (3) is frequently used in tensor data analysis, and the rank R determines the tradeoff between model complexity and model flexibility. For the theory, we assume the true rank R is known; the adaptation to unknown R is addressed in Section 4.3. The low-rank structure dramatically reduces the number of parameters in Θ, from the order of to the order of . More precisely, the effective number of parameters in (3) is p = R(d1 + d2) − R2 for matrices (K = 2) after adjusting for the nonsingular transformation indeterminacy, and for higher-order tensors (K ≥ 3) after adjusting for the scaling indeterminacy. Combining (2) and (3) leads to our low-rank Bernoulli model. We seek to estimate the rank-R tensor Θ given the observed binary tensor . The model can be viewed as a generalization of the classical CP decomposition for continuous-valued tensors to binary tensors, in a way that is analogous to the generalization from a linear model to a GLM. When imposing low-rank structure to a continuous-valued tensor directly, the problem amounts to seeking the best rank-R approximation to , in the least-squares sense. The least-squares criterion is equivalent to the MLE for the low-rank tensor Θ based on a noisy observation , where collects independent and identically distributed (i.i.d.) Gaussian noises. In the next section, we present a close connection between a continuous-valued tensor problem and a binary tensor problem.

Latent Variable Model Interpretation

We show that our binary tensor model (2) has an equivalent interpretation as the threshold model commonly used in 1-bit quantization (Davenport et al., 2014; Bhaskar and Javanmard, 2015; Cai and Zhou, 2013; Ghadermarzy et al., 2018). The later viewpoint sheds light on the nature of the binary (1-bit) measurements from the information perspective. Consider an order-K tensor with a rank-R CP structure. Suppose that we do not directly observe Θ. Instead, we observe the quantized version following the scheme where is a noise tensor to be specified later. Equivalently, the observed binary tensor is , and the associated latent tensor is . Here the sign function is applied to tensors in an element-wise manner. In light of this interpretation, the tensor Θ serves as an underlying, continuous-valued quantity whose noisy discretization gives . The latent model (4) in fact is equivalent to our Bernoulli tensor model (2), if the link f behaves like a cumulative distribution function. Specifically, for any choice of f in (2), if we define as having i.i.d. entries drawn from a distribution whose cumulative distribution function is , then (2) reduces to (4). Conversely, if we set the link function , then model (4) reduces to (2). Such relationship gives a one-to-one correspondence between the error distribution in the latent model and the link function in the Bernoulli model. We describe three common choices of f, or equivalently, the distribution of . Example 1. (Logistic link/Logistic noise). The logistic model is represented by (2) with f(θ) = (1+e−)−1 and the scale parameter σ > 0. Equivalently, the noise in (4) follows i.i.d. logistic distribution with the scale parameter σ. Example 2. (Probit link/Gaussian noise). The probit model is represented by (2) with f(θ) = Φ(θ/σ), where Φ is the cumulative distribution function of a standard Gaussian. Equivalently, the noise in (4) follows i.i.d. N(0, σ2). Example 3. (Laplacian link/Laplacian noise). The Laplacian model is represented by (2) with and the scale parameter σ > 0. Equivalently, the noise in (4) follows i.i.d. Laplace distribution with the scale parameter σ. The above link functions are common for the Bernoulli model, and the choice is informed by several considerations (McCullagh, 1980). The probit is the canonical link based on the Bernoulli likelihood, and it has a direct connection with the log-odds of success. The probit is connected to threshold latent Gaussian tensors. The Laplace has a heavier tail than the normal distribution, and it is more suitable for modeling long-tail data.

Rank-constrained Likelihood-based Estimation

We propose to estimate the unknown parameter tensor Θ in model (2) using a constrained likelihood approach. The log-likelihood function for (2) is where the second equality is due to the symmetry of the link function f. To incorporate the CP structure (3), we propose a constrained optimization, for a given rank and a bound . Here the search space is assumed to be a compact set containing the true parameter Θtrue. The candidate tensor of our interest satisfies two constraints. The first is that Θ admits the CP structure (3) with rank R. As discussed in Section 2.1, the low-rank structure (3) is an effective dimension reduction tool in tensor data analysis. The second constraint is that all the entries of Θ are bounded in absolute value by a constant . We refer to α as the “signal” bound of Θ. This maximum-norm condition is a technical assumption to aid the recovery of Θ in the noiseless case. Similar techniques have been employed for the matrix case (Davenport et al., 2014; Bhaskar and Javanmard, 2015; Cai and Zhou, 2013). In the next section, we first investigate the statistical error bounds for the global optimizer . These bounds characterize the population behavior of the global estimator and weave three quantities: tensor dimension, rank, and signal-to-noise ratio. We then compare these properties to the information-theoretical bound and reveal a phase-transition phenomenon. In Section 4, we develop a specific algorithm for the optimization problem in (5), and we derive the convergence properties of the empirical estimator resulting from this algorithm.

Statistical Properties

Performance Upper Bound

We define two quantities L and γ to control the “steepness” and “convexity” of the link function f. Let where , and α is the bound on the entrywise magnitude of Θ. When α is a fixed constant and f is a fixed function, all these quantities are bounded by some fixed constants independent of the tensor dimension. In particular, for the logistic, probit and Laplacian models, we have We assess the estimation accuracy using the deviation in Frobenius norm. For the true coefficient tensor and its estimator , define The next theorem establishes the upper bound for under model (2). Theorem 1 (Statistical convergence). Suppose (2) with the link function f and the true coefficient tensor (5). Then, there exists an absolute constant C1 > 0, and a constant C2 > 0 that depends only on K, such that, with probability at least , Note that f is strictly log-concave if and only if (Boyd and Vandenberghe, 2004). Henceforth, γ > 0 and L > 0, which ensures the validity of the bound in (6). In fact, the proof of Theorem 1 (see Section 7) shows that the statistically optimal rate holds, not only for the MLE , but also for any estimators in the level set . To compare our upper bound to existing results in literature, we consider a special setting where the dimensions are the same in all modes; i.e., d1 = ⋯ = d = d. In such a case, our bound (6) reduces to for a fixed rank R and a fixed signal bound α. The MLE thus achieves consistency with polynomial convergence rate. Our bound has a faster convergence rate than that in 1-bit tensor recovery (Ghadermarzy et al., 2018), The rate improvement comes from the fact that we impose an exact low-rank structure on Θ, whereas Ghadermarzy et al. (2018) employed the max norm as a surrogate rank measure. Our bound also generalizes the previous results on low-rank binary matrix completion. The convergence rate for rank-constrained matrix completion is (Bhaskar and Javanmard, 2015), which fits into our special case when K = 2. Intuitively, in the tensor data analysis problem, we can view each tensor entry as a data point, and sample size is the total number of entries. A higher tensor order has a larger number of data points and thus exhibits a faster convergence rate as d → ∞. We compare the results (7) to the scenario if we apply the matrix version of binary decomposition to a tensor by unfolding the tensor into a matrix. The “best” matricization solution that unfolds a tensor into a near-square matrix (Mu et al., 2014) gives a convergence rate , with ⌊K/2⌋ being the integer part of K/2. The gap between the rates highlights the importance of decomposition that specifically takes advantage of the multimode structure in tensors. As an immediate corollary of Theorem 1, we obtain the explicit form of the upper bound (6) when the link f is a logistic, probit, or Laplacian function. Corollary 1. Assume the same setup as in Theorem 1. There exists an absolute constant C′ > 0 such that with probability at least , where C(α, σ) is a scaler factor, and C1, C2, C3 > 0 are constants that depend only on K. The dependency of the above error bounds on the signal bound α and the noise level σ will be discussed in Section 3.3.

Information-theoretical Lower Bound

We next establish two lower bounds. The first lower bound is for all statistical estimators , including but not limited to the estimator in (5), under the binary tensor model (2). The result is based on the information theory and is thus algorithm-independent. We show that this lower bound nearly matches the upper bound on the estimation accuracy of , thereby implying the rate optimality of . With a little abuse of notation, we use to denote the set of tensors with the rank bounded by R and the maximum norm bounded by α. The next theorem establishes this first lower bound for all estimators in under the model (2). Theorem 2 (Minimax lower bound for binary tensors). Suppose ≤ min d and the dimension max d ≥ 8. Let 0 ∈ (0, 1) and c0 > 0, such that Here we only present the result for the probit model, while similar results can be obtained for the logistic and Laplacian models. In this theorem, we assume that R ≤ min d. This condition is automatically satisfied in the matrix case, since the rank of a matrix is always bounded by its row and column dimension. For the tensor case, this assertion may not always hold. However, in the most applications, the tensor rank is arguably smaller than its dimension. We view this as a mild condition. Note that the earlier Theorem 1 places no constraint on the rank R. In Section 5, we will assess the empirical performance when the rank exceeds dimension. We next compare the lower bound (9) to the upper bound (8), as the tensor dimension d → ∞ while the signal bound α and the noise level σ are fixed. Since , both the bounds are of the form , where C is a factor that does not depend on the tensor dimension. Henceforth, our estimator is rate-optimal. The second lower bound is for all estimators based on the “unquantized” observation , which enables the evaluation of information loss due to binary quantization . Recall that Section 2.2 introduces a latent variable view of binary tensor model as an entrywise quantization of a noisy continuous-valued tensor. We seek an estimator by “denoising” the continuous-valued observation . The lower bound is obtained via an information-theoretical argument and is again applicable to all estimators . Theorem 3 (Minimax lower bound for continuous-valued tensors). Suppose ≤ min d and max d ≥ 8. Let 0 ∈ (0, 1) and c0 > 0 such that This lower bound (10) quantifies the statistical hardness of the tensor estimation problem. In the next section, we compare the information loss of tensor estimation, based on the data with quantization, , vs. the data without quantization, .

Phase Diagram

The error bounds we have established depend on the signal bound α and the noise level σ. In this section, we define three regimes based on the signal-to-noise ratio (SNR) = ‖Θ‖max/σ, in which the tensor estimation exhibits different behaviors. Table 1 and Figure 1 summarize the error bounds of the three phrases under the case when d1 = ⋯ = d = d. Our discussion focuses on the probit model, but similar patterns also hold for the logistic and Laplacian models.

Figure 1:

Phase diagram according to the SNR. (A) “Noise helps” region: the estimation error decreases with the noise. (B) “Noise hurts” region: the error increases with the noise. (C) Impossible region: a consistent estimator of Θ is impossible. The dashed line between regions (B) and (C) depicts the boundary d−( as K varies. Note that the origin in the x-axis corresponds to the high-dimensional region, d−( → 0, which is of our main interest.

The first phase is when the noise is weak, in that σ ≪ α equivalently . In this regime, the error bound in (8) scales as σ exp(α2/σ2), suggesting that increasing the noise level would lead to an improved tensor estimation accuracy. This “noise helps” region may seem surprising; however it is not an artifact of our proof. It turns out this phenomena is intrinsic to 1-bit quantization, and we confirm this behavior in simulations in Section 5. As the noise level σ goes to zero, the problem essentially reverts to the noiseless case where an accurate estimation of Θ becomes impossible. To see this, we consider a simple example with a rank-1 signal tensor in the latent model (4) in the absence of noise. Two different coefficient tensors, Θ1 = 1⊗2⊗3 and Θ2 = sign(1)⊗sign(2)⊗sign(3), would lead to the same observation , and thus recovery of Θ from becomes hopeless. Interestingly, adding a stochastic noise to the signal tensor prior to 1-bit quantization completely changes the nature of the problem, and an efficient estimator can be obtained through the likelihood approach. In the 1-bit matrix/tensor completion literature, this phenomenon is referred to as “dithering” effect of random noise (Davenport et al., 2014). The second phase is when the noise is comparable to the signal, in that . In this regime, the error bound in (8) scales linearly with σ. We find that the lower bound (10) from the unquantized tensor matches with the upper bound (8) from a quantized one. This suggests that 1-bit quantization induces very little loss of information towards the estimation of Θ. In other words, , which is based on the quantized observation, can achieve the similar degree of accuracy as if the completely unquantized measurements were observed. The third phase is when the noise completely dominates the signal, in that . A consistent estimation of Θ becomes impossible. In this regime, a trivial zero estimator achieves the minimax rate.

Algorithm and Convergence Properties

Alternating Optimization Algorithm

In this section, we introduce an algorithm to solve (5) and study the algorithmic convergence. For notational convenience, we drop the subscript in and simply write . The optimization (5) is a non-convex problem in Θ due to the non-convexity in the feasible set . We use the CP representation of Θ in (3) and turn the optimization into a block-wise convex problem. Algorithm 1 summarizes the full optimization procedure, and we discuss the individual steps in the next paragraph.

Algorithm 1

Binary tensor decomposition

Input: Binary tensor Y∈{0,1}d1×⋯×dK, link function f, rank R, and entrywise bound α.

Output: Rank-R coefficient tensor Θ, along with the factor matrices A = (A₁,…,A_k).

1: Initialize random matrices A(0)={A1(0),…,AK(0)} and iteration index t = 0.

2: while the relative increase in objective function L(A) is less than the tolerance do

3: Update iteration index t ← t + 1.

4: for k = 1 to K do

5: Obtain Ak(t+1) by solving d_k separate GLMs with link function f.

6: end for

7: Line search to obtain γ*.

8: Update Ak(t+1)←γ*Ak(t)+(1−γ*)Ak(t+1), for all k ∈ [K].

9: Normalize the columns of Ak(t+1) to be of unit-norm for all k ≤ K − 1, and absorb the scales into the columns of Ak(t+1).

10: end while

Specifically, write the mode-k factor matrices from (3) as where, without loss of generality, we choose to collect λ’s into the last factor matrix. Let = (1, … ,) denote the collection of all block variables satisfying the above convention. Then the optimization problem (5) is equivalent to Although the objective function in (12) is in general not concave in the K factor matrices jointly, the problem is concave in each factor matrix individually with all other factor matrices fixed. This feature enables a block relaxation type minimization, where we alternatively update one factor matrix at a time while keeping the others fixed. In each iteration, the update of each factor matrix involves solving a number of separate GLMs. To see this, let denote the kth factor matrix at the tth iteration, and Let denote the subtensor of at the jth position of the kth mode. Then the update can be obtained row-by-row by solving d separate GLMs, where each GLM takes as the “response”, as the “predictors”, and the jth row of as the “regression coefficient”, for all j ∈ [d],k ∈ [K]. In each GLM, the effective number of predictors is R, and the effective sample size is . These separable, low-dimensional GLMs allow us to leverage the fast GLM solvers as well as parallel processing to speed up the computation. After each iteration, we post-process the factor matrices by performing a line search, We then update and normalize the columns of . Binary tensor decomposition In practice, we run the algorithm from multiple initializations to locate a final estimate with the highest objective value.

Algorithmic Properties

We study the convergence of Algorithm 1. The convergence of the objective function is guaranteed whenever the is bounded from above, due to the monotonic nature of over iterations. We next study the convergence of the iterates ( and Θ( = Θ{(}. To simplify the analysis, we assume the optimization path is in the interior of the search domain {Θ: ‖Θ‖max ≤ α}. We drop the dependence of α for technical convenience, but all the results should be interpreted with this assumption imposed. In practice, α can be adjusted via probing the MLE frontier (Sur and Candès, 2019). One may start with a reasonably large α and check whether MLE is in the interior of the search domain. If perfect separation occurs, one may want to reduce α to a smaller value in order to control the estimation error. We refer to Sur and Candès (2019) for more discussions on adjusting α via probing the MLE frontier. We need the following assumptions for algorithmic convergence. for ′, ′′ sufficiently close to *. Here ′, ′′ represent the block variables subject to convention (11). (Regularity condition) The log-likelihood is continuous and the set is compact. (Strictly local maximum condition) Each block update in Algorithm 1 is well-defined; i.e., the GLM solution exists and is unique, and the corresponding sub-block in the Hession matrix is non-singular at the solution. (Local uniqueness condition) The set of stationary points of are isolated module scaling. (Local Lipschitz condition) Let * be a local maximizer of . The rank-R CP representation Θ = Θ() is locally Lipschitz at A*; namely, there exist two constants c1, c2 > 0 such that These conditions are mild and often imposed in the literature. Specifically, Assumption (A1) ensures the upper boundedness of log-likelihood and the existence of global optimum. Therefore, the stopping rule of Algorithm 1 is well defined. Assumption (A2) asserts the negative-definiteness of the Hessian in the block coordinate . Note that the full Hession needs not to be negative-definite in all variables simultaneously. We consider this requirement as a reasonable assumption, as similar conditions have been imposed in various non-convex problems (Uschmajew, 2012; Zhou et al., 2013). Assumptions (A2)–(A4) guarantee the local uniqueness of the CP representation Θ = Θ(). The conditions exclude the case of rank-degeneracy; e.g., the case when the tensor Θ can be written in fewer than R factors, or when the columns of are linearly dependent in the GLM update. We comment that the local uniqueness condition is fairly mild for tensors of order three or higher. This property reflects the fundamental difference between tensor and matrix decomposition, in that the same property often fails for the matrix case. Consider an example of a 2-by-2 matrix. Suppose that the local maximizer is , where e1, e2 are canonical vectors in . The variable * = (e1, e2) is a nonattracting point for the matrix problem. Indeed, one can construct a point (0) = (1, 2), with 1 = (sinθ, cosθ)′, and 2 = (cosθ, −sinθ)′. The point (0) can be made arbitrarily close to * by tuning θ, but the algorithm iterates initialized from (0) would never converge to . In contract, a 2-by-2-by-2 tensor problem with the maximizer possesses locally unique decomposition. For more discussion on decomposition uniqueness and its implication in the optimization, we refer to Kruskal (1977); Uschmajew (2012); Zhou et al. (2013). Proposition 1 (Algorithmic convergence). Suppose Assumptions (A1)-(A3) hold. (Global convergence) Every sequence generated by Algorithm 1 converges to a stationary point of . (Locally linear convergence) Let * be a local maximizer of , such that, for any staring point (0) in this neighborhood, the iterates (t) of Algorithm 1 linearly converge to , where ρ ∈ (0, 1) is a contraction parameter. Furthermore, if Assumption (A4) holds at , then there exists a constant C > 0 such that Proposition 1(ii) shows that every local maximizer of is an attractor of Algorithm 1. This property ensures an exponential decay of the estimation error near a local maximum. Combining Proposition 1 and Theorem 1, we have the following theorem. Theorem 4 (Empirical performance). Let (2) with parameter Θ = Θ(). Let (t) denote a sequence of estimators generated from . Suppose * is a local maximizer satisfying that 0 ≥ 0, such that, for all t ≥ T0, where ρ ∈ (0, 1) is a contraction parameter, and C1, C2 > 0 are two constants. Theorem 4 provides the estimation error of the empirical estimator from our Algorithm 1 at each iteration. The bound (13) consists of two terms: the first term is the computational error, and the second is the statistical error. The computational error decays exponentially with the number of iterations, whereas the statistical error remains the same as t grows. The statistical error is unavoidable, as it reflects the statistical error due to estimation with noise; see also Theorem 2. For tensors with d1 = ⋯ = d = d, the computational error is dominated by the statistical error when the iteration number satisfies

Missing Data, Rank Selection, and Computational Complexity

When some tensor entries are missing, we replace the objective function with , where Ω ⊂ [d1]×⋯×[d] is the index set for non-missing entries. The same strategy has been used for continuous-valued tensor decomposition (Acar et al., 2010). For implementation, we modify line 5 in Algorithm 1, by fitting GLMs to the data for which are observed. Other steps in Algorithm 1 are amendable to missing data accordingly. Our approach requires that there are no completely missing subtensors , which is a fairly mild condition. This requirement is similar to the coherence condition in the matrix completion problem; for instance, the recovery of true decomposition is impossible if an entire row or column of a matrix is missing. As a by-product, our tensor decomposition output can also be used for missing value prediction. That is, we predict the missing values using , where is the coefficient tensor estimated from the observed entries. Note that the predicted values are always between 0 and 1, which can be interpreted as a prediction for . For accuracy guarantees with missing data, we refer to Lee and Wang (2020) for detailed results. Algorithm 1 takes the rank of Θ as an input. Estimating an appropriate rank given the data is of practical importance. We adopt the usual Bayesian information criterion (BIC) and choose the rank that minimizes BIC; i.e., where is the estimated coefficient tensor under the working rank R, and p(R) is the effective number of parameters. This criterion aims to balance between the goodness-of-fit for the data and the degree of freedom in the population model. The empirical performance of BIC is investigated in Section 5. Finally, the computational complexity of our algorithm is for each iteration. The per-iteration computational cost scales linearly with the tensor dimension, and this complexity matches with the classical continuous-valued tensor decomposition (Kolda and Bader, 2009). More precisely, the update of involves solving d separate GLMs. Solving these GLMs requires , and therefore the cost for updating K factors in total is . We further report the computation time in Section 5.

Simulations

CP Tensor Model

In this section, we first investigate the finite-sample performance of our method when the data indeed follows the CP tensor model. We consider an order-3 dimension-(d,d,d) binary tensor generated from the threshold model (4), where , and the entries of are i.i.d. drawn from Uniform[−1, 1] for all k ∈ [3] and r ∈ [R]. Without loss of generality, we scale Θtrue such that kΘtruekmax = 1. The binary tensor is generated based on the entrywise quantization of the latent tensor , where consists of i.i.d. Gaussian entries. We vary the rank R ∈ {1, 3, 5}, the tensor dimension d ∈ {20,30, … ,60}, and the noise level σ ∈ {10−3,10−2.5, … ,100.5}. We use BIC to select the rank and report the estimation error based on logistic link averaged across nsim = 30 replications. Figure 2(a) plots the estimation error as a function of the tensor dimension d while holding the noise level fixed at σ = 10−0.5 for three different ranks R ∈ {1, 3, 5}. We find that the estimation error of the constrained MLE decreases as the dimension increases. Consistent with our theoretical results, the decay in the error appears to behave on the order of d−1. A higher-rank tensor tends to yield a larger recovery error, as reflected by the upward shift of the curves as R increases. Indeed, a higher rank means a higher intrinsic dimension of the problem, thus increasing the difficulty of the estimation.

Figure 2:

Estimation error of binary tensor decomposition. (a) Estimation error as a function of the tensor dimension d = d1 = d2 = d3. (b) Estimation error as a function of the noise level.

Figure 2(b) plots the estimation error as a function of the noise level σ while holding the dimension fixed at d = 50 for three different ranks R ∈ {1, 3, 5}. A larger estimation error is observed when the noise is either too small or too large. The non-monotonic behavior confirms the phase transition with respect to the SNR. Particularly, the random noise is seen to improve the recovery accuracy in the high SNR regime. This is consistent to our theoretical result on the “dithering” effects brought by stochastic noise. We next assess the tensor rank selection by BIC. We consider the tensor dimension d ∈ {20, 40, 60} and rank R ∈ {5, 10, 20, 40}. Note that, in some of the combinations, the rank equals or exceeds the tensor dimension. We set the noise level σ ∈ {0.1, 0.01} such that the noise is neither negligible nor overwhelming. For each combination, we simulate the tensor data following the Bernoulli tensor model (2). We minimize BIC using a grid search from R − 5 to R + 5. Table 2 reports the selected rank averaged over nsim = 30 replications, with the standard error shown in the parenthesis. We find that, when d = 20, the selected rank is slightly smaller than the true rank, whereas for d ≥ 40, the selection is accurate. This agrees with our expectation, as the total number of entries corresponds to the sample size in tensor decomposition. A larger d implies a larger sample size, so the BIC selection becomes more accurate.

Table 2:

Rank selection in binary tensor decomposition via BIC. The selected rank is averaged across 30 simulations, with the standard error shown in the parenthesis.

	σ = 0.1			σ = 0.01
True rank	d = 20	d = 40	d = 60	d = 20	d = 40	d = 60
R = 5	4.9 (0.2)	5 (0)	5 (0)	4.8 (1.0)	5 (0)	5 (0)
R = 10	8.7 (0.9)	10 (0)	10 (0)	8.8 (0.4)	10 (0)	10 (0)
R = 20	17.7(1.7)	20.4(0.5)	20.2(0.5)	16.4(0.5)	20.4(0.5)	20.6(0.5)
R = 40	36.8(1.1)	39.6(1.7)	40.2(0.4)	36.0(1.2)	38.8(1.6)	40.3(1.1)

We also evaluate the numerical stability of our optimization algorithm. Although Algorithm 1 has no theoretical guarantee to land at the global optimum, in practice, we often find that the convergence point is satisfactory, in that the corresponding objective value is close to and actually slightly larger than the objective function evaluated at the true parameter . As an illustration, Figure 3 shows the typical trajectories of the objective function under different tensor dimensions and ranks. The dashed line is the objective value at the true parameter, . We find that, upon random initializations, the algorithm lands at a good convergence point and converges quickly. It usually takes fewer than 8 iterations for the relative change in the objective to be below 3%, even for a large d and R. The average computation time per iteration is shown in the plot legend. For instance, when d = 60 and R = 10, each iteration of Algorithm 1 takes fewer than 3 seconds on average.

Figure 3:

Trajectory of the objective function over iterations with varying d and R.

Stochastic Multi-way Block Model

We next evaluate our method under the stochastic multi-way block model, which can be viewed as a higher-order generalization of the stochastic block model commonly used for random graphs, network analysis, and community detection. Under this model, the signal tensor does not have an explicit CP structure with known rank. Specifically, we generate of dimension d = d1 = d2 = d3, where we vary d ∈ {20, 30, 40, 50, 60}. The entries in are realizations of independent Bernoulli variables with a probability tensor Θ. The probability tensor Θ has five blocks along each of the modes, where 1, 2, 3 ∈ {0, 1} are membership matrices indicating the block allocation along each of the mode, × denotes the tensor-by-matrix multiplication (Kolda and Bader, 2009) for k ∈ [3], and is a core tensor corresponding to the block-means on a probit scale, and m1, m2, m3 ∈ {1, … ,5} are block indices. We generate the block means in the following ways: Combinatorial-mean model: ; i.e., each three-way block has its own mean, independent of each other. Additive-mean model: , where , and are i.i.d. drawn from Unif[−1, 1]. Multiplicative-mean model: , and the rest of setup is the same as the additive-mean model. We evaluate our method in terms of the accuracy of recovering the latent tensor Θ given the binary observations. Table 3 reports the relative loss, the estimated rank, and the running time, averaged over nsim = 30 data replications, for the above three sub-models. The relative loss is computed as . Our method is able to recover the signal tensors well in all three scenarios. As an illustration, we also plot one typical realization of the true signal tensor, the input binary tensor, and the recovered signal tensor for each sub-model in Table 3. It is interesting to see that, not only the block structure but also the tensor magnitude are well recovered. We remark that, the data has been generated from a probit model, but we always fit with a logistic link. Our method is shown to maintain a reasonable performance under this model misspecification.

Table 3:

Latent tensor recovery. Figures in the column of “Experiment” are color images of the simulated tensor under different block mean models. Reported are the relative loss, estimated rank, and running time, averaged over 30 data replications. Standard error is shown in the parenthesis.

Comparison with Alternative Methods

We next compare our method with a number of alternative solutions for binary tensor decomposition. Boolean tensor factorization (BooleanTF) (Miettinen, 2011; Erdos and Miettinen, 2013b; Rukat et al., 2018). This method decomposes a binary tensor into binary factors and then recovers the binary entries based on a set of logical rules among the factors. We use the implementation of Rukat et al. (2018). Bayesian tensor factorization (BTF Bayeisan) (Rai et al., 2014). This method uses expectation-maximization to decompose a binary tensor into continuous-valued factors. The algorithm imposes a Gaussian prior on the factor entries and a multiplicative gamma process prior on the factor weights {λ}. Bernoulli tensor factorization with gradient descent (BTF_Gradient) (Hong et al., 2020). This method uses a gradient descent algorithm to decompose a binary tensor into continuous-valued factors. We use the implementation in the toolbox of Matlab. For easy reference, we denote our method by BTF_Alternating[1]. These four methods differ in several ways. BooleanTF is different from the other three in both the cost function and the output format. The rest are all based on the Bernoulli model (2), but with different implementations. BTF_Bayesian employs a Bayesian approach, whereas the other two are frequentist solutions. BTF_Gradient and our method, BTF_Alternating, share the same model, but utilize different optimization algorithms. So the two methods complement each other. On the other hand, we provide not only the algorithm-specific convergence properties, but also algorithm-independent statistical properties including the statistical convergence rate, SNR phase diagram, and mini-max rate. These results are not available in the proposal of BTF_Gradient (Hong et al., 2020). We apply the four methods with default parameters, while selecting the rank R using the recommended approach of each. For our method BTF_Alternating, we use the proposed BIC to select the rank. Because BTF_Gradient does not provide any rank selection criterion, we apply the same R selected by our BIC. For BTF_Alternating, we set the hyper-parameter α to infinity, which essentially poses no prior on the tensor magnitude. Besides, because BTF_Bayesian only supports the logistic link, we use the logistic link in all three BTF methods. We evaluate each method by two metrics. The first metric is the root mean square error, , where denotes the estimated probability tensor. For BooleanTF, this quantity is represented as the posterior mean of (Miettinen, 2011), and for the other three methods, . The second metic is the misclassification error rate, . Here the indicator function is applied to tensors in an element-wise manner, and ‖·‖0 counts the number of non-zero entries in the tensor. These metrics reflect two aspects of the statistical error. RMSE summarizes the estimation error in the parameters, whereas MER summarizes the classification errors among 0’s and 1’s. We simulate data from two different models, and in both cases, the signal tensors do not necessarily follow an exact low-rank CP structure. Therefore, in addition to method comparison, it also allows us to evaluate the robustness of our method under potential model misspecification. The first model is a boolean (logical) tensor model following the setup in Rukat et al. (2018). We first simulate noiseless tensors from the following model, where the binary factor entries {a}, {b}, {c} are mutually independent with each other, the factor probabilities , , are generated i.i.d. from Beta(2,4), and ∨ and ∧ denote the logical OR and AND operations, respectively. Equivalently, the tensor entry is 1 if and only if there exists one or more components in which all corresponding factor entries are 1. It is easy to verify that We then add contamination noise to by flipping the tensor entries 0 ↔ 1 i.i.d. with probability 0.1. We consider the tensor dimension d1 = d2 = d3 = 50 and the boolean rank R ∈ {10, 15, 20, 25, 30}. Figure 4(a)–(b) shows the performance comparison based on nsim = 30 replications. We find that the three BTF methods outperform BooleanTF in RMSE. The results shows the advantage of a probabilistic model, upon which all three BTF methods are built. In contrast, BooleanTF seeks patterns in a specific data realization, but does not target for population estimation. For classification, BooleanTF performs reasonably well in distinguishing 0’s versus 1’s, which agrees with the data mining nature of BooleanTF. It is also interesting to see that MER peaks at R = 20. Further investigation reveals that this setting corresponds to the case when the Bernoulli probabilities concentrate around 0.5, which becomes particularly challenging for classification. Actually, the average Bernoulli probability for R =10, 15, 20, 25, 30 is 0.31, 0.44, 0.53, 0.61, 0.68, respectively. Figure 4(b) also shows that BTF_Alternating and BTF_Gradient achieve a smaller classification error than BTF_Bayesian. One possible explanation is that the normal prior in BTF Bayesian has a poor distinguishing power around θ ≈ 0, which corresponds to the hardest case when Bernoulli probability ≈ 0.5.

Figure 4:

Performance comparison in terms of root mean squared error and misclassification error rate. (a)-(b) Estimation errors under the boolean tensor model. (c)-(d) Estimation errors under the stochastic multiway block model. Error bars represent one standard error around the mean.

The second model is the stochastic multi-way block model considered in Section 5.2, with the block means generated from the combinatorial-mean sub-model. Figure 4(c)-(d) shows the performance comparison, and a similar pattern is observed. The two frequentist-type BTF methods, BTF_Gradient and BTF_Alternating, behave numerically similarly, and they outperform the other alternatives. In particular, the BTF methods exhibit decaying estimation errors, whereas BooleanTF appears to flatten out as dimension grows. This observation suggests that, compared to the algorithmic error, the statistical error is likely more dominating in this setting.

Data Applications

We next illustrate the applicability of our binary tensor decomposition method on a number of data sets, with applications ranging from social networks, email communication networks, to brain structural connectivities. We consider two tasks: one is tensor completion, and the other is clustering along one of the tensor modes. The data sets include: Kinship (Nickel et al., 2011): This is a 104 × 104 × 26 binary tensor consisting of 26 types of relations among a set of 104 individuals in Australian Alyawarra tribe. The data was first collected by Denham and White (2005) to study the kinship system in the Alyawarra language. The tensor entry is 1 if individual i used the kinship term k to refer to individual j, and 0 otherwise. Nations (Nickel et al., 2011): This is a 14 × 14 × 56 binary tensor consisting of 56 political relations of 14 countries between 1950 and 1965. The tensor entry indicates the presence or absence of a political action, such as “treaties”, “sends tourists to”, between the nations. We note that the relationship between a nation and itself is not well defined, so we exclude the diagonal elements from the analysis. Enron (Zhe et al., 2016): This is a 581 × 124 × 48 binary tensor consisting of the three-way relationship, (sender, receiver, time), from the Enron email data set. The Enron data is a large collection of emails from Enron employees that covers a period of 3.5 years. Following Zhe et al. (2016), we take a subset of the Enron data and organize it into a binary tensor, with entry indicating the presence of emails from a sender i to a receiver j at a time period k. HCP (Wang et al., 2019): This is a 68 × 68 × 212 binary tensor consisting of structural connectivity patterns among 68 brain regions for 212 individuals from Human Connectome Project (HCP). All the individual images were preprocessed following a standard pipeline (Zhang et al., 2018), and the brain was parcellated to 68 regions-of-interest following the Desikan atlas (Desikan et al., 2006). The tensor entries encode the presence or absence of fiber connections between those 68 brain regions for each of the 212 individuals. The first task is binary tensor completion, where we apply tensor decomposition to predict the missing entries in the tensor. We compare our binary tensor decomposition method using a logistic link function with the classical continuous-valued tensor decomposition. Specifically, we split the tensor entries into 80% training set and 20% testing set, while ensuring that the nonzero entries are split the same way between the training and testing data. The entries in the testing data are masked as missing, and we predict them based on the tensor decomposition from the training data. The training-testing split is repeated five times, and we report the average area under the receiver operating characteristic curve (AUC) and RMSE across five splits in Table 4. It is clearly seen that the binary tensor decomposition substantially outperforms the classical continuous-valued tensor decomposition. In all data sets, the former obtains a much higher AUC and mostly a lower RMSE. We also report in Table 4 the percentage of nonzero entries for each data. We find that our decomposition method performs well even in the sparse setting. For instance, for the Enron data set, only 0.01% of the entries are non-zero. The classical decomposition almost blindly assigns 0 to all the hold-out testing entires, resulting in a poor AUC of 79.6%. By comparison, our binary tensor decomposition achieves a much higher classification accuracy, with AUC = 94.3%.

Table 4:

Tensor completion for the four binary tensor data sets using two methods: the proposed binary tensor decomposition, and the classical continuous-valued tensor decomposition.

Data set	Non-zeros	Tensor decomposition method
Data set	Non-zeros	Binary (logistic link)		Continuous-valued
		AUC	RMSE	AUC	RMSE
Kinship	3.80%	0.9708	1.2 × 10⁻⁴	0.9436	1.4 × 10⁻³
Nations	21.1%	0.9169	1.1 × 10⁻²	0.8619	2.2 × 10⁻²
Enron	0.01%	0.9432	6.4 × 10⁻³	0.7956	6.3 × 10⁻⁵
HCP	35.3%	0.9860	1.3 × 10⁻³	0.9314	1.4 × 10⁻²

The second task is clustering. We perform the clustering analyses on two data sets, Nations and HCP. For the Nations data set, we utilize a two-step procedure by first applying the proposed binary tensor decomposition method with the logistic link, then applying the K-means clustering along the country mode from the decomposition. In the first step, the BIC criterion suggests R = 9 factors, and in the second step, the classical elbow method selects 5 clusters out of the 9 components. Figure 5(a) plots the 9 tensor factors along the country mode. It is interesting to observe that the countries are partitioned into one group containing those from the communist bloc, two groups from the western bloc, two groups from the neutral bloc, and Brazil forming its own group. We also plot the top four relation types based on their loadings in the tensor factors along the relationship mode in Figure 5(b). The partition of the countries is consistent with their relationship patterns in the adjacency matrices. Indeed, those countries belonging to the same group tend to have similar linking patterns with other countries, as reflected by the block structure in Figure 5(b).

Figure 5:

Analysis of the Nations data set. (a) Top nine tensor components in the country mode from the binary tensor decomposition. The overlaid box depicts the results from the K-means clustering. (b) Relation types with large loadings. Top four relationships identified from the top tensor components are plotted.

We also perform the clustering analysis on the data set HCP. We apply the decomposition method with the logistic link and BIC-selected rank R = 6. Figure 6 plots the heatmap for the top 6 tensor components across the 68 brain regions, and Figure 7 shows the edges with high loadings based on the tensor components. Edges are overlaid on the brain template BrainMesh ICBM152 (Xia et al., 2013), and nodes are color coded based on their regions. We see that the brain regions are spatially separated into several groups and that the nodes within each group are more densely connected with each other. Some interesting spatial patterns in the brain connectivity are observed. For instance, the edges captured by tensor component 2 are located within the cerebral hemisphere. The detected edges are association tracts consisting of the long association fibers, which connect different lobes of a hemisphere, and the short association fibers, which connect different gyri within a single lobe. In contrast, the edges captured by tensor component 3 are located across the two hemispheres. Among the nodes with high connection intensity, we identify superior frontal gyrus, which is known to be involved in self-awareness and sensory system (Goldberg et al., 2006). We also identify corpus callosum, which is the largest commissural tract in the brain that connects two hemispheres. This is consistent with brain anatomy that suggests the key role of corpus callosum in facilitating interhemispheric connectivity (Roland et al., 2017). Moreover, the edges shown in tensor component 4 are mostly located within the frontal lobe, whereas the edges in component 5 connect the frontal lobe with parietal lobe.

Figure 6:

Heatmap for binary tensor components across brain regions in the HCP analysis. The connection matrix = λ ⊗ is plotted for component r ∈ [6].

Figure 7:

Edges with high loadings in the HCP analysis. The top 10% edges with positive loadings (i, j) are plotted, for r ∈ [6] and (i, j) ∈ [68]2. The width of the edge is proportional to the magnitude of (i, j).

Proofs

Proof of Theorem 1

Proof. It follows from the expression of that Define where is the collection of the score functions evaluated at Θtrue, and is the collection of the Hession functions evaluated at Θtrue. We organize the entries in and treat as an order-K dimension-(d1, … ,d) tensor. Similarly, we organize the entries in and treat as a matrix. By the second-order Taylor’s theorem, we expand around Θtrue and obtain where for some γ ∈ [0, 1], and denotes the Hession matrix evaluated at . We first bound the linear term in (14). Note that, by Lemma 1, Define It follows from model (2) and the expression for L that is a random tensor whose entries are independently distributed and satisfy By Lemma 6, with probability at least , we have where C1, C2 are two positive constants. Furthermore, note that rank(Θ) ≤ R, rank(Θtrue) ≤ R, so rank(Θ−Θtrue) ≤ 2R. By Lemma 2, . Combining (15), (16) and (17), we have that, with probability at least , where the constant C2 absorbs all factors that depend only on K. We next bound the quadratic term in (14). Notice that where the second line comes from the fact that and the definition of γ. Combining (14), (18) and (19), we have that, for all , with probability at least , In particular, the above inequality also holds for . Therefore, Since , , which gives Henceforth, Remark 1. Based on the proof of Theorem 1, we can relax the global optimum assumption on the estimator . The same convergence rate holds in the level set .

Proof of Theorem 2

Proof. Without loss of generality, we assume d1 = dmax, and denote by . Let γ ∈ [0, 1] be a constant to be specified later. Our strategy is to construct a finite set of tensors satisfying the properties of (i)-(iv) in Lemma 8. By Lemma 8, such a subset of tensors exist. For any given tensor , let denote the distribution of , where is the observed binary tensor. In particular, is the distribution of induced by the zero parameter tensor 0; i.e., the distribution of conditional on the coefficient tensor Θ = 0. Then conditioning on , the entries of are independent Bernoulli random variables. In addition, we note that (c.f. Lemma 3), where σ is the scale parameter. Therefore, under these link functions, the KL divergence between and satisfies where the first inequality comes from (20), and the second inequality comes from property (iii) of . From (21) and the property (i), we conclude that the inequality is satisfied for any ε ≥ 0, whenγ ∈ [0, 1] is chosen to be sufficiently small depending on ε, e.g., . By applying Tsybakov (2009, Theorem 2.5) to (22), and in view of the property (iv), we obtain that Note that and . By taking ε = 1/10 and γ = 1/11, we conclude from (23) that which is ≥ 1/8.

Proof of Theorem 3

Proof. The argument is similar as that in the proof of Theorem 2. Specifically, we construct a set of tensors such that, for all , Θ satisfies the properties (i) to (iv) of Lemma 8. Given a continuous-valued tensor , let denote the distribution of according to the Gaussian model; that is, . N(0, σ2). Note that, for the Gaussian distribution, So the condition is satisfied for any ε ≥ 0 when γ ∈ [0, 1] is chosen to be sufficiently small depending on ε. In view of the property (iv) and (24), the conclusion follows readily from the application of Tsybakov (2009, Theorem 2.5).

Proof of Proposition 1

Proof. The proof of the global convergence is similar to that of Zhou et al. (2013, Proposition 1). We present the main ideas here for completeness. By Assumption (A2), the block update is well-defined and differentiable. The isolation of stationary points ensures that there are only finite number of stationary points. It suffices to show that every sub-sequence of ( convergences to a same limiting point. Let be one subsequence with limiting point . We aim to show that * is the only limiting point for all possible subsequences in (. As the algorithm monotonically increases the objective value, the limiting point is a stationary point of . Now take the set of all limiting points, which is contained in the set , and is thus compact due to (A1). The compactness of the set of limiting points implies that the set is also connected (Lange, 2010, Propositions 8.2.1 and 15.4.2). Note that a connected subset of the finite stationery points is a single point. Henceforth, every subsequence of ( convergences to a stationary point of . The local convergence follows from Uschmajew (2012, Theorem 3.3) and Zhou et al. (2013, Proposition 1). Here we elaborate on the contraction parameter ρ ∈ (0,1) in our context. Let denote the Hession matrix of the log-likelihood at the local maximum . We partition the Hession into = + + , where is the strictly block lower triangular part and is the block diagonal part. By Assumption (A2), each sub-block of the Hession is negative definite, so the diagonal entries of are strictly negative. This ensures that the block lower triangular matrix + is invertible. The differential of the iteration map can be shown as (Bezdek and Hathaway, 2003, Lemma 2). Therefore ρ = max|λ{(+)−1}| ∈ (0, 1), where λ{·} denotes the i-th singular value of the matrix. By the contraction principle, for (0) sufficiently close to . Because Θ = Θ() is local Lipschitz at with constants c1, c2 > 0, we have for all sufficiently large . Therefore where C > 0 is a constant.

Proof of Theorem 4

Proof. Based on Remark 1 after Theorem 1, we have Meanwhile, Proposition 1 implies that, there exists an iteration number T0 ≥ 0, such that holds for all t ≥ T0. Combining the above two results yields for all t ≥ T0.

Conclusions

Many data tensors consist of binary observations. This article presents a general method and the associated theory for binary tensor decomposition. We have shown that the unknown parameter tensor can be accurately and efficiently recovered under suitable assumptions. When the maximum norm of the unknown tensor is bounded by a constant, our error bound is tight up to a constant and matches with the best possible error bound for the unquantized observations. We comment on a number of possible extensions. Our method leverages on the alternating updating algorithm for the optimization. Although a non-convex optimization procedure such as Algorithm 1 has no guarantee on global optimality, our numerical experiments have suggested that, upon random initializations, the convergence point is often satisfactory, in that the corresponding objective value is close to the objective value . We have shown in Theorem 1 that the same statistically optimal convergence rate holds, not only for the MLE, but also for every local maximizer with sufficiently large objective values. When starting from random initializations, there could be multiple estimates, whose objective values are all greater than . In theory, any of those choices perform equally well in estimating Θtrue. In this sense, local optimality is not necessarily a severe concern in our context. On the other hand, characterizing global optimality for non-convex optimization problem of this type is itself of great interest. There has been recent progress investigating the landscape of non-convex optimization involving tensors (Anandkumar et al., 2014; Richard and Montanari, 2014; Ge and Ma, 2017). The problem is challenging, as the geometry can depend on multiple factors including the tensor rank, dimension, and factorization form. In some special cases such as rank-1 or orthogonally decomposable tensors, one may further obtain the required asymptotical number of initializations, however, at the cost of more stringent assumptions on the target tensor (Anandkumar et al., 2014; Richard and Montanari, 2014). We leave the pursuit of optimization landscape as future research. For the theory, we assume the true rank R is known, whereas for the application, we propose to estimate the rank using BIC given the data. It remains an open and challenging question to establish the convergence rate of the estimated rank (Zhou et al., 2013). We leave a full theoretical investigation of the rank selection consistency and the decomposition error bound under the estimated rank as future research. Finally, although we have concentrated on the Bernoulli distribution in this article, we may consider extensions to other exponential-family distributions, for example, count-valued tensors, multinomial-valued tensors, or tensors with mixed types of entries. Moreover, our proposed method can be thought of as a building block for more specialized tasks such as exploratory data analysis, tensor completion, compressed object representation, and network link prediction. Exploiting the benefits and properties of binary tensor decomposition in each specialized task warrants future research.

11 in total