Literature DB >> 35072057

General-Purpose Bayesian Tensor Learning With Automatic Rank Determination and Uncertainty Quantification.

Kaiqi Zhang¹, Cole Hawkins², Zheng Zhang¹.

Abstract

A major challenge in many machine learning tasks is that the model expressive power depends on model size. Low-rank tensor methods are an efficient tool for handling the curse of dimensionality in many large-scale machine learning models. The major challenges in training a tensor learning model include how to process the high-volume data, how to determine the tensor rank automatically, and how to estimate the uncertainty of the results. While existing tensor learning focuses on a specific task, this paper proposes a generic Bayesian framework that can be employed to solve a broad class of tensor learning problems such as tensor completion, tensor regression, and tensorized neural networks. We develop a low-rank tensor prior for automatic rank determination in nonlinear problems. Our method is implemented with both stochastic gradient Hamiltonian Monte Carlo (SGHMC) and Stein Variational Gradient Descent (SVGD). We compare the automatic rank determination and uncertainty quantification of these two solvers. We demonstrate that our proposed method can determine the tensor rank automatically and can quantify the uncertainty of the obtained results. We validate our framework on tensor completion tasks and tensorized neural network training tasks.

Entities: Chemical

Keywords: Bayesian inference; deep learning; tensor decomposition; tensor learning; uncertainty quantification

Year: 2022 PMID： 35072057 PMCID： PMC8777296 DOI： 10.3389/frai.2021.668353

Source DB: PubMed Journal: Front Artif Intell ISSN： 2624-8212

1. Introduction

Tensors (Kolda and Bader, 2009) are a generalization of matrices to describe and process multidimensional data arrays. Due to its ability to represent a huge amount of data by low-rank factorization, tensor computation has been applied in data recovery and compression (Acar et al., 2011; Jain and Oh, 2014; Austin et al., 2016), machine learning (Cichocki, 2014; Novikov et al., 2015; Sidiropoulos et al., 2017), uncertainty quantification (Zhang et al., 2014, 2016), and so forth. However, most of the existing tensor algorithms rely on numerical optimization, and estimating the tensor rank exactly is NP-Hard in some tensor formats (Hillar and Lim, 2013). To overcome the rank determination challenge, Bayesian methods have been employed successfully in tensor completion tasks (Chu and Ghahramani, 2009; Xiong et al., 2010; Rai et al., 2014; Zhao et al., 2015a,c; Hawkins and Zhang, 2018; Gilbert and Wells, 2019). The key idea is to represent the tensor factors as some hidden statistical variables and to automatically determine the tensor ranks based on the training data and a proper rank-shrinking prior density. Chu and Ghahramani (2009) proposed a maximum a posteriori (MAP) MAP estimation, but a point prediction cannot estimate the uncertainty. In order to estimate model uncertainties, Gibbs sampling and mean-field approximate Bayesian methods have been employed in Xiong et al. (2010), Rai et al. (2014), and Zhao et al. (2015a,b), respectively. The former assumes that a conditional posterior density function can be obtained analytically and be sampled from easily. The latter assumes that the hidden parameters are mutually independent to each other. These methods work well in some simple tensor learning tasks such as tensor completion and factorization, but the may become over-simplified when solving more complicated problems such as tensorized neural networks. Paper Contributions. This paper presents a Bayesian framework that is applicable to a broad class of tensor learning problems, e.g., tensor factorization/completion, tensor regression, and tensorized neural networks. Given a tensor learning model with a specified prior density, likelihood function and low-rank tensor approximation format, our framework estimates the posterior distribution of all the factors and automatically determine the tensor ranks via more flexible Bayesian inference methods such as Hamiltonian Monte Carlo (HMC) (Duane et al., 1987) and Stein Variational Gradient Descent (SVGD) (Liu and Wang, 2016). Compared with the mean-field Bayesian tensor completion (Zhao et al., 2015a,b), our tensor learning approach is more flexible because it does not require the strong assumption of independent hidden parameters. Further, our approach can be applied to highly non-linear tensor problems, i.e., tensorized neural networks. Due to the huge amount of training data in many tensor learning problems, estimating the full gradient can be computationally expensive. Therefore, we replace the full gradient in a tensor learning problem with the stochastic gradient (Chen et al., 2014) while achieving a similar level of accuracy. Because of that, our method has larger scalability than traditional Bayesian methods. Our HMC-based sampling approach to tensor learning returns a set of random samples following the posterior distribution, therefore, we can provide uncertainty estimations for both the model parameters and the predictive results. In certain situations HMC may require that the user store too many model copies for inference. For these situations we also develop an SVGD-based framework which applies deterministic updates to produce a small-sized set of particles to approximate the posterior distribution and approximate model uncertainty. We compare both approaches in our experiments. We note that sampling-based approaches have been employed in Bayesian neural networks (Neal, 1992; Liu and Wang, 2016) and data compression (Schmidt and Mohamed, 2009; Şimşekli et al., 2015). However, a thorough investigation of the generalized Bayesian tensor learning problem has not been reported. Our method have several advantages compared with existing methods: (1) generic–our method framework can deal with a broad class of tensor learning problems, including the tensor decomposition, tensor completion, tensor regression, tensorized neural networks, etc. (2) scalable–the stochastic-gradient implementation enables us to handle tensor learning problems with massive data. (3) uncertainty-aware–the resulting posterior samples provide uncertainty estimations for both the model parameters and predictive results. This manuscript is an extended version of our recent work (Hawkins and Zhang, 2021), which reported SVGD training for Bayesian tensorized neural networks. Our manuscript extends Hawkins and Zhang (2021) in the following ways In our previous work, we tested only one Bayesian sampler (SVGD). In this work, we compare two Bayesian samplers which have different memory/compute trade offs during both the training and inference stages of model deployment. In our previous work, we considered only one tensor format (tensor-train) and one rank determination task (tensorized neural networks). In order to test the generality of our method we test our methods on both the tensor-train and Tucker formats and in two rank determination settings: tensor completion and tensorized neural networks. In our previous work, the rank-threshold operation required a user-defined cutoff. We introduce a rank-thresholding cutoff that requires no user intervention.

2. Generalized Bayesian Tensor Learning

This section will present the model and numerical solver of our generic Bayesian tensor learning framework. We will demonstrate specific applications of our framework in later sections.

2.1. A Generalized Bayesian Model

We consider a generalized tensor learning problem: given a set of observed data = {D1, D2, …D}, we want to estimate the posterior density p(|) of an unknown tensor . Because has a huge number of unknown variables, directly solving the Bayesian tensor learning problem is computationally expensive. Our framework allows various kinds of low-rank tensor representations (Kolda and Bader, 2009), such as CANDECOMP/PARAFAC (CP) (Harshman et al., 1970), tensor-train (TT) (Oseledets, 2011), and Tucker (Tucker, 1966). A low-rank tensor representation can significantly reduce the number of unknown variables. For instance, low-rank CP and tensor-train representations may reduce the number of unknowns from an exponential function of d to a linear one. In a general setting, we denote all model parameters (including the tensor factors and some additional hyper-parameters such as noise level or rank controlling parameters) as Θ, and the unknown tensor can be written as (Θ). Then our goal is to estimate the posterior density Here is a likelihood function, P(Θ) is a prior probability density. A key advantage of this Bayesian parameterized description is as follows: by properly choosing a prior density P(Θ), one can control the structure of Θ and thus automatically enforce a low-rank representation for (Θ) based on the observed data . Doing so overcomes the difficulty of rank determination in optimization-based tensor learning. The formulation (1) is very generic. In practice, one only needs to specify the following information in order to use our Bayesian tensor learning framework: The learning task, such as tensor completion, multi-task tensor learning, tensorized neural networks for classification or regression, etc; A low-rank parameterization format, such as CP, Tucker, tensor-train factorization, etc; A prior density P(Θ) for tensor factors and hyper-parameters. The first two decide the likelihood function P(|Θ), and we will make it clear in section 3. The third decides how compact the resulting model would be: a stronger low-rank prior could result in a model with much fewer model parameters.

2.2. Stochastic Gradient HMC (SGHMC) Solver

Now we need to estimate the hidden tensor factors and hyper-parameters by computing the posterior density in (1). Existing methods (Zhao et al., 2015a; Hawkins and Zhang, 2018) do not apply to generalized tensor learning problems because they rely on Bayesian models that make strong assumptions about the posterior density and require linear models. The first Bayesian solver we employ is the Hamiltonian Monte Carlo (HMC) (Duane et al., 1987) to make our framework applicable to a broad class of tensor learning problems. HMC is an extension to Markov chain Monte Carlo (MCMC) Andrieu et al. (2003), and it uses the gradient information to increase efficiency. SGHMC with thermostats The HMC method avoids the random walks in a standard MCMC framework by simulating the following dynamic system: Here p is the auxiliary momentum variable with the same dimension as Θ, M is a mass matrix. Here U(Θ) is the potential energy which is equal to the negative log posterior: The HMC method starts from a random initial guess of Θ to simulate a sample of Θ, and its steady-state distribution converges to our desired posterior density P(Θ|). Standard HMC becomes inefficient when we solve a tensor learning problem with massive training samples, because computing the gradient requires estimating ∇ log P(D|Θ) for every index n over the whole data set. This often happens in completing a huge-size tensor data set or training a tensorized neural network. To reduce the cost, we use the stochastic unbiased estimator of U(Θ): Here ⊂ denotes a mini-batch with || ≪ N. Then one can update the parameters via and . To compensate the noise introduced by the stochastic gradient, we adopt the thermostats method (Ding et al., 2014) for our tensor learning framework. Specifically, a friction term c is introduced, i.e., The friction term changes accordingly to keep the average kinetic energy constant, thus keeping the distribution of samples invariant. Our framework employs a slightly modified leapfrog approach Iserles (1986) to solve the Hamiltonian system because it has a smaller integration error compared with other methods Duane et al. (1987): where ϵ is the stepsize, and t is the iteration index.

2.3. Stein Variational Gradient Descent (SVGD) Solver

The HMC solver described in the previous section can accurately represent arbitrary distributions given a sufficient sampling budget. The disadvantage of the HMC approach is that it may require a large number of model copies for accurate uncertainty quantification (Neal, 1992). Therefore, we also consider the Stein Variational Gradient Descent (SVGD) (Liu and Wang, 2016) to approximate the posterior density p(Θ|). SVGD uses a small number of particles to approximate a target distribution. The advantage of this approach compared to HMC is a lower memory cost due to a lower particle number. The disadvantage compared to HMC is a potential reduction in the accuracy of the final posterior representation. SVGD SVGD aims to find a set of particles such that approximates the true posterior p(Θ|). Here k(·, ·) is a positive definite kernel, and we use the radial basis function kernel in this work. The particles can be found by minimizing the KL divergence between q(Θ) and p(Θ|)). The optimal update ϕ(·) is derived in (Liu and Wang, 2016) and takes the form where ϵ is the step size. The gradient of the potential function can be approximated with a stochastic gradient when the data size || is large. The SVGD training update is, therefore, a deterministic function of the existing particle locations and minibatch gradients. The particle locations are incorporated through the kernel function k(·, ·) and the minibatch gradients of each individual particles are . To sample from the SVGD distribution during training or inference one computes a deterministic forward pass for each of the n particles . At each step of the SVGD Bayesian learning process we require a deterministic forward/backward pass for each particle to compute the gradient in Equation (7). In practice we use ten particles to compute our posterior approximation. Unlike the HMC update, in which we re-sample the momentum parameter, no additional noise is introduced during the training process. We make two observations about the computational requirements of SVGD. Due to the update in Equation (7) this method requires the computation of n gradients of the potential energy function U at each step. Second, each computation requires that we acesss n particles. Therefore, the per-step memory and compute costs of the SVGD solver are n× as large as the per-step costs of the HMC solver. The advantage of the SVGD model is the compact final representation for uncertainty quantification (Liu and Wang, 2016).

3. Specifying Potential Energy Functions

One can run our Bayesian tensor learning framework as a black-box after specifying the energy function based on three components: a learning task, a low-rank tensor representation format, and a prior density. In this section, we will first give the details of two cases: Bayesian tensor completion with a Gaussian likelihood and low-rank CP representation, and Bayesian tensorized neural network classification with a multinomial likelihood and a low-rank tensor-train representation. Then we will also briefly show the Bayesian models for some other cases.

3.1. Bayesian CP Tensor Completion

Given = {Ω, Ω} where Ω denotes some partially observed noisy tensor elements and Ω specifies the sample indices, we aim to find a low-rank tensor such that = + , where denotes a Gaussian noise tensor with zero mean and variance σ. We use the CP factorization (Harshman et al., 1970; Bro, 1997) to paramterize : where ◦ denotes the outer product of vectors and ⟦·⟧ denotes a Kruskal operator. In the Bayesian tensor learning, we need to estimate CP factors and CP rank R. Following the Bayesian CP tensor completion (Zhao et al., 2015c), we set the hidden parameters as Θ = {A(1), …, A(, Λ, τ}, where hyper-parameters Λ = diag(λ1, …, λ) and τ = 1/σ control the tensor ranks and noise level, respectively. The Gaussian noise assumption leads to the Gaussian likelihood where Ω denotes the observed tensor entries. The negative log likelihood associated with each observation is where f denotes the forward evaluation from CP factors to the i-th element of tensor . Next, our goal is to develop a rank-shrinkage prior. The prior density will enforce structured sparsity on the CP factor matrices, leading to rank shrinkage. We define this rank-shrinkage prior density as We remark that the Gaussian prior on the factor matrices enforces that all factor matrix entries in the same column (same rank) share the same variance. Therefore, as the hyperparameter λ → ∞ all entries in the columns {A((:, j)} shrink to 0. This rank-shrinkage prior leads to the following negative log-priors: The noise level τ may vary among different datasets, and λ can be very large for diminishing ranks. Therefore, we slightly modify the model to ensure the numerical stability in HMC-based Bayesian tensor learning. Instead of sampling τ and λ directly, we sample and for better numerical stability, because these values can vary dramatically among different datasets. Therefore, we use the following prior distributions: Combining equations (12) and (13), we have the modified negative log-prior Based on the above prior density and our given Gaussian noise model, the potential function (neglecting the constants) for the Bayesian CP tensor completion is We note that in order to use other low-rank tensor formats, all that is necessary is a change in the prior density. We provide the specific prior densities for the tensor-train and Tucker formats in later sections.

3.2. Bayesian Tensorized Neural Networks

We further show how to apply our Bayesian tensor learning to train a tensorized deep neural network in the tensor-train (TT) format. Given the training data , we want to find a low-rank tensor in the TT format to describe the weight matrices or convolution filters such that y = g(x, ), where g denotes the forward propagation model of a neural network. For a weight matrix W of size M × J, one can decompose and , then reformulate W as a 2K-dimension tensor with size m1 × j1 × ⋯ × m × j. Afterwards, is approximated by a low-rank tensor-train decomposition where is called the TT core, R is the TT rank, R0 = R = 1, and ⟦·⟧ denotes the tensor-train product (Oseledets, 2011). The convolutional layers can be decomposed in a similar way. The convolution kernel is a 4-th dimension tensor in M × J × H × W, where H and W denote the height and width of the convolution window. This tensor can be further viewed as a (2K + 2)-dimensional tensor with size m1 × j1 × ⋯ × m × j × H × W. In our experiments H = W = 3 remain unchanged, and we only compress along the remaining dimensions, i.e., The shape of each factors (1), (2), …(2, (2 are m1 × R1, R1 × j1 × R2, …, R2 × m × R2, R2 × j × H × W, respectively. The parameters in both fully connected layers and convolutional layers can be represented as where hyper-parameters are used to control the rank R. Here a Gaussian prior is placed over each tensor factor and a Gamma prior is placed over Λ(, where α and β are constants. Once the estimated parameter is larger than a threshold, we delete one horizontal slice of ( and one frontal slice of (. The distribution of Λ( is the same as Equation (30). The prior of unknown parameters Θ = {(, Λ(} is . The idea of the tensor-train low-rank prior is structurally similar to the idea of the low-rank CP prior in Equation (12). However, instead of shrinking columns of every factor matrices, each element controls the prior variance of two factor tensor slices, each of which can shrink to 0. The negative log prior is Here ∗ denotes the element-wise product of two tensors or matrices. For all hyperparameters , we sample and use the log Gamma distribution as a prior. The potential function can be computed as where loss(·) is the negative log likelihood and g(·) denotes the neural network. The loss function can be the cross entropy loss for classification problems and the mean square error loss for regression problems. After getting the potential function, we can apply the SGHMC or SVGD framework to draw samples for the parameters Θ. In this framework, the tensor ranks can be adjusted automatically to reduce the neural network model size in training. We propose to set a threshold and reduce the rank when where S = MJ for the fully connected layers and S2 = M, S2 = J for the convolutional layers. We select this threshold for λ by setting (:, i, :) = 0 and maximizing the negative log prior from (20) with respect to Λ. Therefore, the threshold is chosen as an MAP point of the marginal log-prior conditioned on the value of the tensor factors.

3.3. More General Models

Our Bayesian tensor learning framework can also be applied to other low-rank tensorized neural network formats such as the CP and Tucker formats, other tensorized neural network tasks such as regression instead of classification, and to other tensor completion and factorization approaches using the tensor-train or Tucker formats. For instance, we only need to change the likelihood function to a Gaussian distribution when solving a neural network regression task. We summarize these results in Table 1. In this subsection we provide the likelihoods for more general tensor models and a Bayesian rank-shrinkage prior for the low-rank Tucker format.

Table 1

Equations to calculate the potential energy U(Θ) for different tensor learning tasks and with different low-rank tensor formats.

Learning tasks	Likelihood	CP	Tucker	Tensor-Train
Tensor completion	Gaussian	(12) + (14) +(10)	(31) + (14) + (10)	(20) + (14) + (10)
Neural network classification	Multinomial	(12) + (24)	(31) + (24)	(20) + (24)
Neural network regression	Gaussian	(12) + (26)	(31) + (26)	(20) + (26)

Equations to calculate the potential energy U(Θ) for different tensor learning tasks and with different low-rank tensor formats. Classification Problems. In most classification problems, the neural network can give a likelihood directly, where f(x; Θ) is the propagation function of the network, is a vector and each element denotes the probability that x belongs to one class. It is usually the softmax of output of the last linear layer. Suppose y is a vector with size C, C is the total number of classes, The negative log likelihood is Regression Problems. In a regression problem, it is usually assumed to have a Gaussian likelihood function This leads to the following negative log-likelihood: where σ is a hyperparameter denoting the variance. Tucker Tensor Prior. A popular alternative to the CP and tensor-train tensor formats is the low-rank Tucker format (Tucker, 1966). The Tucker decomposition projects the original tensor into a smaller kernel tensor , Similar to Zhao et al. (2015b), the priors of U( and are set as and respectively. Here β is a constant scaling factor. For simplicity, we assume β is a constant instead of a random variable, which differs from Zhao et al. (2015b). The hyperparameter Λ( follows from the Gamma distribution Here, Λ( is shared between U( and . We observe that similar to other low-rank tensor formats this prior enforces structural rank-shrinkage. This approach differs from the CP format in that each low-rank hyperparamter controls entries in both a factor matrix and the Tucker tensor core . In summary, the prior of the unknown parameters Θ = {, U(, Λ(} is The negative log prior is where ∗ is the element-wise product of two tensors.

4. Numerical Experiments

4.1. Tensor Completion

We first consider a synthetic example and an MRI data experiment to show the efficiency of our proposed methods in tensor data recovery, uncertainty estimation, and automatic rank determination in CP tensor format. In both our SGHMC and SVGD implementations, we generate the initial guess by two steps: we first use the ADAM method (Kingma and Ba, 2014) to reach the neighborhood of a local optimal, and then reduce its rank by truncating all below the threshold given in Equation (22). After that, with SGHMC algorithm, we discard the first 50 samples and use the next 450 samples for evaluation. With SVGD algorithm, we use 10 samples for evaluation.

4.1.1. Synthetic Dataset

We first randomly generated a 20 × 20 × 20 tensor with the ground truth rank of 5. We consider two sets of experiments. Case 1: tensor factors with uniform distributions. Assume the factors follow an independent uniform distribution between 0 and 1. We randomly select 10% of the tensor data to be observed. Case 2: tensor factors with Gaussian distribution. Assume the factors follow a Gaussian distribution with zero mean and variance one. We randomly sample 20% entries of the whole tensor. We select a higher observation ratio in this task because it is more difficult than task with uniformly distributed tensor factors. This is because the true data and the noise have similar distributions. In both cases, we perturb all elements with identically independent distributed Gaussian noise (0, σ2). In Equation (11), we set R = 15, a = c = 1, b = 1 for the uniform factors and b = 4 for Gaussian factors, d = 106 when σ = 0.001 and d = 104 otherwise. For the HMC approach we divided the parameters in two groups, and set the mass of factors matrices as 1 and the mass of all other parameters to 100, as the values of the factor matrices vary more during sampling. We report the root mean square error where is the ground truth and ||·|| is the Frobenius norm, and SD is the predicted noise standard deviation. The results are shown in Table 2.

Table 2

Numerical results of tensor completion for the synthetic experiment and MRI dataset.

		Proposed-HMC			Proposed-SVGD			BFCP (Zhao et al., 2015a)
True factors	Noise	Est. Rank	RMSE	SD	Est. Rank	RMSE	SD	Est. Rank	RMSE	SD
	0.001	5	0.0013	0.0047	5	0.0019	0.0011	1	0.1476	0.1517
Uniform	0.003	5	0.0038	0.0040	5	0.0031	0.0016	1	0.1499	0.1607
Rank-5	0.01	5	0.0128	0.0118	5	0.0114	0.0098	1	0.1386	0.1365
	0.03	5	0.0403	0.0318	5	0.0512	0.0071	1	0.1468	0.1523
	0.001	5	0.0013	0.0025	5	0.0019	0.0007	6	0.0005	0.0011
Gaussian	0.003	5	0.0038	0.0031	5	0.0027	0.0019	6	0.0033	0.0033
Rank-5	0.01	5	0.0130	0.0102	5	0.0119	0.0069	5	0.0106	0.0110
	0.03	5	0.0418	0.0236	5	0.0336	0.0193	7	0.0338	0.0354
MRI dataset		65	0.0856	0.0670	65	0.0727	0.0319	17	0.1495	0.1456

Numerical results of tensor completion for the synthetic experiment and MRI dataset. For the tensor with Gaussian factors, our proposed methods perform almost as well as the mean field approximation (BFCP) (Zhao et al., 2015a) in terms of RMSE and SD. For the tensor with uniform-distributed factors, the BFCP method always underestimates the rank and results in high recovery error and SD. This is because the mean-field assumption on the posterior in the BFCP method places a strong Gaussian assumption on the approximating distribution. We also observe in Table 2 that the SVGD approach can produce lower RMSE estimates than the proposed HMC approach, but may underestimate the uncertainty and predict an SD that is too low.

4.1.2. MRI Dataset

We continue to consider the PINCAT MRI dataset (Candes et al., 2013). This is a 128 × 128 × 50 complex-value tensor and we only consider its amplitude. We re-scale the tensor such that the average amplitude . We randomly sample 80000 (≈ 10%) elements. The parameters in (11) are set to be a = 1, b = 0.2, c = 1, d = 104, and R = 80. We compare our results with BFCP report them in Table 2. The results in Table 2 demonstrate that our methods obtain a much lower RMSE and SD than BFCP, and that BFCP underestimates the rank.

4.2. Tensorized Neural Networks

In this section, we present numerical experiments of our Bayesian tensor learning framework for tensorized neural network tasks. We evaluate both the compression capabilities and accuracy of our proposed method on two common computer vision datasets.

4.2.1. Datasets

We first consider the Fashion-MNIST dataset (Xiao et al., 2017) by a two layer neural network. The first layer (FC1) is a 784 × 500 fully connected layer with a ReLU activation and the second layer (FC2) is a 500 × 10 fully connected layer with the softmax activation. We convert FC1 as a 8-th order tensor and FC2 as a 4-th order tensor for the tensor-train decomposition. For the Tucker decomposition, we convert FC1 as a 4-th order tensor and FC2 into a 3-th order tensor. Next we consider the CIFAR-10 dataset. We build a 6-layer convolutional neural network (CNN) containing 4 convolution layers and 2 fully connected layers. Each convolution layer has a kernel size of 3 × 3 and padding of 1. The number of channels in each convolution layer is 128, 256, 256, 256, respectively. The size of the first fully connected layer (FC1) is 512. A batch normalization layer and a ReLU activation layer is placed after each convolution and fully-connected layer. A maxpooling layer with kernel size of 2 × 2 is placed after the second and the fourth convolution layer.

4.2.2. Numerical Results

We use the ADAM method to minimize the negative posterior to get an initial point, then shrink the rank according to equation (22). In Fashion-MNIST dataset, the initialization takes 50 epochs, and in Cifar-10 dataset, it takes 150 epochs. We searched the hyperparameters in the grid defined by β among [0.1, 0.2, 0.5, 1, 2, 5, 10] and α among [1, 2, 5] and picked the one with the best trade-off between accuracy and compression ratio. Afterwards, we generate T = 450 SGHMC samples aftering discarding the first 50 samples, or n = 10 SVGD samples. We evaluate the accuracy of this model using two criterion: the predictive log likelihood (LL) and the prediction accuracy. The results for different benchmarks using different tensor formats are shown in Table 3. We compare the proposed Bayesian learning with the optimization method that maximize a posterior (MAP) directly. It is shown that our tensor learning framework outperforms MAP in almost every case in terms of both the accuracy and the log likelihood (LL). The improvement in log likelihood indicates that our model can predict the uncertainty better than the MAP method. Besides, our method achieves a compression ratio of up to 98.8× in Fashion-MNIST and 127× in CIFAR-10 in terms of the number of model parameters compared with the baseline network. One SGD initialization of our method on the CIFAR-10 problem takes approximately 3 h to run on an NVIDIA Titan V GPU with 12GB of memory. Either sampling process (SVGD or HMC) takes less than 10 min.

Table 3

Results of different networks on two datasets.

		#Parameters	MAP		Proposed
Dataset	Network	(compression ratio)	LL	Accuracy	LL	Accuracy
Fashion-MNIST	NN	3.97 × 10⁵(1×)	–0.7118	88.91%	–0.6730	89.41%
	TT-NN	2.63 × 10⁴(15.1×)	–0.6687	87.07%	–0.6337	87.78%
	HMC-BF-TT-NN	4.02 × 10³(98.8×)	–0.3317	88.24%	–0.3254	88.64%
	SVGD-BF-TT-NN	2.8 × 10⁴(14.1×)	–0.3317	88.24%	–0.3261	88.57%
	Tucker-NN	2.57 × 10⁵(1.54×)	–1.1673	87.20%	–1.0984	87.53%
	HMC-BF-Tucker-NN	3.10 × 10⁴(12.8×)	–1.2948	87.18%	–0.4405	88.18%
	SVGD-BF-Tucker-NN	3.10 × 10⁴(12.8×)	–1.2948	87.18%	–0.4705	87.86%
CIFAR-10	CNN	9.91 × 10⁶(1×)	-0.5337	91.54%	–0.5370	91.53%
	TT-CNN	6.93 × 10⁵(14.3×)	–0.6077	89.00%	–0.5329	90.13%
	HMC-BF-TT-CNN	7.83 × 10⁴(127×)	–0.3936	86.68%	–0.3623	88.01%
	SVGD-BF-TT-NN	7.83 × 10⁴(127×)	–0.3936	86.68%	–0.3419	88.41%

LL, predictive log likelihood (the larger the better); TT, tensor train decomposition; Tucker, Tucker decomposition; BF, Bayesian low rank prior.

Results of different networks on two datasets. LL, predictive log likelihood (the larger the better); TT, tensor train decomposition; Tucker, Tucker decomposition; BF, Bayesian low rank prior. We also show the estimated tensor-train ranks of the estimated weight matrices and convolution filters in Figure 1. Clearly, our Bayesian tensor learning framework can perform model compression in the training process with automatic rank determination.

Figure 1

The inferred TT rank at different layers. (A) 2 TT-FC layers for Fashion-MNIST. (B) 2 Tucker-FC layers for Fashion-MNIST. (C) 4 TT-Conv and 2 TT-FC layers for CIFAR-10.

5. Related Work

There are many recent works on training a tensor learning model. There are a series of works targeting on tensor completion/factorization problems. Chu and Ghahramani (2009) proposed tensor completion algorithm using a maximum a posteriori (MAP) MAP estimation. Various attempts to evalute the uncertainly of completion using Bayesian method has been made in Xiong et al. (2010), Rai et al. (2014), Zhao et al. (2015a,c). On the other hand, some other works targets only on training a tensorized neural networks Jaderberg et al. (2014) demonstrated the first attempt to compress a CNN using tensor decomposition method. Tai et al. (2015) further improved that by training such a neural network from scratch and evaluate it on a large network. Kossaifi et al. (2019) proposed to compress a CNN by parametrizing the entire structure with a single, high-order tensor. In Kolbeinsson et al. (2021), tensor dropout technique was proposed. This technique can be applied to tensor factorizations and improve the robustness and generalization abilities while provide more computationally and memory efficient models. Bulat et al. (2021) further empirically demonstrated that tensor dropout method can improve the robustness to adversarial attacks. Hayashi et al. (2019) proposed a graphical notation to represent all kinds of decomposition used in tensorized CNN and experimentally compare the tradeoff between accuracy and efficiency. Our work proposes a generic, scalable and uncertainty-aware algorithm to solve a broad class of tensor learning problems, including but not limited to tensor factorization/completion, regression, and tensorized neural networks. Furthermore, these works either focused on post-training compression or fixed-rank training, while our work is the first to our knowledge that perform on-shot rank-adaptive training.

6. Conclusion

We have presented a generic Bayesian framework that is applicable to various tensor learning task described with different low-rank tensor representations. This framework is implemented with Hamiltonian Monte Carlo and Stein variational gradient descent. Among the wide range of applications in tensor learning tasks, we have specifically tested our methods by tensor completion with CP format and tensorized Bayesian neural networks with both tensor train and Tucker formats. In tensor completion, our method has shown better accuracy and capability of rank determination than the state-of-the-art mean-field approximation. In the Bayesian neural network, our method has demonstrated a significant compression ratio in the end-to-end training of tensorized neural networks, as well as better accuracy than the maximum-a-posterior training.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Author Contributions

KZ and CH developed the low-rank Bayesian model under the guidance of ZZ. KZ coded the proposed HMC approach and CH coded the proposed SVGD approach. All authors contributed to the article and approved the submitted version.

Funding

This work was supported by NSF CCF-1817037 and DOE ASCR grant no. DE-SC0021323.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher's Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Algorithm 1

SGHMC with thermostats

Algorithm 2

SVGD

3 in total

1. Bayesian CP Factorization of Incomplete Tensors with Automatic Rank Determination.

Authors: Qibin Zhao; Liqing Zhang; Andrzej Cichocki
Journal: IEEE Trans Pattern Anal Mach Intell Date: 2015-09 Impact factor: 6.226

2. Bayesian Robust Tensor Factorization for Incomplete Multiway Data.

Authors: Qibin Zhao; Guoxu Zhou; Liqing Zhang; Andrzej Cichocki; Shun-Ichi Amari
Journal: IEEE Trans Neural Netw Learn Syst Date: 2015-06-09 Impact factor: 10.451

3. Some mathematical notes on three-mode factor analysis.

Authors: L R Tucker
Journal: Psychometrika Date: 1966-09 Impact factor: 2.500

3 in total

1 in total

1. Machine learning algorithm performance evaluation in structural magnetic resonance imaging-based classification of pediatric bipolar disorders type I patients.

Authors: Ruhai Dou; Weijia Gao; Qingmin Meng; Xiaotong Zhang; Weifang Cao; Liangfeng Kuang; Jinpeng Niu; Yongxin Guo; Dong Cui; Qing Jiao; Jianfeng Qiu; Linyan Su; Guangming Lu
Journal: Front Comput Neurosci Date: 2022-08-23 Impact factor: 3.387

1 in total