Literature DB >> 36010761

Stochastic Control for Bayesian Neural Network Training.

Ludwig Winkler¹, César Ojeda², Manfred Opper^2,3.

Abstract

In this paper, we propose to leverage the Bayesian uncertainty information encoded in parameter distributions to inform the learning procedure for Bayesian models. We derive a first principle stochastic differential equation for the training dynamics of the mean and uncertainty parameter in the variational distributions. On the basis of the derived Bayesian stochastic differential equation, we apply the methodology of stochastic optimal control on the variational parameters to obtain individually controlled learning rates. We show that the resulting optimizer, StochControlSGD, is significantly more robust to large learning rates and can adaptively and individually control the learning rates of the variational parameters. The evolution of the control suggests separate and distinct dynamical behaviours in the training regimes for the mean and uncertainty parameters in Bayesian neural networks.

Entities: Chemical

Keywords: Bayesian inference; Bayesian neural networks; learning

Year: 2022 PMID： 36010761 PMCID： PMC9407447 DOI： 10.3390/e24081097

Source DB: PubMed Journal: Entropy (Basel) ISSN： 1099-4300 Impact factor: 2.738

1. Introduction

Deep Bayesian neural networks (BNNs) aim to leverage the advantages of two different methodologies. First, in recent years, deep representations have been incredibly successful in fields as diverse as computer vision, speech recognition and natural language processing [1,2,3]. Much of the success, however, revolves around prediction accuracy. Second, Bayesian methodologies are required to obtain an estimate of model uncertainty, a crucial feature that allows deep neural networks to tackle risk assessment to create informed model decisions. The role of model uncertainty in the training procedure of BNNs, however, remains unaddressed; the present investigation seeks to exploit the model uncertainty in Bayesian neural networks for the development of new learning algorithms. For the training of BNNs, the approximate posterior over the model parameters is obtained via a maximization of the variational lower bound. Such a posterior introduces a form of uncertainty in the parameters which is different than that injected by random batches of data. In this investigation, we seek to exploit both the data uncertainty (aleatoric) and the model uncertainty (epistemic) to solve a control problem aimed at maximizing the evidence lower bound (ELBO), where the control parameters gauge the dynamics of the gradient during descent. The contributions of our work are threefold, We provide a derivation of the stochastic differential equation on a first principle basis that governs the evolution of the parameters in variational distributions trained with variational inference and we decompose the uncertainty of the gradients into their aleatoric and epistemic components. We derive a stochastic optimal control optimization algorithm which incorporates the uncertainty in the gradients to optimally control the learning rates for each variational parameter. The evolution of the control exhibits distinct dynamical behaviour and demonstrates different fluctuation and dissipation regimes for the variational mean and uncertainty parameters. Section 1 offers an introduction to the topic. In Section 2, we provide an overview over probabilistic models and Bayesian neural networks. Section 3 details the derivation of the stochastic differential equation governing the dynamics of the frequentist and variational parameters. Subsequently, we derive a stochastic optimal control algorithm in Section 4 on the basis of the dynamics of the variational parameters. Finally, Section 5 summarizes experiments undertaken and the performance of the stochastic optimal control optimizer, as well as the distinct behaviour of the control parameters.

2. Variational Inference for Bayesian Neural Networks

For a training dataset , the Bayesian formulation of a neural network places a posterior distribution and a prior on each of its parameters . The quintessential task in Bayesian inference is to compute the posterior according to Bayes’ rule: Given a likelihood function and the parameter prior , we can make predictions by marginalizing out over the parameters. For the most common application of supervised learning with label y and data x, , this gives us where is the likelihood function of the output y given the input x and the posterior parameter distribution . For highly parameterized models, the inference of the posterior distribution requires the computation of a high dimensional integral which is numerically intractable to compute for most complex models as they can easily have millions of parameters. There are two main approaches for inferring the posterior distribution in Bayesian neural networks: sampling from the posterior distribution in proportion to the data likelihood and prior [4], and variational inference, which optimizes a bound on the evidence and approximates the true posterior with a tractable distribution with the variational parameters , [5]. Our approach focuses on the variational inference formulation, which scales well to large data regimes as the bound is amenable to gradient-based optimization schemes [6]. Variational inference infers the posterior distribution by optimizing the Kullback–Leibler divergence between the true posterior and a variational distribution . The important detail is that the variational distribution is assumed to be independent of the data , which makes the solution approximate yet tractable. The optimization problem is then We are, thus, left to optimize the ELBO, which is derived in full in Appendix A, in the form as a surrogate loss function. The ELBO is optimized numerically through gradient descent algorithms, which bring their own set of challenges with respective to gradient step size, directional sensitivity and exploding and vanishing gradients. We propose a stochastic optimal control algorithm for gradient descent optimization which controls the learning rate for every variational parameter based on the local surface of the Kullback–Leibler divergence. For the remainder of this paper, we assume that the variational distribution for each parameter follows an independent normal or Laplace distribution with the location of the distribution and the scale as the variational parameters of the parameter . Since the scale parameter is constrained to be positive, we employ an additional reparameterization which allows us to compute derivatives for during optimization while keeping strictly positive.

2.1. Stochastic Differential Equations for Frequentist Models

During optimization, the model parameters follow a dynamical process; in the following section, we show how it is possible to approximate this dynamic as an SDE; we start with a frequentist version where no distribution is imposed on the parameters (as in the BNN), and the stochasticity is injected by the dataset and samples from therein. Given a probabilistic model with a set of scalar parameters , the input and output , we compute the derivative of a scalar loss function to obtain the derivative with respect to each parameter in the probabilistic model for a single data point. Gradient descent requires us to calculate the derivative of the loss over the entire training dataset at each iteration . This gradient, has an associated variance: The computational cost of calculating gradients over entire training datasets is prohibitively expensive, which has favoured the use of mini-batch sampled gradients. Now, a mini-batch with data points is sampled [7]. The assumption is that a mini-batch is computationally tractable while providing a representative sample of the training dataset to compute a sufficiently good gradient on. We denote as a single data sample and as the mini-batch sample. The sampling of the mini-batches introduces stochasticity into the gradient estimation. The first and second moments, denoted as and , for each scalar parameter of the mini-batch gradients are: It is easy to see that we can decrease the variance in the gradient estimation by increasing the size of the mini-batch M. The change in the parameters in gradient based optimization consequentially follows a noisy estimate of the true gradient which is distributed according to the first- and second-order moments in (4) and (5). The central limit theorem implies that the derivatives are distributed along a Gaussian distribution, [8]. Given the distribution of the gradients, the evolution of the parameter through time with the learning rate can be approximated by: This formulation of the parameter dynamics during training has strong similarities with the Euler–Maruyama discretization of an Ito drift–diffusion process. Indeeed, for an SDE with drift and diffusion : we have the associated Eurler–Maruyama discretization: We proceed by setting , and to denote equivalency, as further described in [8,9,10]. This modification allows the use of stochastic analysis to Ito drift-diffusion processes. See [11] for a more thorough discussion on the relationship of the learning rate and the diffusion of SGD). If we additionally consider the learning in the infinitesimal limit of , we arrive at a formulation for the instantaneous change in time which is given by which is a stochastic differential equation, where is a Wiener process that originates from the limit applied to [12]. We can, thus, conclude that the change in the parameters , for an infinitesimal small learning rate , follows a stochastic differential equation in the form of an Ito drift–diffusion process over time in which the sampling of the mini-batches contributes the diffusion [12].

2.2. Stochastic Differential Equations for Bayesian Models

In BNN models, each scalar parameter is modelled by a univariate distribution . The use of the distribution extends the loss to the form of the ELBO which is additive in the mini-batch samples m and has a closed form regularization term (the Kullback–Leibler divergence between posterior and prior ), the derivation of which can be found in Appendix A. Not only do we choose data samples at random, but, concurrently, we sample the parameter from the distribution following the reparametrization trick. The parameter is thus a random variable itself. Consequentially, the derivative for a single data sample m will exhibit randomness originating both from the randomly sampled mini-batches and the stochasticity of the sampled parameters from the variational distribution. The uncertainty of the parameter derivative can be decomposed into the aleatoric and the epistemic uncertainty. The aleatoric uncertainty arises from the variance in the data and is irreducible, whereas the epistemic uncertainty arises from the uncertainty of the parameter and can be reduced to zero, since, in principle, the parameters can be sampled . Employing the tractable univariate variational distribution to achieve a scalable optimization, for a derivative which is dependent on the random parameter and the randomly chosen data sample m, we can decompose the uncertainty of into a sum of the data uncertainty and the parameter uncertainty, which follows from the law of total variance [13]: In effect, we draw samples twice in BNNs to obtain ‘per data sample per variational sample’ derivatives: data samples for the mini-batch and parameter samples from the variational distribution. Aleatoric uncertainty first computes the expectancy over the ‘variationally sampled’ derivatives per data sample and subsequently computes variance over the mini-batch . Epistemic uncertainty first computes the variance over the ‘variationally sampled’ gradients and, finally, computes the expected derivative over the mini-batch . It is important over which source of randomness the variance is computed in the uncertainty decomposition. The first term, , represents the aleatoric uncertainty and measures the data uncertainty. It measures how much the average gradient varies over the dataset. The second term, , is called the epistemic uncertainty and measures the uncertainty originating from the model parameter distribution. For the epistemic uncertainty, the variance is computed over the source of parameter uncertainty and averaged over the data samples. In BNNs this is explicitly modelled through the use of distributions for every parameter . Frequentist models exhibit only aleatoric uncertainty, as the variance over the deterministic gradients in the epistemic uncertainty evaluates to zero. For a univariate variational distribution , we can now formulate the stochastic differential equation (SDE) that governs the dynamics of the variational parameters . The first modification, with respect to the SDE of a frequentist model in Equation (10) is that, for every parameter in the frequentist model, we have, in fact, two separate variational parameters in the Bayesian model, corresponding to the mean and scale of the variational distribution from which we sample . We, thus, have the two differential equations for the variational parameters , in which has a separate Wiener process due to the externalized noise in the reparameterization, the details of which can be checked up upon in the Appendix C. The second modification is the separation of uncertainty, given that we have the additional source of uncertainty from the distribution . We can, thus, employ the uncertainty decomposition to obtain We can now see that the only difference in the SDEs that govern the training dynamics in frequentist and Bayesian models is the added epistemic uncertainty in the diffusion term of the Bayesian stochastic differential equation. Figure 1 exemplifies the different terms in the Bayesian stochastic differential equation and how uncertainty in stochastic gradient descent for a variational distribution can be decomposed for a toy example in one dimension. The details of the derivation of the Bayesian stochastic differential equation can be followed up in Appendix C.

Figure 1

The components of the stochastic differential equation for the variational parameters and over time. The empirical drift and diffusion estimates shown in blue are unbiased estimates of the true analytically derived drift and diffusion terms. The loss was where b was sampled randomly from to simulate aleatoric uncertainty. The aleatoric uncertainty from the data in the gradients remains constant whereas the epistemic uncertainty from the parameter distribution is reduced to zero.

3. Stochastic Control for Learning Rates

Having derived and characterized the training dynamics of the variational parameters on a first principle basis, we now construct our proposed stochastic optimal control algorithm for BNNs. Our approximation methodology relies on the limit . We first introduce a new control variable that respects the limit, namely the learning rate adjustment to the training, an additional adaptive diagonal control matrix U, which leads to a full SDE for the dynamics of training as: which is an Ito drift-diffusion process, where both the drift and the diffusion are controlled by the diagonal control matrix U and where the diffusion term is estimated from the variance of the gradients. We essentially scale it on a per-parameter basis with the control matrix U. We clip the individual control parameters on the diagonal of U to the range bounding the step size to . We posed our problem as follows: if we have the gradients , how do we choose the policy for adjusting the control parameter U to minimize the loss at the end of the training? Essentially: provided that X follows Equation (19). The general optimal control formalism requires us to minimize the cost C for the optimal control parameter U, accumulated over time , and the final cost , under the constraint of the dynamics .

3.1. Simplifiying the Loss

It is known that the loss surface of deep neural network architectures is highly non-linear, which makes global optimization nearly impossible. In a similar way to [14,15], we therefore approximate the loss surface locally with a quadratic function of the form The quadratic approximation as seen in Figure 2 forfeits the global loss surface for a local approximation in which the respective optimal quantities can be computed optimally in the sense of the local approximation. This simplification is chosen such that a tractable stochastic optimal control algorithm can be derived. Intuitively, given a local quadratic approximation of the loss surface, the offset parameter b denotes the optimum of the quadratic approximation , whereas the curvature A denotes how flat or steep the loss surface is locally.

Figure 2

A one-dimensional illustration of how the optimal stochastic control u is determined from the gradient and parameter information. The parameters and their gradient information are used to estimate the curvature A and offset b for the quadratic approximation g through which the optimal control parameter u is determined. In our experiments with Bayesian neural networks, each parameter has two variational parameters , such that and .

Consequently, we want to move the variational parameters in the observable state vector as close as possible to this local optimum which coincides with the offset parameter b. The curvature A and the offset parameter b of the local quadratic approximation of the loss surface can be conveniently calculated via ordinary least squares with the gradient relation (see Appendix D for details) We maintain running averages of the gradients and the parameters to prevent abrupt changes in the control. The quadratic approximation of the loss surface is maintained for each parameter distribution in the BNN architectures.

3.2. Our Control Problem

Taking inspiration from the local quadratic approximation , we wish to minimize the distance of the observable state variables to the optimum b of the quadratic approximation . We introduce an auxilliary variable L which allows us to simplify the classical control problem that requires the solution for the Hamiltonian–Jacobi–Bellman equation Appendix E. It is known, by definition, that is a stochastic variable. We can obtain a relationship between L and the approximation of the error g: We make use of Ito’s lemma, detailed in Appendix B, to obtain the dynamics of the error and define the diffusion matrix which gives us, which is again an Ito drift-diffusion process, and for which we provide the relevant gradient calculations in Appendix D. With the intention of separating the drift and evaluating the matrix derivatives, the details of which are in the Appendix D, we obtain The dynamics of the error denote the drift of the Ito drift-diffusion process and represent the average dynamics of the error function over time, given the dynamics of the parameters . The task which we want to achieve is to minimize the loss in (23) in such a way that we arrive at the optimum after the control period . where A is the curvature of the local approximation . The motivation of this formulation is that M measures the distance of the state vector to the local optimum b in the quadratic approximation scaled by the curvature A. Thus, minimizing the distance M at each time step is equivalent to minimizing the entire cost C. The full derivation can be found in Appendix E. The optimization of the final cost C can be solved by minimizing the cost of , which, in turn, minimizes C. Taking the derivative of with respect to the individual control parameters and setting it to zero gives us where is a vector with the corresponding control parameters, ∘ is the Hadamard product and extracts the diagonal elements of a matrix For indefinite matrices A, we project onto the eigenvector corresponding to the positive eigenvalue to ensure that the optimality condition is met [16]. The full derivation can be reviewed in Appendix E. We compute the control parameter jointly for the variational parameters , which results in the matrices A, D, M to be in . The inversion of the matrices can be performed analytically, as detailed at the end of Appendix F. Comparing the operations required per parameter in ADAM (addition, subtraction, division etc.) and those in StochControlSGD (mostly 2 × 2 matrix multiplications and analytical inversions), we arrive at an approximately 2.5× increase in computations for StochControlSGD compared to ADAM. It is important to note that ADAM has to be applied to both variational parameters independently, whereas StochControlSGD computes the control parameters jointly, thus saving computation. The StochControlSGD algorithm is detailed in its entirety in Algorithm 1.

4. Experiments

We evaluate the proposed stochastic optimal control SGD, which we abbreviate as StochControlSGD, on the MNIST [17], FashionMNIST [18] and CIFAR10 [19] datasets. In Table 1, we compare the final performance of StochControlSGD with the performance of ADAM, controlled SGD (cSGD), SGD and SGD with cosine learning rate scheduling, as proposed by [15].

Table 1

Test accuracy on the MNIST, FMNIST and CIFAR10 datasets. We abbreviate StochControlSGD as scSGD, and the SGD with cosine learning rate scheduling as LRSGD, for notational brevity. The best performing optimization algorithm per data set is denoted in bold.

	MNIST					FMNIST					CIFAR10
	SGD	ADAM	cSGD	scSGD	LRSGD	SGD	ADAM	cSGD	scSGD	LRSGD	SGD	ADAM	cSGD	scSGD	LRSGD
NN	0.959	0.987	0.961	/	0.985	0.818	0.890	0.851	/	0.878	0.461	0.512	0.432	/	0.499
CNN	0.989	0.993	0.981	/	0.990	0.904	0.918	0.912	/	0.907	0.853	0.865	0.857	/	0.855
BNN (Normal)	0.956	0.963	0.970	0.971	0.069	0.865	0.870	0.876	0.900	0.900	0.441	0.442	0.451	0.471	0.462
CBNN (Normal)	0.982	0.988	0.982	0.990	0.989	0.869	0.914	0.903	0.921	0.915	0.615	0.854	0.836	0.853	0.801
BNN (Laplace)	0.976	0.978	0.974	0.977	0.975	0.890	0.875	0.903	0.901	0.9	0.501	0.452	0.461	0.479	0.500
CBNN (Laplace)	0.989	0.987	0.985	0.991	0.989	0.899	0.916	0.907	0.918	0.912	0.627	0.857	0.829	0.857	0.853

Learning rate scheduling was chosen as the cosine annealing, where the initial learning rate was chosen as and was decreased to . The experimental setup is detailed in Appendix G. ADAM provides a strong baseline for the frequentist models when the learning rate is chosen to be appropriately small. Following the notion of learning rate scheduling, we initialized the learning of both cSGD and StochControlSGD as and . Both cSGD and StochControlSGD are able to adaptively and individually set their control parameters over the course of optimization. Additionally, we plot the convergence of the ADAM, cSGD and StochControlSGD in Figure 3. The results are portrayed more concisely in Figure 4, for which five runs for each learning rate and each optimizer are combined in a boxplot format.

Figure 3

Comparison of StochControlSGD with SGD, controlled SGD and ADAM. StochcontrolSGD offers very robust performance over varying learning rates.

Figure 4

Combined performance of the optimizers over different learning rates. StochControlSGD provides reliable performance over a wide range of learning rates without the necessity of hyperparameter tuning.

In contrast to cSGD and StochControlSGD, ADAM does not have the ability to modify the a priori chosen learning rate . Coupled with the first- and second-order moments from which the surrogate gradient is computed, ADAM is sensitive to the large learning rate with significantly worsening performance for learning rates at and . The larger learning rates do not pose a problem for the optimal control optimizers cSGD and StochControlSGD, as they can adaptively and individually control their learning rates. We consider only optimizers which rely on the gradient information to accelerate the gradient descent and forego learning rate scheduling algorithms which incorporate performance information, such as learning rate schedulers which decrease the learning rate if a performance plateau is detected. Among the optimal control optimizers, StochControlSGD provides tighter bounds on the lower and upper performance while offering a higher performance. Especially on the CIFAR10 dataset in Figure 3, StochControlSGD improves upon cSGD with better absolute performance and less variation between the largest learning rate of and the smallest learning rate . Furthermore, it can be seen that the performance of StochControlSGD and cSGD improve with larger learning rates. As can be seen in Figure 3, the performance of the largest learning rate of is, in fact, its best performance, whereas it is the worst performance for ADAM. The direct comparison of ADAM with StochControlSGD connects to recent work carried out by [20] on the fundamental optimization of deep Bayesian models with gradient optimization algorithms developed for frequentist models. The methodology of BNNs is limited in the amount of relevant information in the uncertainty with respect to the learning optimization due to its reliance on normal priors. Modern frequentist deep neural networks rely on custom layer architectures, such as BatchNorm [21], with additional data augmentation schemes, which have no clear Bayesian interpretation, raising additional questions on the applicability of porting frequentist ideas, such as layer designs, in deep neural networks, to their Bayesian formulations.

Behaviour of Control Parameter

The evolution of the control parameter U allows insight into the descent and fluctuation behaviour of the variational parameters and with respect to the ELBO. More specifically, it allows us to shed some light onto the dynamics between the data log likelihood and the KL divergence. The data loglikelihood aims at minimizing the uncertainty parameter of each variational distribution as much as possible. The gradients of KL divergence, in turn, prioritize an uncertainty parameter which corresponds to the prior which we chose as . The relative weighting of the data log likelihood and KL divergence with respect to the number of samples in the ELBO heavily favours the gradients of the data log likelihood during the descent phase for large datasets. As the gradients of the KL divergence are independent of the data by definition, the importance of their gradients increases proportionally to the diminishing gradients of the converging data log likelihood. The uncertainty parameters were initialized to in all our experiments which allows the BNN to increase the uncertainty of select parameters if the KL divergence dominates the gradients of the specific parameter in question. The intuition is that deep neural networks, in fact, only use few weights [22], and, thus, the uncertainty parameters can be maximized by the KL divergence for parameters for which the gradients of the KL divergence are stronger than the gradients originating from the data log likelihood. We can observe this behaviour in Figure 5, where the median control parameters of decrease quickly alongside the control parameters for the uncertainty parameter . However, as the data loglikelihood converges, the median control parameter of the uncertainty parameter is increased as the relative importance of the gradients originating from the data loglikelihood decreases and the gradients from the KL divergence dominate.

Figure 5

The median control parameter over time plotted with the Training ELBO which is used to compute the gradients for a BNN which was trained on Fashion MNIST.

This indicates two different dynamical regimes in the optimization of the uncertainty parameter of the variational distribution. The mean control parameter remains small during the descent and fluctuation dynamics whereas the uncertainty control is, in fact, increased by the stochastic control optimization algorithm in the fluctuation phase.

5. Related Work

The authors of [15] derived an optimal control algorithm for frequentist models which incorporated the variance into the learning rate scheduling. In [23], it was argued that instead of decreasing the learning in the dissipation phase of the optimization, the batch size should be increased to reduce the uncertainty in the gradients. The authors of [24] and [25] examined adaptive learning rate schemes for changing loss surfaces. The idea of a priori cyclical scaling in the learning rates was pioneered in [26]. The use of the reparameterization of the Gaussian variational distribution in deep Bayesian neural networks to arrive at a scalable optimization algorithm based on variational inference was proposed in [27]. The authors of [28] examined the behaviour of DropOut [29] as an approximate Bayesian inference. The authors of [30] demonstrated that the dropout rate could be learned as an approximate uncertainty parameter.

6. Conclusions

We have examined the potential for incorporating Bayesian uncertainty information directly into a learning algorithm. For this, we derived the SDEs for variational parameters on a first principle basis. With both aleatoric and epistemic uncertainty present in the optimization process, we decomposed the diffusion parameter of the SDE into its data and parameter uncertainties. Having identified the underlying dynamics of the variational parameters during optimization, we proceeded to formulate a stochastic optimal control algorithm for Bayesian models which was able to incorporate the Bayesian uncertainty information into an adaptive and selective learning rate schedule. An analysis of the control parameters indicated separate dynamical behaviours during optimization of the mean and uncertainty parameters. This can be investigated further to examine the dynamics of the ELBO as a loss function for other probabilistic models.

1 in total

Review 1. On-line learning in changing environments with applications in supervised and unsupervised learning.

Authors: Noboru Murata; Motoaki Kawanabe; Andreas Ziehe; Klaus-Robert Müller; Shun-ichi Amari
Journal: Neural Netw Date: 2002 Jun-Jul

1 in total