| Literature DB >> 36010761 |
Ludwig Winkler1, César Ojeda2, Manfred Opper2,3.
Abstract
In this paper, we propose to leverage the Bayesian uncertainty information encoded in parameter distributions to inform the learning procedure for Bayesian models. We derive a first principle stochastic differential equation for the training dynamics of the mean and uncertainty parameter in the variational distributions. On the basis of the derived Bayesian stochastic differential equation, we apply the methodology of stochastic optimal control on the variational parameters to obtain individually controlled learning rates. We show that the resulting optimizer, StochControlSGD, is significantly more robust to large learning rates and can adaptively and individually control the learning rates of the variational parameters. The evolution of the control suggests separate and distinct dynamical behaviours in the training regimes for the mean and uncertainty parameters in Bayesian neural networks.Entities:
Keywords: Bayesian inference; Bayesian neural networks; learning
Year: 2022 PMID: 36010761 PMCID: PMC9407447 DOI: 10.3390/e24081097
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.738
Figure 1The components of the stochastic differential equation for the variational parameters and over time. The empirical drift and diffusion estimates shown in blue are unbiased estimates of the true analytically derived drift and diffusion terms. The loss was where b was sampled randomly from to simulate aleatoric uncertainty. The aleatoric uncertainty from the data in the gradients remains constant whereas the epistemic uncertainty from the parameter distribution is reduced to zero.
Figure 2A one-dimensional illustration of how the optimal stochastic control u is determined from the gradient and parameter information. The parameters and their gradient information are used to estimate the curvature A and offset b for the quadratic approximation g through which the optimal control parameter u is determined. In our experiments with Bayesian neural networks, each parameter has two variational parameters , such that and .
Test accuracy on the MNIST, FMNIST and CIFAR10 datasets. We abbreviate StochControlSGD as scSGD, and the SGD with cosine learning rate scheduling as LRSGD, for notational brevity. The best performing optimization algorithm per data set is denoted in bold.
| MNIST | FMNIST | CIFAR10 | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SGD | ADAM | cSGD | scSGD | LRSGD | SGD | ADAM | cSGD | scSGD | LRSGD | SGD | ADAM | cSGD | scSGD | LRSGD | |
| NN | 0.959 |
| 0.961 | / | 0.985 | 0.818 |
| 0.851 | / | 0.878 | 0.461 |
| 0.432 | / | 0.499 |
| CNN | 0.989 |
| 0.981 | / | 0.990 | 0.904 |
| 0.912 | / | 0.907 | 0.853 |
| 0.857 | / | 0.855 |
| BNN (Normal) | 0.956 | 0.963 | 0.970 |
| 0.069 | 0.865 | 0.870 | 0.876 |
| 0.900 | 0.441 | 0.442 | 0.451 |
| 0.462 |
| CBNN (Normal) | 0.982 | 0.988 | 0.982 |
| 0.989 | 0.869 | 0.914 | 0.903 |
| 0.915 | 0.615 |
| 0.836 | 0.853 | 0.801 |
| BNN (Laplace) | 0.976 |
| 0.974 | 0.977 | 0.975 | 0.890 | 0.875 |
| 0.901 | 0.9 |
| 0.452 | 0.461 | 0.479 | 0.500 |
| CBNN (Laplace) | 0.989 | 0.987 | 0.985 |
| 0.989 | 0.899 | 0.916 | 0.907 |
| 0.912 | 0.627 | 0.857 | 0.829 |
| 0.853 |
Figure 3Comparison of StochControlSGD with SGD, controlled SGD and ADAM. StochcontrolSGD offers very robust performance over varying learning rates.
Figure 4Combined performance of the optimizers over different learning rates. StochControlSGD provides reliable performance over a wide range of learning rates without the necessity of hyperparameter tuning.
Figure 5The median control parameter over time plotted with the Training ELBO which is used to compute the gradients for a BNN which was trained on Fashion MNIST.