This repository contains an implementation of a Bayesian neural network on a toy regression dataset, using the mean field variational approximation. This implementation was completed as part of my MEng project with the Machine Learning Group at Cambridge, supervised by Matt Ashman and Adrian Weller. The purpose of this toy implementation was to familiarise myself with PyTorch, as I would need to be comfortable using it for experiments with more novel models for the rest of the project. The notebook consists of three main sections: dataset generation, deterministic neural network implementation, Bayesian neural network implementation.
The theory behind the approach implemented, as well as any assumptions, are as follows:
-
The neural network is a function that returns an output given the input and the model parameters,
$$\hat{y} = f(x, \boldsymbol{\theta}).$$ -
In our probabilistic setting, we assume a Gaussian likelihood:
$$\mathcal{L}(\boldsymbol{\theta}) = p(\mathcal{D}|\boldsymbol{\theta}) = \prod_{n}\mathcal{N}(y_n|f(x_n,\boldsymbol{\theta}), \sigma^2_{noise})$$ where$\sigma_{noise}$ is a hyperparameter to be selected through trial and error. -
We take independent Gaussian priors over the parameters,
$$p(\boldsymbol{\theta}) = \mathcal{N}(\boldsymbol{\theta}|\boldsymbol{0}, \sigma_p^2\mathbf{I})$$ where$\sigma_p$ is a variance hyperparameter. Note that in the implementation below we take$\sigma_p = 1$ . -
Since the network function is highly nonlinear, the true posterior distribution is intractable and so we resort to variational inference. Namely, we resort to a mean-field (diagonal) Gaussian variational distribution:
$$q_\phi(\boldsymbol{\theta}) = \prod_{i=1}^{|\boldsymbol{\theta}|}\mathcal{N}(\theta_i|\mu_i, \sigma_i^2)$$ where$\phi = {\mu_i, \sigma_i}$ are the variational parameters. -
As usual in variational inference, we wish to select the variational parameters by minimising the Kullback-Leibler (KL) divergence between the variational distribution and the posterior, however as usual this too is intractable and so we optimise the evidence lower-bound (ELBO) instead, which can be written as:
$$\text{ELBO} = \mathbb{E}_{\theta\sim q_{\phi}}\left[\log\mathcal{L}(\boldsymbol{\theta})\right] - \text{KL}\left[q_{\phi}||p(\boldsymbol{\theta})\right]$$ Looking at the two terms, we see that maximising the ELBO maximises the expected log likelihood (fit to the data) while minimising the KL divergence between the variational distribution and the prior distribution (the standard Bayesian "Occam's Razor" tradeoff). -
Since both the variational and prior distributions are taken to be Gaussian, the KL divergence between them can be calculated in closed form. Taking
$k$ to denote the number of parameters,$$KL[q_{\phi}||p(\boldsymbol{\theta})] = \frac{1}{2}\sum_i^k[\frac{\mu_i^2 + \sigma_i^2}{\sigma_p^2} - 2\log\frac{\sigma_i}{\sigma_p} - 1]$$ Note that in the implementation below we use a PyTorch method to calculate the KL divergence, which of course uses this equation under the hood. -
Unfortunately we cannot evaluate the first term in the ELBO directly, but we can estimate it by taking samples from the log likelihood and taking the sample mean. This would stop us from being able to propagate gradients through the network which we need in order to train it, but fortunately we can use the reparameterisation trick (Kingma and Welling, 2013. "Auto-Encoding Variational Bayes"):
$$\boldsymbol{\theta} = \boldsymbol{\mu} + \boldsymbol{\sigma}\odot\boldsymbol{\epsilon}$$ $$\boldsymbol{\epsilon} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{I})$$ $$\mathbb{E}_{\theta\sim q_{\phi}}[\log\mathcal{L}(\boldsymbol{\theta})] = \log\mathcal{L}(\log\mathcal{L}(\boldsymbol{\mu} + \boldsymbol{\sigma}\odot\boldsymbol{\epsilon})]$$ $$\approx \frac{1}{L}\sum^L_{l=1}\log\mathcal{L}(\boldsymbol{\mu} + \boldsymbol{\sigma}\odot\boldsymbol{\epsilon}^{(l)})$$ where$\boldsymbol{\epsilon}^{(l)} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{I})$ ,$L$ is the number of samples used to estimate the expectation, and$\odot$ denotes the element-wise product. -
The predictive distribution is also intractable, and so the best we can do is to make predictions using parameter estimates sampled from the posterior distribution, and then take the prediction sample mean and variance as estimates for the predictive mean and variance.
-
The choice of
$\sigma_{noise}$ heavily influences how the network learns. If it is too large then datapoints which are far from the mean are under-penalised, and so the expected log-likelihood term in the ELBO is not large enough and thus the KL divergence term dominates. This leaves us with a network with parameters very close to the prior setting, and predictions that are inexpressive and highly uncertain. If the choice of$\sigma_{noise}$ is too small then the expected log-likelihood term dominates (since datapoints far from the mean are over-penalised), and we get an overconfident network. The "most correct" hyperparameter setting would be one which reflects the true level of noise in the data, which in this slightly contrived case we happen to know since we set it during generation of our dataset. In a more realistic setting, perhaps a good way to select this hyperparameter would be by employing something like the evidence framework/type II maximum likelihood. VI gives us a natural way to implement this since the ELBO is a lower bound on the (log) marginal likelihood, but unfortunately the mean-field Gaussian approximation is not incredibly accurate and as such the lower bound is not particularly tight, and so the evidence framework does not work so well.