0% found this document useful (0 votes)

14 views8 pages

Bayesian CNNs with Variational Inference

This document summarizes a research paper on Bayesian convolutional neural networks with variational inference. The paper introduces a method to apply Bayesian methods to CNNs to model weight uncertainties with probability distributions rather than single point estimates. This allows the CNN to express predictive uncertainty and helps prevent overfitting with fewer training examples. The method extends previous work on Bayesian neural networks to CNN architectures. It shows the method can be generically applied to different CNNs and examines how to decompose predictive uncertainty into aleatoric and epistemic sources. Performance is equivalent to non-Bayesian CNNs while gaining uncertainty estimates and regularization.

Uploaded by

Selva Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views8 pages

Bayesian CNNs with Variational Inference

Uploaded by

Selva Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Bayesian Convolutional Neural Networks with Variational Inference

Kumar Shridhar* Felix Laumann*

Department of Computer Science Department of Mathematics
TU Kaiserslautern Imperial College London
MindGarage NeuralSpace
k shridhar16@[Link] f.laumann18@[Link]
arXiv:1806.05978v5 [[Link]] 14 Nov 2018

Adrian Llopart Maurin Martin Olsen

Department of Electrical Engineering Department of Business Development and Technology
Technical University of Denmark Aarhus University
adllo@[Link] martino@[Link]
Marcus Liwicki
Department of Computer Science
Luleå University of Technology
MindGarage
[Link]@[Link]

Abstract 1. Introduction
Convolutional neural networks (C NNs) excel at tasks in
the realm of image classification (e.g. [10, 28, 18]). How-
ever, from a probability theory perspective, it is unjustifiable
to use single point-estimates as weights to base any classifi-
We introduce Bayesian convolutional neural networks cation on. C NNs with frequentist inference require substan-
with variational inference, a variant of convolutional neural tial amounts of data examples to train on and are prone to
networks (CNNs), in which the intractable posterior prob- overfitting on datasets with few examples per class.
ability distributions over weights are inferred by Bayes by In this work, we apply Bayesian methods to C NNs in or-
Backprop. We demonstrate how this reliable variational in- der to add a measure for uncertainty and regularization in
ference method can serve as a fundamental construct for their predictions, respectively their training. This approach
various network architectures. On multiple datasets in su- allows the network to express uncertainty via its parameters
pervised learning settings (MNIST, CIFAR-10, CIFAR-100), in form of probability distributions (see Figure 1). At the
our proposed variational inference method achieves perfor- same time, by using a prior probability distribution to in-
mances equivalent to frequentist inference in identical ar- tegrate out the parameters, we compute the average across
chitectures, while the two desiderata, a measure for uncer- many models during training, which gives a regularization
tainty and regularization are incorporated naturally. We ex- effect to the network, thus preventing overfitting.
amine in detail how this measure for uncertainty, namely We build our Bayesian C NN upon Bayes by Backprop [8, 1],
the predictive variance, can be decomposed into aleatoric and approximate the intractable true posterior probability
and epistemic uncertainties. In the past, Bayes by Back- distributions p(w|D) with variational probability distribu-
prop has been successfully implemented in feedforward and tions qθ (w|D), which comprise the properties of Gaussian
recurrent neural networks, but not in convolutional ones. distributions µ ∈ Rd and σ ∈ Rd , denoted N (θ|µ, σ 2 ),
This work represents the extension of the group of Bayesian where d is the total number of parameters defining a proba-
neural networks with variational inference which encom- bility distribution.. The shape of these Gaussian variational
passes now all three aforementioned types of network ar- posterior probability distributions, determined by their vari-
chitectures. ance σ 2 , expresses an uncertainty estimation of every model
parameter. The main contributions of our work are as fol-

4321
lows: and LeCun [3], and MacKay [23] investigated the poste-
rior probability distributions of neural networks by using
1. We present how Bayes by Backprop can be efficiently Laplace approximations. As a response to the limitations of
applied to C NNs. We therefore introduce the idea of Laplace approximations, Neal [25] investigated the use of
applying two convolutional operations, one for the hybrid Monte Carlo for training neural networks, although
mean and one for the variance. it has so far been difficult to apply these to the large sizes
of neural networks built in modern applications. More re-
2. We empirically show that our proposed generic and re-
cently, Graves [8] derived a variational inference scheme for
liable variational inference method for Bayesian C NNs
neural networks and Blundell et al. [1] extended this with an
can be applied to various C NN architectures without
update for the variance that is unbiased and simpler to com-
any limitations on their performances.
pute. Graves [9] derives a similar algorithm in the case of a
3. We examine how to estimate the aleatoric and epis- mixture posterior probability distribution.
temic uncertainties and add regularization in C NNs by Several authors have claimed that Dropout [29] and Gaus-
means of applying Bayesian methods to their training. sian Dropout [30] can be viewed as approximate variational
We compare the performances of these Bayesian C NNs inference schemes [7, 16]. We compare our results to Gal’s
to C NNs which use single point-estimates as weights, & Ghahramani’s [7] and discuss the methodological differ-
i.e. which are trained by frequentist inference. ences in detail.

This work builds on the foundations laid out by Blundell et 3. Bayes by Backprop
al. [1], who introduced Bayes by Backprop for feedforward
neural networks. Together with the extension to recurrent Bayes by Backprop [8, 1] is a variational inference
neural networks, introduced by Fortunato et al. [5], Bayes method to learn the posterior distribution on the weights
by Backprop is now applicable on the three most frequently w ∼ qθ (w|D) of a neural network from which weights
used types of neural networks, i.e., feedforward, recurrent, w can be sampled in backpropagation. Since the true pos-
and convolutional neural networks. terior is typically intractable, an approximate distribution
The paper is structured as subsequently outlined: after we qθ (w|D) is defined that is aimed to be as similar as possi-
have first introduced our work, we secondly review briefly ble to the true posterior p(w|D), measured by the Kullback-
related work in this field; third, we explain the Bayes by Leibler (KL) divergence [19]. Hence, we define the optimal
Backprop method on which our work is built; fourth, we parameters θopt as
explain our proposed Bayesian C NN with variational infer-
ence in details; fifth, we introduce aleatoric and epistemic θopt = arg min KL [qθ (w|D)kp(w|D)]
θ
uncertainties; sixth, we present our results and findings
= arg min KL [qθ (w|D)kp(w)] (1)
through experimental evaluation of the proposed method θ
on various architectures and datasets before we finally con- − Eq(w|θ) [log p(D|w)] + log p(D)
clude our work.
where
2. Related Work Z
qθ (w|D)
Applying Bayesian methods to neural networks has been KL [qθ (w|D)kp(w)] = qθ (w|D) log dw. (2)
p(w)
studied in the past with various approximation methods
for the intractable true posterior probability distribution This derivation forms an optimisation problem with a re-
p(w|D). Buntine and Weigend [2] started to propose var- sulting cost function widely known as variational free en-
ious maximum-a-posteriori (MAP) schemes for neural net- ergy [26, 31, 6] which is built upon two terms: the former,
works. They were also the first who suggested second order KL [qθ (w|D)kp(w)], is dependent on the definition of the
derivatives in the prior probability distribution p(w) to en- prior p(w), thus called complexity cost, whereas the latter,
courage smoothness of the resulting approximate posterior Eq(w|θ) [log p(D|w)], is dependent on the data p(D|w), thus
probability distribution. In subsequent work by Hinton and called likelihood cost. The term log p(D) can be omitted in
Van Camp [11], the first variational methods were proposed the optimisation because it is constant.
which naturally served as a regularizer in neural networks. Since the KL-divergence is also intractable to compute ex-
Hochreiter and Schmidhuber [12] suggest taking an infor- actly, we follow a stochastic variational method [8, 1].
mation theory perspective into account and utilising a min- We sample the weights w from the variational distribu-
imum description length (MDL) loss. This penalises non- tion qθ (w|D) since it is much more probable to draw sam-
robust weights by means of an approximate penalty based ples which are appropriate for numerical methods from the
upon perturbations of the weights on the outputs. Denker variational posterior qθ (w|D) than from the true posterior

4322
p(w|D). Consequently, we arrive at the tractable cost func-
tion (3) which is aimed to be optimized, i.e. minimised w.r.t.
θ, during training:
n
X
F(D, θ) ≈ log qθ (w(i) |D)−log p(w(i) )−log p(D|w(i) )
i=1
(3)
where n is the number of draws.
We sample w(i) from qθ (w|D). The uncertainty afforded
by Bayes by Backprop trained neural networks has been
used successfully for training feedforward neural networks
in both supervised and reinforcement learning environments
[1, 22, 13], for training recurrent neural networks [5], but
has not been applied to convolutional neural networks to-
date.

4. Bayesian convolutional neural networks

with variational inference
In this section, we explain our algorithm of building a
C NN with probability distributions over its weights in each
filter, as seen in Figure 1, and apply variational inference,
i.e. Bayes by Backprop, to compute the intractable true pos-
Figure 1. Input image with exemplary pixel values, filters, and cor-
terior probability distribution, as described in the previous
responding output with point-estimates (top) and probability dis-
section. Notably, a fully Bayesian perspective on a C NN is tributions (bottom) over weights.
for most C NN architectures not accomplished by merely
placing probability distributions over weights in convolu-
tional layers; it also requires probability distributions over ing able to update the variational posterior probability dis-
weights in fully-connected layers (see Figure 2). tribution qθ (w|D) by backpropagation lies in applying two
convolutional operations whereas filters with single point-
4.1. Local reparameterization trick for convolu- estimates apply one. As explained in the previous section,
tional layers we deploy the local reparametrization trick and sample from
We utilise the local reparameterization trick [16] and the output b. Since the output b is a function of mean
apply it to C NNs. Following [16, 27], we do not sample µijwh and variance αijhw µ2ijhw among others, we are then
the weights w, but we sample instead layer activations b able to compute the two variables determining a Gaussian
due to its consequent computational acceleration. The vari-
ational posterior probability distribution qθ (wijhw |D) =
N (µijhw , αijhw µ2ijhw ) (where i and j are the input, respec-
tively output layers, h and w the height, respectively width
of any given filter) allows to implement the local reparamer-
ization trick in convolutional layers. This results in the sub-
sequent equation for convolutional layer activations b:
q
bj = Ai ∗ µi + j A2i ∗ (αi µ2i ) (4)

where j ∼ N (0, 1), Ai is the receptive field, ∗ signalises

the convolutional operation, and the component-wise
multiplication.

4.2. Applying two sequential convolutional opera-

tions (mean and variance) Figure 2. Fully Bayesian perspective of an exemplary C NN.
Weights in filters of convolutional layers, and weights in fully-
The crux of equipping a C NN with probability distribu- connected layers have the form of a probability distribution.
tions over weights instead of single point-estimates and be-

4323
probability distribution, namely mean µijhw and variance an unbiased estimator of the expectation by sampling from
αijhw µ2ijhw , separately. qθ (w|D):
We do this in two convolutional operations: in the first, we Z
treat the output b as an output of a C NN updated by fre- ∗ ∗
Eq [pD (y |x )] = qθ (w|D) pw (y|x) dw (8)
quentist inference. We optimize with Adam [15] towards
a single point-estimate which makes the validation accu- T
1X
racy of classifications increasing. We interpret this single ≈ pw (y ∗ |x∗ ) (9)
T t=1 t
point-estimate as the mean µijwh of the variational poste-
rior probability distributions qθ (w|D). In the second con- where T is the pre-defined number of samples. This estima-
volutional operation, we learn the variance αijhw µ2ijhw . As tor allows us to evaluate the uncertainty of our predictions
this formulation of the variance includes the mean µijwh , by the definition of variance, hence called predictive vari-
only αijhw needs to be learned in the second convolutional ance and denoted as Varq :
operation [24]. In this way, we ensure that only one param-
Varq p(y ∗ |x∗ ) = Eq [yy T ] − Eq [y]Eq [y]T

eter is updated per convolutional operation, exactly how it (10)
would have been with a C NN updated by frequentist infer-
ence. This quantity can be decomposed into the aleatoric and epis-
In other words, while we learn in the first convolutional op- temic uncertainty [14, 20].
eration the MAP of the variational posterior probability dis- T
tribution qθ (w|D), we observe in the second convolutional 1X
Varq p(y ∗ |x∗ ) = diag(p̂t ) − p̂t p̂Tt

operation how much values for weights w deviate from T t=1
this MAP. This procedure is repeated in the fully-connected | {z }
aleatoric
layers. In addition, to accelerate computation, to ensure a (11)
T
positive non-zero variance αijhw µ2ijhw , and to enhance ac- 1 X
T
+ (p̂t − p̄)(p̂t − p̄)
curacy, we learn log αijhw and use the Softplus activation T t=1
function as further described in the Experiments section. | {z }
epistemic

5. Uncertainty estimation in C NNs PT

where p̄ = T1 t=1 p̂t and p̂t = Softmax fwt (x∗ ) .

In classification tasks, we are interested in the predictive It is of paramount importance that uncertainty is split into
distribution pD (y ∗ |x∗ ), where x∗ is an unseen data example aleatoric and epistemic quantities since it allows the mod-
and y ∗ its predicted class. For a Bayesian neural network, eler to evaluate the room for improvements: while aleatoric
this quantity is given by: uncertainty (also known as statistical uncertainty) is merely
Z a measure for the variation of (”noisy”) data, epistemic un-
pD (y ∗ |x∗ ) = pw (y ∗ |x∗ ) pD (w) dw (5) certainty is caused by the model. Hence, a modeler can see
whether the quality of the data is low (i.e. high aleatoric
uncertainty), or the model itself is the cause for poor per-
In Bayes by Backprop, Gaussian distributions qθ (w|D) ∼
formances (i.e. high epistemic uncertainty). The former can
N (w|µ, σ 2 ), where θ = {µ, σ} are learned with some
be improved by gathering more data, whereas the latter re-
dataset D = {xi , yi }ni=1 as we explained previously. Due
quests to refine the model [4].
to the discrete and finite nature of most classification tasks,
the predictive distribution is commonly assumed to be a cat- 6. Experiments
egorical. Incorporating this aspect into the predictive distri-
bution gives us For all conducted experiments, we implement the fore-
Z going description of Bayesian C NNs with variational infer-
pD (y |x ) = Cat(y ∗ |fw (x∗ ))N (w|µ, σ 2 ) dw
∗ ∗
(6) ence in LeNet-5 [21] and AlexNet [18]. The exact architec-
ture specifications can be found in the Appendix and in our
C
Z Y
∗ 1 (w−µ)2 GitHub repository∗ . We train the networks with the MNIST
= f (x∗c |w)yc √ e− 2σ 2 dw (7) dataset of handwritten digits [21], and with the CIFAR-
c=1 2πσ 2
10 and CIFAR-100 datasets [17] since these datasets serve
widely as benchmarks for C NNs’ performances. The orig-
where C is the total number of classes and c f (x∗c |w) =
P
inally chosen activation functions in all architectures are
1.
ReLU, but we must introduce another, called Softplus, see
As there is no closed-form solution due to the lack of con-
(12), because of our method to apply two convolutional
jugacy between categorical and Gaussian distributions, we
cannot recover this distribution. However, we can construct ∗ [Link]

4324
or fully-connected operations. As aforementioned, one of were not able to reproduce these reported results.
these is determining the mean µ, and the other the variance In Figure 3, we show how Bayesian networks incorporate
αµ2 . Specifically, we apply the Softplus function because naturally effects of regularization, exemplified on AlexNet.
we want to ensure that the variance αµ2 never becomes While an AlexNet trained by frequentist inference with-
zero. This would be equivalent to merely calculating the out any regularization overfits greatly on CIFAR-100, an
MAP, which can be interpreted as equivalent to a maximum AlexNet trained by Bayesian inference on CIFAR-100 does
likelihood estimation (MLE), which is further equivalent to not. It performs equivalently to an AlexNet trained by fre-
utilising single point-estimates, hence frequentist inference. quentist inference with three layers of Dropout after the
The Softplus activation function is a smooth approximation first, fourth, and sixth layers in the architecture. In ini-
of ReLU. Although it is practically not influential, it has the tial epochs, Bayesian C NNs trained by variational inference
subtle and analytically important advantage that it never be- start with a low validation accuracy compared to architec-
comes zero for x → −∞, whereas ReLU becomes zero for tures trained by frequentist inference. This must deduce
x → −∞. from the initialization of the variational posterior proba-
bility distributions qθ (w|D) as uniform distributions, while
1
Softplus(x) = · log 1 + exp(β · x) (12) initial point-estimates in architectures trained by frequen-
β tist inference are randomly drawn from a standard Gaussian
where β is by default set to 1. distribution. The latter initialization method ensures the ini-
All experiments are performed with the same hyper- tialized weights are neither too small nor too large. In other
parameters settings as stated in the Appendix. words, the motivation of the latter initialization is to start
with weights such that the activation functions do not let
6.1. Datasets them begin in saturated or dead regions. This is not true in
case of uniform distributions and hence, Bayesian C NNs’
As aforementioned, we train various architectures on
starting validation accuracies can be comparably low.
multiple datasets, namely MNIST, CIFAR-10, and CIFAR-
Figure 4 displays the convergence of the standard devia-
100.
tion σ of the variational posterior probability distribution
Classification on MNIST. The MNIST dataset of handwrit-
qθ (w|D) of a random model parameter over epochs. As
ten digits consists of 60,000 training and 10,000 validation
aforementioned, all prior probability distributions p(w) are
images of 28 by 28 pixels. Each image is labelled with its
initialized as uniform distributions. The variational pos-
corresponding number (between zero and nine, inclusive).
terior probability distributions qθ (w|D) are approximated
Classification on CIFAR-10. The CIFAR-10 dataset con-
as Gaussian distributions which become more confident as
sists of 60,000 colour images in 10 classes, with 6,000 im-
more data is processed - observable by the decreasing stan-
ages per class, each image 32 by 32 pixels large. Each of
dard deviation over epochs in Figure 4. Although the val-
the classes has 5,000 training images and 1,000 validation
idation accuracy for MNIST on Bayesian LeNet-5 has al-
images.
ready reached 99%, we can still see a fairly steep decrease
Classification on CIFAR-100. This dataset is similar to the
in the parameter’s standard deviation. In Figure 5, we plot
CIFAR-10, except it has 100 classes containing 600 images
the actual Gaussian variational posterior probability distri-
each. There are 500 training images and 100 validation im-
butions qθ (w|D) of a random parameter of LeNet-5 trained
ages per class. The resolution of the images is as in CIFAR-
on CIFAR-10 at some epochs.
10 32 by 32 pixels.
Finally, Table 2 compares the means of aleatoric and epis-
6.2. Results
First, we evaluate the performance of our proposed
MNIST CIFAR-10 CIFAR-100
method, Bayesian C NNs with variational inference. Table
1 shows a comparison of validation accuracies (in percent- Bayesian AlexNet (with VI) 99 73 36

age) for architectures trained by two disparate Bayesian ap- Frequentist AlexNet 99 73 38

proaches, namely variational inference, i.e. Bayes by Back- Bayesian LeNet-5 (with VI) 98 69 31
prop and Dropout as proposed by Gal and Ghahramani [7], Frequentist LeNet-5 98 68 33
plus frequentist inference for all three datasets. Bayesian Bayesian LeNet-5 (with Dropout) 99.5 83
C NNs trained by variational inference achieve validation ac-
curacies comparable to their counter-architectures trained Table 1. Comparison of validation accuracies (in percentage) for
by frequentist inference. On MNIST, validation accuracies different architectures with variational inference (VI), frequentist
of the two disparate Bayesian approaches are comparable, inference and Dropout as a Bayesian approximation as proposed
but a Bayesian LeNet-5 with Dropout achieves a consider- by Gal and Ghahramani [7] for MNIST, CIFAR-10, and CIFAR-
able higher validation accuracy on CIFAR-10, although we 100.

4325
Figure 5. Convergence of the Gaussian variational posterior proba-
bility distribution qθ (w|D) of a random model parameter at epochs
1, 5, 20, 50, and 100. CIFAR-10 is trained on Bayesian LeNet-5.
Figure 3. AlexNet trained on CIFAR-100 by Bayesian and fre-
quentist inference. The frequentist AlexNet without Dropout over-
fits while the Bayesian AlexNet naturally incorporates an effect Aleatoric uncertainty Epistemic uncertainty
of regularization, comparable to a frequentist AlexNet with three Bayesian LeNet-5 (MNIST) 0.0096 0.0026
Dropout layers. Bayesian LeNet-5 (CIFAR-10) 0.1920 0.0404

Table 2. Aleatoric and epistemic uncertainty for Bayesian LeNet-

temic uncertainties for a Bayesian LeNet-5 with variational
5 calculated for MNIST and CIFAR-10, computed as proposed by
inference on MNIST and CIFAR-10. The aleatoric uncer-
Kwon et al. [20].
tainty of CIFAR-10 is about twenty times as large as that
of MNIST. Considering that the aleatoric uncertainty mea-
sures the irreducible variability and depends on the pre- fifteen times larger than that of MNIST, which we antici-
dicted values, a larger aleatoric uncertainty for CIFAR-10 pated, since epistemic uncertainty decreases proportional to
can be directly deduced from its lower validation accuracy validation accuracy.
and may be further due to the smaller number of training
examples. The epistemic uncertainty of CIFAR-10 is about 7. Conclusion
We propose Bayesian C NNs utilizing Bayes by Back-
prop as a reliable, variational inference method for C NNs
which has not been studied to-date, and estimate the mod-
els’ aleatoric and epistemic uncertainties.
There has been previous work by Gal and Ghahramani [7]
who utilized the various outputs of a Dropout function to
define a distribution, and concluded that one can then speak
of a Bayesian C NN. This approach finds, perhaps also due
its ease, a large confirming audience. However, we argue
against this approach, and claim deficiencies. Specifically,
in Gal’s and Ghahramani’s [7] approach, no prior proba-
bility distributions p(w) are placed on the C NN’s parame-
ters. But, these are a substantial part of a Bayesian interpre-
tation for the simple reason that Bayes’ theorem includes
them. Thus we argue, starting with prior probability distri-
butions p(w) is essential in Bayesian methods. In compari-
son, we place prior probability distributions over all model
parameters, and update them according to Bayes’ theorem
Figure 4. Convergence of the standard deviation of the Gaussian with variational inference, precisely Bayes by Backprop. We
variational posterior probability distribution qθ (w|D) of a random show that these neural networks achieve state-of-the-art re-
model parameter at epochs 1, 5, 20, 50, and 100. MNIST is trained sults as those achieved by the same network architectures
on Bayesian LeNet-5. trained by frequentist inference. Furthermore, we examine
how aleatoric and epistemic uncertainties can be computed

4326
for our proposed method and show the natural regulariza- [16] D. P. Kingma, T. Salimans, and M. Welling. Variational
tion effect of Bayesian methods. dropout and the local reparameterization trick. In Advances
As an add-on method to further enhance the stability of the in Neural Information Processing Systems, pages 2575–
optimization, posterior sharpening [5] could be applied to 2583, 2015.
Bayesian C NNs in future work. There, the variational poste- [17] A. Krizhevsky and G. Hinton. Learning multiple layers of
rior distribution qθ (w|D) is conditioned on the training data features from tiny images. Technical report, Citeseer, 2009.
of a batch D(i) . We can see qθ (w|D(i) ) as a proposal distri- [18] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
bution, or hyper-prior when we rethink it as a hierarchical classification with deep convolutional neural networks. In
model, to improve the gradient estimates of the intractable Advances in neural information processing systems, pages
likelihood function p(D|w). 1097–1105, 2012.
[19] S. Kullback and R. A. Leibler. On information and suffi-
References ciency. The annals of mathematical statistics, 22(1):79–86,
1951.
[1] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wier-
[20] Y. Kwon, J.-H. Won, B. J. Kim, and M. C. Paik. Uncer-
stra. Weight uncertainty in neural networks. arXiv preprint
tainty quantification using bayesian neural networks in clas-
arXiv:1505.05424, 2015.
sification: Application to ischemic stroke lesion segmenta-
[2] W. L. Buntine and A. S. Weigend. Bayesian back-
tion. 2018.
propagation. Complex systems, 5(6):603–643, 1991.
[21] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-
[3] J. S. Denker and Y. LeCu. Transforming neural-net output
based learning applied to document recognition. Proceed-
levels to probability distributions. In Advances in neural in-
ings of the IEEE, 86(11):2278–2324, 1998.
formation processing systems, pages 853–859, 1991.
[22] Z. C. Lipton, J. Gao, L. Li, X. Li, F. Ahmed, and
[4] A. Der Kiureghian and O. Ditlevsen. Aleatory or epistemic?
L. Deng. Efficient exploration for dialogue policy learning
does it matter? Structural Safety, 31(2):105–112, 2009.
with bbq networks & replay buffer spiking. arXiv preprint
[5] M. Fortunato, C. Blundell, and O. Vinyals. Bayesian re-
arXiv:1608.05081, 2016.
current neural networks. arXiv preprint arXiv:1704.02798,
2017. [23] D. J. MacKay. Probable networks and plausible predic-
[6] K. Friston, J. Mattout, N. Trujillo-Barreto, J. Ashburner, and tionsa review of practical bayesian methods for supervised
W. Penny. Variational free energy and the laplace approxi- neural networks. Network: Computation in Neural Systems,
mation. Neuroimage, 34(1):220–234, 2007. 6(3):469–505, 1995.
[7] Y. Gal and Z. Ghahramani. Bayesian convolutional neural [24] D. Molchanov, A. Ashukha, and D. Vetrov. Variational
networks with bernoulli approximate variational inference. dropout sparsifies deep neural networks. arXiv preprint
arXiv preprint arXiv:1506.02158, 2015. arXiv:1701.05369, 2017.
[8] A. Graves. Practical variational inference for neural net- [25] R. M. Neal. Bayesian learning for neural networks, volume
works. In Advances in Neural Information Processing Sys- 118. Springer Science & Business Media, 2012.
tems, pages 2348–2356, 2011. [26] R. M. Neal and G. E. Hinton. A view of the em algorithm that
[9] A. Graves. Stochastic backpropagation through mixture den- justifies incremental, sparse, and other variants. In Learning
sity distributions. arXiv preprint arXiv:1607.05690, 2016. in graphical models, pages 355–368. Springer, 1998.
[10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- [27] K. Neklyudov, D. Molchanov, A. Ashukha, and D. Vetrov.
ing for image recognition. In Proceedings of the IEEE con- Variance networks: When expectation does not meet your
ference on computer vision and pattern recognition, pages expectations. arXiv preprint arXiv:1803.03764, 2018.
770–778, 2016. [28] K. Simonyan and A. Zisserman. Very deep convolutional
[11] G. E. Hinton and D. Van Camp. Keeping the neural networks networks for large-scale image recognition. arXiv preprint
simple by minimizing the description length of the weights. arXiv:1409.1556, 2014.
In Proceedings of the sixth annual conference on Computa- [29] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
tional learning theory, pages 5–13. ACM, 1993. R. Salakhutdinov. Dropout: a simple way to prevent neural
[12] S. Hochreiter and J. Schmidhuber. Simplifying neural nets networks from overfitting. The Journal of Machine Learning
by discovering flat minima. In Advances in neural informa- Research, 15(1):1929–1958, 2014.
tion processing systems, pages 529–536, 1995. [30] S. Wang and C. Manning. Fast dropout training. In inter-
[13] R. Houthooft, X. Chen, Y. Duan, J. Schulman, F. De Turck, national conference on machine learning, pages 118–126,
and P. Abbeel. Curiosity-driven exploration in deep rein- 2013.
forcement learning via bayesian neural networks. arXiv [31] J. S. Yedidia, W. T. Freeman, and Y. Weiss. Constructing
preprint arxiv.1605.09674, 2016. free-energy approximations and generalized belief propaga-
[14] A. Kendall and Y. Gal. What uncertainties do we need in tion algorithms. IEEE Transactions on information theory,
bayesian deep learning for computer vision? In Advances 51(7):2282–2312, 2005.
in neural information processing systems, pages 5574–5584,
2017.
[15] D. P. Kingma and J. Ba. Adam: A method for stochastic
optimization. arXiv preprint arXiv:1412.6980, 2014.

4327
8. Appendix
8.1. Experiment specifications

variable value

learning rate 0.001

epochs 100

batch size 128

sample size 10

(αµ2 )init of approximate posterior qθ (w|D) -10

optimizer Adam [15]

λ in `-2 normalisation 0.0005

βi 2M −i [1]
2M −1

8.2. Model architectures

8.3. LeNet-5

layer type width stride padding input shape nonlinearity

convolution (5 × 5) 6 1 0 M × 1 × 32 × 32 Softplus

Mmax-pooling (2 × 2) 2 0 M × 6 × 28 × 28

convolution (5 × 5) 16 1 0 M × 1 × 14 × 14 Softplus

max-pooling (2 × 2) 2 0 M × 16 × 10 × 10

fully-connected 120 M × 400 Softplus

fully-connected 84 M × 120 Softplus

fully-connected 10 M × 84

8.4. AlexNet

layer type width stride padding input shape nonlinearity

convolution (11 × 11) 64 4 5 M × 3 × 32 × 32 Softplus

max-pooling (2 × 2) 2 0 M × 64 × 32 × 32

convolution (5 × 5) 192 1 2 M × 64 × 15 × 15 Softplus

max-pooling (2 × 2) 2 0 M × 192 × 15 × 15

convolution (3 × 3) 384 1 1 M × 192 × 7 × 7 Softplus

convolution (3 × 3) 256 1 1 M × 384 × 7 × 7 Softplus

convolution (3 × 3) 128 1 1 M × 256 × 7 × 7 Softplus

max-pooling (2 × 2) 2 0 M × 128 × 7 × 7

fully-connected 128 M × 128

4328

Hands-On Bayesian Neural Network
No ratings yet
Hands-On Bayesian Neural Network
28 pages
Weight Uncertainty in Neural Networks: Charles Blundell Julien Cornebise Koray Kavukcuoglu Daan Wierstra
No ratings yet
Weight Uncertainty in Neural Networks: Charles Blundell Julien Cornebise Koray Kavukcuoglu Daan Wierstra
10 pages
Hands-On Bayesian Neural Networks
No ratings yet
Hands-On Bayesian Neural Networks
24 pages
Scalable Bayesian Neural Networks via PBP
No ratings yet
Scalable Bayesian Neural Networks via PBP
9 pages
Hands-On Guide to Bayesian Neural Networks
No ratings yet
Hands-On Guide to Bayesian Neural Networks
20 pages
Bayesian RNNs for Uncertainty Estimation
No ratings yet
Bayesian RNNs for Uncertainty Estimation
11 pages
MOPED: Bayesian Weight Priors in DNNs
No ratings yet
MOPED: Bayesian Weight Priors in DNNs
8 pages
Lecture 6 Answers
No ratings yet
Lecture 6 Answers
37 pages
Practical Deep Learning with Bayesian Inference
No ratings yet
Practical Deep Learning with Bayesian Inference
27 pages
Bayesian Neural Networks Explained
No ratings yet
Bayesian Neural Networks Explained
47 pages
Textbookfull - Com/?p 109268
100% (2)
Textbookfull - Com/?p 109268
67 pages
On The Optimization and Pruning For Bayesian Deep Learning
No ratings yet
On The Optimization and Pruning For Bayesian Deep Learning
10 pages
Deep Neural Networks in Probabilistic Models
No ratings yet
Deep Neural Networks in Probabilistic Models
47 pages
Introduction to Bayesian Neural Networks
No ratings yet
Introduction to Bayesian Neural Networks
25 pages
Transformers Can Do Bayesian Inference
No ratings yet
Transformers Can Do Bayesian Inference
24 pages
Prior-Data Fitted Networks for Bayesian Inference
No ratings yet
Prior-Data Fitted Networks for Bayesian Inference
23 pages
BNN Tutorial CILVR
No ratings yet
BNN Tutorial CILVR
83 pages
MCMC for Bayesian Neural Networks Tutorial
No ratings yet
MCMC for Bayesian Neural Networks Tutorial
31 pages
Scalable Bayesian Neural Networks
No ratings yet
Scalable Bayesian Neural Networks
25 pages
Enhancing Bayesian Inference with PFNs
No ratings yet
Enhancing Bayesian Inference with PFNs
23 pages
Bayesian Machine Learning Review Guide
No ratings yet
Bayesian Machine Learning Review Guide
7 pages
πVAE: Bayesian Deep Learning Model
No ratings yet
πVAE: Bayesian Deep Learning Model
16 pages
BayesDLL: Bayesian Deep Learning Tool
No ratings yet
BayesDLL: Bayesian Deep Learning Tool
13 pages
Fitting BLI Data with Bayesian Neural Networks
No ratings yet
Fitting BLI Data with Bayesian Neural Networks
9 pages
PAC-Bayesian Learning Overview
No ratings yet
PAC-Bayesian Learning Overview
26 pages
Adaptable Bayesian Neural Networks Project
No ratings yet
Adaptable Bayesian Neural Networks Project
2 pages
Data Augmentation in Bayesian Deep Learning
No ratings yet
Data Augmentation in Bayesian Deep Learning
28 pages
Entropy 26 00707 v2
No ratings yet
Entropy 26 00707 v2
40 pages
Lecture 12 Bayesian Neural Network
No ratings yet
Lecture 12 Bayesian Neural Network
46 pages
Snoek 15
No ratings yet
Snoek 15
10 pages
Evolution of Variational Autoencoders
No ratings yet
Evolution of Variational Autoencoders
3 pages
Bayesian Deep Learning Insights
No ratings yet
Bayesian Deep Learning Insights
7 pages
SWAG: Bayesian Uncertainty in Deep Learning
No ratings yet
SWAG: Bayesian Uncertainty in Deep Learning
25 pages
Bayesian Layers for Neural Network Uncertainty
No ratings yet
Bayesian Layers for Neural Network Uncertainty
13 pages
Function-Space Variational Inference in BNNs
No ratings yet
Function-Space Variational Inference in BNNs
14 pages
Adversarial Variational Bayes Explained
No ratings yet
Adversarial Variational Bayes Explained
14 pages
Bayesian Neural Networks for Policy Search
No ratings yet
Bayesian Neural Networks for Policy Search
14 pages
Understanding Deep Learning Generalization
No ratings yet
Understanding Deep Learning Generalization
20 pages
Bayesian Online Natural Gradient Method
No ratings yet
Bayesian Online Natural Gradient Method
43 pages
Estimating Uncertainty with ABNN
No ratings yet
Estimating Uncertainty with ABNN
19 pages
Bayesian Neural Networks Overview
No ratings yet
Bayesian Neural Networks Overview
110 pages
Trustworthy Bayesian Meta Learning
No ratings yet
Trustworthy Bayesian Meta Learning
8 pages
Bayesian Low-Rank Learning Framework
No ratings yet
Bayesian Low-Rank Learning Framework
10 pages
Efficient Bayesian Inference with EKI
No ratings yet
Efficient Bayesian Inference with EKI
28 pages
Streamlined Predictions in Bayesian Deep Learning
No ratings yet
Streamlined Predictions in Bayesian Deep Learning
31 pages
Probabilistic Deep Learning for Recommendations
No ratings yet
Probabilistic Deep Learning for Recommendations
6 pages
Understanding Bayesian Learning Concepts
No ratings yet
Understanding Bayesian Learning Concepts
15 pages
Bayesian KAN
No ratings yet
Bayesian KAN
18 pages
Bayesian Networks in AI Explained
No ratings yet
Bayesian Networks in AI Explained
6 pages
Bayesian Inference and Generative Models
No ratings yet
Bayesian Inference and Generative Models
5 pages
Introduction to Variational Autoencoders
No ratings yet
Introduction to Variational Autoencoders
89 pages
Posterior Regularized Bayesian Neural Network
No ratings yet
Posterior Regularized Bayesian Neural Network
12 pages
Streamlining Bayesian Deep Learning Predictions
No ratings yet
Streamlining Bayesian Deep Learning Predictions
30 pages
Bayesian Networks and Uncertainty Reasoning
No ratings yet
Bayesian Networks and Uncertainty Reasoning
3 pages
Variational Inference in Bayesian Deep Learning
No ratings yet
Variational Inference in Bayesian Deep Learning
70 pages
Practical Applications of Bayesian Networks
No ratings yet
Practical Applications of Bayesian Networks
32 pages
Introduction to Variational Autoencoders
No ratings yet
Introduction to Variational Autoencoders
5 pages
Code-Mixed Dravidian Text Classification
No ratings yet
Code-Mixed Dravidian Text Classification
7 pages
Deep Learning for Dravidian Code-Mixing
No ratings yet
Deep Learning for Dravidian Code-Mixing
10 pages
Knowledge Mapping and Model Building
No ratings yet
Knowledge Mapping and Model Building
22 pages
Electromagnetic Radiation in Spectroscopy
0% (1)
Electromagnetic Radiation in Spectroscopy
22 pages
Data Visualization Basics with Python
No ratings yet
Data Visualization Basics with Python
47 pages
Benefits of Studying Hard
No ratings yet
Benefits of Studying Hard
2 pages
Evidence Finder in Animal Intelligence Debate
No ratings yet
Evidence Finder in Animal Intelligence Debate
1 page
MATLAB Signal & Image Processing Workshop
No ratings yet
MATLAB Signal & Image Processing Workshop
1 page
Effective Strategies for Teaching Writing
No ratings yet
Effective Strategies for Teaching Writing
7 pages
(P) (Deleuw, Mayer, 2008) Comparison of 3 Measures of CL PDF
No ratings yet
(P) (Deleuw, Mayer, 2008) Comparison of 3 Measures of CL PDF
12 pages
Entrance Design's Impact on Institute Image
No ratings yet
Entrance Design's Impact on Institute Image
6 pages
Boost Social Skills Through Self-Development
No ratings yet
Boost Social Skills Through Self-Development
2 pages
Simondon's Technical Objects Explained
No ratings yet
Simondon's Technical Objects Explained
122 pages
Service Blueprint Assignment Guide
No ratings yet
Service Blueprint Assignment Guide
2 pages
Solved PYQs on Organizational Behavior
No ratings yet
Solved PYQs on Organizational Behavior
50 pages
Management Principles Overview
No ratings yet
Management Principles Overview
2 pages
AI Overview for CSC 477 Students
No ratings yet
AI Overview for CSC 477 Students
116 pages
Ethical Leadership 1
No ratings yet
Ethical Leadership 1
19 pages
Scaffolding in English Speaking Education
No ratings yet
Scaffolding in English Speaking Education
11 pages
Understanding Lowest Common Factor in Math
No ratings yet
Understanding Lowest Common Factor in Math
3 pages
Effective Feedback Techniques for Business
100% (1)
Effective Feedback Techniques for Business
5 pages
Understanding Synthetic Intelligence for Librarians
No ratings yet
Understanding Synthetic Intelligence for Librarians
10 pages
Music and Arts Q1W1 Lesson Plan
No ratings yet
Music and Arts Q1W1 Lesson Plan
14 pages
Vygotsky's Socio-Cultural Theory Explained
100% (1)
Vygotsky's Socio-Cultural Theory Explained
18 pages
Efficacy of Multimodal Teaching Strategies
No ratings yet
Efficacy of Multimodal Teaching Strategies
113 pages
Grammar Practices 10 Sample Worksheets-1
No ratings yet
Grammar Practices 10 Sample Worksheets-1
10 pages
Effective Written Communication Guide
100% (1)
Effective Written Communication Guide
32 pages
Testbank For Social Psychology 7th Edition by Elliot Aronson
No ratings yet
Testbank For Social Psychology 7th Edition by Elliot Aronson
14 pages
Vocabulary Insights: Summit 2 Unit 7
100% (1)
Vocabulary Insights: Summit 2 Unit 7
2 pages
(1978) PRUF - A Meaning Representation Language For Natural Languages (Zadeh)
No ratings yet
(1978) PRUF - A Meaning Representation Language For Natural Languages (Zadeh)
66 pages
Integral Arts: Community and Creativity
No ratings yet
Integral Arts: Community and Creativity
15 pages
Unlocking Genius: 10 Innovation Rules
No ratings yet
Unlocking Genius: 10 Innovation Rules
5 pages
Understanding Human Flourishing Components
100% (1)
Understanding Human Flourishing Components
7 pages
Oral Test Evaluation Rubric Guide
No ratings yet
Oral Test Evaluation Rubric Guide
1 page
Performing Arts Task 2: Instruction Commentary
No ratings yet
Performing Arts Task 2: Instruction Commentary
3 pages

Bayesian CNNs with Variational Inference

Uploaded by

Bayesian CNNs with Variational Inference

Uploaded by

Bayesian Convolutional Neural Networks with Variational Inference

Kumar Shridhar* Felix Laumann*

Adrian Llopart Maurin Martin Olsen

4. Bayesian convolutional neural networks

where j ∼ N (0, 1), Ai is the receptive field, ∗ signalises

4.2. Applying two sequential convolutional opera-

5. Uncertainty estimation in C NNs PT

Table 2. Aleatoric and epistemic uncertainty for Bayesian LeNet-

learning rate 0.001

batch size 128

(αµ2 )init of approximate posterior qθ (w|D) -10

optimizer Adam [15]

λ in `-2 normalisation 0.0005

8.2. Model architectures

layer type width stride padding input shape nonlinearity

fully-connected 120 M × 400 Softplus

fully-connected 84 M × 120 Softplus

layer type width stride padding input shape nonlinearity

convolution (11 × 11) 64 4 5 M × 3 × 32 × 32 Softplus

convolution (5 × 5) 192 1 2 M × 64 × 15 × 15 Softplus

convolution (3 × 3) 384 1 1 M × 192 × 7 × 7 Softplus

convolution (3 × 3) 256 1 1 M × 384 × 7 × 7 Softplus

convolution (3 × 3) 128 1 1 M × 256 × 7 × 7 Softplus

fully-connected 128 M × 128

You might also like

where j ∼ N (0, 1), Ai is the receptive field, ∗ signalises