0% found this document useful (0 votes)
14 views8 pages

Bayesian CNNs with Variational Inference

This document summarizes a research paper on Bayesian convolutional neural networks with variational inference. The paper introduces a method to apply Bayesian methods to CNNs to model weight uncertainties with probability distributions rather than single point estimates. This allows the CNN to express predictive uncertainty and helps prevent overfitting with fewer training examples. The method extends previous work on Bayesian neural networks to CNN architectures. It shows the method can be generically applied to different CNNs and examines how to decompose predictive uncertainty into aleatoric and epistemic sources. Performance is equivalent to non-Bayesian CNNs while gaining uncertainty estimates and regularization.

Uploaded by

Selva Kumar
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views8 pages

Bayesian CNNs with Variational Inference

This document summarizes a research paper on Bayesian convolutional neural networks with variational inference. The paper introduces a method to apply Bayesian methods to CNNs to model weight uncertainties with probability distributions rather than single point estimates. This allows the CNN to express predictive uncertainty and helps prevent overfitting with fewer training examples. The method extends previous work on Bayesian neural networks to CNN architectures. It shows the method can be generically applied to different CNNs and examines how to decompose predictive uncertainty into aleatoric and epistemic sources. Performance is equivalent to non-Bayesian CNNs while gaining uncertainty estimates and regularization.

Uploaded by

Selva Kumar
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Bayesian Convolutional Neural Networks with Variational Inference

Kumar Shridhar* Felix Laumann*


Department of Computer Science Department of Mathematics
TU Kaiserslautern Imperial College London
MindGarage NeuralSpace
k shridhar16@[Link] f.laumann18@[Link]
arXiv:1806.05978v5 [[Link]] 14 Nov 2018

Adrian Llopart Maurin Martin Olsen


Department of Electrical Engineering Department of Business Development and Technology
Technical University of Denmark Aarhus University
adllo@[Link] martino@[Link]
Marcus Liwicki
Department of Computer Science
Luleå University of Technology
MindGarage
[Link]@[Link]

Abstract 1. Introduction
Convolutional neural networks (C NNs) excel at tasks in
the realm of image classification (e.g. [10, 28, 18]). How-
ever, from a probability theory perspective, it is unjustifiable
to use single point-estimates as weights to base any classifi-
We introduce Bayesian convolutional neural networks cation on. C NNs with frequentist inference require substan-
with variational inference, a variant of convolutional neural tial amounts of data examples to train on and are prone to
networks (CNNs), in which the intractable posterior prob- overfitting on datasets with few examples per class.
ability distributions over weights are inferred by Bayes by In this work, we apply Bayesian methods to C NNs in or-
Backprop. We demonstrate how this reliable variational in- der to add a measure for uncertainty and regularization in
ference method can serve as a fundamental construct for their predictions, respectively their training. This approach
various network architectures. On multiple datasets in su- allows the network to express uncertainty via its parameters
pervised learning settings (MNIST, CIFAR-10, CIFAR-100), in form of probability distributions (see Figure 1). At the
our proposed variational inference method achieves perfor- same time, by using a prior probability distribution to in-
mances equivalent to frequentist inference in identical ar- tegrate out the parameters, we compute the average across
chitectures, while the two desiderata, a measure for uncer- many models during training, which gives a regularization
tainty and regularization are incorporated naturally. We ex- effect to the network, thus preventing overfitting.
amine in detail how this measure for uncertainty, namely We build our Bayesian C NN upon Bayes by Backprop [8, 1],
the predictive variance, can be decomposed into aleatoric and approximate the intractable true posterior probability
and epistemic uncertainties. In the past, Bayes by Back- distributions p(w|D) with variational probability distribu-
prop has been successfully implemented in feedforward and tions qθ (w|D), which comprise the properties of Gaussian
recurrent neural networks, but not in convolutional ones. distributions µ ∈ Rd and σ ∈ Rd , denoted N (θ|µ, σ 2 ),
This work represents the extension of the group of Bayesian where d is the total number of parameters defining a proba-
neural networks with variational inference which encom- bility distribution.. The shape of these Gaussian variational
passes now all three aforementioned types of network ar- posterior probability distributions, determined by their vari-
chitectures. ance σ 2 , expresses an uncertainty estimation of every model
parameter. The main contributions of our work are as fol-

4321
lows: and LeCun [3], and MacKay [23] investigated the poste-
rior probability distributions of neural networks by using
1. We present how Bayes by Backprop can be efficiently Laplace approximations. As a response to the limitations of
applied to C NNs. We therefore introduce the idea of Laplace approximations, Neal [25] investigated the use of
applying two convolutional operations, one for the hybrid Monte Carlo for training neural networks, although
mean and one for the variance. it has so far been difficult to apply these to the large sizes
of neural networks built in modern applications. More re-
2. We empirically show that our proposed generic and re-
cently, Graves [8] derived a variational inference scheme for
liable variational inference method for Bayesian C NNs
neural networks and Blundell et al. [1] extended this with an
can be applied to various C NN architectures without
update for the variance that is unbiased and simpler to com-
any limitations on their performances.
pute. Graves [9] derives a similar algorithm in the case of a
3. We examine how to estimate the aleatoric and epis- mixture posterior probability distribution.
temic uncertainties and add regularization in C NNs by Several authors have claimed that Dropout [29] and Gaus-
means of applying Bayesian methods to their training. sian Dropout [30] can be viewed as approximate variational
We compare the performances of these Bayesian C NNs inference schemes [7, 16]. We compare our results to Gal’s
to C NNs which use single point-estimates as weights, & Ghahramani’s [7] and discuss the methodological differ-
i.e. which are trained by frequentist inference. ences in detail.

This work builds on the foundations laid out by Blundell et 3. Bayes by Backprop
al. [1], who introduced Bayes by Backprop for feedforward
neural networks. Together with the extension to recurrent Bayes by Backprop [8, 1] is a variational inference
neural networks, introduced by Fortunato et al. [5], Bayes method to learn the posterior distribution on the weights
by Backprop is now applicable on the three most frequently w ∼ qθ (w|D) of a neural network from which weights
used types of neural networks, i.e., feedforward, recurrent, w can be sampled in backpropagation. Since the true pos-
and convolutional neural networks. terior is typically intractable, an approximate distribution
The paper is structured as subsequently outlined: after we qθ (w|D) is defined that is aimed to be as similar as possi-
have first introduced our work, we secondly review briefly ble to the true posterior p(w|D), measured by the Kullback-
related work in this field; third, we explain the Bayes by Leibler (KL) divergence [19]. Hence, we define the optimal
Backprop method on which our work is built; fourth, we parameters θopt as
explain our proposed Bayesian C NN with variational infer-
ence in details; fifth, we introduce aleatoric and epistemic θopt = arg min KL [qθ (w|D)kp(w|D)]
θ
uncertainties; sixth, we present our results and findings
= arg min KL [qθ (w|D)kp(w)] (1)
through experimental evaluation of the proposed method θ
on various architectures and datasets before we finally con- − Eq(w|θ) [log p(D|w)] + log p(D)
clude our work.
where
2. Related Work Z
qθ (w|D)
Applying Bayesian methods to neural networks has been KL [qθ (w|D)kp(w)] = qθ (w|D) log dw. (2)
p(w)
studied in the past with various approximation methods
for the intractable true posterior probability distribution This derivation forms an optimisation problem with a re-
p(w|D). Buntine and Weigend [2] started to propose var- sulting cost function widely known as variational free en-
ious maximum-a-posteriori (MAP) schemes for neural net- ergy [26, 31, 6] which is built upon two terms: the former,
works. They were also the first who suggested second order KL [qθ (w|D)kp(w)], is dependent on the definition of the
derivatives in the prior probability distribution p(w) to en- prior p(w), thus called complexity cost, whereas the latter,
courage smoothness of the resulting approximate posterior Eq(w|θ) [log p(D|w)], is dependent on the data p(D|w), thus
probability distribution. In subsequent work by Hinton and called likelihood cost. The term log p(D) can be omitted in
Van Camp [11], the first variational methods were proposed the optimisation because it is constant.
which naturally served as a regularizer in neural networks. Since the KL-divergence is also intractable to compute ex-
Hochreiter and Schmidhuber [12] suggest taking an infor- actly, we follow a stochastic variational method [8, 1].
mation theory perspective into account and utilising a min- We sample the weights w from the variational distribu-
imum description length (MDL) loss. This penalises non- tion qθ (w|D) since it is much more probable to draw sam-
robust weights by means of an approximate penalty based ples which are appropriate for numerical methods from the
upon perturbations of the weights on the outputs. Denker variational posterior qθ (w|D) than from the true posterior

4322
p(w|D). Consequently, we arrive at the tractable cost func-
tion (3) which is aimed to be optimized, i.e. minimised w.r.t.
θ, during training:
n
X
F(D, θ) ≈ log qθ (w(i) |D)−log p(w(i) )−log p(D|w(i) )
i=1
(3)
where n is the number of draws.
We sample w(i) from qθ (w|D). The uncertainty afforded
by Bayes by Backprop trained neural networks has been
used successfully for training feedforward neural networks
in both supervised and reinforcement learning environments
[1, 22, 13], for training recurrent neural networks [5], but
has not been applied to convolutional neural networks to-
date.

4. Bayesian convolutional neural networks


with variational inference
In this section, we explain our algorithm of building a
C NN with probability distributions over its weights in each
filter, as seen in Figure 1, and apply variational inference,
i.e. Bayes by Backprop, to compute the intractable true pos-
Figure 1. Input image with exemplary pixel values, filters, and cor-
terior probability distribution, as described in the previous
responding output with point-estimates (top) and probability dis-
section. Notably, a fully Bayesian perspective on a C NN is tributions (bottom) over weights.
for most C NN architectures not accomplished by merely
placing probability distributions over weights in convolu-
tional layers; it also requires probability distributions over ing able to update the variational posterior probability dis-
weights in fully-connected layers (see Figure 2). tribution qθ (w|D) by backpropagation lies in applying two
convolutional operations whereas filters with single point-
4.1. Local reparameterization trick for convolu- estimates apply one. As explained in the previous section,
tional layers we deploy the local reparametrization trick and sample from
We utilise the local reparameterization trick [16] and the output b. Since the output b is a function of mean
apply it to C NNs. Following [16, 27], we do not sample µijwh and variance αijhw µ2ijhw among others, we are then
the weights w, but we sample instead layer activations b able to compute the two variables determining a Gaussian
due to its consequent computational acceleration. The vari-
ational posterior probability distribution qθ (wijhw |D) =
N (µijhw , αijhw µ2ijhw ) (where i and j are the input, respec-
tively output layers, h and w the height, respectively width
of any given filter) allows to implement the local reparamer-
ization trick in convolutional layers. This results in the sub-
sequent equation for convolutional layer activations b:
q
bj = Ai ∗ µi + j A2i ∗ (αi µ2i ) (4)

where j ∼ N (0, 1), Ai is the receptive field, ∗ signalises


the convolutional operation, and the component-wise
multiplication.

4.2. Applying two sequential convolutional opera-


tions (mean and variance) Figure 2. Fully Bayesian perspective of an exemplary C NN.
Weights in filters of convolutional layers, and weights in fully-
The crux of equipping a C NN with probability distribu- connected layers have the form of a probability distribution.
tions over weights instead of single point-estimates and be-

4323
probability distribution, namely mean µijhw and variance an unbiased estimator of the expectation by sampling from
αijhw µ2ijhw , separately. qθ (w|D):
We do this in two convolutional operations: in the first, we Z
treat the output b as an output of a C NN updated by fre- ∗ ∗
Eq [pD (y |x )] = qθ (w|D) pw (y|x) dw (8)
quentist inference. We optimize with Adam [15] towards
a single point-estimate which makes the validation accu- T
1X
racy of classifications increasing. We interpret this single ≈ pw (y ∗ |x∗ ) (9)
T t=1 t
point-estimate as the mean µijwh of the variational poste-
rior probability distributions qθ (w|D). In the second con- where T is the pre-defined number of samples. This estima-
volutional operation, we learn the variance αijhw µ2ijhw . As tor allows us to evaluate the uncertainty of our predictions
this formulation of the variance includes the mean µijwh , by the definition of variance, hence called predictive vari-
only αijhw needs to be learned in the second convolutional ance and denoted as Varq :
operation [24]. In this way, we ensure that only one param-
Varq p(y ∗ |x∗ ) = Eq [yy T ] − Eq [y]Eq [y]T

eter is updated per convolutional operation, exactly how it (10)
would have been with a C NN updated by frequentist infer-
ence. This quantity can be decomposed into the aleatoric and epis-
In other words, while we learn in the first convolutional op- temic uncertainty [14, 20].
eration the MAP of the variational posterior probability dis- T
tribution qθ (w|D), we observe in the second convolutional 1X
Varq p(y ∗ |x∗ ) = diag(p̂t ) − p̂t p̂Tt

operation how much values for weights w deviate from T t=1
this MAP. This procedure is repeated in the fully-connected | {z }
aleatoric
layers. In addition, to accelerate computation, to ensure a (11)
T
positive non-zero variance αijhw µ2ijhw , and to enhance ac- 1 X
T
+ (p̂t − p̄)(p̂t − p̄)
curacy, we learn log αijhw and use the Softplus activation T t=1
function as further described in the Experiments section. | {z }
epistemic

5. Uncertainty estimation in C NNs PT


where p̄ = T1 t=1 p̂t and p̂t = Softmax fwt (x∗ ) .


In classification tasks, we are interested in the predictive It is of paramount importance that uncertainty is split into
distribution pD (y ∗ |x∗ ), where x∗ is an unseen data example aleatoric and epistemic quantities since it allows the mod-
and y ∗ its predicted class. For a Bayesian neural network, eler to evaluate the room for improvements: while aleatoric
this quantity is given by: uncertainty (also known as statistical uncertainty) is merely
Z a measure for the variation of (”noisy”) data, epistemic un-
pD (y ∗ |x∗ ) = pw (y ∗ |x∗ ) pD (w) dw (5) certainty is caused by the model. Hence, a modeler can see
whether the quality of the data is low (i.e. high aleatoric
uncertainty), or the model itself is the cause for poor per-
In Bayes by Backprop, Gaussian distributions qθ (w|D) ∼
formances (i.e. high epistemic uncertainty). The former can
N (w|µ, σ 2 ), where θ = {µ, σ} are learned with some
be improved by gathering more data, whereas the latter re-
dataset D = {xi , yi }ni=1 as we explained previously. Due
quests to refine the model [4].
to the discrete and finite nature of most classification tasks,
the predictive distribution is commonly assumed to be a cat- 6. Experiments
egorical. Incorporating this aspect into the predictive distri-
bution gives us For all conducted experiments, we implement the fore-
Z going description of Bayesian C NNs with variational infer-
pD (y |x ) = Cat(y ∗ |fw (x∗ ))N (w|µ, σ 2 ) dw
∗ ∗
(6) ence in LeNet-5 [21] and AlexNet [18]. The exact architec-
ture specifications can be found in the Appendix and in our
C
Z Y
∗ 1 (w−µ)2 GitHub repository∗ . We train the networks with the MNIST
= f (x∗c |w)yc √ e− 2σ 2 dw (7) dataset of handwritten digits [21], and with the CIFAR-
c=1 2πσ 2
10 and CIFAR-100 datasets [17] since these datasets serve
widely as benchmarks for C NNs’ performances. The orig-
where C is the total number of classes and c f (x∗c |w) =
P
inally chosen activation functions in all architectures are
1.
ReLU, but we must introduce another, called Softplus, see
As there is no closed-form solution due to the lack of con-
(12), because of our method to apply two convolutional
jugacy between categorical and Gaussian distributions, we
cannot recover this distribution. However, we can construct ∗ [Link]

4324
or fully-connected operations. As aforementioned, one of were not able to reproduce these reported results.
these is determining the mean µ, and the other the variance In Figure 3, we show how Bayesian networks incorporate
αµ2 . Specifically, we apply the Softplus function because naturally effects of regularization, exemplified on AlexNet.
we want to ensure that the variance αµ2 never becomes While an AlexNet trained by frequentist inference with-
zero. This would be equivalent to merely calculating the out any regularization overfits greatly on CIFAR-100, an
MAP, which can be interpreted as equivalent to a maximum AlexNet trained by Bayesian inference on CIFAR-100 does
likelihood estimation (MLE), which is further equivalent to not. It performs equivalently to an AlexNet trained by fre-
utilising single point-estimates, hence frequentist inference. quentist inference with three layers of Dropout after the
The Softplus activation function is a smooth approximation first, fourth, and sixth layers in the architecture. In ini-
of ReLU. Although it is practically not influential, it has the tial epochs, Bayesian C NNs trained by variational inference
subtle and analytically important advantage that it never be- start with a low validation accuracy compared to architec-
comes zero for x → −∞, whereas ReLU becomes zero for tures trained by frequentist inference. This must deduce
x → −∞. from the initialization of the variational posterior proba-
bility distributions qθ (w|D) as uniform distributions, while
1 
Softplus(x) = · log 1 + exp(β · x) (12) initial point-estimates in architectures trained by frequen-
β tist inference are randomly drawn from a standard Gaussian
where β is by default set to 1. distribution. The latter initialization method ensures the ini-
All experiments are performed with the same hyper- tialized weights are neither too small nor too large. In other
parameters settings as stated in the Appendix. words, the motivation of the latter initialization is to start
with weights such that the activation functions do not let
6.1. Datasets them begin in saturated or dead regions. This is not true in
case of uniform distributions and hence, Bayesian C NNs’
As aforementioned, we train various architectures on
starting validation accuracies can be comparably low.
multiple datasets, namely MNIST, CIFAR-10, and CIFAR-
Figure 4 displays the convergence of the standard devia-
100.
tion σ of the variational posterior probability distribution
Classification on MNIST. The MNIST dataset of handwrit-
qθ (w|D) of a random model parameter over epochs. As
ten digits consists of 60,000 training and 10,000 validation
aforementioned, all prior probability distributions p(w) are
images of 28 by 28 pixels. Each image is labelled with its
initialized as uniform distributions. The variational pos-
corresponding number (between zero and nine, inclusive).
terior probability distributions qθ (w|D) are approximated
Classification on CIFAR-10. The CIFAR-10 dataset con-
as Gaussian distributions which become more confident as
sists of 60,000 colour images in 10 classes, with 6,000 im-
more data is processed - observable by the decreasing stan-
ages per class, each image 32 by 32 pixels large. Each of
dard deviation over epochs in Figure 4. Although the val-
the classes has 5,000 training images and 1,000 validation
idation accuracy for MNIST on Bayesian LeNet-5 has al-
images.
ready reached 99%, we can still see a fairly steep decrease
Classification on CIFAR-100. This dataset is similar to the
in the parameter’s standard deviation. In Figure 5, we plot
CIFAR-10, except it has 100 classes containing 600 images
the actual Gaussian variational posterior probability distri-
each. There are 500 training images and 100 validation im-
butions qθ (w|D) of a random parameter of LeNet-5 trained
ages per class. The resolution of the images is as in CIFAR-
on CIFAR-10 at some epochs.
10 32 by 32 pixels.
Finally, Table 2 compares the means of aleatoric and epis-
6.2. Results
First, we evaluate the performance of our proposed
MNIST CIFAR-10 CIFAR-100
method, Bayesian C NNs with variational inference. Table
1 shows a comparison of validation accuracies (in percent- Bayesian AlexNet (with VI) 99 73 36

age) for architectures trained by two disparate Bayesian ap- Frequentist AlexNet 99 73 38

proaches, namely variational inference, i.e. Bayes by Back- Bayesian LeNet-5 (with VI) 98 69 31
prop and Dropout as proposed by Gal and Ghahramani [7], Frequentist LeNet-5 98 68 33
plus frequentist inference for all three datasets. Bayesian Bayesian LeNet-5 (with Dropout) 99.5 83
C NNs trained by variational inference achieve validation ac-
curacies comparable to their counter-architectures trained Table 1. Comparison of validation accuracies (in percentage) for
by frequentist inference. On MNIST, validation accuracies different architectures with variational inference (VI), frequentist
of the two disparate Bayesian approaches are comparable, inference and Dropout as a Bayesian approximation as proposed
but a Bayesian LeNet-5 with Dropout achieves a consider- by Gal and Ghahramani [7] for MNIST, CIFAR-10, and CIFAR-
able higher validation accuracy on CIFAR-10, although we 100.

4325
Figure 5. Convergence of the Gaussian variational posterior proba-
bility distribution qθ (w|D) of a random model parameter at epochs
1, 5, 20, 50, and 100. CIFAR-10 is trained on Bayesian LeNet-5.
Figure 3. AlexNet trained on CIFAR-100 by Bayesian and fre-
quentist inference. The frequentist AlexNet without Dropout over-
fits while the Bayesian AlexNet naturally incorporates an effect Aleatoric uncertainty Epistemic uncertainty
of regularization, comparable to a frequentist AlexNet with three Bayesian LeNet-5 (MNIST) 0.0096 0.0026
Dropout layers. Bayesian LeNet-5 (CIFAR-10) 0.1920 0.0404

Table 2. Aleatoric and epistemic uncertainty for Bayesian LeNet-


temic uncertainties for a Bayesian LeNet-5 with variational
5 calculated for MNIST and CIFAR-10, computed as proposed by
inference on MNIST and CIFAR-10. The aleatoric uncer-
Kwon et al. [20].
tainty of CIFAR-10 is about twenty times as large as that
of MNIST. Considering that the aleatoric uncertainty mea-
sures the irreducible variability and depends on the pre- fifteen times larger than that of MNIST, which we antici-
dicted values, a larger aleatoric uncertainty for CIFAR-10 pated, since epistemic uncertainty decreases proportional to
can be directly deduced from its lower validation accuracy validation accuracy.
and may be further due to the smaller number of training
examples. The epistemic uncertainty of CIFAR-10 is about 7. Conclusion
We propose Bayesian C NNs utilizing Bayes by Back-
prop as a reliable, variational inference method for C NNs
which has not been studied to-date, and estimate the mod-
els’ aleatoric and epistemic uncertainties.
There has been previous work by Gal and Ghahramani [7]
who utilized the various outputs of a Dropout function to
define a distribution, and concluded that one can then speak
of a Bayesian C NN. This approach finds, perhaps also due
its ease, a large confirming audience. However, we argue
against this approach, and claim deficiencies. Specifically,
in Gal’s and Ghahramani’s [7] approach, no prior proba-
bility distributions p(w) are placed on the C NN’s parame-
ters. But, these are a substantial part of a Bayesian interpre-
tation for the simple reason that Bayes’ theorem includes
them. Thus we argue, starting with prior probability distri-
butions p(w) is essential in Bayesian methods. In compari-
son, we place prior probability distributions over all model
parameters, and update them according to Bayes’ theorem
Figure 4. Convergence of the standard deviation of the Gaussian with variational inference, precisely Bayes by Backprop. We
variational posterior probability distribution qθ (w|D) of a random show that these neural networks achieve state-of-the-art re-
model parameter at epochs 1, 5, 20, 50, and 100. MNIST is trained sults as those achieved by the same network architectures
on Bayesian LeNet-5. trained by frequentist inference. Furthermore, we examine
how aleatoric and epistemic uncertainties can be computed

4326
for our proposed method and show the natural regulariza- [16] D. P. Kingma, T. Salimans, and M. Welling. Variational
tion effect of Bayesian methods. dropout and the local reparameterization trick. In Advances
As an add-on method to further enhance the stability of the in Neural Information Processing Systems, pages 2575–
optimization, posterior sharpening [5] could be applied to 2583, 2015.
Bayesian C NNs in future work. There, the variational poste- [17] A. Krizhevsky and G. Hinton. Learning multiple layers of
rior distribution qθ (w|D) is conditioned on the training data features from tiny images. Technical report, Citeseer, 2009.
of a batch D(i) . We can see qθ (w|D(i) ) as a proposal distri- [18] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
bution, or hyper-prior when we rethink it as a hierarchical classification with deep convolutional neural networks. In
model, to improve the gradient estimates of the intractable Advances in neural information processing systems, pages
likelihood function p(D|w). 1097–1105, 2012.
[19] S. Kullback and R. A. Leibler. On information and suffi-
References ciency. The annals of mathematical statistics, 22(1):79–86,
1951.
[1] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wier-
[20] Y. Kwon, J.-H. Won, B. J. Kim, and M. C. Paik. Uncer-
stra. Weight uncertainty in neural networks. arXiv preprint
tainty quantification using bayesian neural networks in clas-
arXiv:1505.05424, 2015.
sification: Application to ischemic stroke lesion segmenta-
[2] W. L. Buntine and A. S. Weigend. Bayesian back-
tion. 2018.
propagation. Complex systems, 5(6):603–643, 1991.
[21] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-
[3] J. S. Denker and Y. LeCu. Transforming neural-net output
based learning applied to document recognition. Proceed-
levels to probability distributions. In Advances in neural in-
ings of the IEEE, 86(11):2278–2324, 1998.
formation processing systems, pages 853–859, 1991.
[22] Z. C. Lipton, J. Gao, L. Li, X. Li, F. Ahmed, and
[4] A. Der Kiureghian and O. Ditlevsen. Aleatory or epistemic?
L. Deng. Efficient exploration for dialogue policy learning
does it matter? Structural Safety, 31(2):105–112, 2009.
with bbq networks & replay buffer spiking. arXiv preprint
[5] M. Fortunato, C. Blundell, and O. Vinyals. Bayesian re-
arXiv:1608.05081, 2016.
current neural networks. arXiv preprint arXiv:1704.02798,
2017. [23] D. J. MacKay. Probable networks and plausible predic-
[6] K. Friston, J. Mattout, N. Trujillo-Barreto, J. Ashburner, and tionsa review of practical bayesian methods for supervised
W. Penny. Variational free energy and the laplace approxi- neural networks. Network: Computation in Neural Systems,
mation. Neuroimage, 34(1):220–234, 2007. 6(3):469–505, 1995.
[7] Y. Gal and Z. Ghahramani. Bayesian convolutional neural [24] D. Molchanov, A. Ashukha, and D. Vetrov. Variational
networks with bernoulli approximate variational inference. dropout sparsifies deep neural networks. arXiv preprint
arXiv preprint arXiv:1506.02158, 2015. arXiv:1701.05369, 2017.
[8] A. Graves. Practical variational inference for neural net- [25] R. M. Neal. Bayesian learning for neural networks, volume
works. In Advances in Neural Information Processing Sys- 118. Springer Science & Business Media, 2012.
tems, pages 2348–2356, 2011. [26] R. M. Neal and G. E. Hinton. A view of the em algorithm that
[9] A. Graves. Stochastic backpropagation through mixture den- justifies incremental, sparse, and other variants. In Learning
sity distributions. arXiv preprint arXiv:1607.05690, 2016. in graphical models, pages 355–368. Springer, 1998.
[10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- [27] K. Neklyudov, D. Molchanov, A. Ashukha, and D. Vetrov.
ing for image recognition. In Proceedings of the IEEE con- Variance networks: When expectation does not meet your
ference on computer vision and pattern recognition, pages expectations. arXiv preprint arXiv:1803.03764, 2018.
770–778, 2016. [28] K. Simonyan and A. Zisserman. Very deep convolutional
[11] G. E. Hinton and D. Van Camp. Keeping the neural networks networks for large-scale image recognition. arXiv preprint
simple by minimizing the description length of the weights. arXiv:1409.1556, 2014.
In Proceedings of the sixth annual conference on Computa- [29] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
tional learning theory, pages 5–13. ACM, 1993. R. Salakhutdinov. Dropout: a simple way to prevent neural
[12] S. Hochreiter and J. Schmidhuber. Simplifying neural nets networks from overfitting. The Journal of Machine Learning
by discovering flat minima. In Advances in neural informa- Research, 15(1):1929–1958, 2014.
tion processing systems, pages 529–536, 1995. [30] S. Wang and C. Manning. Fast dropout training. In inter-
[13] R. Houthooft, X. Chen, Y. Duan, J. Schulman, F. De Turck, national conference on machine learning, pages 118–126,
and P. Abbeel. Curiosity-driven exploration in deep rein- 2013.
forcement learning via bayesian neural networks. arXiv [31] J. S. Yedidia, W. T. Freeman, and Y. Weiss. Constructing
preprint arxiv.1605.09674, 2016. free-energy approximations and generalized belief propaga-
[14] A. Kendall and Y. Gal. What uncertainties do we need in tion algorithms. IEEE Transactions on information theory,
bayesian deep learning for computer vision? In Advances 51(7):2282–2312, 2005.
in neural information processing systems, pages 5574–5584,
2017.
[15] D. P. Kingma and J. Ba. Adam: A method for stochastic
optimization. arXiv preprint arXiv:1412.6980, 2014.

4327
8. Appendix
8.1. Experiment specifications

variable value

learning rate 0.001

epochs 100

batch size 128

sample size 10

(αµ2 )init of approximate posterior qθ (w|D) -10

optimizer Adam [15]

λ in `-2 normalisation 0.0005

βi 2M −i [1]
2M −1

8.2. Model architectures


8.3. LeNet-5

layer type width stride padding input shape nonlinearity

convolution (5 × 5) 6 1 0 M × 1 × 32 × 32 Softplus

Mmax-pooling (2 × 2) 2 0 M × 6 × 28 × 28

convolution (5 × 5) 16 1 0 M × 1 × 14 × 14 Softplus

max-pooling (2 × 2) 2 0 M × 16 × 10 × 10

fully-connected 120 M × 400 Softplus

fully-connected 84 M × 120 Softplus

fully-connected 10 M × 84

8.4. AlexNet

layer type width stride padding input shape nonlinearity

convolution (11 × 11) 64 4 5 M × 3 × 32 × 32 Softplus

max-pooling (2 × 2) 2 0 M × 64 × 32 × 32

convolution (5 × 5) 192 1 2 M × 64 × 15 × 15 Softplus

max-pooling (2 × 2) 2 0 M × 192 × 15 × 15

convolution (3 × 3) 384 1 1 M × 192 × 7 × 7 Softplus

convolution (3 × 3) 256 1 1 M × 384 × 7 × 7 Softplus

convolution (3 × 3) 128 1 1 M × 256 × 7 × 7 Softplus

max-pooling (2 × 2) 2 0 M × 128 × 7 × 7

fully-connected 128 M × 128

4328

You might also like