Introduction to Bayesian Optimization
Javier González
Masterclass, 7-February, 2107 @Lancaster University
Big picture
“Civilization advances by extending the
number of important operations which
we can perform without thinking of them.”
(Alfred North Whitehead)
We are interested on optimizing data science pipelines:
I Automatic model configuration.
I Automate the design of physical experiments.
Agenda of the day
I 9:00-11:00, Introduction to Bayesian Optimization:
I What is BayesOpt and why it works?
I Relevant things to know.
I 11:30-13:00, Connections, extensions and
applications:
I Extensions to multi-task problems, constrained domains,
early-stopping, high dimensions.
I Connections to Armed bandits and ABC.
I An applications in genetics.
I 14:00-16:00, GPyOpt LAB!: Bring your own problem!
I 16:30-15:30, Hot topics current challenges:
I Parallelization.
I Non-myopic methods
I Interactive Bayesian Optimization.
Section I: Introduction to Bayesian Optimization
I What is BayesOpt and why it works?
I Relevant things to know.
Data Science pipeline/Autonomous System
Challenges and needs for automation
Features Results
Data Modelling Interpretation/
Collection extraction
Decision
Filters, Model tuning and
dimensionality configuration, code Data visualisation,
Optimal design
reduction optimisation Sequential
experimentation
Pipeline improvement / Interaction with environment
Experimental Design - Uncertainty Quantification
Can we automate/simplify the process of designing complex experiments?
Emulator - Simulator - Physical system
Global optimization
Consider a ‘well behaved’ function f : X → R where X ⊆ RD is
a bounded domain.
xM = arg min f (x).
x∈X
I f is explicitly unknown and multimodal.
I Evaluations of f may be perturbed.
I Evaluations of f are expensive.
Expensive functions, who doesn’t have one?
Parameter tuning in ML algorithms.
I Number of layers/units per layer
I Weight penalties
I Learning rates, etc.
Figure source: [Link]
Expensive functions, who doesn’t have one?
Active Path Finding in Middle Level
Optimise the location of a sequence of waypoints in a map to
navigate from a location to a destination.
Expensive functions, who doesn’t have one?
Tuning websites with A/B testing
Optimize the web design to maximize sign-ups, downloads,
purchases, etc.
Expensive functions, who doesn’t have one?
[González, Lonworth, James and Lawrence, NIPS workshops 2014, 2015]
Design of experiments: gene optimization
I Use mammalian cells to make protein products.
I Control the ability of the cell-factory to use synthetic DNA.
Optimize genes (ATTGGTUGA...) to best enable the
cell-factory to operate most efficiently.
Expensive functions, who doesn’t have one?
Many other problems:
I Robotics, control, reinforcement learning.
I Scheduling, planning
I compilers, hardware, software?
I Intractable likelihoods.
What to do?
Option 1: Use previous knowledge
To select the parameters at hand. Perhaps not very scientific
but still in use...
What to do?
Option 2: Grid search?
If f is L-Lipschitz continuous and we are in a noise-free domain
to guarantee that we propose some xM,n such that
f (xM ) − f (xM,n ) ≤
we need to evaluate f on a D-dimensional unit hypercube:
(L/)D evaluations!
Example: (10/0.01)5 = 10e14...
... but function evaluations are very expensive!
What to do?
Option 3: Random search?
We can sample the space uniformly [Bergstra and Bengio 2012]
Better than grid search in various senses but still expensive to
guarantee good coverage.
What to do?
Key question:
Can we do better?
Problem (the audience is encouraged to participate!)
I Find the optimum of some function f in the interval [0,1].
I f is L-Lipchitz continuous and differentiable.
I Evaluations of f are exact and we have 4 of them!
Situation
We have a few function evaluations
Where is the minimum of f ?
Where should the take the next evaluation?
Intuitive solution
One curve
Intuitive solution
Three curves
Intuitive solution
Ten curves
Intuitive solution
Hundred curves
Intuitive solution
Many curves
Intuitive solution
Infinite curves
General idea: surrogate modelling
1. Use a surrogate model of f to carry out the optimization.
2. Define an utility function to collect new data points
satisfying some optimality criterion: optimization as
decision.
3. Study decision problems as inference using the surrogate
model: use a probabilistic model able to calibrate both,
epistemic and aleatoric uncertainty.
Uncertainty Quantification
Utility functions
The utility should represent our design goal:.
1. Active Learning and experimental design: Maximize the
differential entropy of the posterior distribution p(f |X, y)
(D-optimality in experimental design).
2. Minimize the loss in a sequence x1 , . . . , xn
N
X
rN = f (xn ) − N f (xM )
n=1
(1) does to a lot exploration whereas (2) encourages
exploitation about the minimum of f .
Bayesian Optimisation
[Mockus, 1978]
Methodology to perform global optimisation of multimodal
black-box functions.
1. Choose some prior measure over the space of possible
objectives f .
2. Combine prior and the likelihood to get a posterior
measure over the objective given some observations.
3. Use the posterior to decide where to take the next
evaluation according to some acquisition/loss function.
4. Augment the data.
Iterate between 2 and 4 until the evaluation budget is over.
Surrogate model: Gaussian process
Default Choice: Gaussian processes [Rasmunsen and Williams, 2006]
Infinite-dimensional probability density, such that each linear
finite-dimensional restriction is multivariate Gaussian.
I Model f (x) ∼ GP(µ(x), k(x, x0 )) is determined by the mean
function m(x) and covariance function k(x, x0 ; θ).
I Posterior mean µ(x; θ, D) and variance σ(x; θ, D) can be
computed explicitly given a dataset D.
Other models are also possible: Random Forrest
[Criminisi et al, 2011]
Other models are also possible: t-Student processes
Exploration vs. exploitation
Bayesian optimization explains human active search
[Borji and Itti, 2013]
Exploration vs. exploitation
Picture source: [Link]
GP Upper (lower) Confidence Band
[Srinivas et al., 2010]
Direct balance between exploration and exploitation:
αLCB (x; θ, D) = −µ(x; θ, D) + βt σ(x; θ, D)
GP Upper (lower) Confidence Band
[Srinivas et al., 2010]
I In noiseless cases, it is a lower bound of the function to
minimize.
I This allows to computer a bound on how close we are to
the minimum.
I Optimal choices available for the ’regularization parameter’.
Expected Improvement
[Jones et al., 1998]
Z
αEI (x; θ, D) = max(0, ybest − y)p(y|x; θ, D)dy
y
Expected Improvement
[Jones et al., 1998]
I Perhaps the most used acquisition.
I Explicit for available for Gaussian posteriors.
I It is too greedy in some problems. It is possible to make
more explorative adding a ’explorative’ parameter
αEI (x; θ, D) = σ(x; θ, D)(γ(x)Φ(γ(x))) + N (γ(x); 0, 1).
where
f (xbest ) − µ(x; θ, D) + ψ
γ(x) = .
σ(x; θ, D)
Maximum Probability of Improvement
[Hushner, 1964]
γ(x) = σ(x; θ, D)−1 (µ(x; θ, D) − ybest )
αM P I (x; θ, D) = p(f (x) < ybest ) = Φ(γ(x))
Maximum Probability of Improvement
[Hushner, 1964]
I First used acquisition: very intuitive.
I Less used in practice.
I Explicit for available for Gaussian posteriors.
αM P I (x; θ, D) = Φ(γ(x))).
where
f (xbest ) − µ(x; θ, D) + ψ
γ(x) = .
σ(x; θ, D)
Information-theoretic approaches
[Hennig and Schuler, 2013; Hernández-Lobato et al., 2014]
αES (x; θ, D) = H[p(xmin |D)] − Ep(y|D,x) [H[p(xmin |D ∪ {x, y})]]
Information-theoretic approaches
Uses the distribution of the minimum
Z Y
pmin (x) ≡ p[x = arg min f (x)] = p(f ) θ[f (x̃) − f (x)]df
f :I→< x̃∈I
x̃6=x
where θ is the Heaviside’s step function. No closed form!
Use Thomson sampling to approximate the distribution.
Generate many sample paths from the GP, optimize them to
take samples from pmin (x).
Thomson sampling
Probability matching
αT HOM SON (x; θ, D) = g(x)
g(x) is sampled form GP(µ(x), k(x, x0 ))
Thompson sampling
Probability matching [Rahimi and B. Recht, 2007]
I It is easy to generate posterior samples of a GP at a finite
set of locations.
I More difficult is to generate ‘continuous’ samples.
Possible using the Bochner’s lemma: existence of the Fourier
dual of k, s(ω) which is equal to the spectral density of k
h T 0
i
k(x, x0 ) = νEω e−iω (x−x ) = 2νEω,b cos(ωxT + b) cos(ωxT + b)
With sampling and this lemma (taking p(w) = s(ω)/ν and
b ∼ U[0, 2π]) we can construct a feature based approximation
for sample paths of the GP.
m
0 ν X −iω(i)T x −iω(i)T x0
k(x, x ) ≈ e e
m
i=1
The choice of utility matters
[Hoffman, Shahriari and de Freitas, 2013]
The choice of the utility may change a lot the result of the
optimisation.
The choice of utility in practice
[Hoffman, Shahriari and de Freitas, 2013]
The best utility depends on the problem and the level of
exploration/exploitation required.
Illustration of BO
Illustration of BO
Illustration of BO
Illustration of BO
Illustration of BO
Illustration of BO
Illustration of BO
Illustration of BO
Bayesian Optimization
As a ’mapping’ between two problems
BO is an strategy to transform the problem
xM = arg min f (x)
x∈X
solvable!
into a series of problems:
xn+1 = arg max α(x; Dn , Mn )
x∈X
solvable!
where now:
I α(x) is inexpensive to evaluate.
I The gradients of α(x) are typically available.
I Still need to find xn+1 .
BO vs other methods
[Osborne et al, 2009]
Bayesian optimization works better in practice!
Recap
I Bayesian optimization is a way of encoding our beliefs
about a property of a function (the minimum)
I Two key elements: the model and the acquisition function.
I Many choices in both cases, especially in terms of the
acquisition function used.
I The key is to find a good balance between exploration and
exploitation.
Main issues
I What to do with the hyper-parameters of the model?
I How to select points to initialize the model?
I How to optimize the acquisition function?
BO independent of the parameters of the GP.
[Snoek et al. 2012]
Integrate out across parameter values or location outputs.
How to initialise the model?
I One point in the centre of the domain.
I Uniformly selected random locations.
I Latin design.
I Halton sequences.
I Determinantal point processes.
The idea is always to start at some locations trying to minimise
the initial model uncertainty.
Latin design
n × n array filled with n different symbols, each occurring
exactly once in each row and exactly once in each column.
pyDOE
Python framework for standard experimental design
Latin design
Window honors Ronald Fisher. Fisher’s student, A. W. F.
Edwards, designed this window for Caius College, Cambridge.
Halton sequences
[Halton, 1964]
I Used to generate points in (0, 1) × (0, 1)
I Sequence that is constructed according to a deterministic
method that uses a prime number as its base.
Figure source: Wikipedia
Halton sequences
[Halton, 1964]
Better coverage than random.
Halton Random
Figure source: Wikipedia
Determinantal point processes
Kulesza and Taskar, [2012]
We say that X is a ‘determinantal point process’ on Λ with
kernel K if it is a simple point process on Λ with a joint
intensity or ‘correlation function’ given by
ρn (x1 , . . . , xn ) = det(K(xi , xj )1≤i,j≤n )
I Probability measures over subsets.
I Possible to characterise the samples in terms of quality and
diversity.
Determinantal point processes
Kulesza and Taskar, [2012]
Key idea:
Determinantal point processes
Kulesza and Taskar, [2012]
Methods to optimise the acquisition function
This may not be easy.
I Gradient descent methods: Conjugate gradient, BFGS, etc.
I Lipschitz based heuristics: DIRECT.
I Evolutionary algorithms: CMA.
Some of these methods can also be used to directly optimize f
Gradient descent
[Avriel,2013], but many others
We need to know the gradients. This is the case for most
acquisitions but not for all of them (PES for instance).
Gradient descent
May fall in local minima if the function is multimodal: multiple
initializations.
‘DIviding RECTangles’, DIRECT
[Perttunen at al. 1993]
Minimal hypothesis about the acquisition
‘DIviding RECTangles’, DIRECT
[Perttunen at al. 1993]
Finds good solution in general and doesn’t need gradient. Not
generalizable to non-squared domains.
Covariance Matrix Adaptation, CMA
[Hansen and Ostermeier, 2001].
I Sample for a Gaussian with some mean µ and covariance
matrix Σ.
I Select the best points and use them to update µ and Σ.
I Sample form the new Gaussian.
Took a while to start using these ideas in ML
Although in the stats community have been there for a while
I BO depends on its own parameters.
I Lack of software to apply these methods as a black
optimization boxes.
I Reduced scalability in dimensions and number of
evaluations (this is still a problem).
Practical Bayesian Optimization of Machine Learning
Algorithms. Snoek, Larochelle and Adams. NIPS 2012
(Spearmint)
Increasing popular field
I Hot topic in Machine Learning.
I The BO workshop at NIPS is well stablished and it is a
mini-conference itself.
Bayesian optimization now
It has become increasingly popular since it allows to configure
algorithms without human intervention.
BO takes to human out of the loop!
BO in industry: Twitter
BO in industry: Uber
Questions?