0% found this document useful (0 votes)
7 views78 pages

LancasterMasterclass 1

The document provides an introduction to Bayesian Optimization, focusing on its application in optimizing data science pipelines and experimental design. It outlines the agenda for a masterclass, discusses the challenges of optimizing expensive functions, and presents various methods for surrogate modeling and utility functions. Key concepts include exploration vs. exploitation, the use of Gaussian processes, and the importance of acquisition functions in the optimization process.

Uploaded by

Jordi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views78 pages

LancasterMasterclass 1

The document provides an introduction to Bayesian Optimization, focusing on its application in optimizing data science pipelines and experimental design. It outlines the agenda for a masterclass, discusses the challenges of optimizing expensive functions, and presents various methods for surrogate modeling and utility functions. Key concepts include exploration vs. exploitation, the use of Gaussian processes, and the importance of acquisition functions in the optimization process.

Uploaded by

Jordi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Introduction to Bayesian Optimization

Javier González

Masterclass, 7-February, 2107 @Lancaster University


Big picture

“Civilization advances by extending the


number of important operations which
we can perform without thinking of them.”
(Alfred North Whitehead)

We are interested on optimizing data science pipelines:

I Automatic model configuration.


I Automate the design of physical experiments.
Agenda of the day
I 9:00-11:00, Introduction to Bayesian Optimization:
I What is BayesOpt and why it works?
I Relevant things to know.

I 11:30-13:00, Connections, extensions and


applications:
I Extensions to multi-task problems, constrained domains,
early-stopping, high dimensions.
I Connections to Armed bandits and ABC.
I An applications in genetics.

I 14:00-16:00, GPyOpt LAB!: Bring your own problem!

I 16:30-15:30, Hot topics current challenges:


I Parallelization.
I Non-myopic methods
I Interactive Bayesian Optimization.
Section I: Introduction to Bayesian Optimization
I What is BayesOpt and why it works?
I Relevant things to know.
Data Science pipeline/Autonomous System
Challenges and needs for automation

Features Results
Data Modelling Interpretation/
Collection extraction
Decision
Filters, Model tuning and
dimensionality configuration, code Data visualisation,
Optimal design
reduction optimisation Sequential
experimentation

Pipeline improvement / Interaction with environment


Experimental Design - Uncertainty Quantification
Can we automate/simplify the process of designing complex experiments?

Emulator - Simulator - Physical system


Global optimization
Consider a ‘well behaved’ function f : X → R where X ⊆ RD is
a bounded domain.

xM = arg min f (x).


x∈X

I f is explicitly unknown and multimodal.


I Evaluations of f may be perturbed.
I Evaluations of f are expensive.
Expensive functions, who doesn’t have one?
Parameter tuning in ML algorithms.

I Number of layers/units per layer


I Weight penalties
I Learning rates, etc.

Figure source: [Link]


Expensive functions, who doesn’t have one?
Active Path Finding in Middle Level

Optimise the location of a sequence of waypoints in a map to


navigate from a location to a destination.
Expensive functions, who doesn’t have one?

Tuning websites with A/B testing

Optimize the web design to maximize sign-ups, downloads,


purchases, etc.
Expensive functions, who doesn’t have one?
[González, Lonworth, James and Lawrence, NIPS workshops 2014, 2015]

Design of experiments: gene optimization

I Use mammalian cells to make protein products.


I Control the ability of the cell-factory to use synthetic DNA.

Optimize genes (ATTGGTUGA...) to best enable the


cell-factory to operate most efficiently.
Expensive functions, who doesn’t have one?

Many other problems:

I Robotics, control, reinforcement learning.


I Scheduling, planning
I compilers, hardware, software?
I Intractable likelihoods.
What to do?

Option 1: Use previous knowledge

To select the parameters at hand. Perhaps not very scientific


but still in use...
What to do?

Option 2: Grid search?

If f is L-Lipschitz continuous and we are in a noise-free domain


to guarantee that we propose some xM,n such that

f (xM ) − f (xM,n ) ≤ 

we need to evaluate f on a D-dimensional unit hypercube:

(L/)D evaluations!

Example: (10/0.01)5 = 10e14...


... but function evaluations are very expensive!
What to do?
Option 3: Random search?

We can sample the space uniformly [Bergstra and Bengio 2012]

Better than grid search in various senses but still expensive to


guarantee good coverage.
What to do?

Key question:

Can we do better?
Problem (the audience is encouraged to participate!)

I Find the optimum of some function f in the interval [0,1].

I f is L-Lipchitz continuous and differentiable.

I Evaluations of f are exact and we have 4 of them!


Situation
We have a few function evaluations

Where is the minimum of f ?


Where should the take the next evaluation?
Intuitive solution
One curve
Intuitive solution
Three curves
Intuitive solution
Ten curves
Intuitive solution
Hundred curves
Intuitive solution
Many curves
Intuitive solution
Infinite curves
General idea: surrogate modelling

1. Use a surrogate model of f to carry out the optimization.

2. Define an utility function to collect new data points


satisfying some optimality criterion: optimization as
decision.

3. Study decision problems as inference using the surrogate


model: use a probabilistic model able to calibrate both,
epistemic and aleatoric uncertainty.

Uncertainty Quantification
Utility functions

The utility should represent our design goal:.

1. Active Learning and experimental design: Maximize the


differential entropy of the posterior distribution p(f |X, y)
(D-optimality in experimental design).

2. Minimize the loss in a sequence x1 , . . . , xn


N
X
rN = f (xn ) − N f (xM )
n=1

(1) does to a lot exploration whereas (2) encourages


exploitation about the minimum of f .
Bayesian Optimisation
[Mockus, 1978]

Methodology to perform global optimisation of multimodal


black-box functions.

1. Choose some prior measure over the space of possible


objectives f .

2. Combine prior and the likelihood to get a posterior


measure over the objective given some observations.

3. Use the posterior to decide where to take the next


evaluation according to some acquisition/loss function.

4. Augment the data.

Iterate between 2 and 4 until the evaluation budget is over.


Surrogate model: Gaussian process
Default Choice: Gaussian processes [Rasmunsen and Williams, 2006]

Infinite-dimensional probability density, such that each linear


finite-dimensional restriction is multivariate Gaussian.

I Model f (x) ∼ GP(µ(x), k(x, x0 )) is determined by the mean


function m(x) and covariance function k(x, x0 ; θ).
I Posterior mean µ(x; θ, D) and variance σ(x; θ, D) can be
computed explicitly given a dataset D.
Other models are also possible: Random Forrest
[Criminisi et al, 2011]
Other models are also possible: t-Student processes
Exploration vs. exploitation

Bayesian optimization explains human active search

[Borji and Itti, 2013]


Exploration vs. exploitation

Picture source: [Link]


GP Upper (lower) Confidence Band
[Srinivas et al., 2010]

Direct balance between exploration and exploitation:


αLCB (x; θ, D) = −µ(x; θ, D) + βt σ(x; θ, D)
GP Upper (lower) Confidence Band
[Srinivas et al., 2010]

I In noiseless cases, it is a lower bound of the function to


minimize.
I This allows to computer a bound on how close we are to
the minimum.
I Optimal choices available for the ’regularization parameter’.
Expected Improvement
[Jones et al., 1998]

Z
αEI (x; θ, D) = max(0, ybest − y)p(y|x; θ, D)dy
y
Expected Improvement
[Jones et al., 1998]

I Perhaps the most used acquisition.


I Explicit for available for Gaussian posteriors.
I It is too greedy in some problems. It is possible to make
more explorative adding a ’explorative’ parameter

αEI (x; θ, D) = σ(x; θ, D)(γ(x)Φ(γ(x))) + N (γ(x); 0, 1).


where
f (xbest ) − µ(x; θ, D) + ψ
γ(x) = .
σ(x; θ, D)
Maximum Probability of Improvement
[Hushner, 1964]

γ(x) = σ(x; θ, D)−1 (µ(x; θ, D) − ybest )


αM P I (x; θ, D) = p(f (x) < ybest ) = Φ(γ(x))
Maximum Probability of Improvement
[Hushner, 1964]

I First used acquisition: very intuitive.


I Less used in practice.
I Explicit for available for Gaussian posteriors.

αM P I (x; θ, D) = Φ(γ(x))).
where
f (xbest ) − µ(x; θ, D) + ψ
γ(x) = .
σ(x; θ, D)
Information-theoretic approaches
[Hennig and Schuler, 2013; Hernández-Lobato et al., 2014]

αES (x; θ, D) = H[p(xmin |D)] − Ep(y|D,x) [H[p(xmin |D ∪ {x, y})]]


Information-theoretic approaches

Uses the distribution of the minimum


Z Y
pmin (x) ≡ p[x = arg min f (x)] = p(f ) θ[f (x̃) − f (x)]df
f :I→< x̃∈I
x̃6=x

where θ is the Heaviside’s step function. No closed form!

Use Thomson sampling to approximate the distribution.


Generate many sample paths from the GP, optimize them to
take samples from pmin (x).
Thomson sampling
Probability matching

αT HOM SON (x; θ, D) = g(x)


g(x) is sampled form GP(µ(x), k(x, x0 ))
Thompson sampling
Probability matching [Rahimi and B. Recht, 2007]

I It is easy to generate posterior samples of a GP at a finite


set of locations.
I More difficult is to generate ‘continuous’ samples.

Possible using the Bochner’s lemma: existence of the Fourier


dual of k, s(ω) which is equal to the spectral density of k
h T 0
i
k(x, x0 ) = νEω e−iω (x−x ) = 2νEω,b cos(ωxT + b) cos(ωxT + b)
 

With sampling and this lemma (taking p(w) = s(ω)/ν and


b ∼ U[0, 2π]) we can construct a feature based approximation
for sample paths of the GP.
m
0 ν X −iω(i)T x −iω(i)T x0
k(x, x ) ≈ e e
m
i=1
The choice of utility matters
[Hoffman, Shahriari and de Freitas, 2013]

The choice of the utility may change a lot the result of the
optimisation.
The choice of utility in practice
[Hoffman, Shahriari and de Freitas, 2013]

The best utility depends on the problem and the level of


exploration/exploitation required.
Illustration of BO
Illustration of BO
Illustration of BO
Illustration of BO
Illustration of BO
Illustration of BO
Illustration of BO
Illustration of BO
Bayesian Optimization
As a ’mapping’ between two problems

BO is an strategy to transform the problem


xM = arg min f (x)
x∈X
solvable!

into a series of problems:


xn+1 = arg max α(x; Dn , Mn )
x∈X
solvable!
where now:

I α(x) is inexpensive to evaluate.


I The gradients of α(x) are typically available.
I Still need to find xn+1 .
BO vs other methods
[Osborne et al, 2009]

Bayesian optimization works better in practice!


Recap

I Bayesian optimization is a way of encoding our beliefs


about a property of a function (the minimum)

I Two key elements: the model and the acquisition function.

I Many choices in both cases, especially in terms of the


acquisition function used.

I The key is to find a good balance between exploration and


exploitation.
Main issues

I What to do with the hyper-parameters of the model?

I How to select points to initialize the model?

I How to optimize the acquisition function?


BO independent of the parameters of the GP.
[Snoek et al. 2012]

Integrate out across parameter values or location outputs.


How to initialise the model?

I One point in the centre of the domain.


I Uniformly selected random locations.
I Latin design.
I Halton sequences.
I Determinantal point processes.

The idea is always to start at some locations trying to minimise


the initial model uncertainty.
Latin design

n × n array filled with n different symbols, each occurring


exactly once in each row and exactly once in each column.
pyDOE

Python framework for standard experimental design


Latin design
Window honors Ronald Fisher. Fisher’s student, A. W. F.
Edwards, designed this window for Caius College, Cambridge.
Halton sequences
[Halton, 1964]

I Used to generate points in (0, 1) × (0, 1)


I Sequence that is constructed according to a deterministic
method that uses a prime number as its base.

Figure source: Wikipedia


Halton sequences
[Halton, 1964]

Better coverage than random.

Halton Random
Figure source: Wikipedia
Determinantal point processes
Kulesza and Taskar, [2012]

We say that X is a ‘determinantal point process’ on Λ with


kernel K if it is a simple point process on Λ with a joint
intensity or ‘correlation function’ given by

ρn (x1 , . . . , xn ) = det(K(xi , xj )1≤i,j≤n )

I Probability measures over subsets.


I Possible to characterise the samples in terms of quality and
diversity.
Determinantal point processes
Kulesza and Taskar, [2012]

Key idea:
Determinantal point processes
Kulesza and Taskar, [2012]
Methods to optimise the acquisition function

This may not be easy.

I Gradient descent methods: Conjugate gradient, BFGS, etc.


I Lipschitz based heuristics: DIRECT.
I Evolutionary algorithms: CMA.

Some of these methods can also be used to directly optimize f


Gradient descent
[Avriel,2013], but many others

We need to know the gradients. This is the case for most


acquisitions but not for all of them (PES for instance).
Gradient descent

May fall in local minima if the function is multimodal: multiple


initializations.
‘DIviding RECTangles’, DIRECT
[Perttunen at al. 1993]

Minimal hypothesis about the acquisition


‘DIviding RECTangles’, DIRECT
[Perttunen at al. 1993]

Finds good solution in general and doesn’t need gradient. Not


generalizable to non-squared domains.
Covariance Matrix Adaptation, CMA
[Hansen and Ostermeier, 2001].

I Sample for a Gaussian with some mean µ and covariance


matrix Σ.
I Select the best points and use them to update µ and Σ.
I Sample form the new Gaussian.
Took a while to start using these ideas in ML

Although in the stats community have been there for a while

I BO depends on its own parameters.

I Lack of software to apply these methods as a black


optimization boxes.

I Reduced scalability in dimensions and number of


evaluations (this is still a problem).

Practical Bayesian Optimization of Machine Learning


Algorithms. Snoek, Larochelle and Adams. NIPS 2012
(Spearmint)
Increasing popular field

I Hot topic in Machine Learning.


I The BO workshop at NIPS is well stablished and it is a
mini-conference itself.
Bayesian optimization now

It has become increasingly popular since it allows to configure


algorithms without human intervention.

BO takes to human out of the loop!


BO in industry: Twitter
BO in industry: Uber
Questions?

You might also like