0% found this document useful (0 votes)

7 views78 pages

LancasterMasterclass 1

The document provides an introduction to Bayesian Optimization, focusing on its application in optimizing data science pipelines and experimental design. It outlines the agenda for a masterclass, discusses the challenges of optimizing expensive functions, and presents various methods for surrogate modeling and utility functions. Key concepts include exploration vs. exploitation, the use of Gaussian processes, and the importance of acquisition functions in the optimization process.

Uploaded by

Jordi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views78 pages

LancasterMasterclass 1

Uploaded by

Jordi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Introduction to Bayesian Optimization

Javier González

Masterclass, 7-February, 2107 @Lancaster University

Big picture

“Civilization advances by extending the

number of important operations which
we can perform without thinking of them.”
(Alfred North Whitehead)

We are interested on optimizing data science pipelines:

I Automatic model configuration.

I Automate the design of physical experiments.
Agenda of the day
I 9:00-11:00, Introduction to Bayesian Optimization:
I What is BayesOpt and why it works?
I Relevant things to know.

I 11:30-13:00, Connections, extensions and

applications:
I Extensions to multi-task problems, constrained domains,
early-stopping, high dimensions.
I Connections to Armed bandits and ABC.
I An applications in genetics.

I 14:00-16:00, GPyOpt LAB!: Bring your own problem!

I 16:30-15:30, Hot topics current challenges:

I Parallelization.
I Non-myopic methods
I Interactive Bayesian Optimization.
Section I: Introduction to Bayesian Optimization
I What is BayesOpt and why it works?
I Relevant things to know.
Data Science pipeline/Autonomous System
Challenges and needs for automation

Features Results
Data Modelling Interpretation/
Collection extraction
Decision
Filters, Model tuning and
dimensionality configuration, code Data visualisation,
Optimal design
reduction optimisation Sequential
experimentation

Pipeline improvement / Interaction with environment

Experimental Design - Uncertainty Quantification
Can we automate/simplify the process of designing complex experiments?

Emulator - Simulator - Physical system

Global optimization
Consider a ‘well behaved’ function f : X → R where X ⊆ RD is
a bounded domain.

xM = arg min f (x).

x∈X

I f is explicitly unknown and multimodal.

I Evaluations of f may be perturbed.
I Evaluations of f are expensive.
Expensive functions, who doesn’t have one?
Parameter tuning in ML algorithms.

I Number of layers/units per layer

I Weight penalties
I Learning rates, etc.

Figure source: [Link]

Expensive functions, who doesn’t have one?
Active Path Finding in Middle Level

Optimise the location of a sequence of waypoints in a map to

navigate from a location to a destination.
Expensive functions, who doesn’t have one?

Tuning websites with A/B testing

Optimize the web design to maximize sign-ups, downloads,

purchases, etc.
Expensive functions, who doesn’t have one?
[González, Lonworth, James and Lawrence, NIPS workshops 2014, 2015]

Design of experiments: gene optimization

I Use mammalian cells to make protein products.

I Control the ability of the cell-factory to use synthetic DNA.

Optimize genes (ATTGGTUGA...) to best enable the

cell-factory to operate most efficiently.
Expensive functions, who doesn’t have one?

Many other problems:

I Robotics, control, reinforcement learning.

I Scheduling, planning
I compilers, hardware, software?
I Intractable likelihoods.
What to do?

Option 1: Use previous knowledge

To select the parameters at hand. Perhaps not very scientific

but still in use...
What to do?

Option 2: Grid search?

If f is L-Lipschitz continuous and we are in a noise-free domain

to guarantee that we propose some xM,n such that

f (xM ) − f (xM,n ) ≤

we need to evaluate f on a D-dimensional unit hypercube:

(L/)D evaluations!

Example: (10/0.01)5 = 10e14...

... but function evaluations are very expensive!
What to do?
Option 3: Random search?

We can sample the space uniformly [Bergstra and Bengio 2012]

Better than grid search in various senses but still expensive to

guarantee good coverage.
What to do?

Key question:

Can we do better?
Problem (the audience is encouraged to participate!)

I Find the optimum of some function f in the interval [0,1].

I f is L-Lipchitz continuous and differentiable.

I Evaluations of f are exact and we have 4 of them!

Situation
We have a few function evaluations

Where is the minimum of f ?

Where should the take the next evaluation?
Intuitive solution
One curve
Intuitive solution
Three curves
Intuitive solution
Ten curves
Intuitive solution
Hundred curves
Intuitive solution
Many curves
Intuitive solution
Infinite curves
General idea: surrogate modelling

1. Use a surrogate model of f to carry out the optimization.

2. Define an utility function to collect new data points

satisfying some optimality criterion: optimization as
decision.

3. Study decision problems as inference using the surrogate

model: use a probabilistic model able to calibrate both,
epistemic and aleatoric uncertainty.

Uncertainty Quantification
Utility functions

The utility should represent our design goal:.

1. Active Learning and experimental design: Maximize the

differential entropy of the posterior distribution p(f |X, y)
(D-optimality in experimental design).

2. Minimize the loss in a sequence x1 , . . . , xn

N
X
rN = f (xn ) − N f (xM )
n=1

(1) does to a lot exploration whereas (2) encourages

exploitation about the minimum of f .
Bayesian Optimisation
[Mockus, 1978]

Methodology to perform global optimisation of multimodal

black-box functions.

1. Choose some prior measure over the space of possible

objectives f .

2. Combine prior and the likelihood to get a posterior

measure over the objective given some observations.

3. Use the posterior to decide where to take the next

evaluation according to some acquisition/loss function.

4. Augment the data.

Iterate between 2 and 4 until the evaluation budget is over.

Surrogate model: Gaussian process
Default Choice: Gaussian processes [Rasmunsen and Williams, 2006]

Infinite-dimensional probability density, such that each linear

finite-dimensional restriction is multivariate Gaussian.

I Model f (x) ∼ GP(µ(x), k(x, x0 )) is determined by the mean

function m(x) and covariance function k(x, x0 ; θ).
I Posterior mean µ(x; θ, D) and variance σ(x; θ, D) can be
computed explicitly given a dataset D.
Other models are also possible: Random Forrest
[Criminisi et al, 2011]
Other models are also possible: t-Student processes
Exploration vs. exploitation

Bayesian optimization explains human active search

[Borji and Itti, 2013]

Exploration vs. exploitation

Picture source: [Link]

GP Upper (lower) Confidence Band
[Srinivas et al., 2010]

Direct balance between exploration and exploitation:

αLCB (x; θ, D) = −µ(x; θ, D) + βt σ(x; θ, D)
GP Upper (lower) Confidence Band
[Srinivas et al., 2010]

I In noiseless cases, it is a lower bound of the function to

minimize.
I This allows to computer a bound on how close we are to
the minimum.
I Optimal choices available for the ’regularization parameter’.
Expected Improvement
[Jones et al., 1998]

Z
αEI (x; θ, D) = max(0, ybest − y)p(y|x; θ, D)dy
y
Expected Improvement
[Jones et al., 1998]

I Perhaps the most used acquisition.

I Explicit for available for Gaussian posteriors.
I It is too greedy in some problems. It is possible to make
more explorative adding a ’explorative’ parameter

αEI (x; θ, D) = σ(x; θ, D)(γ(x)Φ(γ(x))) + N (γ(x); 0, 1).

where
f (xbest ) − µ(x; θ, D) + ψ
γ(x) = .
σ(x; θ, D)
Maximum Probability of Improvement
[Hushner, 1964]

γ(x) = σ(x; θ, D)−1 (µ(x; θ, D) − ybest )

αM P I (x; θ, D) = p(f (x) < ybest ) = Φ(γ(x))
Maximum Probability of Improvement
[Hushner, 1964]

I First used acquisition: very intuitive.

I Less used in practice.
I Explicit for available for Gaussian posteriors.

αM P I (x; θ, D) = Φ(γ(x))).
where
f (xbest ) − µ(x; θ, D) + ψ
γ(x) = .
σ(x; θ, D)
Information-theoretic approaches
[Hennig and Schuler, 2013; Hernández-Lobato et al., 2014]

αES (x; θ, D) = H[p(xmin |D)] − Ep(y|D,x) [H[p(xmin |D ∪ {x, y})]]

Information-theoretic approaches

Uses the distribution of the minimum

Z Y
pmin (x) ≡ p[x = arg min f (x)] = p(f ) θ[f (x̃) − f (x)]df
f :I→< x̃∈I
x̃6=x

where θ is the Heaviside’s step function. No closed form!

Use Thomson sampling to approximate the distribution.

Generate many sample paths from the GP, optimize them to
take samples from pmin (x).
Thomson sampling
Probability matching

αT HOM SON (x; θ, D) = g(x)

g(x) is sampled form GP(µ(x), k(x, x0 ))
Thompson sampling
Probability matching [Rahimi and B. Recht, 2007]

I It is easy to generate posterior samples of a GP at a finite

set of locations.
I More difficult is to generate ‘continuous’ samples.

Possible using the Bochner’s lemma: existence of the Fourier

dual of k, s(ω) which is equal to the spectral density of k
h T 0
i
k(x, x0 ) = νEω e−iω (x−x ) = 2νEω,b cos(ωxT + b) cos(ωxT + b)

With sampling and this lemma (taking p(w) = s(ω)/ν and

b ∼ U[0, 2π]) we can construct a feature based approximation
for sample paths of the GP.
m
0 ν X −iω(i)T x −iω(i)T x0
k(x, x ) ≈ e e
m
i=1
The choice of utility matters
[Hoffman, Shahriari and de Freitas, 2013]

The choice of the utility may change a lot the result of the
optimisation.
The choice of utility in practice
[Hoffman, Shahriari and de Freitas, 2013]

The best utility depends on the problem and the level of

exploration/exploitation required.
Illustration of BO
Illustration of BO
Illustration of BO
Illustration of BO
Illustration of BO
Illustration of BO
Illustration of BO
Illustration of BO
Bayesian Optimization
As a ’mapping’ between two problems

BO is an strategy to transform the problem

xM = arg min f (x)
x∈X
solvable!

into a series of problems:

xn+1 = arg max α(x; Dn , Mn )
x∈X
solvable!
where now:

I α(x) is inexpensive to evaluate.

I The gradients of α(x) are typically available.
I Still need to find xn+1 .
BO vs other methods
[Osborne et al, 2009]

Bayesian optimization works better in practice!

Recap

I Bayesian optimization is a way of encoding our beliefs

about a property of a function (the minimum)

I Two key elements: the model and the acquisition function.

I Many choices in both cases, especially in terms of the

acquisition function used.

I The key is to find a good balance between exploration and

exploitation.
Main issues

I What to do with the hyper-parameters of the model?

I How to select points to initialize the model?

I How to optimize the acquisition function?

BO independent of the parameters of the GP.
[Snoek et al. 2012]

Integrate out across parameter values or location outputs.

How to initialise the model?

I One point in the centre of the domain.

I Uniformly selected random locations.
I Latin design.
I Halton sequences.
I Determinantal point processes.

The idea is always to start at some locations trying to minimise

the initial model uncertainty.
Latin design

n × n array filled with n different symbols, each occurring

exactly once in each row and exactly once in each column.
pyDOE

Python framework for standard experimental design

Latin design
Window honors Ronald Fisher. Fisher’s student, A. W. F.
Edwards, designed this window for Caius College, Cambridge.
Halton sequences
[Halton, 1964]

I Used to generate points in (0, 1) × (0, 1)

I Sequence that is constructed according to a deterministic
method that uses a prime number as its base.

Figure source: Wikipedia

Halton sequences
[Halton, 1964]

Better coverage than random.

Halton Random
Figure source: Wikipedia
Determinantal point processes
Kulesza and Taskar, [2012]

We say that X is a ‘determinantal point process’ on Λ with

kernel K if it is a simple point process on Λ with a joint
intensity or ‘correlation function’ given by

ρn (x1 , . . . , xn ) = det(K(xi , xj )1≤i,j≤n )

I Probability measures over subsets.

I Possible to characterise the samples in terms of quality and
diversity.
Determinantal point processes
Kulesza and Taskar, [2012]

Key idea:
Determinantal point processes
Kulesza and Taskar, [2012]
Methods to optimise the acquisition function

This may not be easy.

I Gradient descent methods: Conjugate gradient, BFGS, etc.

I Lipschitz based heuristics: DIRECT.
I Evolutionary algorithms: CMA.

Some of these methods can also be used to directly optimize f

Gradient descent
[Avriel,2013], but many others

We need to know the gradients. This is the case for most

acquisitions but not for all of them (PES for instance).
Gradient descent

May fall in local minima if the function is multimodal: multiple

initializations.
‘DIviding RECTangles’, DIRECT
[Perttunen at al. 1993]

Minimal hypothesis about the acquisition

‘DIviding RECTangles’, DIRECT
[Perttunen at al. 1993]

Finds good solution in general and doesn’t need gradient. Not

generalizable to non-squared domains.
Covariance Matrix Adaptation, CMA
[Hansen and Ostermeier, 2001].

I Sample for a Gaussian with some mean µ and covariance

matrix Σ.
I Select the best points and use them to update µ and Σ.
I Sample form the new Gaussian.
Took a while to start using these ideas in ML

Although in the stats community have been there for a while

I BO depends on its own parameters.

I Lack of software to apply these methods as a black

optimization boxes.

I Reduced scalability in dimensions and number of

evaluations (this is still a problem).

Practical Bayesian Optimization of Machine Learning

Algorithms. Snoek, Larochelle and Adams. NIPS 2012
(Spearmint)
Increasing popular field

I Hot topic in Machine Learning.

I The BO workshop at NIPS is well stablished and it is a
mini-conference itself.
Bayesian optimization now

It has become increasingly popular since it allows to configure

algorithms without human intervention.

BO takes to human out of the loop!

BO in industry: Twitter
BO in industry: Uber
Questions?

Introduction to Bayesian Optimization Techniques
No ratings yet
Introduction to Bayesian Optimization Techniques
79 pages
Bayesian Optimization Tutorial Overview
No ratings yet
Bayesian Optimization Tutorial Overview
49 pages
Bayesian Optimization in Machine Learning
No ratings yet
Bayesian Optimization in Machine Learning
84 pages
In-Context Bayesian Optimization Tutorial
No ratings yet
In-Context Bayesian Optimization Tutorial
22 pages
Understanding Bayesian Optimization Techniques
No ratings yet
Understanding Bayesian Optimization Techniques
3 pages
Bayesian Optimization
No ratings yet
Bayesian Optimization
23 pages
Bayesian Optimization: Theory and Practice Using Python Peng Liu Ebook Premium Edition
100% (2)
Bayesian Optimization: Theory and Practice Using Python Peng Liu Ebook Premium Edition
35 pages
HEBO - Heteroscedastic Evolutionary Bayesian Optimisation
No ratings yet
HEBO - Heteroscedastic Evolutionary Bayesian Optimisation
12 pages
Bayesian Optimization Explained
No ratings yet
Bayesian Optimization Explained
376 pages
Bayesoptbook A4
100% (1)
Bayesoptbook A4
374 pages
Bayesian Optimization with Neural Networks
No ratings yet
Bayesian Optimization with Neural Networks
9 pages
Bayesian Optimization
No ratings yet
Bayesian Optimization
31 pages
Bayesian Optimization Techniques Explained
No ratings yet
Bayesian Optimization Techniques Explained
22 pages
Bayesian Optimization in Machine Learning
No ratings yet
Bayesian Optimization in Machine Learning
78 pages
Bayesian Optimization Explained
No ratings yet
Bayesian Optimization Explained
6 pages
第2课机器学习进阶张景昭
No ratings yet
第2课机器学习进阶张景昭
70 pages
Theoretical Analysis of Bayesian Optimization
No ratings yet
Theoretical Analysis of Bayesian Optimization
6 pages
Bayesian Optimization for ML Hyperparameters
No ratings yet
Bayesian Optimization for ML Hyperparameters
11 pages
Introduction to Bayesian Optimization
No ratings yet
Introduction to Bayesian Optimization
4 pages
Practical Bayesian Optimization Techniques
No ratings yet
Practical Bayesian Optimization Techniques
29 pages
Simplifying Bayesian Optimization with FIBO
No ratings yet
Simplifying Bayesian Optimization with FIBO
14 pages
Inducing Point Allocation in Bayesian Optimisation
No ratings yet
Inducing Point Allocation in Bayesian Optimisation
18 pages
FunBO: LLM-Based Acquisition Functions
No ratings yet
FunBO: LLM-Based Acquisition Functions
20 pages
Snoek 15
No ratings yet
Snoek 15
10 pages
Bayesian Optimization in Action MEAP V07 1st / Chapters 1 To 8 of 13 Edition Quan Nguyen Ebook Testbank Solutions Optimized Version 2026
100% (4)
Bayesian Optimization in Action MEAP V07 1st / Chapters 1 To 8 of 13 Edition Quan Nguyen Ebook Testbank Solutions Optimized Version 2026
135 pages
High-Dimensional Bayesian Optimization Insights
No ratings yet
High-Dimensional Bayesian Optimization Insights
20 pages
Bayesian Optimization with GPR Techniques
No ratings yet
Bayesian Optimization with GPR Techniques
59 pages
Predictive Entropy Search for Optimization
No ratings yet
Predictive Entropy Search for Optimization
12 pages
Bayesian Hyperparameter Optimization in R
No ratings yet
Bayesian Hyperparameter Optimization in R
6 pages
Bayesian Optimization For Accelerating Hyper-Parameter Tuning
No ratings yet
Bayesian Optimization For Accelerating Hyper-Parameter Tuning
4 pages
Bayesian Learning and Models Explained
No ratings yet
Bayesian Learning and Models Explained
55 pages
Bayesian Optimization Techniques Explained
No ratings yet
Bayesian Optimization Techniques Explained
25 pages
Conformal Bayesian Optimization Explained
No ratings yet
Conformal Bayesian Optimization Explained
28 pages
Theoretical Analysis of Bayesian Optimization
No ratings yet
Theoretical Analysis of Bayesian Optimization
12 pages
Hyperparameter Optimization Interview Guide
No ratings yet
Hyperparameter Optimization Interview Guide
7 pages
No-Regret Bayesian Optimization Algorithm
No ratings yet
No-Regret Bayesian Optimization Algorithm
24 pages
Understanding Machine Learning Hyperparameters
No ratings yet
Understanding Machine Learning Hyperparameters
13 pages
All Knowing Algorithm Foundation 1
No ratings yet
All Knowing Algorithm Foundation 1
7 pages
Bayesian Linear Regression Overview
No ratings yet
Bayesian Linear Regression Overview
36 pages
Bayesian Learning Tutorial Overview
No ratings yet
Bayesian Learning Tutorial Overview
69 pages
Bayesian Optimization for Hyperparameter Tuning
No ratings yet
Bayesian Optimization for Hyperparameter Tuning
4 pages
Lecture2 2013
No ratings yet
Lecture2 2013
60 pages
Bayesian Optimization in Experimental Design
No ratings yet
Bayesian Optimization in Experimental Design
12 pages
Human-Algorithm Collaboration in Optimization
No ratings yet
Human-Algorithm Collaboration in Optimization
19 pages
Evolutionary Learning in Bayesian Models
No ratings yet
Evolutionary Learning in Bayesian Models
18 pages
Bayesian Programming
100% (5)
Bayesian Programming
208 pages
Bayesian Programming
No ratings yet
Bayesian Programming
208 pages
Bayesian Optimization for Hyperparameter Tuning
No ratings yet
Bayesian Optimization for Hyperparameter Tuning
33 pages
PRML Slides 3
No ratings yet
PRML Slides 3
57 pages
Bayesian Algorithms for Econometrics
No ratings yet
Bayesian Algorithms for Econometrics
33 pages
Bias Variance Trade Off
No ratings yet
Bias Variance Trade Off
4 pages
Surrogate Optimization Strategies Explained
No ratings yet
Surrogate Optimization Strategies Explained
12 pages
Ensemble Gaussian Processes for Bayesian Optimization
No ratings yet
Ensemble Gaussian Processes for Bayesian Optimization
14 pages
Batman Clothing Model Assessment Task
No ratings yet
Batman Clothing Model Assessment Task
12 pages
Econometrics II: AR Process Predictions
No ratings yet
Econometrics II: AR Process Predictions
5 pages
Understanding Metric Spaces in Mathematics
No ratings yet
Understanding Metric Spaces in Mathematics
19 pages
Understanding Pick's Theorem
100% (1)
Understanding Pick's Theorem
10 pages
12th Arts Math Crash Course Schedule
No ratings yet
12th Arts Math Crash Course Schedule
1 page
Series Convergence Tests Practice
No ratings yet
Series Convergence Tests Practice
2 pages
Form 3 Mathematics Paper 2 Instructions
No ratings yet
Form 3 Mathematics Paper 2 Instructions
6 pages
PL2 The Quantum Scars of Primes
No ratings yet
PL2 The Quantum Scars of Primes
6 pages
c2 Sequences and Series
No ratings yet
c2 Sequences and Series
11 pages
Non-Standard Analysis: Hyperreals Explained
100% (1)
Non-Standard Analysis: Hyperreals Explained
10 pages
Relational Algebra in DBMS Explained
100% (1)
Relational Algebra in DBMS Explained
69 pages
If Then Statements Math
No ratings yet
If Then Statements Math
18 pages
Kleene's Theorem in Automata Theory
No ratings yet
Kleene's Theorem in Automata Theory
6 pages
Mathematics & Statistics Course Overview
No ratings yet
Mathematics & Statistics Course Overview
6 pages
Geometry Transformation Challenges in Students
No ratings yet
Geometry Transformation Challenges in Students
14 pages
Formulae: Derivation and Application
No ratings yet
Formulae: Derivation and Application
40 pages
Analytical Mechanics Homework 1
No ratings yet
Analytical Mechanics Homework 1
2 pages
Putnam Practice Problems Fall 2007
No ratings yet
Putnam Practice Problems Fall 2007
3 pages
Theorems for Initial Value Problems
No ratings yet
Theorems for Initial Value Problems
4 pages
Grade IX Mathematics Sample Paper 2023
No ratings yet
Grade IX Mathematics Sample Paper 2023
9 pages
Logic and Truth Table Exercises
No ratings yet
Logic and Truth Table Exercises
3 pages
Understanding Inequalities and Solutions
No ratings yet
Understanding Inequalities and Solutions
57 pages
Understanding Competency-Based Education
No ratings yet
Understanding Competency-Based Education
10 pages
Adding & Subtracting Fractions Guide
No ratings yet
Adding & Subtracting Fractions Guide
53 pages
Class 9 Olympiad Maths Polynomials Questions
No ratings yet
Class 9 Olympiad Maths Polynomials Questions
3 pages
Coordinate Geometry Worksheet for Class IX
No ratings yet
Coordinate Geometry Worksheet for Class IX
5 pages
Dynamics of Structures 4E Anil K Chopra Testbank & Ebook
No ratings yet
Dynamics of Structures 4E Anil K Chopra Testbank & Ebook
246 pages
Pre Grade 8 Mathematics Assessment Test
No ratings yet
Pre Grade 8 Mathematics Assessment Test
5 pages
Understanding Significant Figures and Calculations
No ratings yet
Understanding Significant Figures and Calculations
6 pages