HEBO - Heteroscedastic Evolutionary Bayesian Optimisation
HEBO - Heteroscedastic Evolutionary Bayesian Optimisation
Optimisation
Abstract
We introduce HEBO: Heteroscedastic Evolutionary Bayesian Optimisation that
won the NeurIPS 2020 black-box optimisation competition. We present non-
conventional modifications to the surrogate model and acquisition maximisation
process and show such a combination superior against all baselines provided by
the Bayesmark package. Lastly, we perform an ablation study to highlight the
components that contributed to the success of HEBO.
1 Introduction
The NeurIPS 2020 black-box optimisation challenge is an annual competition that evaluates black-box
optimisation algorithms on real-world score functions. The contest is constructed from automated
machine learning (AutoML) tasks [1] and the Bayesmark package, both of which aim at tuning
hyper-parameters of models to improve validation set performance. Given the importance of correctly
tuning machine learning algorithms [2], this challenge constitutes a major stepping-stone towards
real-world deployment of large-scale models.
In general, the contest can be formulated as a problem of optimising a performance measure (score)
for various hyper-parameter configurations of ML algorithms. For instance, these parameters can
correspond to learning rates, layer depths and widths, or dropout rates during a neural network training
step. The ’score’ that we would optimise in this instance would be the validation loss. Among various
challenges, the main difficulties in optimising hyper-parameters are; that performance measures
are neither explicitly available in closed form, nor do they necessarily adhere to differentiability
assumptions, they can typically be very computationally expensive to evaluate, and there is no precise
method to perform mixed variable optimisation. Nonetheless, one can still exploit a low number
of validation loss evaluations for a better exploration of the hyper-parameter space. This, in turn,
transforms the problem into an instance of black-box optimisation:
min f (x), (1)
x∈X
∗
Equal contribution.
†
Honorary position at University College London.
34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.
with x denoting optimisation variables that correspond to mixed (continuous and discrete) hyper-
parameter settings and f (x) the score function we wish to minimise.
Bayesian Optimisation (BO) presents a sample-efficient and derivative-free solution for finding a
minimum x? of black-box objectives such as these encountered in Equation 1. Real-world examples
of BBO problems are ubiquitous [3, 4, 5, 6, 7, 8, 9, 10, 11]. BO algorithms are typically equipped
with two core components that allow for improved exploration of input points. The first corresponds
to a probabilistic surrogate model (generally a Gaussian process) that approximates f (x) from
historical interactions whilst providing uncertainty estimates needed to guide exploration. The second
component equates to an acquisition function that acts as a proxy to the true sequential risk, measuring
the utility of gathering new input points by trading off exploration and exploitation.
Lots of BO implementations assume an idealised setting for calculating future black-box evaluations,
such as Gaussian noise likelihoods and one optimal acquisition function. While analysing the compe-
tition’s data, we realised that neither of these two assumptions hold; noise processes are complex
and heteroscedastic and different acquisition functions arrived at conflicting results. Furthermore, in
case unseen data involved even more compounded stochasticity, as evaluation tasks are hidden to
participants, we would prefer to hedge against our losses by equipping the surrogate model with the
ability to handle non-stationarities.
Hence, to arrive at a winning solution, we developed a heteroscedastic and evolutionary Bayesian
optimisation (HEBO) algorithm that supports: 1) complex noise processes through input warping
and output transformations, and 2) multi-objective acquisition functions with evolutionary optimis-
ers. In particular, while input and output non-linear transformations handle non-stationarity and
heteroscedasticity respectively, multi-objectivity avoids conflicts by enabling a consensus among
various acquisition functions through a Pareto-frontier.
2 Background
We consider a sequential decision approach to the global optimisation of a smooth function f : X →
R over a bounded input domain X ⊆ Rd . At each decision round, i, we select a collection of q input
(new) (new) (new)
points x1:q ∈ X q and observe values of the black-box function y1:q = f (x1:q ). We allow the
returned value to be either deterministic i.e., yi = f (xi ) or stochastic with yi = f (xi ) + i , where i
denotes a bounded-variance random variable. Our goal is to rapidly (in terms of N ) approach the
maximum x? = arg maxx∈X f (x). Since both f (·) and x? are unknown, solvers need to trade off
exploitation and exploration during the search process, as shown in Algorithm 1.
To reason about the unknown function, typical Bayesian optimisation algorithms assume smoothness
and adopt Bayesian modelling as a principle to carry out inference about the properties of f (·)
in light of the observations. Here, one introduces a surrogate model (line 3 of Algorithm 1) to
encode beliefs over the smoothness properties and an observation model to describe collected data,
ni
Di = {xl , yl }l=1 , up to the ith round with ni denoting the total acquired data so far. Using these
two components in addition to Bayes rule, we can then compute a posterior p(f (·)|Di ) to encode
all knowledge of f (·) allowing us to account for the location of the maximum. A generic Bayesian
(new)
optimisation solver finds a batch of q input points, x1:q , by maximising an acquisition function that
generally trades off exploration versus exploitation as shown in line 4 of the algorithm.
(new)
At this stage, x1:q are evaluated by querying the black-box to generate a set of q response variables
(new) (new)
y1:q = f (x1:q ) that are then used to augment the dataset; see line 6 of Algorithm 1. The above
process repeats for a total of N iterations at which the best performing input in DN is returned as the
solution to the problem in Equation 1.
A Gaussian process (GP) offers a flexible and sample-efficient procedure for placing priors over
unknown functions [12]. These models are fully specified by a mean function m(x) and a covariance
function, or kernel, k(x, x0 ) that encodes the smoothness assumptions on f (·). Given any finite
collection of inputs x1:ni , the outputs are jointly Gaussian given by:
f (x1:ni )|θ ∼ N (m(x1:ni ), Kθ (x1:ni , x1:ni )) ,
2
Algorithm 1 Batched Bayesian Optimisation
1: Inputs: Total number of outer iterations N , initial randomly-initialised dataset D0 = {xl , yl ≡
f (xl )}nl=1
0
, batch size q, acquisition function
2: for i = 0 : N − 1:
3: Fit a surrogate model to the current dataset Di
(new)
4: Find q points x1:q by maximising an acquisition function, trading off exploration and
exploitation
(new) (new)
5: Evaluate new inputs by querying the black-box to acquire y1:q = f (x1:q )
(new) (new) q
6: Update the dataset creating Di+1 = Di ∪ {xl , yl }l=1
7: end for
8: Output: Return the best-performing query point from the data x? = arg maxx∈DN f (x)
where [m(x1:ni )]k = m(xk ) denotes the mean vector, and Kθ (x1:ni , x1:i ) ∈ Rni ×ni the covari-
ance matrix with its (k, l)th entry computed as [Kθ (x1:ni , x1:ni )]k,l = kθ (xk , xl ). Here, kθ (·, ·)
represents a parameterised kernel with unknown hyperparameters θ corresponding to lengthscales
or signal amplitudes for example. For ease of presentation following [12], we use a zero-mean
prior in our notation here. In terms of the choice of Gaussian process kernel, there are a wide array
of options which encode prior modelling assumptions about the latent function. Two of the most
commonly-encountered kernels in the Bayesian optimisation literature are the squared exponential
(SE) and Matérn(32) kernels.
Given the data Di , and assuming Gaussian-corrupted observations yi = f (xi ) + i with i ∼
2
N (0, σnoise ), we can write the joint distribution over the data and an arbitrary evaluation input x as:
1
(i)
1
(i),−1 ni
min J (θ, σnoise ) = det Cθ + (y1:ni − m(x1:ni ))T Cθ (y1:ni − m(x1:ni )) + log 2π,
θ,σnoise 2 2 2
(2)
(i) 2 (i)
with Cθ = Kθ + σnoise I.
The objective in Equation 2 represents a non-convex optimisation problem making GPs susceptible
to local minima. Various off-the-shelf optimisation solvers ranging from first-order [13, 14] to
second-order [15, 16] methods have been rigorously studied in the literature.
Having introduced a distribution over latent black-box functions and specified mechanisms for up-
dating hyperparameters, we now discuss the process by which novel query points are suggested
for collection in order to improve the surrogate model’s best guess for the global optimiser x? . In
Bayesian optimisation, proposing novel query points is performed through maximising an acqui-
sition function αtype (·|Di ) that trades off exploration and exploitation by utilising statistics from
(new)
p(f (·)|Di ), i.e., x1:q = arg maxx∈X q αtype (x|Di ). During our participation in the competition,
we experimented with the following commonly used batch forms originally introduced in [17]:
3
1. Expected Improvement: One of the most popular acquisition functions is expected im-
provement [18, 19], which determines new query points by maximising expected gain
relative to the function values observed so far:
αq-EI (x1:q |Di ) = Ef (x1:q )|Di ,θ max {ReLU(f (x1:q ) − f (x+
i )1q )} , (3)
j∈1:q
where 1q denotes a q-dimensional vector of ones and as such, the ReLu(·) is to be executed
element-wise, and x+i is the best performing input in the data so far.
2. Probability of Improvement: Another commonly-used acquisition function in Bayesian
optimisation is the probability of improvement criterion which measures the probability
of acquiring gains in the function value compared to f (x+ i ) [20]. Such a probability is
measured through an expected Heaviside step function as follows:
+
αq-PI (x1:q |Di ) = Ef (x1:q )|Di ,θ max 11{f (x1:q ) − f (xi )1q } , (4)
j∈1:q
where 11{f (x1:q ) − f (x+ i )1q } returns a q-dimensional binary vector with [1 1{f (x1:q ) −
f (x+
i )}] j = 1 if [f (x1:q )]j ≥ [f (x+
i )1q ] j and zero otherwise for all j ∈ {1, . . . , q}.
3. Upper Confidence Bound: In this type of acquisition, the learner trades off the mean and
variance of the predictive distribution to gather new query points for function evaluation [21].
To allow parallel computation, the authors in [17] have suggested the following form:
n o
p
αq-UCB (x1:q |Di ) = Ef (x1:q )|Di ,θ max µi (x1:q ; θ) + βπ /2|γi (x1:q ; θ)| , (5)
j∈1:q
During our initial attempts, we found that standard off-the-shelve Bayesian optimisation submissions,
i.e., ones with simplistic noise models and a single choice of an acquisition function, tended to under-
perform due to the complexity associated with the competition’s data. Tackling these shortcomings,
we proposed two adaptations to: 1) handle complex noise processes using input-output non-linear
transformations, and 2) enable a multi-objective choice of acquisitions.
As noted earlier, we found that a vanilla Gaussian process model faced difficulties handling het-
eroscedasticity and non-stationarity available in the data. To tackle these problems, we perform two
transformations; a power transformation (Γζ (·)) on the output and Warping (ΨΘ (·)) on the input,
dependent on parameters Θ.
Output: For output objectives, instead of the commonly used linear normalisation, we use
the power transformation (Γζ (·)) to reduce heteroscedasticity. Precisely, we adopt the box-cox
transformation[22] if objectives are strictly positive (or strictly negative), otherwise we follow the
yeo-johnson’s [23] one. For each transformation, the hyper-parameters for stabilising variance and
minimising skewness are estimated through maximum likelihood. Note, this is executed before
minimising the negative log marginal likelihood of the GP.
Input: For input parameters, we utilised an input warped Gaussian process [24] to handle non-
stationary functions. Here, we perform a non-linear monotonic transformation by making use of
the cumulative distribution function of the Kumaraswamy distribution given by: ΨΘ (x; a, b)i =
1 − (1 − xa i (bi −1)
i ) with the transformation executed component-wise and Θ = {a, b} denoting the
additional parameters introduced.
In executing these transformations, we followed readily available implementations, where the
power_transform from sklearn [25] and InputWarpedGP from the GPy’s package [26] were
used.
4
Gaussian Process & Kernels: When selecting a surrogate model needed for line 3 of Algorithm 1,
we deemed sample efficiency important and, hence, settled on adopting a Gaussian process with an
additive kernel given by:
(HEBO) (Linear) (Matérn32)
kθ1 ,θ2 (x, x0 ) = kθ1 (x, x0 ) + kθ2 (x, x0 ) with θ1 and θ2 are kernel hyper-parameters.
Stochastic mean function: Lastly, we observed from the Bayesmark benchmarks that most AutoML
tasks were highly noisy. In this case, we found that adding noise to the GP’s posterior mean helped
aid exploration. To achieve this, we add on an additional likelihood noise-dependent term to the
posterior mean, yielding a new mean function given by: m(HEBO) (·) = m(·) + ξσnoise2
.
As mentioned earlier, we found that in certain circumstances, different acquisitions led to conflicting
results. To circumvent such a problem, we adopted a multi-objective acquisition as that proposed
in [27]. The multi-objective acquisition ensemble algorithm (MACE) searches for a Pareto front across
multiple acquisitions functions, i.e., the solution that empirically scores higher than any individually
tested acquisition. This also enables parallel optimisation as the multi-objective optimisation returns
multiple Pareto-optimal recommendations. We utilised an evolutionary optimiser supporting mutation
and cross-over – NSGA-II – from the pymoo [28] package to solve the following:
With τ is the best function value observed so far, In our implementation, we use the above ap-
proximation when zi < −6 and the exact log αq-EI (x|Di ) otherwise. We compare the proposed
logarithmic approximation to the exact evaluation in Figure 2 in the Appendix. It can be seen that our
approximation is accurate when zi is small, but returns +∞ when z 2 = 1. Note, a similar approach
can be applied for approximating probability improvement.
4 Experiments
In this section, we first detail the summary results of our method compared with all Bayesmark
baselines. We then discuss a thorough ablation study and the apparent significance of various design
choices.
Bayesmark Offline Datasets: We conducted a comparison of HEBO across all 108 Bayesmark
Offline datasets and all available baselines provided in the BBO Challenge starter kit. Figure 1b shows
that HEBO significantly outperforms others with Pysot [29] and Turbo [30] being the second-and
third-best, respectively. Interestingly, we can also see a more widely spread upper-quartile of HEBO
vs other baselines3 .
5
102 102
100
100 96
94
99
92
98 90
w/o Power Transformation w/o Warp w/o Log Approx w/o Stochastic Mean HEBO Random Nevergrad Opentuner Skopt Hyperopt Turbo Pysot HEBO
(a) (b)
Figure 1: All algorithms in the plots are evaluated on the full Bayesmark Offline Dataset of 108
tasks and each task repeated with 20 random seeds. (a) Ablation Study of HEBO on Bayesmark
Offline Dataset. We see that when all components are combined, we see a more appealing upper
quartile distribution which reaches a much higher maximum scores than other ablations were able
to, hinting towards a positively synergistic combination of algorithm components was found. (b)
Summary of HEBO vs all available baselines. We see substantial improvements when using HEBO
against all baselines, in many cases with HEBO also producing tighter distribution of scores on the
lower quartile, and a high distribution on the upper quartile. This distribution of quartiles suggests
that worst case we know it should roughly perform near the mean, and best case we could get a result
significantly higher than the mean. This, paired with the fact that our mean performance is above
satisfactory, give us much confidence in applying this method to a wide-range of BO tasks. Notice,
as we significantly outperformed all baselines, similarly we see that Pysot and Turbo significantly
outperform all other baselines.
Table 1: Bayesmark Offline Dataset Summary of HEBO vs all available benchmarks on 108 tasks,
each task repeated with 20 random seeds. Where P-value is a two-sided test for the null hypothesis
that an algorithm has the same expected value as HEBO: p-values are highlight in bold if this null
hypothesis is rejected with 95% significance. We found all methods to be significantly worse than
HEBO.
In contrast, the removal of the stochastic mean function and log approximation showed insignificant
differences. Significance test results from this ablation study are shown in Table 2 in the Appendix.
We want to remind the reader again that the hypotheses from this ablation study do not necessarily
carry over to the online dataset from the competition, due to the aforementioned change of evaluation
domain.
5 Conclusion
We presented HEBO, a Bayesian optimisation method that utilises a novel formulation of warped
Gaussian Processes, multi-objective acquisition and a multi-objective evolutionary approach. In union,
these adaptations led to a general BO algorithm that won the NeurIPS 2020 Black-box Optimisation
Challenge, where the rankings were based on real-world tasks. We performed a thorough ablation
study to identify fundamental properties that are strongly correlated with the observed success. In
conclusion, we believe our work to help improve the understanding and highlight areas of Bayesian
optimisation research that highly impact real-world problems.
6
Acknowledgements
We want to thank the organisers of the competition, as well as Huawei Noah’s Ark for providing all
the necessary resources for us to compete in this challenge.
References
[1] Isabelle Guyon, Lisheng Sun-Hosoya, Marc Boullé, Hugo Jair Escalante, Sergio Escalera,
Zhengying Liu, Damir Jajetic, Bisakha Ray, Mehreen Saeed, Michéle Sebag, Alexander Stat-
nikov, WeiWei Tu, and Evelyne Viegas. Analysis of the automl challenge series 2015-2018. In
AutoML, Springer series on Challenges in Machine Learning, 2019.
[2] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine
learning algorithms. In Advances in neural information processing systems, pages 2951–2959,
2012.
[3] Rafael Gómez-Bombarelli, Jennifer N Wei, David Duvenaud, José Miguel Hernández-Lobato,
Benjamín Sánchez-Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D Hirzel,
Ryan P Adams, and Alán Aspuru-Guzik. Automatic chemical design using a data-driven
continuous representation of molecules. ACS central science, 4(2):268–276, 2018.
[4] Ksenia Korovina, Sailun Xu, Kirthevasan Kandasamy, Willie Neiswanger, Barnabas Poczos,
Jeff Schneider, and Eric Xing. Chembo: Bayesian optimization of small organic molecules
with synthesizable recommendations. In International Conference on Artificial Intelligence and
Statistics, pages 3393–3403. PMLR, 2020.
[5] Henry B Moss and Ryan-Rhys Griffiths. Gaussian process molecule property prediction with
flowmo. arXiv preprint arXiv:2010.01118, 2020.
[6] Ryan-Rhys Griffiths and José Miguel Hernández-Lobato. Constrained Bayesian optimization for
automatic chemical design using variational autoencoders. Chemical Science, 11(2):577–586,
2020.
[7] Zhuhong Zhang, Lei Wang, and Fei Long. Immune optimization approach solving multi-
objective chance-constrained programming. Evolving Systems, 6(1):41–53, 2015.
[8] Prasant Kumar Mahapatra, Susmita Ganguli, and Amod Kumar. A hybrid particle swarm
optimization and artificial immune system algorithm for image enhancement. Soft Computing,
19(8):2101–2109, 2015.
[9] Kwang-Seon Yoo and Seog-Young Han. Modified ant colony optimization for topology opti-
mization of geometrically nonlinear structures. International Journal of Precision Engineering
and Manufacturing, 15(4):679–687, 2014.
[10] Dmitrii V Speranskii. Ant colony optimization algorithms for digital device diagnostics.
Automatic Control and Computer Sciences, 49(2):82–87, 2015.
[11] Alexander I Cowen-Rivers, Daniel Palenicek, Vincent Moens, Mohammed Abdullah, Aivar
Sootla, Jun Wang, and Haitham Ammar. Samba: Safe model-based & active reinforcement
learning. arXiv preprint arXiv:2006.09436, 2020.
[12] Carl Edward Rasmussen and Christopher KI Williams. Gaussian Processes for Machine
Learning, volume 2. MIT press Cambridge, MA, 2006.
[13] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua
Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations,
ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
[14] Léon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. Advances in Neural
Information Processing Systems, 20:161–168, 2007.
[15] Ciyou Zhu, Richard H Byrd, Peihuang Lu, and Jorge Nocedal. Algorithm 778: L-BFGS-B:
Fortran subroutines for large-scale bound-constrained optimization. ACM Transactions on
Mathematical Software (TOMS), 23(4):550–560, 1997.
7
[16] Shun-ichi Amari. Natural gradient works efficiently in learning. Neural Computation, 10(2):251–
276, 1998.
[17] James Wilson, Frank Hutter, and Marc Deisenroth. Maximizing acquisition functions for
Bayesian optimization. In Advances in Neural Information Processing Systems, pages 9884–
9895, 2018.
[18] Jonas Močkus. On Bayesian methods for seeking the extremum. In Optimization Techniques
IFIP Technical Conference, pages 400–404. Springer, 1975.
[19] Donald R Jones, Matthias Schonlau, and William J Welch. Efficient global optimization of
expensive black-box functions. Journal of Global Optimization, 13(4):455–492, 1998.
[20] Harold J. Kushner. A New Method of Locating the Maximum Point of an Arbitrary Multipeak
Curve in the Presence of Noise. Journal of Basic Engineering, 86(1):97–106, 03 1964.
[21] Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian process
optimization in the bandit setting: no regret and experimental design. In Proceedings of the
27th International Conference on Machine Learning, pages 1015–1022, 2010.
[22] George EP Box and David R Cox. An analysis of transformations. Journal of the Royal
Statistical Society: Series B (Methodological), 26(2):211–243, 1964.
[23] In-Kwon Yeo and Richard A Johnson. A new family of power transformations to improve
normality or symmetry. Biometrika, 87(4):954–959, 2000.
[24] Jasper Snoek, Kevin Swersky, Rich Zemel, and Ryan Adams. Input warping for bayesian
optimization of non-stationary functions. In International Conference on Machine Learning,
pages 1674–1682, 2014.
[25] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,
P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,
M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine
Learning Research, 12:2825–2830, 2011.
[26] GPy. GPy: A gaussian process framework in python. [Link]
GPy, 2012.
[27] Wenlong Lyu, Fan Yang, Changhao Yan, Dian Zhou, and Xuan Zeng. Batch bayesian op-
timization via multi-objective acquisition ensemble for automated analog circuit design. In
International Conference on Machine Learning, pages 3306–3314, 2018.
[28] J. Blank and K. Deb. Pymoo: Multi-objective optimization in python. IEEE Access, 8:89497–
89509, 2020.
[29] David Eriksson, David Bindel, and Christine A Shoemaker. pysot and poap: An event-driven
asynchronous framework for surrogate optimization. arXiv preprint arXiv:1908.00420, 2019.
[30] David Eriksson, Michael Pearce, Jacob Gardner, Ryan D Turner, and Matthias Poloczek.
Scalable global optimization via local bayesian optimization. In Advances in Neural Information
Processing Systems, pages 5496–5507, 2019.
8
A Ablation
Table 2 shows the two-sided independent t-test performed on all components against the complete
HEBO algorithm. We observe that the two most significant features were the input-output non-linear
transformations. We noticed slight improvements by using the log approximation scheme and having
in a stochastic mean function. However, these small improvements were more considerable on the
more challenging online dataset.
HEBO w/o LogApprox HEBO w/o Stochastic Mean HEBO w/o Power HEBO w/o Warp HEBO
Mean 100.061 100.099 99.158 99.625 100.117
Std 0.351 0.408 0.451 0.453 0.476
P-value 0.675 0.894 1.041e−07 0.002 -
Min 99.441 99.215 98.358 98.680 99.349
25% 99.884 99.970 98.963 99.515 99.871
50% 100.051 100.163 99.155 99.700 100.057
75% 100.290 100.368 99.375 99.922 100.369
Max 100.792 100.629 99.974 100.240 101.516
Table 2: Ablation study on HEBO in offline benchmarks, evaluating each component based on
normalised score statistics. There are a total of 108 tasks in the offline benchmarks, and we repeat
each of these with 20 random seeds. Where P-value is a two sided independent t-test on the hypothesis
that the component significantly effects the performance of HEBO. We can see clearly in the offline
data that each component offered an increase in mean score. For each ablation study, we perform a
t-test with HEBO. The test highlights that the results from using the power transformation and input
warped Gaussian Process are significant in their effect on the mean score. However, adding noise
to the posterior mean and using the log approximation for numerical stability did not significantly
improve HEBO. It is worth noting, however, that we observed a big difference in how algorithms
performed offline vs online, so this ablation is not conclusive, but instead offers support to each
components claim.
B Log Approximation
τ − µi (x; θ)
αq-EI (x|Di ) = σi (x; θ)(zi Φ(zi ) + φ(zi )) with zi = ,
σi (x; θ)
where Φ(·) and φ(·) are the cumulative and density functions of a standard normal distribution
N (0, 1).
When optimising expected improvement, sometimes we start from point x with very small zi =
τ −µi (x;θ)
σi (x;θ) (starting point of gradient based optimiser or initial population of evolutionary optimiser),
this could happen when x has bad objective value, and is acknowledged by the surrogate model with
high confidence. In that case, the expected improvement value (and its gradient) would be very close
to zero, making it hard to optimise.
An apparent approach is to instead optimise log αq-EI (x|Di ). However, we found that sometimes
the expected improvement value is so small that evaluating log αq-EI (x|Di ) returns NaN, leading to
numerical instability during the optimisation.
The analytical expression of log αq-EI (x|Di ) is:
zi Φ(zi )
log αq-EI (x|Di ) = log σi (x; θ) + log φ(zi ) + log 1 + (7)
φ(zi )
9
Approximation
Exact
−15
−20
logEI
−25
−30
−35
−6 −5 −4 −3 −2 −1 0 1 2
z
(a) (b)
Figure 2: (a-b) Comparison between the exact log αq-EI (x1:q |Di ) and the proposed logarithm approx-
imation, note that when z < −40, exact evaluation returns NaN while the approximation can still be
calculated.
In Equation 7, the first two terms are easy to evaluate, but we don’t have analytical expression to the
last term. With Lh́ospital rule, it’s easy to show that:
zi Φ(zi ) Φ(zi ) + zi φ(zi ) Φ(zi ) φ(zi )
lim = lim = lim −1 − = −1 −
zi →−∞ φ(zi ) zi →−∞ −zi φ(zi ) zi →−∞ zi φ(zi ) φ(zi ) − zi2 φ(zi )
1
= −1 + 2 .
zi − 1
We compare the proposed logarithmic approximation to the exact evaluation in Figure 2. It can be
seen that our approximation is accurate when zi is small, but returns +∞ when z 2 = 1.
10
2
0
C
2
2
1
intercept_scaling
2
0.6
0.7
0.8
y
0.9
1.0
2 0 2 2 0 2 1.0 0.8 0.6
C intercept_scaling y
Figure 3: All plots show data from the same Bayesmark task. The aim of this task is to tune two
hyper-parameters, a regularisation parameter C, and an intercept scaling parameter to maximising
the validation accuracy (y) of a lasso classifier on the Wine dataset. (a) Shows each of the hyper-
parameters and accuracy correlated against each other. Importantly, we find weak correlation between
intercept scaling and accuracy. In contrast, we find strong correlation between C and accuracy.
11
(a)
HEBO
100 HEBO w/o stochastic mean
Score on lasso_wine_acc
80
60
40
20
0
0 2 4 6 8 10 12 14 16
Iteration
(b)
Figure 4: All plots show data from the same Bayesmark task. The goal of this task is to tune two
hyper-parameters, a regularisation parameter C, and an intercept scaling parameter to maximising
the validation accuracy (y) of a lasso classifier on the Wine dataset. (a) We then do a more broader
evaluation of many (uniformly sampled) C values and the corresponding validation classification
accuracy once the lasso classifier has been trained and notice in a specific region of regularisation
values (between [−1, 2]), to be no clear correlation with accuracy. Thus, the best method for searching
in this region would appear to be random — this motivates us exploring a BO algorithm which can
resort to some form of random search in challenging regions. (b) We plot the average score of HEBO
w and w/o the stochastic mean term. Note, on this task one achieves a normalised score of 100 if the
model achieves 97.9% validation accuracy, 67 if it gets 97.2% validation accuracy and 35 for 96.4%
validation accuracy. Hence, we can see the more difficult to identify regularisation hyper-parameters
are found faster with the stochastic mean term, than without.
Noise level: 0.00, lb = -2.00, ub = 2.00 Noise level: 0.01, lb = -2.00, ub = 2.00 Noise level: 0.10, lb = -2.00, ub = 2.00 Noise level: 1.00, lb = -2.00, ub = 2.00 Noise level: 1.00, lb = -0.50, ub = 0.50
4.0 4.0 4 2
5
3.5 3.5
4
3.0 3.0 1
3
3
2.5 2.5
2 0
2.0 2.0 2
1
1.5 1.5 1
0
1.0 1.0 1
1
0.5 0.5 2
0 2
0.0 0.0
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0.4 0.2 0.0 0.2 0.4
Figure 5: f (x) = x2 Synthetic black-box function experiment to demonstrate the optimised points
(orange crosses) that a genetic algorithm arrives at when varying noise levels are introduced, which
simulate the varying likelihood noise values that effect the posterior mean prediction. We can clearly
see the correlation of, as noise increases as does the fine spread of optimised. It is this principle that
allows HEBO to achieve a coarse-to-fine search
12