Understanding Multi-armed Bandits Concepts
Understanding Multi-armed Bandits Concepts
These notes were written for myself to refer to while lecturing. They are not a replacement for
the course textbooks [Lattimore and Szepesvári, 2020, Bubeck et al., 2012] and may contain
errors! I have posted them by request.
1 Introduction
Machine learning, and in particular, supervised learning, is the study of making statistical inferences from
previously collected data. Multi-armed bandits is more about an interaction between an agent (algorithm)
and an environment where one simultaneously collects data and makes inferences in a closed-loop.
You have n “arms” or actions, representing distributions. “Pulling” an arm represents requesting a
sample from that arm.
At each time t = 1, 2, 3, . . .
• Algorithm chooses an action It ∈ {1, . . . , n}
• Observes a reward XIt ,t ∼ PIt where P1 , . . . , Pn are unknown distributions
That is, playing arm i and time s results in a reward Xi,s from the ith distribution. In these lectures, all
distributions will be Gaussian (or sub-Gaussian) with variance 1 unless otherwise specified. Example of
sub-Gaussian distribution is bounded distributions on [−1, 1] or Gaussian N (0, 1). Formally, a distribution
of X is 1-sub-Gaussian if E[exp(λX)] ≤ exp(λ2 /2).
We will find that the means of the distribution are the most pertinent parameters of these distributions.
Let θi∗ = EX∼Pi [X] be the mean of the ith distribution. Define ∆i = maxj=1,...,n θj∗ − θi∗ . We measure
performance of an algorithm in two ways: 1) how much total reward is accumulated, and 2) how many total
pulls are required to identify the best mean.
1
1.2 Best-arm identification
Given a δ ∈ (0, 1) identify the best arm with probability at least 1 − δ using as few total pulls as possible.
While related, these objectives are at odds with one another. Sometimes called the (, δ)-PAC setting,
but for simplicity we’ll take = 0.
1.3 Warm-up
Suppose n = 2. How long would it take to decide one arm was better than another using sub-gaussian
bounds? Consider the trivial algorithm:
Input: 2 arms, time τ ∈ N.
Pull each arm i ∈ {1, 2} exactly τ times and compute empirical mean θbi .
For all t > 2τ play arm arg maxi θbi
Proposition 1. Fix , δ. If Z1 , Z2 , . . . are independent mean-zero σ 2 -sub-Gaussian
Pτ random variables so that
E[exp(λZt )] ≤ exp(λ2 σ 2 /2), then for τ = d2σ 2 −2 log(1/δ)e we have P( τ1 t=1 Zt ≤ ) ≥ 1 − δ.
Proof.
τ τ
!
1X X
P( Zt > ) = P(exp λ Zt > exp(λτ ))
τ t=1 t=1
" τ
!#
X
≤ e−λτ E exp λ Zt
t=1
τ
Y
= e−λτ E [exp (λZt )]
t=1
= e−λτ exp λ2 σ 2 τ /2
= exp −λτ + λ2 σ 2 τ /2
≤ exp −τ 2 /2σ 2 ≤ δ
Then P(E1c ∪ E2c ) ≤ P(E1c ) + P(E1c ) ≤ δ. Thus, if we pull each arm τ times then on E1 ∩ E2 we have
θb1 > θ1∗ − ∆/2
≥ θ2∗ + ∆/2
> θb2
so that we have determined the best-arm. And we can play it forever.
After any T total plays such that arm i has been played Ti times and T = T1 + T2 , the expected regret
is at most
" T #
X
∗
θ1 T − E XIs ,s = θ1∗ T − E [(T1 θ1∗ + T2 θ2∗ )]
s=1
= E [T2 ∆]
= E [T2 ∆1{E1 ∩ E2 } + T2 ∆1{E1c ∪ E2c }]
≤ E [τ ∆1{E1 ∩ E2 } + T ∆1{E1c ∪ E2c }]
≤ 8∆−1 log(4/δ) + ∆T P(E1c ∪ E2c )
≤ 8∆−1 log(4/δ) + ∆T δ.
2
If we take δ = 1/T then the expected regret is less than ∆ + 8∆−1 log(4T ). On the other hand, the regret
can’t possibly be greater than ∆T , thus the total regret is bounded by
" T #
X
∗
θ1 T − E XIs ,s = min{T ∆, ∆ + 8∆−1 log(4T )}
s=1
p
≤1+2 8T log(4T )
p
where the last step takes the worst case ∆ = 8 log(4T )/T .
Takeaway: For√ very small ∆ we lose almost nothing, for very large ∆ its easy to distinguish, its maximized
at around 1/ T . We’ll see this again.
Lemma 1. Assume that maxi∈X ∆i ≤ 4. With probability at least 1−δ, we have 1 ∈ Xb` and maxi∈Xb` ∆i ≤ 8`
for all ` ∈ N.
Proof. For any ` ∈ N and i ∈ [n] define
n o
Ei,` = |θbi,` − θi∗ | ≤ `
T n T∞ q 2 /δ)
and E = i=1 `=1 Ei,` . Noting that ` = 2 log(4n`
τ` we have
∞ ∞
n [
! n X
[ X δ
P(E c ) = P c
Ei,` ≤ ≤ δ.
i=1 `=1 i=1 `=1
2n`2
3
which implies 1 ∈ Xb`+1 . Thus, 1 ∈ Xb` for all `. On the other hand, any i for which ∆i = θ1∗ − θi∗ > 4` we
have
= (θbj,` − θj ) − (θbi,` − θi ) − ∆i
> 2` − 4` = 2`
Theorem 1. Assume that maxi∈X ∆i ≤ 4. Then with probability at least 1 − δ, 1 is returned from the
algorithm at a time τ that satisfies
n
X
τ ≤c ∆−2 −2
i log(n log(∆i )/δ)
i=2
Proof. If ∆ = mini6=1 ∆i then Xb` = {1} for t ≥ dlog2 (8∆)e. Note that
dlog2 (8∆)e
X
Ti = τ` 1{i ∈ Xb` }
`=1
dlog2 (8∆)e
X
≤ τ` 1{∆i ≤ 8` }
`=1
dlog2 (8∆−1
i )e
X
= τ`
`=1
dlog2 (8∆−1
i )e
2
4` |X |
X
= d2−2
` log( δ )e
`=1
dlog2 (8∆−1
i )e
4 log22 (16∆−2
i )|X |
X
≤ d2 log( δ )e 4`
`=1
4 log22 (16∆−2
i )|X |
≤ c∆i−2 log( δ )
which implies
Thus, the total number of samples taken before Xb` = {1} is equal to
n n
4 log22 (16∆−2
i )|X |
X X
Ti ≤ c∆−2
i log( δ )
i=1 i=1
Pn
which implies that one can identify the best arm after no more than i=2 ∆−2 −2
i log(n log(∆i )/δ).
Pn
∆−1
p
Moreover, if the algorithm is run with δ = 1/T then RT ≤ c i=2 i log(T ) and RT ≤ c nT log(T ).
4
Suppose you run for T timesteps. Then for any ν ≥ 0 the regret is bounded by:
n
X X
∆i Ti ≤ νT + 8` τ` |Xb` |
i=2 `:8` >ν
dlog2 (8(∆∨ν)−1 )e
X
≤ νT + 8` τ` |Xb` |
`=1
−1
n dlog2 (8(∆∨ν)
X X )e
≤ νT + 8` τ` 1{∆i ≤ 8` }
i=1 `=1
−1
n dlog2 (8(∆
X X i ∨ν) )e
= νT + 8` τ`
i=1 `=1
−1
n dlog2 (8(∆ i ∨ν) )e
2
4` |X |
X X
= νT + 8` d2−2
` log( δ )e
i=1 `=1
n dlog2 (8(∆i ∨ν)−1 )e
4 log22 (8(∆i ∨ν)−2 )|X |
X X
≤ νT + c log( δ ) 2`
i=1 `=1
n
−1
c(∆i ∨ ν)−1 log( log((∆i ∨ν) )|X |
X
≤ νT + δ )
i=1
Pn −1 −1
Setting ν = 0 yields a regret of i=2p ∆i log(n log(∆i )/δ). On the other hand, using ∆i ∨ ν ≥ ν and
minimizing over ν yields a regret of nT log(n log(T )/δ). The expected regret, of course, is then bounded
by
n
" n #
X X
∆i E[Ti ] = E ∆i Ti
i=2 i=2
n
X
≤ ∆−1 −1 c
i log(n log(∆i )/δ) + T P(E )
i=2
Pn
Setting δ = 1/T implies the regret is less than i=2 c∆−1
i log(T ).
Some remarks:
• This analysis doesn’t reuse samples from previous rounds, it is easy to make this change.
• Regret bound requires knowledge of T a priori.
5
2 2
1 −(x−µ) /2σ
Let pµ (x) = 2π e be the Gaussian distribution with mean µ. Under H0 , Xi ∼ p0 and under
H1 , Xi ∼ p∆ . Let φ : Rn → {0, ∆}. Then the minimax probability of error is equal to
1
inf max{P0 (φ = 1), P1 (φ = 0)} ≥ inf (P0 (φ = 1), P1 (φ = 0))
φ φ 2
Z Z
1
= inf 1{φ(x) = 1}p0 (x)dx + 1{φ(x) = 0}p1 (x)dx
φ 2 x∈Rn x∈Rn
Z
1
= min{p0 (x), p1 (x)}dx
2 x∈Rn
Z 2
1 p
≥ p0 (x)p1 (x)dx (Cauchy-Schwartz)
4 x∈Rn
Z
1
≥ exp − log( pp10 (x) )p
(x) 1 (x)dx (Jensen’s)
4 x∈Rn
where
Z 2 Z 2
p p
p0 (x)p1 (x)dx = min{p0 (x), p1 (x)} max{p0 (x), p1 (x)}dx
x∈Rn n
Z x∈R Z
≤ min{p0 (x), p1 (x)}dx max{p0 (x), p1 (x)}dx (Cauchy-Schwartz)
x∈Rn x∈Rn
Z
≤2 min{p0 (x), p1 (x)}dx
x∈Rn
Note that
Z n
! n
Y p1 (xi ) Y
KL(P1 |P0 ) = log p1 (xi )dx
x i=1
p0 (xi ) i=1
= nKL(p1 |p0 ) = n∆2 /2
3.2 Indentification
An algorithm for best-arm identification at time t is described by given a history (Is , Xs )s<t for each time t
is described by a
• selection rule It ∈ [n] is Ft−1 measurable where Ft = σ(I1 , X1 , I2 , X2 , . . . , It−1 , Xt−1 )
• stopping time τ is Ft measurable, and
6
• recommendation rule bi ∈ [n] invoked at time τ which is Fτ -measurable.
Definition 1. We say that an algorithm for best-arm identification is δ-PAC if for all θ∗ ∈ Rn we have
Pθ∗ (bi = arg maxi∈[n] θi∗ ) ≥ 1 − δ.
The following is due to [Kaufmann et al., 2016], a strengthening of the first time it appeared in [Mannor and Tsitsiklis, 2004
Theorem 3 (Best-arm identification lower bound). Any algorithm Pnthat is δ-PAC on {P : Pi = N (θi , 1), θ1 >
1
maxi6=1 θi , θ ∈ [0, 1]n } for δ < 0.15 satisfies Eθ∗ [τ ] ≥ 2 log( 2.4δ ) i=1 ∆−2
i .
Proof sketch: The original instance has Pi = N (θi∗ , 1). Pick some j ∈ [n] and define an alternative mean
(j) (j)
vector θ(j) ∈ [0, 1]n such that θi = θi∗ if i 6= j and θi = θ1 + for j = i for some arbitrarily small number
(j)
. Note that under θ , arm j is the best arm.
Because the algorithm claims to be δ-PAC, it has to output arm 1 under θ∗ and arm j under θ(j) . But
these two bandit games only differ on arm j so to tell the difference between them its only natural to sample
arm j until one can figure out which instance is being played (i.e., is its mean θj or θ1 + ?) The discussion
above suggests that to make this distinction with probability at least 1 − δ, it is necessary to sample arm j
at least 2(θ1 − θj + )−2 log(1/4δ) times. Taking to zero and noticing that j was arbitrary completes the
sketch.
This is not a proof, however, because the number of times the algorithm samples arm j is random whereas
in the above argument it was fixed. The proof of [Kaufmann et al., 2016] provides convenient tools to prove
general lower bounds for δ-PAC settings.
Takeaway: This is what his field does: prove an initial upper, then lower, then chase it.
7
The recent work of [Lattimore, 2018] defined a UCB-based algorithm that achieves asymptotic optimal
√
constants, and finite regret bounds of i log(T )
P
∆ −1 and nT .
i
Note that η ∼ N (0, I). For any W ∼ N (µ, Σ) we have AW + b ∼ N (Aµ + b, AΣA> ). Thus
We will use the notation kzk2A = z > Az so that with probability at least 1 − δ
p
z > (θ̂ − θ∗ ) ≤ kzk(X > X)−1 2 log(1/δ)
8
where in the final step we made use of the following equality
5 Experimental design
Note that if I take measurements (x1 , . . . , xn ) ∈ X and observe their corresponding observations yi =
hxi , θ∗ i + ηi where ηi ∈ 0, ∞, then E[(θb − θ)(θb − θ)> ] = σ 2 (X T X)−1 and also, θb − θ∗ ∼ N (0, σ 2 (X > X)−1 ).
We can visualize this as a confidence ellipsoid for each choice of X. And we can even think of optimizing
1 −x> Σ−1 x/2
the choice. Recall that the PDF of a Gaussian is φ(x) = (2π|Σ|) d/2 e . With entropy 12 log(2πe|Σ|).
When the number of selected points is large, its more convenient to think of sampling n points from a
distribution placed over X . Define
X
Aλ = λx xx>
x∈X
so that for every X ∈ Rτ ×d there exists some λ ∈ 4X such that X > X = >
P
x∈X dλx τ exx = Aλ . This Aλ
can then be used to shape the covariance θ:
b
Lemma 3 (Kiefer-Wolfowitz (1960)). For any X with d = dim(span(X )), there exists a λ∗ ∈ 4X that
• maxλ gD (λ) = gD (λ∗ )
• minλ fG (λ) = fG (λ∗ )
• fG (λ∗ ) = gD (λ∗ ) = d
• support(λ∗ ) = (d + 1)d/2
Proposition 2. If λ∗ is the G-optimal design for X then if we pull arm x ∈ X exactly dτ λ∗x e times for some
τ > 0 and compute the least squares estimator θ. b Then for each x ∈ X we have with probability at least 1 − δ
p
hx, θb − θ∗ i ≤ kxk(Px∈X dτ λ∗x exx> )−1 2 log(1/δ)
1 p
≤ √ kxk(Px∈X λ∗x xx> )−1 2 log(1/δ)
τ
r
2d log(1/δ)
≤
τ
and we have taken at most τ + d(d+1) pulls. Thus, for any δ 0 ∈ (0, 1) we have P( x∈X {|hx, θb − θ∗ i| >
S
q 2
2d log(2|X |/δ 0 )
τ }) ≤ δ 0 .
9
Notes:
• The support size of (d + 1)d/2 is trivial application of Caratheodory’s theorem. Many algorithms to
find this efficiently.
• Note that one can find a λ∗ with a constant approximation with just support O(d).
` = 2−` , τ` = 2d−2 2
` log(4` |X |/δ)
Pull arm x ∈ X exactly dλ b`,x τ` e times and construct the least squares estimator θb` using only
the observations of this round
Xb`+1 ← Xb` \ x ∈ Xb` : maxx0 ∈Xb` hx0 − x, θb` i > 2`
`←`+1
Output: Xb`
After T time steps, define the regret as
" T #
X
RT = hx? , θ∗ i − E hxt , θ∗ i
t=1
X
= E Tx ∆x
x6=x?
where ∆x = hx? − x, θ∗ i.
Lemma 4. Assume that maxx∈X hx? − x, θ∗ i ≤ 4. With probability at least 1 − δ, we have x? ∈ Xb` and
maxx∈Xb` hx? − x, θ∗ i ≤ 8` for all ` ∈ N.
Proof. For any V ⊆ X and x ∈ V define
where it is implicit that θb` is the G-optimal design constructed in the algorithm at stage ` with respect
to Xb` = V. Note that this is precisely the analogous events of multi-armed bandits. The key piece of the
10
analysis is that
∞ [
[ ∞
X [
c c
P {Ex,` (Xb` )} ≤ P {Ex,` (Xb` )}
`=1 x∈X
b` `=1 x∈X
b`
∞
!
X X [
c
= P {Ex,` (V)}, Xb` =V
`=1 V⊆X x∈V
∞ X
!
X [
c
= P {Ex,` (V)} P(Xb` = V)
`=1 V⊆X x∈V
X∞ X
δ|V|
≤ 2`2 |X | P(X`
b = V) ≤ δ
`=1 V⊆X
T T∞
Thus, in what follows, assume E := x∈X `=1 {Ex,` (Xb` )} holds.
Fix any ` for which x? ∈ Xb` (note x? ∈ Xb1 ). Then for any x ∈ Xb` we have
which implies x? ∈ Xb`+1 . Thus, x? ∈ Xb` for all `. On the other hand, any x for which hx? − x, θ∗ i > 4` we
have
For any ` ≥ dlog2 (8∆−1 )e we have that Xb` = {x? }. Suppose you run for T timesteps. Then for any ν ≥ 0
the regret is bounded by:
dlog2 (8(∆∨ν)−1 )e
X X
∆x Tx = T ν + 8` (|support(λ
b` )| + τ` )
x∈X \x? `=1
dlog2 (8(∆∨ν)−1 )e
X
= Tν + 8` ( (d+1)d
2 + 2d−2 2
` log(4` |X |/δ))
`=1
dlog2 (8(∆∨ν)−1 )e
X
≤ T ν + 4(d + 1)ddlog2 (8(∆ ∨ ν) −1
)e + 16d−1 2
` log(4` |X |/δ)
`=1
dlog2 (8(∆∨ν)−1 )e
X
−1 −1
≤ T ν + 4(d + 1)ddlog2 (8(∆ ∨ ν) )e + 16d log(4 log22 (16(∆ ∨ ν) )|X |/δ) 2`
`=1
−1 −1 −1
≤ T ν + 4(d + 1)ddlog2 (8(∆ ∨ ν) )e + 512d(∆ ∨ ν) log(4 log22 (16(∆ ∨ ν) )|X |/δ)
Setting ν = 0 yields a regret bound of O(d∆−1 log(|X | log(∆−1 )/δ)) which implies RT ≤ c ∆
d
log(|X |T ). Mini-
p p
mizing over ν > 0 yields a regret bound of O( dT log(log(T /d)|X |/δ)) which implies RT ≤ c dT log(|X |T ).
Remarks:
• Let X = {ei : i ∈ [d]}. Then for this action set, this bound is nearly minimax according to our lower
bounds!
11
Pd
• However, this is also concerning: we know that in the bandit setting the regret scales like i=2 ∆−1
i log(T )
but this scales d∆−1 log(T ), which is significantly worse. Can we achieve this?
• For pure-exploration, an analogous analysis shows that one can identify the best-arm in ∆d2 log(1/δ)
pulls. But this is exactly the same rate we would have gotten if we did G-optimal once in the beginning
and sample according to that!
ρ? := inf τ
λ∈4X ,τ ∈N
t←t+1
Output: Xbt+1
Lemma 5. Assume that maxx∈X hx? − x, θ∗ i ≤ 2. With probability at least 1 − δ, we have x? ∈ Xb` and
maxx∈Xb` hx? − x, θ∗ i ≤ 4` for all ` ∈ N.
12
where it is implicit that θb` is the design constructed in the algorithm at stage ` with respect to Xb` = V.
Given Xb` , with probability at least 1 − 2`2δ|X |
p
|hx − x? , θb` − θ∗ i| ≤ kx − x? k(Px∈V dτ` λ`,x (V)exx> )−1 2 log(4`2 |X |/δ)
kx − x? k(Px∈V λ`,x (V)xx> )−1 p
≤ √ 2 log(4`2 |X |/δ)
τ`
v
u kx − x? k2(P
u
> −1 p
x∈V λ`,x (V)xx )
≤t −2 2
2 log(4`2 |X |/δ)
2` f (V) log(4` |X |/δ)
= `
T∞ T
By exactly the same sequence of steps as above, we have P( `=1 x∈Xb` {|hx − x? , θbt − θ∗ i| > t }) =
T T∞
0
P `=1 Ex,` (X` ) ≥ 1 − δ, so assume these events hold. Consequently, for any x ∈ X`
b b
x∈X
so that x? would survive to round ` + 1. And for any x ∈ Xb` such that hx? − x, θ∗ i > 2` we have
which implies this x would be kicked out. Note that this implies that maxx∈Xb`+1 hx? −x, θ∗ i ≤ 2` = 4`+1 .
Theorem 5. Assume that maxx∈X hx? − x, θ∗ i ≤ 2. Then with probability at least 1 − δ, x? is returned from
the algorithm at a time τ that satisfies
Proof. Define S` = {x ∈ X : hx? − x, θ∗ i ≤ 4` }. Note that by assumption X = Xb1 = S1 . The above lemma
T∞
implies that with probability at least 1 − δ we have `=1 {Xb` ⊆ S` }. This implies that
≤ min max
0
kx − x0 k2(P λx xx> )−1
λ∈4X x,x ∈S` x∈X
= f (S` )
13
For ` ≥ dlog2 (4∆−1 )e we have that S` = {x? }, thus, the sample complexity to identify x? is equal to
14
Note that
1X
ρ? := inf αx
α∈[0,∞)X 2
x∈X
kx? − xk2(P αx xx> )−1 1
x∈X
subject to max ≤
x∈X ∆2x 2
Notes
• There exists an asymptotic algorithm [Lattimore and Szepesvari, 2016], but no satisfying finite-time
algorithm as of yet.
• Information directed sampling may be near-optimal and very high performance.
9.1 Preliminaries
This presentation of mirror descent follows [Bubeck et al., 2012, Ch. 5].
For any open convex set D ⊂ Rd and its closure denoted D̄, for any Legendre F on D̄ define F ∗ (x) :=
supy∈D̄ x> y − F (y).
Define DF (x, y) = F (x) − F (y) − (x − y)> ∇F (y).
Let the Mirror Descent iterations satisfy, a1 = arg mina∈A F (a) then
15
Corollary 1 (Online Mirror Descent with Linear Losses). Let A ⊂ D ⊂ Rd be a closed convex action set,
{zt }Tt=1 ⊂ Z, `(a, z) = a> z, and F a Legendre function defined on A ⊂ D̄, such that ∇F (a) − ηz ∈ ∇F (D)
for all (a, z) ∈ A × Z is satisfied. Then OMD satisfies
T T
X
> supa∈A F (a) − F (a1 ) 1 X
sup (at − a) zt ≤ + DF ∗ (∇F (at ) − ηzt , ∇F (at )) .
a∈A t=1 η η t=1
Corollary 2 (Stochastic Online Mirror Descent with Linear Losses). Let A ⊂ D ⊂ Rd be a finite action set,
zt }Tt=1 ⊂ Z, `(a, z) = a> z, and F a Legendre function defined on A ⊂ D̄, such that ∇F (a) − ηz ∈ ∇F (D)
{b
for all (a, z) ∈ A × Z is satisfied. Then OMD satisfies
" T # T
X
> supa∈A F (a) − F (a1 ) 1 X
sup E (At − a) zt ≤ + E [DF ∗ (∇F (at ) − ηb
zt , ∇F (at ))] .
a∈A t=1
η η t=1
Taking the expectation on both sides yields the result by noting that
" T # " T # " T # " T # " T #
X X X X X
> > > > >
E zt (At − a) = E zt (at − a) = E E[zt (at − a)|Pt ] = E zt (at − a)|Pt ] = E
E[b zbt (at − a) .
t=1 t=1 t=1 t=1 t=1
16
9.2.1 Full information game
Assume the setting of Example 1. The following algorithm implements the updates of Mirror Descent above
for the particular loss `(a, z) = a> z.
Exponential Weights
Input: Time horizon T .
Initialize: Player sets a1 = (1/d, . . . , 1/d). Adversary chooses {zt }Tt=1 ⊂ [0, 1]d .
for: t = 1, · · · , T
Player chooses at ∈ A
Player suffers (and observes) loss `(at , zt ) = a>
t zt
Player observes zt
Update mirror descent iterates:
t
X d
X
at+1,i = exp(−η
e zs,i ) at,i = e
at+1,i / at+1,j .
e
s=1 j=1
Corollary 3 (Exponential weights). Under the conditions of Example 1 with let `(a, z) = a> z, the expo-
nential weights algorithm satisfies
T
X log(d) ηT p
sup `(at , zt ) − `(a, zt ) ≤ + ≤ 2T log(d)
a∈A t=1 η 2
Proof. Note ∇a `(a, z) = z. Plug in quantities of the example to obtain for any a ∈ A
T
X T
X
`(at , zt ) − `(a, zt ) = zt> (at − a)
t=1 t=1
T d
log(d) 1 X X
≤ + at,i (exp(−ηzt,i ) − 1 + ηzt,i )
η η t=1 i=1
T d
log(d) η X X 2
≤ + at,i zt,i
η 2 t=1 i=1
log(d) ηT
≤ +
η 2
where the second line uses F (x) ≤ 0 and F (a1 ) = log(d), the third line uses exp(−x) ≤ 1 − x + 21 x2 for
x ≥ 0, and the last line follows from zt,i ∈ [0, 1] and at is a probability distribution.
17
in hindsight is
" T
# " T #
X X
E `(At , zt ) − `(a, zt ) = E zt> (At − a)
t=1 t=1
" T #
X
=E zt> (at − a)
t=1
" T
#
X
=E `(at , zt ) − `(a, zt )
t=1
where the expectation is with to the random selection of each It from at . Alternatively, we can directly
apply Corollary 2 with zbt = zt since we are in the full information setting.
Corollary 4 (Exponential weights over finite actions). Under the conditions of Example 1 where the player
can only play ei for i ∈ {1, . . . , d} with let `(a, z) = a> z, the exponential weights over finite actions algorithm
satisfies
" T
#
X log(d) ηT p
E sup `(eIt , zt ) − `(a, zt ) ≤ + ≤ 2T log(d)
a∈A t=1 η 2
Proof. Immediate from reduction described above and the previous corollary since the iterates are identical
in the full information game. Due to the oblivious adversary we have
" T # " T
#
X X
sup E `(at , zt ) − `(a, zt ) = E sup `(at , zt ) − `(a, zt )
a∈A t=1 a∈A t=1
Note, as we did in class, one can also prove a high probability bound that would apply to a general reactive
adversary [?, Ch. 2.7]
1{It = i}zt,i
zt,i |a1 , I1 , . . . , at−1 , It−1 , at ] = E[
E[b |a1 , I1 , . . . , at−1 , It−1 , at ]
at,i
d
X 1{j = i}zt,i
= at,j
j=1
at,i
= zt,i .
18
Pd
Also note that E[At ] = E[eIt ] = i=1 ei at,i = at .
Corollary 5 (EXP3). Under the conditions of Example 1 where the player can only play ei for i ∈ {1, . . . , d}
with `(a, z) = a> z and only observe bandit feedback, the EXP3 algorithm satisfies
" T # T
" d #
X log(d) η X X
2
sup E `(At , zt ) − `(a, zt ) ≤ + E at,i zbt,i
a∈A t=1
η 2 t=1 i=1
log(d) ηT d p
≤ + ≤ 2dT log(d)
η 2
Proof. We can directly apply Corollary 2:
" T
# T
X
> supa∈A F (a) − F (a1 ) 1 X
E sup (At − a) zt ≤ + E [DF ∗ (∇F (at ) − ηb zt , ∇F (at ))]
a∈A t=1 η η t=1
T
" d #
log(d) 1 X X
= + E at,i (exp(−ηb zt,i ) − 1 + ηb
zt,i )
η η t=1 i=1
T
" d #
log(d) η X X
2
≤ + E at,i zbt,i
η 2 t=1 i=1
T
" d #
2
log(d) η X X 1{It = i}zt,i
= + E at,i
η 2 t=1 i=1
a2t,i
T d
log(d) η X X 2
= + z
η 2 t=1 i=1 t,i
log(d) ηdT
≤ +
η 2
For an alternative proof of EXP3, see [Lattimore and Szepesvári, 2020, Ch. 11]
19
then:
" T
# T
X
> supa∈A F (a) − F (a1 ) 1 X
E sup (At − a) zt ≤ + E [DF ∗ (∇F (at ) − ηb
zt , ∇F (at ))]
a∈A t=1 η η t=1
T
1 ηX
zt k22
≤ + E kb
η 2 t=1
T
" d #
1 ηX X
2
= + E zbt,i
η 2 t=1 i=1
T
" d #
X zt,i 2
1 ηX
= + E .
η 2 t=1 a
i=1 t,i
The issue here is that at,i can become arbitrarily close to 0 and blow up the bound. If we mix in unfirom
2/3
exploration
p at each round, one can show that the regret bound is O((dT ) ) which is significantly worse
than O( dT log(d)) of EXP3 above. So given an action set A how do we choose F ? The next proposition
sheds some light on this question.
Proposition 3. If F is twice continuously differntiable, and if its Hessian ∇2 F (x) is invertible ∀x ∈ D,
then ∀x, y ∈ D, there exists ζ ∈ D such that ∇F (ζ) ∈ [∇F (x), ∇F (y)] and
1
DF ∗ (∇F (x), ∇F (y)) = k∇F (x) − ∇F (y)k2(∇2 F (ζ))−1 .
2
The implication of the above proposition is that ∃∇F (ζt ) ∈ [∇F (at ) − ηb
zt , ∇F (at )] and
η2
DF ∗ (∇F (at ) − ηb
zt , ∇F (at )) = zt k2(∇2 F (ζt ))−1
kb
2
For the choice of F (x) = 12 kxk22 we have ∇2 F (ζt ) = I for any ζt so that the Hessian is flat across the action
Pd
set. On the other hand, with the choice F (x) = i=1 xi log(xi )−xi , we have ∇2 F (x) = diag(1/x1 , . . . , 1/xd )
which blows up as a component of x approaches 0. But this is perfect, since then
2
η 2
E [DF ∗ (∇F (at ) − ηb
zt , ∇F (at ))] = E kb
zt k(∇2 F (ζt ))−1
2
" d
#
η2 X 2
=E zb ζt,i
2 i=1 t,i
" d
#
2
η 2 X zt,i
=E ζt,i
2 i=1 at,i
d
η2 X 2
≈ z
2 i=1 t,i
where we have used the approximation that ζt ≈ at . The hessian of F is blowing up precisely at the locations
where zbt blows up, essentially cancelling each other! We’ll see another example of this in the next subsection.
20
d
• zbt = (1 − ξt ) 1−ka t k2
At A>
t zt
It is straightforward to verify that E[Ath|at ] = at and E[b zti|at ] = zt . Following the intuition of the previous
η2 2
section, we need to choose F to make E 2 kb zt k(∇2 F (ζt ))−1 small. Let F (x) = − log(1 − kxk2 ) − kxk2 . Note
that
h
2
i d > 2
E kbzt k(∇2 F (ζt ))−1 = E k(1 − ξt ) At At zt k(∇2 F (ζt ))−1
1 − kat k2
d2
> 2
=E kAt At zt k(∇2 F (ζt ))−1 |ξt = 0
1 − kat k2
d
X 1 d2
= kei e> 2
i zt k(∇2 F (ζt ))−1
i=1
d 1 − kat k2
d
= kzt k2(∇2 F (ζt ))−1
1 − kat k2
d
≤ (1 − kζt k2 )kzt k22
1 − kat k2
≤ 2d
1−kzt k2
where we have used the fact that ∇2 F (x) I/(1−kxk2 ) and 1−ka t k2
≤ 2 (see [Lattimore and Szepesvári, 2020]
for second fact). √
Using a shrunken action set, one can prove that the regret is bounded by O( dT ) (see [Lattimore and Szepesvári, 2020]).
A straightforward modification for the stochastic mirror descent proof accounts for the forced exploration:
Proposition 4 (EXP3(γ)). The regret of EXP3(γ) algorithm satisfies
" T # T
" n #
X log(n) 1X X
max E yt,It − yt,i ≤ + 2γT + E qt,i φ(−ηb
yt,i )
i∈[n]
t=1
η η t=1 i=1
21
Theorem 8. Let |A| = n and A = {x1 , . . . , xn } ⊂ Rd and maxx∈A maxt |x> zt | ≤ 1. Using the reduction to
EXP3(γ) with γ = ηd we have
" T #
X
> log(n) p
max E (At − a) zt ≤ + 3ηdT ≤ 3dT log(n)
a∈A
t=1
η
= x>
i zt = yt,i
We also have
2
yt,i
E[b ] = E[(x> bt )2 |pt ]
i z
−1 > −1
= E[x> > 2
i Qt At At Qt xi (At zt ) |pt ]
−1
≤ x>
i Qt xi · h
2
so that
" n
# n
X X
−1
E 2
qt,i ybt,i ≤ h2 qt,i x>
i Qt xi
i=1 i=1
n
X
−1
≤ h2 Trace(qt,i xi x>
i Qt )
i=1
≤ h Trace(Qt Q−1
2
t ) = dL
2
Finally,
yt,i | = |ηx>
|ηb bt |
i z
−1
= |ηx> >
i Qt At At zt |
−1
≤ ηh|x>
i Qt At |
−1
= ηh|x>
i Qt xIt |
≤ ηhkxi kQ−1 kxIt kQ−1
t t
ηhd
≤
γ
so it suffices to take γ = ηhd.
10 Contextual Bandits
This section is inspired by [Lattimore and Szepesvári, 2020]
For t = 1, 2, . . .
iid
• Nature reveals ct ∼ D
22
which models users showing up to websites, or patients showing up to the doctor with different symptoms.
A policy π maps contexts to actions in [n] and the value of a policy π is defiend as
V (π) = EC, [v(C, π(C)) + ] = EC [v(C, π(C))]
At each time, we assume the action taken is according to some policy πt ∈ Π so that the regret is defined as
XT
RT = T · V (π ? ) − E[ V (πt )]
t=1
Notes
• This is not the only way we could have defined regret. Perhaps an even more natural approach would
be to define the optimal policy wrt the actual rewards, now the expected rewards.
One naive way to handle this is to run a different linear bandit for each ct , but this is impractical for
infinite sets of ct .
and play xt = arg maxx∈Xt U CBt (x). If x?t = arg maxx∈Xt hx, θ∗ i then
Let θ̂t be the `2 -regularized least-squares estimate of θ∗ with regularization parameter λ > 0 given by
θ̂t = arg min kX1:t θ − Y1:t k + λkθk22 = (XT1:t X1:t + λI)−1 XT1:t Y1:t
θ
where we are denoting X1:t as a matrix with rows X1T , X2T , . . . , XtT and Y1:t as the vector (Y1 , . . . , Yt )T . The
following theorem says that with high probability θ∗ lies with high probability in an ellipsoid with center at
θ̂t .
23
Theorem 9. Confidence Ellipsoid. Assume the same as in Theorem ??, let V = Iλ, λ > 0, define Yt =
hXt , θt i + ηt and assume that kθ∗ k ≤ S. Then for any δ > 0, with probability at least 1 − δ, for all t ≥ 0, θ∗
lies in the set
s
det(V ) 1/2 ) det(λI)−1/2
t
Ct = θ ∈ Rd : kθ̂ − θkV t ≤ R 2 log( ) + λ1/2 S .
δ
Furthermore, if for all t ≥ 1, kXt k2 ≤ L then with probability at least 1 − δ, for all t ≥ 0, θ∗ lies in the set
( r )
0 d 1 + tL2 /λ 1/2
Ct = θ ∈ R : kθ̂ − θkV t ≤ R d log( )+λ S .
δ
10.2.1 τ -greedy
Here we explore a simple strategy that explores for τ steps then exploits or the last T − τ steps. For any
context c fix an exploration distribution µ(x|c) ∈ 4n such that µ(x|c) > 0 for all x, c.
Then suppose at time t, I observed some context ct and played xt ∼ µ(·|c) to receive reward rt =
v(ct , xt ) + t . Then define the inverse propensity scoring estimator as
1{xt = x}
vb(ct , x) = rt
µ(x|ct )
Note that
1{xt = x}
v (ct , x)|ct ] = E rt
E[b |ct
µ(x|ct )
X 1{xt = x}
= µ(x0 |ct )E rt |ct , xt = x0
0
µ(x|ct )
x ∈X
X 1{x0 = x}
= µ(x0 |ct )v(ct , x0 )
µ(x|ct )
x0 ∈X
= v(ct , x)
with variance
v (ct , x) − v(ct , x))2 |ct ] ≤ E[(b
E[(b v (ct , x))2 |ct ]
1{xt = x}
≤E |c t (rt ≤ 1)
µ(x|ct )2
X 1{x0 = x}
≤ µ(x0 |ct )
0
µ(x|ct )2
x ∈X
1
=
µ(x|ct )
1
Pt
and trivially, vb(c, x) ≤ vbmax := maxc,x µ(x|c) . Finally, if Vbt (π) = 1s s=1 vb(ct , π(ct )) then E[Vbt (π)] = V (π)
for any π ∈ Π. By Bernstein’s inequality we have with probability at least 1 − δ
s
2 log(2/δ) 1 2b
vmax log(2/δ)
|V (π) − Vt (π)| ≤
b Ec,x∼µ +
t µ(π(c)|c) 3t
24
1
and in particular, if µ(x|c) = for all x, c then
n
r r
2n log(2/δ) 2n log(2/δ) 4n log(2/δ)
|V (π) − Vt (π)| ≤
b + ≤
t 3t t
q
for some t ≥ 2n log(2/δ). Define t := 4n log(2|Π|/δ)
t . If π
b = arg maxπ∈Π Vbt (π)
V (b π ) − Vbt (b
π ) = V (b π ) − Vbt (π ? ) + Vbt (π ? ) − V (π ? ) +V (π ? )
π ) + Vbt (b
| {z } | {z } | {z }
≥−t ≥0 ≥−t
?
≥ V (π ) − 2t
If we explore uniformly for τ rounds than exploit or T − τ rounds then we achieve a regret of at most
r
˙ − τ ) ≤ τ + T 4n log(2|Π|/δ)
1 · t + 2τ (T
τ
which is minimized at τ = (nT 2 log(2|Π|/δ))1/3 which yields a regret of O(T 2/3 (n log(2|Π|/δ))1/3 ) regret.
1
Pm
Lemma 6 (Bernstein’s inequality). Let X1 , . . . , Xm be independent random variables such that m i=1 E[(Xi −
E[Xi ])2 ] ≤ σ 2 and |Xi | ≤ B. Then
m
r
1 X 2σ 2 log(2/δ) 2B log(2/δ)
Xi ≤ +
m i=1 m 3m
b ⊂ Π.
for some active set Π
Input: Policy set Π such that π : X → [n] for all π ∈ Π, confidence level δ ∈ (0, 1).
Let Πb 1 ← Π, ` ← 1
while |Π b ` | > 1 do
q
log(2|Π|T /δ)
` = 2−` , τ` = d16n−2 ` log(2|Π|T /δ)e, γ` = min{ 2n ,
1
9nτ`
}
h i
1
Q` = arg minQ∈4Πb maxπ∈Π b ` EC Qγ (π(C)|C)
`
γ P
s.t. Q (x|c) = γ + (1 − γn) π∈Π b ` :π(c)=x Q(π)
for t = T`−1 + 1, . . . , T`
Observe context ct
Play xt ∼ Qγ (·|ct ), set pt = Qγ (xt |ct ) and observe reward rt = v(ct , xt ) + ηt
1
P 1{π(ct )=ct }
Set Vb` (π) = T` −T `−1 t∈(T`−1 ,T` ] rt pt
b ` | max 0 b Vb` (π 0 ) − Vb` (π) ≥ 2`
b `+1 ← Π
Π b` \ π ∈ Π
π ∈Π`
t←t+1
Output: Πt+1
We state without proof:
1
Lemma 7. For any finite policy set Π and γ ≤ 2n we have
1
min max EC ≤ 2n
Q∈4Π π∈Π Qγ (π(C)|C)
25
Lemma 8. For all ` = 1, 2, . . . we have π ? ∈ Π
b ` and max b V (π) ≥ V (π ? ) − 8` .
π∈Π`
h i
1
Proof. Let τ` = T` − T`−1 . Noting that the variance of Vb` (π) is bounded by maxπ∈Π
b ` EC Qγ
, we
` (π(C)|C)
apply Bernstein’s inequality at each stage ` to find
s
4n log(2|Π|T /δ) 2 log(2|Π|T /δ)
|V (π) − Vb` (π)| ≤ +
τ` 3γ` τ`
s
16n log(2|Π|T /δ)
≤
τ`
q
1 log(2|Π|T /δ)
for the choice of γ` = min{ 2n , 9nτ` } to equalize the terms for large τ` . The last inequality holds if
τ` ≥ n log(2|Π|T /δ). To make the right hand side less than ` , it suffices to take τ` = d16n−2
` log(2|Π|T /δ)e.
b ` with π ? ∈ Π
For any fixed Π b ` , we have that any π ∈ Π
b ` satisfies
≤ 2` .
On the other hand, for any π such that V (π ? ) − V (π) > 4`
> 2`
? ?
which implies this π will be kicked out. This means that maxπ∈Π b `+1 V (π) ≥ V (π ) − 4` ≥ V (π ) − 8`+1 .
Extending the proof to all ` and random Π b ` is identical to above for linear bandits.
Suppose you run for T timesteps. Let ∆ = minπ6=π? V (π ? ) − V (π). Then for any ν ≥ 0 the regret is
bounded by:
dlog2 (4(∆∨ν)−1 )e
X
T ν+ (γ` n + 8` (1 − γ` n))τ`
`=1
dlog2 (4(∆∨ν)−1 )e q
X log(2|Π|T /δ)
= Tν + (n 9nτ` + 8` )τ`
`=1
dlog2 (4(∆∨ν)−1 )e q
X
= Tν + n d16n−2 −2
` log(2|Π|T /δ)e log(2|Π|T /δ)/9n + 8` d16n` log(2|Π|T /δ)e
`=1
dlog2 (4(∆∨ν)−1 )e
X
≤ Tν + 2n−1 −1
` log(4|Π|T /δ) + 128n` log(4|Π|T /δ)
`=1
dlog2 (4(∆∨ν)−1 )e
X
≤ T ν + 8 + 130n log(2|Π|T /δ) 2t
t=1
≤ T ν + 8 + 2080n(∆ ∨ ν)−1 log(2|Π|T /δ).
p using the upper bound (∆ ∨ ν) ≤ ν and optimizing over ν we have that the regret is no great
As before,
than O( nT log(|Π|T /δ)).
26
10.3 EXP4 for oblivious adversaries
The following algorithm and proof are very standard. However, the textbooks [Lattimore and Szepesvári, 2020,
Bubeck et al., 2012] have some unfortunate typos and/or notation that I found confusing so I have repro-
duced EXP4 here.
and
n n X
m m n m n
(t) (t)
X X X X X X
E[`t,It ] = E[ pt,i `t,i ] = Qt,j Ej,i `t,i = Qt,j Ej,i `t,i = E[ Qt,j ybt,j ]
i=1 i=1 j=1 j=1 i=1 j=1 i=1
log(m) ηnT
≤ +
η 2
where the first two lines follow from plugging in the identities of above, the third from the definition of Mt ,
27
and the first inequality follows from the guarantee of EXP3. The last inequality follows from
" n #
X (t)
2
yt,j
E[b ]=E ( Ej,i `bt,i )2
i=1
!2
n
(t) 1{It = i}`t,i
X
= E Ej,i
i=1
pt,i
(t)
!2
Ej,It `t,It
=E
pt,It
n (t)
!2
X Ej,i `t,i
= pt,i
i=1
pt,i
n (t)
X Ej,i
≤
i=1
pt,i
so
m m n (t)
X
2
X X Ej,i
E Qt,j ybt,j = Qt,j
j=1 j=1 i=1
pt,i
n Pm (t)
X j=1 Qt,j Ej,i
= =n
i=1
pt,i
28
References
[Audibert and Bubeck, 2009] Audibert, J.-Y. and Bubeck, S. (2009). Minimax policies for adversarial and stochastic
bandits.
[Auer et al., 2002] Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-time analysis of the multiarmed bandit
problem. Machine learning, 47(2-3):235–256.
[Bubeck et al., 2012] Bubeck, S., Cesa-Bianchi, N., et al. (2012). Regret analysis of stochastic and nonstochastic
multi-armed bandit problems. Foundations and Trends R in Machine Learning, 5(1):1–122.
[Cappé et al., 2013] Cappé, O., Garivier, A., Maillard, O.-A., Munos, R., Stoltz, G., et al. (2013). Kullback–leibler
upper confidence bounds for optimal sequential allocation. The Annals of Statistics, 41(3):1516–1541.
[Fiez et al., 2019] Fiez, T., Jain, L., Jamieson, K. G., and Ratliff, L. (2019). Sequential experimental design for
transductive linear bandits. In Advances in Neural Information Processing Systems, pages 10666–10676.
[Kaufmann et al., 2016] Kaufmann, E., Cappé, O., and Garivier, A. (2016). On the complexity of best-arm identifi-
cation in multi-armed bandit models. The Journal of Machine Learning Research, 17(1):1–42.
[Lattimore, 2018] Lattimore, T. (2018). Refining the confidence level for optimistic bandit strategies. The Journal
of Machine Learning Research, 19(1):765–796.
[Lattimore and Szepesvari, 2016] Lattimore, T. and Szepesvari, C. (2016). The end of optimism? an asymptotic
analysis of finite-armed linear bandits. arXiv preprint arXiv:1610.04491.
[Lattimore and Szepesvari, 2017] Lattimore, T. and Szepesvari, C. (2017). The end of optimism? an asymptotic
analysis of finite-armed linear bandits. In Artificial Intelligence and Statistics, pages 728–737.
[Lattimore and Szepesvári, 2020] Lattimore, T. and Szepesvári, C. (2020). Bandit algorithms. https: //
tor-lattimore. com/ downloads/ book/ book. pdf .
[Mannor and Tsitsiklis, 2004] Mannor, S. and Tsitsiklis, J. N. (2004). The sample complexity of exploration in the
multi-armed bandit problem. Journal of Machine Learning Research, 5(Jun):623–648.
[Pukelsheim, 2006] Pukelsheim, F. (2006). Optimal design of experiments. SIAM.
[Soare, 2015] Soare, M. (2015). Sequential resource allocation in linear stochastic bandits. PhD thesis, Université
Lille 1-Sciences et Technologies.
[Soare et al., 2014] Soare, M., Lazaric, A., and Munos, R. (2014). Best-arm identification in linear bandits. In
Advances in Neural Information Processing Systems, pages 828–836.
[Yu et al., 2006] Yu, K., Bi, J., and Tresp, V. (2006). Active learning via transductive experimental design. In
Proceedings of the 23rd international conference on Machine learning, pages 1081–1088. ACM.
High probability bounds are effective in adversarial settings as they provide a statistical guarantee on performance under the worst-case loss sequence, ensuring that the regret remains low with high confidence. This probabilistic assurance caters to robust algorithm design, capable of handling uncertainties and adversarial influences more reliably .
In the full information setting, the exploration-exploitation trade-off can be managed directly using the complete loss vectors, allowing precise adjustments based on observed data. In bandit settings, where only partial feedback is obtained, the trade-off is encoded through probabilistic estimates and stochastic approaches like EXP3, reflecting an indirect method of balancing exploration and exploitation amidst uncertainty .
Mirror Descent iterates adjust the decision variables by projecting the gradients of the loss function onto the feasible decision space with respect to a chosen distance-generating function. This method helps exploit the curvature of the space and can more effectively converge to an optimal solution by controlling the learning rate dynamically, which is especially useful in high-dimensional task environments .
The least squares estimator allows the algorithm to approximate the unknown parameter B8* based on the observations collected in each round. By constructing this estimator only with the data from the current round, the algorithm efficiently updates its understanding of the reward structures, thus enhancing its decision-making process for selecting arms that maximize expected reward while minimizing regret .
Exponential weights play a critical role in managing situations where the feedback available is limited to only the outcome associated with the chosen action. By assigning weights to actions based on their historical performance, the algorithm can adjust probabilities for choosing each action in future rounds. This approach enables the estimation of the outcomes for all actions despite observing only one, thus facilitating a balance between exploration and exploitation .
The primary goal of the Action Elimination Algorithm is to identify the best arm out of a set of multiple arms with high confidence. The algorithm aims to ensure that the optimal arm remains in the set of candidates throughout the iterations by eliminating arms that perform significantly worse than the current best estimate. This elimination process relies on calculating empirical means and their confidence intervals to decide whether an arm can be eliminated from further consideration .
The assumptions made regarding the distribution of rewards include the use of sub-Gaussian distributions, ensuring that variability in observations is controlled and that reward distributions do not deviate significantly from their mean. This assumption simplifies the regret analysis and calculation of confidence bounds in best-arm identification settings .
Neglecting practicalities can lead to a simplified analysis; however, it may result in algorithms that perform slightly weaker in practice. By sacrificing some performance potential for analytical simplicity, the behavior of algorithms in realistic environments might not be accurately captured, which can affect their reliability when implemented .
When B3 is set to zero, the regret bound of the bandit algorithm is O(d 94^{-1} log(|X|T)), and it implies a regret bounded by c d 94 log(|X|T). On the other hand, minimizing over B3 > 0, the regret bound becomes O( 1Ap d T log(log(T/d)|X|/B4)), which implies a regret bounded by c 1Ap dT log(|X|T).
The EXP3 algorithm uses Exponential Weights for updating the probability distribution. This approach ensures that exploration is handled through random sampling based on current probability estimates, while also allowing exploitation of known rewarding actions by iteratively adjusting action probabilities .